I’m happy to report that full connectivity was restored to the house a day or two before we were due to fly to NZ. Returning home on Sunday, I was happy to discover that we still had net :) I’ll talk more about the NZ trip and tramping the Kepler Track once I’ve sorted out the photos and loaded them to flickr. I have about 130 photos that I need to weed though that should be relatively quick compared to weeding my photos from the European holiday over Dec/Jan. For the Europe set, I’ve managed to get it down to under 300 from around 700 but it still needs a couple more goes. I should have the Kepler set up this week at least.

baby steps with web harvests

One of the things I’m interested in is working with data sets around web harvesting and archiving. I’ve spent a bit of time over the years exploring the Internet Archive and other web archives, and I’m hitting the point where I’d like to understand the sorts of information gathered when you harvest a bunch of websites. What can be discerned from a site’s structure, how does it change over time, are there any other useful directions to explore?

When you harvest web sites you end up with a bunch of files in the WARC format. So far, in my limited experience, a typical WARC file is about a gig and one harvest can contain lots of these files. Depending on how your set up your harvester, you can save all content on a site including office files, music, video and so on. A harvest captures that website at one moment in time, and with repeated harvests it’s possible to get a sense of how it might change over time. As part of learning how all this works, I’m using a small archive of 72 WARC files that roughly total 55GB.

Having successfully installed lots of software on my machine at home, I might actually be ready to start experimenting. I’ve been following the Getting Started guide for installing Warcbase (platform for managing web archives) and associated software on a mac mini. While time consuming, it’s actually been straightforward and installing software on the mac has seemed easier than installing similar stuff under windows a year or so back. Of that guide, I have completed steps 1, 2, 3, and 5. Step 4 involves installing Spark Notebook but the primary site seems to be down at the moment so I’ve installed gephi to handle data visualisation. As a result I am now running:

  • Homebrew – MacOS package manager
  • Maven3 – software project management tool
  • Warcbase – built on hadoop and hbase
  • Apache Spark – an engine for large-scale data processing
  • Gephi – data visualisation

In other words a bunch of tools for dealing with really large data sets installed on a really small computer :-) I’d originally bought the mac mini to migrate my photo collection from a much older Mac Pro and hadn’t considered it as a platform for doing large scale data stuff.  So far, it’s holding up though I am feeling the limits of having only 8GB of RAM.

All those tools can be used on really big systems and run across server clusters. Thankfully, they also work on a single system but you have to keep the data chunks small. I tried analysing the entire 55GB archive in one go but spark spat out a bunch of errors and crashed. Running it file by file, where each file is up to a gig, seems to be working so far.

There’s been no working internet at home for a couple of weeks so I’ve been hampered in what help I can look up but at least had all the software installed before we lost connection. Spark may have had issues for a different reason eg I may not have specified the directory path correctly but I couldn’t easily google the errors.

I’m trying out a script in spark to generate the site structure from each archive and this is typically producing a file of about 2-3k from a 1GB file of data. The script is able to write to gephi’s file format, GDF. Gephi supports the ability to load lots of files and merge them into one. That means I can run a file by file analysis and then combine them at the visualisation stage. I haven’t worked out the code to run the script iteratively for each file and am manually changing the file name each time. The ugly image below is my first data load into gephi showing the interlinking URL nodes. I haven’t done anything with it, it is literally the first display screen. However it does indicate that I might at last be heading in a useful direction.

visualisation of website nodes

Next steps include learning how to write scripts myself and learning how to use gephi to produce a more meaningful visualisation.



keep on trying

I have been reading feminist literature for a fair chunk of my life (and had conversations with feminist friends and generally tried to live a more otherness aware life) and I continue to discover things where I, as a white male, continue to fail. Or perhaps not so much fail, as not quite understood the perspective of others. Sometimes this happens repeatedly…for years…before I manage to get a clue.

In tech, a common pattern is for hiring managers to say “I don’t care who you are, just show me your hobby projects on github, or your think-pieces on medium” but a bit of reflection is all it takes to realize that screening based on free-time pursuits gets you more affluent white men than it does underemployed single moms.

Time to do stuff is not equal for everybody. In library circles that can mean time to spend on conference committees, to do volunteer work, to admin elists and so on. For most of my life I was single, white male with a job and seemingly endless amounts of free time. That in turn begs the question of why haven’t I done so much more, which is answered by my sheer, bloody laziness and ability to procrastinate.

However I have continued to say that volunteering requires giving up your own time to do stuff for others. That’s easy when you’re a single, white male. That’s what privilege looks like. I now live with a partner and she has 3 kids. My ability to “give up” time no longer exists, I no longer have privileged control over my time. My time is shared with others. She has given up a lot of time for the kids.

That’s not a complaint about my current situation.

In the old days, I’d come home from work, switch on the TV, or the playstation, perhaps get round to cooking, or heating up a frozen dinner (more likely as I hate cooking), pour myself some wine and so on. These days, we return home, the number for dinner is variable and occasionally unknown, depending on the movement of the kids (15, 18, 21). Post dinner is variable depending on what others are doing…time on the playstation is negotiated. I’m playing skyrim again and I can’t spend entire weeks/months playing it like I could when I was alone. There might be alternative options eg my partner and I are currently watching Once Upon a Time with Ms15, however they sometimes like to watch Nashville together which I’m not interested in.

I seem to have little time to volunteer, or at least I think I have little time and that may be more a reflection that I’m still trying to think in context of my old life while trying to adapt to my new life. Adaption is taking me a long time as I’ve been with my partner fort 4 years and living altogether fulltime for 2 years. I am reminded that as I think these sorts of things, I am still exercising privilege as these sorts of things are not “options” for others, nor having the luxury to ruminate on them.

#blogjune 2016 recap

So that’s it then, blogging over for another year. Here’s where I promise that I’ll start blogging again and do so more often. Like I do most years and then fail to deliver :-) With that said, I did manage to increase my blogging rate a couple of months ago and have had a steady increase in advance of June this time round. That suggests I might have enough ticker to keep going. I could point to the list of 20 or so ideas on my list of potential posts but I can do that most years…even in 2014 where I only blogged 4 times during June. It’s less the ideas and more the inclination; getting round to writing and expanding the idea on one device or another.

As I often state, I blog for me and noone else, to inhabit an online space of my online. However I do like to look at the stats though I care not whether they’re good or bad. 2010 was my best year ever on this platform, coincidentally that was also the first year of #blogjune and 2012 seems to be the last, really good year before the drop off and general decline. Stats perked up in 2014 but that was also my second best year for blogging with 30 posts in June.

blog statistics

#blogjune has been running for 7 years now and I’ve managed to make it to 30 posts for 3 of those 7 years, including this year. I’m pretty happy with this year’s effort:

  • 33 posts
  • 10,000 words, averaging around 300 words per day

In 2010, my first #blogjune, and my most prolific, I managed 19,000 words in 34 posts, around 560 words per day. I’m sorta curious what my stats are like for each June rather than the annual tallies, graphed above. However, I haven’t worked out how to export the data easily and to be honest my stats, like my care factor, are pretty low :)

I have another post semi-written on whisky and feel like I can probably write a few more. One idea that’s on my list is to go through all the beers I’ve rated via untappd and list them, pointing out my favourites. Perhaps I should also write a post on how much I drink which is actually less than you might assume from all my alcohol references.


five in the evening

a hacker trilogy

Unsurprisingly I’m into SF movies, and tech oriented stuff generally. I’ve long had an interest in hacking, have read many a book on the subject, watched films and occasionally dabbled though never broken into anything. The worst I got was writing password traps at uni to catch the unwary. There was a great book many years ago, that I devoured at uni and keep a print version on the shelf: Hackers: Heroes of the Computer Revolution. A book full of anecdotes of the early days of computers and the early hackers, people who created new code and established the frontiers of computing.

I’ve long had an idea in my head of what I like to call a cinematic hacker trilogy; three films that portray hacking and engage with its history. There’s been lots of films around hacking and some are good and some not so good but three seem to have stood out in my head:

I love all three though I think the third is my favourite for capturing the sense of history, spicing it with the thrill of the game and a decent soundtrack. I’ve just rewatched WarGames and it holds up well though the acting and dialogue are clearly artifacts of the 80s. However the basic idea of stealing passwords written down remains true enough today, the weakest link is always people. Sneakers features Robert Redford and Ben Kingsley and is very smooth with a hacking group working semi legit but built by an old school hacker. There are other movies in the genre, good and bad, but this trio sits best in my head.

five in the dark

