techie librarian; meatier than a seahorse

 

Tag lines…whatever do you use for your tagline: the subheading of your identity, the punchline by which people establish a connection. Mostly I pay them lip service, smiling occasionally at a clever one. My own tend to refer to variations of: techie, librarian and eclectic, sometimes all 3 at once.

In a rather wayward conversation, spinning down a rabbit hole of curiousity, as things are wont to do when Matt Finch is involved, a recent conversation turned from roasting penguins to eating seahorses.

I participated in a workshop as part of NLS8 and the first activity was for everyone to sketch a scene, in 90 seconds, on a piece of A4 using at least one of three figures on a screen: 2 humans (or human-like) and a penguin. As is my wont, I immediately gave into the dark side and sketched the two humans roasting the penguin. The second half of the activity was for each table to construct a cohesive story using those scenes as panel. They were two quick activities that worked really well as an icebreaker and got you thinking at how easy it was to come up with ideas under pressure.

The seahorses came later…or rather many years earlier:

to which I responded with my “meatier than seahorse” remark and commented elsewhere that while I have never eaten penguin, I have actually eaten seahorse.

Many years ago, 2003 I think (really must upload those photos to flickr), I spent a few weeks on an Intrepid trip in China with friends. We started in Beijing and went to the Beijing night markets, a place where you can eat just about anything including silk worms and even scorpions on a stick. Scorpions were a wee a but scary but we figured had to be ok as noone was dropping dead. As far as we can figure, they’re bred without their stinger.

While trying to order something else, there was a language issue, and I ended up with seahorse on a stick. I think the scorpions were about 20 cents for five whereas the seahorse was a few Oz dollars for one. Our tour guide tried to talk our way out of it but the shopowner insisted. So I paid for it and ate it. There wasn’t much flavour as it was primarily shell with perhaps a tiny morsel of meat.

Matt suggested “meatier than a seahorse” as a bio and it immediately rang the right sort of bells, both physically and metaphorically. I am now using it for all my taglines :-)

a little low end

Ruminating further on my desire for more processing power, I’ve been thinking more on clusters and can’t help but feel that a lego rack of tiny motherboards is a rather cute direction to head in. My general idea at the moment is to look at building a cluster significantly more powerful than a mac mini but with a small footprint and not too expensive. While there are desktop towers and second-hand servers that can achieve much better performance, they take up a lot of physical space. 2 things I’ve always been interested in in computing:

  • small size (or footprint) – I don’t want them to dominate the space
  • low weight
  • good battery life is also nice but less relevant in this scenario

The Intel NUC cluster is the high end version of the sort of setup that could work for me. However high end, cutting edge isn’t the only solution and comments on twitter reminded me that there are other, cheaper options for home use starting with the humble raspberry pi. Turns out there’s quite a bit of work in that area on a low end approach to supercomputing. While the overall speed per board won’t be huge, gains can be made for parallel computing as a good number of cores and threads increases work done in these sorts of systems and may work out better, and cheaper than a single NUC.

train station in Manarola, Italy.

There’s been a lot of work with raspberry pi clusters and running boards in parallel with anything from 4 boards up to 200+; someone has even published instructions on building a 4 board Pi cluster in a mere 29 minutes. However the Pi isn’t the only option and another board that has developed a community is the Odroid series and they seem a wee bit more powerful than the Pi without being much more expensive.

The challenge I gather with Pi/Odroid setups is potentially around the ARM chipset whereas the NUC being Intel is on a more common platform. ARM is a slower chipset relatively and doesn’t quite have the broad support of mainstream chipsets however there seems to be a strong community around them. On the other hand, if you want to go down the intel route, then there’s credit size computers, like the Intel Edison, based on x86 chips. Literally the size/thickness of a credit card and can boot to standard linux. Clusters of these are even smaller, with a 10 card cluster that looks like it could fit in the palm of your hand.

Realistically, while it’s nice to dream I’m not actually that great with hardware stuff and I can see that 29 minute Pi cluster taking me most of the day…if I can get it to work at all. Yet it sounds so simple. I suspect it’s a matter of courage, patience and lots of google-fu. I get nervous when dealing with hardware and installing software, blindly running other people’s scripts and keeping my fingers cross that if errors occur, they’re not too hard to resolve. The advantage of cheaper approaches is that I’m not too badly out of pocket if I can’t get it to work, a few hundred vs a few thousand. The other question is whether a tenth of the budget produces better than a tenth of the power?

using big data to create bad art

A few weeks back, I installed a lot of software on my computer at home with the plan to work out what to do with large data sets, particularly web archives. One of my roles at work is being responsible for managing and running the Library’s web archiving strategy and regularly harvesting publicly available government websites. That’s all fun and good but you end up with a lot of data and I think there’s close to 5TB in the collection now. The next tricks revolve around what you can use the data for and what sorts of data are worthwhile to make accessible. Under my current, non NBN, download speeds I estimate it would take a few months to download 5TB of data assuming a steady connection.

The dataset I’m using currently is a cohesive collection of publicly available websites containing approximately 68GB of data in 61 files. Each file is a compressed WARC file, WARC being the standard for Web ARChive files. Following some excellent instructions, I ran the scala code from step 1 in my local install of spark shell and successfully extracted the site structure. The code needed to be modified slightly to work with the pathname of my data set, roughly

  • run Spark Shell with sufficient memory, I’m using 6 of my 8GB of RAM
  • run “:paste”
  • copy in scala code
  • hit “Control-D” to start the code analysing the data

I think that took around 20-30 minutes to run. The first time through, it crashed at the end as I’d left a couple of regular text files in the archive directory and the code sample didn’t handle those. Fair enough too, as it’s only sample code and not a full program with error detection and handling. I moved the text files out and ran it again. Second time through it finished happily.

The resultant file containing all the URLs and linkages was a total of 355kb, not bad for a starting data size if 68GB and provides something a little more manageable to play with. Next step is to load the file into Gephi which is an open source, data visualisation tool for networks and graphs. I still have little idea how to use gephi effectively and am mostly just pressing different buttons and playing with layouts to see how stuff works. I haven’t quite got to the point of making visually interesting displays like the one shown in the tutorial, however I have managed to create some really ugly art:

ugly data analysis

I hit the point a while back where it’s no longer sufficient to play with sample bits and pieces and I need to sit down and learn stuff properly. To that end I ordered a couple of books on Apache Spark, then ordered another book, Programming in Scala, and wondering whether I should also buy The Scala Cookbook. Or perhaps I shouldn’t try and do everything at once. I am reading both the Spark books concurrently as they’re aimed for different audiences and take different approaches. However after an initial spurt through the first couple of chapters, I haven’t touched them in a couple of weeks. I also need to learn how to use Gephi effectively and there’s a few tutorials available for doing that. I should explore other visualisation tools too as well and continue to look at what other sorts of tools can be used.

a lethargic 5

I’m happy to report that full connectivity was restored to the house a day or two before we were due to fly to NZ. Returning home on Sunday, I was happy to discover that we still had net :) I’ll talk more about the NZ trip and tramping the Kepler Track once I’ve sorted out the photos and loaded them to flickr. I have about 130 photos that I need to weed though that should be relatively quick compared to weeding my photos from the European holiday over Dec/Jan. For the Europe set, I’ve managed to get it down to under 300 from around 700 but it still needs a couple more goes. I should have the Kepler set up this week at least.

While I’m in post NZ recovery, here’s 5 random things I’ve tweeted in recent months:

baby steps with web harvests

One of the things I’m interested in is working with data sets around web harvesting and archiving. I’ve spent a bit of time over the years exploring the Internet Archive and other web archives, and I’m hitting the point where I’d like to understand the sorts of information gathered when you harvest a bunch of websites. What can be discerned from a site’s structure, how does it change over time, are there any other useful directions to explore?

When you harvest web sites you end up with a bunch of files in the WARC format. So far, in my limited experience, a typical WARC file is about a gig and one harvest can contain lots of these files. Depending on how your set up your harvester, you can save all content on a site including office files, music, video and so on. A harvest captures that website at one moment in time, and with repeated harvests it’s possible to get a sense of how it might change over time. As part of learning how all this works, I’m using a small archive of 72 WARC files that roughly total 55GB.

Having successfully installed lots of software on my machine at home, I might actually be ready to start experimenting. I’ve been following the Getting Started guide for installing Warcbase (platform for managing web archives) and associated software on a mac mini. While time consuming, it’s actually been straightforward and installing software on the mac has seemed easier than installing similar stuff under windows a year or so back. Of that guide, I have completed steps 1, 2, 3, and 5. Step 4 involves installing Spark Notebook but the primary site seems to be down at the moment so I’ve installed gephi to handle data visualisation. As a result I am now running:

  • Homebrew – MacOS package manager
  • Maven3 – software project management tool
  • Warcbase – built on hadoop and hbase
  • Apache Spark – an engine for large-scale data processing
  • Gephi – data visualisation

In other words a bunch of tools for dealing with really large data sets installed on a really small computer :-) I’d originally bought the mac mini to migrate my photo collection from a much older Mac Pro and hadn’t considered it as a platform for doing large scale data stuff.  So far, it’s holding up though I am feeling the limits of having only 8GB of RAM.

All those tools can be used on really big systems and run across server clusters. Thankfully, they also work on a single system but you have to keep the data chunks small. I tried analysing the entire 55GB archive in one go but spark spat out a bunch of errors and crashed. Running it file by file, where each file is up to a gig, seems to be working so far.

There’s been no working internet at home for a couple of weeks so I’ve been hampered in what help I can look up but at least had all the software installed before we lost connection. Spark may have had issues for a different reason eg I may not have specified the directory path correctly but I couldn’t easily google the errors.

I’m trying out a script in spark to generate the site structure from each archive and this is typically producing a file of about 2-3k from a 1GB file of data. The script is able to write to gephi’s file format, GDF. Gephi supports the ability to load lots of files and merge them into one. That means I can run a file by file analysis and then combine them at the visualisation stage. I haven’t worked out the code to run the script iteratively for each file and am manually changing the file name each time. The ugly image below is my first data load into gephi showing the interlinking URL nodes. I haven’t done anything with it, it is literally the first display screen. However it does indicate that I might at last be heading in a useful direction.

visualisation of website nodes

Next steps include learning how to write scripts myself and learning how to use gephi to produce a more meaningful visualisation.

 

 

keep on trying

I have been reading feminist literature for a fair chunk of my life (and had conversations with feminist friends and generally tried to live a more otherness aware life) and I continue to discover things where I, as a white male, continue to fail. Or perhaps not so much fail, as not quite understood the perspective of others. Sometimes this happens repeatedly…for years…before I manage to get a clue.

The latest one was this:

In tech, a common pattern is for hiring managers to say “I don’t care who you are, just show me your hobby projects on github, or your think-pieces on medium” but a bit of reflection is all it takes to realize that screening based on free-time pursuits gets you more affluent white men than it does underemployed single moms.

Time to do stuff is not equal for everybody. In library circles that can mean time to spend on conference committees, to do volunteer work, to admin elists and so on. For most of my life I was single, white male with a job and seemingly endless amounts of free time. That in turn begs the question of why haven’t I done so much more, which is answered by my sheer, bloody laziness and ability to procrastinate.

However I have continued to say that volunteering requires giving up your own time to do stuff for others. That’s easy when you’re a single, white male. That’s what privilege looks like. I now live with a partner and she has 3 kids. My ability to “give up” time no longer exists, I no longer have privileged control over my time. My time is shared with others. She has given up a lot of time for the kids.

That’s not a complaint about my current situation.

In the old days, I’d come home from work, switch on the TV, or the playstation, perhaps get round to cooking, or heating up a frozen dinner (more likely as I hate cooking), pour myself some wine and so on. These days, we return home, the number for dinner is variable and occasionally unknown, depending on the movement of the kids (15, 18, 21). Post dinner is variable depending on what others are doing…time on the playstation is negotiated. I’m playing skyrim again and I can’t spend entire weeks/months playing it like I could when I was alone. There might be alternative options eg my partner and I are currently watching Once Upon a Time with Ms15, however they sometimes like to watch Nashville together which I’m not interested in.

I seem to have little time to volunteer, or at least I think I have little time and that may be more a reflection that I’m still trying to think in context of my old life while trying to adapt to my new life. Adaption is taking me a long time as I’ve been with my partner fort 4 years and living altogether fulltime for 2 years. I am reminded that as I think these sorts of things, I am still exercising privilege as these sorts of things are not “options” for others, nor having the luxury to ruminate on them.

#blogjune 2016 recap

So that’s it then, blogging over for another year. Here’s where I promise that I’ll start blogging again and do so more often. Like I do most years and then fail to deliver :-) With that said, I did manage to increase my blogging rate a couple of months ago and have had a steady increase in advance of June this time round. That suggests I might have enough ticker to keep going. I could point to the list of 20 or so ideas on my list of potential posts but I can do that most years…even in 2014 where I only blogged 4 times during June. It’s less the ideas and more the inclination; getting round to writing and expanding the idea on one device or another.

As I often state, I blog for me and noone else, to inhabit an online space of my online. However I do like to look at the stats though I care not whether they’re good or bad. 2010 was my best year ever on this platform, coincidentally that was also the first year of #blogjune and 2012 seems to be the last, really good year before the drop off and general decline. Stats perked up in 2014 but that was also my second best year for blogging with 30 posts in June.

blog statistics

#blogjune has been running for 7 years now and I’ve managed to make it to 30 posts for 3 of those 7 years, including this year. I’m pretty happy with this year’s effort:

  • 33 posts
  • 10,000 words, averaging around 300 words per day

In 2010, my first #blogjune, and my most prolific, I managed 19,000 words in 34 posts, around 560 words per day. I’m sorta curious what my stats are like for each June rather than the annual tallies, graphed above. However, I haven’t worked out how to export the data easily and to be honest my stats, like my care factor, are pretty low :)

My top 5 posts were:

I also started, or attempted to start a series on alcohol, or at least whisky:

I have another post semi-written on whisky and feel like I can probably write a few more. One idea that’s on my list is to go through all the beers I’ve rated via untappd and list them, pointing out my favourites. Perhaps I should also write a post on how much I drink which is actually less than you might assume from all my alcohol references.