threading delights

I’ve had the new machine a few days and I’m starting to get the hang of it, but learning, lots of learning. Finding linux equivalents of windows tools and then working out how to install them. Troubleshooting unexpected java errors trying to get spark shell to compile properly – turns out I had the JRE but not the full JDK which means I had to download more stuff and update some config files as well pathname references so the system knows where to find stuff.

As it turns out I completely misread the new pages for Archives Unleashed and didn’t see the black menu bar at the top of the screen for all of the docs. Was a little too tired methinks. I  installed stuff using old versions of the docs I found on the wayback machine and other bits. Consequently I’ve ended up with a more recent version of Archives Unleashed (a bit of mouthful after the easier “warcbase”) with 0.10.1 instead of 0.9.0 and I’m running a current version of Spark Shell, 2.2.0, instead of 1.6.1. Anyway it all works…I think.

The next headache was that my harvest test data was still on the mac mini. I wasn’t sure how to get the data across as I couldn’t write to a windows hard drive from the mac. Then had the bright idea of copying the data, 56 files for a total of 80GB, to my home server via wifi. That took 6 hours…to the server, so I went away and did other things. Towards the end of that process I had a bit of time so I worked out that if I formatted a drive for the mac in exFAT format, I could install some utilities in linux to read it. That took an hour, half hour to copy to the drive, half an hour from the drive to linux. Phew.

Then I tried running the SCALA code for extracting the site structure and ran into a few errors as about 15% of the files have developed an error somewhere along the way. I removed all the broken files leaving me with 47 usable ones. All up, it took 18 minutes to process the data, not quite as fast as I was hoping. On the other hand, the advantage of having lots of ram is that there was plenty of space to do other things. Running the same job on the mac mini with dual core CPU and 8GB RAM brought it to a grinding halt and nothing else was possible. On the new machine, I could run everything else normally including web browsing and downloads.


Regardless of whether I allocated 5gb, 10gb, 24gb, or even 28gb of RAM, time taken to process still hovered around 18 minutes. With 28gb allocated it only needed around 15gb to process, as can be seen in the above screenshot of htop. The other nice thing about htop is that it demonstrated that all 8 CPU threads were in use. Where I think I saved some time is that swap doesn’t seem to have been required which would have reduced some overheads. Either that, or I haven’t worked out how to use swap memory yet.

Still very early days.

some new tech

Following my fun in July when I hit a bit of a wall in playing with large data sets and brought my mac mini to a grinding halt, I ruminated on next steps. Wall aside, it was a wee bit frustrating that running experiments on larger data sets took a long time to run and that’s been a bit off-putting to further progress. So I decided that I really did a new machine and was going to get an intel NUC skull canyon as it was small and fast. I waited for Intel to announce their new 8th generation CPUs which they did recently. Unfortunately the upgrade to the current 6th generation Skull isn’t due till Q2 2018.

On the other hand, prices have been dropping on the barebones Skull and you can pick one up for around AUD$700. However a retailer pointed out to me recently that the ASUS VivoMini, while pricier, uses 7th generation CPUs. Plus it’s a cuter box. After some umming and ahhing, I ordered the vivomini with 32GB RAM and an additional 1TB drive (it includes a 256GB SSD in the m.2 port). The CPU is a 7th generation quad core i7. Total cost was around AUD$1,700 whereas a similarly set up Skull would have been around $1,400-500. It has a small footprint and sits nicely on top of the mac mini.


Picked it up yesterday and it booted straight into windows. Today, somewhat trepidatiously, I had a go at setting it up to dual boot with linux. The last few years I’ve been running linux via virtualbox on windows and that’s been sufficient. It’s been a long, long, long time since I set up a dual boot machine and that was using debian which was a wee bit challenging at the time.

This time round it was all easy as. I followed some straightforward instructions carefully and tested initially on a live boot via USB and then used that USB to install it properly. I’ve booted back and forth between windows and linux several times just to be sure and so far so good. I’m currently writing this blog via firefox in ubuntu. My next step was going to be to set up warcbase however that’s been deprecated as Ian Milligan and his team have received a new grant and are working on building an updated environment under their Archives Unleashed Toolkit. So I’ll play with that instead :) Regardless I’ll still need to get Apache Spark up and running which is likely my next step.

rabbit holes of adventure

Dinner table conversation tonight ended up chatting about Mystery House, that my partner played occasionally when she was younger. Mystery House is known as the first graphical adventure game. That of course led the conversation into interactive fiction, referencing the top shelf of my bookcase which contains pretty much all of Infocom‘s text adventures. I remember Zork II was my first text adventure and fiendish it was. I relied on adventure columns in computer game magazines of the time for clues on how to solve difficult puzzles including the horrible baseball diamond puzzle, also known as the Oddly Angled Room.

In those days, I couldn’t google answers and would spend months stuck on a problem. Sometimes that could be a good thing but mostly it was bloody frustrating. While there was a certain sense of achievement in solving puzzles, it meant I couldn’t advance the story. Solving puzzles was essential to accessing further parts of the game. These days I think I prefer story telling and plot development though solving puzzles is nice too. Happily most games provide decent hint mechanisms and if I get desperate I can google for answers.

Much to the shock of my partner, I commented that I usually have my text adventure collection stored on all my active machines as they are part of my central core of files that migrate across my various computing environments. This sounds substantial until you realise that text adventures, having little graphics and don’t take up a lot of space. My entire interactive fiction archive is a little over 100MB, of which the complete works of infocom account for 95%. Come to think of it, they were the only ones I was able to buy as a box set later, the Lost Treasures of Infocom, and load in a system independent format.

interactive fiction games

The other key adventure game company of the time was Level 9. Infocom were American based, while Level 9 were from the UK and I had several of their games. Regrettably, while I still have the boxes, I no longer have the equipment to read the discs. Later on, graphic adventures developed further with Magnetic Scrolls commencing with their first game, the fantastic The Pawn. I have several titles of their titles on my shelf too. Methinks I need to investigate further as to whether I can get these on my current machines. Come to think of it, I’ve barely mentioned Sierra Online who were responsible for Mystery House and later developed the King’s Quest and SpaceQuest series. Oh, and then there was Ultima…yet another rabbit hole…

big data is bigger than a mini

I broke my mac mini. Temporarily. I found its limits for big data stuff. It could handle running spark shell with 75% RAM allocated processing 70GB of gzipped WARC files to produce a 355kb gephi file of links. But it took 30 minutes to do it. Today I hit it with around 200GB and it ran out of memory around 95GB. To be honest, I’m rather impressed. I was impressed that it handled 70GB. The mini has 8GB of RAM of which I allocated 6GB to running spark shell. Everything else slowed to a halt. Ultimately even that wasn’t enough. I eventually worked out that I could split my data up into three chunks and process separately. I now have 3 gephi files totaling around 600KB (from 200GB of data) of URLs and links that I can do stuff with.

lots of cars on the road in BorneoThe bigger question is: do I need to look at new hardware to handle this sort of stuff and if so, what direction do I go in? I’ve done a bunch of research and I’m unclear of what the best direction is…actually that’s not quite right, I know what the best direction is and it involves a chunk of money :-) What’s the best direction without spending stupid money? I figure there are three main groups:

  1. Intel NUC Skull Canyon vs Supermicro Superserver
  2. Odroid XU4 cluster – each board handles 2gb RAM and they have an octo-core (8 cores!) while Raspberry Pi-3 is only 1gb per board
  3. Mini tower – takes up more space but cheaper than option 1 for more RAM and core/threads with less cooling issues

The key requirements for me are RAM and thread count with the emphasis on a parallel processing environment.

In terms of cost, 2 is cheapest, then 3 then 1. 2 is also cutest :-) a cluster of 8 Odroid XU4 gives 16gb RAM and 64 cores and would probably cost a little over a grand to put together. The Supermicro is more expandable ie can handle up to 128GB of RAM compared to the 32GB max of the Intel Skull. On the other hand, I can put together a fully loaded Skull for around AUD$1,300 while the SuperMicro requires a little more but works out cheaper in the long run. To go the Skull route means each upgrade requires a new skull at $1,400 a pop whereas the Supermicro requires around $2,000 to start up but has room for cheaper expansions. The Supermicro can be a little noisy and the Skull is quieter.

I like the idea of option 2 and building a cluster of Odroid XU4 but to be honest, I’m not great at hardware stuff and suspect that path holds a lot of pain and I may not come out the other side. Option 3 has space issues but plenty of expandability. I haven’t worked out the answer to how well an Odroid cluster would compare to the Intel Skull given a parallel processing emphasis. The skull beats it on raw speed but for tasks tailored for lots of threads running concurrently I have no idea. I really like the Supermicro approach but it costs more and may be noisier. I noticed that the price has recently dropped on the barebones Skull from around AUD$900 to $780. To that you would add RAM and SSD. The Supermicro has a barebones price of around USD$800 and again you then add RAM and SSD, though it can also handle other types of drives.

Motorbike in BorneoI’m not even sure I want to go down the “moar hardware” path at all. The mac mini can cope with the basic 70GB dataset and provide a sufficient environment to experiment and learn in. I’m fairly committed to going down this path and need to learn more about how to use spark and code in scala. I’m using scala mostly because it’s been referenced in the WARCbase guides I’ve started up with. I don’t know enough to know if it’s the best path forward; on the other hand, any path forward is a good one.


#blogjune 2017 recap

Done and dusted for another year. Here we are 4 days later and this is my first post since June finished. Stats are a funny old thing, the only ones I have to count are for folk who specifically visit my page. I have no idea how many other people are reading my posts via feed readers such as feedly. It’s possible to make a rough estimate as feedly does show a subscriber count for each of its feeds but I’m unclear as to its accuracy, having read conflicting accounts. All in all, it sounds like work is required to get accurate figures and my care factor is a little too low for that :-)

blog statistics

Looking at the graph above, direct access seems to have dropped off a little in 2017 but has been mostly stead for the last few years. With that said, the 2017 figure is based on the year so far ie the first six months. That suggests, even I can manage to keep blogging, that 2017 is shaping up to be my best year since at least 2014, and potentially since 2012. I think 2012 was the first year that wordpress broke down the difference between views and visitors, as noted by the darker shading in the column.

In terms of my #blogjune blogging, I managed to just scrape in:

  • 30 posts
  • 10,700 words, averaging 357 words per post

3 posts less but 700 words more than my 2016 effort. I think I managed to blog about most things I thought I would though I never got round to blogging properly about whisky, though I had a few ideas in draft. My top 5 posts were:

which seems a mix of interesting and pedestrian, so here’s the next 5 as well:

which is a more interesting list of titles :-)

techie librarian; meatier than a seahorse


Tag lines…whatever do you use for your tagline: the subheading of your identity, the punchline by which people establish a connection. Mostly I pay them lip service, smiling occasionally at a clever one. My own tend to refer to variations of: techie, librarian and eclectic, sometimes all 3 at once.

In a rather wayward conversation, spinning down a rabbit hole of curiousity, as things are wont to do when Matt Finch is involved, a recent conversation turned from roasting penguins to eating seahorses.

I participated in a workshop as part of NLS8 and the first activity was for everyone to sketch a scene, in 90 seconds, on a piece of A4 using at least one of three figures on a screen: 2 humans (or human-like) and a penguin. As is my wont, I immediately gave into the dark side and sketched the two humans roasting the penguin. The second half of the activity was for each table to construct a cohesive story using those scenes as panel. They were two quick activities that worked really well as an icebreaker and got you thinking at how easy it was to come up with ideas under pressure.

The seahorses came later…or rather many years earlier:

to which I responded with my “meatier than seahorse” remark and commented elsewhere that while I have never eaten penguin, I have actually eaten seahorse.

Many years ago, 2003 I think (really must upload those photos to flickr), I spent a few weeks on an Intrepid trip in China with friends. We started in Beijing and went to the Beijing night markets, a place where you can eat just about anything including silk worms and even scorpions on a stick. Scorpions were a wee a but scary but we figured had to be ok as noone was dropping dead. As far as we can figure, they’re bred without their stinger.

While trying to order something else, there was a language issue, and I ended up with seahorse on a stick. I think the scorpions were about 20 cents for five whereas the seahorse was a few Oz dollars for one. Our tour guide tried to talk our way out of it but the shopowner insisted. So I paid for it and ate it. There wasn’t much flavour as it was primarily shell with perhaps a tiny morsel of meat.

Matt suggested “meatier than a seahorse” as a bio and it immediately rang the right sort of bells, both physically and metaphorically. I am now using it for all my taglines :-)