rabbit holes of adventure

Dinner table conversation tonight ended up chatting about Mystery House, that my partner played occasionally when she was younger. Mystery House is known as the first graphical adventure game. That of course led the conversation into interactive fiction, referencing the top shelf of my bookcase which contains pretty much all of Infocom‘s text adventures. I remember Zork II was my first text adventure and fiendish it was. I relied on adventure columns in computer game magazines of the time for clues on how to solve difficult puzzles including the horrible baseball diamond puzzle, also known as the Oddly Angled Room.

In those days, I couldn’t google answers and would spend months stuck on a problem. Sometimes that could be a good thing but mostly it was bloody frustrating. While there was a certain sense of achievement in solving puzzles, it meant I couldn’t advance the story. Solving puzzles was essential to accessing further parts of the game. These days I think I prefer story telling and plot development though solving puzzles is nice too. Happily most games provide decent hint mechanisms and if I get desperate I can google for answers.

Much to the shock of my partner, I commented that I usually have my text adventure collection stored on all my active machines as they are part of my central core of files that migrate across my various computing environments. This sounds substantial until you realise that text adventures, having little graphics and don’t take up a lot of space. My entire interactive fiction archive is a little over 100MB, of which the complete works of infocom account for 95%. Come to think of it, they were the only ones I was able to buy as a box set later, the Lost Treasures of Infocom, and load in a system independent format.

interactive fiction games

The other key adventure game company of the time was Level 9. Infocom were American based, while Level 9 were from the UK and I had several of their games. Regrettably, while I still have the boxes, I no longer have the equipment to read the discs. Later on, graphic adventures developed further with Magnetic Scrolls commencing with their first game, the fantastic The Pawn. I have several titles of their titles on my shelf too. Methinks I need to investigate further as to whether I can get these on my current machines. Come to think of it, I’ve barely mentioned Sierra Online who were responsible for Mystery House and later developed the King’s Quest and SpaceQuest series. Oh, and then there was Ultima…yet another rabbit hole…

big data is bigger than a mini

I broke my mac mini. Temporarily. I found its limits for big data stuff. It could handle running spark shell with 75% RAM allocated processing 70GB of gzipped WARC files to produce a 355kb gephi file of links. But it took 30 minutes to do it. Today I hit it with around 200GB and it ran out of memory around 95GB. To be honest, I’m rather impressed. I was impressed that it handled 70GB. The mini has 8GB of RAM of which I allocated 6GB to running spark shell. Everything else slowed to a halt. Ultimately even that wasn’t enough. I eventually worked out that I could split my data up into three chunks and process separately. I now have 3 gephi files totaling around 600KB (from 200GB of data) of URLs and links that I can do stuff with.

lots of cars on the road in BorneoThe bigger question is: do I need to look at new hardware to handle this sort of stuff and if so, what direction do I go in? I’ve done a bunch of research and I’m unclear of what the best direction is…actually that’s not quite right, I know what the best direction is and it involves a chunk of money :-) What’s the best direction without spending stupid money? I figure there are three main groups:

  1. Intel NUC Skull Canyon vs Supermicro Superserver
  2. Odroid XU4 cluster – each board handles 2gb RAM and they have an octo-core (8 cores!) while Raspberry Pi-3 is only 1gb per board
  3. Mini tower – takes up more space but cheaper than option 1 for more RAM and core/threads with less cooling issues

The key requirements for me are RAM and thread count with the emphasis on a parallel processing environment.

In terms of cost, 2 is cheapest, then 3 then 1. 2 is also cutest :-) a cluster of 8 Odroid XU4 gives 16gb RAM and 64 cores and would probably cost a little over a grand to put together. The Supermicro is more expandable ie can handle up to 128GB of RAM compared to the 32GB max of the Intel Skull. On the other hand, I can put together a fully loaded Skull for around AUD$1,300 while the SuperMicro requires a little more but works out cheaper in the long run. To go the Skull route means each upgrade requires a new skull at $1,400 a pop whereas the Supermicro requires around $2,000 to start up but has room for cheaper expansions. The Supermicro can be a little noisy and the Skull is quieter.

I like the idea of option 2 and building a cluster of Odroid XU4 but to be honest, I’m not great at hardware stuff and suspect that path holds a lot of pain and I may not come out the other side. Option 3 has space issues but plenty of expandability. I haven’t worked out the answer to how well an Odroid cluster would compare to the Intel Skull given a parallel processing emphasis. The skull beats it on raw speed but for tasks tailored for lots of threads running concurrently I have no idea. I really like the Supermicro approach but it costs more and may be noisier. I noticed that the price has recently dropped on the barebones Skull from around AUD$900 to $780. To that you would add RAM and SSD. The Supermicro has a barebones price of around USD$800 and again you then add RAM and SSD, though it can also handle other types of drives.

Motorbike in BorneoI’m not even sure I want to go down the “moar hardware” path at all. The mac mini can cope with the basic 70GB dataset and provide a sufficient environment to experiment and learn in. I’m fairly committed to going down this path and need to learn more about how to use spark and code in scala. I’m using scala mostly because it’s been referenced in the WARCbase guides I’ve started up with. I don’t know enough to know if it’s the best path forward; on the other hand, any path forward is a good one.


identifying data

Wednesday and time to respond to an identity challenge from Paul :-) 4 questions about me and computer gear I like and I suspect question 1 and question 4 are going to be the hard ones. As this is a personal space, I tend not to talk about my work, or at least not directly. My about page provides hints of past current jobs but that’s about it.

Who are you, and what do you do?

My name is snail. I use my real name at work though even there I’d prefer to use snail but all the systems are based around official names not nicknames. Sadly. Many folk know me as snail except security and the switchboard so turning up and asking for snail ain’t gonna work :-) I am the Online Resources Specialist Librarian at the State Library of NSW and I am responsible for working with eresources, dealing with vendors, contract management, budget management, EZproxy, eresource troubleshooting and support, eresource subscriptions and digital archive purchases…and stats…and more stats. I am the Library’s representative on the NSLA eResources Consortium. 3 years ago I implemented a project for whole of domain web harvesting of all government websites under *.nsw.gov.au and I’ve been running that ever since…I’ll be commencing the primary annual captures today. I may have been blogging about the web harvesting stuff recently :)

What hardware do you use?

At work, I have a basic laptop running Windows 7 plugged into a 24″ widescreen monitor, along with a Das Keyboard Professional 4 mechanical keyboard and a Logitech trackball. I have a Jabra bluetooth hub hooked up to the desk phone which is paired to my mobile hearing aid loop, enabling me to hear telephone calls through my hearing aids.

laptops, tablet, phone ereaderI have a personal laptop, 2013 11″ Sony Vaio running Windows 10, which I use occasionally at work for external testing. At home, I have a mac mini connected to a 24″ widescreen monitor, with a Logitech G610 mechanical keyboard and a Logitech trackball. Behind the scenes I’m running a home server on a 4 bay QNAP TS-421 in RAID 5: each drive is 3TB for a total of 12TB which I’m primarily using it for backing all my machines, running my itunes server, and photo archive. I have a 7″ Nexus (2013) tablet, a Samsung galaxy s5 phone, and a Sony PRS-T2 ereader. Even a Psion 5mx that still works! I have several old keyboards too, assorted external hard drives and lots of USB sticks. :-)

And what software?

30533574640_5de8d36502_nThe machine at work is on Windows 7 and has just migrated to Office 365. The personal laptop is running Windows 10 and tends to run Open Office variants, has a virtualbox running Linux Mint, and a few other odds and ends. The mac mini is running whatever is the current MacOS and the phone and tablet are running android. I’ve never been much good at this single operating environment malarkey :-) Some of my favourite software includes:

and more browser variants than I care to count including lynx.

What would be your dream setup?

I wish all my devices would talk better to each other, a universal standard for talking across different machines, operating systems and so on. More speed, more bandwidth and greater customisation options. I like things to look pretty, both the hardware and the software, and I don’t like it when fab looking customisations break things. I like working from home but like working near colleagues too and some way of merging the two environments would be fab. I want better ears to hear conversations and chit-chat.

board with keys

I spend a lot of my life in front of a keyboard. I have tried other sorts of things here and there but suspect I’m stuffed with anything other than a keyboard. A physical keyboard. I do not like the lack of physical feedback from virtual keyboards. I get by with phone or tablet, swiping + predictive text works well enough but awkward for composing slabs of text and editing. I am most at home with a full keyboard.

Keyboards come in all shapes these days including small ones, big ones, some with less keys, some with all the keys, some with colours and flashing lights. Keyboards even have their own culture and groupies, depending on where you hang. “tenkeyless” is a bit of thing at the moment referring to keyboards that don’t have a numeric keypad. When I was working vendor-side, I had a series of thinkpad laptops (initially IBM then Lenovo which took over IBM’s laptop division) and work supplied an external keyboard for the office. Both the laptop and external keyboards were decent; I still have the external IBM keyboard and used it recently and it holds up.

logitech wave keyboard
Logitech Wave

My keyboard of choice in recent years was the logitech wave. It was comfy, felt great to type on and had a really nice feel. When I find a keyboard I like I usually get the same keyboard for work and home. My current workplace supplies a basic keyboard which is good as far as they go but I replaced it with a wave :-) If you can get away with it, it’s nice to swap in your own gear at work.

For a long time now, I’ve been reading about mechanical keyboards. These hark back to olden days when office typewriters were actual typewriters and computer keyboards emulated this approach. I’ve been trawling through my saved articles (using pocket these days) and found this fabulous article by Justine Hyde on the love of typewriters. There’s something about the click clack of old school typewriters that appeal. I’ve been using computer keyboards since I was 12 and still have fond memories of learning to type on mum’s, or was it dad’s, portable typewriter.

Assorted keyboardsI used to think mechanical keyboards were a bit of a trendy thing, focused on the noise of typing and hipster, old school style…not to mention being expensive. Mechanical keyboards can be very noisy. Most modern keyboards are membrane-based where the board under the keys is a rubber membrane sending signals based on the character pressed whereas mechanical keyboards have a specific switch for each key. Modern keyboards are quiet and don’t disturb whilst everyone knows if you’re using a mechanical keyboard. Stories continued to emerge of how nice mechanicals are to type on but still, they’re eccy: the wave cost around $100 including mouse + wrist-rest, whereas a good mechanical keyboard is around $150+.

Recently I bit the bullet and went hunting for a decent mechanical keyboard. I have long admired Das Keyboard as dedicated keyboard enthusiasts, they even produce a keyboard with no labels for touch typists in pure, unadulterated style. I never actually learned to touch-type but suspect I would probably do ok on a blank keyboard. Instead I went looking for the Das Professional 4…with labels. The Das Pro 4 is regarded as the top end at around AUD$270 and doesn’t include a wrist-rest.

I popped in to Capitol Square in Sydney where there’s a nest of specialist computer shops but none had one and I eventually settled for the cheaper Logitech Orion G610 ($150) with Cherry Brown switches. A note on switches: these are the things that separate mechanical keyboards from typical membranes; each key has its own switch and Cherry is the top of the top of pile for switches. Cherry has several colours denoting noise and feel with Cherry reds generally being the loudest. My reading suggested Cherry browns are quieter while maintaining decent tactility. Competing keyboard suppliers use Cherry too but some have developed their own switches eg Logitech have developed Romero for their top end keyboards. I won’t get into the terminology of actuation points and so on as it can get a wee bit intense.

Das Keyboard Professional 4
Das Keyboard Professional 4

I set up the logitech g610 at home and oh my, it was orders of magnitude better than the wave. Noisy yes, but so so nice to type on. However what I hadn’t understood was just how much faster mechanical keyboards are to type on; I thought I was a fairly quick typist but am faster still on a mechanical. The g610 is beautiful to type on. Anyways I ended up ordering the Das Pro 4 (Cherry Brown) online and I set it up at work. While the Logitech G610 was an order of magnitude better than the Wave, the Das Pro 4 was that much better again and possibly quieter too. It was so much better than typing on the G610 that I’m now tempted to replace the G610 with a Das Pro 4 too. It’s all subjective: I was happy for years with the wave; I was happy with the G610, and now I’m even happier with the Das Keyboard Professional 4.



a little low end

Ruminating further on my desire for more processing power, I’ve been thinking more on clusters and can’t help but feel that a lego rack of tiny motherboards is a rather cute direction to head in. My general idea at the moment is to look at building a cluster significantly more powerful than a mac mini but with a small footprint and not too expensive. While there are desktop towers and second-hand servers that can achieve much better performance, they take up a lot of physical space. 2 things I’ve always been interested in in computing:

  • small size (or footprint) – I don’t want them to dominate the space
  • low weight
  • good battery life is also nice but less relevant in this scenario

The Intel NUC cluster is the high end version of the sort of setup that could work for me. However high end, cutting edge isn’t the only solution and comments on twitter reminded me that there are other, cheaper options for home use starting with the humble raspberry pi. Turns out there’s quite a bit of work in that area on a low end approach to supercomputing. While the overall speed per board won’t be huge, gains can be made for parallel computing as a good number of cores and threads increases work done in these sorts of systems and may work out better, and cheaper than a single NUC.

train station in Manarola, Italy.

There’s been a lot of work with raspberry pi clusters and running boards in parallel with anything from 4 boards up to 200+; someone has even published instructions on building a 4 board Pi cluster in a mere 29 minutes. However the Pi isn’t the only option and another board that has developed a community is the Odroid series and they seem a wee bit more powerful than the Pi without being much more expensive.

The challenge I gather with Pi/Odroid setups is potentially around the ARM chipset whereas the NUC being Intel is on a more common platform. ARM is a slower chipset relatively and doesn’t quite have the broad support of mainstream chipsets however there seems to be a strong community around them. On the other hand, if you want to go down the intel route, then there’s credit size computers, like the Intel Edison, based on x86 chips. Literally the size/thickness of a credit card and can boot to standard linux. Clusters of these are even smaller, with a 10 card cluster that looks like it could fit in the palm of your hand.

Realistically, while it’s nice to dream I’m not actually that great with hardware stuff and I can see that 29 minute Pi cluster taking me most of the day…if I can get it to work at all. Yet it sounds so simple. I suspect it’s a matter of courage, patience and lots of google-fu. I get nervous when dealing with hardware and installing software, blindly running other people’s scripts and keeping my fingers cross that if errors occur, they’re not too hard to resolve. The advantage of cheaper approaches is that I’m not too badly out of pocket if I can’t get it to work, a few hundred vs a few thousand. The other question is whether a tenth of the budget produces better than a tenth of the power?

gimme me moar power!

I’m in my early days of playing with big data stuff though finding the small systems I’m using a little on the slow side. Stuff takes a while to load, a while to run, a while to die. Lots of waiting. In Monday’s post, I commented about running code for extracting site structures from my data set and it taking 20-30 minutes before crashing when it encountered an unexpected file, then another 20-30 minutes to make a successful run. That did at least mean I had time to catch up on Masterchef :-) Interestingly, Gephi seemed to run faster on my work laptop (windows 7) than it did on my mac mini. Mind you, neither system was built with this sort of processing in mind.

System 80 Computer (1982)

A friend recently introduced me to a new term, “nuc clusters”. These are based on Intel NUC (Next Unit of Computing) mini computers and they have been getting more and more and impressive each year. They’re tiny computers, smaller than a mac mini but a little chunkier. The high end versions are close to AUD$1,000 and then you need to add high speed RAM and solid state drives (SSD). On the other hand, that includes a quad core Intel i7 chip, space for 32GB RAM and 2 high speed M.2 sockets for SSDs. For around $1,500 you end up with a “basic” system with some serious power on a really, tiny footprint. Some enterprising folk have taken that a wee bit further and set up desktop server racks, networking multiple NUCs together into server clusters. There are naked versions that fit in a shoebox, where they’ve removed the casings and built mini racks for the motherboards, a briefcase version, and someone has even constructed a server rack out of lego.

In other words, while it’s a little pricey, you end with some serious computing power that doesn’t take over your deskspace. While a 4 board system sounds really awesome effectively running 16 cores, 128GB RAM and 4-8TB SSD (there are now 2TB SSD but they’re not cheap), a starting price of $6,000 or so starts to sound scary. I had a look around at desktop tower based systems and a decently powered system probably starts around $3-4,000. Intel have recently announced a new CPU with 18 cores and 36 threads for around US$2,000 just for the CPU. Eep! Reading a few whirlpool boards I discovered that there’s a big second-hand market for servers and racks. Prices are pretty good though they’re usually ex-server farms and take up a fair bit of physical space, not to mention I’m not especially confident with hardware stuff and getting one of these up and running may be a little more challenging.

Then of course, there’s cloud based servers and clusters including Amazon Web Servers (AWS) who even have an EMR (Elastic MapReduce) platform for running all the tools I’m currently playing with, and then some. Ian Mulligan and co have even developed instructions for running warcbase in an AWS environment. I haven’t looked into these too much yet. Longer term I may need to give all this some more serious attention but for now the mac mini is at least adequate. Just.

using big data to create bad art

A few weeks back, I installed a lot of software on my computer at home with the plan to work out what to do with large data sets, particularly web archives. One of my roles at work is being responsible for managing and running the Library’s web archiving strategy and regularly harvesting publicly available government websites. That’s all fun and good but you end up with a lot of data and I think there’s close to 5TB in the collection now. The next tricks revolve around what you can use the data for and what sorts of data are worthwhile to make accessible. Under my current, non NBN, download speeds I estimate it would take a few months to download 5TB of data assuming a steady connection.

The dataset I’m using currently is a cohesive collection of publicly available websites containing approximately 68GB of data in 61 files. Each file is a compressed WARC file, WARC being the standard for Web ARChive files. Following some excellent instructions, I ran the scala code from step 1 in my local install of spark shell and successfully extracted the site structure. The code needed to be modified slightly to work with the pathname of my data set, roughly

  • run Spark Shell with sufficient memory, I’m using 6 of my 8GB of RAM
  • run “:paste”
  • copy in scala code
  • hit “Control-D” to start the code analysing the data

I think that took around 20-30 minutes to run. The first time through, it crashed at the end as I’d left a couple of regular text files in the archive directory and the code sample didn’t handle those. Fair enough too, as it’s only sample code and not a full program with error detection and handling. I moved the text files out and ran it again. Second time through it finished happily.

The resultant file containing all the URLs and linkages was a total of 355kb, not bad for a starting data size if 68GB and provides something a little more manageable to play with. Next step is to load the file into Gephi which is an open source, data visualisation tool for networks and graphs. I still have little idea how to use gephi effectively and am mostly just pressing different buttons and playing with layouts to see how stuff works. I haven’t quite got to the point of making visually interesting displays like the one shown in the tutorial, however I have managed to create some really ugly art:

ugly data analysis

I hit the point a while back where it’s no longer sufficient to play with sample bits and pieces and I need to sit down and learn stuff properly. To that end I ordered a couple of books on Apache Spark, then ordered another book, Programming in Scala, and wondering whether I should also buy The Scala Cookbook. Or perhaps I shouldn’t try and do everything at once. I am reading both the Spark books concurrently as they’re aimed for different audiences and take different approaches. However after an initial spurt through the first couple of chapters, I haven’t touched them in a couple of weeks. I also need to learn how to use Gephi effectively and there’s a few tutorials available for doing that. I should explore other visualisation tools too as well and continue to look at what other sorts of tools can be used.