knuth

I often say professionally that I did a compsci major (though can never claim it officially) yonks ago but decided against becoming a programmer. That’s not a decision I regret mostly, though it must be said I continue to have strong leanings that direction. Scarily, it’s been over 25 years since those compsci days. Still, I learnt good stuff.

I recall in the second half of first year compsci, we had an older lecturer at the time who was actually a maths lecturer who seemed to have come across into computers. I can say “older” as I’ve just found this bio which sums up very briefly a rather fascinating career. He may even have been one of my favourite lecturers as he liked to play with new ideas and introduced stuff he knew about from maths into computing. I was a very rare beast in compsci in that I was enrolled under BA and not directly in Compsci and I did no math. I had done first year math but it wasn’t quite my bag. Doherty was very big on mathematical ideas and assessing efficiencies of algorithms.

I recall him talking some weird algorithm for encrypting data and he worked through the basic idea in a lecture, I think it was based on some sort of fractional encoding model. At the end of the lecture, he said the next assignment would be to implement it. I found the idea of it fascinating. The next assignment came out and sure enough it was on encryption so I implemented the algorithm in Pascal that he’d talked about based on my lecture notes. The idea was you’d write code to encrypt a paragraph of text, and code to decrypt the text. I was mostly successful but because it relied on decimal conversion of larger numbers, it rapidly lost accuracy on the 8 bit macs we were using at the time. Out of a sentence of 10 words, it started losing letters by the end of the first word.

Turns out, I should have read the back page of the assignment. Doherty had decided that the technique was a little too experimental for first year compsci and had instead instructed everyone to use a hashing technique. I handed my assignment in and discussed with the class tutor what I’d done. He wasn’t familiar with the algorithm at all but was impressed that it worked and understood why it failed where it did. I got full marks and first year compsci was one of my few high distinctions at uni.

mini computers on top of computer books.Anyway, Doherty would often quote Knuth as the foundation of modern computing. Knuth was all about the development of algorithms and understanding their efficiencies. Algorithms are really important as they represent techniques for solving particular sorts of problems eg what is the best way to sort a random string of numbers? The answer varies depending on how many numbers are in the string, or even whether you can know the number of numbers. For very small sets, a bubble sort is sufficient, and from there you move on to binary searches, binary trees, and so on. I wasn’t always across the math but really appreciated the underlying thinking around assessing approaches to problem solving. Plus Doherty was a fab lecturer with a bit of character.

So Knuth. He is best known for his series, The Art of Computer Programming, which has gone through a few editions and I wonder if it will ever be actually finished; the fourth volume is actually labeled 4A: Combinatorial Algorithms Part 1. Volume 4 is eventually expected to cover 4 volumes: 4A, 4B, 4C, 4D. 4B has been partially released across several fascicles of which 6 have been released. Volume 3 seems to be the most relevant for where I’m at today and where I’m looking to play; #3 is around 750 pages devoted specifically to sorting and searching. So much of what we do online is reliant on being able to find stuff and to find stuff well, it helps if the data has been ordered.

Knuth has this been this name in my head though my life has gone in other directions. A few years ago, I did a google and found that not only were his books on Amazon, there was even a box set of Volumes 1-4A. I bit the bullet about 3 years ago and bought the set, cost around US$180 at the time and looks really, bloody good on the shelf. I haven’t read a great deal yet but dipped in a few times and planning to get into volume 3 properly at some point. I’ve recently being moving stuff around at home and don’t have a lot of space for books next to where my computer gear is these days. However, it turns out, the mac mini sits nicely on top of the set, and my newest computer, the VivoMini sits nicely on top of the mac. I sorta like the idea of these small computers sitting on Knuth’s foundation.

threading delights

I’ve had the new machine a few days and I’m starting to get the hang of it, but learning, lots of learning. Finding linux equivalents of windows tools and then working out how to install them. Troubleshooting unexpected java errors trying to get spark shell to compile properly – turns out I had the JRE but not the full JDK which means I had to download more stuff and update some config files as well pathname references so the system knows where to find stuff.

As it turns out I completely misread the new pages for Archives Unleashed and didn’t see the black menu bar at the top of the screen for all of the docs. Was a little too tired methinks. I  installed stuff using old versions of the docs I found on the wayback machine and other bits. Consequently I’ve ended up with a more recent version of Archives Unleashed (a bit of mouthful after the easier “warcbase”) with 0.10.1 instead of 0.9.0 and I’m running a current version of Spark Shell, 2.2.0, instead of 1.6.1. Anyway it all works…I think.

The next headache was that my harvest test data was still on the mac mini. I wasn’t sure how to get the data across as I couldn’t write to a windows hard drive from the mac. Then had the bright idea of copying the data, 56 files for a total of 80GB, to my home server via wifi. That took 6 hours…to the server, so I went away and did other things. Towards the end of that process I had a bit of time so I worked out that if I formatted a drive for the mac in exFAT format, I could install some utilities in linux to read it. That took an hour, half hour to copy to the drive, half an hour from the drive to linux. Phew.

Then I tried running the SCALA code for extracting the site structure and ran into a few errors as about 15% of the files have developed an error somewhere along the way. I removed all the broken files leaving me with 47 usable ones. All up, it took 18 minutes to process the data, not quite as fast as I was hoping. On the other hand, the advantage of having lots of ram is that there was plenty of space to do other things. Running the same job on the mac mini with dual core CPU and 8GB RAM brought it to a grinding halt and nothing else was possible. On the new machine, I could run everything else normally including web browsing and downloads.

htop2

Regardless of whether I allocated 5gb, 10gb, 24gb, or even 28gb of RAM, time taken to process still hovered around 18 minutes. With 28gb allocated it only needed around 15gb to process, as can be seen in the above screenshot of htop. The other nice thing about htop is that it demonstrated that all 8 CPU threads were in use. Where I think I saved some time is that swap doesn’t seem to have been required which would have reduced some overheads. Either that, or I haven’t worked out how to use swap memory yet.

Still very early days.

some new tech

Following my fun in July when I hit a bit of a wall in playing with large data sets and brought my mac mini to a grinding halt, I ruminated on next steps. Wall aside, it was a wee bit frustrating that running experiments on larger data sets took a long time to run and that’s been a bit off-putting to further progress. So I decided that I really did a new machine and was going to get an intel NUC skull canyon as it was small and fast. I waited for Intel to announce their new 8th generation CPUs which they did recently. Unfortunately the upgrade to the current 6th generation Skull isn’t due till Q2 2018.

On the other hand, prices have been dropping on the barebones Skull and you can pick one up for around AUD$700. However a retailer pointed out to me recently that the ASUS VivoMini, while pricier, uses 7th generation CPUs. Plus it’s a cuter box. After some umming and ahhing, I ordered the vivomini with 32GB RAM and an additional 1TB drive (it includes a 256GB SSD in the m.2 port). The CPU is a 7th generation quad core i7. Total cost was around AUD$1,700 whereas a similarly set up Skull would have been around $1,400-500. It has a small footprint and sits nicely on top of the mac mini.

36938037614_0820718b3e

Picked it up yesterday and it booted straight into windows. Today, somewhat trepidatiously, I had a go at setting it up to dual boot with linux. The last few years I’ve been running linux via virtualbox on windows and that’s been sufficient. It’s been a long, long, long time since I set up a dual boot machine and that was using debian which was a wee bit challenging at the time.

This time round it was all easy as. I followed some straightforward instructions carefully and tested initially on a live boot via USB and then used that USB to install it properly. I’ve booted back and forth between windows and linux several times just to be sure and so far so good. I’m currently writing this blog via firefox in ubuntu. My next step was going to be to set up warcbase however that’s been deprecated as Ian Milligan and his team have received a new grant and are working on building an updated environment under their Archives Unleashed Toolkit. So I’ll play with that instead :) Regardless I’ll still need to get Apache Spark up and running which is likely my next step.

rabbit holes of adventure

Dinner table conversation tonight ended up chatting about Mystery House, that my partner played occasionally when she was younger. Mystery House is known as the first graphical adventure game. That of course led the conversation into interactive fiction, referencing the top shelf of my bookcase which contains pretty much all of Infocom‘s text adventures. I remember Zork II was my first text adventure and fiendish it was. I relied on adventure columns in computer game magazines of the time for clues on how to solve difficult puzzles including the horrible baseball diamond puzzle, also known as the Oddly Angled Room.

In those days, I couldn’t google answers and would spend months stuck on a problem. Sometimes that could be a good thing but mostly it was bloody frustrating. While there was a certain sense of achievement in solving puzzles, it meant I couldn’t advance the story. Solving puzzles was essential to accessing further parts of the game. These days I think I prefer story telling and plot development though solving puzzles is nice too. Happily most games provide decent hint mechanisms and if I get desperate I can google for answers.

Much to the shock of my partner, I commented that I usually have my text adventure collection stored on all my active machines as they are part of my central core of files that migrate across my various computing environments. This sounds substantial until you realise that text adventures, having little graphics and don’t take up a lot of space. My entire interactive fiction archive is a little over 100MB, of which the complete works of infocom account for 95%. Come to think of it, they were the only ones I was able to buy as a box set later, the Lost Treasures of Infocom, and load in a system independent format.

interactive fiction games

The other key adventure game company of the time was Level 9. Infocom were American based, while Level 9 were from the UK and I had several of their games. Regrettably, while I still have the boxes, I no longer have the equipment to read the discs. Later on, graphic adventures developed further with Magnetic Scrolls commencing with their first game, the fantastic The Pawn. I have several titles of their titles on my shelf too. Methinks I need to investigate further as to whether I can get these on my current machines. Come to think of it, I’ve barely mentioned Sierra Online who were responsible for Mystery House and later developed the King’s Quest and SpaceQuest series. Oh, and then there was Ultima…yet another rabbit hole…

big data is bigger than a mini

I broke my mac mini. Temporarily. I found its limits for big data stuff. It could handle running spark shell with 75% RAM allocated processing 70GB of gzipped WARC files to produce a 355kb gephi file of links. But it took 30 minutes to do it. Today I hit it with around 200GB and it ran out of memory around 95GB. To be honest, I’m rather impressed. I was impressed that it handled 70GB. The mini has 8GB of RAM of which I allocated 6GB to running spark shell. Everything else slowed to a halt. Ultimately even that wasn’t enough. I eventually worked out that I could split my data up into three chunks and process separately. I now have 3 gephi files totaling around 600KB (from 200GB of data) of URLs and links that I can do stuff with.

lots of cars on the road in BorneoThe bigger question is: do I need to look at new hardware to handle this sort of stuff and if so, what direction do I go in? I’ve done a bunch of research and I’m unclear of what the best direction is…actually that’s not quite right, I know what the best direction is and it involves a chunk of money :-) What’s the best direction without spending stupid money? I figure there are three main groups:

  1. Intel NUC Skull Canyon vs Supermicro Superserver
  2. Odroid XU4 cluster – each board handles 2gb RAM and they have an octo-core (8 cores!) while Raspberry Pi-3 is only 1gb per board
  3. Mini tower – takes up more space but cheaper than option 1 for more RAM and core/threads with less cooling issues

The key requirements for me are RAM and thread count with the emphasis on a parallel processing environment.

In terms of cost, 2 is cheapest, then 3 then 1. 2 is also cutest :-) a cluster of 8 Odroid XU4 gives 16gb RAM and 64 cores and would probably cost a little over a grand to put together. The Supermicro is more expandable ie can handle up to 128GB of RAM compared to the 32GB max of the Intel Skull. On the other hand, I can put together a fully loaded Skull for around AUD$1,300 while the SuperMicro requires a little more but works out cheaper in the long run. To go the Skull route means each upgrade requires a new skull at $1,400 a pop whereas the Supermicro requires around $2,000 to start up but has room for cheaper expansions. The Supermicro can be a little noisy and the Skull is quieter.

I like the idea of option 2 and building a cluster of Odroid XU4 but to be honest, I’m not great at hardware stuff and suspect that path holds a lot of pain and I may not come out the other side. Option 3 has space issues but plenty of expandability. I haven’t worked out the answer to how well an Odroid cluster would compare to the Intel Skull given a parallel processing emphasis. The skull beats it on raw speed but for tasks tailored for lots of threads running concurrently I have no idea. I really like the Supermicro approach but it costs more and may be noisier. I noticed that the price has recently dropped on the barebones Skull from around AUD$900 to $780. To that you would add RAM and SSD. The Supermicro has a barebones price of around USD$800 and again you then add RAM and SSD, though it can also handle other types of drives.

Motorbike in BorneoI’m not even sure I want to go down the “moar hardware” path at all. The mac mini can cope with the basic 70GB dataset and provide a sufficient environment to experiment and learn in. I’m fairly committed to going down this path and need to learn more about how to use spark and code in scala. I’m using scala mostly because it’s been referenced in the WARCbase guides I’ve started up with. I don’t know enough to know if it’s the best path forward; on the other hand, any path forward is a good one.

 

identifying data

Wednesday and time to respond to an identity challenge from Paul :-) 4 questions about me and computer gear I like and I suspect question 1 and question 4 are going to be the hard ones. As this is a personal space, I tend not to talk about my work, or at least not directly. My about page provides hints of past current jobs but that’s about it.

Who are you, and what do you do?

My name is snail. I use my real name at work though even there I’d prefer to use snail but all the systems are based around official names not nicknames. Sadly. Many folk know me as snail except security and the switchboard so turning up and asking for snail ain’t gonna work :-) I am the Online Resources Specialist Librarian at the State Library of NSW and I am responsible for working with eresources, dealing with vendors, contract management, budget management, EZproxy, eresource troubleshooting and support, eresource subscriptions and digital archive purchases…and stats…and more stats. I am the Library’s representative on the NSLA eResources Consortium. 3 years ago I implemented a project for whole of domain web harvesting of all government websites under *.nsw.gov.au and I’ve been running that ever since…I’ll be commencing the primary annual captures today. I may have been blogging about the web harvesting stuff recently :)

What hardware do you use?

At work, I have a basic laptop running Windows 7 plugged into a 24″ widescreen monitor, along with a Das Keyboard Professional 4 mechanical keyboard and a Logitech trackball. I have a Jabra bluetooth hub hooked up to the desk phone which is paired to my mobile hearing aid loop, enabling me to hear telephone calls through my hearing aids.

laptops, tablet, phone ereaderI have a personal laptop, 2013 11″ Sony Vaio running Windows 10, which I use occasionally at work for external testing. At home, I have a mac mini connected to a 24″ widescreen monitor, with a Logitech G610 mechanical keyboard and a Logitech trackball. Behind the scenes I’m running a home server on a 4 bay QNAP TS-421 in RAID 5: each drive is 3TB for a total of 12TB which I’m primarily using it for backing all my machines, running my itunes server, and photo archive. I have a 7″ Nexus (2013) tablet, a Samsung galaxy s5 phone, and a Sony PRS-T2 ereader. Even a Psion 5mx that still works! I have several old keyboards too, assorted external hard drives and lots of USB sticks. :-)

And what software?

30533574640_5de8d36502_nThe machine at work is on Windows 7 and has just migrated to Office 365. The personal laptop is running Windows 10 and tends to run Open Office variants, has a virtualbox running Linux Mint, and a few other odds and ends. The mac mini is running whatever is the current MacOS and the phone and tablet are running android. I’ve never been much good at this single operating environment malarkey :-) Some of my favourite software includes:

and more browser variants than I care to count including lynx.

What would be your dream setup?

I wish all my devices would talk better to each other, a universal standard for talking across different machines, operating systems and so on. More speed, more bandwidth and greater customisation options. I like things to look pretty, both the hardware and the software, and I don’t like it when fab looking customisations break things. I like working from home but like working near colleagues too and some way of merging the two environments would be fab. I want better ears to hear conversations and chit-chat.

board with keys

I spend a lot of my life in front of a keyboard. I have tried other sorts of things here and there but suspect I’m stuffed with anything other than a keyboard. A physical keyboard. I do not like the lack of physical feedback from virtual keyboards. I get by with phone or tablet, swiping + predictive text works well enough but awkward for composing slabs of text and editing. I am most at home with a full keyboard.

Keyboards come in all shapes these days including small ones, big ones, some with less keys, some with all the keys, some with colours and flashing lights. Keyboards even have their own culture and groupies, depending on where you hang. “tenkeyless” is a bit of thing at the moment referring to keyboards that don’t have a numeric keypad. When I was working vendor-side, I had a series of thinkpad laptops (initially IBM then Lenovo which took over IBM’s laptop division) and work supplied an external keyboard for the office. Both the laptop and external keyboards were decent; I still have the external IBM keyboard and used it recently and it holds up.

logitech wave keyboard
Logitech Wave

My keyboard of choice in recent years was the logitech wave. It was comfy, felt great to type on and had a really nice feel. When I find a keyboard I like I usually get the same keyboard for work and home. My current workplace supplies a basic keyboard which is good as far as they go but I replaced it with a wave :-) If you can get away with it, it’s nice to swap in your own gear at work.

For a long time now, I’ve been reading about mechanical keyboards. These hark back to olden days when office typewriters were actual typewriters and computer keyboards emulated this approach. I’ve been trawling through my saved articles (using pocket these days) and found this fabulous article by Justine Hyde on the love of typewriters. There’s something about the click clack of old school typewriters that appeal. I’ve been using computer keyboards since I was 12 and still have fond memories of learning to type on mum’s, or was it dad’s, portable typewriter.

Assorted keyboardsI used to think mechanical keyboards were a bit of a trendy thing, focused on the noise of typing and hipster, old school style…not to mention being expensive. Mechanical keyboards can be very noisy. Most modern keyboards are membrane-based where the board under the keys is a rubber membrane sending signals based on the character pressed whereas mechanical keyboards have a specific switch for each key. Modern keyboards are quiet and don’t disturb whilst everyone knows if you’re using a mechanical keyboard. Stories continued to emerge of how nice mechanicals are to type on but still, they’re eccy: the wave cost around $100 including mouse + wrist-rest, whereas a good mechanical keyboard is around $150+.

Recently I bit the bullet and went hunting for a decent mechanical keyboard. I have long admired Das Keyboard as dedicated keyboard enthusiasts, they even produce a keyboard with no labels for touch typists in pure, unadulterated style. I never actually learned to touch-type but suspect I would probably do ok on a blank keyboard. Instead I went looking for the Das Professional 4…with labels. The Das Pro 4 is regarded as the top end at around AUD$270 and doesn’t include a wrist-rest.

I popped in to Capitol Square in Sydney where there’s a nest of specialist computer shops but none had one and I eventually settled for the cheaper Logitech Orion G610 ($150) with Cherry Brown switches. A note on switches: these are the things that separate mechanical keyboards from typical membranes; each key has its own switch and Cherry is the top of the top of pile for switches. Cherry has several colours denoting noise and feel with Cherry reds generally being the loudest. My reading suggested Cherry browns are quieter while maintaining decent tactility. Competing keyboard suppliers use Cherry too but some have developed their own switches eg Logitech have developed Romero for their top end keyboards. I won’t get into the terminology of actuation points and so on as it can get a wee bit intense.

Das Keyboard Professional 4
Das Keyboard Professional 4

I set up the logitech g610 at home and oh my, it was orders of magnitude better than the wave. Noisy yes, but so so nice to type on. However what I hadn’t understood was just how much faster mechanical keyboards are to type on; I thought I was a fairly quick typist but am faster still on a mechanical. The g610 is beautiful to type on. Anyways I ended up ordering the Das Pro 4 (Cherry Brown) online and I set it up at work. While the Logitech G610 was an order of magnitude better than the Wave, the Das Pro 4 was that much better again and possibly quieter too. It was so much better than typing on the G610 that I’m now tempted to replace the G610 with a Das Pro 4 too. It’s all subjective: I was happy for years with the wave; I was happy with the G610, and now I’m even happier with the Das Keyboard Professional 4.