harvest testing

One of the difficulties with working with web harvests is that it is but one of several priorities and not even the key one. A lot of my job is focussed on managing the Library’s eresources collection, dealing with suppliers and looking after budgets. In addition I’ve been running the Library’s web harvesting programme for about three and a half years now. The main crawl of NSW government websites was set up originally by Archive-It and these days I have it scheduled to run twice a year. There are other smaller crawls that are run throughout the year.

However there’s never been a lot of time for exploring the harvested content in detail and ensuring we’re getting the material we think we are. We do run some testing by searching for specific content within the archive eg budget papers and check that it contains all relevant content including spreadsheets and documents. There’s only so much you can do to manually test when this particular archive is 3.5TB and contains around 74 million documents. The Archive-It software does provide some tools for checking crawl results and broadly indicating missed material.

However, as readers continue to explore the collection, they come across things where we haven’t fully captured the content we thought we had. A recent example is the Electoral Atlas of NSW 1856-2006 edited by Eamonn Clifford, Antony Green and David Clune. The State Library does hold it in print and until recently the digital content was hosted on the NSW Parliamentary website.

On initial inspection, it appeared that the content had been captured via the harvest both by SLNSW and the National Library (NLA). The NLA version doesn’t descend further while the NSW version does display the individual election results eg 1984:

Election details of the 1984 New South Wales state election

However, all the links in the 1984 Election Links section return a “Not in Archive” message, similarly for other years. In this example, there is some happy news in that the main wayback machine seems to have captured the site in full including those pages we’ve missed. The question that I need to explore and may need to ask Archive-It about, is why their crawl captured that information and our’s didn’t.

As a side note, I’ve found the Wayback browser plugin (Firefox, Chrome) rather useful for finding archived versions of pages that no longer exist on websites.

 

bits and whiskies

Sat down at the computer today for the first time in a while and installed docker. I have it installed on most of my machines and got round to it on the vivomini today. Was a simple matter to run:

sudo apt install docker.io

enter my password and off it went. Docker containers include everything you’re likely to need to run a particular batch of software. Installing software is rarely simple and may rely on the presence of other packages which leads into a vicious circle of finding all the dependencies and installing them. In this case, I wanted to try the new-ish docker container for the Archives Unleashed Toolkit which, in earlier days and been a little challenging in a on docker environment. Whereas this version was dead simple via docker on a linux command line:

Step 1 sudo docker pull archivesunleashed/docker-aut
Step 2 sudo docker run --rm -it archivesunleashed/docker-aut

Both steps took a while but I think it was around 15-20 minutes altogether on my ADSL2 house wifi (my NBN option is HFC and that’s been delayed several months). When the second step finished I was greeted with the opening screen for the spark shell and ready to work. Very nice and will have more of a play later.

For now, I’m currently downloading Horizon Zero Dawn: The Frozen Wilds and rather looking forward to revisiting my favourite game of 2017, and possibly even my favourite game since Skyrim. Actually, I’m not sure on the latter and I haven’t actually stopped playing Skyrim. I have been playing a lot of Assassin’s Creed: Origins over the last couple of months and it feels like there’s still so much to explore. Some of it is a bit repetitive yet it’s wonderful exploring such a well realised version of Egypt, in the time of Cleopatra, and its surrounds. With that said, I’m at the point where I’m going to ease back and pop into it occasionally rather than have it as my primary game.

Then there was whisky. All the bottles I had opened in early November are now finished. Back then I had 9 bottles altogether with 5 open, now  9 bottles and 4 open. Actually I have an additional 7 bottles but they’re each 50ml and combined are equivalent to a single bottle. My partner bought me a box of 4 peated malts for christmas, and I picked up a taster pack of 3 Loch Lomond whiskies. Whiskies opened include:

  • Hellyers Road 10 year old (46.2%) – a nice, soft dram from Tasmania. Usually retails around $90 and I think I’m on my second bottle.
  • Ben Nevis 18 year old (single cask, 54.7%) – strong but delish, loving this one and on to the second bottle. This was $240 and is part of a fund raiser for a new distillery in Corowa, NSW.
  • BenRiach Peated Cask Strength Single Malt (56%) – also strong and also delish. This was $150 and I have a suspicion that BenRiach is turning out to be one of my favourite distilleries after Highland Park and Overeem. I have also enjoyed their 17 year old PX cask.
  • Glenmorangie: The Duthac (43%) – more yum. This was a christmas present and was released for travel retail and is primarily available at duty free places at airports, Singapore in this instance. Part finished in Pedro Ximinez casks. Sherry casks are my preferred and the Pedro Ximinez (PX) seems to raise that a notch or two.

Speaking of Pedro, I rather like sherry straight too. I used to prefer ports and muscats, and even had a port barrel maturing at one stage. I suspect if I ever do another barrel it will be for sherry. Of sherries, the Pedro Ximinez or PX (though it seems irreverent to shorten it such) is turning out to be my favourite. I have been trying out various releases from cheap to expensive, the most expensive being around $55 for 350ml! My favourite, while a little pricey, seems to be the Cardenal Cisneros at $56/750ml, though cheap compared to whisky.

knuth

I often say professionally that I did a compsci major (though can never claim it officially) yonks ago but decided against becoming a programmer. That’s not a decision I regret mostly, though it must be said I continue to have strong leanings that direction. Scarily, it’s been over 25 years since those compsci days. Still, I learnt good stuff.

I recall in the second half of first year compsci, we had an older lecturer at the time who was actually a maths lecturer who seemed to have come across into computers. I can say “older” as I’ve just found this bio which sums up very briefly a rather fascinating career. He may even have been one of my favourite lecturers as he liked to play with new ideas and introduced stuff he knew about from maths into computing. I was a very rare beast in compsci in that I was enrolled under BA and not directly in Compsci and I did no math. I had done first year math but it wasn’t quite my bag. Doherty was very big on mathematical ideas and assessing efficiencies of algorithms.

I recall him talking some weird algorithm for encrypting data and he worked through the basic idea in a lecture, I think it was based on some sort of fractional encoding model. At the end of the lecture, he said the next assignment would be to implement it. I found the idea of it fascinating. The next assignment came out and sure enough it was on encryption so I implemented the algorithm in Pascal that he’d talked about based on my lecture notes. The idea was you’d write code to encrypt a paragraph of text, and code to decrypt the text. I was mostly successful but because it relied on decimal conversion of larger numbers, it rapidly lost accuracy on the 8 bit macs we were using at the time. Out of a sentence of 10 words, it started losing letters by the end of the first word.

Turns out, I should have read the back page of the assignment. Doherty had decided that the technique was a little too experimental for first year compsci and had instead instructed everyone to use a hashing technique. I handed my assignment in and discussed with the class tutor what I’d done. He wasn’t familiar with the algorithm at all but was impressed that it worked and understood why it failed where it did. I got full marks and first year compsci was one of my few high distinctions at uni.

mini computers on top of computer books.Anyway, Doherty would often quote Knuth as the foundation of modern computing. Knuth was all about the development of algorithms and understanding their efficiencies. Algorithms are really important as they represent techniques for solving particular sorts of problems eg what is the best way to sort a random string of numbers? The answer varies depending on how many numbers are in the string, or even whether you can know the number of numbers. For very small sets, a bubble sort is sufficient, and from there you move on to binary searches, binary trees, and so on. I wasn’t always across the math but really appreciated the underlying thinking around assessing approaches to problem solving. Plus Doherty was a fab lecturer with a bit of character.

So Knuth. He is best known for his series, The Art of Computer Programming, which has gone through a few editions and I wonder if it will ever be actually finished; the fourth volume is actually labeled 4A: Combinatorial Algorithms Part 1. Volume 4 is eventually expected to cover 4 volumes: 4A, 4B, 4C, 4D. 4B has been partially released across several fascicles of which 6 have been released. Volume 3 seems to be the most relevant for where I’m at today and where I’m looking to play; #3 is around 750 pages devoted specifically to sorting and searching. So much of what we do online is reliant on being able to find stuff and to find stuff well, it helps if the data has been ordered.

Knuth has this been this name in my head though my life has gone in other directions. A few years ago, I did a google and found that not only were his books on Amazon, there was even a box set of Volumes 1-4A. I bit the bullet about 3 years ago and bought the set, cost around US$180 at the time and looks really, bloody good on the shelf. I haven’t read a great deal yet but dipped in a few times and planning to get into volume 3 properly at some point. I’ve recently being moving stuff around at home and don’t have a lot of space for books next to where my computer gear is these days. However, it turns out, the mac mini sits nicely on top of the set, and my newest computer, the VivoMini sits nicely on top of the mac. I sorta like the idea of these small computers sitting on Knuth’s foundation.

threading delights

I’ve had the new machine a few days and I’m starting to get the hang of it, but learning, lots of learning. Finding linux equivalents of windows tools and then working out how to install them. Troubleshooting unexpected java errors trying to get spark shell to compile properly – turns out I had the JRE but not the full JDK which means I had to download more stuff and update some config files as well pathname references so the system knows where to find stuff.

As it turns out I completely misread the new pages for Archives Unleashed and didn’t see the black menu bar at the top of the screen for all of the docs. Was a little too tired methinks. I  installed stuff using old versions of the docs I found on the wayback machine and other bits. Consequently I’ve ended up with a more recent version of Archives Unleashed (a bit of mouthful after the easier “warcbase”) with 0.10.1 instead of 0.9.0 and I’m running a current version of Spark Shell, 2.2.0, instead of 1.6.1. Anyway it all works…I think.

The next headache was that my harvest test data was still on the mac mini. I wasn’t sure how to get the data across as I couldn’t write to a windows hard drive from the mac. Then had the bright idea of copying the data, 56 files for a total of 80GB, to my home server via wifi. That took 6 hours…to the server, so I went away and did other things. Towards the end of that process I had a bit of time so I worked out that if I formatted a drive for the mac in exFAT format, I could install some utilities in linux to read it. That took an hour, half hour to copy to the drive, half an hour from the drive to linux. Phew.

Then I tried running the SCALA code for extracting the site structure and ran into a few errors as about 15% of the files have developed an error somewhere along the way. I removed all the broken files leaving me with 47 usable ones. All up, it took 18 minutes to process the data, not quite as fast as I was hoping. On the other hand, the advantage of having lots of ram is that there was plenty of space to do other things. Running the same job on the mac mini with dual core CPU and 8GB RAM brought it to a grinding halt and nothing else was possible. On the new machine, I could run everything else normally including web browsing and downloads.

htop2

Regardless of whether I allocated 5gb, 10gb, 24gb, or even 28gb of RAM, time taken to process still hovered around 18 minutes. With 28gb allocated it only needed around 15gb to process, as can be seen in the above screenshot of htop. The other nice thing about htop is that it demonstrated that all 8 CPU threads were in use. Where I think I saved some time is that swap doesn’t seem to have been required which would have reduced some overheads. Either that, or I haven’t worked out how to use swap memory yet.

Still very early days.

some new tech

Following my fun in July when I hit a bit of a wall in playing with large data sets and brought my mac mini to a grinding halt, I ruminated on next steps. Wall aside, it was a wee bit frustrating that running experiments on larger data sets took a long time to run and that’s been a bit off-putting to further progress. So I decided that I really did a new machine and was going to get an intel NUC skull canyon as it was small and fast. I waited for Intel to announce their new 8th generation CPUs which they did recently. Unfortunately the upgrade to the current 6th generation Skull isn’t due till Q2 2018.

On the other hand, prices have been dropping on the barebones Skull and you can pick one up for around AUD$700. However a retailer pointed out to me recently that the ASUS VivoMini, while pricier, uses 7th generation CPUs. Plus it’s a cuter box. After some umming and ahhing, I ordered the vivomini with 32GB RAM and an additional 1TB drive (it includes a 256GB SSD in the m.2 port). The CPU is a 7th generation quad core i7. Total cost was around AUD$1,700 whereas a similarly set up Skull would have been around $1,400-500. It has a small footprint and sits nicely on top of the mac mini.

36938037614_0820718b3e

Picked it up yesterday and it booted straight into windows. Today, somewhat trepidatiously, I had a go at setting it up to dual boot with linux. The last few years I’ve been running linux via virtualbox on windows and that’s been sufficient. It’s been a long, long, long time since I set up a dual boot machine and that was using debian which was a wee bit challenging at the time.

This time round it was all easy as. I followed some straightforward instructions carefully and tested initially on a live boot via USB and then used that USB to install it properly. I’ve booted back and forth between windows and linux several times just to be sure and so far so good. I’m currently writing this blog via firefox in ubuntu. My next step was going to be to set up warcbase however that’s been deprecated as Ian Milligan and his team have received a new grant and are working on building an updated environment under their Archives Unleashed Toolkit. So I’ll play with that instead :) Regardless I’ll still need to get Apache Spark up and running which is likely my next step.

rabbit holes of adventure

Dinner table conversation tonight ended up chatting about Mystery House, that my partner played occasionally when she was younger. Mystery House is known as the first graphical adventure game. That of course led the conversation into interactive fiction, referencing the top shelf of my bookcase which contains pretty much all of Infocom‘s text adventures. I remember Zork II was my first text adventure and fiendish it was. I relied on adventure columns in computer game magazines of the time for clues on how to solve difficult puzzles including the horrible baseball diamond puzzle, also known as the Oddly Angled Room.

In those days, I couldn’t google answers and would spend months stuck on a problem. Sometimes that could be a good thing but mostly it was bloody frustrating. While there was a certain sense of achievement in solving puzzles, it meant I couldn’t advance the story. Solving puzzles was essential to accessing further parts of the game. These days I think I prefer story telling and plot development though solving puzzles is nice too. Happily most games provide decent hint mechanisms and if I get desperate I can google for answers.

Much to the shock of my partner, I commented that I usually have my text adventure collection stored on all my active machines as they are part of my central core of files that migrate across my various computing environments. This sounds substantial until you realise that text adventures, having little graphics and don’t take up a lot of space. My entire interactive fiction archive is a little over 100MB, of which the complete works of infocom account for 95%. Come to think of it, they were the only ones I was able to buy as a box set later, the Lost Treasures of Infocom, and load in a system independent format.

interactive fiction games

The other key adventure game company of the time was Level 9. Infocom were American based, while Level 9 were from the UK and I had several of their games. Regrettably, while I still have the boxes, I no longer have the equipment to read the discs. Later on, graphic adventures developed further with Magnetic Scrolls commencing with their first game, the fantastic The Pawn. I have several titles of their titles on my shelf too. Methinks I need to investigate further as to whether I can get these on my current machines. Come to think of it, I’ve barely mentioned Sierra Online who were responsible for Mystery House and later developed the King’s Quest and SpaceQuest series. Oh, and then there was Ultima…yet another rabbit hole…

big data is bigger than a mini

I broke my mac mini. Temporarily. I found its limits for big data stuff. It could handle running spark shell with 75% RAM allocated processing 70GB of gzipped WARC files to produce a 355kb gephi file of links. But it took 30 minutes to do it. Today I hit it with around 200GB and it ran out of memory around 95GB. To be honest, I’m rather impressed. I was impressed that it handled 70GB. The mini has 8GB of RAM of which I allocated 6GB to running spark shell. Everything else slowed to a halt. Ultimately even that wasn’t enough. I eventually worked out that I could split my data up into three chunks and process separately. I now have 3 gephi files totaling around 600KB (from 200GB of data) of URLs and links that I can do stuff with.

lots of cars on the road in BorneoThe bigger question is: do I need to look at new hardware to handle this sort of stuff and if so, what direction do I go in? I’ve done a bunch of research and I’m unclear of what the best direction is…actually that’s not quite right, I know what the best direction is and it involves a chunk of money :-) What’s the best direction without spending stupid money? I figure there are three main groups:

  1. Intel NUC Skull Canyon vs Supermicro Superserver
  2. Odroid XU4 cluster – each board handles 2gb RAM and they have an octo-core (8 cores!) while Raspberry Pi-3 is only 1gb per board
  3. Mini tower – takes up more space but cheaper than option 1 for more RAM and core/threads with less cooling issues

The key requirements for me are RAM and thread count with the emphasis on a parallel processing environment.

In terms of cost, 2 is cheapest, then 3 then 1. 2 is also cutest :-) a cluster of 8 Odroid XU4 gives 16gb RAM and 64 cores and would probably cost a little over a grand to put together. The Supermicro is more expandable ie can handle up to 128GB of RAM compared to the 32GB max of the Intel Skull. On the other hand, I can put together a fully loaded Skull for around AUD$1,300 while the SuperMicro requires a little more but works out cheaper in the long run. To go the Skull route means each upgrade requires a new skull at $1,400 a pop whereas the Supermicro requires around $2,000 to start up but has room for cheaper expansions. The Supermicro can be a little noisy and the Skull is quieter.

I like the idea of option 2 and building a cluster of Odroid XU4 but to be honest, I’m not great at hardware stuff and suspect that path holds a lot of pain and I may not come out the other side. Option 3 has space issues but plenty of expandability. I haven’t worked out the answer to how well an Odroid cluster would compare to the Intel Skull given a parallel processing emphasis. The skull beats it on raw speed but for tasks tailored for lots of threads running concurrently I have no idea. I really like the Supermicro approach but it costs more and may be noisier. I noticed that the price has recently dropped on the barebones Skull from around AUD$900 to $780. To that you would add RAM and SSD. The Supermicro has a barebones price of around USD$800 and again you then add RAM and SSD, though it can also handle other types of drives.

Motorbike in BorneoI’m not even sure I want to go down the “moar hardware” path at all. The mac mini can cope with the basic 70GB dataset and provide a sufficient environment to experiment and learn in. I’m fairly committed to going down this path and need to learn more about how to use spark and code in scala. I’m using scala mostly because it’s been referenced in the WARCbase guides I’ve started up with. I don’t know enough to know if it’s the best path forward; on the other hand, any path forward is a good one.