zip is gone…almost

For many, many years…decades even, my main email/ISP etc was hosted on an outfit called Zip, or even zipworld. It was progressively swallowed up by larger and companies, till in 2015 it ended up with Telstra. Telstra recently announced that they were shutting down the smaller networks though I could seek an account with them if I liked.

a shipping crane by the waterAdmittedly, the last few years I’ve been maintaining my zip account primarily as an email forwarder for sending/receiving email. At home, my partner has connectivity with another provider. My old website no longer works though I do have full backups (on my PC, external hard drive, and NAS), plus you can find it on the wayback machine.

Update: it’s not dead yet. Curiously, if I use “my.zipworld.com.au” instead of “www.zipworld.com.au”, my old site is still accessible :-) Of course, all the links I have that point to it are broken.

My primary email address (not zip) was pointing to my zip account now points to my gmail account. My old zip account is mostly used by a couple of elists, the odd family missive, and a lot of spam. The mail server hasn’t died yet though I expect that will happen one day but I’m still successfully using it to send email…and spammers continue to use it successfully to send me email.

Anyways, I am a little sad to say goodbye to dear old zip. The big advantage in the early days was the work they did in maintaining a local usenet server and it was why I signed up in the first place. Of course, it’s been a long since I used usenet either. Usenet was replaced by other things, and eventually there was twitter and facebook, which picked up some sense of community that I was missing.

harvest testing

One of the difficulties with working with web harvests is that it is but one of several priorities and not even the key one. A lot of my job is focussed on managing the Library’s eresources collection, dealing with suppliers and looking after budgets. In addition I’ve been running the Library’s web harvesting programme for about three and a half years now. The main crawl of NSW government websites was set up originally by Archive-It and these days I have it scheduled to run twice a year. There are other smaller crawls that are run throughout the year.

However there’s never been a lot of time for exploring the harvested content in detail and ensuring we’re getting the material we think we are. We do run some testing by searching for specific content within the archive eg budget papers and check that it contains all relevant content including spreadsheets and documents. There’s only so much you can do to manually test when this particular archive is 3.5TB and contains around 74 million documents. The Archive-It software does provide some tools for checking crawl results and broadly indicating missed material.

However, as readers continue to explore the collection, they come across things where we haven’t fully captured the content we thought we had. A recent example is the Electoral Atlas of NSW 1856-2006 edited by Eamonn Clifford, Antony Green and David Clune. The State Library does hold it in print and until recently the digital content was hosted on the NSW Parliamentary website.

On initial inspection, it appeared that the content had been captured via the harvest both by SLNSW and the National Library (NLA). The NLA version doesn’t descend further while the NSW version does display the individual election results eg 1984:

Election details of the 1984 New South Wales state election

However, all the links in the 1984 Election Links section return a “Not in Archive” message, similarly for other years. In this example, there is some happy news in that the main wayback machine seems to have captured the site in full including those pages we’ve missed. The question that I need to explore and may need to ask Archive-It about, is why their crawl captured that information and our’s didn’t.

As a side note, I’ve found the Wayback browser plugin (Firefox, Chrome) rather useful for finding archived versions of pages that no longer exist on websites.

 

bits and whiskies

Sat down at the computer today for the first time in a while and installed docker. I have it installed on most of my machines and got round to it on the vivomini today. Was a simple matter to run:

sudo apt install docker.io

enter my password and off it went. Docker containers include everything you’re likely to need to run a particular batch of software. Installing software is rarely simple and may rely on the presence of other packages which leads into a vicious circle of finding all the dependencies and installing them. In this case, I wanted to try the new-ish docker container for the Archives Unleashed Toolkit which, in earlier days and been a little challenging in a on docker environment. Whereas this version was dead simple via docker on a linux command line:

Step 1 sudo docker pull archivesunleashed/docker-aut
Step 2 sudo docker run --rm -it archivesunleashed/docker-aut

Both steps took a while but I think it was around 15-20 minutes altogether on my ADSL2 house wifi (my NBN option is HFC and that’s been delayed several months). When the second step finished I was greeted with the opening screen for the spark shell and ready to work. Very nice and will have more of a play later.

For now, I’m currently downloading Horizon Zero Dawn: The Frozen Wilds and rather looking forward to revisiting my favourite game of 2017, and possibly even my favourite game since Skyrim. Actually, I’m not sure on the latter and I haven’t actually stopped playing Skyrim. I have been playing a lot of Assassin’s Creed: Origins over the last couple of months and it feels like there’s still so much to explore. Some of it is a bit repetitive yet it’s wonderful exploring such a well realised version of Egypt, in the time of Cleopatra, and its surrounds. With that said, I’m at the point where I’m going to ease back and pop into it occasionally rather than have it as my primary game.

Then there was whisky. All the bottles I had opened in early November are now finished. Back then I had 9 bottles altogether with 5 open, now  9 bottles and 4 open. Actually I have an additional 7 bottles but they’re each 50ml and combined are equivalent to a single bottle. My partner bought me a box of 4 peated malts for christmas, and I picked up a taster pack of 3 Loch Lomond whiskies. Whiskies opened include:

  • Hellyers Road 10 year old (46.2%) – a nice, soft dram from Tasmania. Usually retails around $90 and I think I’m on my second bottle.
  • Ben Nevis 18 year old (single cask, 54.7%) – strong but delish, loving this one and on to the second bottle. This was $240 and is part of a fund raiser for a new distillery in Corowa, NSW.
  • BenRiach Peated Cask Strength Single Malt (56%) – also strong and also delish. This was $150 and I have a suspicion that BenRiach is turning out to be one of my favourite distilleries after Highland Park and Overeem. I have also enjoyed their 17 year old PX cask.
  • Glenmorangie: The Duthac (43%) – more yum. This was a christmas present and was released for travel retail and is primarily available at duty free places at airports, Singapore in this instance. Part finished in Pedro Ximinez casks. Sherry casks are my preferred and the Pedro Ximinez (PX) seems to raise that a notch or two.

Speaking of Pedro, I rather like sherry straight too. I used to prefer ports and muscats, and even had a port barrel maturing at one stage. I suspect if I ever do another barrel it will be for sherry. Of sherries, the Pedro Ximinez or PX (though it seems irreverent to shorten it such) is turning out to be my favourite. I have been trying out various releases from cheap to expensive, the most expensive being around $55 for 350ml! My favourite, while a little pricey, seems to be the Cardenal Cisneros at $56/750ml, though cheap compared to whisky.

knuth

I often say professionally that I did a compsci major (though can never claim it officially) yonks ago but decided against becoming a programmer. That’s not a decision I regret mostly, though it must be said I continue to have strong leanings that direction. Scarily, it’s been over 25 years since those compsci days. Still, I learnt good stuff.

I recall in the second half of first year compsci, we had an older lecturer at the time who was actually a maths lecturer who seemed to have come across into computers. I can say “older” as I’ve just found this bio which sums up very briefly a rather fascinating career. He may even have been one of my favourite lecturers as he liked to play with new ideas and introduced stuff he knew about from maths into computing. I was a very rare beast in compsci in that I was enrolled under BA and not directly in Compsci and I did no math. I had done first year math but it wasn’t quite my bag. Doherty was very big on mathematical ideas and assessing efficiencies of algorithms.

I recall him talking some weird algorithm for encrypting data and he worked through the basic idea in a lecture, I think it was based on some sort of fractional encoding model. At the end of the lecture, he said the next assignment would be to implement it. I found the idea of it fascinating. The next assignment came out and sure enough it was on encryption so I implemented the algorithm in Pascal that he’d talked about based on my lecture notes. The idea was you’d write code to encrypt a paragraph of text, and code to decrypt the text. I was mostly successful but because it relied on decimal conversion of larger numbers, it rapidly lost accuracy on the 8 bit macs we were using at the time. Out of a sentence of 10 words, it started losing letters by the end of the first word.

Turns out, I should have read the back page of the assignment. Doherty had decided that the technique was a little too experimental for first year compsci and had instead instructed everyone to use a hashing technique. I handed my assignment in and discussed with the class tutor what I’d done. He wasn’t familiar with the algorithm at all but was impressed that it worked and understood why it failed where it did. I got full marks and first year compsci was one of my few high distinctions at uni.

mini computers on top of computer books.Anyway, Doherty would often quote Knuth as the foundation of modern computing. Knuth was all about the development of algorithms and understanding their efficiencies. Algorithms are really important as they represent techniques for solving particular sorts of problems eg what is the best way to sort a random string of numbers? The answer varies depending on how many numbers are in the string, or even whether you can know the number of numbers. For very small sets, a bubble sort is sufficient, and from there you move on to binary searches, binary trees, and so on. I wasn’t always across the math but really appreciated the underlying thinking around assessing approaches to problem solving. Plus Doherty was a fab lecturer with a bit of character.

So Knuth. He is best known for his series, The Art of Computer Programming, which has gone through a few editions and I wonder if it will ever be actually finished; the fourth volume is actually labeled 4A: Combinatorial Algorithms Part 1. Volume 4 is eventually expected to cover 4 volumes: 4A, 4B, 4C, 4D. 4B has been partially released across several fascicles of which 6 have been released. Volume 3 seems to be the most relevant for where I’m at today and where I’m looking to play; #3 is around 750 pages devoted specifically to sorting and searching. So much of what we do online is reliant on being able to find stuff and to find stuff well, it helps if the data has been ordered.

Knuth has this been this name in my head though my life has gone in other directions. A few years ago, I did a google and found that not only were his books on Amazon, there was even a box set of Volumes 1-4A. I bit the bullet about 3 years ago and bought the set, cost around US$180 at the time and looks really, bloody good on the shelf. I haven’t read a great deal yet but dipped in a few times and planning to get into volume 3 properly at some point. I’ve recently being moving stuff around at home and don’t have a lot of space for books next to where my computer gear is these days. However, it turns out, the mac mini sits nicely on top of the set, and my newest computer, the VivoMini sits nicely on top of the mac. I sorta like the idea of these small computers sitting on Knuth’s foundation.

threading delights

I’ve had the new machine a few days and I’m starting to get the hang of it, but learning, lots of learning. Finding linux equivalents of windows tools and then working out how to install them. Troubleshooting unexpected java errors trying to get spark shell to compile properly – turns out I had the JRE but not the full JDK which means I had to download more stuff and update some config files as well pathname references so the system knows where to find stuff.

As it turns out I completely misread the new pages for Archives Unleashed and didn’t see the black menu bar at the top of the screen for all of the docs. Was a little too tired methinks. I  installed stuff using old versions of the docs I found on the wayback machine and other bits. Consequently I’ve ended up with a more recent version of Archives Unleashed (a bit of mouthful after the easier “warcbase”) with 0.10.1 instead of 0.9.0 and I’m running a current version of Spark Shell, 2.2.0, instead of 1.6.1. Anyway it all works…I think.

The next headache was that my harvest test data was still on the mac mini. I wasn’t sure how to get the data across as I couldn’t write to a windows hard drive from the mac. Then had the bright idea of copying the data, 56 files for a total of 80GB, to my home server via wifi. That took 6 hours…to the server, so I went away and did other things. Towards the end of that process I had a bit of time so I worked out that if I formatted a drive for the mac in exFAT format, I could install some utilities in linux to read it. That took an hour, half hour to copy to the drive, half an hour from the drive to linux. Phew.

Then I tried running the SCALA code for extracting the site structure and ran into a few errors as about 15% of the files have developed an error somewhere along the way. I removed all the broken files leaving me with 47 usable ones. All up, it took 18 minutes to process the data, not quite as fast as I was hoping. On the other hand, the advantage of having lots of ram is that there was plenty of space to do other things. Running the same job on the mac mini with dual core CPU and 8GB RAM brought it to a grinding halt and nothing else was possible. On the new machine, I could run everything else normally including web browsing and downloads.

htop2

Regardless of whether I allocated 5gb, 10gb, 24gb, or even 28gb of RAM, time taken to process still hovered around 18 minutes. With 28gb allocated it only needed around 15gb to process, as can be seen in the above screenshot of htop. The other nice thing about htop is that it demonstrated that all 8 CPU threads were in use. Where I think I saved some time is that swap doesn’t seem to have been required which would have reduced some overheads. Either that, or I haven’t worked out how to use swap memory yet.

Still very early days.

some new tech

Following my fun in July when I hit a bit of a wall in playing with large data sets and brought my mac mini to a grinding halt, I ruminated on next steps. Wall aside, it was a wee bit frustrating that running experiments on larger data sets took a long time to run and that’s been a bit off-putting to further progress. So I decided that I really did a new machine and was going to get an intel NUC skull canyon as it was small and fast. I waited for Intel to announce their new 8th generation CPUs which they did recently. Unfortunately the upgrade to the current 6th generation Skull isn’t due till Q2 2018.

On the other hand, prices have been dropping on the barebones Skull and you can pick one up for around AUD$700. However a retailer pointed out to me recently that the ASUS VivoMini, while pricier, uses 7th generation CPUs. Plus it’s a cuter box. After some umming and ahhing, I ordered the vivomini with 32GB RAM and an additional 1TB drive (it includes a 256GB SSD in the m.2 port). The CPU is a 7th generation quad core i7. Total cost was around AUD$1,700 whereas a similarly set up Skull would have been around $1,400-500. It has a small footprint and sits nicely on top of the mac mini.

36938037614_0820718b3e

Picked it up yesterday and it booted straight into windows. Today, somewhat trepidatiously, I had a go at setting it up to dual boot with linux. The last few years I’ve been running linux via virtualbox on windows and that’s been sufficient. It’s been a long, long, long time since I set up a dual boot machine and that was using debian which was a wee bit challenging at the time.

This time round it was all easy as. I followed some straightforward instructions carefully and tested initially on a live boot via USB and then used that USB to install it properly. I’ve booted back and forth between windows and linux several times just to be sure and so far so good. I’m currently writing this blog via firefox in ubuntu. My next step was going to be to set up warcbase however that’s been deprecated as Ian Milligan and his team have received a new grant and are working on building an updated environment under their Archives Unleashed Toolkit. So I’ll play with that instead :) Regardless I’ll still need to get Apache Spark up and running which is likely my next step.

rabbit holes of adventure

Dinner table conversation tonight ended up chatting about Mystery House, that my partner played occasionally when she was younger. Mystery House is known as the first graphical adventure game. That of course led the conversation into interactive fiction, referencing the top shelf of my bookcase which contains pretty much all of Infocom‘s text adventures. I remember Zork II was my first text adventure and fiendish it was. I relied on adventure columns in computer game magazines of the time for clues on how to solve difficult puzzles including the horrible baseball diamond puzzle, also known as the Oddly Angled Room.

In those days, I couldn’t google answers and would spend months stuck on a problem. Sometimes that could be a good thing but mostly it was bloody frustrating. While there was a certain sense of achievement in solving puzzles, it meant I couldn’t advance the story. Solving puzzles was essential to accessing further parts of the game. These days I think I prefer story telling and plot development though solving puzzles is nice too. Happily most games provide decent hint mechanisms and if I get desperate I can google for answers.

Much to the shock of my partner, I commented that I usually have my text adventure collection stored on all my active machines as they are part of my central core of files that migrate across my various computing environments. This sounds substantial until you realise that text adventures, having little graphics and don’t take up a lot of space. My entire interactive fiction archive is a little over 100MB, of which the complete works of infocom account for 95%. Come to think of it, they were the only ones I was able to buy as a box set later, the Lost Treasures of Infocom, and load in a system independent format.

interactive fiction games

The other key adventure game company of the time was Level 9. Infocom were American based, while Level 9 were from the UK and I had several of their games. Regrettably, while I still have the boxes, I no longer have the equipment to read the discs. Later on, graphic adventures developed further with Magnetic Scrolls commencing with their first game, the fantastic The Pawn. I have several titles of their titles on my shelf too. Methinks I need to investigate further as to whether I can get these on my current machines. Come to think of it, I’ve barely mentioned Sierra Online who were responsible for Mystery House and later developed the King’s Quest and SpaceQuest series. Oh, and then there was Ultima…yet another rabbit hole…