knuth

I often say professionally that I did a compsci major (though can never claim it officially) yonks ago but decided against becoming a programmer. That’s not a decision I regret mostly, though it must be said I continue to have strong leanings that direction. Scarily, it’s been over 25 years since those compsci days. Still, I learnt good stuff.

I recall in the second half of first year compsci, we had an older lecturer at the time who was actually a maths lecturer who seemed to have come across into computers. I can say “older” as I’ve just found this bio which sums up very briefly a rather fascinating career. He may even have been one of my favourite lecturers as he liked to play with new ideas and introduced stuff he knew about from maths into computing. I was a very rare beast in compsci in that I was enrolled under BA and not directly in Compsci and I did no math. I had done first year math but it wasn’t quite my bag. Doherty was very big on mathematical ideas and assessing efficiencies of algorithms.

I recall him talking some weird algorithm for encrypting data and he worked through the basic idea in a lecture, I think it was based on some sort of fractional encoding model. At the end of the lecture, he said the next assignment would be to implement it. I found the idea of it fascinating. The next assignment came out and sure enough it was on encryption so I implemented the algorithm in Pascal that he’d talked about based on my lecture notes. The idea was you’d write code to encrypt a paragraph of text, and code to decrypt the text. I was mostly successful but because it relied on decimal conversion of larger numbers, it rapidly lost accuracy on the 8 bit macs we were using at the time. Out of a sentence of 10 words, it started losing letters by the end of the first word.

Turns out, I should have read the back page of the assignment. Doherty had decided that the technique was a little too experimental for first year compsci and had instead instructed everyone to use a hashing technique. I handed my assignment in and discussed with the class tutor what I’d done. He wasn’t familiar with the algorithm at all but was impressed that it worked and understood why it failed where it did. I got full marks and first year compsci was one of my few high distinctions at uni.

mini computers on top of computer books.Anyway, Doherty would often quote Knuth as the foundation of modern computing. Knuth was all about the development of algorithms and understanding their efficiencies. Algorithms are really important as they represent techniques for solving particular sorts of problems eg what is the best way to sort a random string of numbers? The answer varies depending on how many numbers are in the string, or even whether you can know the number of numbers. For very small sets, a bubble sort is sufficient, and from there you move on to binary searches, binary trees, and so on. I wasn’t always across the math but really appreciated the underlying thinking around assessing approaches to problem solving. Plus Doherty was a fab lecturer with a bit of character.

So Knuth. He is best known for his series, The Art of Computer Programming, which has gone through a few editions and I wonder if it will ever be actually finished; the fourth volume is actually labeled 4A: Combinatorial Algorithms Part 1. Volume 4 is eventually expected to cover 4 volumes: 4A, 4B, 4C, 4D. 4B has been partially released across several fascicles of which 6 have been released. Volume 3 seems to be the most relevant for where I’m at today and where I’m looking to play; #3 is around 750 pages devoted specifically to sorting and searching. So much of what we do online is reliant on being able to find stuff and to find stuff well, it helps if the data has been ordered.

Knuth has this been this name in my head though my life has gone in other directions. A few years ago, I did a google and found that not only were his books on Amazon, there was even a box set of Volumes 1-4A. I bit the bullet about 3 years ago and bought the set, cost around US$180 at the time and looks really, bloody good on the shelf. I haven’t read a great deal yet but dipped in a few times and planning to get into volume 3 properly at some point. I’ve recently being moving stuff around at home and don’t have a lot of space for books next to where my computer gear is these days. However, it turns out, the mac mini sits nicely on top of the set, and my newest computer, the VivoMini sits nicely on top of the mac. I sorta like the idea of these small computers sitting on Knuth’s foundation.

more sheep (there are spoilers)

…which sounds like it could be a line introducing the planned Settlers of Catan movie.

But no, I of course refer to a second movie picking up on the themes of Philip K Dick‘s “Do Androids Dream of Electric Sheep”. As an aside, thankfully, googling “android” and “dick” produced far less scary results than anticipated. Dick’s greatness far exceeds dodgy pr0n references. DADOES as it is often shortened to, is possibly Dick’s second best novel, the best generally regarded as The Man in the High Castle. I tend not to disagree.

When the first version of the first movie was released in, I think, 1982 or 3, I wasn’t able to see it. Dad however, bought me the book. I was 14 at the time. It blew me away. I loved it much. I’ve only read it once but it has always stuck with me. Vivid. I finally saw the film on its second release, as the Director’s Cut. Loved it. Also vivid. Seared into my mind. I saw it many times both on big screen and small.

I suspect I’ve seen the original version several times since. Comfortably double figures on the director’s cut, possibly double figures on the original. I love both though it remains true that people generally prefer the version they saw first. There is also a final cut that was much later that I tried to watch recently but didn’t quite find the right moment to pop it on. I’ve read enough to know where it differs and seen the other versions enough to work out where it fits visually.

Seriously: spoilers below.

Bladerunner is that rare, rare film that leaves out so much of the book yet captures its essence, shares its soul. Blade Runner 2049 also succeeded…mostly. I was riveted to the screen for most of it, for most of it was perfect. If it had stopped shortly after Agent K and Deckard met, or been effectively finished at that point, I would have been happy. But it didn’t finish, there were fights and kidnapping and more fights, Terminator-esque. The side plot of the tycoon Wallace, felt shallow and vacuous. Unnecessary. Even then, if the film had ended with K dying on the snowy steps it might have been redeemed. It felt so much like an interesting film with a Brady Bunch ending slapped on.

And yet, it was still so bloody good…so much perfect, visually grand (it needs a really big screen, a big space), musically, aurally wonderful. It still traces the path of  what does it mean to be human, and explores new ripples. It was well paced, events, music, landscapes…connected. I loved its play with virtual characters, and the way it overlapped virtual with real…or a sense of real. It’s all about the sense of real, and not necessarily the real itself.

techie librarian; meatier than a seahorse

 

Tag lines…whatever do you use for your tagline: the subheading of your identity, the punchline by which people establish a connection. Mostly I pay them lip service, smiling occasionally at a clever one. My own tend to refer to variations of: techie, librarian and eclectic, sometimes all 3 at once.

In a rather wayward conversation, spinning down a rabbit hole of curiousity, as things are wont to do when Matt Finch is involved, a recent conversation turned from roasting penguins to eating seahorses.

I participated in a workshop as part of NLS8 and the first activity was for everyone to sketch a scene, in 90 seconds, on a piece of A4 using at least one of three figures on a screen: 2 humans (or human-like) and a penguin. As is my wont, I immediately gave into the dark side and sketched the two humans roasting the penguin. The second half of the activity was for each table to construct a cohesive story using those scenes as panel. They were two quick activities that worked really well as an icebreaker and got you thinking at how easy it was to come up with ideas under pressure.

The seahorses came later…or rather many years earlier:

to which I responded with my “meatier than seahorse” remark and commented elsewhere that while I have never eaten penguin, I have actually eaten seahorse.

Many years ago, 2003 I think (really must upload those photos to flickr), I spent a few weeks on an Intrepid trip in China with friends. We started in Beijing and went to the Beijing night markets, a place where you can eat just about anything including silk worms and even scorpions on a stick. Scorpions were a wee a but scary but we figured had to be ok as noone was dropping dead. As far as we can figure, they’re bred without their stinger.

While trying to order something else, there was a language issue, and I ended up with seahorse on a stick. I think the scorpions were about 20 cents for five whereas the seahorse was a few Oz dollars for one. Our tour guide tried to talk our way out of it but the shopowner insisted. So I paid for it and ate it. There wasn’t much flavour as it was primarily shell with perhaps a tiny morsel of meat.

Matt suggested “meatier than a seahorse” as a bio and it immediately rang the right sort of bells, both physically and metaphorically. I am now using it for all my taglines :-)

a little low end

Ruminating further on my desire for more processing power, I’ve been thinking more on clusters and can’t help but feel that a lego rack of tiny motherboards is a rather cute direction to head in. My general idea at the moment is to look at building a cluster significantly more powerful than a mac mini but with a small footprint and not too expensive. While there are desktop towers and second-hand servers that can achieve much better performance, they take up a lot of physical space. 2 things I’ve always been interested in in computing:

  • small size (or footprint) – I don’t want them to dominate the space
  • low weight
  • good battery life is also nice but less relevant in this scenario

The Intel NUC cluster is the high end version of the sort of setup that could work for me. However high end, cutting edge isn’t the only solution and comments on twitter reminded me that there are other, cheaper options for home use starting with the humble raspberry pi. Turns out there’s quite a bit of work in that area on a low end approach to supercomputing. While the overall speed per board won’t be huge, gains can be made for parallel computing as a good number of cores and threads increases work done in these sorts of systems and may work out better, and cheaper than a single NUC.

train station in Manarola, Italy.

There’s been a lot of work with raspberry pi clusters and running boards in parallel with anything from 4 boards up to 200+; someone has even published instructions on building a 4 board Pi cluster in a mere 29 minutes. However the Pi isn’t the only option and another board that has developed a community is the Odroid series and they seem a wee bit more powerful than the Pi without being much more expensive.

The challenge I gather with Pi/Odroid setups is potentially around the ARM chipset whereas the NUC being Intel is on a more common platform. ARM is a slower chipset relatively and doesn’t quite have the broad support of mainstream chipsets however there seems to be a strong community around them. On the other hand, if you want to go down the intel route, then there’s credit size computers, like the Intel Edison, based on x86 chips. Literally the size/thickness of a credit card and can boot to standard linux. Clusters of these are even smaller, with a 10 card cluster that looks like it could fit in the palm of your hand.

Realistically, while it’s nice to dream I’m not actually that great with hardware stuff and I can see that 29 minute Pi cluster taking me most of the day…if I can get it to work at all. Yet it sounds so simple. I suspect it’s a matter of courage, patience and lots of google-fu. I get nervous when dealing with hardware and installing software, blindly running other people’s scripts and keeping my fingers cross that if errors occur, they’re not too hard to resolve. The advantage of cheaper approaches is that I’m not too badly out of pocket if I can’t get it to work, a few hundred vs a few thousand. The other question is whether a tenth of the budget produces better than a tenth of the power?

using big data to create bad art

A few weeks back, I installed a lot of software on my computer at home with the plan to work out what to do with large data sets, particularly web archives. One of my roles at work is being responsible for managing and running the Library’s web archiving strategy and regularly harvesting publicly available government websites. That’s all fun and good but you end up with a lot of data and I think there’s close to 5TB in the collection now. The next tricks revolve around what you can use the data for and what sorts of data are worthwhile to make accessible. Under my current, non NBN, download speeds I estimate it would take a few months to download 5TB of data assuming a steady connection.

The dataset I’m using currently is a cohesive collection of publicly available websites containing approximately 68GB of data in 61 files. Each file is a compressed WARC file, WARC being the standard for Web ARChive files. Following some excellent instructions, I ran the scala code from step 1 in my local install of spark shell and successfully extracted the site structure. The code needed to be modified slightly to work with the pathname of my data set, roughly

  • run Spark Shell with sufficient memory, I’m using 6 of my 8GB of RAM
  • run “:paste”
  • copy in scala code
  • hit “Control-D” to start the code analysing the data

I think that took around 20-30 minutes to run. The first time through, it crashed at the end as I’d left a couple of regular text files in the archive directory and the code sample didn’t handle those. Fair enough too, as it’s only sample code and not a full program with error detection and handling. I moved the text files out and ran it again. Second time through it finished happily.

The resultant file containing all the URLs and linkages was a total of 355kb, not bad for a starting data size if 68GB and provides something a little more manageable to play with. Next step is to load the file into Gephi which is an open source, data visualisation tool for networks and graphs. I still have little idea how to use gephi effectively and am mostly just pressing different buttons and playing with layouts to see how stuff works. I haven’t quite got to the point of making visually interesting displays like the one shown in the tutorial, however I have managed to create some really ugly art:

ugly data analysis

I hit the point a while back where it’s no longer sufficient to play with sample bits and pieces and I need to sit down and learn stuff properly. To that end I ordered a couple of books on Apache Spark, then ordered another book, Programming in Scala, and wondering whether I should also buy The Scala Cookbook. Or perhaps I shouldn’t try and do everything at once. I am reading both the Spark books concurrently as they’re aimed for different audiences and take different approaches. However after an initial spurt through the first couple of chapters, I haven’t touched them in a couple of weeks. I also need to learn how to use Gephi effectively and there’s a few tutorials available for doing that. I should explore other visualisation tools too as well and continue to look at what other sorts of tools can be used.

a lethargic 5

I’m happy to report that full connectivity was restored to the house a day or two before we were due to fly to NZ. Returning home on Sunday, I was happy to discover that we still had net :) I’ll talk more about the NZ trip and tramping the Kepler Track once I’ve sorted out the photos and loaded them to flickr. I have about 130 photos that I need to weed though that should be relatively quick compared to weeding my photos from the European holiday over Dec/Jan. For the Europe set, I’ve managed to get it down to under 300 from around 700 but it still needs a couple more goes. I should have the Kepler set up this week at least.

While I’m in post NZ recovery, here’s 5 random things I’ve tweeted in recent months:

baby steps with web harvests

One of the things I’m interested in is working with data sets around web harvesting and archiving. I’ve spent a bit of time over the years exploring the Internet Archive and other web archives, and I’m hitting the point where I’d like to understand the sorts of information gathered when you harvest a bunch of websites. What can be discerned from a site’s structure, how does it change over time, are there any other useful directions to explore?

When you harvest web sites you end up with a bunch of files in the WARC format. So far, in my limited experience, a typical WARC file is about a gig and one harvest can contain lots of these files. Depending on how your set up your harvester, you can save all content on a site including office files, music, video and so on. A harvest captures that website at one moment in time, and with repeated harvests it’s possible to get a sense of how it might change over time. As part of learning how all this works, I’m using a small archive of 72 WARC files that roughly total 55GB.

Having successfully installed lots of software on my machine at home, I might actually be ready to start experimenting. I’ve been following the Getting Started guide for installing Warcbase (platform for managing web archives) and associated software on a mac mini. While time consuming, it’s actually been straightforward and installing software on the mac has seemed easier than installing similar stuff under windows a year or so back. Of that guide, I have completed steps 1, 2, 3, and 5. Step 4 involves installing Spark Notebook but the primary site seems to be down at the moment so I’ve installed gephi to handle data visualisation. As a result I am now running:

  • Homebrew – MacOS package manager
  • Maven3 – software project management tool
  • Warcbase – built on hadoop and hbase
  • Apache Spark – an engine for large-scale data processing
  • Gephi – data visualisation

In other words a bunch of tools for dealing with really large data sets installed on a really small computer :-) I’d originally bought the mac mini to migrate my photo collection from a much older Mac Pro and hadn’t considered it as a platform for doing large scale data stuff.  So far, it’s holding up though I am feeling the limits of having only 8GB of RAM.

All those tools can be used on really big systems and run across server clusters. Thankfully, they also work on a single system but you have to keep the data chunks small. I tried analysing the entire 55GB archive in one go but spark spat out a bunch of errors and crashed. Running it file by file, where each file is up to a gig, seems to be working so far.

There’s been no working internet at home for a couple of weeks so I’ve been hampered in what help I can look up but at least had all the software installed before we lost connection. Spark may have had issues for a different reason eg I may not have specified the directory path correctly but I couldn’t easily google the errors.

I’m trying out a script in spark to generate the site structure from each archive and this is typically producing a file of about 2-3k from a 1GB file of data. The script is able to write to gephi’s file format, GDF. Gephi supports the ability to load lots of files and merge them into one. That means I can run a file by file analysis and then combine them at the visualisation stage. I haven’t worked out the code to run the script iteratively for each file and am manually changing the file name each time. The ugly image below is my first data load into gephi showing the interlinking URL nodes. I haven’t done anything with it, it is literally the first display screen. However it does indicate that I might at last be heading in a useful direction.

visualisation of website nodes

Next steps include learning how to write scripts myself and learning how to use gephi to produce a more meaningful visualisation.