I’ve had the new machine a few days and I’m starting to get the hang of it, but learning, lots of learning. Finding linux equivalents of windows tools and then working out how to install them. Troubleshooting unexpected java errors trying to get spark shell to compile properly – turns out I had the JRE but not the full JDK which means I had to download more stuff and update some config files as well pathname references so the system knows where to find stuff.
As it turns out I completely misread the new pages for Archives Unleashed and didn’t see the black menu bar at the top of the screen for all of the docs. Was a little too tired methinks. I installed stuff using old versions of the docs I found on the wayback machine and other bits. Consequently I’ve ended up with a more recent version of Archives Unleashed (a bit of mouthful after the easier “warcbase”) with 0.10.1 instead of 0.9.0 and I’m running a current version of Spark Shell, 2.2.0, instead of 1.6.1. Anyway it all works…I think.
The next headache was that my harvest test data was still on the mac mini. I wasn’t sure how to get the data across as I couldn’t write to a windows hard drive from the mac. Then had the bright idea of copying the data, 56 files for a total of 80GB, to my home server via wifi. That took 6 hours…to the server, so I went away and did other things. Towards the end of that process I had a bit of time so I worked out that if I formatted a drive for the mac in exFAT format, I could install some utilities in linux to read it. That took an hour, half hour to copy to the drive, half an hour from the drive to linux. Phew.
Then I tried running the SCALA code for extracting the site structure and ran into a few errors as about 15% of the files have developed an error somewhere along the way. I removed all the broken files leaving me with 47 usable ones. All up, it took 18 minutes to process the data, not quite as fast as I was hoping. On the other hand, the advantage of having lots of ram is that there was plenty of space to do other things. Running the same job on the mac mini with dual core CPU and 8GB RAM brought it to a grinding halt and nothing else was possible. On the new machine, I could run everything else normally including web browsing and downloads.
Regardless of whether I allocated 5gb, 10gb, 24gb, or even 28gb of RAM, time taken to process still hovered around 18 minutes. With 28gb allocated it only needed around 15gb to process, as can be seen in the above screenshot of htop. The other nice thing about htop is that it demonstrated that all 8 CPU threads were in use. Where I think I saved some time is that swap doesn’t seem to have been required which would have reduced some overheads. Either that, or I haven’t worked out how to use swap memory yet.
Still very early days.