One of the things I’m interested in is working with data sets around web harvesting and archiving. I’ve spent a bit of time over the years exploring the Internet Archive and other web archives, and I’m hitting the point where I’d like to understand the sorts of information gathered when you harvest a bunch of websites. What can be discerned from a site’s structure, how does it change over time, are there any other useful directions to explore?
When you harvest web sites you end up with a bunch of files in the WARC format. So far, in my limited experience, a typical WARC file is about a gig and one harvest can contain lots of these files. Depending on how your set up your harvester, you can save all content on a site including office files, music, video and so on. A harvest captures that website at one moment in time, and with repeated harvests it’s possible to get a sense of how it might change over time. As part of learning how all this works, I’m using a small archive of 72 WARC files that roughly total 55GB.
Having successfully installed lots of software on my machine at home, I might actually be ready to start experimenting. I’ve been following the Getting Started guide for installing Warcbase (platform for managing web archives) and associated software on a mac mini. While time consuming, it’s actually been straightforward and installing software on the mac has seemed easier than installing similar stuff under windows a year or so back. Of that guide, I have completed steps 1, 2, 3, and 5. Step 4 involves installing Spark Notebook but the primary site seems to be down at the moment so I’ve installed gephi to handle data visualisation. As a result I am now running:
- Homebrew – MacOS package manager
- Maven3 – software project management tool
- Warcbase – built on hadoop and hbase
- Apache Spark – an engine for large-scale data processing
- Gephi – data visualisation
In other words a bunch of tools for dealing with really large data sets installed on a really small computer :-) I’d originally bought the mac mini to migrate my photo collection from a much older Mac Pro and hadn’t considered it as a platform for doing large scale data stuff. So far, it’s holding up though I am feeling the limits of having only 8GB of RAM.
All those tools can be used on really big systems and run across server clusters. Thankfully, they also work on a single system but you have to keep the data chunks small. I tried analysing the entire 55GB archive in one go but spark spat out a bunch of errors and crashed. Running it file by file, where each file is up to a gig, seems to be working so far.
There’s been no working internet at home for a couple of weeks so I’ve been hampered in what help I can look up but at least had all the software installed before we lost connection. Spark may have had issues for a different reason eg I may not have specified the directory path correctly but I couldn’t easily google the errors.
I’m trying out a script in spark to generate the site structure from each archive and this is typically producing a file of about 2-3k from a 1GB file of data. The script is able to write to gephi’s file format, GDF. Gephi supports the ability to load lots of files and merge them into one. That means I can run a file by file analysis and then combine them at the visualisation stage. I haven’t worked out the code to run the script iteratively for each file and am manually changing the file name each time. The ugly image below is my first data load into gephi showing the interlinking URL nodes. I haven’t done anything with it, it is literally the first display screen. However it does indicate that I might at last be heading in a useful direction.
Next steps include learning how to write scripts myself and learning how to use gephi to produce a more meaningful visualisation.