I broke my mac mini. Temporarily. I found its limits for big data stuff. It could handle running spark shell with 75% RAM allocated processing 70GB of gzipped WARC files to produce a 355kb gephi file of links. But it took 30 minutes to do it. Today I hit it with around 200GB and it ran out of memory around 95GB. To be honest, I’m rather impressed. I was impressed that it handled 70GB. The mini has 8GB of RAM of which I allocated 6GB to running spark shell. Everything else slowed to a halt. Ultimately even that wasn’t enough. I eventually worked out that I could split my data up into three chunks and process separately. I now have 3 gephi files totaling around 600KB (from 200GB of data) of URLs and links that I can do stuff with.
The bigger question is: do I need to look at new hardware to handle this sort of stuff and if so, what direction do I go in? I’ve done a bunch of research and I’m unclear of what the best direction is…actually that’s not quite right, I know what the best direction is and it involves a chunk of money :-) What’s the best direction without spending stupid money? I figure there are three main groups:
- Intel NUC Skull Canyon vs Supermicro Superserver
- Odroid XU4 cluster – each board handles 2gb RAM and they have an octo-core (8 cores!) while Raspberry Pi-3 is only 1gb per board
- Mini tower – takes up more space but cheaper than option 1 for more RAM and core/threads with less cooling issues
The key requirements for me are RAM and thread count with the emphasis on a parallel processing environment.
In terms of cost, 2 is cheapest, then 3 then 1. 2 is also cutest :-) a cluster of 8 Odroid XU4 gives 16gb RAM and 64 cores and would probably cost a little over a grand to put together. The Supermicro is more expandable ie can handle up to 128GB of RAM compared to the 32GB max of the Intel Skull. On the other hand, I can put together a fully loaded Skull for around AUD$1,300 while the SuperMicro requires a little more but works out cheaper in the long run. To go the Skull route means each upgrade requires a new skull at $1,400 a pop whereas the Supermicro requires around $2,000 to start up but has room for cheaper expansions. The Supermicro can be a little noisy and the Skull is quieter.
I like the idea of option 2 and building a cluster of Odroid XU4 but to be honest, I’m not great at hardware stuff and suspect that path holds a lot of pain and I may not come out the other side. Option 3 has space issues but plenty of expandability. I haven’t worked out the answer to how well an Odroid cluster would compare to the Intel Skull given a parallel processing emphasis. The skull beats it on raw speed but for tasks tailored for lots of threads running concurrently I have no idea. I really like the Supermicro approach but it costs more and may be noisier. I noticed that the price has recently dropped on the barebones Skull from around AUD$900 to $780. To that you would add RAM and SSD. The Supermicro has a barebones price of around USD$800 and again you then add RAM and SSD, though it can also handle other types of drives.
I’m not even sure I want to go down the “moar hardware” path at all. The mac mini can cope with the basic 70GB dataset and provide a sufficient environment to experiment and learn in. I’m fairly committed to going down this path and need to learn more about how to use spark and code in scala. I’m using scala mostly because it’s been referenced in the WARCbase guides I’ve started up with. I don’t know enough to know if it’s the best path forward; on the other hand, any path forward is a good one.