big data is bigger than a mini

I broke my mac mini. Temporarily. I found its limits for big data stuff. It could handle running spark shell with 75% RAM allocated processing 70GB of gzipped WARC files to produce a 355kb gephi file of links. But it took 30 minutes to do it. Today I hit it with around 200GB and it ran out of memory around 95GB. To be honest, I’m rather impressed. I was impressed that it handled 70GB. The mini has 8GB of RAM of which I allocated 6GB to running spark shell. Everything else slowed to a halt. Ultimately even that wasn’t enough. I eventually worked out that I could split my data up into three chunks and process separately. I now have 3 gephi files totaling around 600KB (from 200GB of data) of URLs and links that I can do stuff with.

lots of cars on the road in BorneoThe bigger question is: do I need to look at new hardware to handle this sort of stuff and if so, what direction do I go in? I’ve done a bunch of research and I’m unclear of what the best direction is…actually that’s not quite right, I know what the best direction is and it involves a chunk of money :-) What’s the best direction without spending stupid money? I figure there are three main groups:

  1. Intel NUC Skull Canyon vs Supermicro Superserver
  2. Odroid XU4 cluster – each board handles 2gb RAM and they have an octo-core (8 cores!) while Raspberry Pi-3 is only 1gb per board
  3. Mini tower – takes up more space but cheaper than option 1 for more RAM and core/threads with less cooling issues

The key requirements for me are RAM and thread count with the emphasis on a parallel processing environment.

In terms of cost, 2 is cheapest, then 3 then 1. 2 is also cutest :-) a cluster of 8 Odroid XU4 gives 16gb RAM and 64 cores and would probably cost a little over a grand to put together. The Supermicro is more expandable ie can handle up to 128GB of RAM compared to the 32GB max of the Intel Skull. On the other hand, I can put together a fully loaded Skull for around AUD$1,300 while the SuperMicro requires a little more but works out cheaper in the long run. To go the Skull route means each upgrade requires a new skull at $1,400 a pop whereas the Supermicro requires around $2,000 to start up but has room for cheaper expansions. The Supermicro can be a little noisy and the Skull is quieter.

I like the idea of option 2 and building a cluster of Odroid XU4 but to be honest, I’m not great at hardware stuff and suspect that path holds a lot of pain and I may not come out the other side. Option 3 has space issues but plenty of expandability. I haven’t worked out the answer to how well an Odroid cluster would compare to the Intel Skull given a parallel processing emphasis. The skull beats it on raw speed but for tasks tailored for lots of threads running concurrently I have no idea. I really like the Supermicro approach but it costs more and may be noisier. I noticed that the price has recently dropped on the barebones Skull from around AUD$900 to $780. To that you would add RAM and SSD. The Supermicro has a barebones price of around USD$800 and again you then add RAM and SSD, though it can also handle other types of drives.

Motorbike in BorneoI’m not even sure I want to go down the “moar hardware” path at all. The mac mini can cope with the basic 70GB dataset and provide a sufficient environment to experiment and learn in. I’m fairly committed to going down this path and need to learn more about how to use spark and code in scala. I’m using scala mostly because it’s been referenced in the WARCbase guides I’ve started up with. I don’t know enough to know if it’s the best path forward; on the other hand, any path forward is a good one.

 

#blogjune 2017 recap

Done and dusted for another year. Here we are 4 days later and this is my first post since June finished. Stats are a funny old thing, the only ones I have to count are for folk who specifically visit my page. I have no idea how many other people are reading my posts via feed readers such as feedly. It’s possible to make a rough estimate as feedly does show a subscriber count for each of its feeds but I’m unclear as to its accuracy, having read conflicting accounts. All in all, it sounds like work is required to get accurate figures and my care factor is a little too low for that :-)

blog statistics

Looking at the graph above, direct access seems to have dropped off a little in 2017 but has been mostly stead for the last few years. With that said, the 2017 figure is based on the year so far ie the first six months. That suggests, even I can manage to keep blogging, that 2017 is shaping up to be my best year since at least 2014, and potentially since 2012. I think 2012 was the first year that wordpress broke down the difference between views and visitors, as noted by the darker shading in the column.

In terms of my #blogjune blogging, I managed to just scrape in:

  • 30 posts
  • 10,700 words, averaging 357 words per post

3 posts less but 700 words more than my 2016 effort. I think I managed to blog about most things I thought I would though I never got round to blogging properly about whisky, though I had a few ideas in draft. My top 5 posts were:

which seems a mix of interesting and pedestrian, so here’s the next 5 as well:

which is a more interesting list of titles :-)

techie librarian; meatier than a seahorse

 

Tag lines…whatever do you use for your tagline: the subheading of your identity, the punchline by which people establish a connection. Mostly I pay them lip service, smiling occasionally at a clever one. My own tend to refer to variations of: techie, librarian and eclectic, sometimes all 3 at once.

In a rather wayward conversation, spinning down a rabbit hole of curiousity, as things are wont to do when Matt Finch is involved, a recent conversation turned from roasting penguins to eating seahorses.

I participated in a workshop as part of NLS8 and the first activity was for everyone to sketch a scene, in 90 seconds, on a piece of A4 using at least one of three figures on a screen: 2 humans (or human-like) and a penguin. As is my wont, I immediately gave into the dark side and sketched the two humans roasting the penguin. The second half of the activity was for each table to construct a cohesive story using those scenes as panel. They were two quick activities that worked really well as an icebreaker and got you thinking at how easy it was to come up with ideas under pressure.

The seahorses came later…or rather many years earlier:

to which I responded with my “meatier than seahorse” remark and commented elsewhere that while I have never eaten penguin, I have actually eaten seahorse.

Many years ago, 2003 I think (really must upload those photos to flickr), I spent a few weeks on an Intrepid trip in China with friends. We started in Beijing and went to the Beijing night markets, a place where you can eat just about anything including silk worms and even scorpions on a stick. Scorpions were a wee a but scary but we figured had to be ok as noone was dropping dead. As far as we can figure, they’re bred without their stinger.

While trying to order something else, there was a language issue, and I ended up with seahorse on a stick. I think the scorpions were about 20 cents for five whereas the seahorse was a few Oz dollars for one. Our tour guide tried to talk our way out of it but the shopowner insisted. So I paid for it and ate it. There wasn’t much flavour as it was primarily shell with perhaps a tiny morsel of meat.

Matt suggested “meatier than a seahorse” as a bio and it immediately rang the right sort of bells, both physically and metaphorically. I am now using it for all my taglines :-)

money, money, money

This was given to me by old friends many years ago…I think it was a Kris Kringle present but my memory is a wee bit hazy. Quite simply, this is a moneybox snail as well as being one of my brighter snails. I smile a little every time I look at it. It used to sit on the bookcases in my old loungeroom but these days resides with the larger snail collection.

moneybox snail

filmfest 2017 roundup

Another filmfest finished. I run out of puff toward the end and didn’t manage to write up film reviews for the final two days. I was also conscious of needing sleep as I anticipated a busy time at NLS8 the following weekend; in hindsight that was a very wise move as NLS8 was exhausting. I have a conference mode where I somehow assume the guise of someone vaguely extroverted and throw myself into things and chat to as many people as I can. However in doing so, sucks the energy out of me and I have felt rather fatigued in the days since.

Back to filmfest and I think I managed 28 films this time, slightly up on 27 from last year. It was almost 29 but I chose not to go to a 9.30am screening on the final morning and chose a decent sleep-in and a casual brekky. That film would have been a doco on NASA’s Voyager programme, The Farthest, which I gather was really amazing.

Some things don’t change, the app is still dodgy, the search is still dodgy and the website is still dodgy eg the app has several search options except title search. My partner and I have different android phones, and when she hit the back button to return from ticket to movie list, whereas for me the back button returned me to the home screen. The desktop version of the website only displays the search option if your browser is a particular width, too narrow and and it won’t work.

Tech issues aside, filmfest remained generally excellent and I think this was the first year I managed to avoid a dud film altogether, though Ms16 was less successful alas. This is a list, in alphabetical order, of all the films that stood out for me:

a puppet snail

This was given to me many years ago, and is one of my older snails. A family I used to hang out with in my church days, spotted it and had to buy it for me. In the early days I used to wear it as a puppet to entertain their kids. These days it sits in a bowl, looking out on the world.

puppet snail