2. What is it?
https://github.com/addthis/hydra
● Hadoop-style distrib processing
framework, optimised for trees
● The Big Idea:
data processing = building and navigating
tree data structures
4. Getting started (OSX)
# Prerequisites
brew install rabbitmq maven coreutils wget
# Check this works without a passphrase
ssh localhost
# Check that the GNU coreutils cmds
# (grm, gcp, gln, gmv) are on your PATH
# Clone & build
git clone https://github.com/addthis/hydra.git
cd hydra
mvn package
5. Getting started (2)
# Start local stack
hydra-uber/bin/local-stack.sh start
hydra-uber/bin/local-stack.sh start
# yes, twice!
hydra-uber/bin/local-stack.sh seed
# UI should now be running
open http://localhost:5052
6. Hello world
# Sample job definition file available at
hydra-uber/local/sample/self-gen-tree.json
# Click ‘Create’, copy-paste the job config,
# save the job and click ‘Kick’ to run it.
# Click the ‘Q’ button to open the query UI
# and see the resulting data.
7. Analysing text files
# Tips:
## “files” source is broken. Use “mesh2”.
## Docs are out of date. Read the source
code!
# Mesh filesystem root is here:
hydra-local/streams/
# Here’s an example job config I used to
parse some TSV-formatted Apache logs
https://gist.github.com/cb372/9046464
8. Conclusions
● If you have Small Data,
use grep, awk, sort, uniq
● If you have Big Data,
use Hadoop
● If you really like trees,
use Hydra ;)