Your data isn't that big @ Big Things Meetup 2016-05-16

Your data isn't that big
Big data processing with bash
scripting via command line
Boaz Menuhin - Crosswise (Oracle)

About Crosswise (Oracle)
Crosswise identifies individuals across devices anonymously using Big-Data and
Machine Learning techniques.
Our processing pipeline is embedded with academic knowledge and engineering
expertise.
We process 1.3 petabyte every week.
Our batch production processing stack is made of Python, Scala, Java, AWS
Elastic MapReduce, Luigi, Apache Pig and Graphlab.

Big data is not a new problem
1950s - Mainframe
1969 - Unix
1977 - BSD
1983 - GNU
GNU’s coreutils package (formally coreutils)
Implementation and reimplementation of several known
infrastructure solutions for common problem. Mostly
regarding processing of text

What to use command line utils for?
- For bulk processing
- To validate an assumption
- To debug intermediate results
- For data analysis
When to use command line utils?
- When it is possible

Machines are stronger #1
Macbook Pro 15 inch (2014)
2.3 Ghz Intel core i7 Quad core
16 GB RAM, 256 GB SSD
Equivalent to c3.2xlarge
Can easily process several billions of records on a single thread (using python!)
in a reasonable time

Machines are stronger #2
AWS EC2:
● m4.10xlarge: 40 vCPU, 160GB ram - $2.394/$0.368 per hour
● r3.8xlarge: 32 vCPU, 244GB ram, 640GB ephemeral - $2.66/$0.3639 per
hour
● SSD - $0.10 per GB per month => 5TB ~ 500$ per month

Law of large numbers
The average of sampled data of size n converges to the average
of the whole data as n →∞.
Implication:
We can sample our data, produce a statistic on it using simple
command line utils, and if we sample correctly, we can deduce
about the whole data.

No need to...
- No need to write code
- Mature software:
- Google’s MapReduce theoretical paper was published
in 2004 (~20 years after GNU’s textutils)
- 30+ years of documentation and improvements
- GNU’s regex engine is superior to that of Java, Perl,
Python, Ruby and many others (Golang is exception).

Regex engine comparison (2007)

Parallelism vs Concurrency*
* Rob Pike

The pipeline architecture |
cat file.txt | cut -f 4 | sort | uniq -c > hist_4

Efficient utilization of resource
- Let the OS pipe data between processes
- No network traffic
- No storage between calculations
- No HDFS, no replication of data
- No map-reduce overhead
- No need to run a cluster

With a cherry on top
- Easy to manually test
- Using head and tail
- Easy to learn
- With great software comes great documentation
- Enhance your 1337-h4x0r reputation

Tools of the trade
awk, comm, cat, csplit, cut,
diff, grep, head, join, jo, jq, nl,
pv, sed, sort, split, uniq, tail, tr,
wc, parallel, xargs

Basics - cat, head, tail
cat - read file and print to STDOUT
Example: cat file.txt | ...
head - read first lines of filestdin
tail - read last lines of filestdin

Basics - cut
Project specific columns from structured entry
Example: cat file | cut -d “t” -t 1,2,6
Good to know:
- CPU expensive
- Does not deal well with empty columns

Basics - sort and uniq
Sort - sort lines by column
* Sort will not write to stdout until end of processing!
* GNU’s sort will use available cores to parallel computation
Uniq - eliminate duplicate lines and count
Example: cat names.txt | cut -f 1 | sort | uniq -c > hist

Intermediate - awk
A whole programming language via command line.
Example: Hello world - Word Count
cat blab | awk ‘
{for (i=1; i<NF; i+=1) {arr[$i] += 1}}
END {for (w in arr) {print w, arr[w]}}
' | sort -k 2 -n -r | head -n 3

Advanced - parallel, xargs
parallel - Use to parallelize a command (instead of serial for loop)
● Use to fully utilize all cores of machine.
Example: find . -name ‘*.gz’ | parallel -k -j150% -n 1000 -
m zgrep -H -n "STRING" {}
xargs - organise stdin into params of another command
Example: s3cmd ls s3://bucket/path/ | awk '{print $4}' |
xargs -n 5 s3cmd get

Very powerful json processor
Example: cat many.json | jq -r ‘.user_agent’ | awk
'BEGIN{cntr=0} {cntr+=length($0)} END{print cntr/NR, NR }'
Sensitive to malformed JSON and ecsotic unicode chars
Working with json - jq

MapReduce analogy
Map - cut, grep, tr, awk, sed
Combine - awk
Sort - sort
Reduce - awk, wc, unique, comm, join

“How big should a cluster be?”
What the code does?
What machine is used (cores, memory, storage)?
How big is your data?

My very own rule of thumb
- len(gzcat(raw data)) <= dozens of GB
- Task contains no more than 2 sort phases
- Want to validate a hypothesis fast
- Want to get stats fast
- Want to do it once
- No need to write UDFs (User Defined Functions)

Further reading
Set operations with command line:
- http://www.catonmat.net/blog/set-operations-in-unix-shell-simplified/
Don't use Hadoop - your data isn't that big
- https://www.chrisstucchio.com/blog/2013/hadoop_hatred.html
Command-line tools can be 235x faster than your Hadoop cluster:
- http://aadrake.com/command-line-tools-can-be-235x-faster-than-your-hadoop-cluster.html
Comparison of regex engines (2007)
- https://swtch.com/~rsc/regexp/regexp1.html

Thank you
+ We will be hiring (soon...)

Your data isn't that big @ Big Things Meetup 2016-05-16

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Your data isn't that big @ Big Things Meetup 2016-05-16

Similaire à Your data isn't that big @ Big Things Meetup 2016-05-16 (20)

Dernier

Dernier (20)

Your data isn't that big @ Big Things Meetup 2016-05-16

Notes de l'éditeur