Big data analysis from command line using GNU text utils.
A lot of big data analysis tasks can be implemented using utils that can be found on almost every computer. Using such utils can help save time, money and give a good hint regarding an instance of problem.
This presentations contains some historical background about GNU text utils, what they are capable of and when should one prefer command line utils upon modern Big Data technologies.
Your data isn't that big @ Big Things Meetup 2016-05-16
1. Your data isn't that big
Big data processing with bash
scripting via command line
Boaz Menuhin - Crosswise (Oracle)
2. About Crosswise (Oracle)
Crosswise identifies individuals across devices anonymously using Big-Data and
Machine Learning techniques.
Our processing pipeline is embedded with academic knowledge and engineering
expertise.
We process 1.3 petabyte every week.
Our batch production processing stack is made of Python, Scala, Java, AWS
Elastic MapReduce, Luigi, Apache Pig and Graphlab.
3. Big data is not a new problem
1950s - Mainframe
1969 - Unix
1977 - BSD
1983 - GNU
GNU’s coreutils package (formally coreutils)
Implementation and reimplementation of several known
infrastructure solutions for common problem. Mostly
regarding processing of text
4. What to use command line utils for?
- For bulk processing
- To validate an assumption
- To debug intermediate results
- For data analysis
When to use command line utils?
- When it is possible
6. Machines are stronger #1
Macbook Pro 15 inch (2014)
2.3 Ghz Intel core i7 Quad core
16 GB RAM, 256 GB SSD
Equivalent to c3.2xlarge
Can easily process several billions of records on a single thread (using python!)
in a reasonable time
7. Machines are stronger #2
AWS EC2:
● m4.10xlarge: 40 vCPU, 160GB ram - $2.394/$0.368 per hour
● r3.8xlarge: 32 vCPU, 244GB ram, 640GB ephemeral - $2.66/$0.3639 per
hour
● SSD - $0.10 per GB per month => 5TB ~ 500$ per month
8. Law of large numbers
The average of sampled data of size n converges to the average
of the whole data as n →∞.
Implication:
We can sample our data, produce a statistic on it using simple
command line utils, and if we sample correctly, we can deduce
about the whole data.
10. No need to...
- No need to write code
- Mature software:
- Google’s MapReduce theoretical paper was published
in 2004 (~20 years after GNU’s textutils)
- 30+ years of documentation and improvements
- GNU’s regex engine is superior to that of Java, Perl,
Python, Ruby and many others (Golang is exception).
22. Efficient utilization of resource
- Let the OS pipe data between processes
- No network traffic
- No storage between calculations
- No HDFS, no replication of data
- No map-reduce overhead
- No need to run a cluster
23. With a cherry on top
- Easy to manually test
- Using head and tail
- Easy to learn
- With great software comes great documentation
- Enhance your 1337-h4x0r reputation
24. Tools of the trade
awk, comm, cat, csplit, cut,
diff, grep, head, join, jo, jq, nl,
pv, sed, sort, split, uniq, tail, tr,
wc, parallel, xargs
25. Basics - cat, head, tail
cat - read file and print to STDOUT
Example: cat file.txt | ...
head - read first lines of filestdin
tail - read last lines of filestdin
26. Basics - cut
Project specific columns from structured entry
Example: cat file | cut -d “t” -t 1,2,6
Good to know:
- CPU expensive
- Does not deal well with empty columns
27. Basics - sort and uniq
Sort - sort lines by column
* Sort will not write to stdout until end of processing!
* GNU’s sort will use available cores to parallel computation
Uniq - eliminate duplicate lines and count
Example: cat names.txt | cut -f 1 | sort | uniq -c > hist
28. Intermediate - awk
A whole programming language via command line.
Example: Hello world - Word Count
cat blab | awk ‘
{for (i=1; i<NF; i+=1) {arr[$i] += 1}}
END {for (w in arr) {print w, arr[w]}}
' | sort -k 2 -n -r | head -n 3
29. Advanced - parallel, xargs
parallel - Use to parallelize a command (instead of serial for loop)
● Use to fully utilize all cores of machine.
Example: find . -name ‘*.gz’ | parallel -k -j150% -n 1000 -
m zgrep -H -n "STRING" {}
xargs - organise stdin into params of another command
Example: s3cmd ls s3://bucket/path/ | awk '{print $4}' |
xargs -n 5 s3cmd get
30. Very powerful json processor
Example: cat many.json | jq -r ‘.user_agent’ | awk
'BEGIN{cntr=0} {cntr+=length($0)} END{print cntr/NR, NR }'
Sensitive to malformed JSON and ecsotic unicode chars
Working with json - jq
32. When to use command line utils?
“How big should a cluster be?”
What the code does?
What machine is used (cores, memory, storage)?
How big is your data?
33. When to use command line utils?
My very own rule of thumb
- len(gzcat(raw data)) <= dozens of GB
- Task contains no more than 2 sort phases
- Want to validate a hypothesis fast
- Want to get stats fast
- Want to do it once
- No need to write UDFs (User Defined Functions)
34. Further reading
Set operations with command line:
- http://www.catonmat.net/blog/set-operations-in-unix-shell-simplified/
Don't use Hadoop - your data isn't that big
- https://www.chrisstucchio.com/blog/2013/hadoop_hatred.html
Command-line tools can be 235x faster than your Hadoop cluster:
- http://aadrake.com/command-line-tools-can-be-235x-faster-than-your-hadoop-cluster.html
Comparison of regex engines (2007)
- https://swtch.com/~rsc/regexp/regexp1.html
היום יש יותר דאטא אבל המחשבים הרבה יותר חזקים
זה אפשרי
ההורדה למכונות של אמאזון לוקחת פחות זמן ופחות כסף
ומה אם יש לכם יותר מ5 טרה?
חוק המספרים הגדולים אומר שאפשר לקחת סמפל מספיק גדול מהדאטא והוא ייתנהג כמו הדאטא המלא
כלומר אנחנו יכולים להסתפק בחלק קטן מהדאטא שכן אפשר לעבד על מכונה אחת כדי לבדוק את ההשארה שלנו