SlideShare une entreprise Scribd logo
1  sur  35
Your data isn't that big
Big data processing with bash
scripting via command line
Boaz Menuhin - Crosswise (Oracle)
About Crosswise (Oracle)
Crosswise identifies individuals across devices anonymously using Big-Data and
Machine Learning techniques.
Our processing pipeline is embedded with academic knowledge and engineering
expertise.
We process 1.3 petabyte every week.
Our batch production processing stack is made of Python, Scala, Java, AWS
Elastic MapReduce, Luigi, Apache Pig and Graphlab.
Big data is not a new problem
1950s - Mainframe
1969 - Unix
1977 - BSD
1983 - GNU
GNU’s coreutils package (formally coreutils)
Implementation and reimplementation of several known
infrastructure solutions for common problem. Mostly
regarding processing of text
What to use command line utils for?
- For bulk processing
- To validate an assumption
- To debug intermediate results
- For data analysis
When to use command line utils?
- When it is possible
It is possible
Machines are stronger #1
Macbook Pro 15 inch (2014)
2.3 Ghz Intel core i7 Quad core
16 GB RAM, 256 GB SSD
Equivalent to c3.2xlarge
Can easily process several billions of records on a single thread (using python!)
in a reasonable time
Machines are stronger #2
AWS EC2:
● m4.10xlarge: 40 vCPU, 160GB ram - $2.394/$0.368 per hour
● r3.8xlarge: 32 vCPU, 244GB ram, 640GB ephemeral - $2.66/$0.3639 per
hour
● SSD - $0.10 per GB per month => 5TB ~ 500$ per month
Law of large numbers
The average of sampled data of size n converges to the average
of the whole data as n →∞.
Implication:
We can sample our data, produce a statistic on it using simple
command line utils, and if we sample correctly, we can deduce
about the whole data.
It is expedient
No need to...
- No need to write code
- Mature software:
- Google’s MapReduce theoretical paper was published
in 2004 (~20 years after GNU’s textutils)
- 30+ years of documentation and improvements
- GNU’s regex engine is superior to that of Java, Perl,
Python, Ruby and many others (Golang is exception).
Regex engine comparison (2007)
Parallelism vs Concurrency*
* Rob Pike
Parallelism vs Concurrency
Parallelism vs Concurrency
Parallelism vs Concurrency
The pipeline architecture |
cat file.txt | cut -f 4 | sort | uniq -c > hist_4
The pipeline architecture |
cat file.txt | cut -f 4 | sort | uniq -c > hist_4
The pipeline architecture |
cat file.txt | cut -f 4 | sort | uniq -c > hist_4
The pipeline architecture |
cat file.txt | cut -f 4 | sort | uniq -c > hist_4
The pipeline architecture |
cat file.txt | cut -f 4 | sort | uniq -c > hist_4
The pipeline architecture |
cat file.txt | cut -f 4 | sort | uniq -c > hist_4
Efficient utilization of resource
- Let the OS pipe data between processes
- No network traffic
- No storage between calculations
- No HDFS, no replication of data
- No map-reduce overhead
- No need to run a cluster
With a cherry on top
- Easy to manually test
- Using head and tail
- Easy to learn
- With great software comes great documentation
- Enhance your 1337-h4x0r reputation
Tools of the trade
awk, comm, cat, csplit, cut,
diff, grep, head, join, jo, jq, nl,
pv, sed, sort, split, uniq, tail, tr,
wc, parallel, xargs
Basics - cat, head, tail
cat - read file and print to STDOUT
Example: cat file.txt | ...
head - read first lines of filestdin
tail - read last lines of filestdin
Basics - cut
Project specific columns from structured entry
Example: cat file | cut -d “t” -t 1,2,6
Good to know:
- CPU expensive
- Does not deal well with empty columns
Basics - sort and uniq
Sort - sort lines by column
* Sort will not write to stdout until end of processing!
* GNU’s sort will use available cores to parallel computation
Uniq - eliminate duplicate lines and count
Example: cat names.txt | cut -f 1 | sort | uniq -c > hist
Intermediate - awk
A whole programming language via command line.
Example: Hello world - Word Count
cat blab | awk ‘
{for (i=1; i<NF; i+=1) {arr[$i] += 1}}
END {for (w in arr) {print w, arr[w]}}
' | sort -k 2 -n -r | head -n 3
Advanced - parallel, xargs
parallel - Use to parallelize a command (instead of serial for loop)
● Use to fully utilize all cores of machine.
Example: find . -name ‘*.gz’ | parallel -k -j150% -n 1000 -
m zgrep -H -n "STRING" {}
xargs - organise stdin into params of another command
Example: s3cmd ls s3://bucket/path/ | awk '{print $4}' |
xargs -n 5 s3cmd get
Very powerful json processor
Example: cat many.json | jq -r ‘.user_agent’ | awk
'BEGIN{cntr=0} {cntr+=length($0)} END{print cntr/NR, NR }'
Sensitive to malformed JSON and ecsotic unicode chars
Working with json - jq
MapReduce analogy
Map - cut, grep, tr, awk, sed
Combine - awk
Sort - sort
Reduce - awk, wc, unique, comm, join
When to use command line utils?
“How big should a cluster be?”
What the code does?
What machine is used (cores, memory, storage)?
How big is your data?
When to use command line utils?
My very own rule of thumb
- len(gzcat(raw data)) <= dozens of GB
- Task contains no more than 2 sort phases
- Want to validate a hypothesis fast
- Want to get stats fast
- Want to do it once
- No need to write UDFs (User Defined Functions)
Further reading
Set operations with command line:
- http://www.catonmat.net/blog/set-operations-in-unix-shell-simplified/
Don't use Hadoop - your data isn't that big
- https://www.chrisstucchio.com/blog/2013/hadoop_hatred.html
Command-line tools can be 235x faster than your Hadoop cluster:
- http://aadrake.com/command-line-tools-can-be-235x-faster-than-your-hadoop-cluster.html
Comparison of regex engines (2007)
- https://swtch.com/~rsc/regexp/regexp1.html
Thank you
+ We will be hiring (soon...)

Contenu connexe

Tendances

If the Data Cannot Come To The Algorithm...
If the Data Cannot Come To The Algorithm...If the Data Cannot Come To The Algorithm...
If the Data Cannot Come To The Algorithm...Robert Burrell Donkin
 
Распределенные системы хранения данных, особенности реализации DHT в проекте ...
Распределенные системы хранения данных, особенности реализации DHT в проекте ...Распределенные системы хранения данных, особенности реализации DHT в проекте ...
Распределенные системы хранения данных, особенности реализации DHT в проекте ...yaevents
 
Object multifunctional indexing with an open API
Object multifunctional indexing with an open API Object multifunctional indexing with an open API
Object multifunctional indexing with an open API akvalex
 
XESLite - Handling Event Logs in ProM
XESLite - Handling Event Logs in ProMXESLite - Handling Event Logs in ProM
XESLite - Handling Event Logs in ProMFelix Mannhardt
 
My talk at Topconf.com conference, Tallinn, 1st of November 2012
My talk at Topconf.com conference, Tallinn, 1st of November 2012My talk at Topconf.com conference, Tallinn, 1st of November 2012
My talk at Topconf.com conference, Tallinn, 1st of November 2012Kostja Osipov
 
Neat Analytics with Pandas Indexes, Alexander Hendorf
Neat Analytics with Pandas Indexes, Alexander Hendorf Neat Analytics with Pandas Indexes, Alexander Hendorf
Neat Analytics with Pandas Indexes, Alexander Hendorf Pôle Systematic Paris-Region
 
Building maps for apps in the cloud - a Softlayer Use Case
Building maps for  apps in the cloud - a Softlayer Use CaseBuilding maps for  apps in the cloud - a Softlayer Use Case
Building maps for apps in the cloud - a Softlayer Use CaseTiman Rebel
 
Python crash course for geologists in the mining industry
Python crash course for geologists in the mining industryPython crash course for geologists in the mining industry
Python crash course for geologists in the mining industryLaurent Wagner
 
Frequent Itemset Mining on BigData
Frequent Itemset Mining on BigDataFrequent Itemset Mining on BigData
Frequent Itemset Mining on BigDataRaju Gupta
 
High Performance OSM Data Manipulation With Osmium - State of the Map 2013
High Performance OSM Data Manipulation With Osmium - State of the Map 2013High Performance OSM Data Manipulation With Osmium - State of the Map 2013
High Performance OSM Data Manipulation With Osmium - State of the Map 2013OSMFstateofthemap
 
[COSCUP 2018] uTensor C++ Code Generator
[COSCUP 2018] uTensor C++ Code Generator[COSCUP 2018] uTensor C++ Code Generator
[COSCUP 2018] uTensor C++ Code GeneratorYin-Chen Liao
 
hbaseconasia2019 Spatio temporal Data Management based on Ali-HBase Ganos and...
hbaseconasia2019 Spatio temporal Data Management based on Ali-HBase Ganos and...hbaseconasia2019 Spatio temporal Data Management based on Ali-HBase Ganos and...
hbaseconasia2019 Spatio temporal Data Management based on Ali-HBase Ganos and...Michael Stack
 
Skytree big data london meetup - may 2013
Skytree   big data london meetup - may 2013Skytree   big data london meetup - may 2013
Skytree big data london meetup - may 2013bigdatalondon
 
Elasticsearch avoiding hotspots
Elasticsearch  avoiding hotspotsElasticsearch  avoiding hotspots
Elasticsearch avoiding hotspotsChristophe Marchal
 
Nach os network
Nach os networkNach os network
Nach os networknaniix21_3
 
Nach os network
Nach os networkNach os network
Nach os networknaniix21_3
 

Tendances (20)

If the Data Cannot Come To The Algorithm...
If the Data Cannot Come To The Algorithm...If the Data Cannot Come To The Algorithm...
If the Data Cannot Come To The Algorithm...
 
Распределенные системы хранения данных, особенности реализации DHT в проекте ...
Распределенные системы хранения данных, особенности реализации DHT в проекте ...Распределенные системы хранения данных, особенности реализации DHT в проекте ...
Распределенные системы хранения данных, особенности реализации DHT в проекте ...
 
Object multifunctional indexing with an open API
Object multifunctional indexing with an open API Object multifunctional indexing with an open API
Object multifunctional indexing with an open API
 
NBITSearch. Features.
NBITSearch. Features.NBITSearch. Features.
NBITSearch. Features.
 
XESLite - Handling Event Logs in ProM
XESLite - Handling Event Logs in ProMXESLite - Handling Event Logs in ProM
XESLite - Handling Event Logs in ProM
 
My talk at Topconf.com conference, Tallinn, 1st of November 2012
My talk at Topconf.com conference, Tallinn, 1st of November 2012My talk at Topconf.com conference, Tallinn, 1st of November 2012
My talk at Topconf.com conference, Tallinn, 1st of November 2012
 
R&D for L&D
R&D for L&DR&D for L&D
R&D for L&D
 
Neat Analytics with Pandas Indexes, Alexander Hendorf
Neat Analytics with Pandas Indexes, Alexander Hendorf Neat Analytics with Pandas Indexes, Alexander Hendorf
Neat Analytics with Pandas Indexes, Alexander Hendorf
 
Building maps for apps in the cloud - a Softlayer Use Case
Building maps for  apps in the cloud - a Softlayer Use CaseBuilding maps for  apps in the cloud - a Softlayer Use Case
Building maps for apps in the cloud - a Softlayer Use Case
 
Python crash course for geologists in the mining industry
Python crash course for geologists in the mining industryPython crash course for geologists in the mining industry
Python crash course for geologists in the mining industry
 
FleetDB
FleetDBFleetDB
FleetDB
 
Frequent Itemset Mining on BigData
Frequent Itemset Mining on BigDataFrequent Itemset Mining on BigData
Frequent Itemset Mining on BigData
 
High Performance OSM Data Manipulation With Osmium - State of the Map 2013
High Performance OSM Data Manipulation With Osmium - State of the Map 2013High Performance OSM Data Manipulation With Osmium - State of the Map 2013
High Performance OSM Data Manipulation With Osmium - State of the Map 2013
 
[COSCUP 2018] uTensor C++ Code Generator
[COSCUP 2018] uTensor C++ Code Generator[COSCUP 2018] uTensor C++ Code Generator
[COSCUP 2018] uTensor C++ Code Generator
 
hbaseconasia2019 Spatio temporal Data Management based on Ali-HBase Ganos and...
hbaseconasia2019 Spatio temporal Data Management based on Ali-HBase Ganos and...hbaseconasia2019 Spatio temporal Data Management based on Ali-HBase Ganos and...
hbaseconasia2019 Spatio temporal Data Management based on Ali-HBase Ganos and...
 
Skytree big data london meetup - may 2013
Skytree   big data london meetup - may 2013Skytree   big data london meetup - may 2013
Skytree big data london meetup - may 2013
 
Elasticsearch avoiding hotspots
Elasticsearch  avoiding hotspotsElasticsearch  avoiding hotspots
Elasticsearch avoiding hotspots
 
Bosco r users2013
Bosco r users2013Bosco r users2013
Bosco r users2013
 
Nach os network
Nach os networkNach os network
Nach os network
 
Nach os network
Nach os networkNach os network
Nach os network
 

Similaire à Your data isn't that big @ Big Things Meetup 2016-05-16

Big data distributed processing: Spark introduction
Big data distributed processing: Spark introductionBig data distributed processing: Spark introduction
Big data distributed processing: Spark introductionHektor Jacynycz García
 
Data engineering Stl Big Data IDEA user group
Data engineering   Stl Big Data IDEA user groupData engineering   Stl Big Data IDEA user group
Data engineering Stl Big Data IDEA user groupAdam Doyle
 
Netmap presentation
Netmap presentationNetmap presentation
Netmap presentationAmir Razmjou
 
Fast and Scalable Python
Fast and Scalable PythonFast and Scalable Python
Fast and Scalable PythonTravis Oliphant
 
Porting a Streaming Pipeline from Scala to Rust
Porting a Streaming Pipeline from Scala to RustPorting a Streaming Pipeline from Scala to Rust
Porting a Streaming Pipeline from Scala to RustEvan Chan
 
Fletcher Framework for Programming FPGA
Fletcher Framework for Programming FPGAFletcher Framework for Programming FPGA
Fletcher Framework for Programming FPGAGanesan Narayanasamy
 
introduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pigintroduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and PigRicardo Varela
 
Hadoop Network Performance profile
Hadoop Network Performance profileHadoop Network Performance profile
Hadoop Network Performance profilepramodbiligiri
 
Using R on High Performance Computers
Using R on High Performance ComputersUsing R on High Performance Computers
Using R on High Performance ComputersDave Hiltbrand
 
EEDC - Apache Pig
EEDC - Apache PigEEDC - Apache Pig
EEDC - Apache Pigjavicid
 
Impala presentation ahad rana
Impala presentation ahad ranaImpala presentation ahad rana
Impala presentation ahad ranaData Con LA
 
Language-agnostic data analysis workflows and reproducible research
Language-agnostic data analysis workflows and reproducible researchLanguage-agnostic data analysis workflows and reproducible research
Language-agnostic data analysis workflows and reproducible researchAndrew Lowe
 
Rust All Hands Winter 2011
Rust All Hands Winter 2011Rust All Hands Winter 2011
Rust All Hands Winter 2011Patrick Walton
 
Spark 4th Meetup Londond - Building a Product with Spark
Spark 4th Meetup Londond - Building a Product with SparkSpark 4th Meetup Londond - Building a Product with Spark
Spark 4th Meetup Londond - Building a Product with Sparksamthemonad
 
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, VectorizedData Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, VectorizedHostedbyConfluent
 
Tips and Tricks for Increased Development Efficiency
Tips and Tricks for Increased Development EfficiencyTips and Tricks for Increased Development Efficiency
Tips and Tricks for Increased Development EfficiencyOlivier Bourgeois
 
Apache Arrow (Strata-Hadoop World San Jose 2016)
Apache Arrow (Strata-Hadoop World San Jose 2016)Apache Arrow (Strata-Hadoop World San Jose 2016)
Apache Arrow (Strata-Hadoop World San Jose 2016)Wes McKinney
 
Mapreduce Algorithms
Mapreduce AlgorithmsMapreduce Algorithms
Mapreduce AlgorithmsAmund Tveit
 
Os Reindersfinal
Os ReindersfinalOs Reindersfinal
Os Reindersfinaloscon2007
 

Similaire à Your data isn't that big @ Big Things Meetup 2016-05-16 (20)

Big data distributed processing: Spark introduction
Big data distributed processing: Spark introductionBig data distributed processing: Spark introduction
Big data distributed processing: Spark introduction
 
Data engineering Stl Big Data IDEA user group
Data engineering   Stl Big Data IDEA user groupData engineering   Stl Big Data IDEA user group
Data engineering Stl Big Data IDEA user group
 
Netmap presentation
Netmap presentationNetmap presentation
Netmap presentation
 
Fast and Scalable Python
Fast and Scalable PythonFast and Scalable Python
Fast and Scalable Python
 
Porting a Streaming Pipeline from Scala to Rust
Porting a Streaming Pipeline from Scala to RustPorting a Streaming Pipeline from Scala to Rust
Porting a Streaming Pipeline from Scala to Rust
 
Fletcher Framework for Programming FPGA
Fletcher Framework for Programming FPGAFletcher Framework for Programming FPGA
Fletcher Framework for Programming FPGA
 
introduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pigintroduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pig
 
Hadoop Network Performance profile
Hadoop Network Performance profileHadoop Network Performance profile
Hadoop Network Performance profile
 
Interpreting the Data:Parallel Analysis with Sawzall
Interpreting the Data:Parallel Analysis with SawzallInterpreting the Data:Parallel Analysis with Sawzall
Interpreting the Data:Parallel Analysis with Sawzall
 
Using R on High Performance Computers
Using R on High Performance ComputersUsing R on High Performance Computers
Using R on High Performance Computers
 
EEDC - Apache Pig
EEDC - Apache PigEEDC - Apache Pig
EEDC - Apache Pig
 
Impala presentation ahad rana
Impala presentation ahad ranaImpala presentation ahad rana
Impala presentation ahad rana
 
Language-agnostic data analysis workflows and reproducible research
Language-agnostic data analysis workflows and reproducible researchLanguage-agnostic data analysis workflows and reproducible research
Language-agnostic data analysis workflows and reproducible research
 
Rust All Hands Winter 2011
Rust All Hands Winter 2011Rust All Hands Winter 2011
Rust All Hands Winter 2011
 
Spark 4th Meetup Londond - Building a Product with Spark
Spark 4th Meetup Londond - Building a Product with SparkSpark 4th Meetup Londond - Building a Product with Spark
Spark 4th Meetup Londond - Building a Product with Spark
 
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, VectorizedData Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized
 
Tips and Tricks for Increased Development Efficiency
Tips and Tricks for Increased Development EfficiencyTips and Tricks for Increased Development Efficiency
Tips and Tricks for Increased Development Efficiency
 
Apache Arrow (Strata-Hadoop World San Jose 2016)
Apache Arrow (Strata-Hadoop World San Jose 2016)Apache Arrow (Strata-Hadoop World San Jose 2016)
Apache Arrow (Strata-Hadoop World San Jose 2016)
 
Mapreduce Algorithms
Mapreduce AlgorithmsMapreduce Algorithms
Mapreduce Algorithms
 
Os Reindersfinal
Os ReindersfinalOs Reindersfinal
Os Reindersfinal
 

Dernier

Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Angeliki Cooney
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKJago de Vreede
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024The Digital Insurer
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfOverkill Security
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 

Dernier (20)

Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 

Your data isn't that big @ Big Things Meetup 2016-05-16

  • 1. Your data isn't that big Big data processing with bash scripting via command line Boaz Menuhin - Crosswise (Oracle)
  • 2. About Crosswise (Oracle) Crosswise identifies individuals across devices anonymously using Big-Data and Machine Learning techniques. Our processing pipeline is embedded with academic knowledge and engineering expertise. We process 1.3 petabyte every week. Our batch production processing stack is made of Python, Scala, Java, AWS Elastic MapReduce, Luigi, Apache Pig and Graphlab.
  • 3. Big data is not a new problem 1950s - Mainframe 1969 - Unix 1977 - BSD 1983 - GNU GNU’s coreutils package (formally coreutils) Implementation and reimplementation of several known infrastructure solutions for common problem. Mostly regarding processing of text
  • 4. What to use command line utils for? - For bulk processing - To validate an assumption - To debug intermediate results - For data analysis When to use command line utils? - When it is possible
  • 6. Machines are stronger #1 Macbook Pro 15 inch (2014) 2.3 Ghz Intel core i7 Quad core 16 GB RAM, 256 GB SSD Equivalent to c3.2xlarge Can easily process several billions of records on a single thread (using python!) in a reasonable time
  • 7. Machines are stronger #2 AWS EC2: ● m4.10xlarge: 40 vCPU, 160GB ram - $2.394/$0.368 per hour ● r3.8xlarge: 32 vCPU, 244GB ram, 640GB ephemeral - $2.66/$0.3639 per hour ● SSD - $0.10 per GB per month => 5TB ~ 500$ per month
  • 8. Law of large numbers The average of sampled data of size n converges to the average of the whole data as n →∞. Implication: We can sample our data, produce a statistic on it using simple command line utils, and if we sample correctly, we can deduce about the whole data.
  • 10. No need to... - No need to write code - Mature software: - Google’s MapReduce theoretical paper was published in 2004 (~20 years after GNU’s textutils) - 30+ years of documentation and improvements - GNU’s regex engine is superior to that of Java, Perl, Python, Ruby and many others (Golang is exception).
  • 16. The pipeline architecture | cat file.txt | cut -f 4 | sort | uniq -c > hist_4
  • 17. The pipeline architecture | cat file.txt | cut -f 4 | sort | uniq -c > hist_4
  • 18. The pipeline architecture | cat file.txt | cut -f 4 | sort | uniq -c > hist_4
  • 19. The pipeline architecture | cat file.txt | cut -f 4 | sort | uniq -c > hist_4
  • 20. The pipeline architecture | cat file.txt | cut -f 4 | sort | uniq -c > hist_4
  • 21. The pipeline architecture | cat file.txt | cut -f 4 | sort | uniq -c > hist_4
  • 22. Efficient utilization of resource - Let the OS pipe data between processes - No network traffic - No storage between calculations - No HDFS, no replication of data - No map-reduce overhead - No need to run a cluster
  • 23. With a cherry on top - Easy to manually test - Using head and tail - Easy to learn - With great software comes great documentation - Enhance your 1337-h4x0r reputation
  • 24. Tools of the trade awk, comm, cat, csplit, cut, diff, grep, head, join, jo, jq, nl, pv, sed, sort, split, uniq, tail, tr, wc, parallel, xargs
  • 25. Basics - cat, head, tail cat - read file and print to STDOUT Example: cat file.txt | ... head - read first lines of filestdin tail - read last lines of filestdin
  • 26. Basics - cut Project specific columns from structured entry Example: cat file | cut -d “t” -t 1,2,6 Good to know: - CPU expensive - Does not deal well with empty columns
  • 27. Basics - sort and uniq Sort - sort lines by column * Sort will not write to stdout until end of processing! * GNU’s sort will use available cores to parallel computation Uniq - eliminate duplicate lines and count Example: cat names.txt | cut -f 1 | sort | uniq -c > hist
  • 28. Intermediate - awk A whole programming language via command line. Example: Hello world - Word Count cat blab | awk ‘ {for (i=1; i<NF; i+=1) {arr[$i] += 1}} END {for (w in arr) {print w, arr[w]}} ' | sort -k 2 -n -r | head -n 3
  • 29. Advanced - parallel, xargs parallel - Use to parallelize a command (instead of serial for loop) ● Use to fully utilize all cores of machine. Example: find . -name ‘*.gz’ | parallel -k -j150% -n 1000 - m zgrep -H -n "STRING" {} xargs - organise stdin into params of another command Example: s3cmd ls s3://bucket/path/ | awk '{print $4}' | xargs -n 5 s3cmd get
  • 30. Very powerful json processor Example: cat many.json | jq -r ‘.user_agent’ | awk 'BEGIN{cntr=0} {cntr+=length($0)} END{print cntr/NR, NR }' Sensitive to malformed JSON and ecsotic unicode chars Working with json - jq
  • 31. MapReduce analogy Map - cut, grep, tr, awk, sed Combine - awk Sort - sort Reduce - awk, wc, unique, comm, join
  • 32. When to use command line utils? “How big should a cluster be?” What the code does? What machine is used (cores, memory, storage)? How big is your data?
  • 33. When to use command line utils? My very own rule of thumb - len(gzcat(raw data)) <= dozens of GB - Task contains no more than 2 sort phases - Want to validate a hypothesis fast - Want to get stats fast - Want to do it once - No need to write UDFs (User Defined Functions)
  • 34. Further reading Set operations with command line: - http://www.catonmat.net/blog/set-operations-in-unix-shell-simplified/ Don't use Hadoop - your data isn't that big - https://www.chrisstucchio.com/blog/2013/hadoop_hatred.html Command-line tools can be 235x faster than your Hadoop cluster: - http://aadrake.com/command-line-tools-can-be-235x-faster-than-your-hadoop-cluster.html Comparison of regex engines (2007) - https://swtch.com/~rsc/regexp/regexp1.html
  • 35. Thank you + We will be hiring (soon...)

Notes de l'éditeur

  1. היום יש יותר דאטא אבל המחשבים הרבה יותר חזקים זה אפשרי
  2. ההורדה למכונות של אמאזון לוקחת פחות זמן ופחות כסף ומה אם יש לכם יותר מ5 טרה?
  3. חוק המספרים הגדולים אומר שאפשר לקחת סמפל מספיק גדול מהדאטא והוא ייתנהג כמו הדאטא המלא כלומר אנחנו יכולים להסתפק בחלק קטן מהדאטא שכן אפשר לעבד על מכונה אחת כדי לבדוק את ההשארה שלנו
  4. מי מכיר go?
  5. הופיעה לראשונה ב1977