SlideShare une entreprise Scribd logo
1  sur  38
HANDLING BIGGER DATA
What to do if your data’s too big
Data nerding
Your 5-7 things
❑ Bigger data
❑ Much bigger data
❑ Much bigger data storage
❑ Bigger data science teams
BIGGER DATA
Or, ‘data that’s a bit too big’
3
First, don’t panic
Computer storage
250Gb Internal hard drive. (hopefully)
permanent storage. The place you’re
storing photos, data etc
16Gb RAM. Temporary
storage. The place
read_csv loads your
dataset into.
2Tb External hard
drive. A handy place
to keep bigger
datafiles.
Gigabytes, Terabytes etc.
Name Size in bytes Contains (roughly)
Byte 1 1 character (‘a’, ‘1’ etc)
Kilobyte 1,000 Half a printed page
Megabyte 1,000,000 1 novella. 5Mb = complete works of Shakespeare
Gigabyte 1,000,000,000 1 high-fidelity symphony recording; 10m of shelved books
Terabyte 1,000,000,000,000 All the x-ray films in a large hospital; 10 = library of
congress collection. 2.6 = Panama Papers leak
Petabyte 1,000,000,000,000,000 2 = all US academic libraries; 10= 1 hour’s output from
SKA telescope
Exabyte 1,000,000,000,000,000,000 5 = all words ever spoken by humans
Zettabyte 1,000,000,000,000,000,000,000
Yottabyte 1,000,000,000,000,000,000,000,000 Current storage capacity of the Internet
Things to Try: Too Big
❑Read data in ‘chunks’
csv_chunks = pandas.read_csv(‘myfile.csv’, chunksize = 10000)
❑ Divide and conquer in your code:
csv_chunks = pandas.read_csv(‘myfile.csv’, skiprows=10000, chunksize = 10000)
❑Use parallel processing
❑ E.g the Dask library
Things to try: Too Slow
❑Use %timeit to find where the speed problems are
❑Use compiled python, (e.g. the Numba library)
❑Use C code (via Cython)
8
MUCH BIGGER DATA
Or, ‘What if it really doesn’t fit?’
9
Volume, Velocity, Variety
Much Faster Datastreams
Twitter firehose:
❑ Firehose averages 6,000 tweets per second
❑ Record is 143,199 tweets in one second (Aug 3rd 2013, Japan)
❑ Twitter public streams = 1% of Firehose steam
Google index (2013):
❑ 30 trillion unique pages on the internet
❑ Google index = 100 petabytes (100 million gigabytes)
❑ 100 billion web searches a month
❑ Search returned in about ⅛ second
Distributed systems
❑ Store the data on multiple ‘servers’:
❑ Big idea: Distributed file systems
❑ Replicate data (server hardware breaks more often than you think)
❑ Do the processing on multiple servers:
❑ Lots of code does the same thing to different pieces of data
❑ Big idea: Map/Reduce
Parallel Processors
❑Laptop: 4 cores, 16 GB RAM, 256 GB disk
❑Workstation: 24 cores, 1 TB RAM
❑Clusters: as big as you can imagine…
13
Distributed filesystems
Your typical rack server...
Map/Reduce: Crowdsourcing for computers
Distributed Programming Platforms
Hadoop
❑ HDFS: distributed filesystem
❑ MapReduce engine: processing
Spark
❑ In-memory processing
❑ Because moving data around is the biggest bottleneck
Typical (Current) Ecosystem
HDFS
Spark
Python
R
SQL
Tableau
Publisher
Data warehouse
Anaconda comes with this…
Parallel Python Libraries
❑ Dask
❑ Datasets look like NumpyArrays, Pandas DataFrames
❑ df.groupby(df.index).value.mean()
❑ Direct access into HDFS, S3 etc
❑ PySpark
❑ Also has DataFrames
❑ Connects to Spark
20
MUCH BIGGER DATA
STORAGE
Or, ‘Where do we put all this stuff?’
2
1
SQL Databases
❑ Row/column tables
❑ Keys
❑ SQL query language
❑ Joins etc (like Pandas)
ETL (Extract - Transform - Load)
❑ Extract
❑ Extract data from multiple sources
❑ Transform
❑ Convert data into database formats (e.g. sql)
❑ Load
❑ Load data into database
Data warehouses
NoSql Databases
❑ Not forced into row/column
❑ Lots of different types
❑ Key/value: can add feature without rewriting
tables
❑ Graph: stores nodes and edges
❑ Column: useful if you have a lot more reads
than writes
❑ Document: general-purpose. MongoDb is
commonly used.
Data Lakes
BIGGER DATA SCIENCE
TEAMS
Or, ‘Who does this stuff?’
2
7
Big Data Work
❑ Data Science
❑ Data Analysis
❑ Data Engineering
❑ Data Strategy
Big Data Science Teams
❑ Usually seen:
❑ Project manager
❑ Business analysts
❑ Data Scientists / Analysts: insight from data
❑ Data Engineers / Developers: data flow implementation, production systems
❑ Sometimes seen:
❑ Data Architect: data flow design
❑ User Experience / User Interface developer / Visual designer
Data Strategy
❑ Why should data be important here?
❑ Which business questions does this place have?
❑ What data does/could this place have access to?
❑ How much data work is already here?
❑ Who has the data science gene?
❑ What needs to change to make this place data-driven?
❑ People (training, culture)
❑ Processes
❑ Technologies (data access, storage, analysis tools)
❑ Data
Data Analysis
❑ What are the statistics of this dataset?
❑ E.g. which pages are popular
❑ Usually on already-formatted data, e.g. google analytics results
Data Science
❑ Ask an interesting question
❑ Get the data
❑ Explore the data
❑ Model the data
❑ Communicate and visualize your results
Data Engineering
❑ Big data storage
❑ SQL, NoSQL
❑ warehouses, lakes
❑ Cloud computing architectures
❑ Privacy / security
❑ Uptime
❑ Maintenance
❑ Big data analytics
❑ Distributed programming
platforms
❑ Privacy / security
❑ Uptime
❑ Maintenance
❑ etc.
EXERCISES
Or, ‘Trying some of this out’
3
4
Exercises
❑ Use pandas read_csv() to read a datafile in in chunks
LEARNING MORE
Or, ‘books’
3
6
READING
3
7
“Books are a
uniquely portable
magic” – Stephen
King
THANK YOU
sjterp@thoughtworks.com

Contenu connexe

Tendances

Tendances (20)

Data science presentation
Data science presentationData science presentation
Data science presentation
 
Big data and data science overview
Big data and data science overviewBig data and data science overview
Big data and data science overview
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Introduction to Data Science by Datalent Team @Data Science Clinic #9
Introduction to Data Science by Datalent Team @Data Science Clinic #9Introduction to Data Science by Datalent Team @Data Science Clinic #9
Introduction to Data Science by Datalent Team @Data Science Clinic #9
 
Data science
Data scienceData science
Data science
 
Intro to Data Science by DatalentTeam at Data Science Clinic#11
Intro to Data Science by DatalentTeam at Data Science Clinic#11Intro to Data Science by DatalentTeam at Data Science Clinic#11
Intro to Data Science by DatalentTeam at Data Science Clinic#11
 
Introduction to data science intro,ch(1,2,3)
Introduction to data science intro,ch(1,2,3)Introduction to data science intro,ch(1,2,3)
Introduction to data science intro,ch(1,2,3)
 
Data science
Data science Data science
Data science
 
Data Science using Python
Data Science using PythonData Science using Python
Data Science using Python
 
Big Data Science: Intro and Benefits
Big Data Science: Intro and BenefitsBig Data Science: Intro and Benefits
Big Data Science: Intro and Benefits
 
Data science presentation 2nd CI day
Data science presentation 2nd CI dayData science presentation 2nd CI day
Data science presentation 2nd CI day
 
Data Science Tutorial | Introduction To Data Science | Data Science Training ...
Data Science Tutorial | Introduction To Data Science | Data Science Training ...Data Science Tutorial | Introduction To Data Science | Data Science Training ...
Data Science Tutorial | Introduction To Data Science | Data Science Training ...
 
Introduction to Data Science - Week 4 - Tools and Technologies in Data Science
Introduction to Data Science - Week 4 - Tools and Technologies in Data ScienceIntroduction to Data Science - Week 4 - Tools and Technologies in Data Science
Introduction to Data Science - Week 4 - Tools and Technologies in Data Science
 
Data science
Data scienceData science
Data science
 
Demystifying Data Science with an introduction to Machine Learning
Demystifying Data Science with an introduction to Machine LearningDemystifying Data Science with an introduction to Machine Learning
Demystifying Data Science with an introduction to Machine Learning
 
Big data deep learning: applications and challenges
Big data deep learning: applications and challengesBig data deep learning: applications and challenges
Big data deep learning: applications and challenges
 
Data science vs. Data scientist by Jothi Periasamy
Data science vs. Data scientist by Jothi PeriasamyData science vs. Data scientist by Jothi Periasamy
Data science vs. Data scientist by Jothi Periasamy
 
The Evolution of Data Science
The Evolution of Data ScienceThe Evolution of Data Science
The Evolution of Data Science
 
Data Science
Data ScienceData Science
Data Science
 
Creating a Data Science Ecosystem for Scientific, Societal and Educational Im...
Creating a Data Science Ecosystem for Scientific, Societal and Educational Im...Creating a Data Science Ecosystem for Scientific, Societal and Educational Im...
Creating a Data Science Ecosystem for Scientific, Societal and Educational Im...
 

En vedette

En vedette (11)

Measure of safety culture
Measure of safety cultureMeasure of safety culture
Measure of safety culture
 
Опис досвіду роботи
Опис досвіду роботиОпис досвіду роботи
Опис досвіду роботи
 
Encuesta de Salarios
Encuesta de SalariosEncuesta de Salarios
Encuesta de Salarios
 
Session 02 python basics
Session 02 python basicsSession 02 python basics
Session 02 python basics
 
Potencial De Membrana Plasmatica
Potencial De Membrana PlasmaticaPotencial De Membrana Plasmatica
Potencial De Membrana Plasmatica
 
Resume
ResumeResume
Resume
 
Female faculty members awareness about retirement planning avenues
Female faculty members   awareness about retirement planning avenuesFemale faculty members   awareness about retirement planning avenues
Female faculty members awareness about retirement planning avenues
 
Final gallery
Final galleryFinal gallery
Final gallery
 
Decálogo de las férulas miorrelajantes
Decálogo de las férulas miorrelajantesDecálogo de las férulas miorrelajantes
Decálogo de las férulas miorrelajantes
 
Transportes do Brasil
Transportes do BrasilTransportes do Brasil
Transportes do Brasil
 
0610 w16 qp_42
0610 w16 qp_420610 w16 qp_42
0610 w16 qp_42
 

Similaire à Session 10 handling bigger data

Similaire à Session 10 handling bigger data (20)

Big Data Analytics: Finding diamonds in the rough with Azure
Big Data Analytics: Finding diamonds in the rough with AzureBig Data Analytics: Finding diamonds in the rough with Azure
Big Data Analytics: Finding diamonds in the rough with Azure
 
Hadoop-2.6.0 Slides
Hadoop-2.6.0 SlidesHadoop-2.6.0 Slides
Hadoop-2.6.0 Slides
 
Big data-denis-rothman
Big data-denis-rothmanBig data-denis-rothman
Big data-denis-rothman
 
Column Stores and Google BigQuery
Column Stores and Google BigQueryColumn Stores and Google BigQuery
Column Stores and Google BigQuery
 
A gentle introduction to the world of BigData and Hadoop
A gentle introduction to the world of BigData and HadoopA gentle introduction to the world of BigData and Hadoop
A gentle introduction to the world of BigData and Hadoop
 
PUC Masterclass Big Data
PUC Masterclass Big DataPUC Masterclass Big Data
PUC Masterclass Big Data
 
Big Data - An Overview
Big Data -  An OverviewBig Data -  An Overview
Big Data - An Overview
 
Introduction to Big Data & Hadoop
Introduction to Big Data & HadoopIntroduction to Big Data & Hadoop
Introduction to Big Data & Hadoop
 
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
 
getFamiliarWithHadoop
getFamiliarWithHadoopgetFamiliarWithHadoop
getFamiliarWithHadoop
 
Big Data Technologies - Hadoop
Big Data Technologies - HadoopBig Data Technologies - Hadoop
Big Data Technologies - Hadoop
 
Fast and Scalable Python
Fast and Scalable PythonFast and Scalable Python
Fast and Scalable Python
 
Hadoop HDFS.ppt
Hadoop HDFS.pptHadoop HDFS.ppt
Hadoop HDFS.ppt
 
Terabyte-scale image similarity search: experience and best practice
Terabyte-scale image similarity search: experience and best practiceTerabyte-scale image similarity search: experience and best practice
Terabyte-scale image similarity search: experience and best practice
 
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introduction
 
Hadoop ecosystem framework n hadoop in live environment
Hadoop ecosystem framework  n hadoop in live environmentHadoop ecosystem framework  n hadoop in live environment
Hadoop ecosystem framework n hadoop in live environment
 
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
 
Hadoop Technology
Hadoop TechnologyHadoop Technology
Hadoop Technology
 
Data analytics & its Trends
Data analytics & its TrendsData analytics & its Trends
Data analytics & its Trends
 
Big Data And Hadoop
Big Data And HadoopBig Data And Hadoop
Big Data And Hadoop
 

Plus de Sara-Jayne Terp

2021 IWC presentation: Risk, SOCs and Mitigations: Cognitive Security is Comi...
2021 IWC presentation: Risk, SOCs and Mitigations: Cognitive Security is Comi...2021 IWC presentation: Risk, SOCs and Mitigations: Cognitive Security is Comi...
2021 IWC presentation: Risk, SOCs and Mitigations: Cognitive Security is Comi...
Sara-Jayne Terp
 

Plus de Sara-Jayne Terp (20)

Distributed defense against disinformation: disinformation risk management an...
Distributed defense against disinformation: disinformation risk management an...Distributed defense against disinformation: disinformation risk management an...
Distributed defense against disinformation: disinformation risk management an...
 
Risk, SOCs, and mitigations: cognitive security is coming of age
Risk, SOCs, and mitigations: cognitive security is coming of ageRisk, SOCs, and mitigations: cognitive security is coming of age
Risk, SOCs, and mitigations: cognitive security is coming of age
 
disinformation risk management: leveraging cyber security best practices to s...
disinformation risk management: leveraging cyber security best practices to s...disinformation risk management: leveraging cyber security best practices to s...
disinformation risk management: leveraging cyber security best practices to s...
 
Cognitive security: all the other things
Cognitive security: all the other thingsCognitive security: all the other things
Cognitive security: all the other things
 
The Business(es) of Disinformation
The Business(es) of DisinformationThe Business(es) of Disinformation
The Business(es) of Disinformation
 
2021-05-SJTerp-AMITT_disinfoSoc-umaryland
2021-05-SJTerp-AMITT_disinfoSoc-umaryland2021-05-SJTerp-AMITT_disinfoSoc-umaryland
2021-05-SJTerp-AMITT_disinfoSoc-umaryland
 
2021 IWC presentation: Risk, SOCs and Mitigations: Cognitive Security is Comi...
2021 IWC presentation: Risk, SOCs and Mitigations: Cognitive Security is Comi...2021 IWC presentation: Risk, SOCs and Mitigations: Cognitive Security is Comi...
2021 IWC presentation: Risk, SOCs and Mitigations: Cognitive Security is Comi...
 
2021-02-10_CogSecCollab_UBerkeley
2021-02-10_CogSecCollab_UBerkeley2021-02-10_CogSecCollab_UBerkeley
2021-02-10_CogSecCollab_UBerkeley
 
Using AMITT and ATT&CK frameworks
Using AMITT and ATT&CK frameworksUsing AMITT and ATT&CK frameworks
Using AMITT and ATT&CK frameworks
 
2020 12 nyu-workshop_cog_sec
2020 12 nyu-workshop_cog_sec2020 12 nyu-workshop_cog_sec
2020 12 nyu-workshop_cog_sec
 
2020 09-01 disclosure
2020 09-01 disclosure2020 09-01 disclosure
2020 09-01 disclosure
 
2019 11 terp_mansonbulletproof_master copy
2019 11 terp_mansonbulletproof_master copy2019 11 terp_mansonbulletproof_master copy
2019 11 terp_mansonbulletproof_master copy
 
BSidesLV 2018 talk: social engineering at scale, a community guide
BSidesLV 2018 talk: social engineering at scale, a community guideBSidesLV 2018 talk: social engineering at scale, a community guide
BSidesLV 2018 talk: social engineering at scale, a community guide
 
Social engineering at scale
Social engineering at scaleSocial engineering at scale
Social engineering at scale
 
engineering misinformation
engineering misinformationengineering misinformation
engineering misinformation
 
Online misinformation: they're coming for our brainz now
Online misinformation: they're coming for our brainz nowOnline misinformation: they're coming for our brainz now
Online misinformation: they're coming for our brainz now
 
Sj terp ciwg_nyc2017_credibility_belief
Sj terp ciwg_nyc2017_credibility_beliefSj terp ciwg_nyc2017_credibility_belief
Sj terp ciwg_nyc2017_credibility_belief
 
Belief: learning about new problems from old things
Belief: learning about new problems from old thingsBelief: learning about new problems from old things
Belief: learning about new problems from old things
 
risks and mitigations of releasing data
risks and mitigations of releasing datarisks and mitigations of releasing data
risks and mitigations of releasing data
 
Session 09 learning relationships.pptx
Session 09 learning relationships.pptxSession 09 learning relationships.pptx
Session 09 learning relationships.pptx
 

Dernier

Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
gajnagarg
 
Computer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdfComputer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdf
SayantanBiswas37
 
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
gajnagarg
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
gajnagarg
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
vexqp
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
gajnagarg
 
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
HyderabadDolls
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
Health
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Klinik kandungan
 

Dernier (20)

7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
 
Computer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdfComputer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdf
 
Statistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbersStatistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbers
 
20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf
 
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
 
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
 
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
 
Kings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about themKings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about them
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
 
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
 
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
 
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham Ware
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 

Session 10 handling bigger data

  • 1. HANDLING BIGGER DATA What to do if your data’s too big Data nerding
  • 2. Your 5-7 things ❑ Bigger data ❑ Much bigger data ❑ Much bigger data storage ❑ Bigger data science teams
  • 3. BIGGER DATA Or, ‘data that’s a bit too big’ 3
  • 5. Computer storage 250Gb Internal hard drive. (hopefully) permanent storage. The place you’re storing photos, data etc 16Gb RAM. Temporary storage. The place read_csv loads your dataset into. 2Tb External hard drive. A handy place to keep bigger datafiles.
  • 6. Gigabytes, Terabytes etc. Name Size in bytes Contains (roughly) Byte 1 1 character (‘a’, ‘1’ etc) Kilobyte 1,000 Half a printed page Megabyte 1,000,000 1 novella. 5Mb = complete works of Shakespeare Gigabyte 1,000,000,000 1 high-fidelity symphony recording; 10m of shelved books Terabyte 1,000,000,000,000 All the x-ray films in a large hospital; 10 = library of congress collection. 2.6 = Panama Papers leak Petabyte 1,000,000,000,000,000 2 = all US academic libraries; 10= 1 hour’s output from SKA telescope Exabyte 1,000,000,000,000,000,000 5 = all words ever spoken by humans Zettabyte 1,000,000,000,000,000,000,000 Yottabyte 1,000,000,000,000,000,000,000,000 Current storage capacity of the Internet
  • 7. Things to Try: Too Big ❑Read data in ‘chunks’ csv_chunks = pandas.read_csv(‘myfile.csv’, chunksize = 10000) ❑ Divide and conquer in your code: csv_chunks = pandas.read_csv(‘myfile.csv’, skiprows=10000, chunksize = 10000) ❑Use parallel processing ❑ E.g the Dask library
  • 8. Things to try: Too Slow ❑Use %timeit to find where the speed problems are ❑Use compiled python, (e.g. the Numba library) ❑Use C code (via Cython) 8
  • 9. MUCH BIGGER DATA Or, ‘What if it really doesn’t fit?’ 9
  • 11. Much Faster Datastreams Twitter firehose: ❑ Firehose averages 6,000 tweets per second ❑ Record is 143,199 tweets in one second (Aug 3rd 2013, Japan) ❑ Twitter public streams = 1% of Firehose steam Google index (2013): ❑ 30 trillion unique pages on the internet ❑ Google index = 100 petabytes (100 million gigabytes) ❑ 100 billion web searches a month ❑ Search returned in about ⅛ second
  • 12. Distributed systems ❑ Store the data on multiple ‘servers’: ❑ Big idea: Distributed file systems ❑ Replicate data (server hardware breaks more often than you think) ❑ Do the processing on multiple servers: ❑ Lots of code does the same thing to different pieces of data ❑ Big idea: Map/Reduce
  • 13. Parallel Processors ❑Laptop: 4 cores, 16 GB RAM, 256 GB disk ❑Workstation: 24 cores, 1 TB RAM ❑Clusters: as big as you can imagine… 13
  • 15. Your typical rack server...
  • 17. Distributed Programming Platforms Hadoop ❑ HDFS: distributed filesystem ❑ MapReduce engine: processing Spark ❑ In-memory processing ❑ Because moving data around is the biggest bottleneck
  • 20. Parallel Python Libraries ❑ Dask ❑ Datasets look like NumpyArrays, Pandas DataFrames ❑ df.groupby(df.index).value.mean() ❑ Direct access into HDFS, S3 etc ❑ PySpark ❑ Also has DataFrames ❑ Connects to Spark 20
  • 21. MUCH BIGGER DATA STORAGE Or, ‘Where do we put all this stuff?’ 2 1
  • 22. SQL Databases ❑ Row/column tables ❑ Keys ❑ SQL query language ❑ Joins etc (like Pandas)
  • 23. ETL (Extract - Transform - Load) ❑ Extract ❑ Extract data from multiple sources ❑ Transform ❑ Convert data into database formats (e.g. sql) ❑ Load ❑ Load data into database
  • 25. NoSql Databases ❑ Not forced into row/column ❑ Lots of different types ❑ Key/value: can add feature without rewriting tables ❑ Graph: stores nodes and edges ❑ Column: useful if you have a lot more reads than writes ❑ Document: general-purpose. MongoDb is commonly used.
  • 27. BIGGER DATA SCIENCE TEAMS Or, ‘Who does this stuff?’ 2 7
  • 28. Big Data Work ❑ Data Science ❑ Data Analysis ❑ Data Engineering ❑ Data Strategy
  • 29. Big Data Science Teams ❑ Usually seen: ❑ Project manager ❑ Business analysts ❑ Data Scientists / Analysts: insight from data ❑ Data Engineers / Developers: data flow implementation, production systems ❑ Sometimes seen: ❑ Data Architect: data flow design ❑ User Experience / User Interface developer / Visual designer
  • 30. Data Strategy ❑ Why should data be important here? ❑ Which business questions does this place have? ❑ What data does/could this place have access to? ❑ How much data work is already here? ❑ Who has the data science gene? ❑ What needs to change to make this place data-driven? ❑ People (training, culture) ❑ Processes ❑ Technologies (data access, storage, analysis tools) ❑ Data
  • 31. Data Analysis ❑ What are the statistics of this dataset? ❑ E.g. which pages are popular ❑ Usually on already-formatted data, e.g. google analytics results
  • 32. Data Science ❑ Ask an interesting question ❑ Get the data ❑ Explore the data ❑ Model the data ❑ Communicate and visualize your results
  • 33. Data Engineering ❑ Big data storage ❑ SQL, NoSQL ❑ warehouses, lakes ❑ Cloud computing architectures ❑ Privacy / security ❑ Uptime ❑ Maintenance ❑ Big data analytics ❑ Distributed programming platforms ❑ Privacy / security ❑ Uptime ❑ Maintenance ❑ etc.
  • 34. EXERCISES Or, ‘Trying some of this out’ 3 4
  • 35. Exercises ❑ Use pandas read_csv() to read a datafile in in chunks
  • 37. READING 3 7 “Books are a uniquely portable magic” – Stephen King