SlideShare une entreprise Scribd logo
1  sur  36
Télécharger pour lire hors ligne
Dirty Data? Clean it up!
Or, how to do data science in the real world.
Dan Lynn
CEO, AgilData
@danklynn
dan@agildata.com
Patrick Russell
Independent Consultant (formerly Data Science @Craftsy)
@patrickrm101
patrick@patrickrussell.me
© Phil Mislinksi - www.pmimage.com
Patrick Russell - Bass
Data Scientist between things ;)
Dan Lynn - Guitar
CEO, AgilData
© Phil Mislinksi - www.pmimage.com
EXPERT SOLUTIONS AND SERVICES FOR
COMPLEX DATA PROBLEMS
At AgilData, we help you get the most out of your data. We provide Software and Services to help firms deliver on
the promise of Big Data and complex data infrastructures:
● AgilData Scalable Cluster for MySQL – Massively scalable and performant MySQL databases combined
with 24×7 remote managed services for DBA/DevOps
● Trusted Big Data experts to solve problems, set strategy and develop solutions for BI, data
pipeline orchestration, ETL, Data Engineering & DevOps, APIs and custom applications.
www.agildata.com
Hey, you’re a data scientist, right? Great!
We have millions of users. How we can use email
to monetize our user base better?
— Marketing
1 / 1 + exp(-x)
https://www.etsy.com/shop/NausicaaDistribution
Source: https://www.oreilly.com/ideas/2015-data-science-salary-survey
http://www.lavante.com/the-hub/ap-industry/lavante-and-spend-matters-look-at-how-dirty-vendor-data-impacts-your-bottom-line/
Data Cleansing
Data Cleansing
● Dates & Times
● Numbers & Strings
● Addresses
● Clickstream Data
● Handling missing data
● Tidy Data
Dates & Times
● Timestamps can mean different things
○ ingested_date, event_timestamp
● Clocks can’t be trusted
○ Server time: which server? Is it synchronized?
○ Client time? Is there a synchronizing time scheme?
● Timezones
○ What tz is your own data in?
○ Your email provider? Your adwords account? Your Google Analytics?
Numbers & Strings
● Use the right types for your numbers (int, bigint, float, numeric
etc)
● Murphy’s Law of text inputs: If a user can put something in a text
field, anything and everything will happen.
● Watch out for floating point precision mistakes
Addresses
● Parsing / validation is not something you want to do yourself
○ USPS has validation and zip lookup for US addresses:
https://www.usps.com/business/web-tools-apis/documentation-updates.htm
● Remember zip codes are strings. And the rest of the world does not
use U.S. zips.
● IP geolocation: Get lat/long, state, city, postal & ISP, from visitor
IPs
○ https://www.maxmind.com/en/geoip2-city
○ This is ALWAYS approximate
● If working with GIS, recommend http://postgis.net/
○ Vanilla postgres also has earthdistance for great circle distance
Clickstream Data
● User agent => Device: Don’t do this yourself (we use WURFL and Google
Analytics)
● Query strings follow the rules of text. Everything will show up
○ They might be truncated
○ URL encoding might be missing characters (%2 instead of %20)
○ Use a library to parse params (ie Python ships with urlparse.parse_qs)
● If your system creates sessions (tomcat, Google Analytics), don’t be
afraid to create your own sessions on top of the pageview data
○ You’ll get cross channel and cross device behavior this way
Clickstream Data
Missing / empty data
● Easy to overlook but important
● What does missing data in the context of your analysis mean?
○ Not collected (why not?)
○ Error state
○ N/A or undefined
○ Especially for histograms, missing data lead to very poor conclusions.
● Does your data use sentinel values? (ie -9999 or “null”)
○ df[‘nps_score’].replace(-9999, np.nan)
● Imputation
● Storage
Tidy Data
● Conceptual framework for structuring data for analysis and fitting
○ Each variable forms a column
○ Each observation is a row
○ Each type of observational unit forms a table
● Pretty much normal form from relational databases for stats
● Tidy can be different depending on the question asked
● R (dplyr, tidyr) and Python (pandas) have functions for making your
long data wide & wide data long (stack, unstack, melt, pivot)
● Paper: http://vita.had.co.nz/papers/tidy-data.pdf
● Python tutorial: http://tomaugspurger.github.io/modern-5-tidy.html
Tidy Data
● Example might be marketplace transaction data with 1 row per
transaction
● You might want to do analysis on participants, 1 row per participant
Hey, that’s a great model. How can we build it
into our decision-making process?
— Marketing
Operationalizing Data Science
● Doing an analysis once rarely delivers lasting value.
● The business needs continuous insight, so you need to get this stuff
into production.
○ Hosting
○ ETL
○ Pipelines
Operationalizing Data Science
Hosting
● Delivering continuous analyses requires operational infrastructure
○ Database(s)
○ Visualization tools (e.g. Chartio, Arcadia Data, Tableau, Looker, Qlik, etc..)
○ REST services / microservices
● These all have uptime requirements. You need to involve your (dev)ops
team earlier rather than later.
● Microservices / REST endpoints have architectural implications
● Visualization tools
○ Local (e.g. Jupyter, Zeppelin)
○ On-premise (Arcadia Data, Tableau, Qlik)
○ Hosted (Chartio)
● Visualization tools often require a SQL interface, thus….
ETL - Extract, Transform, Load
● Often used to herd data into some kind of data warehouse (e.g. RDBMS
+ star schema, Hadoop w/ unstructured data, etc..)
● Not just for data warehousing
● Not just for modeling
● No general solution
● Tooling
○ Apache Spark, Apache Sqoop
○ Commercial Tools: Informatica, Vertica, SQL Server, DataVirtuality etc…
● And then there is Apache Kafka…and the “NoETL” movement
○ Book: “I <3 Logs” - by Jay kreps
○ Replay history from the beginning of time as needed
ETL - Extract, Transform, Load - Example
● Not just for production runs
○ For example, Patrick does a lot of ad hoc time-to-event analysis on email opens,
transactions, visits.
■ Survival functions, etc...
○ Setup ETL that builds tables With the right shape to throw right into models
Pipelines
● From data to model output
● Define dependencies and define DAG for the work
○ Steps defined by assigning input as output of prior steps
○ Luigi (http://luigi.readthedocs.io/en/stable/index.html)
○ Drake (https://github.com/Factual/drake)
○ Scikit learn has its own Pipeline
■ That can be part of your bigger pipeline
● Scheduling can be trickier than you think
○ Resource contention
○ Loose dependencies
○ Cron is fine but Jenkins works really well for this!
● Don’t be afraid to create and teardown full environments as steps
○ For example, spin up and configure an EMR cluster, do stuff, tear it down*
* make your VP of Infrastructure less miserable
Pipelines - Luigi
● Written in Python. Steps implemented by subclassing Task
● Visualize your DAG
● Supports data in relational DBs, Redshift, HDFS, S3, file system
● Flexible and extensible
● Can parallelize jobs
● Workflow runs by executing last step which schedules all dependencies
Pipelines - Luigi
Pipelines - Drake
● JVM (written in Clojure)
● Like a Makefile but for data work
● Supports commands in Shell, Python, Ruby, Clojure
Pipelines - More Tools
● Oozie
○ The default job orchestration engine for Hadoop. Can chain together multiple jobs
to form a complete DAG.
○ Open source
● Kettle
○ Old-school, but still relevant.
○ Visual pipeline designer. Execution engine
○ Open source
● Informatica
○ Visual pipeline designer, mature toolset
○ Commercial
● Datavirtuality
○ Treats all your stores (including Google Analytics) like schemas in a single db
○ Great for microservice architectures
○ Commercial
© Patrick Coppinger
Thanks!
dan@agildata.com — patrick@craftsy.com
@danklynn — @patrickrm101
References
● I Heart Logs
○ http://www.amazon.com/Heart-Logs-Stream-Processing-Integration/dp/1491909382
● Tidy Data
○ http://vita.had.co.nz/papers/tidy-data.pdf
Additional Tools
● Scientific python stack (ipython, numpy, scipy, pandas, matplotlib…)
● Hadleyverse for R (dplyr, ggplot, tidyr, lubridate…)
● csvkit: command line tools (csvcut, csvgrep, csvjoin...) for CSV data
● jq: fast command line tool for working with json (ie pipe cURL to jq)
● psql (if you use postgresql or Redshift)

Contenu connexe

Tendances

Presto @ Uber Hadoop summit2017
Presto @ Uber Hadoop summit2017Presto @ Uber Hadoop summit2017
Presto @ Uber Hadoop summit2017Zhenxiao Luo
 
From Batch to Streaming ET(L) with Apache Apex
From Batch to Streaming ET(L) with Apache ApexFrom Batch to Streaming ET(L) with Apache Apex
From Batch to Streaming ET(L) with Apache ApexDataWorks Summit
 
ETL Practices for Better or Worse
ETL Practices for Better or WorseETL Practices for Better or Worse
ETL Practices for Better or WorseEric Sun
 
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBase
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBaseHBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBase
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBaseMichael Stack
 
Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...
Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...
Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...Spark Summit
 
Lessons Learned from Modernizing USCIS Data Analytics Platform
Lessons Learned from Modernizing USCIS Data Analytics PlatformLessons Learned from Modernizing USCIS Data Analytics Platform
Lessons Learned from Modernizing USCIS Data Analytics PlatformDatabricks
 
Building Data Lakes with Apache Airflow
Building Data Lakes with Apache AirflowBuilding Data Lakes with Apache Airflow
Building Data Lakes with Apache AirflowGary Stafford
 
Proud to be Polyglot - Riviera Dev 2015
Proud to be Polyglot - Riviera Dev 2015Proud to be Polyglot - Riviera Dev 2015
Proud to be Polyglot - Riviera Dev 2015Tugdual Grall
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...Databricks
 
Presto GeoSpatial @ Strata New York 2017
Presto GeoSpatial @ Strata New York 2017Presto GeoSpatial @ Strata New York 2017
Presto GeoSpatial @ Strata New York 2017Zhenxiao Luo
 
Presto Strata London 2019: Cost-Based Optimizer for interactive SQL on anything
Presto Strata London 2019: Cost-Based Optimizer for interactive SQL on anythingPresto Strata London 2019: Cost-Based Optimizer for interactive SQL on anything
Presto Strata London 2019: Cost-Based Optimizer for interactive SQL on anythingPiotr Findeisen
 
Data Warehouse Modernization - Big Data in the Cloud Success with Qubole on O...
Data Warehouse Modernization - Big Data in the Cloud Success with Qubole on O...Data Warehouse Modernization - Big Data in the Cloud Success with Qubole on O...
Data Warehouse Modernization - Big Data in the Cloud Success with Qubole on O...Qubole
 
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...Spark Summit
 
Big Data Day LA 2015 - NoSQL: Doing it wrong before getting it right by Lawre...
Big Data Day LA 2015 - NoSQL: Doing it wrong before getting it right by Lawre...Big Data Day LA 2015 - NoSQL: Doing it wrong before getting it right by Lawre...
Big Data Day LA 2015 - NoSQL: Doing it wrong before getting it right by Lawre...Data Con LA
 
HBaseConAsia2018 Track3-3: HBase at China Life Insurance
HBaseConAsia2018 Track3-3: HBase at China Life InsuranceHBaseConAsia2018 Track3-3: HBase at China Life Insurance
HBaseConAsia2018 Track3-3: HBase at China Life InsuranceMichael Stack
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
 
Introduction to basic data analytics tools
Introduction to basic data analytics toolsIntroduction to basic data analytics tools
Introduction to basic data analytics toolsNascenia IT
 

Tendances (20)

Presto @ Uber Hadoop summit2017
Presto @ Uber Hadoop summit2017Presto @ Uber Hadoop summit2017
Presto @ Uber Hadoop summit2017
 
From Batch to Streaming ET(L) with Apache Apex
From Batch to Streaming ET(L) with Apache ApexFrom Batch to Streaming ET(L) with Apache Apex
From Batch to Streaming ET(L) with Apache Apex
 
What database
What databaseWhat database
What database
 
ETL Practices for Better or Worse
ETL Practices for Better or WorseETL Practices for Better or Worse
ETL Practices for Better or Worse
 
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBase
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBaseHBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBase
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBase
 
Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...
Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...
Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...
 
Lessons Learned from Modernizing USCIS Data Analytics Platform
Lessons Learned from Modernizing USCIS Data Analytics PlatformLessons Learned from Modernizing USCIS Data Analytics Platform
Lessons Learned from Modernizing USCIS Data Analytics Platform
 
Building Data Lakes with Apache Airflow
Building Data Lakes with Apache AirflowBuilding Data Lakes with Apache Airflow
Building Data Lakes with Apache Airflow
 
Proud to be Polyglot - Riviera Dev 2015
Proud to be Polyglot - Riviera Dev 2015Proud to be Polyglot - Riviera Dev 2015
Proud to be Polyglot - Riviera Dev 2015
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Introduction to Dremio
Introduction to DremioIntroduction to Dremio
Introduction to Dremio
 
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
 
Presto GeoSpatial @ Strata New York 2017
Presto GeoSpatial @ Strata New York 2017Presto GeoSpatial @ Strata New York 2017
Presto GeoSpatial @ Strata New York 2017
 
Presto Strata London 2019: Cost-Based Optimizer for interactive SQL on anything
Presto Strata London 2019: Cost-Based Optimizer for interactive SQL on anythingPresto Strata London 2019: Cost-Based Optimizer for interactive SQL on anything
Presto Strata London 2019: Cost-Based Optimizer for interactive SQL on anything
 
Data Warehouse Modernization - Big Data in the Cloud Success with Qubole on O...
Data Warehouse Modernization - Big Data in the Cloud Success with Qubole on O...Data Warehouse Modernization - Big Data in the Cloud Success with Qubole on O...
Data Warehouse Modernization - Big Data in the Cloud Success with Qubole on O...
 
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
 
Big Data Day LA 2015 - NoSQL: Doing it wrong before getting it right by Lawre...
Big Data Day LA 2015 - NoSQL: Doing it wrong before getting it right by Lawre...Big Data Day LA 2015 - NoSQL: Doing it wrong before getting it right by Lawre...
Big Data Day LA 2015 - NoSQL: Doing it wrong before getting it right by Lawre...
 
HBaseConAsia2018 Track3-3: HBase at China Life Insurance
HBaseConAsia2018 Track3-3: HBase at China Life InsuranceHBaseConAsia2018 Track3-3: HBase at China Life Insurance
HBaseConAsia2018 Track3-3: HBase at China Life Insurance
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
Introduction to basic data analytics tools
Introduction to basic data analytics toolsIntroduction to basic data analytics tools
Introduction to basic data analytics tools
 

En vedette

AgilData - How I Learned to Stop Worrying and Evolve with On-Demand Schemas
AgilData - How I Learned to Stop Worrying and Evolve  with On-Demand SchemasAgilData - How I Learned to Stop Worrying and Evolve  with On-Demand Schemas
AgilData - How I Learned to Stop Worrying and Evolve with On-Demand SchemasDan Lynn
 
Data decay and the illusion of the present
Data decay and the illusion of the presentData decay and the illusion of the present
Data decay and the illusion of the presentDan Lynn
 
Hands on with Apache Spark
Hands on with Apache SparkHands on with Apache Spark
Hands on with Apache SparkDan Lynn
 
Storing and manipulating graphs in HBase
Storing and manipulating graphs in HBaseStoring and manipulating graphs in HBase
Storing and manipulating graphs in HBaseDan Lynn
 
Big Data and Data Standardization at LinkedIn
Big Data and Data Standardization at LinkedInBig Data and Data Standardization at LinkedIn
Big Data and Data Standardization at LinkedInAlexis Baird
 
Get it Clean and Keep it Clean
Get it Clean and Keep it CleanGet it Clean and Keep it Clean
Get it Clean and Keep it CleanDQ Global
 
Data Cleanup Presentation - RecordLion
Data Cleanup Presentation - RecordLionData Cleanup Presentation - RecordLion
Data Cleanup Presentation - RecordLionAndrew Borgschulte
 
Neo, Titan & Cassandra
Neo, Titan & CassandraNeo, Titan & Cassandra
Neo, Titan & Cassandrajohnrjenson
 
The Data Cleansing Process - A Roadmap to Material Master Data Quality
The Data Cleansing Process - A Roadmap to Material Master Data QualityThe Data Cleansing Process - A Roadmap to Material Master Data Quality
The Data Cleansing Process - A Roadmap to Material Master Data QualityI.M.A. Ltd.
 
Data Quality - The Cleansing Process
Data Quality - The Cleansing ProcessData Quality - The Cleansing Process
Data Quality - The Cleansing ProcessInfoCheckPoint
 
Presentation on Data Cleansing
Presentation on Data CleansingPresentation on Data Cleansing
Presentation on Data Cleansingng8
 
Brief Introduction to the 12 Steps of Evaluation Data Cleaning
Brief Introduction to the 12 Steps of Evaluation Data CleaningBrief Introduction to the 12 Steps of Evaluation Data Cleaning
Brief Introduction to the 12 Steps of Evaluation Data CleaningJennifer Morrow
 
Жизненный цикл тестировщика (Урансофт, семинар TrueTester #3)
Жизненный цикл тестировщика (Урансофт, семинар TrueTester #3)Жизненный цикл тестировщика (Урансофт, семинар TrueTester #3)
Жизненный цикл тестировщика (Урансофт, семинар TrueTester #3)Catherine Tipanova
 
Romance vietnamienne.
Romance vietnamienne.Romance vietnamienne.
Romance vietnamienne.sinagua
 
Getting it Right in Mobile: How to Use Mobile to Build Relationships
Getting it Right in Mobile: How to Use Mobile to Build RelationshipsGetting it Right in Mobile: How to Use Mobile to Build Relationships
Getting it Right in Mobile: How to Use Mobile to Build RelationshipsWaterfall Mobile
 
Informanagement Presentation
Informanagement PresentationInformanagement Presentation
Informanagement Presentationinformanagement
 

En vedette (20)

AgilData - How I Learned to Stop Worrying and Evolve with On-Demand Schemas
AgilData - How I Learned to Stop Worrying and Evolve  with On-Demand SchemasAgilData - How I Learned to Stop Worrying and Evolve  with On-Demand Schemas
AgilData - How I Learned to Stop Worrying and Evolve with On-Demand Schemas
 
Data decay and the illusion of the present
Data decay and the illusion of the presentData decay and the illusion of the present
Data decay and the illusion of the present
 
Hands on with Apache Spark
Hands on with Apache SparkHands on with Apache Spark
Hands on with Apache Spark
 
Storing and manipulating graphs in HBase
Storing and manipulating graphs in HBaseStoring and manipulating graphs in HBase
Storing and manipulating graphs in HBase
 
Big Data and Data Standardization at LinkedIn
Big Data and Data Standardization at LinkedInBig Data and Data Standardization at LinkedIn
Big Data and Data Standardization at LinkedIn
 
Get it Clean and Keep it Clean
Get it Clean and Keep it CleanGet it Clean and Keep it Clean
Get it Clean and Keep it Clean
 
Data Cleanup Presentation - RecordLion
Data Cleanup Presentation - RecordLionData Cleanup Presentation - RecordLion
Data Cleanup Presentation - RecordLion
 
Neo, Titan & Cassandra
Neo, Titan & CassandraNeo, Titan & Cassandra
Neo, Titan & Cassandra
 
The Data Cleansing Process - A Roadmap to Material Master Data Quality
The Data Cleansing Process - A Roadmap to Material Master Data QualityThe Data Cleansing Process - A Roadmap to Material Master Data Quality
The Data Cleansing Process - A Roadmap to Material Master Data Quality
 
Data Quality - The Cleansing Process
Data Quality - The Cleansing ProcessData Quality - The Cleansing Process
Data Quality - The Cleansing Process
 
Presentation on Data Cleansing
Presentation on Data CleansingPresentation on Data Cleansing
Presentation on Data Cleansing
 
Brief Introduction to the 12 Steps of Evaluation Data Cleaning
Brief Introduction to the 12 Steps of Evaluation Data CleaningBrief Introduction to the 12 Steps of Evaluation Data Cleaning
Brief Introduction to the 12 Steps of Evaluation Data Cleaning
 
WLIA - 2015 Fall Regional, Oshkosh WI
WLIA - 2015 Fall Regional, Oshkosh WIWLIA - 2015 Fall Regional, Oshkosh WI
WLIA - 2015 Fall Regional, Oshkosh WI
 
Data Cleaning Process
Data Cleaning ProcessData Cleaning Process
Data Cleaning Process
 
Neostorm case analysis
Neostorm case analysisNeostorm case analysis
Neostorm case analysis
 
Жизненный цикл тестировщика (Урансофт, семинар TrueTester #3)
Жизненный цикл тестировщика (Урансофт, семинар TrueTester #3)Жизненный цикл тестировщика (Урансофт, семинар TrueTester #3)
Жизненный цикл тестировщика (Урансофт, семинар TrueTester #3)
 
Romance vietnamienne.
Romance vietnamienne.Romance vietnamienne.
Romance vietnamienne.
 
Getting it Right in Mobile: How to Use Mobile to Build Relationships
Getting it Right in Mobile: How to Use Mobile to Build RelationshipsGetting it Right in Mobile: How to Use Mobile to Build Relationships
Getting it Right in Mobile: How to Use Mobile to Build Relationships
 
Informanagement Presentation
Informanagement PresentationInformanagement Presentation
Informanagement Presentation
 
Mule Deer Research in Utah, April 2011
Mule Deer Research in Utah, April 2011Mule Deer Research in Utah, April 2011
Mule Deer Research in Utah, April 2011
 

Similaire à Dirty Data? Clean it up! - Rocky Mountain DataCon 2016

Dirty data? Clean it up! - Datapalooza Denver 2016
Dirty data? Clean it up! - Datapalooza Denver 2016Dirty data? Clean it up! - Datapalooza Denver 2016
Dirty data? Clean it up! - Datapalooza Denver 2016Dan Lynn
 
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3  Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3 Omid Vahdaty
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned Omid Vahdaty
 
A Modern Interface for Data Science on Postgres/Greenplum - Greenplum Summit ...
A Modern Interface for Data Science on Postgres/Greenplum - Greenplum Summit ...A Modern Interface for Data Science on Postgres/Greenplum - Greenplum Summit ...
A Modern Interface for Data Science on Postgres/Greenplum - Greenplum Summit ...VMware Tanzu
 
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned Omid Vahdaty
 
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | EnglishAWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | EnglishOmid Vahdaty
 
Machine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systemsMachine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systemsZhenxiao Luo
 
Extracting Insights from Data at Twitter
Extracting Insights from Data at TwitterExtracting Insights from Data at Twitter
Extracting Insights from Data at TwitterPrasad Wagle
 
Big Data Pipeline for Analytics at Scale @ FIT CVUT 2014
Big Data Pipeline for Analytics at Scale @ FIT CVUT 2014Big Data Pipeline for Analytics at Scale @ FIT CVUT 2014
Big Data Pipeline for Analytics at Scale @ FIT CVUT 2014Jaroslav Gergic
 
Make your data fly - Building data platform in AWS
Make your data fly - Building data platform in AWSMake your data fly - Building data platform in AWS
Make your data fly - Building data platform in AWSKimmo Kantojärvi
 
Data Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixData Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixC4Media
 
Ledingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @LendingkartLedingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @LendingkartMukesh Singh
 
Python's slippy path and Tao of thick Pandas: give my data, Rrrrr...
Python's slippy path and Tao of thick Pandas: give my data, Rrrrr...Python's slippy path and Tao of thick Pandas: give my data, Rrrrr...
Python's slippy path and Tao of thick Pandas: give my data, Rrrrr...Alexey Zinoviev
 
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...Marcin Bielak
 
Miguel Angel Fajardo - NewSQL: the magic wand of data - Codemotion Rome 2019
Miguel Angel Fajardo - NewSQL: the magic wand of data - Codemotion Rome 2019Miguel Angel Fajardo - NewSQL: the magic wand of data - Codemotion Rome 2019
Miguel Angel Fajardo - NewSQL: the magic wand of data - Codemotion Rome 2019Codemotion
 

Similaire à Dirty Data? Clean it up! - Rocky Mountain DataCon 2016 (20)

Dirty data? Clean it up! - Datapalooza Denver 2016
Dirty data? Clean it up! - Datapalooza Denver 2016Dirty data? Clean it up! - Datapalooza Denver 2016
Dirty data? Clean it up! - Datapalooza Denver 2016
 
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3  Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned
 
A Modern Interface for Data Science on Postgres/Greenplum - Greenplum Summit ...
A Modern Interface for Data Science on Postgres/Greenplum - Greenplum Summit ...A Modern Interface for Data Science on Postgres/Greenplum - Greenplum Summit ...
A Modern Interface for Data Science on Postgres/Greenplum - Greenplum Summit ...
 
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
 
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | EnglishAWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
 
Machine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systemsMachine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systems
 
Extracting Insights from Data at Twitter
Extracting Insights from Data at TwitterExtracting Insights from Data at Twitter
Extracting Insights from Data at Twitter
 
Big Data Pipeline for Analytics at Scale @ FIT CVUT 2014
Big Data Pipeline for Analytics at Scale @ FIT CVUT 2014Big Data Pipeline for Analytics at Scale @ FIT CVUT 2014
Big Data Pipeline for Analytics at Scale @ FIT CVUT 2014
 
BigData Hadoop
BigData Hadoop BigData Hadoop
BigData Hadoop
 
Make your data fly - Building data platform in AWS
Make your data fly - Building data platform in AWSMake your data fly - Building data platform in AWS
Make your data fly - Building data platform in AWS
 
Data Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixData Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFix
 
Ledingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @LendingkartLedingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @Lendingkart
 
Large Data Analyze With PyTables
Large Data Analyze With PyTablesLarge Data Analyze With PyTables
Large Data Analyze With PyTables
 
Py tables
Py tablesPy tables
Py tables
 
PyTables
PyTablesPyTables
PyTables
 
Python's slippy path and Tao of thick Pandas: give my data, Rrrrr...
Python's slippy path and Tao of thick Pandas: give my data, Rrrrr...Python's slippy path and Tao of thick Pandas: give my data, Rrrrr...
Python's slippy path and Tao of thick Pandas: give my data, Rrrrr...
 
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...
 
Miguel Angel Fajardo - NewSQL: the magic wand of data - Codemotion Rome 2019
Miguel Angel Fajardo - NewSQL: the magic wand of data - Codemotion Rome 2019Miguel Angel Fajardo - NewSQL: the magic wand of data - Codemotion Rome 2019
Miguel Angel Fajardo - NewSQL: the magic wand of data - Codemotion Rome 2019
 
Data science
Data scienceData science
Data science
 

Dernier

Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...shivangimorya083
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...Pooja Nehwal
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramMoniSankarHazra
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...shambhavirathore45
 

Dernier (20)

Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 

Dirty Data? Clean it up! - Rocky Mountain DataCon 2016

  • 1. Dirty Data? Clean it up! Or, how to do data science in the real world. Dan Lynn CEO, AgilData @danklynn dan@agildata.com Patrick Russell Independent Consultant (formerly Data Science @Craftsy) @patrickrm101 patrick@patrickrussell.me
  • 2. © Phil Mislinksi - www.pmimage.com Patrick Russell - Bass Data Scientist between things ;) Dan Lynn - Guitar CEO, AgilData
  • 3. © Phil Mislinksi - www.pmimage.com EXPERT SOLUTIONS AND SERVICES FOR COMPLEX DATA PROBLEMS At AgilData, we help you get the most out of your data. We provide Software and Services to help firms deliver on the promise of Big Data and complex data infrastructures: ● AgilData Scalable Cluster for MySQL – Massively scalable and performant MySQL databases combined with 24×7 remote managed services for DBA/DevOps ● Trusted Big Data experts to solve problems, set strategy and develop solutions for BI, data pipeline orchestration, ETL, Data Engineering & DevOps, APIs and custom applications. www.agildata.com
  • 4. Hey, you’re a data scientist, right? Great! We have millions of users. How we can use email to monetize our user base better? — Marketing
  • 5. 1 / 1 + exp(-x)
  • 6.
  • 7.
  • 8.
  • 9.
  • 11.
  • 14. Data Cleansing ● Dates & Times ● Numbers & Strings ● Addresses ● Clickstream Data ● Handling missing data ● Tidy Data
  • 15. Dates & Times ● Timestamps can mean different things ○ ingested_date, event_timestamp ● Clocks can’t be trusted ○ Server time: which server? Is it synchronized? ○ Client time? Is there a synchronizing time scheme? ● Timezones ○ What tz is your own data in? ○ Your email provider? Your adwords account? Your Google Analytics?
  • 16. Numbers & Strings ● Use the right types for your numbers (int, bigint, float, numeric etc) ● Murphy’s Law of text inputs: If a user can put something in a text field, anything and everything will happen. ● Watch out for floating point precision mistakes
  • 17. Addresses ● Parsing / validation is not something you want to do yourself ○ USPS has validation and zip lookup for US addresses: https://www.usps.com/business/web-tools-apis/documentation-updates.htm ● Remember zip codes are strings. And the rest of the world does not use U.S. zips. ● IP geolocation: Get lat/long, state, city, postal & ISP, from visitor IPs ○ https://www.maxmind.com/en/geoip2-city ○ This is ALWAYS approximate ● If working with GIS, recommend http://postgis.net/ ○ Vanilla postgres also has earthdistance for great circle distance
  • 18. Clickstream Data ● User agent => Device: Don’t do this yourself (we use WURFL and Google Analytics) ● Query strings follow the rules of text. Everything will show up ○ They might be truncated ○ URL encoding might be missing characters (%2 instead of %20) ○ Use a library to parse params (ie Python ships with urlparse.parse_qs) ● If your system creates sessions (tomcat, Google Analytics), don’t be afraid to create your own sessions on top of the pageview data ○ You’ll get cross channel and cross device behavior this way
  • 20. Missing / empty data ● Easy to overlook but important ● What does missing data in the context of your analysis mean? ○ Not collected (why not?) ○ Error state ○ N/A or undefined ○ Especially for histograms, missing data lead to very poor conclusions. ● Does your data use sentinel values? (ie -9999 or “null”) ○ df[‘nps_score’].replace(-9999, np.nan) ● Imputation ● Storage
  • 21. Tidy Data ● Conceptual framework for structuring data for analysis and fitting ○ Each variable forms a column ○ Each observation is a row ○ Each type of observational unit forms a table ● Pretty much normal form from relational databases for stats ● Tidy can be different depending on the question asked ● R (dplyr, tidyr) and Python (pandas) have functions for making your long data wide & wide data long (stack, unstack, melt, pivot) ● Paper: http://vita.had.co.nz/papers/tidy-data.pdf ● Python tutorial: http://tomaugspurger.github.io/modern-5-tidy.html
  • 22. Tidy Data ● Example might be marketplace transaction data with 1 row per transaction ● You might want to do analysis on participants, 1 row per participant
  • 23. Hey, that’s a great model. How can we build it into our decision-making process? — Marketing
  • 25. ● Doing an analysis once rarely delivers lasting value. ● The business needs continuous insight, so you need to get this stuff into production. ○ Hosting ○ ETL ○ Pipelines Operationalizing Data Science
  • 26. Hosting ● Delivering continuous analyses requires operational infrastructure ○ Database(s) ○ Visualization tools (e.g. Chartio, Arcadia Data, Tableau, Looker, Qlik, etc..) ○ REST services / microservices ● These all have uptime requirements. You need to involve your (dev)ops team earlier rather than later. ● Microservices / REST endpoints have architectural implications ● Visualization tools ○ Local (e.g. Jupyter, Zeppelin) ○ On-premise (Arcadia Data, Tableau, Qlik) ○ Hosted (Chartio) ● Visualization tools often require a SQL interface, thus….
  • 27. ETL - Extract, Transform, Load ● Often used to herd data into some kind of data warehouse (e.g. RDBMS + star schema, Hadoop w/ unstructured data, etc..) ● Not just for data warehousing ● Not just for modeling ● No general solution ● Tooling ○ Apache Spark, Apache Sqoop ○ Commercial Tools: Informatica, Vertica, SQL Server, DataVirtuality etc… ● And then there is Apache Kafka…and the “NoETL” movement ○ Book: “I <3 Logs” - by Jay kreps ○ Replay history from the beginning of time as needed
  • 28. ETL - Extract, Transform, Load - Example ● Not just for production runs ○ For example, Patrick does a lot of ad hoc time-to-event analysis on email opens, transactions, visits. ■ Survival functions, etc... ○ Setup ETL that builds tables With the right shape to throw right into models
  • 29. Pipelines ● From data to model output ● Define dependencies and define DAG for the work ○ Steps defined by assigning input as output of prior steps ○ Luigi (http://luigi.readthedocs.io/en/stable/index.html) ○ Drake (https://github.com/Factual/drake) ○ Scikit learn has its own Pipeline ■ That can be part of your bigger pipeline ● Scheduling can be trickier than you think ○ Resource contention ○ Loose dependencies ○ Cron is fine but Jenkins works really well for this! ● Don’t be afraid to create and teardown full environments as steps ○ For example, spin up and configure an EMR cluster, do stuff, tear it down* * make your VP of Infrastructure less miserable
  • 30. Pipelines - Luigi ● Written in Python. Steps implemented by subclassing Task ● Visualize your DAG ● Supports data in relational DBs, Redshift, HDFS, S3, file system ● Flexible and extensible ● Can parallelize jobs ● Workflow runs by executing last step which schedules all dependencies
  • 32. Pipelines - Drake ● JVM (written in Clojure) ● Like a Makefile but for data work ● Supports commands in Shell, Python, Ruby, Clojure
  • 33. Pipelines - More Tools ● Oozie ○ The default job orchestration engine for Hadoop. Can chain together multiple jobs to form a complete DAG. ○ Open source ● Kettle ○ Old-school, but still relevant. ○ Visual pipeline designer. Execution engine ○ Open source ● Informatica ○ Visual pipeline designer, mature toolset ○ Commercial ● Datavirtuality ○ Treats all your stores (including Google Analytics) like schemas in a single db ○ Great for microservice architectures ○ Commercial
  • 34. © Patrick Coppinger Thanks! dan@agildata.com — patrick@craftsy.com @danklynn — @patrickrm101
  • 35. References ● I Heart Logs ○ http://www.amazon.com/Heart-Logs-Stream-Processing-Integration/dp/1491909382 ● Tidy Data ○ http://vita.had.co.nz/papers/tidy-data.pdf
  • 36. Additional Tools ● Scientific python stack (ipython, numpy, scipy, pandas, matplotlib…) ● Hadleyverse for R (dplyr, ggplot, tidyr, lubridate…) ● csvkit: command line tools (csvcut, csvgrep, csvjoin...) for CSV data ● jq: fast command line tool for working with json (ie pipe cURL to jq) ● psql (if you use postgresql or Redshift)