Contenu connexe Similaire à Data Preparation vs. Inline Data Wrangling in Data Science and Machine Learning (20) Data Preparation vs. Inline Data Wrangling in Data Science and Machine Learning2. © Copyright 2000-2017 TIBCO Software Inc.
A key task to create appropriate analytic models in machine learning or deep learning is
the integration and preparation of data sets from various sources like files, databases, big
data storages, sensors or social networks. This step can take up to 50% of the whole
project.
This session compares different alternative techniques to prepare data, including extract-
transform-load (ETL) batch processing, streaming analytics ingestion, and data wrangling
within visual analytics. Various options and their trade-offs are shown in live demos using
different advanced analytics technologies and open source frameworks such as R, Python,
Apache Spark, Talend or KNIME. The session also discusses how this is related to visual
analytics, and best practices for how the data scientist and business user should work
together to build good analytic models.
Key takeaways for the audience:
- Learn various options for preparing data sets to build analytic models
- Understand the pros and cons and the targeted persona for each option
- See different technologies and open source frameworks for data preparation
- Understand the relation to visual analytics and streaming analytics, and how these
concepts are actually leveraged to build the analytic model after data preparation
Comparison of Data Preprocessing vs. Data Wrangling vs. ETL vs. Streaming Ingestion
in Machine Learning / Deep Learning Projects
3. © Copyright 2000-2017 TIBCO Software Inc.
Key Takeaways
Ø Various languages, frameworks and tools for data preparation - trade-offs included
Ø Data Wrangling as important add-on to data preprocessing - best within visual analytics tool
Ø Visual analytics and open source data science components are complementary
Ø Avoiding numerous components speeds up a data science project
… for Data Preparation in Data Science:
4. © Copyright 2000-2017 TIBCO Software Inc.
Agenda
1) The Need for Data Preprocessing and Data Wrangling
2) Kaggle’s Titanic Dataset
3) Data Preprocessing - by the Data Scientist
4) Data Preprocessing - by the (Citizen) Data Scientist
5) Data Wrangling - by the Business Analyst or (Citizen) Data Scientist
6) ETL and DQ - by the Developer
7) Data Ingestion and Streaming Analytics - by the Developer
5. © Copyright 2000-2017 TIBCO Software Inc.
Agenda
1) The Need for Data Preprocessing and Data Wrangling
2) Kaggle’s Titanic Dataset
3) Data Preprocessing - by the Data Scientist
4) Data Preprocessing - by the (Citizen) Data Scientist
5) Data Wrangling - by the Business Analyst or (Citizen) Data Scientist
6) ETL and DQ - by the Developer
7) Data Ingestion and Streaming Analytics - by the Developer
6. © Copyright 2000-2017 TIBCO Software Inc.
From Insight to Action - Closed Loop for Big Data Analytics
Insight ActionEVENTSEVENTS
7. © Copyright 2000-2017 TIBCO Software Inc.
From Insight to Action - Closed Loop for Big Data Analytics
Insight Action
MONITOR
PREDICT
ACT
DECIDE
MODEL
ACCESS
ANALYZE
WRANGLE
8. © Copyright 2000-2017 TIBCO Software Inc.
Analyst Reports 2016
Magic Quadrant for Advanced Analytics Platforms
The Forrester Wave: Enterprise Insight Platform SuitesMagic Quadrant for Data Integration Tools
Magic Quadrant for BI and Analytics
9. © Copyright 2000-2017 TIBCO Software Inc.
Demystify Data Science for the Business Analyst
Leverage Machine Learning
without
help of a Data Scientist
10. © Copyright 2000-2017 TIBCO Software Inc.
• Business User / Analyst
• Data Scientist
• Citizen Data Scientist
• Developer
User Roles
AI-DRIVEN VISUAL ANALYTICS
DATA DISCOVERY DASHBOARDS
DATA SCIENCE RE-IMAGINED
PREDICTIVE MACHINE LEARNING
STREAMING ANALYTICS
REAL TIME ACTIONABLE
11. © Copyright 2000-2017 TIBCO Software Inc.
• “The heart of data science”
• Domain knowledge is very important
• Often takes 60% to 80% of the whole analytical pipeline
• Get the best accuracy from machine learning algorithms on your datasets
• Cannot be fully automated (at least not in the beginning)
Data Preparation
http://www.slideshare.net/odsc/feature-engineering
Data Preparation
12. © Copyright 2000-2017 TIBCO Software Inc.
• Basics (select, filter, removal of duplicates, …)
• Sampling (balanced, stratisfied, ...)
• Data Partitioning (create training + validation + test data set, ...)
• Transformations (normalisation, standardisation, scaling, pivoting, ...)
• Binning (count-based, handling of missing values as its own group, …)
• Data Replacement (cutting, splitting, merging, ...)
• Weighting and Selection (attribute weighting, automatic optimization, ...)
• Attribute Generation (ID generation, ...)
• Imputation (replacement of missing observations by using statistical algorithms)
Data Cleaning
13. © Copyright 2000-2017 TIBCO Software Inc.
• Using domain knowledge of the data to create features that make machine learning algorithms work
• Fundamental to the application of machine learning
• Both difficult and expensive
• Part of Model Building, but also includes Data Preparation
Feature Engineering
The process of feature engineering
• Brainstorming Or Testing features
• Deciding what features to create
• Creating features
• Checking how the features work
with your model
• Improving your features if needed
• Go back to brainstorming/creating
more features until the work is done
14. © Copyright 2000-2017 TIBCO Software Inc.
Analytical Pipeline
1. Data Access
2. Data Preprocessing
3. Exploratory Data Analysis
4. Model Building
5. Model Validation
6. Model Execution
7. Deployment
16. © Copyright 2000-2017 TIBCO Software Inc.
Data Preparation in the Analytical Pipeline
1. Data Access
2. Data Preprocessing
3. Exploratory Data Analysis
4. Model Building
5. Model Validation
6. Model Execution
7. Deployment
Data Preprocessing
+
Data Wrangling
=
Success
17. Reference Architecture for Big Data Analytics
Operational Analytics
OperationsLive UI
SENSOR DATA
TRANSACTIONS
MESSAGE BUS
MACHINE DATA
SOCIAL DATA
Streaming AnalyticsAction
Aggregate
Rules
Stream Processing
Analytics
Correlate
Live Monitoring
Continuous query
processing
Alerts
Manual action,
escalation
HISTORICAL ANALYSIS
Data Sheets
BI
Data
Scientists
Cleansed
Data
History
Data Discovery
Enterprise Service Bus
ERP MDM DB WMS
SOA
Data Storage
Internal Data
Integration Bus
API
Event Server
Machine
Learning
Big Data
18. Reference Architecture for Big Data Analytics
Operational Analytics
OperationsLive UI
SENSOR DATA
TRANSACTIONS
MESSAGE BUS
MACHINE DATA
SOCIAL DATA
Streaming AnalyticsAction
Aggregate
Rules
Stream Processing
Analytics
Correlate
Live Monitoring
Continuous query
processing
Alerts
Manual action,
escalation
HISTORICAL ANALYSIS
Data Sheets
BI
Data
Scientists
Cleansed
Data
History
Data Discovery
Enterprise Service Bus
ERP MDM DB WMS
SOA
Data Storage
Internal Data
Integration Bus
API
Event Server
Machine
Learning
Big Data
ETL /
Data Ingestion
(Apache NiFi, Talend, …)
Streaming
Analytics
(Apache Flink, TIBCO StreamBase, …)
Data
Wrangling
(Trifacta, TIBCO Spotfire, …)
Data
Preparation
(R, Python, KNIME,
RapidMiner, …)
Big Data
Preparation
(MapReduce, Spark, …)
19. © Copyright 2000-2017 TIBCO Software Inc.
Agenda
1) The Need for Data Preprocessing and Data Wrangling
2) Kaggle’s Titanic Dataset
3) Data Preprocessing - by the Data Scientist
4) Data Preprocessing - by the (Citizen) Data Scientist
5) Data Wrangling - by the Business Analyst or (Citizen) Data Scientist
6) ETL and DQ - by the Developer
7) Data Ingestion and Streaming Analytics - by the Developer
21. © Copyright 2000-2017 TIBCO Software Inc.
• create new column (extract)
• get title out of name (Mr., Mrs., Miss., Master., Other)
• create new column (aggregate)
• familiy size = 1+ SibSp + Parch
• create new column 'CabinFirstCharacter’
• extract the first character of the column 'cabin’
• remove duplicates in dataset
• add data to ‘NA’s (imputation)
• Age: ‘Average’ instead of ‘NA’ or discretize to bins;
• Cabin: Replace empty values with 'U' for Unknown
• use ‘data science functions’ to bring all data in a “similar shape” (e.g.
Scale / normalize / PCA / Box-Cox, …)
Examples for quality improvement and feature engineering
22. © Copyright 2000-2017 TIBCO Software Inc.
Overlapping!
ETL
Data
Wrangling
Streaming
Analytics
Data
Preprocessing
Big Data
Preparation
23. © Copyright 2000-2017 TIBCO Software Inc.
Agenda
1) The Need for Data Preprocessing and Data Wrangling
2) Kaggle’s Titanic Dataset
3) Data Preprocessing - by the Data Scientist
4) Data Preprocessing - by the (Citizen) Data Scientist
5) Data Wrangling - by the Business Analyst or (Citizen) Data Scientist
6) ETL and DQ - by the Developer
7) Data Ingestion and Streaming Analytics - by the Developer
24. Frameworks for the Data Scientist
Many more ….
Programming
Language
Big Data
Framework
Deep Learning
Framework
25. © Copyright 2000-2017 TIBCO Software Inc.
• Built for the Data Scientist
• Includes data preprocessing functions (filter, extract, …)
• But also data science functions (scale, shuffle, PCA, …)
• Built for exploratory data analysis
• Focus on ”low level” coding
• Not built for enterprise scale deployment
• Commercial Enterprise Scale Runtime
• R: TIBCO Runtime for R (TERR), Microsoft R (former Revolution R)
Data Preprocessing with R
27. © Copyright 2000-2017 TIBCO Software Inc.
• Manipulate, clean and summarize unstructured data.
• Data manipulation operations such as applying filter, selecting specific
columns, sorting data, adding or deleting columns and aggregating data
• Very easy to learn and use dplyr functions
R Example: dplyr Package
https://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html
28. © Copyright 2000-2017 TIBCO Software Inc.
• ’Data Science related’ Preprocessing (Center, scale, PCA, BoxCox, ...)
• Streamlines the model training process for complex regression and classification
problems
• Generic interface in front of hundreds of existing R model implementations (with
diverse APIs)
R Example: Caret Package
http://topepo.github.io/caret/index.html
30. © Copyright 2000-2017 TIBCO Software Inc.
• Built for the Developer and Data Scientist
• Built for processing big data (GB, TB, PB, …)
• Built-in elastic scalability
• Data processing at the edge (i.e. where the data is located)
• Commercial offerings
• Apache Hadoop / Spark: Hortonworks, Cloudera, MapR, Databricks …
• Focus on ”low level” coding
Data Preprocessing – Big Data Frameworks
31. © Copyright 2000-2017 TIBCO Software Inc.
Apache Spark
https://benfradet.github.io/blog/2015/12/16/Exploring-spark.ml-with-the-Titanic-Kaggle-competition
32. © Copyright 2000-2017 TIBCO Software Inc.
Agenda
1) The Need for Data Preprocessing and Data Wrangling
2) Kaggle’s Titanic Dataset
3) Data Preprocessing - by the Data Scientist
4) Data Preprocessing - by the (Citizen) Data Scientist
5) Data Wrangling - by the Business Analyst or (Citizen) Data Scientist
6) ETL and DQ - by the Developer
7) Data Ingestion and Streaming Analytics - by the Developer
33. © Copyright 2000-2017 TIBCO Software Inc.
• Focus on ease-of-use and time-to-market / agility
• Development Environment + Runtime / Execution Server
• Visual “Coding”
• Code Generation
• Leverages Data Science frameworks like R or H2O.ai under
the hood respectively integrates them
• Leverages Big Data frameworks like Apache Hadoop or Spark
Data Preprocessing - by the (Citizen) Data Scientist
34. © Copyright 2000-2017 TIBCO Software Inc.
KNIME
https://www.linkedin.com/pulse/first-experience-knime-richard-soon
35. © Copyright 2000-2017 TIBCO Software Inc.
RapidMiner
https://rapidminer.com/resource/rapidminer-advanced-analytics-demonstration
36. © Copyright 2000-2017 TIBCO Software Inc.
RapidMiner
Filter Columns
Distance-based Outlier Detection
Easy Data Preparation:
• Many visual ML operators
• Intelligent recommendations
• Native Hadoop / Spark support
38. © Copyright 2000-2017 TIBCO Software Inc.
Agenda
1) The Need for Data Preprocessing and Data Wrangling
2) Kaggle’s Titanic Dataset
3) Data Preprocessing - by the Data Scientist
4) Data Preprocessing - by the (Citizen) Data Scientist
5) Data Wrangling - by the Business Analyst or (Citizen) Data Scientist
6) ETL and DQ - by the Developer
7) Data Ingestion and Streaming Analytics - by the Developer
39. © Copyright 2000-2017 TIBCO Software Inc.
• Built for “everybody” - Business Analyst or (Citizen) Data Scientist
• Focus on ease-of-use and time-to-market / agility
• e.g. DataWrangler, Trifacta, TIBCO Spotfire
Data Wrangling
Trifacta
Wrangler
40. Inline Data Wrangling within Visual Analytics Tooling
http://marketo.tibco.com/rs/221-BCQ-142/images/how-integrated-data-wrangling-fuels-analytic-creativity.pdf
“When analysts are in the middle of discovery, stopping everything
and going back to another tool is jarring. It breaks their flow. They
have to come back and pick up later. Productivity plummets and
creative energy crashes.”
• Inline-Data Wrangling during exploratory analysis of data
• All-in-one tooling; done by one single user
• AI-driven data wrangling and visualization
• e.g. TIBCO Spotfire
41. © Copyright 2000-2017 TIBCO Software Inc.
Inline Data Wrangling
Inline
Data Wrangling
=
Visual Interactive
Data Analysis
+
Data Preprocessing
in a Single Tool
44. © Copyright 2000-2017 TIBCO Software Inc.
Agenda
1) The Need for Data Preprocessing and Data Wrangling
2) Kaggle’s Titanic Dataset
3) Data Preprocessing - by the Data Scientist
4) Data Preprocessing - by the (Citizen) Data Scientist
5) Data Wrangling - by the Business Analyst or (Citizen) Data Scientist
6) ETL and DQ - by the Developer
7) Data Ingestion and Streaming Analytics - by the Developer
45. © Copyright 2000-2016 TIBCO Software Inc.
Dataflow Pipeline – Extract, Transform, Load
https://www.linkedin.com/pulse/data-pipeline-hadoop-part-1-2-birender-saini
• Built for the developer
• Focus on ease-of-use and enterprise deployments
• Focus on visual coding
• Focus on complex integration and data quality
• Support for big data frameworks like Apache Hadoop / Spark
46. © Copyright 2000-2017 TIBCO Software Inc.
Pentaho: Loading, transforming and cleaning Titanic data
http://eric.univ-lyon2.fr/~ricco/tanagra/fichiers/en_Tanagra_Pentaho_Data_Integration.pdf
47. © Copyright 2000-2017 TIBCO Software Inc.
Agenda
1) The Need for Data Preprocessing and Data Wrangling
2) Kaggle’s Titanic Dataset
3) Data Preprocessing - by the Data Scientist
4) Data Preprocessing - by the (Citizen) Data Scientist
5) Data Wrangling - by the Business Analyst or (Citizen) Data Scientist
6) ETL and DQ - by the Developer
7) Data Ingestion and Streaming Analytics - by the Developer
48. © Copyright 2000-2017 TIBCO Software Inc.
Streaming Analytics - Processing Pipeline
APIs
Adapters /
Channels
Integration
Messaging
Stream Ingest
Transformation
Aggregation
Enrichment
Filtering
Stream
Preprocessing
Process
Management
Analytics
(Real Time)
Applications
& APIs
Analytics /
DW Reporting
Stream
Outcomes
• Contextual Rules
• Windowing
• Patterns
• Analytics
• Deep ML
• …
Stream Analytics &
Processing
Index / SearchNormalization
Data Preprocessing
as piece of the puzzle
(batch or real time)
50. Streaming Analytics Frameworks and Products (no complete list!)
OPEN SOURCE CLOSED SOURCE
PRODUCT
FRAMEWORK
Azure Microsoft
Stream Analytics
http://www.kai-waehner.de/blog/2016/11/15/streaming-analytics-comparison-
open-source-frameworks-products-cloud-services/
51. © Copyright 2000-2017 TIBCO Software Inc.
TIBCO StreamBase: Loading, transforming and cleaning Titanic data
53. © Copyright 2000-2017 TIBCO Software Inc.
Key Takeaways
Ø Various languages, frameworks and tools for data preparation - trade-offs included
Ø Data Wrangling as important add-on to data preprocessing - best within visual analytics tool
Ø Visual analytics and open source data science components are complementary
Ø Avoiding numerous components speeds up a data science project
… for Data Preparation in Data Science:
54. Questions? Please contact me!
Kai Wähner
Technology Evangelist
kontakt@kai-waehner.de
@KaiWaehner
www.kai-waehner.de
LinkedIn