SlideShare une entreprise Scribd logo
1  sur  36
Télécharger pour lire hors ligne
September 8-9, 2016
BigML, Inc 2
Basic Transformations
Poul Pertesen
CIO, BigML, Inc
Creating Machine Learning Ready Data
BigML, Inc 3Machine Learning-Ready Data
Basic Transformations
Q: How does a physicist milk a cow?
A: Well, first let us consider a spherical cow...
Q: How does a data scientist build a model?
A: Well, first let us consider perfectly formatted data…
BigML, Inc 4Machine Learning-Ready Data
The Dream
CSV Dataset Model Profit!
BigML, Inc 5Machine Learning-Ready Data
The Reality
CRM
Web Accounts
Transactions
ML Ready?
Is all hope lost?
How do you even start?
BigML, Inc 6Machine Learning-Ready Data
Holistic Approach
• Define a clear idea of the goal.
• Understand what ML tasks will achieve the goal.
• Understand the data structure to perform those ML tasks.
• Find out what kind of data you have and make it ML-Ready
• where is it, how is it stored?
• what are the features?
• can you access it programmatically?
• Feature Engineering: transform the data you have into the
data you actually need.
• Evaluate: Try it on a small scale
• Accept that you might have to start over….
• But when it works, automate it!!!!
BigML, Inc 7Machine Learning-Ready Data
Holistic Approach
Define Goal & ML Task
BigML, Inc 8Machine Learning-Ready Data
Understand ML Tasks
Goal
• Will this customer default on a loan?
• How many customers will apply for a
loan next month?
• Is the consumption of this product
unusual?
• Is the behavior of the customers
similar?
• Are these product purchased
together?
ML Task
Classification
Regression
Anomaly Detection
Cluster Analysis
Association Discovery
BigML, Inc 9Machine Learning-Ready Data
Holistic Approach
Required Data Structure
BigML, Inc 10Machine Learning-Ready Data
Classification
CategoricalTrainingTesting
Predicting
BigML, Inc 11Machine Learning-Ready Data
Regression
NumericTrainingTesting
Predicting
BigML, Inc 12Machine Learning-Ready Data
Anomaly Detection
BigML, Inc 13Machine Learning-Ready Data
Cluster Analysis
BigML, Inc 14Machine Learning-Ready Data
Association Discovery
BigML, Inc 15Machine Learning-Ready Data
Holistic Approach
Make Your Data ML-Ready
BigML, Inc 16Machine Learning-Ready Data
ML-Ready Data
Instances
Fields	
  (Features)
Tabular Data:
• Each row is one of the instances.
• Each column is a field that describes a property of the 

instance that is relevant to the question being modeled.
• Fields can be:
already be present in your data
derived from your data
or generated using other fields.
Machine Learning
Algorithms consume
instances of the
question that you want
to model.
!! Danger Ahead !!
BigML, Inc 17Machine Learning-Ready Data
Cleansing
Homogenize missing values and different types in the same
feature, fix input errors, correct semantic issues, types, etc.
Name Date Duration (s) Genre Plays
Highway star 1984-05-24 - Rock 139
Blues alive 1990/03/01 281 Blues 239
Lonely planet 2002-11-19 5:32s Techno 42
Dance, dance 02/23/1983 312 Disco N/A
The wall 1943-01-20 218 Reagge 83
Offside down 1965-02-19 4 minutes Techno 895
The alchemist 2001-11-21 418 Bluesss 178
Bring me down 18-10-98 328 Classic 21
The scarecrow 1994-10-12 269 Rock 734
Original	
  data
Name Date Duration (s) Genre Plays
Highway star 1984-05-24 Rock 139
Blues alive 1990-03-01 281 Blues 239
Lonely planet 2002-11-19 332 Techno 42
Dance, dance 1983-02-23 312 Disco
The wall 1943-01-20 218 Reagge 83
Offside down 1965-02-19 240 Techno 895
The alchemist 2001-11-21 418 Blues 178
Bring me down 1998-10-18 328 Classic 21
The scarecrow 1994-10-12 269 Rock 734
Cleaned	
  data
BigML, Inc 18Machine Learning-Ready Data
Denormalizing
users
artists
tracks
albums
Instances
Features
(millions)
join
Data is usually normalized in relational databases, ML-Ready datasets
need the information de-normalized in a single file/dataset.
BigML, Inc 19Machine Learning-Ready Data
Aggregating
User Num.Playbacks Total Time Pref.Device
User001 3 830 Tablet
User002 1 218 Smartphone
User003 3 1019 TV
User005 2 521 Tablet
Aggregated data (list of users)
When the entity to model is different from the provided data, an
aggregation to get the entity might be needed.
Content Genr
e
Duration Play Time User Device
Highway
star
Rock 190 2015-05-12
16:29:33
User001 TV
Blues alive Blues 281 2015-05-13
12:31:21
User005 Tablet
Lonely
planet
Tech
no
332 2015-05-13
14:26:04
User003 TV
Dance,
dance
Disco 312 2015-05-13
18:12:45
User001 Tablet
The wall Reag
ge
218 2015-05-14
09:02:55
User002 Smartphone
Offside
down
Tech
no
240 2015-05-14
11:26:32
User005 Tablet
The
alchemist
Blues 418 2015-05-14
21:44:15
User003 TV
Bring me
down
Class
ic
328 2015-05-15
06:59:56
User001 Tablet
The
scarecrow
Rock 269 2015-05-15
12:37:05
User003 Smartphone
Original data (list of playbacks)
tail -n+2 playlists.csv | cut -d',' -f5 | sort | uniq -c
BigML, Inc 20Machine Learning-Ready Data
Pivoting
Different values of a feature are pivoted to new columns in the
result dataset.
Content Genre Duration Play Time User Device
Highway star Rock 190 2015-05-12 16:29:33 User001 TV
Blues alive Blues 281 2015-05-13 12:31:21 User005 Tablet
Lonely planet Techno 332 2015-05-13 14:26:04 User003 TV
Dance, dance Disco 312 2015-05-13 18:12:45 User001 Tablet
The wall Reagge 218 2015-05-14 09:02:55 User002 Smartphone
Offside down Techno 240 2015-05-14 11:26:32 User005 Tablet
The alchemist Blues 418 2015-05-14 21:44:15 User003 TV
Bring me down Classic 328 2015-05-15 06:59:56 User001 Tablet
The scarecrow Rock 269 2015-05-15 12:37:05 User003 Smartphone
Original data
User Num.Playback
s
Total Time Pref.Device NP_TV NP_Tablet NP_Smartphone TT_TV TT_Tablet TT_Smartphone
User001 3 830 Tablet 1 2 0 190 640 0
User002 1 218 Smartphone 0 0 1 0 0 218
User003 3 1019 TV 2 0 1 750 0 269
User005 2 521 Tablet 0 2 0 0 521 0
Aggregated data with pivoted columns
BigML, Inc 21Machine Learning-Ready Data
Time Windows
Create new features using values over different periods of time
Instances
Features
Time
Instances
Features
(millions)
(thousands)
t=1 t=2 t=3
BigML, Inc 22Machine Learning-Ready Data
Updates
Need a current view of the data, but new data only comes in
batches of changes
day	
  1day	
  2day	
  3
Instances
Features
BigML, Inc 23Machine Learning-Ready Data
Structuring Output
• A CSV file uses plain text to store tabular data.
• In a CSV file, each row of the file is an instance.
• Each column in a row is usually separated by a comma (,) but other
"separators" like semi-colon (;), colon (:), pipe (|), can also be used. Each
row must contain the same number of fields
• but they can be null
• Fields can be quoted using double quotes (").
• Fields that contain commas or line separators must be quoted.
• Quotes (") in fields must be doubled ("").
• The character encoding must be UTF-8
• Optionally, a CSV file can use the first line as a header to provide the
names of each field.
After all the data transformations, a CSV (“Comma-Separated
Values) file has to be generated, following the rules below:
BigML, Inc 24Machine Learning-Ready Data
Holistic Approach
Feature Engineering
BigML, Inc 25Machine Learning-Ready Data
Feature Engineering
• Flatline
• Domain Specific Language for data generation
and filtering
• Works with datasets -> datasets
• Lots of built-in functions
• Sliding windows
• Date/Time parsing
• Flatline Editor (in UI)
• https://github.com/bigmlcom/flatline
BigML, Inc 26Machine Learning-Ready Data
Feature Engineering
• Feature Engineering of Numeric features:
• Discretization (percentiles, within percentiles, groups)
• Replacement
• Normalization
• Exponentiation, Logarithms, Squares, etc.
• Shock
• Feature Engineering of Text features:
• Misspellings
• Length
• Number of subordinate sentences
• Language
• Levenshtein distance
• Stacking:
• Compute a field using non-linear combinations of other fields
BigML, Inc 27Machine Learning-Ready Data
Holistic Approach
Test & Automate
BigML, Inc 28Machine Learning-Ready Data
Test & Automate
• Test - Evaluate
• Did you meet the goal?
• If not, did you discover something else useful?
• If not, start over
• If you did…
• Automate - You don’t want to hand code that every time, right?
• Consider tools that are easy to automate
• scripting interface
• APIs
• Ability to maintenance is important
BigML, Inc 29Machine Learning-Ready Data
Tools
• Command Line?
• join, cut, awk, sed, sort, uniq
• Automation
• Shell, Python, etc
• Talend
• BigML: bindings, bigmler, API, whizzml
• Relational DB
• MySQL
• Non-Relational DB
• MongoDB
BigML, Inc 30Machine Learning-Ready Data
Prosper
Submit Bids
Cancelled Withdraw
Funded
Expired
Defaulted
Paid
Current
Late
Q: Which new loans make it to funded?
Q: Which funded loans make it to paid?
Q: If funded, what will be the rate?
Classification
Regression
Classification
BigML, Inc 31Machine Learning-Ready Data
Prosper
Data Provided in XML updates!!
fetch.sh
“curl”
daily
export.sh
import.py
XML
bigml.sh
Model

Predict

Share in gallery
Status
LoanStatus
BorrowerRate
BigML, Inc 32Machine Learning-Ready Data
Prosper
• XML… yuck!
• MongoDB has CSV export and is record based so it is easy to
handle changing data structure.
• Feature Engineering
• There are 5 different classes of “bad” loans
• Date cleanup
• Type casting: floats and ints
• Would be better to track over time
• number of late payments
• compare predictions and actuals
• XML… yuck!
Tidbits and Lessons Learned….
BigML, Inc 33Machine Learning-Ready Data
Diabetes
Fix Missing Values in a “Meaningful” Way
Filter Zeros
Model 

insulin
Predict 

insulin
Select 

insulin
Fixed

Dataset
Amended

Dataset
Original

Dataset
Clean

Dataset
BigML, Inc 34Machine Learning-Ready Data
Stock Prices
(/ (- ( f "price") (avg-window "price" -4, -1)) (standard-deviation "price"))
Shock: Deviations from Trend
date volume price
1 34353 314
2 44455 315
3 22333 315
4 52322 321
5 28000 320
6 31254 319
7 56544 323
8 44331 324
9 81111 287
10 65422 294
11 59999 300
12 45556 302
13 19899 301
14 21453 302
314
314 315
314 315 315
314 315 315 321
315 315 321 320
315 321 320 319
4-Day moving avg)
Current - (4-day avg)
std dev
BigML, Inc 35Machine Learning-Ready Data
Talend
https://blog.bigml.com/2013/10/30/data-preparation-for-machine-learning-using-mysql/
Denormalization Example
BigML, Inc 36Machine Learning-Ready Data
Talend
https://blog.bigml.com/2013/10/30/data-preparation-for-machine-learning-using-mysql/
Denormalization Example

Contenu connexe

Tendances

BSSML17 - Logistic Regressions
BSSML17 - Logistic RegressionsBSSML17 - Logistic Regressions
BSSML17 - Logistic RegressionsBigML, Inc
 
Web UI, Algorithms, and Feature Engineering
Web UI, Algorithms, and Feature Engineering Web UI, Algorithms, and Feature Engineering
Web UI, Algorithms, and Feature Engineering BigML, Inc
 
VSSML16 L8. Advanced Workflows: Feature Selection, Boosting, Gradient Descent...
VSSML16 L8. Advanced Workflows: Feature Selection, Boosting, Gradient Descent...VSSML16 L8. Advanced Workflows: Feature Selection, Boosting, Gradient Descent...
VSSML16 L8. Advanced Workflows: Feature Selection, Boosting, Gradient Descent...BigML, Inc
 
BSSML17 - Basic Data Transformations
BSSML17 - Basic Data TransformationsBSSML17 - Basic Data Transformations
BSSML17 - Basic Data TransformationsBigML, Inc
 
BSSML17 - Ensembles
BSSML17 - EnsemblesBSSML17 - Ensembles
BSSML17 - EnsemblesBigML, Inc
 
VSSML16 L6. Feature Engineering
VSSML16 L6. Feature EngineeringVSSML16 L6. Feature Engineering
VSSML16 L6. Feature EngineeringBigML, Inc
 
BigML Summer 2016 Release
BigML Summer 2016 ReleaseBigML Summer 2016 Release
BigML Summer 2016 ReleaseBigML, Inc
 
BSSML17 - Feature Engineering
BSSML17 - Feature EngineeringBSSML17 - Feature Engineering
BSSML17 - Feature EngineeringBigML, Inc
 
VSSML17 L6. Time Series and Deepnets
VSSML17 L6. Time Series and DeepnetsVSSML17 L6. Time Series and Deepnets
VSSML17 L6. Time Series and DeepnetsBigML, Inc
 
BSSML16 L7. Feature Engineering
BSSML16 L7. Feature EngineeringBSSML16 L7. Feature Engineering
BSSML16 L7. Feature EngineeringBigML, Inc
 
BSSML17 - Deepnets
BSSML17 - DeepnetsBSSML17 - Deepnets
BSSML17 - DeepnetsBigML, Inc
 
BSSML16 L10. Summary Day 2 Sessions
BSSML16 L10. Summary Day 2 SessionsBSSML16 L10. Summary Day 2 Sessions
BSSML16 L10. Summary Day 2 SessionsBigML, Inc
 
BSSML16 L8. REST API, Bindings, and Basic Workflows
BSSML16 L8. REST API, Bindings, and Basic WorkflowsBSSML16 L8. REST API, Bindings, and Basic Workflows
BSSML16 L8. REST API, Bindings, and Basic WorkflowsBigML, Inc
 
BSSML17 - API and WhizzML
BSSML17 - API and WhizzMLBSSML17 - API and WhizzML
BSSML17 - API and WhizzMLBigML, Inc
 
VSSML17 Review. Summary Day 2 Sessions
VSSML17 Review. Summary Day 2 SessionsVSSML17 Review. Summary Day 2 Sessions
VSSML17 Review. Summary Day 2 SessionsBigML, Inc
 
MLSD18. Feature Engineering
MLSD18. Feature EngineeringMLSD18. Feature Engineering
MLSD18. Feature EngineeringBigML, Inc
 
BSSML17 - Time Series
BSSML17 - Time SeriesBSSML17 - Time Series
BSSML17 - Time SeriesBigML, Inc
 
VSSML18. Feature Engineering
VSSML18. Feature EngineeringVSSML18. Feature Engineering
VSSML18. Feature EngineeringBigML, Inc
 
VSSML18. Data Transformations
VSSML18. Data TransformationsVSSML18. Data Transformations
VSSML18. Data TransformationsBigML, Inc
 
BigML Education - Feature Engineering with Flatline
BigML Education - Feature Engineering with FlatlineBigML Education - Feature Engineering with Flatline
BigML Education - Feature Engineering with FlatlineBigML, Inc
 

Tendances (20)

BSSML17 - Logistic Regressions
BSSML17 - Logistic RegressionsBSSML17 - Logistic Regressions
BSSML17 - Logistic Regressions
 
Web UI, Algorithms, and Feature Engineering
Web UI, Algorithms, and Feature Engineering Web UI, Algorithms, and Feature Engineering
Web UI, Algorithms, and Feature Engineering
 
VSSML16 L8. Advanced Workflows: Feature Selection, Boosting, Gradient Descent...
VSSML16 L8. Advanced Workflows: Feature Selection, Boosting, Gradient Descent...VSSML16 L8. Advanced Workflows: Feature Selection, Boosting, Gradient Descent...
VSSML16 L8. Advanced Workflows: Feature Selection, Boosting, Gradient Descent...
 
BSSML17 - Basic Data Transformations
BSSML17 - Basic Data TransformationsBSSML17 - Basic Data Transformations
BSSML17 - Basic Data Transformations
 
BSSML17 - Ensembles
BSSML17 - EnsemblesBSSML17 - Ensembles
BSSML17 - Ensembles
 
VSSML16 L6. Feature Engineering
VSSML16 L6. Feature EngineeringVSSML16 L6. Feature Engineering
VSSML16 L6. Feature Engineering
 
BigML Summer 2016 Release
BigML Summer 2016 ReleaseBigML Summer 2016 Release
BigML Summer 2016 Release
 
BSSML17 - Feature Engineering
BSSML17 - Feature EngineeringBSSML17 - Feature Engineering
BSSML17 - Feature Engineering
 
VSSML17 L6. Time Series and Deepnets
VSSML17 L6. Time Series and DeepnetsVSSML17 L6. Time Series and Deepnets
VSSML17 L6. Time Series and Deepnets
 
BSSML16 L7. Feature Engineering
BSSML16 L7. Feature EngineeringBSSML16 L7. Feature Engineering
BSSML16 L7. Feature Engineering
 
BSSML17 - Deepnets
BSSML17 - DeepnetsBSSML17 - Deepnets
BSSML17 - Deepnets
 
BSSML16 L10. Summary Day 2 Sessions
BSSML16 L10. Summary Day 2 SessionsBSSML16 L10. Summary Day 2 Sessions
BSSML16 L10. Summary Day 2 Sessions
 
BSSML16 L8. REST API, Bindings, and Basic Workflows
BSSML16 L8. REST API, Bindings, and Basic WorkflowsBSSML16 L8. REST API, Bindings, and Basic Workflows
BSSML16 L8. REST API, Bindings, and Basic Workflows
 
BSSML17 - API and WhizzML
BSSML17 - API and WhizzMLBSSML17 - API and WhizzML
BSSML17 - API and WhizzML
 
VSSML17 Review. Summary Day 2 Sessions
VSSML17 Review. Summary Day 2 SessionsVSSML17 Review. Summary Day 2 Sessions
VSSML17 Review. Summary Day 2 Sessions
 
MLSD18. Feature Engineering
MLSD18. Feature EngineeringMLSD18. Feature Engineering
MLSD18. Feature Engineering
 
BSSML17 - Time Series
BSSML17 - Time SeriesBSSML17 - Time Series
BSSML17 - Time Series
 
VSSML18. Feature Engineering
VSSML18. Feature EngineeringVSSML18. Feature Engineering
VSSML18. Feature Engineering
 
VSSML18. Data Transformations
VSSML18. Data TransformationsVSSML18. Data Transformations
VSSML18. Data Transformations
 
BigML Education - Feature Engineering with Flatline
BigML Education - Feature Engineering with FlatlineBigML Education - Feature Engineering with Flatline
BigML Education - Feature Engineering with Flatline
 

Similaire à VSSML16 L5. Basic Data Transformations

DutchMLSchool. Automating Decision Making
DutchMLSchool. Automating Decision MakingDutchMLSchool. Automating Decision Making
DutchMLSchool. Automating Decision MakingBigML, Inc
 
MLSEV. Automating Decision Making
MLSEV. Automating Decision MakingMLSEV. Automating Decision Making
MLSEV. Automating Decision MakingBigML, Inc
 
The Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it WorkThe Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it WorkIvo Andreev
 
The Past, Present, and Future of Machine Learning APIs
The Past, Present, and Future of Machine Learning APIsThe Past, Present, and Future of Machine Learning APIs
The Past, Present, and Future of Machine Learning APIsBigML, Inc
 
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...Rodney Joyce
 
Making Data Science Scalable - 5 Lessons Learned
Making Data Science Scalable - 5 Lessons LearnedMaking Data Science Scalable - 5 Lessons Learned
Making Data Science Scalable - 5 Lessons LearnedLaurenz Wuttke
 
Machine learning at scale - Webinar By zekeLabs
Machine learning at scale - Webinar By zekeLabsMachine learning at scale - Webinar By zekeLabs
Machine learning at scale - Webinar By zekeLabszekeLabs Technologies
 
AI Modernization at AT&T and the Application to Fraud with Databricks
AI Modernization at AT&T and the Application to Fraud with DatabricksAI Modernization at AT&T and the Application to Fraud with Databricks
AI Modernization at AT&T and the Application to Fraud with DatabricksDatabricks
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesVictorSzoltysek
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...Jose Quesada (hiring)
 
Data Engineering Data warehousing Pentaho
Data Engineering  Data warehousing  PentahoData Engineering  Data warehousing  Pentaho
Data Engineering Data warehousing PentahoPraveenHegde20
 
SH 1 - SES 1 - advanced_schema_design.pptx
SH 1 - SES 1 - advanced_schema_design.pptxSH 1 - SES 1 - advanced_schema_design.pptx
SH 1 - SES 1 - advanced_schema_design.pptxMongoDB
 
SH 1 - SES 1 - advanced_schema_design.pptx
SH 1 - SES 1 - advanced_schema_design.pptxSH 1 - SES 1 - advanced_schema_design.pptx
SH 1 - SES 1 - advanced_schema_design.pptxMongoDB
 
Tensors Are All You Need: Faster Inference with Hummingbird
Tensors Are All You Need: Faster Inference with HummingbirdTensors Are All You Need: Faster Inference with Hummingbird
Tensors Are All You Need: Faster Inference with HummingbirdDatabricks
 
Advanced Schema Design Patterns
Advanced Schema Design PatternsAdvanced Schema Design Patterns
Advanced Schema Design PatternsMongoDB
 
Past, present and future of predictive APIs - Poul Petersen
Past, present and future of predictive APIs - Poul PetersenPast, present and future of predictive APIs - Poul Petersen
Past, present and future of predictive APIs - Poul PetersenPAPIs.io
 
Modeling computer networks by colored Petri nets
Modeling computer networks by colored Petri netsModeling computer networks by colored Petri nets
Modeling computer networks by colored Petri netsDmitryZaitsev5
 
Artificial Intelligence in practice - Gerbert Kaandorp - Codemotion Amsterdam...
Artificial Intelligence in practice - Gerbert Kaandorp - Codemotion Amsterdam...Artificial Intelligence in practice - Gerbert Kaandorp - Codemotion Amsterdam...
Artificial Intelligence in practice - Gerbert Kaandorp - Codemotion Amsterdam...Codemotion
 
CSEG1001 Lecture 1 Introduction to Computers
CSEG1001 Lecture 1 Introduction to ComputersCSEG1001 Lecture 1 Introduction to Computers
CSEG1001 Lecture 1 Introduction to ComputersDhiviya Rose
 

Similaire à VSSML16 L5. Basic Data Transformations (20)

DutchMLSchool. Automating Decision Making
DutchMLSchool. Automating Decision MakingDutchMLSchool. Automating Decision Making
DutchMLSchool. Automating Decision Making
 
MLSEV. Automating Decision Making
MLSEV. Automating Decision MakingMLSEV. Automating Decision Making
MLSEV. Automating Decision Making
 
The Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it WorkThe Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it Work
 
The Past, Present, and Future of Machine Learning APIs
The Past, Present, and Future of Machine Learning APIsThe Past, Present, and Future of Machine Learning APIs
The Past, Present, and Future of Machine Learning APIs
 
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
 
Making Data Science Scalable - 5 Lessons Learned
Making Data Science Scalable - 5 Lessons LearnedMaking Data Science Scalable - 5 Lessons Learned
Making Data Science Scalable - 5 Lessons Learned
 
Machine learning at scale - Webinar By zekeLabs
Machine learning at scale - Webinar By zekeLabsMachine learning at scale - Webinar By zekeLabs
Machine learning at scale - Webinar By zekeLabs
 
AI Modernization at AT&T and the Application to Fraud with Databricks
AI Modernization at AT&T and the Application to Fraud with DatabricksAI Modernization at AT&T and the Application to Fraud with Databricks
AI Modernization at AT&T and the Application to Fraud with Databricks
 
Python ml
Python mlPython ml
Python ml
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
 
Data Engineering Data warehousing Pentaho
Data Engineering  Data warehousing  PentahoData Engineering  Data warehousing  Pentaho
Data Engineering Data warehousing Pentaho
 
SH 1 - SES 1 - advanced_schema_design.pptx
SH 1 - SES 1 - advanced_schema_design.pptxSH 1 - SES 1 - advanced_schema_design.pptx
SH 1 - SES 1 - advanced_schema_design.pptx
 
SH 1 - SES 1 - advanced_schema_design.pptx
SH 1 - SES 1 - advanced_schema_design.pptxSH 1 - SES 1 - advanced_schema_design.pptx
SH 1 - SES 1 - advanced_schema_design.pptx
 
Tensors Are All You Need: Faster Inference with Hummingbird
Tensors Are All You Need: Faster Inference with HummingbirdTensors Are All You Need: Faster Inference with Hummingbird
Tensors Are All You Need: Faster Inference with Hummingbird
 
Advanced Schema Design Patterns
Advanced Schema Design PatternsAdvanced Schema Design Patterns
Advanced Schema Design Patterns
 
Past, present and future of predictive APIs - Poul Petersen
Past, present and future of predictive APIs - Poul PetersenPast, present and future of predictive APIs - Poul Petersen
Past, present and future of predictive APIs - Poul Petersen
 
Modeling computer networks by colored Petri nets
Modeling computer networks by colored Petri netsModeling computer networks by colored Petri nets
Modeling computer networks by colored Petri nets
 
Artificial Intelligence in practice - Gerbert Kaandorp - Codemotion Amsterdam...
Artificial Intelligence in practice - Gerbert Kaandorp - Codemotion Amsterdam...Artificial Intelligence in practice - Gerbert Kaandorp - Codemotion Amsterdam...
Artificial Intelligence in practice - Gerbert Kaandorp - Codemotion Amsterdam...
 
CSEG1001 Lecture 1 Introduction to Computers
CSEG1001 Lecture 1 Introduction to ComputersCSEG1001 Lecture 1 Introduction to Computers
CSEG1001 Lecture 1 Introduction to Computers
 

Plus de BigML, Inc

Digital Transformation and Process Optimization in Manufacturing
Digital Transformation and Process Optimization in ManufacturingDigital Transformation and Process Optimization in Manufacturing
Digital Transformation and Process Optimization in ManufacturingBigML, Inc
 
DutchMLSchool 2022 - Automation
DutchMLSchool 2022 - AutomationDutchMLSchool 2022 - Automation
DutchMLSchool 2022 - AutomationBigML, Inc
 
DutchMLSchool 2022 - ML for AML Compliance
DutchMLSchool 2022 - ML for AML ComplianceDutchMLSchool 2022 - ML for AML Compliance
DutchMLSchool 2022 - ML for AML ComplianceBigML, Inc
 
DutchMLSchool 2022 - Multi Perspective Anomalies
DutchMLSchool 2022 - Multi Perspective AnomaliesDutchMLSchool 2022 - Multi Perspective Anomalies
DutchMLSchool 2022 - Multi Perspective AnomaliesBigML, Inc
 
DutchMLSchool 2022 - My First Anomaly Detector
DutchMLSchool 2022 - My First Anomaly Detector DutchMLSchool 2022 - My First Anomaly Detector
DutchMLSchool 2022 - My First Anomaly Detector BigML, Inc
 
DutchMLSchool 2022 - Anomaly Detection
DutchMLSchool 2022 - Anomaly DetectionDutchMLSchool 2022 - Anomaly Detection
DutchMLSchool 2022 - Anomaly DetectionBigML, Inc
 
DutchMLSchool 2022 - History and Developments in ML
DutchMLSchool 2022 - History and Developments in MLDutchMLSchool 2022 - History and Developments in ML
DutchMLSchool 2022 - History and Developments in MLBigML, Inc
 
DutchMLSchool 2022 - End-to-End ML
DutchMLSchool 2022 - End-to-End MLDutchMLSchool 2022 - End-to-End ML
DutchMLSchool 2022 - End-to-End MLBigML, Inc
 
DutchMLSchool 2022 - A Data-Driven Company
DutchMLSchool 2022 - A Data-Driven CompanyDutchMLSchool 2022 - A Data-Driven Company
DutchMLSchool 2022 - A Data-Driven CompanyBigML, Inc
 
DutchMLSchool 2022 - ML in the Legal Sector
DutchMLSchool 2022 - ML in the Legal SectorDutchMLSchool 2022 - ML in the Legal Sector
DutchMLSchool 2022 - ML in the Legal SectorBigML, Inc
 
DutchMLSchool 2022 - Smart Safe Stadiums
DutchMLSchool 2022 - Smart Safe StadiumsDutchMLSchool 2022 - Smart Safe Stadiums
DutchMLSchool 2022 - Smart Safe StadiumsBigML, Inc
 
DutchMLSchool 2022 - Process Optimization in Manufacturing Plants
DutchMLSchool 2022 - Process Optimization in Manufacturing PlantsDutchMLSchool 2022 - Process Optimization in Manufacturing Plants
DutchMLSchool 2022 - Process Optimization in Manufacturing PlantsBigML, Inc
 
DutchMLSchool 2022 - Anomaly Detection at Scale
DutchMLSchool 2022 - Anomaly Detection at ScaleDutchMLSchool 2022 - Anomaly Detection at Scale
DutchMLSchool 2022 - Anomaly Detection at ScaleBigML, Inc
 
DutchMLSchool 2022 - Citizen Development in AI
DutchMLSchool 2022 - Citizen Development in AIDutchMLSchool 2022 - Citizen Development in AI
DutchMLSchool 2022 - Citizen Development in AIBigML, Inc
 
Democratizing Object Detection
Democratizing Object DetectionDemocratizing Object Detection
Democratizing Object DetectionBigML, Inc
 
BigML Release: Image Processing
BigML Release: Image ProcessingBigML Release: Image Processing
BigML Release: Image ProcessingBigML, Inc
 
Machine Learning in Retail: Know Your Customers' Customer. See Your Future
Machine Learning in Retail: Know Your Customers' Customer. See Your FutureMachine Learning in Retail: Know Your Customers' Customer. See Your Future
Machine Learning in Retail: Know Your Customers' Customer. See Your FutureBigML, Inc
 
Machine Learning in Retail: ML in the Retail Sector
Machine Learning in Retail: ML in the Retail SectorMachine Learning in Retail: ML in the Retail Sector
Machine Learning in Retail: ML in the Retail SectorBigML, Inc
 
ML in GRC: Machine Learning in Legal Automation, How to Trust a Lawyerbot
ML in GRC: Machine Learning in Legal Automation, How to Trust a LawyerbotML in GRC: Machine Learning in Legal Automation, How to Trust a Lawyerbot
ML in GRC: Machine Learning in Legal Automation, How to Trust a LawyerbotBigML, Inc
 
ML in GRC: Supporting Human Decision Making for Regulatory Adherence with Mac...
ML in GRC: Supporting Human Decision Making for Regulatory Adherence with Mac...ML in GRC: Supporting Human Decision Making for Regulatory Adherence with Mac...
ML in GRC: Supporting Human Decision Making for Regulatory Adherence with Mac...BigML, Inc
 

Plus de BigML, Inc (20)

Digital Transformation and Process Optimization in Manufacturing
Digital Transformation and Process Optimization in ManufacturingDigital Transformation and Process Optimization in Manufacturing
Digital Transformation and Process Optimization in Manufacturing
 
DutchMLSchool 2022 - Automation
DutchMLSchool 2022 - AutomationDutchMLSchool 2022 - Automation
DutchMLSchool 2022 - Automation
 
DutchMLSchool 2022 - ML for AML Compliance
DutchMLSchool 2022 - ML for AML ComplianceDutchMLSchool 2022 - ML for AML Compliance
DutchMLSchool 2022 - ML for AML Compliance
 
DutchMLSchool 2022 - Multi Perspective Anomalies
DutchMLSchool 2022 - Multi Perspective AnomaliesDutchMLSchool 2022 - Multi Perspective Anomalies
DutchMLSchool 2022 - Multi Perspective Anomalies
 
DutchMLSchool 2022 - My First Anomaly Detector
DutchMLSchool 2022 - My First Anomaly Detector DutchMLSchool 2022 - My First Anomaly Detector
DutchMLSchool 2022 - My First Anomaly Detector
 
DutchMLSchool 2022 - Anomaly Detection
DutchMLSchool 2022 - Anomaly DetectionDutchMLSchool 2022 - Anomaly Detection
DutchMLSchool 2022 - Anomaly Detection
 
DutchMLSchool 2022 - History and Developments in ML
DutchMLSchool 2022 - History and Developments in MLDutchMLSchool 2022 - History and Developments in ML
DutchMLSchool 2022 - History and Developments in ML
 
DutchMLSchool 2022 - End-to-End ML
DutchMLSchool 2022 - End-to-End MLDutchMLSchool 2022 - End-to-End ML
DutchMLSchool 2022 - End-to-End ML
 
DutchMLSchool 2022 - A Data-Driven Company
DutchMLSchool 2022 - A Data-Driven CompanyDutchMLSchool 2022 - A Data-Driven Company
DutchMLSchool 2022 - A Data-Driven Company
 
DutchMLSchool 2022 - ML in the Legal Sector
DutchMLSchool 2022 - ML in the Legal SectorDutchMLSchool 2022 - ML in the Legal Sector
DutchMLSchool 2022 - ML in the Legal Sector
 
DutchMLSchool 2022 - Smart Safe Stadiums
DutchMLSchool 2022 - Smart Safe StadiumsDutchMLSchool 2022 - Smart Safe Stadiums
DutchMLSchool 2022 - Smart Safe Stadiums
 
DutchMLSchool 2022 - Process Optimization in Manufacturing Plants
DutchMLSchool 2022 - Process Optimization in Manufacturing PlantsDutchMLSchool 2022 - Process Optimization in Manufacturing Plants
DutchMLSchool 2022 - Process Optimization in Manufacturing Plants
 
DutchMLSchool 2022 - Anomaly Detection at Scale
DutchMLSchool 2022 - Anomaly Detection at ScaleDutchMLSchool 2022 - Anomaly Detection at Scale
DutchMLSchool 2022 - Anomaly Detection at Scale
 
DutchMLSchool 2022 - Citizen Development in AI
DutchMLSchool 2022 - Citizen Development in AIDutchMLSchool 2022 - Citizen Development in AI
DutchMLSchool 2022 - Citizen Development in AI
 
Democratizing Object Detection
Democratizing Object DetectionDemocratizing Object Detection
Democratizing Object Detection
 
BigML Release: Image Processing
BigML Release: Image ProcessingBigML Release: Image Processing
BigML Release: Image Processing
 
Machine Learning in Retail: Know Your Customers' Customer. See Your Future
Machine Learning in Retail: Know Your Customers' Customer. See Your FutureMachine Learning in Retail: Know Your Customers' Customer. See Your Future
Machine Learning in Retail: Know Your Customers' Customer. See Your Future
 
Machine Learning in Retail: ML in the Retail Sector
Machine Learning in Retail: ML in the Retail SectorMachine Learning in Retail: ML in the Retail Sector
Machine Learning in Retail: ML in the Retail Sector
 
ML in GRC: Machine Learning in Legal Automation, How to Trust a Lawyerbot
ML in GRC: Machine Learning in Legal Automation, How to Trust a LawyerbotML in GRC: Machine Learning in Legal Automation, How to Trust a Lawyerbot
ML in GRC: Machine Learning in Legal Automation, How to Trust a Lawyerbot
 
ML in GRC: Supporting Human Decision Making for Regulatory Adherence with Mac...
ML in GRC: Supporting Human Decision Making for Regulatory Adherence with Mac...ML in GRC: Supporting Human Decision Making for Regulatory Adherence with Mac...
ML in GRC: Supporting Human Decision Making for Regulatory Adherence with Mac...
 

Dernier

Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...ZurliaSoop
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...amitlee9823
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...amitlee9823
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 

Dernier (20)

Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 

VSSML16 L5. Basic Data Transformations

  • 2. BigML, Inc 2 Basic Transformations Poul Pertesen CIO, BigML, Inc Creating Machine Learning Ready Data
  • 3. BigML, Inc 3Machine Learning-Ready Data Basic Transformations Q: How does a physicist milk a cow? A: Well, first let us consider a spherical cow... Q: How does a data scientist build a model? A: Well, first let us consider perfectly formatted data…
  • 4. BigML, Inc 4Machine Learning-Ready Data The Dream CSV Dataset Model Profit!
  • 5. BigML, Inc 5Machine Learning-Ready Data The Reality CRM Web Accounts Transactions ML Ready? Is all hope lost? How do you even start?
  • 6. BigML, Inc 6Machine Learning-Ready Data Holistic Approach • Define a clear idea of the goal. • Understand what ML tasks will achieve the goal. • Understand the data structure to perform those ML tasks. • Find out what kind of data you have and make it ML-Ready • where is it, how is it stored? • what are the features? • can you access it programmatically? • Feature Engineering: transform the data you have into the data you actually need. • Evaluate: Try it on a small scale • Accept that you might have to start over…. • But when it works, automate it!!!!
  • 7. BigML, Inc 7Machine Learning-Ready Data Holistic Approach Define Goal & ML Task
  • 8. BigML, Inc 8Machine Learning-Ready Data Understand ML Tasks Goal • Will this customer default on a loan? • How many customers will apply for a loan next month? • Is the consumption of this product unusual? • Is the behavior of the customers similar? • Are these product purchased together? ML Task Classification Regression Anomaly Detection Cluster Analysis Association Discovery
  • 9. BigML, Inc 9Machine Learning-Ready Data Holistic Approach Required Data Structure
  • 10. BigML, Inc 10Machine Learning-Ready Data Classification CategoricalTrainingTesting Predicting
  • 11. BigML, Inc 11Machine Learning-Ready Data Regression NumericTrainingTesting Predicting
  • 12. BigML, Inc 12Machine Learning-Ready Data Anomaly Detection
  • 13. BigML, Inc 13Machine Learning-Ready Data Cluster Analysis
  • 14. BigML, Inc 14Machine Learning-Ready Data Association Discovery
  • 15. BigML, Inc 15Machine Learning-Ready Data Holistic Approach Make Your Data ML-Ready
  • 16. BigML, Inc 16Machine Learning-Ready Data ML-Ready Data Instances Fields  (Features) Tabular Data: • Each row is one of the instances. • Each column is a field that describes a property of the 
 instance that is relevant to the question being modeled. • Fields can be: already be present in your data derived from your data or generated using other fields. Machine Learning Algorithms consume instances of the question that you want to model. !! Danger Ahead !!
  • 17. BigML, Inc 17Machine Learning-Ready Data Cleansing Homogenize missing values and different types in the same feature, fix input errors, correct semantic issues, types, etc. Name Date Duration (s) Genre Plays Highway star 1984-05-24 - Rock 139 Blues alive 1990/03/01 281 Blues 239 Lonely planet 2002-11-19 5:32s Techno 42 Dance, dance 02/23/1983 312 Disco N/A The wall 1943-01-20 218 Reagge 83 Offside down 1965-02-19 4 minutes Techno 895 The alchemist 2001-11-21 418 Bluesss 178 Bring me down 18-10-98 328 Classic 21 The scarecrow 1994-10-12 269 Rock 734 Original  data Name Date Duration (s) Genre Plays Highway star 1984-05-24 Rock 139 Blues alive 1990-03-01 281 Blues 239 Lonely planet 2002-11-19 332 Techno 42 Dance, dance 1983-02-23 312 Disco The wall 1943-01-20 218 Reagge 83 Offside down 1965-02-19 240 Techno 895 The alchemist 2001-11-21 418 Blues 178 Bring me down 1998-10-18 328 Classic 21 The scarecrow 1994-10-12 269 Rock 734 Cleaned  data
  • 18. BigML, Inc 18Machine Learning-Ready Data Denormalizing users artists tracks albums Instances Features (millions) join Data is usually normalized in relational databases, ML-Ready datasets need the information de-normalized in a single file/dataset.
  • 19. BigML, Inc 19Machine Learning-Ready Data Aggregating User Num.Playbacks Total Time Pref.Device User001 3 830 Tablet User002 1 218 Smartphone User003 3 1019 TV User005 2 521 Tablet Aggregated data (list of users) When the entity to model is different from the provided data, an aggregation to get the entity might be needed. Content Genr e Duration Play Time User Device Highway star Rock 190 2015-05-12 16:29:33 User001 TV Blues alive Blues 281 2015-05-13 12:31:21 User005 Tablet Lonely planet Tech no 332 2015-05-13 14:26:04 User003 TV Dance, dance Disco 312 2015-05-13 18:12:45 User001 Tablet The wall Reag ge 218 2015-05-14 09:02:55 User002 Smartphone Offside down Tech no 240 2015-05-14 11:26:32 User005 Tablet The alchemist Blues 418 2015-05-14 21:44:15 User003 TV Bring me down Class ic 328 2015-05-15 06:59:56 User001 Tablet The scarecrow Rock 269 2015-05-15 12:37:05 User003 Smartphone Original data (list of playbacks) tail -n+2 playlists.csv | cut -d',' -f5 | sort | uniq -c
  • 20. BigML, Inc 20Machine Learning-Ready Data Pivoting Different values of a feature are pivoted to new columns in the result dataset. Content Genre Duration Play Time User Device Highway star Rock 190 2015-05-12 16:29:33 User001 TV Blues alive Blues 281 2015-05-13 12:31:21 User005 Tablet Lonely planet Techno 332 2015-05-13 14:26:04 User003 TV Dance, dance Disco 312 2015-05-13 18:12:45 User001 Tablet The wall Reagge 218 2015-05-14 09:02:55 User002 Smartphone Offside down Techno 240 2015-05-14 11:26:32 User005 Tablet The alchemist Blues 418 2015-05-14 21:44:15 User003 TV Bring me down Classic 328 2015-05-15 06:59:56 User001 Tablet The scarecrow Rock 269 2015-05-15 12:37:05 User003 Smartphone Original data User Num.Playback s Total Time Pref.Device NP_TV NP_Tablet NP_Smartphone TT_TV TT_Tablet TT_Smartphone User001 3 830 Tablet 1 2 0 190 640 0 User002 1 218 Smartphone 0 0 1 0 0 218 User003 3 1019 TV 2 0 1 750 0 269 User005 2 521 Tablet 0 2 0 0 521 0 Aggregated data with pivoted columns
  • 21. BigML, Inc 21Machine Learning-Ready Data Time Windows Create new features using values over different periods of time Instances Features Time Instances Features (millions) (thousands) t=1 t=2 t=3
  • 22. BigML, Inc 22Machine Learning-Ready Data Updates Need a current view of the data, but new data only comes in batches of changes day  1day  2day  3 Instances Features
  • 23. BigML, Inc 23Machine Learning-Ready Data Structuring Output • A CSV file uses plain text to store tabular data. • In a CSV file, each row of the file is an instance. • Each column in a row is usually separated by a comma (,) but other "separators" like semi-colon (;), colon (:), pipe (|), can also be used. Each row must contain the same number of fields • but they can be null • Fields can be quoted using double quotes ("). • Fields that contain commas or line separators must be quoted. • Quotes (") in fields must be doubled (""). • The character encoding must be UTF-8 • Optionally, a CSV file can use the first line as a header to provide the names of each field. After all the data transformations, a CSV (“Comma-Separated Values) file has to be generated, following the rules below:
  • 24. BigML, Inc 24Machine Learning-Ready Data Holistic Approach Feature Engineering
  • 25. BigML, Inc 25Machine Learning-Ready Data Feature Engineering • Flatline • Domain Specific Language for data generation and filtering • Works with datasets -> datasets • Lots of built-in functions • Sliding windows • Date/Time parsing • Flatline Editor (in UI) • https://github.com/bigmlcom/flatline
  • 26. BigML, Inc 26Machine Learning-Ready Data Feature Engineering • Feature Engineering of Numeric features: • Discretization (percentiles, within percentiles, groups) • Replacement • Normalization • Exponentiation, Logarithms, Squares, etc. • Shock • Feature Engineering of Text features: • Misspellings • Length • Number of subordinate sentences • Language • Levenshtein distance • Stacking: • Compute a field using non-linear combinations of other fields
  • 27. BigML, Inc 27Machine Learning-Ready Data Holistic Approach Test & Automate
  • 28. BigML, Inc 28Machine Learning-Ready Data Test & Automate • Test - Evaluate • Did you meet the goal? • If not, did you discover something else useful? • If not, start over • If you did… • Automate - You don’t want to hand code that every time, right? • Consider tools that are easy to automate • scripting interface • APIs • Ability to maintenance is important
  • 29. BigML, Inc 29Machine Learning-Ready Data Tools • Command Line? • join, cut, awk, sed, sort, uniq • Automation • Shell, Python, etc • Talend • BigML: bindings, bigmler, API, whizzml • Relational DB • MySQL • Non-Relational DB • MongoDB
  • 30. BigML, Inc 30Machine Learning-Ready Data Prosper Submit Bids Cancelled Withdraw Funded Expired Defaulted Paid Current Late Q: Which new loans make it to funded? Q: Which funded loans make it to paid? Q: If funded, what will be the rate? Classification Regression Classification
  • 31. BigML, Inc 31Machine Learning-Ready Data Prosper Data Provided in XML updates!! fetch.sh “curl” daily export.sh import.py XML bigml.sh Model Predict Share in gallery Status LoanStatus BorrowerRate
  • 32. BigML, Inc 32Machine Learning-Ready Data Prosper • XML… yuck! • MongoDB has CSV export and is record based so it is easy to handle changing data structure. • Feature Engineering • There are 5 different classes of “bad” loans • Date cleanup • Type casting: floats and ints • Would be better to track over time • number of late payments • compare predictions and actuals • XML… yuck! Tidbits and Lessons Learned….
  • 33. BigML, Inc 33Machine Learning-Ready Data Diabetes Fix Missing Values in a “Meaningful” Way Filter Zeros Model 
 insulin Predict 
 insulin Select 
 insulin Fixed
 Dataset Amended
 Dataset Original
 Dataset Clean
 Dataset
  • 34. BigML, Inc 34Machine Learning-Ready Data Stock Prices (/ (- ( f "price") (avg-window "price" -4, -1)) (standard-deviation "price")) Shock: Deviations from Trend date volume price 1 34353 314 2 44455 315 3 22333 315 4 52322 321 5 28000 320 6 31254 319 7 56544 323 8 44331 324 9 81111 287 10 65422 294 11 59999 300 12 45556 302 13 19899 301 14 21453 302 314 314 315 314 315 315 314 315 315 321 315 315 321 320 315 321 320 319 4-Day moving avg) Current - (4-day avg) std dev
  • 35. BigML, Inc 35Machine Learning-Ready Data Talend https://blog.bigml.com/2013/10/30/data-preparation-for-machine-learning-using-mysql/ Denormalization Example
  • 36. BigML, Inc 36Machine Learning-Ready Data Talend https://blog.bigml.com/2013/10/30/data-preparation-for-machine-learning-using-mysql/ Denormalization Example