SlideShare une entreprise Scribd logo
1  sur  31
Feature Hashing for
Scalable Machine Learning
Nick Pentreath
IBM
About
• About me
– @MLnick
– Principal Engineer at IBM working on machine
learning & Apache Spark
– Apache Spark PMC
– Author of Machine Learning with Spark
Agenda
• Intro to feature hashing
• HashingTF in Spark ML
• FeatureHasher in Spark ML
• Experiments
• Future Work
Intro to Feature Hashing
Encoding Features
• Most ML algorithms
operate on numeric
feature vectors
• Features are often
categorical – even
numerical features (e.g.
binning continuous
features)
2.1 3.2 −0.2 0.7
Encoding Features
• “one-hot” encoding is
popular for categorical
features
• “bag of words” is
popular for text (or token
counts more generally)
0 0 1 0 0
0 2 0 1 1
High Dimensional Features
• Many domains have very high dense feature dimension
(e.g. images, video)
• Here we’re concerned with sparse feature domains, e.g.
online ads, ecommerce, social networks, video sharing,
text & NLP
• Model sizes can be very large even for simple models
The “Hashing Trick”
• Use a hash function to map feature values to
indices in the feature vector
Boston
Hash “city=boston” 0 0 1 0 0
Modulo hash value
to vector size to get
index of feature
Stock Price
Hash “stock_price” 0 2.60 … 0 0
Modulo hash value
to vector size to get
index of feature
Feature Hashing: Pros
• Fast & Simple
• Preserves sparsity
• Memory efficient
– Limits feature vector size
– No need to store mapping feature name -> index
• Online learning
• Easy handling of missing data
• Feature engineering
Feature Hashing: Cons
• No inverse mapping => cannot go from feature
indices back to feature names
– Interpretability & feature importances
– But similar issues with other dim reduction techniques
(e.g. random projections, PCA, SVD)
• Hash collisions …
– Impact on accuracy of feature collisions
– Can use signed hash functions to alleviate part of it
HashingTF in Spark ML
HashingTF Transformer
• Transforms text (sentences) -> term frequency
vectors (aka “bag of words”)
• Uses the “hashing trick” to compute the feature
indices
• Feature value is term frequency (token count)
• Optional parameter to only return binary token
occurrence vector
HashingTF Transformer
Hacking HashingTF
• HashingTF can be used for
categorical features…
• … but doesn’t fit neatly into
Pipelines
FeatureHasher in Spark ML
FeatureHasher
• Flexible, scalable feature encoding using
hashing trick
• Support multiple input columns (numeric or
string, i.e. categorical)
• One-shot feature encoder
• Core logic similar to HashingTF
FeatureHasher
• Operates on entire
Row
• Determining feature
index
– Numeric: feature
name
– String:
“feature=value”
• String encoding =>
effectively “one hot”
FeatureHasher
Experiments
Text Classification
• Kaggle Email Spam Dataset
0.955
0.96
0.965
0.97
0.975
0.98
0.985
0.99
10 12 14 16 18
Hash bits
AUC by hash bits
HashingTF
CountVectorizer
Text Classification
• Adding regularization (regParam=0.01)
0.97
0.975
0.98
0.985
0.99
0.995
10 12 14 16 18
Hash bits
AUC by hash bits
HashingTF
CountVectorizer
Ad Click Prediction
• Criteo Display Advertising Challenge
– 45m examples, 34m features, 0.000003% sparsity
• Outbrain Click Prediction
– 80m examples, 15m features, 0.000007% sparsity
• Criteo Terabyte Log Data
– 7 day subset
– 1.5b examples, 300m feature, 0.0000003% sparsity
Data
• Illustrative characteristics - Criteo DAC
0
2
4
6
8
10
12
Millions
Unique Values per Feature
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Feature Occurence (%)
Challenges
Raw Data StringIndexer OneHotEncoder VectorAssembler
OOM!
• Typical one-hot encoding pipeline failed consistently
Results
• Compare AUC for different # hash bits
0.698
0.7
0.702
0.704
0.706
0.708
0.71
0.712
0.714
0.716
0.718
18 20 22 24
Hash bits
Outbrain
AUC
0.74
0.742
0.744
0.746
0.748
0.75
0.752
18 20 22 24
Hash bits
Criteo DAC
AUC
Results
• Criteo 1T logs – 7 day subset
• Can train model on 1.5b examples
• 300m original features for this subset
• 224 hashed features (16m)
• Impossible with current Spark ML (OOM, 2Gb
broadcast limit)
Summary & Future Work
Summary
• Feature hashing is a fast, efficient, flexible tool for
feature encoding
• Can scale to high-dimensional sparse data, without
giving up much accuracy
• Supports multi-column “one-shot” encoding
• Avoids common issues with Spark ML Pipelines
using StringIndexer & OneHotEncoder at scale
Future Directions
• Include in Spark ML
– Watch SPARK-13969 for details
– Comments welcome!
• Signed hash functions
• Internal feature crossing & namespaces (ala Vowpal
Wabbit)
• DictVectorizer-like transformer => one-pass feature
encoder for multiple numeric & categorical columns
(with inverse mapping)
References
• Hash Kernels
• Feature Hashing for Large Scale Multitask
Learning
• Vowpal Wabbit
• Scikit-learn
Thank You.
@Mlnick
spark.tc

Contenu connexe

Tendances

Archmage, Pinterest’s Real-time Analytics Platform on Druid
Archmage, Pinterest’s Real-time Analytics Platform on DruidArchmage, Pinterest’s Real-time Analytics Platform on Druid
Archmage, Pinterest’s Real-time Analytics Platform on Druid
Imply
 
Kafka Connect & Streams - the ecosystem around Kafka
Kafka Connect & Streams - the ecosystem around KafkaKafka Connect & Streams - the ecosystem around Kafka
Kafka Connect & Streams - the ecosystem around Kafka
Guido Schmutz
 

Tendances (20)

My first 90 days with ClickHouse.pdf
My first 90 days with ClickHouse.pdfMy first 90 days with ClickHouse.pdf
My first 90 days with ClickHouse.pdf
 
Deep Dive on Amazon Redshift
Deep Dive on Amazon RedshiftDeep Dive on Amazon Redshift
Deep Dive on Amazon Redshift
 
Intro to Elasticsearch
Intro to ElasticsearchIntro to Elasticsearch
Intro to Elasticsearch
 
What is in a Lucene index?
What is in a Lucene index?What is in a Lucene index?
What is in a Lucene index?
 
Powering Interactive BI Analytics with Presto and Delta Lake
Powering Interactive BI Analytics with Presto and Delta LakePowering Interactive BI Analytics with Presto and Delta Lake
Powering Interactive BI Analytics with Presto and Delta Lake
 
Debunking the Myths of HDFS Erasure Coding Performance
Debunking the Myths of HDFS Erasure Coding Performance Debunking the Myths of HDFS Erasure Coding Performance
Debunking the Myths of HDFS Erasure Coding Performance
 
Archmage, Pinterest’s Real-time Analytics Platform on Druid
Archmage, Pinterest’s Real-time Analytics Platform on DruidArchmage, Pinterest’s Real-time Analytics Platform on Druid
Archmage, Pinterest’s Real-time Analytics Platform on Druid
 
Introduction to Data streaming - 05/12/2014
Introduction to Data streaming - 05/12/2014Introduction to Data streaming - 05/12/2014
Introduction to Data streaming - 05/12/2014
 
PostgreSQL and Benchmarks
PostgreSQL and BenchmarksPostgreSQL and Benchmarks
PostgreSQL and Benchmarks
 
Delta Architecture
Delta ArchitectureDelta Architecture
Delta Architecture
 
Apache Kafka - Event Sourcing, Monitoring, Librdkafka, Scaling & Partitioning
Apache Kafka - Event Sourcing, Monitoring, Librdkafka, Scaling & PartitioningApache Kafka - Event Sourcing, Monitoring, Librdkafka, Scaling & Partitioning
Apache Kafka - Event Sourcing, Monitoring, Librdkafka, Scaling & Partitioning
 
Building Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache SparkBuilding Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache Spark
 
Kafka Connect & Streams - the ecosystem around Kafka
Kafka Connect & Streams - the ecosystem around KafkaKafka Connect & Streams - the ecosystem around Kafka
Kafka Connect & Streams - the ecosystem around Kafka
 
Kafka at scale facebook israel
Kafka at scale   facebook israelKafka at scale   facebook israel
Kafka at scale facebook israel
 
Apache Ignite vs Alluxio: Memory Speed Big Data Analytics
Apache Ignite vs Alluxio: Memory Speed Big Data AnalyticsApache Ignite vs Alluxio: Memory Speed Big Data Analytics
Apache Ignite vs Alluxio: Memory Speed Big Data Analytics
 
Druid deep dive
Druid deep diveDruid deep dive
Druid deep dive
 
Understanding InfluxDB’s New Storage Engine
Understanding InfluxDB’s New Storage EngineUnderstanding InfluxDB’s New Storage Engine
Understanding InfluxDB’s New Storage Engine
 
Metrics Are Not Enough: Monitoring Apache Kafka and Streaming Applications
Metrics Are Not Enough: Monitoring Apache Kafka and Streaming ApplicationsMetrics Are Not Enough: Monitoring Apache Kafka and Streaming Applications
Metrics Are Not Enough: Monitoring Apache Kafka and Streaming Applications
 
A Deep Dive Into Understanding Apache Cassandra
A Deep Dive Into Understanding Apache CassandraA Deep Dive Into Understanding Apache Cassandra
A Deep Dive Into Understanding Apache Cassandra
 
MongoDB Sharding
MongoDB ShardingMongoDB Sharding
MongoDB Sharding
 

En vedette

No-Bullshit Data Science
No-Bullshit Data ScienceNo-Bullshit Data Science
No-Bullshit Data Science
Domino Data Lab
 
Druva Casestudy - HackerEarth
Druva Casestudy - HackerEarthDruva Casestudy - HackerEarth
Druva Casestudy - HackerEarth
HackerEarth
 

En vedette (20)

Fairly Measuring Fairness In Machine Learning
Fairly Measuring Fairness In Machine LearningFairly Measuring Fairness In Machine Learning
Fairly Measuring Fairness In Machine Learning
 
Kill the wabbit
Kill the wabbitKill the wabbit
Kill the wabbit
 
No-Bullshit Data Science
No-Bullshit Data ScienceNo-Bullshit Data Science
No-Bullshit Data Science
 
USC LIGHT Ministry Introduction
USC LIGHT Ministry IntroductionUSC LIGHT Ministry Introduction
USC LIGHT Ministry Introduction
 
Intra company hackathons using HackerEarth
Intra company hackathons using HackerEarthIntra company hackathons using HackerEarth
Intra company hackathons using HackerEarth
 
Druva Casestudy - HackerEarth
Druva Casestudy - HackerEarthDruva Casestudy - HackerEarth
Druva Casestudy - HackerEarth
 
How to assess & hire Java developers accurately?
How to assess & hire Java developers accurately?How to assess & hire Java developers accurately?
How to assess & hire Java developers accurately?
 
Work - LIGHT Ministry
Work - LIGHT MinistryWork - LIGHT Ministry
Work - LIGHT Ministry
 
Open Innovation - A Case Study
Open Innovation - A Case StudyOpen Innovation - A Case Study
Open Innovation - A Case Study
 
Ethics in Data Science and Machine Learning
Ethics in Data Science and Machine LearningEthics in Data Science and Machine Learning
Ethics in Data Science and Machine Learning
 
HackerEarth helping a startup hire developers - The Practo Case Study
HackerEarth helping a startup hire developers - The Practo Case StudyHackerEarth helping a startup hire developers - The Practo Case Study
HackerEarth helping a startup hire developers - The Practo Case Study
 
Make Sense Out of Data with Feature Engineering
Make Sense Out of Data with Feature EngineeringMake Sense Out of Data with Feature Engineering
Make Sense Out of Data with Feature Engineering
 
How hackathons can drive top line revenue growth
How hackathons can drive top line revenue growthHow hackathons can drive top line revenue growth
How hackathons can drive top line revenue growth
 
DataRobot R Package
DataRobot R PackageDataRobot R Package
DataRobot R Package
 
HackerEarth Sourcing Solution
HackerEarth Sourcing SolutionHackerEarth Sourcing Solution
HackerEarth Sourcing Solution
 
Tda presentation
Tda presentationTda presentation
Tda presentation
 
State of women in technical workforce
State of women in technical workforceState of women in technical workforce
State of women in technical workforce
 
Managing Data Science | Lessons from the Field
Managing Data Science | Lessons from the Field Managing Data Science | Lessons from the Field
Managing Data Science | Lessons from the Field
 
Data Science Competition
Data Science CompetitionData Science Competition
Data Science Competition
 
Doing your first Kaggle (Python for Big Data sets)
Doing your first Kaggle (Python for Big Data sets)Doing your first Kaggle (Python for Big Data sets)
Doing your first Kaggle (Python for Big Data sets)
 

Similaire à Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick Pentreath

Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
xlight
 
Deploying MLlib for Scoring in Structured Streaming with Joseph Bradley
Deploying MLlib for Scoring in Structured Streaming with Joseph BradleyDeploying MLlib for Scoring in Structured Streaming with Joseph Bradley
Deploying MLlib for Scoring in Structured Streaming with Joseph Bradley
Databricks
 

Similaire à Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick Pentreath (20)

Feature Hashing for Scalable Machine Learning with Nick Pentreath
Feature Hashing for Scalable Machine Learning with Nick PentreathFeature Hashing for Scalable Machine Learning with Nick Pentreath
Feature Hashing for Scalable Machine Learning with Nick Pentreath
 
Feature Hashing for Scalable Machine Learning with Nick Pentreath
Feature Hashing for Scalable Machine Learning with Nick PentreathFeature Hashing for Scalable Machine Learning with Nick Pentreath
Feature Hashing for Scalable Machine Learning with Nick Pentreath
 
Spark Summit EU talk by Nick Pentreath
Spark Summit EU talk by Nick PentreathSpark Summit EU talk by Nick Pentreath
Spark Summit EU talk by Nick Pentreath
 
Math with .NET for you and Azure
Math with .NET for you and AzureMath with .NET for you and Azure
Math with .NET for you and Azure
 
"Source Code Abstracts Classification Using CNN", Vadim Markovtsev, Lead Soft...
"Source Code Abstracts Classification Using CNN", Vadim Markovtsev, Lead Soft..."Source Code Abstracts Classification Using CNN", Vadim Markovtsev, Lead Soft...
"Source Code Abstracts Classification Using CNN", Vadim Markovtsev, Lead Soft...
 
The Data Science Process - Do we need it and how to apply?
The Data Science Process - Do we need it and how to apply?The Data Science Process - Do we need it and how to apply?
The Data Science Process - Do we need it and how to apply?
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
 
Machine learning for IoT - unpacking the blackbox
Machine learning for IoT - unpacking the blackboxMachine learning for IoT - unpacking the blackbox
Machine learning for IoT - unpacking the blackbox
 
Fantastic ML apps and how to build them
Fantastic ML apps and how to build themFantastic ML apps and how to build them
Fantastic ML apps and how to build them
 
Fixing twitter
Fixing twitterFixing twitter
Fixing twitter
 
Fixing_Twitter
Fixing_TwitterFixing_Twitter
Fixing_Twitter
 
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
 
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
 
On-boarding with JanusGraph Performance
On-boarding with JanusGraph PerformanceOn-boarding with JanusGraph Performance
On-boarding with JanusGraph Performance
 
Deploying MLlib for Scoring in Structured Streaming with Joseph Bradley
Deploying MLlib for Scoring in Structured Streaming with Joseph BradleyDeploying MLlib for Scoring in Structured Streaming with Joseph Bradley
Deploying MLlib for Scoring in Structured Streaming with Joseph Bradley
 
Building Large Scale Machine Learning Applications with Pipelines-(Evan Spark...
Building Large Scale Machine Learning Applications with Pipelines-(Evan Spark...Building Large Scale Machine Learning Applications with Pipelines-(Evan Spark...
Building Large Scale Machine Learning Applications with Pipelines-(Evan Spark...
 
Matplotlib Review 2021
Matplotlib Review 2021Matplotlib Review 2021
Matplotlib Review 2021
 
Matplotlib_Complete review_2021_abridged_version
Matplotlib_Complete review_2021_abridged_versionMatplotlib_Complete review_2021_abridged_version
Matplotlib_Complete review_2021_abridged_version
 
Python Raster Function - Esri Developer Conference - 2015
Python Raster Function - Esri Developer Conference - 2015Python Raster Function - Esri Developer Conference - 2015
Python Raster Function - Esri Developer Conference - 2015
 
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]
 

Plus de Spark Summit

Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Spark Summit
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Spark Summit
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Spark Summit
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Spark Summit
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
Spark Summit
 

Plus de Spark Summit (20)

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub Wozniak
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin Kim
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim Simeonov
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
 

Dernier

怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
vexqp
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Bertram Ludäscher
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
gajnagarg
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
ahmedjiabur940
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
gajnagarg
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
chadhar227
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Klinik kandungan
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
gajnagarg
 

Dernier (20)

怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham Ware
 
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
 
Kings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about themKings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about them
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Statistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbersStatistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbers
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
 
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
 
20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - Almora
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for Research
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
 

Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick Pentreath

  • 1. Feature Hashing for Scalable Machine Learning Nick Pentreath IBM
  • 2. About • About me – @MLnick – Principal Engineer at IBM working on machine learning & Apache Spark – Apache Spark PMC – Author of Machine Learning with Spark
  • 3. Agenda • Intro to feature hashing • HashingTF in Spark ML • FeatureHasher in Spark ML • Experiments • Future Work
  • 5. Encoding Features • Most ML algorithms operate on numeric feature vectors • Features are often categorical – even numerical features (e.g. binning continuous features) 2.1 3.2 −0.2 0.7
  • 6. Encoding Features • “one-hot” encoding is popular for categorical features • “bag of words” is popular for text (or token counts more generally) 0 0 1 0 0 0 2 0 1 1
  • 7. High Dimensional Features • Many domains have very high dense feature dimension (e.g. images, video) • Here we’re concerned with sparse feature domains, e.g. online ads, ecommerce, social networks, video sharing, text & NLP • Model sizes can be very large even for simple models
  • 8. The “Hashing Trick” • Use a hash function to map feature values to indices in the feature vector Boston Hash “city=boston” 0 0 1 0 0 Modulo hash value to vector size to get index of feature Stock Price Hash “stock_price” 0 2.60 … 0 0 Modulo hash value to vector size to get index of feature
  • 9. Feature Hashing: Pros • Fast & Simple • Preserves sparsity • Memory efficient – Limits feature vector size – No need to store mapping feature name -> index • Online learning • Easy handling of missing data • Feature engineering
  • 10. Feature Hashing: Cons • No inverse mapping => cannot go from feature indices back to feature names – Interpretability & feature importances – But similar issues with other dim reduction techniques (e.g. random projections, PCA, SVD) • Hash collisions … – Impact on accuracy of feature collisions – Can use signed hash functions to alleviate part of it
  • 12. HashingTF Transformer • Transforms text (sentences) -> term frequency vectors (aka “bag of words”) • Uses the “hashing trick” to compute the feature indices • Feature value is term frequency (token count) • Optional parameter to only return binary token occurrence vector
  • 14. Hacking HashingTF • HashingTF can be used for categorical features… • … but doesn’t fit neatly into Pipelines
  • 16. FeatureHasher • Flexible, scalable feature encoding using hashing trick • Support multiple input columns (numeric or string, i.e. categorical) • One-shot feature encoder • Core logic similar to HashingTF
  • 17. FeatureHasher • Operates on entire Row • Determining feature index – Numeric: feature name – String: “feature=value” • String encoding => effectively “one hot”
  • 20. Text Classification • Kaggle Email Spam Dataset 0.955 0.96 0.965 0.97 0.975 0.98 0.985 0.99 10 12 14 16 18 Hash bits AUC by hash bits HashingTF CountVectorizer
  • 21. Text Classification • Adding regularization (regParam=0.01) 0.97 0.975 0.98 0.985 0.99 0.995 10 12 14 16 18 Hash bits AUC by hash bits HashingTF CountVectorizer
  • 22. Ad Click Prediction • Criteo Display Advertising Challenge – 45m examples, 34m features, 0.000003% sparsity • Outbrain Click Prediction – 80m examples, 15m features, 0.000007% sparsity • Criteo Terabyte Log Data – 7 day subset – 1.5b examples, 300m feature, 0.0000003% sparsity
  • 23. Data • Illustrative characteristics - Criteo DAC 0 2 4 6 8 10 12 Millions Unique Values per Feature 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Feature Occurence (%)
  • 24. Challenges Raw Data StringIndexer OneHotEncoder VectorAssembler OOM! • Typical one-hot encoding pipeline failed consistently
  • 25. Results • Compare AUC for different # hash bits 0.698 0.7 0.702 0.704 0.706 0.708 0.71 0.712 0.714 0.716 0.718 18 20 22 24 Hash bits Outbrain AUC 0.74 0.742 0.744 0.746 0.748 0.75 0.752 18 20 22 24 Hash bits Criteo DAC AUC
  • 26. Results • Criteo 1T logs – 7 day subset • Can train model on 1.5b examples • 300m original features for this subset • 224 hashed features (16m) • Impossible with current Spark ML (OOM, 2Gb broadcast limit)
  • 28. Summary • Feature hashing is a fast, efficient, flexible tool for feature encoding • Can scale to high-dimensional sparse data, without giving up much accuracy • Supports multi-column “one-shot” encoding • Avoids common issues with Spark ML Pipelines using StringIndexer & OneHotEncoder at scale
  • 29. Future Directions • Include in Spark ML – Watch SPARK-13969 for details – Comments welcome! • Signed hash functions • Internal feature crossing & namespaces (ala Vowpal Wabbit) • DictVectorizer-like transformer => one-pass feature encoder for multiple numeric & categorical columns (with inverse mapping)
  • 30. References • Hash Kernels • Feature Hashing for Large Scale Multitask Learning • Vowpal Wabbit • Scikit-learn

Notes de l'éditeur

  1. ML models require numeric features Most features are not neatly encoded as numbers Need to encode categorical features; e.g. categories, tags Many numerical features are not useful in “raw form” – e.g. geo-location, time-of-day are more useful as features if binned or transformed into “indicator” variables Need to transform text features into numeric vector representations – e.g. search keywords
  2. One hot encoding Bag of words
  3. Linear models in large sparse feature domains can have 10s-100s millions, even billions, of features Features are often high cardinality – user and product ids, geo-locations, tags, search keywords Feature interactions are commonly used – exploding feature dimension even further More complex models such as Factorization Machines still multiply the model size significantly
  4. Hash function needs to spread feature indices evenly and minimise hash collisions MurmurHash3 is commonly used Similar to “kernel trick” – here the “trick” is fixed size feature vector (often << feature dimension) => dimensionality reduction
  5. Signed hash functions => unbiased estimate (collisions will tend to cancel each other out in aggregate)
  6. Transforms text (sentences) into term frequency vectors Same as count vectorizer except that feature locations in the resulting vector use hashed indices
  7. Typical usage example
  8. ”Stringify” workaround Doesn’t fit into pipelines Only works for categorical features
  9. Small dataset – 2500 examples Email text content After tokenization, feature dim = 56k Compare CountVectorizer on full vocabulary with HashingTF with varying hash bits At lower dimensions AUC Is actually higher! In sparse domains feature hashing could be playing a regularization role?
  10. With regularization CountVectorizer performs better Performance of HashingTF also better at highher bit rates
  11. A few features dominate model size (e.g. probably user, ad, query ids etc) A few “hot features” that occur in many examples. Tails off quickly – power law data quite typical of interaction or user event data – search, recommendation, ads, social networks
  12. OneHotEncoder OOM => ML Attribute metadata size?
  13. Spark ML issues with (a) high cardinality features; and (b) wide datasets