SlideShare une entreprise Scribd logo
1  sur  27
Télécharger pour lire hors ligne
Using Embeddings to Understand 

the Variance and Evolution of
Data Science Skill Sets
Maryam Jahanshahi Ph.D.
Research Scientist
TapRecruit
http://bit.ly/pydatanyc-emb
Maryam Jahanshahi Ph.D.
Research Scientist
TapRecruit
http://bit.ly/pydatanyc-emb
Using Embeddings to Understand 

the Variance and Evolution of
Data Science Skill Sets
TapRecruit uses NLP to understand and 

organize natural language career content
Smart Editor 

for Job Descriptions
Active Pipeline 

Health Monitoring
Multifaceted Salary 

Estimation
Language matters in job descriptions
Same title,
Different job
Finance Manager
Kraft Foods
Finance Manager
Kraft Foods
Same Title
Junior (3 Years)
No Managerial Experience
Senior (6-8 Years)
Division Level Controller
Strategic Finance Role
MBA / CPA
Required Experience
Required Responsibility
Preferred Skill
Required Education
Different title,
Same job
Performance 

Marketing Manager
PocketGems
Senior Analyst,
Customer Strategy
The Gap
Mid-Level
Quantitative Focus
Expertise iBanking
Data Analysis Tools (SQL)
Consulting Experience
Preferred
MBA Preferred
Mid-Level
Quantitative Focus
Expertise iBanking
Relational Database Experience
External Consulting Experience
Preferred
BA degree in business, finance, MBA
Preferred
Required Experience
Required Skills
Required Experience
Required Skills
Preferred Experience
Required and 

Preferred Education
How have data science skills
changed over time?
How have data science skills changed over time?
Manual Feature Extraction Dynamic Topic Modeling
Adapted from Blei and Lafferty, ICML 2006.
1880
force
energy
motion
differ
light
1960
radiat
energy
electron
measure
ray
2000
state
energy
electron
magnet
field
1920
atom
theory
electron
energy
measure
Matter
Electron
Quantum
MBA
PhD
SQL
Tableau
PowerBIPython
Word embeddings capture semantic similarities
Proficiency programming in Python, Java or C++.
Word ContextContext
Experience in Python, Java or other object-oriented programming
languages
WordContext Context
Statistical modeling through software (e.g. SPSS) or programming language (e.g.
Python)
WordContext
French
German
Japanese
Esperanto
Language
Python
Programming
C++
Java
Object-
orientated
Embeddings capture entity relationships
Adapted from Stanford NLP GLoVE Project
Woman :: Queen as Man :: ?
Man
Woman
Queen
King
Hierarchies
McAdamCola
o
VodafoneVerizon
Viacom
Dauman
Exxon Tillerson
Wal-Mart McMillon
Comparatives 

and Superlatives
SlowestSlower
Slow Shortest
Shorter
Strongest
Stronge
r
Strong
Short
Pretrained embeddings facilitate fast prototyping
Final Application
Corpus Twitter Common Crawl GoogleNews Wikipedia
Tokens 27 B 42-840 B 100 B 6 B
Vocabulary Size 1.2 M 1.9-2.2 M 3 M 400 k
Algorithm GLoVE GLoVE word2vec GLoVE
Vector Length 25 - 200 d 300 d 300 d 50 - 300 d
Corpus Generation
Corpus Processing
Language Model
Generation
Language Model
Tuning
Problems with pretrained embedding models
Casing
Abbreviations vs Words
e.g. IT vs it
Out of Vocabulary Words Domain Specific Words & Acronyms
Polysemy
Words with multiple meanings
e.g. drive (a car) vs drive (results)
e.g. Chef (the job) vs Chef (the language)
Multi-word Expressions
Phrases that have new meanings
e.g. Front-end vs front + end
Tools for developing custom language models
Tokenization, POS tagging, Sentence
Segmentation, Dependency Parsing
Corpus Processing
CoreNLP SyntaxNet
Different word embedding models
(GLoVE, word2vec, fastText)
Language Modeling
Windows capture semantic similarity vs relatedness
Captures Semantic similarity, Substitutes
and Word-level differences
Small Window Size
Captures Semantic relatedness,
Alternatives and Domain-level differences
Large Window Size
Python
Java
Programming
C++
Language
French
German
Japanese
Esperanto
SPSS
Statistical
modeling
Object-orientated
SoftwarePython
Programming
C++
JavaLanguage
French
German
Japanese
Esperanto
Object-
orientated
Custom embeddings identified equal opportunity and
perks language
Custom embeddings identified ‘soft’ skills and
language around experience
I’ve got 300 dimensions…
but time ain’t one
Two flavors of dynamic embeddings
Trained Together
2015
2016
2017
2018
Balmer and Mandt, arXiv: 1702:08359.
Rudolph and Blei, arXiv: 1703:08052.
Independently Trained
2016
2017
2018
2015
Kim, Chiu, Kaneki, Hedge and Petrov, arXiv: 1405:3515.
Kulkarni, Al-Rfou, Perozzi and Skiena, arXiv: 1411:3315.
The benefits of dynamic Bernoulli embeddings
Data efficient: Treats each time slice
as a sequential latent variable,
enabling time slices with sparse data.
Does not require stitching/
alignment:
Treating time slice as a variable
ensures embeddings are connected
across slices.
Dynamic embeddings
Balmer and Mandt, arXiv: 1702:08359.
Rudolph and Blei, arXiv: 1703:08052.
Data hungry: Sufficient data for each
time slice for a quality embedding.
Requires stitching: Each time slice
is trained independently, therefore
dimensions are not comparable
across slices.
Static embeddings
Kim, Chiu, Kaneki, Hedge and Petrov, arXiv: 1405:3515.
Kulkarni, Al-Rfou, Perozzi and Skiena, arXiv: 1411:3315.

Demand for MBAs and PhDs falling
Data Science JobsAll Jobs
0
0.5
1
2016 2017 2018
0
0.5
1
2016 2017 2018
MBA
PhD
PhD
Data Science skills showing significant shifts
Tableau and PowerBI
0
1
2
3
2016 2017 2018
Tableau
PowerBI
Python vs Perl
2016 2017 2018
Python
Perl
1
0
Hadoop vs Spark
2016 2017 2018
Hadoop
Spark
1
0
regression :: Generalized Linear Models as
word2vec :: Exponential Family Embeddings
Members of the Exponential Family of Embeddings
Binary Data
Bernoulli Embedding
Count or Ordinal Data
Poisson Embedding
Context
Datapoint
Context
Mini Bagels
Cream cheese
Milk
Coffee
Orange Juice
Continuous Data
Gaussian Embedding
Context
Datapoint
Context
JFK-CDG
LGA-DCA
JFK-DFW
LAX-JFK
LAX-LGA
Context
Datapoin
t
Context
Proficiency
programming
Python
Java
C++
Members of the Exponential Family of Embeddings
Binary Data
Bernoulli Embedding
Context
Datapoin
t
Context
Proficiency
programming
Python
Java
C++
Count or Ordinal Data
Context
Datapoint
Context
10 Guidelines for A/B Testing / Emily Robinson
The Value of Null Results / Angel D’az
Words in Space / Rebecca Bilbro
… / James Powell
Why I Use Julia / Katharine Hyatt
Poisson Embedding
Poisson embeddings capture item similarities from
shopper behavior
Maruchan chicken ramen High Inner Product Combinations
Maruchan creamy chicken ramen
Maruchan oriental flavor ramen
Maruchan roast chicken ramen
Old Dutch potato chips & Budweiser Lager beer
Lays potato chips & DiGiorno frozen pizza
Yoplait strawberry yogurt Low Inner Product Combinations
Yoplait apricot mango yogurt
Yoplait strawberry orange smoothie
Yoplait strawberry banana yogurt
General Mills cinnamon toast & Tide Plus detergent
Beef Swanson Broth soup & Campbell Soup cans
Adapted from Rudolph, Ruiz, Mandt and Blei, arXiv: 1608.00778.
How have data science skills changed over time?
- Flavors of static word embeddings: The Corpus Issue
- Considerations for developing custom embedding models
- Flavors of dynamic models: Dynamic Bernoulli embeddings
- Other members of the Exponential Family of Embeddings
Thank you PyData NYC!
Maryam Jahanshahi Ph.D.
Research Scientist
TapRecruit
http://bit.ly/pydatanyc-emb
Thank you PyData NYC!
Maryam Jahanshahi Ph.D.
Research Scientist
TapRecruit
http://bit.ly/pydatanyc-emb

Contenu connexe

Tendances

Agile_goa_2013_clean_code_tdd
Agile_goa_2013_clean_code_tddAgile_goa_2013_clean_code_tdd
Agile_goa_2013_clean_code_tddSrinivasa GV
 
Is Groovy better for testing than Java?
Is Groovy better for testing than Java?Is Groovy better for testing than Java?
Is Groovy better for testing than Java?Trisha Gee
 
OpenAI’s GPT 3 Language Model - guest Steve Omohundro
OpenAI’s GPT 3 Language Model - guest Steve OmohundroOpenAI’s GPT 3 Language Model - guest Steve Omohundro
OpenAI’s GPT 3 Language Model - guest Steve OmohundroNumenta
 
10 more lessons learned from building Machine Learning systems
10 more lessons learned from building Machine Learning systems10 more lessons learned from building Machine Learning systems
10 more lessons learned from building Machine Learning systemsXavier Amatriain
 
Learning to Translate with Joey NMT
Learning to Translate with Joey NMTLearning to Translate with Joey NMT
Learning to Translate with Joey NMTJulia Kreutzer
 
An Ultimate Guide To Hire Python Developer
An Ultimate Guide To Hire Python DeveloperAn Ultimate Guide To Hire Python Developer
An Ultimate Guide To Hire Python DeveloperRishiVardhaniM
 
Frequently asked tcs technical interview questions and answers
Frequently asked tcs technical interview questions and answersFrequently asked tcs technical interview questions and answers
Frequently asked tcs technical interview questions and answersnishajj
 
Code Smells and Refactoring - Satyajit Dey & Ashif Iqbal
Code Smells and Refactoring - Satyajit Dey & Ashif IqbalCode Smells and Refactoring - Satyajit Dey & Ashif Iqbal
Code Smells and Refactoring - Satyajit Dey & Ashif IqbalCefalo
 

Tendances (8)

Agile_goa_2013_clean_code_tdd
Agile_goa_2013_clean_code_tddAgile_goa_2013_clean_code_tdd
Agile_goa_2013_clean_code_tdd
 
Is Groovy better for testing than Java?
Is Groovy better for testing than Java?Is Groovy better for testing than Java?
Is Groovy better for testing than Java?
 
OpenAI’s GPT 3 Language Model - guest Steve Omohundro
OpenAI’s GPT 3 Language Model - guest Steve OmohundroOpenAI’s GPT 3 Language Model - guest Steve Omohundro
OpenAI’s GPT 3 Language Model - guest Steve Omohundro
 
10 more lessons learned from building Machine Learning systems
10 more lessons learned from building Machine Learning systems10 more lessons learned from building Machine Learning systems
10 more lessons learned from building Machine Learning systems
 
Learning to Translate with Joey NMT
Learning to Translate with Joey NMTLearning to Translate with Joey NMT
Learning to Translate with Joey NMT
 
An Ultimate Guide To Hire Python Developer
An Ultimate Guide To Hire Python DeveloperAn Ultimate Guide To Hire Python Developer
An Ultimate Guide To Hire Python Developer
 
Frequently asked tcs technical interview questions and answers
Frequently asked tcs technical interview questions and answersFrequently asked tcs technical interview questions and answers
Frequently asked tcs technical interview questions and answers
 
Code Smells and Refactoring - Satyajit Dey & Ashif Iqbal
Code Smells and Refactoring - Satyajit Dey & Ashif IqbalCode Smells and Refactoring - Satyajit Dey & Ashif Iqbal
Code Smells and Refactoring - Satyajit Dey & Ashif Iqbal
 

Similaire à Using Embeddings to Understand the Variance and Evolution of Data Science... - Maryam Jahanshahi

Big Data Engineer Resume. Timely Delivery: We unde
Big Data Engineer Resume. Timely Delivery: We undeBig Data Engineer Resume. Timely Delivery: We unde
Big Data Engineer Resume. Timely Delivery: We undeLindsay Adams
 
Data science | What is Data science
Data science | What is Data scienceData science | What is Data science
Data science | What is Data scienceShilpaKrishna6
 
SAP HANA SPS10- Text Analysis & Text Mining
SAP HANA SPS10- Text Analysis & Text MiningSAP HANA SPS10- Text Analysis & Text Mining
SAP HANA SPS10- Text Analysis & Text MiningSAP Technology
 
Data Workflows for Machine Learning - SF Bay Area ML
Data Workflows for Machine Learning - SF Bay Area MLData Workflows for Machine Learning - SF Bay Area ML
Data Workflows for Machine Learning - SF Bay Area MLPaco Nathan
 
Python For SEO specialists and Content Marketing - Hand in Hand
Python For SEO specialists and Content Marketing - Hand in HandPython For SEO specialists and Content Marketing - Hand in Hand
Python For SEO specialists and Content Marketing - Hand in HandDido Grigorov
 
Ssas dmx ile kurum içi verilerin i̇şlenmesi
Ssas dmx ile kurum içi verilerin i̇şlenmesiSsas dmx ile kurum içi verilerin i̇şlenmesi
Ssas dmx ile kurum içi verilerin i̇şlenmesiKoray Kocabas
 
Shashwat Sher Resume
Shashwat Sher ResumeShashwat Sher Resume
Shashwat Sher ResumeShashwat Sher
 
Designing a Generative AI QnA solution with Proprietary Enterprise Business K...
Designing a Generative AI QnA solution with Proprietary Enterprise Business K...Designing a Generative AI QnA solution with Proprietary Enterprise Business K...
Designing a Generative AI QnA solution with Proprietary Enterprise Business K...IRJET Journal
 
What’s New with Databricks Machine Learning
What’s New with Databricks Machine LearningWhat’s New with Databricks Machine Learning
What’s New with Databricks Machine LearningDatabricks
 
Loading text data from SAP source systems
Loading text data from SAP source systemsLoading text data from SAP source systems
Loading text data from SAP source systemsMarcelo Honores
 
Brochure - Software Development Learning Path
 Brochure - Software Development Learning Path Brochure - Software Development Learning Path
Brochure - Software Development Learning PathBoard Infinity
 
Certification Study Group - NLP & Recommendation Systems on GCP Session 5
Certification Study Group - NLP & Recommendation Systems on GCP Session 5Certification Study Group - NLP & Recommendation Systems on GCP Session 5
Certification Study Group - NLP & Recommendation Systems on GCP Session 5gdgsurrey
 
CA ERwin Data Modeler End User Presentation
CA ERwin Data Modeler End User PresentationCA ERwin Data Modeler End User Presentation
CA ERwin Data Modeler End User PresentationCA RMDM Latam
 

Similaire à Using Embeddings to Understand the Variance and Evolution of Data Science... - Maryam Jahanshahi (20)

Big Data Engineer Resume. Timely Delivery: We unde
Big Data Engineer Resume. Timely Delivery: We undeBig Data Engineer Resume. Timely Delivery: We unde
Big Data Engineer Resume. Timely Delivery: We unde
 
Manish SAS
Manish SASManish SAS
Manish SAS
 
Data science | What is Data science
Data science | What is Data scienceData science | What is Data science
Data science | What is Data science
 
SAP HANA SPS10- Text Analysis & Text Mining
SAP HANA SPS10- Text Analysis & Text MiningSAP HANA SPS10- Text Analysis & Text Mining
SAP HANA SPS10- Text Analysis & Text Mining
 
Data Workflows for Machine Learning - SF Bay Area ML
Data Workflows for Machine Learning - SF Bay Area MLData Workflows for Machine Learning - SF Bay Area ML
Data Workflows for Machine Learning - SF Bay Area ML
 
Industrail training in php
Industrail training in phpIndustrail training in php
Industrail training in php
 
Python For SEO specialists and Content Marketing - Hand in Hand
Python For SEO specialists and Content Marketing - Hand in HandPython For SEO specialists and Content Marketing - Hand in Hand
Python For SEO specialists and Content Marketing - Hand in Hand
 
Ssas dmx ile kurum içi verilerin i̇şlenmesi
Ssas dmx ile kurum içi verilerin i̇şlenmesiSsas dmx ile kurum içi verilerin i̇şlenmesi
Ssas dmx ile kurum içi verilerin i̇şlenmesi
 
Shashwat Sher Resume
Shashwat Sher ResumeShashwat Sher Resume
Shashwat Sher Resume
 
Shaneeb@SAP_ABAP_compressed
Shaneeb@SAP_ABAP_compressedShaneeb@SAP_ABAP_compressed
Shaneeb@SAP_ABAP_compressed
 
slides.pdf
slides.pdfslides.pdf
slides.pdf
 
Designing a Generative AI QnA solution with Proprietary Enterprise Business K...
Designing a Generative AI QnA solution with Proprietary Enterprise Business K...Designing a Generative AI QnA solution with Proprietary Enterprise Business K...
Designing a Generative AI QnA solution with Proprietary Enterprise Business K...
 
What’s New with Databricks Machine Learning
What’s New with Databricks Machine LearningWhat’s New with Databricks Machine Learning
What’s New with Databricks Machine Learning
 
Sap abap pdf
Sap abap pdfSap abap pdf
Sap abap pdf
 
Loading text data from SAP source systems
Loading text data from SAP source systemsLoading text data from SAP source systems
Loading text data from SAP source systems
 
Brochure - Software Development Learning Path
 Brochure - Software Development Learning Path Brochure - Software Development Learning Path
Brochure - Software Development Learning Path
 
Certification Study Group - NLP & Recommendation Systems on GCP Session 5
Certification Study Group - NLP & Recommendation Systems on GCP Session 5Certification Study Group - NLP & Recommendation Systems on GCP Session 5
Certification Study Group - NLP & Recommendation Systems on GCP Session 5
 
Nicholas.Piggee resume
Nicholas.Piggee resumeNicholas.Piggee resume
Nicholas.Piggee resume
 
Rohan_ABAP_CV
Rohan_ABAP_CVRohan_ABAP_CV
Rohan_ABAP_CV
 
CA ERwin Data Modeler End User Presentation
CA ERwin Data Modeler End User PresentationCA ERwin Data Modeler End User Presentation
CA ERwin Data Modeler End User Presentation
 

Plus de PyData

Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...PyData
 
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif WalshUnit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif WalshPyData
 
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake BolewskiThe TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake BolewskiPyData
 
Deploying Data Science for Distribution of The New York Times - Anne Bauer
Deploying Data Science for Distribution of The New York Times - Anne BauerDeploying Data Science for Distribution of The New York Times - Anne Bauer
Deploying Data Science for Distribution of The New York Times - Anne BauerPyData
 
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam LermaGraph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam LermaPyData
 
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo MazzaferroRESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo MazzaferroPyData
 
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...PyData
 
Avoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
Avoiding Bad Database Surprises: Simulation and Scalability - Steven LottAvoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
Avoiding Bad Database Surprises: Simulation and Scalability - Steven LottPyData
 
Words in Space - Rebecca Bilbro
Words in Space - Rebecca BilbroWords in Space - Rebecca Bilbro
Words in Space - Rebecca BilbroPyData
 
Pydata beautiful soup - Monica Puerto
Pydata beautiful soup - Monica PuertoPydata beautiful soup - Monica Puerto
Pydata beautiful soup - Monica PuertoPyData
 
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...PyData
 
Extending Pandas with Custom Types - Will Ayd
Extending Pandas with Custom Types - Will AydExtending Pandas with Custom Types - Will Ayd
Extending Pandas with Custom Types - Will AydPyData
 
Measuring Model Fairness - Stephen Hoover
Measuring Model Fairness - Stephen HooverMeasuring Model Fairness - Stephen Hoover
Measuring Model Fairness - Stephen HooverPyData
 
What's the Science in Data Science? - Skipper Seabold
What's the Science in Data Science? - Skipper SeaboldWhat's the Science in Data Science? - Skipper Seabold
What's the Science in Data Science? - Skipper SeaboldPyData
 
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...PyData
 
Solving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Solving very simple substitution ciphers algorithmically - Stephen Enright-WardSolving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Solving very simple substitution ciphers algorithmically - Stephen Enright-WardPyData
 
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...PyData
 
Deprecating the state machine: building conversational AI with the Rasa stack...
Deprecating the state machine: building conversational AI with the Rasa stack...Deprecating the state machine: building conversational AI with the Rasa stack...
Deprecating the state machine: building conversational AI with the Rasa stack...PyData
 
Towards automating machine learning: benchmarking tools for hyperparameter tu...
Towards automating machine learning: benchmarking tools for hyperparameter tu...Towards automating machine learning: benchmarking tools for hyperparameter tu...
Towards automating machine learning: benchmarking tools for hyperparameter tu...PyData
 
Using GANs to improve generalization in a semi-supervised setting - trying it...
Using GANs to improve generalization in a semi-supervised setting - trying it...Using GANs to improve generalization in a semi-supervised setting - trying it...
Using GANs to improve generalization in a semi-supervised setting - trying it...PyData
 

Plus de PyData (20)

Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
 
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif WalshUnit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh
 
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake BolewskiThe TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
 
Deploying Data Science for Distribution of The New York Times - Anne Bauer
Deploying Data Science for Distribution of The New York Times - Anne BauerDeploying Data Science for Distribution of The New York Times - Anne Bauer
Deploying Data Science for Distribution of The New York Times - Anne Bauer
 
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam LermaGraph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
 
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo MazzaferroRESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
 
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
 
Avoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
Avoiding Bad Database Surprises: Simulation and Scalability - Steven LottAvoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
Avoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
 
Words in Space - Rebecca Bilbro
Words in Space - Rebecca BilbroWords in Space - Rebecca Bilbro
Words in Space - Rebecca Bilbro
 
Pydata beautiful soup - Monica Puerto
Pydata beautiful soup - Monica PuertoPydata beautiful soup - Monica Puerto
Pydata beautiful soup - Monica Puerto
 
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
 
Extending Pandas with Custom Types - Will Ayd
Extending Pandas with Custom Types - Will AydExtending Pandas with Custom Types - Will Ayd
Extending Pandas with Custom Types - Will Ayd
 
Measuring Model Fairness - Stephen Hoover
Measuring Model Fairness - Stephen HooverMeasuring Model Fairness - Stephen Hoover
Measuring Model Fairness - Stephen Hoover
 
What's the Science in Data Science? - Skipper Seabold
What's the Science in Data Science? - Skipper SeaboldWhat's the Science in Data Science? - Skipper Seabold
What's the Science in Data Science? - Skipper Seabold
 
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
 
Solving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Solving very simple substitution ciphers algorithmically - Stephen Enright-WardSolving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Solving very simple substitution ciphers algorithmically - Stephen Enright-Ward
 
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
 
Deprecating the state machine: building conversational AI with the Rasa stack...
Deprecating the state machine: building conversational AI with the Rasa stack...Deprecating the state machine: building conversational AI with the Rasa stack...
Deprecating the state machine: building conversational AI with the Rasa stack...
 
Towards automating machine learning: benchmarking tools for hyperparameter tu...
Towards automating machine learning: benchmarking tools for hyperparameter tu...Towards automating machine learning: benchmarking tools for hyperparameter tu...
Towards automating machine learning: benchmarking tools for hyperparameter tu...
 
Using GANs to improve generalization in a semi-supervised setting - trying it...
Using GANs to improve generalization in a semi-supervised setting - trying it...Using GANs to improve generalization in a semi-supervised setting - trying it...
Using GANs to improve generalization in a semi-supervised setting - trying it...
 

Dernier

TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...apidays
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 

Dernier (20)

TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 

Using Embeddings to Understand the Variance and Evolution of Data Science... - Maryam Jahanshahi

  • 1. Using Embeddings to Understand 
 the Variance and Evolution of Data Science Skill Sets Maryam Jahanshahi Ph.D. Research Scientist TapRecruit http://bit.ly/pydatanyc-emb
  • 2. Maryam Jahanshahi Ph.D. Research Scientist TapRecruit http://bit.ly/pydatanyc-emb Using Embeddings to Understand 
 the Variance and Evolution of Data Science Skill Sets
  • 3. TapRecruit uses NLP to understand and 
 organize natural language career content Smart Editor 
 for Job Descriptions Active Pipeline 
 Health Monitoring Multifaceted Salary 
 Estimation
  • 4.
  • 5. Language matters in job descriptions Same title, Different job Finance Manager Kraft Foods Finance Manager Kraft Foods Same Title Junior (3 Years) No Managerial Experience Senior (6-8 Years) Division Level Controller Strategic Finance Role MBA / CPA Required Experience Required Responsibility Preferred Skill Required Education Different title, Same job Performance 
 Marketing Manager PocketGems Senior Analyst, Customer Strategy The Gap Mid-Level Quantitative Focus Expertise iBanking Data Analysis Tools (SQL) Consulting Experience Preferred MBA Preferred Mid-Level Quantitative Focus Expertise iBanking Relational Database Experience External Consulting Experience Preferred BA degree in business, finance, MBA Preferred Required Experience Required Skills Required Experience Required Skills Preferred Experience Required and 
 Preferred Education
  • 6. How have data science skills changed over time?
  • 7. How have data science skills changed over time? Manual Feature Extraction Dynamic Topic Modeling Adapted from Blei and Lafferty, ICML 2006. 1880 force energy motion differ light 1960 radiat energy electron measure ray 2000 state energy electron magnet field 1920 atom theory electron energy measure Matter Electron Quantum MBA PhD SQL Tableau PowerBIPython
  • 8. Word embeddings capture semantic similarities Proficiency programming in Python, Java or C++. Word ContextContext Experience in Python, Java or other object-oriented programming languages WordContext Context Statistical modeling through software (e.g. SPSS) or programming language (e.g. Python) WordContext French German Japanese Esperanto Language Python Programming C++ Java Object- orientated
  • 9. Embeddings capture entity relationships Adapted from Stanford NLP GLoVE Project Woman :: Queen as Man :: ? Man Woman Queen King Hierarchies McAdamCola o VodafoneVerizon Viacom Dauman Exxon Tillerson Wal-Mart McMillon Comparatives 
 and Superlatives SlowestSlower Slow Shortest Shorter Strongest Stronge r Strong Short
  • 10. Pretrained embeddings facilitate fast prototyping Final Application Corpus Twitter Common Crawl GoogleNews Wikipedia Tokens 27 B 42-840 B 100 B 6 B Vocabulary Size 1.2 M 1.9-2.2 M 3 M 400 k Algorithm GLoVE GLoVE word2vec GLoVE Vector Length 25 - 200 d 300 d 300 d 50 - 300 d Corpus Generation Corpus Processing Language Model Generation Language Model Tuning
  • 11. Problems with pretrained embedding models Casing Abbreviations vs Words e.g. IT vs it Out of Vocabulary Words Domain Specific Words & Acronyms Polysemy Words with multiple meanings e.g. drive (a car) vs drive (results) e.g. Chef (the job) vs Chef (the language) Multi-word Expressions Phrases that have new meanings e.g. Front-end vs front + end
  • 12. Tools for developing custom language models Tokenization, POS tagging, Sentence Segmentation, Dependency Parsing Corpus Processing CoreNLP SyntaxNet Different word embedding models (GLoVE, word2vec, fastText) Language Modeling
  • 13. Windows capture semantic similarity vs relatedness Captures Semantic similarity, Substitutes and Word-level differences Small Window Size Captures Semantic relatedness, Alternatives and Domain-level differences Large Window Size Python Java Programming C++ Language French German Japanese Esperanto SPSS Statistical modeling Object-orientated SoftwarePython Programming C++ JavaLanguage French German Japanese Esperanto Object- orientated
  • 14. Custom embeddings identified equal opportunity and perks language
  • 15. Custom embeddings identified ‘soft’ skills and language around experience
  • 16. I’ve got 300 dimensions… but time ain’t one
  • 17. Two flavors of dynamic embeddings Trained Together 2015 2016 2017 2018 Balmer and Mandt, arXiv: 1702:08359. Rudolph and Blei, arXiv: 1703:08052. Independently Trained 2016 2017 2018 2015 Kim, Chiu, Kaneki, Hedge and Petrov, arXiv: 1405:3515. Kulkarni, Al-Rfou, Perozzi and Skiena, arXiv: 1411:3315.
  • 18. The benefits of dynamic Bernoulli embeddings Data efficient: Treats each time slice as a sequential latent variable, enabling time slices with sparse data. Does not require stitching/ alignment: Treating time slice as a variable ensures embeddings are connected across slices. Dynamic embeddings Balmer and Mandt, arXiv: 1702:08359. Rudolph and Blei, arXiv: 1703:08052. Data hungry: Sufficient data for each time slice for a quality embedding. Requires stitching: Each time slice is trained independently, therefore dimensions are not comparable across slices. Static embeddings Kim, Chiu, Kaneki, Hedge and Petrov, arXiv: 1405:3515. Kulkarni, Al-Rfou, Perozzi and Skiena, arXiv: 1411:3315.

  • 19. Demand for MBAs and PhDs falling Data Science JobsAll Jobs 0 0.5 1 2016 2017 2018 0 0.5 1 2016 2017 2018 MBA PhD PhD
  • 20. Data Science skills showing significant shifts Tableau and PowerBI 0 1 2 3 2016 2017 2018 Tableau PowerBI Python vs Perl 2016 2017 2018 Python Perl 1 0 Hadoop vs Spark 2016 2017 2018 Hadoop Spark 1 0
  • 21. regression :: Generalized Linear Models as word2vec :: Exponential Family Embeddings
  • 22. Members of the Exponential Family of Embeddings Binary Data Bernoulli Embedding Count or Ordinal Data Poisson Embedding Context Datapoint Context Mini Bagels Cream cheese Milk Coffee Orange Juice Continuous Data Gaussian Embedding Context Datapoint Context JFK-CDG LGA-DCA JFK-DFW LAX-JFK LAX-LGA Context Datapoin t Context Proficiency programming Python Java C++
  • 23. Members of the Exponential Family of Embeddings Binary Data Bernoulli Embedding Context Datapoin t Context Proficiency programming Python Java C++ Count or Ordinal Data Context Datapoint Context 10 Guidelines for A/B Testing / Emily Robinson The Value of Null Results / Angel D’az Words in Space / Rebecca Bilbro … / James Powell Why I Use Julia / Katharine Hyatt Poisson Embedding
  • 24. Poisson embeddings capture item similarities from shopper behavior Maruchan chicken ramen High Inner Product Combinations Maruchan creamy chicken ramen Maruchan oriental flavor ramen Maruchan roast chicken ramen Old Dutch potato chips & Budweiser Lager beer Lays potato chips & DiGiorno frozen pizza Yoplait strawberry yogurt Low Inner Product Combinations Yoplait apricot mango yogurt Yoplait strawberry orange smoothie Yoplait strawberry banana yogurt General Mills cinnamon toast & Tide Plus detergent Beef Swanson Broth soup & Campbell Soup cans Adapted from Rudolph, Ruiz, Mandt and Blei, arXiv: 1608.00778.
  • 25. How have data science skills changed over time? - Flavors of static word embeddings: The Corpus Issue - Considerations for developing custom embedding models - Flavors of dynamic models: Dynamic Bernoulli embeddings - Other members of the Exponential Family of Embeddings
  • 26. Thank you PyData NYC! Maryam Jahanshahi Ph.D. Research Scientist TapRecruit http://bit.ly/pydatanyc-emb
  • 27. Thank you PyData NYC! Maryam Jahanshahi Ph.D. Research Scientist TapRecruit http://bit.ly/pydatanyc-emb