SlideShare une entreprise Scribd logo
1  sur  27
. DataScience
SG.
UNDERGRAD SERIES |
26 SEPT 2019
Ivan
QA automation, Web-
crawlers, Tableau dashboards
Internships,
Research Assistant,
personal projects
KDD Labs - Web-based
data mining application
1SMU FYP
SMU Information Systems,
Marketing Analytics
Graduation
Job Recommender
Systems, churn prediction,
network analysis, BI
solutions
2Data Scientist,
Tech In Asia
ABOUT
2014 -
2017
2017
2018
2018
2014
Enrolled into
SMU
Ivan Tan
SMU Final Year Project (2017)
KDD Labs
• Introduction, Project Motivation
• Key Functionalities
• Challenges
• Demo
• Technology/Framework
• Project Takeaways
1
. KDD Labs .
A web-based analytics tool to
enable students to upload,
visualize and perform data mining
with no coding
PROJECT MOTIVATION
Used as a teaching tool in SMU’s IS424 Business
Analytics and Data Mining (2017)
Not meant to replace enterprise-grade or open-source
products like Python (sklearn, tensorflow, etc.) or SAS
Provide students who are new to the field some hands-
on experience
Overcome existing limitations such as installation, OS-
specific software by providing a web interface
What is KDD?
Knowledge Discovery in Databases (KDD)
1. Understanding of the application domain & goals
2. Creating a target data set: selecting or focusing on a
subset of variables, or samples
3. Data cleaning and preprocessing
4. Data reduction and projection. (transformation,
dimensionality reduction, so on) - Useful features
5. Choosing the data mining task.
○ Deciding whether the goal of the KDD process is
classification, regression, clustering, etc.
6. Choosing the data mining algorithm(s)
○ Methods, models, parameters
7. Data mining - identifying patterns of interest
8. Analysing, Interpreting patterns, consolidating
discovered knowledge.
Reference: http://www2.cs.uregina.ca/~dbd/cs831/notes/kdd/1_kdd.html
6
KDD Labs
Key Functionalities
● User Management - Login/Logout, Project history
● Data Management - Upload, Sharing, data attributes
● Data Exploration - Summary Statistics, Data
Visualisations (graphs)
● Data Preprocessing - Normalisation, Missing values
● Data Classification - Decision Trees, SVM, Ensemble,
XGBoost, Train/Test split, XGBoost, K-Fold, Prediction
● Data Clustering - KMeans, Hierarchical
● Data Association Rules - Market basket analysis,
Graph visualisation
Project
Challenges
No data analytics
background when we
started
Learn data mining and
develop web application
concurrently
Deploy application to
production (live usage) to
in same semester
Design and develop basic
functionalities
(Login, Canvas, Preprocessing,
Visualization)
May Aug NovLab 1 Lab 2 Lab 3 Lab 4
Iteratively push new features
(School term)
( Summer break)
Final
presentations
Technologies, Architecture
Roles & Responsibilities
2x Frontend
1x Backend
1x Data / Backend
1x Project Manager
◎ Understand and translate lab requirements to
specifications
◎ Write functions to interface with scikit-learn,
pandas, matplotlib libraries
◎ Manage CRUD/storage of data files
Python-based web
application for easy
integration with scikit-
learn libraries
Demo
Disclaimer
◎ Project is no longer being maintained. Please
don’t report bugs to me :)
◎ If you do find a user account and password
lying around, do not upload your
personal/private data
Last Deployed: 2017
KDD Labs
Key Takeaways
Intro to data mining
Introduction to basic Machine
Learning algorithms
(supervised/unsupervised),
evaluation methods
Importance of basics
Understanding fundamentals
concepts is key before going into
advanced ML (AI, deep learning,
etc.)
Technical skills
Python (pandas, numpy) for
data wrangling, visualisation
(matplotlib) and ML (scikit-
learn)
Project Management
Prioritisation of tasks, time
management, etc. If you are ahead
of time, management will always
add more features
Right attitude &
mindset
You are only limited to your
imaginations. You can set out to
achieve anything if you put your mind
to it!
(or when your grades depend on it 😄)
Jobs Recommender System
at Tech In Asia (TIA)
• Introduction
• Overview of Recommender Systems
• Modelling the data
• NLP with Word2Vec
• Regression
• Advice/Final Notes
2
About
Tech in Asia
Tech in Asia aims to build and serve
Asia’s tech and startup community.
Media bridges the information
gap, delivering insights and
news about the Asian tech
industry
Jobs bridges the talent gap,
linking Asian tech talent to the
right roles and employers
Studios is our newest team,
focused on bridging the
connection gap between
companies and their Asian tech
audience
35%of revenue from
recommendations
30%of overall listening from
recommendations
75%of content views from
recommendations
Reference: https://digital.hbs.edu/platform-rctom/submission/discover-weekly-how-spotify-is-changing-the-way-we-consume-music/,
https://www.mckinsey.com/industries/retail/our-insights/how-retailers-can-keep-up-with-consumers,
Impact of
Recommender
Systems
Jobseekers
Discovery
Displaying recommended jobs to
users based on their profile (job title,
skillset, years of experience, etc.)
What do recommender systems achieve at TIA Jobs?
Employers
Filtering
Helping employers to sift through
hundreds of applications by
prioritising relevant candidates
Creating a
better match
between 2
parties
Content-based
● Uses the content for recommendation purposes - text processing techniques,
semantic information, etc (Item features)
Collaborative
filtering
● Uses the interaction of users with items - historical data such as clicks,
purchases, ratings, etc. I.e. “You may like this because users who bought this also
bought”
Hybrid ● Combination of various approaches
More readings here:
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.402.1036&rep=rep1&type=pdf
First steps : Conduct Literature Research
Overview of Recommender Systems
Job-seekers
Resume + Profile info
- Current Job Title
- Skills []
- Years of Experience
- Summary
- Industry
...
Our approach : Content-based/Knowledge-based
Modelling the data
Employers
Job postings
- Job Posting Title
- Skills []
- Years of Experience
- Summary
- Industry
...
Creating a
‘similarity
score’ between
the 2 features
OCR / Entity
Extraction (3rd party)
Semi-structured
data
Score Table
Natural Language Processing with Word2Vec
ID Resume (User) Job Posting Scoring
1 Title Title Word2Vec (Job Description) + Cosine Similarity
2 Years of Exp Years of Exp ..
3 Skills (Array) Skills (Array) ..
4 ...
What is
Word2Vec Why?
“Word2vec is a two-layer neural net that
processes text. Its input is a text corpus
and its output is a set of vectors: feature
vectors for words in that corpus. While
Word2vec is not a deep neural network,
it turns text into a numerical form that
deep nets can understand.”
References: https://skymind.ai/wiki/word2vec, https://planetcalc.com/1721/
Data Scientist = [1, 0, 1, 0 … 1]
Data Analyst = [1, 1, 1, 0 … 1]
Data Engineer = [1, 0, 1, 0 … 1]
ML Engineer = [1, 1, 1, 0 … 1]
N-dimensional vector
How close/far are the titles
“Data Scientist” and “Data Analyst”?
String-based
(Levenshtein distance)
does not take into
account semantics. There
are many titles which are
similar/identical in
meaning but named very
differently
Training/Inference
Word2Vec for Job Titles
Train on
historical job
postings data as
‘dictionary’
Word2Vec
Model
(can use custom or pre-
trained)
Training
Preprocessed Text
Corpus (Documents)
1 2
Inference
How close/far are the each of the job titles
Use Cosine Similarity, K-Means clustering, etc.
[Full Stack Developer,
Software Engineer,
Backend Engineer,
Backend Developer]
[Data Scientist, Data
Engineer, Data Analyst,
Senior Data Scientist,
Business Intelligence
Manager]
[Graphic Designer,
Graphic Design Intern,
Graphic Design, UI/UX
Designer, Creative
Designer]
Training/Inference
Word2Vec for Skills
Example:
Design/Marketing-related skills such as Adobe
Photoshop, After Effects, Brand Design,
Advertisement are ‘close’ to each other.
‘Skills’ vector reduced to 2 dimensions
using PCA for visualisation
23
Score Table
Putting it all together
For each User-Job (How relevant is this User-Job pair?) Score
Job Title Score X1 Skills Score X2 (Other features) ... Y(0,1)
...
This brings up another problem.. How
do we assign weights to each of these
features?
I.e. How important is the job title
as compared to skills and other
features when evaluating a
candidate
The Answer:
Regression - Using Historical Job Applications Data
Acceptance, Yij of each Useri, Jobj match
will be based on the following:
Yij = XijW1 + XijW2 … + ε
Resume (User) x Description (Job) score Accepted (Yes/No)
X1W1 X2W2 (Other features)... 1 / 0
...
Wi can be filled using coefficients of
independent variables from regression
results
(1) Coefficients: weightage of how important each feature
(Title, skills, exp, etc.) - to compute final score
(2) R2: Backtesting the model - a good measure of how well
the model performs (how much variation of job
application acceptances are explained by independent
variables)
We have historical
user job applications
and acceptance data!
Pair User ID Job ID Final Score Rank
1 1 1 0.9 1
2 1 2 0.8 2
3 1 3 0.7 3
4 ... ...
Business Logic
Post-Processing
Eg. Recommendations for User ID 1
Re-ranking by other factors such as
- Date published of job posting
- Popularity
- Paid/Boosting
- Randomization
...
API/REST
Website
Scheduling ETLs, data pipelines
Scala ML Libraries, big data
processing
Prototyping, developing POC,
visualisation, evaluation
What I learned building a job recommender system
Practical Advice
If you’re a jobseeker... If you’re a jobseeker/employer...
Graphical CVs
Unless:
◎ You’re a graphic designer/UI/UX
◎ You’re certain that the CV end up
in the hiring manager’s inbox
and not a recruitment/jobs
platform
Fancy Job Titles (You know what I’m
talking about!)
Coding Ninja, Software Ninjaneer, Tax
Wrangler, Full Stack Magician, Retail
Jedi, Digital Overlord, Chief Everything
Officer …
Unless:
◎ No
Data Science
Final Notes
You have to be
passionate
● There is a lot to learn in this
field and keep up with
constantly
● Rapid advancement in data
science/AI. FOMO is real!
Leverage on great
community
● A unique feature of Data Science
practitioners/community -
competitive yet collaborative
● Reach out to experts in different
areas, mentors
● Personally benefited from DSSG
talks and workshops
FOMO: Fear of Missing Out
Thank you for your time :)
Any questions?
linkedin.com/in/yongsiang
ivan@techinasia.com

Contenu connexe

Tendances

Deepak_OFSAA_6_Years_Exp_Resume
Deepak_OFSAA_6_Years_Exp_ResumeDeepak_OFSAA_6_Years_Exp_Resume
Deepak_OFSAA_6_Years_Exp_Resume
Deepak Sharma
 
Resume-Manish_Agrahari_IBM_BPM
Resume-Manish_Agrahari_IBM_BPMResume-Manish_Agrahari_IBM_BPM
Resume-Manish_Agrahari_IBM_BPM
Manish Agrahari
 
Prince_Kumar_JAVA_Developer
Prince_Kumar_JAVA_DeveloperPrince_Kumar_JAVA_Developer
Prince_Kumar_JAVA_Developer
Prince nagsen
 
Large-Scale Machine Learning at Twitter
Large-Scale Machine Learning at TwitterLarge-Scale Machine Learning at Twitter
Large-Scale Machine Learning at Twitter
nep_test_account
 
Jayaram_Parida- Big Data Architect and Technical Scrum Master
Jayaram_Parida- Big Data Architect and Technical Scrum MasterJayaram_Parida- Big Data Architect and Technical Scrum Master
Jayaram_Parida- Big Data Architect and Technical Scrum Master
Jayaram Parida
 
Job Suneel Mandam
Job Suneel MandamJob Suneel Mandam
Job Suneel Mandam
Job Suneel
 
RENUGA VEERARAGAVAN Resume HADOOP
RENUGA VEERARAGAVAN Resume HADOOPRENUGA VEERARAGAVAN Resume HADOOP
RENUGA VEERARAGAVAN Resume HADOOP
renuga V
 

Tendances (16)

Prasad Degala CV
Prasad Degala CVPrasad Degala CV
Prasad Degala CV
 
Resume_Robert Rajakumar
Resume_Robert RajakumarResume_Robert Rajakumar
Resume_Robert Rajakumar
 
Machine Learning for Recommender Systems in the Job Market
Machine Learning for Recommender Systems in the Job MarketMachine Learning for Recommender Systems in the Job Market
Machine Learning for Recommender Systems in the Job Market
 
Deepak_OFSAA_6_Years_Exp_Resume
Deepak_OFSAA_6_Years_Exp_ResumeDeepak_OFSAA_6_Years_Exp_Resume
Deepak_OFSAA_6_Years_Exp_Resume
 
Resume-Manish_Agrahari_IBM_BPM
Resume-Manish_Agrahari_IBM_BPMResume-Manish_Agrahari_IBM_BPM
Resume-Manish_Agrahari_IBM_BPM
 
Anirban_Kundu
Anirban_KunduAnirban_Kundu
Anirban_Kundu
 
Shah niharc vmar1a
Shah niharc vmar1aShah niharc vmar1a
Shah niharc vmar1a
 
Prince_Kumar_JAVA_Developer
Prince_Kumar_JAVA_DeveloperPrince_Kumar_JAVA_Developer
Prince_Kumar_JAVA_Developer
 
RESUME
RESUMERESUME
RESUME
 
sudipto_resume
sudipto_resumesudipto_resume
sudipto_resume
 
Large-Scale Machine Learning at Twitter
Large-Scale Machine Learning at TwitterLarge-Scale Machine Learning at Twitter
Large-Scale Machine Learning at Twitter
 
Bhanu prasad profile
Bhanu prasad profileBhanu prasad profile
Bhanu prasad profile
 
The Apache Solr Semantic Knowledge Graph
The Apache Solr Semantic Knowledge GraphThe Apache Solr Semantic Knowledge Graph
The Apache Solr Semantic Knowledge Graph
 
Jayaram_Parida- Big Data Architect and Technical Scrum Master
Jayaram_Parida- Big Data Architect and Technical Scrum MasterJayaram_Parida- Big Data Architect and Technical Scrum Master
Jayaram_Parida- Big Data Architect and Technical Scrum Master
 
Job Suneel Mandam
Job Suneel MandamJob Suneel Mandam
Job Suneel Mandam
 
RENUGA VEERARAGAVAN Resume HADOOP
RENUGA VEERARAGAVAN Resume HADOOPRENUGA VEERARAGAVAN Resume HADOOP
RENUGA VEERARAGAVAN Resume HADOOP
 

Similaire à DataScience SG | Undergrad Series | 26th Sep 19

AdityaSharma_Analyst.doc
AdityaSharma_Analyst.docAdityaSharma_Analyst.doc
AdityaSharma_Analyst.doc
Aditya Sharma
 
Lokesh_Reddy_Datastage_Resume
Lokesh_Reddy_Datastage_ResumeLokesh_Reddy_Datastage_Resume
Lokesh_Reddy_Datastage_Resume
Lokesh Reddy
 
Pankaj Resume for Hadoop,Java,J2EE - Outside World
Pankaj Resume for Hadoop,Java,J2EE -  Outside WorldPankaj Resume for Hadoop,Java,J2EE -  Outside World
Pankaj Resume for Hadoop,Java,J2EE - Outside World
Pankaj Kumar
 
Khalid SRIJI resume
Khalid SRIJI resumeKhalid SRIJI resume
Khalid SRIJI resume
Khalid SRIJI
 
Nirmal Singh -Resume
Nirmal Singh -ResumeNirmal Singh -Resume
Nirmal Singh -Resume
Nirmal Singh
 
Ravi Kant sharma (1)
Ravi Kant sharma (1)Ravi Kant sharma (1)
Ravi Kant sharma (1)
RAVI KANT
 

Similaire à DataScience SG | Undergrad Series | 26th Sep 19 (20)

AdityaSharma_Analyst.doc
AdityaSharma_Analyst.docAdityaSharma_Analyst.doc
AdityaSharma_Analyst.doc
 
Amol Resume U
Amol Resume UAmol Resume U
Amol Resume U
 
Rajiv Ranjan ODI_Developer
Rajiv Ranjan ODI_DeveloperRajiv Ranjan ODI_Developer
Rajiv Ranjan ODI_Developer
 
Kislaya resume latest
Kislaya resume latestKislaya resume latest
Kislaya resume latest
 
Resume
ResumeResume
Resume
 
Lokesh_Reddy_Datastage_Resume
Lokesh_Reddy_Datastage_ResumeLokesh_Reddy_Datastage_Resume
Lokesh_Reddy_Datastage_Resume
 
RamaRaju
RamaRajuRamaRaju
RamaRaju
 
Akwin_Resume
Akwin_ResumeAkwin_Resume
Akwin_Resume
 
Pankaj Resume for Hadoop,Java,J2EE - Outside World
Pankaj Resume for Hadoop,Java,J2EE -  Outside WorldPankaj Resume for Hadoop,Java,J2EE -  Outside World
Pankaj Resume for Hadoop,Java,J2EE - Outside World
 
Khalid SRIJI resume
Khalid SRIJI resumeKhalid SRIJI resume
Khalid SRIJI resume
 
DebduttaRoy_2016
DebduttaRoy_2016 DebduttaRoy_2016
DebduttaRoy_2016
 
Sayed M Ahmad_Resume
Sayed M Ahmad_ResumeSayed M Ahmad_Resume
Sayed M Ahmad_Resume
 
Nirmal Singh -Resume
Nirmal Singh -ResumeNirmal Singh -Resume
Nirmal Singh -Resume
 
Lavanya2+ copy
Lavanya2+   copyLavanya2+   copy
Lavanya2+ copy
 
ChetanResume
ChetanResumeChetanResume
ChetanResume
 
Resume
ResumeResume
Resume
 
Resume_Asish
Resume_AsishResume_Asish
Resume_Asish
 
Day1
Day1Day1
Day1
 
Ravi Kant sharma (1)
Ravi Kant sharma (1)Ravi Kant sharma (1)
Ravi Kant sharma (1)
 
Nandakumar_V_datastage
Nandakumar_V_datastageNandakumar_V_datastage
Nandakumar_V_datastage
 

Dernier

Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
amitlee9823
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
AroojKhan71
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
amitlee9823
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
JoseMangaJr1
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
amitlee9823
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
amitlee9823
 

Dernier (20)

Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 

DataScience SG | Undergrad Series | 26th Sep 19

  • 2. QA automation, Web- crawlers, Tableau dashboards Internships, Research Assistant, personal projects KDD Labs - Web-based data mining application 1SMU FYP SMU Information Systems, Marketing Analytics Graduation Job Recommender Systems, churn prediction, network analysis, BI solutions 2Data Scientist, Tech In Asia ABOUT 2014 - 2017 2017 2018 2018 2014 Enrolled into SMU Ivan Tan
  • 3. SMU Final Year Project (2017) KDD Labs • Introduction, Project Motivation • Key Functionalities • Challenges • Demo • Technology/Framework • Project Takeaways 1
  • 4. . KDD Labs . A web-based analytics tool to enable students to upload, visualize and perform data mining with no coding PROJECT MOTIVATION Used as a teaching tool in SMU’s IS424 Business Analytics and Data Mining (2017) Not meant to replace enterprise-grade or open-source products like Python (sklearn, tensorflow, etc.) or SAS Provide students who are new to the field some hands- on experience Overcome existing limitations such as installation, OS- specific software by providing a web interface
  • 5. What is KDD? Knowledge Discovery in Databases (KDD) 1. Understanding of the application domain & goals 2. Creating a target data set: selecting or focusing on a subset of variables, or samples 3. Data cleaning and preprocessing 4. Data reduction and projection. (transformation, dimensionality reduction, so on) - Useful features 5. Choosing the data mining task. ○ Deciding whether the goal of the KDD process is classification, regression, clustering, etc. 6. Choosing the data mining algorithm(s) ○ Methods, models, parameters 7. Data mining - identifying patterns of interest 8. Analysing, Interpreting patterns, consolidating discovered knowledge. Reference: http://www2.cs.uregina.ca/~dbd/cs831/notes/kdd/1_kdd.html
  • 6. 6
  • 7. KDD Labs Key Functionalities ● User Management - Login/Logout, Project history ● Data Management - Upload, Sharing, data attributes ● Data Exploration - Summary Statistics, Data Visualisations (graphs) ● Data Preprocessing - Normalisation, Missing values ● Data Classification - Decision Trees, SVM, Ensemble, XGBoost, Train/Test split, XGBoost, K-Fold, Prediction ● Data Clustering - KMeans, Hierarchical ● Data Association Rules - Market basket analysis, Graph visualisation
  • 8. Project Challenges No data analytics background when we started Learn data mining and develop web application concurrently Deploy application to production (live usage) to in same semester Design and develop basic functionalities (Login, Canvas, Preprocessing, Visualization) May Aug NovLab 1 Lab 2 Lab 3 Lab 4 Iteratively push new features (School term) ( Summer break) Final presentations
  • 9.
  • 10. Technologies, Architecture Roles & Responsibilities 2x Frontend 1x Backend 1x Data / Backend 1x Project Manager ◎ Understand and translate lab requirements to specifications ◎ Write functions to interface with scikit-learn, pandas, matplotlib libraries ◎ Manage CRUD/storage of data files Python-based web application for easy integration with scikit- learn libraries
  • 11. Demo Disclaimer ◎ Project is no longer being maintained. Please don’t report bugs to me :) ◎ If you do find a user account and password lying around, do not upload your personal/private data Last Deployed: 2017
  • 12. KDD Labs Key Takeaways Intro to data mining Introduction to basic Machine Learning algorithms (supervised/unsupervised), evaluation methods Importance of basics Understanding fundamentals concepts is key before going into advanced ML (AI, deep learning, etc.) Technical skills Python (pandas, numpy) for data wrangling, visualisation (matplotlib) and ML (scikit- learn) Project Management Prioritisation of tasks, time management, etc. If you are ahead of time, management will always add more features Right attitude & mindset You are only limited to your imaginations. You can set out to achieve anything if you put your mind to it! (or when your grades depend on it 😄)
  • 13. Jobs Recommender System at Tech In Asia (TIA) • Introduction • Overview of Recommender Systems • Modelling the data • NLP with Word2Vec • Regression • Advice/Final Notes 2 About Tech in Asia Tech in Asia aims to build and serve Asia’s tech and startup community. Media bridges the information gap, delivering insights and news about the Asian tech industry Jobs bridges the talent gap, linking Asian tech talent to the right roles and employers Studios is our newest team, focused on bridging the connection gap between companies and their Asian tech audience
  • 14. 35%of revenue from recommendations 30%of overall listening from recommendations 75%of content views from recommendations Reference: https://digital.hbs.edu/platform-rctom/submission/discover-weekly-how-spotify-is-changing-the-way-we-consume-music/, https://www.mckinsey.com/industries/retail/our-insights/how-retailers-can-keep-up-with-consumers, Impact of Recommender Systems
  • 15. Jobseekers Discovery Displaying recommended jobs to users based on their profile (job title, skillset, years of experience, etc.) What do recommender systems achieve at TIA Jobs? Employers Filtering Helping employers to sift through hundreds of applications by prioritising relevant candidates Creating a better match between 2 parties
  • 16. Content-based ● Uses the content for recommendation purposes - text processing techniques, semantic information, etc (Item features) Collaborative filtering ● Uses the interaction of users with items - historical data such as clicks, purchases, ratings, etc. I.e. “You may like this because users who bought this also bought” Hybrid ● Combination of various approaches More readings here: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.402.1036&rep=rep1&type=pdf First steps : Conduct Literature Research Overview of Recommender Systems
  • 17. Job-seekers Resume + Profile info - Current Job Title - Skills [] - Years of Experience - Summary - Industry ... Our approach : Content-based/Knowledge-based Modelling the data Employers Job postings - Job Posting Title - Skills [] - Years of Experience - Summary - Industry ... Creating a ‘similarity score’ between the 2 features OCR / Entity Extraction (3rd party) Semi-structured data
  • 18. Score Table Natural Language Processing with Word2Vec ID Resume (User) Job Posting Scoring 1 Title Title Word2Vec (Job Description) + Cosine Similarity 2 Years of Exp Years of Exp .. 3 Skills (Array) Skills (Array) .. 4 ...
  • 19. What is Word2Vec Why? “Word2vec is a two-layer neural net that processes text. Its input is a text corpus and its output is a set of vectors: feature vectors for words in that corpus. While Word2vec is not a deep neural network, it turns text into a numerical form that deep nets can understand.” References: https://skymind.ai/wiki/word2vec, https://planetcalc.com/1721/ Data Scientist = [1, 0, 1, 0 … 1] Data Analyst = [1, 1, 1, 0 … 1] Data Engineer = [1, 0, 1, 0 … 1] ML Engineer = [1, 1, 1, 0 … 1] N-dimensional vector How close/far are the titles “Data Scientist” and “Data Analyst”? String-based (Levenshtein distance) does not take into account semantics. There are many titles which are similar/identical in meaning but named very differently
  • 20. Training/Inference Word2Vec for Job Titles Train on historical job postings data as ‘dictionary’ Word2Vec Model (can use custom or pre- trained) Training Preprocessed Text Corpus (Documents) 1 2 Inference How close/far are the each of the job titles Use Cosine Similarity, K-Means clustering, etc. [Full Stack Developer, Software Engineer, Backend Engineer, Backend Developer] [Data Scientist, Data Engineer, Data Analyst, Senior Data Scientist, Business Intelligence Manager] [Graphic Designer, Graphic Design Intern, Graphic Design, UI/UX Designer, Creative Designer]
  • 21. Training/Inference Word2Vec for Skills Example: Design/Marketing-related skills such as Adobe Photoshop, After Effects, Brand Design, Advertisement are ‘close’ to each other. ‘Skills’ vector reduced to 2 dimensions using PCA for visualisation
  • 22. 23 Score Table Putting it all together For each User-Job (How relevant is this User-Job pair?) Score Job Title Score X1 Skills Score X2 (Other features) ... Y(0,1) ... This brings up another problem.. How do we assign weights to each of these features? I.e. How important is the job title as compared to skills and other features when evaluating a candidate
  • 23. The Answer: Regression - Using Historical Job Applications Data Acceptance, Yij of each Useri, Jobj match will be based on the following: Yij = XijW1 + XijW2 … + ε Resume (User) x Description (Job) score Accepted (Yes/No) X1W1 X2W2 (Other features)... 1 / 0 ... Wi can be filled using coefficients of independent variables from regression results (1) Coefficients: weightage of how important each feature (Title, skills, exp, etc.) - to compute final score (2) R2: Backtesting the model - a good measure of how well the model performs (how much variation of job application acceptances are explained by independent variables) We have historical user job applications and acceptance data!
  • 24. Pair User ID Job ID Final Score Rank 1 1 1 0.9 1 2 1 2 0.8 2 3 1 3 0.7 3 4 ... ... Business Logic Post-Processing Eg. Recommendations for User ID 1 Re-ranking by other factors such as - Date published of job posting - Popularity - Paid/Boosting - Randomization ... API/REST Website Scheduling ETLs, data pipelines Scala ML Libraries, big data processing Prototyping, developing POC, visualisation, evaluation
  • 25. What I learned building a job recommender system Practical Advice If you’re a jobseeker... If you’re a jobseeker/employer... Graphical CVs Unless: ◎ You’re a graphic designer/UI/UX ◎ You’re certain that the CV end up in the hiring manager’s inbox and not a recruitment/jobs platform Fancy Job Titles (You know what I’m talking about!) Coding Ninja, Software Ninjaneer, Tax Wrangler, Full Stack Magician, Retail Jedi, Digital Overlord, Chief Everything Officer … Unless: ◎ No
  • 26. Data Science Final Notes You have to be passionate ● There is a lot to learn in this field and keep up with constantly ● Rapid advancement in data science/AI. FOMO is real! Leverage on great community ● A unique feature of Data Science practitioners/community - competitive yet collaborative ● Reach out to experts in different areas, mentors ● Personally benefited from DSSG talks and workshops FOMO: Fear of Missing Out
  • 27. Thank you for your time :) Any questions? linkedin.com/in/yongsiang ivan@techinasia.com

Notes de l'éditeur

  1. Knowledge Discovery in Databases (KDD) Process Developing an understanding of the application domain, the relevant prior knowledge, the goals of the end-user Creating a target data set: selecting a data set, or focusing on a subset of variables, or data samples, on which discovery is to be performed. Data cleaning and preprocessing. Removal of noise or outliers. Collecting necessary information to model or account for noise. Strategies for handling missing data fields. Accounting for time sequence information and known changes. Data reduction and projection. Finding useful features to represent the data depending on the goal of the task. Using dimensionality reduction or transformation methods to reduce the effective number of variables under consideration or to find invariant representations for the data. Choosing the data mining task. Deciding whether the goal of the KDD process is classification, regression, clustering, etc. Choosing the data mining algorithm(s). Selecting method(s) to be used for searching for patterns in the data. Deciding which models and parameters may be appropriate. Matching a particular data mining method with the overall criteria of the KDD process. Data mining. Searching for patterns of interest in a particular representational form or a set of such representations as classification rules or trees, regression, clustering, and so forth. Interpreting mined patterns. Consolidating discovered knowledge.
  2. Challenges before embarking on the project As a group of 5 year-3 students back then, we had no idea what is data mining/data science all about! (None of us had taken this module as well) Imagine trying to build an application to teach a topic that you are not even familiar with… Therefore, the challenge was trying to build an application to teach a topic that we were not experienced in: Learn and build concurrently (all while taking other courses, generally being a student.) As if none of that is crazy enough, (Illustrate with a timeline) Deploy project iteratively as IS424 was being taught (students were our guinea pigs) Grateful for having great sponsors to guide us along in terms of domain knowledge, weekly iterations/consultations Start learning analytics from foundations
  3. https://wiki.smu.edu.sg/is480/IS480_Team_wiki%3A_2017T1_Team_Atoms_Project_Management
  4. Context/overview before I conclude first part with takeaways
  5. I don’t even know who is paying for the server
  6. Backup slide in case website goes down
  7. Key takeaways from the project A ‘sneak preview’ into Machine Learning algorithms, sparked interest to continue pursuing data science Built a strong foundation of simple supervised/unsupervised ML concepts such classification (decision trees), regression, association, evaluation metrics (accuracy, precision, confusion matrix) Once you understand the basics, most advanced algorithms build upon the foundation. Do not skip the basics and go into deep learning, neural networks, etc. because simple problems require simple solutions (overkill) You have to be genuinely passionate in data science to pursue it. A lot of hard work, perseverance and self-learning is crucial and it is always never enough Proficiency in Python (data cleaning, wrangling) 80-20 data cleaning If you are ahead of time, your boss will add new features Having the right attitude and mindset is important! You can do anything if you put your mind to it (or when your grades depend on it)!
  8. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.402.1036&rep=rep1&type=pdf
  9. Focus on job title
  10. https://www.cbinsights.com/research/most-absurd-tech-job-titles/
  11. You have to be passionate There is a lot (really) to learn in this field and keep up with constantly. You might think you know a lot about an area but most of the time its just the tip of the iceberg (But personally, that’s the excitement is all about) if you are not genuinely passionate you will lose interest quickly The field advances very, very quickly with new publications in AI or new tools every now and then. FOMO is real. I don’t have a solution for that. Make full use of helpful community There will be a lot of times where you feel stuck in a particular area - reach out to people in the community, ask experts, mentors, whatever, just do whatever it takes! (Stackoverflow will be your best friend) Despite being a very competitive field, people are very helpful and are mostly willingly share Personally benefited a lot from attending DSSG talks every now and then (that’s why I’m here!)