DataScience SG | Undergrad Series | 26th Sep 19

. DataScience
SG.
UNDERGRAD SERIES |
26 SEPT 2019
Ivan

QA automation, Web-
crawlers, Tableau dashboards
Internships,
Research Assistant,
personal projects
KDD Labs - Web-based
data mining application
1SMU FYP
SMU Information Systems,
Marketing Analytics
Graduation
Job Recommender
Systems, churn prediction,
network analysis, BI
solutions
2Data Scientist,
Tech In Asia
ABOUT
2014 -
2017
2017
2018
2018
2014
Enrolled into
SMU
Ivan Tan

SMU Final Year Project (2017)
KDD Labs
• Introduction, Project Motivation
• Key Functionalities
• Challenges
• Demo
• Technology/Framework
• Project Takeaways
1

. KDD Labs .
A web-based analytics tool to
enable students to upload,
visualize and perform data mining
with no coding
PROJECT MOTIVATION
Used as a teaching tool in SMU’s IS424 Business
Analytics and Data Mining (2017)
Not meant to replace enterprise-grade or open-source
products like Python (sklearn, tensorflow, etc.) or SAS
Provide students who are new to the field some hands-
on experience
Overcome existing limitations such as installation, OS-
specific software by providing a web interface

What is KDD?
Knowledge Discovery in Databases (KDD)
1. Understanding of the application domain & goals
2. Creating a target data set: selecting or focusing on a
subset of variables, or samples
3. Data cleaning and preprocessing
4. Data reduction and projection. (transformation,
dimensionality reduction, so on) - Useful features
5. Choosing the data mining task.
○ Deciding whether the goal of the KDD process is
classification, regression, clustering, etc.
6. Choosing the data mining algorithm(s)
○ Methods, models, parameters
7. Data mining - identifying patterns of interest
8. Analysing, Interpreting patterns, consolidating
discovered knowledge.
Reference: http://www2.cs.uregina.ca/~dbd/cs831/notes/kdd/1_kdd.html

KDD Labs
Key Functionalities
● User Management - Login/Logout, Project history
● Data Management - Upload, Sharing, data attributes
● Data Exploration - Summary Statistics, Data
Visualisations (graphs)
● Data Preprocessing - Normalisation, Missing values
● Data Classification - Decision Trees, SVM, Ensemble,
XGBoost, Train/Test split, XGBoost, K-Fold, Prediction
● Data Clustering - KMeans, Hierarchical
● Data Association Rules - Market basket analysis,
Graph visualisation

Project
Challenges
No data analytics
background when we
started
Learn data mining and
develop web application
concurrently
Deploy application to
production (live usage) to
in same semester
Design and develop basic
functionalities
(Login, Canvas, Preprocessing,
Visualization)
May Aug NovLab 1 Lab 2 Lab 3 Lab 4
Iteratively push new features
(School term)
( Summer break)
Final
presentations

Technologies, Architecture
Roles & Responsibilities
2x Frontend
1x Backend
1x Data / Backend
1x Project Manager
◎ Understand and translate lab requirements to
specifications
◎ Write functions to interface with scikit-learn,
pandas, matplotlib libraries
◎ Manage CRUD/storage of data files
Python-based web
application for easy
integration with scikit-
learn libraries

Demo
Disclaimer
◎ Project is no longer being maintained. Please
don’t report bugs to me :)
◎ If you do find a user account and password
lying around, do not upload your
personal/private data
Last Deployed: 2017

KDD Labs
Key Takeaways
Intro to data mining
Introduction to basic Machine
Learning algorithms
(supervised/unsupervised),
evaluation methods
Importance of basics
Understanding fundamentals
concepts is key before going into
advanced ML (AI, deep learning,
etc.)
Technical skills
Python (pandas, numpy) for
data wrangling, visualisation
(matplotlib) and ML (scikit-
learn)
Project Management
Prioritisation of tasks, time
management, etc. If you are ahead
of time, management will always
add more features
Right attitude &
mindset
You are only limited to your
imaginations. You can set out to
achieve anything if you put your mind
to it!
(or when your grades depend on it 😄)

Jobs Recommender System
at Tech In Asia (TIA)
• Introduction
• Overview of Recommender Systems
• Modelling the data
• NLP with Word2Vec
• Regression
• Advice/Final Notes
2
About
Tech in Asia
Tech in Asia aims to build and serve
Asia’s tech and startup community.
Media bridges the information
gap, delivering insights and
news about the Asian tech
industry
Jobs bridges the talent gap,
linking Asian tech talent to the
right roles and employers
Studios is our newest team,
focused on bridging the
connection gap between
companies and their Asian tech
audience

35%of revenue from
recommendations
30%of overall listening from
recommendations
75%of content views from
recommendations
Reference: https://digital.hbs.edu/platform-rctom/submission/discover-weekly-how-spotify-is-changing-the-way-we-consume-music/,
https://www.mckinsey.com/industries/retail/our-insights/how-retailers-can-keep-up-with-consumers,
Impact of
Recommender
Systems

Jobseekers
Discovery
Displaying recommended jobs to
users based on their profile (job title,
skillset, years of experience, etc.)
What do recommender systems achieve at TIA Jobs?
Employers
Filtering
Helping employers to sift through
hundreds of applications by
prioritising relevant candidates
Creating a
better match
between 2
parties

Content-based
● Uses the content for recommendation purposes - text processing techniques,
semantic information, etc (Item features)
Collaborative
filtering
● Uses the interaction of users with items - historical data such as clicks,
purchases, ratings, etc. I.e. “You may like this because users who bought this also
bought”
Hybrid ● Combination of various approaches
More readings here:
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.402.1036&rep=rep1&type=pdf
First steps : Conduct Literature Research
Overview of Recommender Systems

Job-seekers
Resume + Profile info
- Current Job Title
- Skills []
- Years of Experience
- Summary
- Industry
...
Our approach : Content-based/Knowledge-based
Modelling the data
Employers
Job postings
- Job Posting Title
- Skills []
- Years of Experience
- Summary
- Industry
...
Creating a
‘similarity
score’ between
the 2 features
OCR / Entity
Extraction (3rd party)
Semi-structured
data

Score Table
Natural Language Processing with Word2Vec
ID Resume (User) Job Posting Scoring
1 Title Title Word2Vec (Job Description) + Cosine Similarity
2 Years of Exp Years of Exp ..
3 Skills (Array) Skills (Array) ..
4 ...

What is
Word2Vec Why?
“Word2vec is a two-layer neural net that
processes text. Its input is a text corpus
and its output is a set of vectors: feature
vectors for words in that corpus. While
Word2vec is not a deep neural network,
it turns text into a numerical form that
deep nets can understand.”
References: https://skymind.ai/wiki/word2vec, https://planetcalc.com/1721/
Data Scientist = [1, 0, 1, 0 … 1]
Data Analyst = [1, 1, 1, 0 … 1]
Data Engineer = [1, 0, 1, 0 … 1]
ML Engineer = [1, 1, 1, 0 … 1]
N-dimensional vector
How close/far are the titles
“Data Scientist” and “Data Analyst”?
String-based
(Levenshtein distance)
does not take into
account semantics. There
are many titles which are
similar/identical in
meaning but named very
differently

Training/Inference
Word2Vec for Job Titles
Train on
historical job
postings data as
‘dictionary’
Word2Vec
Model
(can use custom or pre-
trained)
Training
Preprocessed Text
Corpus (Documents)
1 2
Inference
How close/far are the each of the job titles
Use Cosine Similarity, K-Means clustering, etc.
[Full Stack Developer,
Software Engineer,
Backend Engineer,
Backend Developer]
[Data Scientist, Data
Engineer, Data Analyst,
Senior Data Scientist,
Business Intelligence
Manager]
[Graphic Designer,
Graphic Design Intern,
Graphic Design, UI/UX
Designer, Creative
Designer]

Training/Inference
Word2Vec for Skills
Example:
Design/Marketing-related skills such as Adobe
Photoshop, After Effects, Brand Design,
Advertisement are ‘close’ to each other.
‘Skills’ vector reduced to 2 dimensions
using PCA for visualisation

23
Score Table
Putting it all together
For each User-Job (How relevant is this User-Job pair?) Score
Job Title Score X1 Skills Score X2 (Other features) ... Y(0,1)
...
This brings up another problem.. How
do we assign weights to each of these
features?
I.e. How important is the job title
as compared to skills and other
features when evaluating a
candidate

The Answer:
Regression - Using Historical Job Applications Data
Acceptance, Yij of each Useri, Jobj match
will be based on the following:
Yij = XijW1 + XijW2 … + ε
Resume (User) x Description (Job) score Accepted (Yes/No)
X1W1 X2W2 (Other features)... 1 / 0
...
Wi can be filled using coefficients of
independent variables from regression
results
(1) Coefficients: weightage of how important each feature
(Title, skills, exp, etc.) - to compute final score
(2) R2: Backtesting the model - a good measure of how well
the model performs (how much variation of job
application acceptances are explained by independent
variables)
We have historical
user job applications
and acceptance data!

Pair User ID Job ID Final Score Rank
1 1 1 0.9 1
2 1 2 0.8 2
3 1 3 0.7 3
4 ... ...
Business Logic
Post-Processing
Eg. Recommendations for User ID 1
Re-ranking by other factors such as
- Date published of job posting
- Popularity
- Paid/Boosting
- Randomization
...
API/REST
Website
Scheduling ETLs, data pipelines
Scala ML Libraries, big data
processing
Prototyping, developing POC,
visualisation, evaluation

What I learned building a job recommender system
Practical Advice
If you’re a jobseeker... If you’re a jobseeker/employer...
Graphical CVs
Unless:
◎ You’re a graphic designer/UI/UX
◎ You’re certain that the CV end up
in the hiring manager’s inbox
and not a recruitment/jobs
platform
Fancy Job Titles (You know what I’m
talking about!)
Coding Ninja, Software Ninjaneer, Tax
Wrangler, Full Stack Magician, Retail
Jedi, Digital Overlord, Chief Everything
Officer …
Unless:
◎ No

Data Science
Final Notes
You have to be
passionate
● There is a lot to learn in this
field and keep up with
constantly
● Rapid advancement in data
science/AI. FOMO is real!
Leverage on great
community
● A unique feature of Data Science
practitioners/community -
competitive yet collaborative
● Reach out to experts in different
areas, mentors
● Personally benefited from DSSG
talks and workshops
FOMO: Fear of Missing Out

Thank you for your time :)
Any questions?
linkedin.com/in/yongsiang
ivan@techinasia.com

DataScience SG | Undergrad Series | 26th Sep 19

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (16)

Similaire à DataScience SG | Undergrad Series | 26th Sep 19

Similaire à DataScience SG | Undergrad Series | 26th Sep 19 (20)

Dernier

Dernier (20)

DataScience SG | Undergrad Series | 26th Sep 19

Notes de l'éditeur