Details
For September, DataScience Sg is starting a new series specially for the undergrads. The series aims to showcase undergrads and fresh grads project work.
The series is meant to encourage youths in joining the data science & artificial intelligence career. And for the employers to come in and recruit talents for your companies.
In this inaugural meetup for the series, we have the following youths to share about their work and project and how their projects helped them in their current career.
DSSG strongly encourage current undergrads and fresh grads to join us in this series. Its still open to the general community!
Details:
Ivan is currently a Data Scientist at Tech In Asia (TIA), with experience in developing recommender systems, customer churn prediction, network analysis and driving BI solutions through data visualization and analytics. He graduated with a Bachelor of Science (Informations Systems) and Major in Marketing Analytics from SMU in 2018.
Ivan will be sharing about his Final Year Project when he was an undergrad at SMU — KDDLabs, a web-based data mining application while explaining the team’s motivations, challenges and key takeaways. In addition, he will also be talking about his first data product at TIA, developing recommender systems to help better connect jobseekers with employers and vice versa.
LinkedIn: https://www.linkedin.com/in/yongsiang/
FYP: http://smu.sg/kddlabs
2. QA automation, Web-
crawlers, Tableau dashboards
Internships,
Research Assistant,
personal projects
KDD Labs - Web-based
data mining application
1SMU FYP
SMU Information Systems,
Marketing Analytics
Graduation
Job Recommender
Systems, churn prediction,
network analysis, BI
solutions
2Data Scientist,
Tech In Asia
ABOUT
2014 -
2017
2017
2018
2018
2014
Enrolled into
SMU
Ivan Tan
4. . KDD Labs .
A web-based analytics tool to
enable students to upload,
visualize and perform data mining
with no coding
PROJECT MOTIVATION
Used as a teaching tool in SMU’s IS424 Business
Analytics and Data Mining (2017)
Not meant to replace enterprise-grade or open-source
products like Python (sklearn, tensorflow, etc.) or SAS
Provide students who are new to the field some hands-
on experience
Overcome existing limitations such as installation, OS-
specific software by providing a web interface
5. What is KDD?
Knowledge Discovery in Databases (KDD)
1. Understanding of the application domain & goals
2. Creating a target data set: selecting or focusing on a
subset of variables, or samples
3. Data cleaning and preprocessing
4. Data reduction and projection. (transformation,
dimensionality reduction, so on) - Useful features
5. Choosing the data mining task.
○ Deciding whether the goal of the KDD process is
classification, regression, clustering, etc.
6. Choosing the data mining algorithm(s)
○ Methods, models, parameters
7. Data mining - identifying patterns of interest
8. Analysing, Interpreting patterns, consolidating
discovered knowledge.
Reference: http://www2.cs.uregina.ca/~dbd/cs831/notes/kdd/1_kdd.html
7. KDD Labs
Key Functionalities
● User Management - Login/Logout, Project history
● Data Management - Upload, Sharing, data attributes
● Data Exploration - Summary Statistics, Data
Visualisations (graphs)
● Data Preprocessing - Normalisation, Missing values
● Data Classification - Decision Trees, SVM, Ensemble,
XGBoost, Train/Test split, XGBoost, K-Fold, Prediction
● Data Clustering - KMeans, Hierarchical
● Data Association Rules - Market basket analysis,
Graph visualisation
8. Project
Challenges
No data analytics
background when we
started
Learn data mining and
develop web application
concurrently
Deploy application to
production (live usage) to
in same semester
Design and develop basic
functionalities
(Login, Canvas, Preprocessing,
Visualization)
May Aug NovLab 1 Lab 2 Lab 3 Lab 4
Iteratively push new features
(School term)
( Summer break)
Final
presentations
9.
10. Technologies, Architecture
Roles & Responsibilities
2x Frontend
1x Backend
1x Data / Backend
1x Project Manager
◎ Understand and translate lab requirements to
specifications
◎ Write functions to interface with scikit-learn,
pandas, matplotlib libraries
◎ Manage CRUD/storage of data files
Python-based web
application for easy
integration with scikit-
learn libraries
11. Demo
Disclaimer
◎ Project is no longer being maintained. Please
don’t report bugs to me :)
◎ If you do find a user account and password
lying around, do not upload your
personal/private data
Last Deployed: 2017
12. KDD Labs
Key Takeaways
Intro to data mining
Introduction to basic Machine
Learning algorithms
(supervised/unsupervised),
evaluation methods
Importance of basics
Understanding fundamentals
concepts is key before going into
advanced ML (AI, deep learning,
etc.)
Technical skills
Python (pandas, numpy) for
data wrangling, visualisation
(matplotlib) and ML (scikit-
learn)
Project Management
Prioritisation of tasks, time
management, etc. If you are ahead
of time, management will always
add more features
Right attitude &
mindset
You are only limited to your
imaginations. You can set out to
achieve anything if you put your mind
to it!
(or when your grades depend on it 😄)
13. Jobs Recommender System
at Tech In Asia (TIA)
• Introduction
• Overview of Recommender Systems
• Modelling the data
• NLP with Word2Vec
• Regression
• Advice/Final Notes
2
About
Tech in Asia
Tech in Asia aims to build and serve
Asia’s tech and startup community.
Media bridges the information
gap, delivering insights and
news about the Asian tech
industry
Jobs bridges the talent gap,
linking Asian tech talent to the
right roles and employers
Studios is our newest team,
focused on bridging the
connection gap between
companies and their Asian tech
audience
14. 35%of revenue from
recommendations
30%of overall listening from
recommendations
75%of content views from
recommendations
Reference: https://digital.hbs.edu/platform-rctom/submission/discover-weekly-how-spotify-is-changing-the-way-we-consume-music/,
https://www.mckinsey.com/industries/retail/our-insights/how-retailers-can-keep-up-with-consumers,
Impact of
Recommender
Systems
15. Jobseekers
Discovery
Displaying recommended jobs to
users based on their profile (job title,
skillset, years of experience, etc.)
What do recommender systems achieve at TIA Jobs?
Employers
Filtering
Helping employers to sift through
hundreds of applications by
prioritising relevant candidates
Creating a
better match
between 2
parties
16. Content-based
● Uses the content for recommendation purposes - text processing techniques,
semantic information, etc (Item features)
Collaborative
filtering
● Uses the interaction of users with items - historical data such as clicks,
purchases, ratings, etc. I.e. “You may like this because users who bought this also
bought”
Hybrid ● Combination of various approaches
More readings here:
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.402.1036&rep=rep1&type=pdf
First steps : Conduct Literature Research
Overview of Recommender Systems
17. Job-seekers
Resume + Profile info
- Current Job Title
- Skills []
- Years of Experience
- Summary
- Industry
...
Our approach : Content-based/Knowledge-based
Modelling the data
Employers
Job postings
- Job Posting Title
- Skills []
- Years of Experience
- Summary
- Industry
...
Creating a
‘similarity
score’ between
the 2 features
OCR / Entity
Extraction (3rd party)
Semi-structured
data
18. Score Table
Natural Language Processing with Word2Vec
ID Resume (User) Job Posting Scoring
1 Title Title Word2Vec (Job Description) + Cosine Similarity
2 Years of Exp Years of Exp ..
3 Skills (Array) Skills (Array) ..
4 ...
19. What is
Word2Vec Why?
“Word2vec is a two-layer neural net that
processes text. Its input is a text corpus
and its output is a set of vectors: feature
vectors for words in that corpus. While
Word2vec is not a deep neural network,
it turns text into a numerical form that
deep nets can understand.”
References: https://skymind.ai/wiki/word2vec, https://planetcalc.com/1721/
Data Scientist = [1, 0, 1, 0 … 1]
Data Analyst = [1, 1, 1, 0 … 1]
Data Engineer = [1, 0, 1, 0 … 1]
ML Engineer = [1, 1, 1, 0 … 1]
N-dimensional vector
How close/far are the titles
“Data Scientist” and “Data Analyst”?
String-based
(Levenshtein distance)
does not take into
account semantics. There
are many titles which are
similar/identical in
meaning but named very
differently
20. Training/Inference
Word2Vec for Job Titles
Train on
historical job
postings data as
‘dictionary’
Word2Vec
Model
(can use custom or pre-
trained)
Training
Preprocessed Text
Corpus (Documents)
1 2
Inference
How close/far are the each of the job titles
Use Cosine Similarity, K-Means clustering, etc.
[Full Stack Developer,
Software Engineer,
Backend Engineer,
Backend Developer]
[Data Scientist, Data
Engineer, Data Analyst,
Senior Data Scientist,
Business Intelligence
Manager]
[Graphic Designer,
Graphic Design Intern,
Graphic Design, UI/UX
Designer, Creative
Designer]
22. 23
Score Table
Putting it all together
For each User-Job (How relevant is this User-Job pair?) Score
Job Title Score X1 Skills Score X2 (Other features) ... Y(0,1)
...
This brings up another problem.. How
do we assign weights to each of these
features?
I.e. How important is the job title
as compared to skills and other
features when evaluating a
candidate
23. The Answer:
Regression - Using Historical Job Applications Data
Acceptance, Yij of each Useri, Jobj match
will be based on the following:
Yij = XijW1 + XijW2 … + ε
Resume (User) x Description (Job) score Accepted (Yes/No)
X1W1 X2W2 (Other features)... 1 / 0
...
Wi can be filled using coefficients of
independent variables from regression
results
(1) Coefficients: weightage of how important each feature
(Title, skills, exp, etc.) - to compute final score
(2) R2: Backtesting the model - a good measure of how well
the model performs (how much variation of job
application acceptances are explained by independent
variables)
We have historical
user job applications
and acceptance data!
24. Pair User ID Job ID Final Score Rank
1 1 1 0.9 1
2 1 2 0.8 2
3 1 3 0.7 3
4 ... ...
Business Logic
Post-Processing
Eg. Recommendations for User ID 1
Re-ranking by other factors such as
- Date published of job posting
- Popularity
- Paid/Boosting
- Randomization
...
API/REST
Website
Scheduling ETLs, data pipelines
Scala ML Libraries, big data
processing
Prototyping, developing POC,
visualisation, evaluation
25. What I learned building a job recommender system
Practical Advice
If you’re a jobseeker... If you’re a jobseeker/employer...
Graphical CVs
Unless:
◎ You’re a graphic designer/UI/UX
◎ You’re certain that the CV end up
in the hiring manager’s inbox
and not a recruitment/jobs
platform
Fancy Job Titles (You know what I’m
talking about!)
Coding Ninja, Software Ninjaneer, Tax
Wrangler, Full Stack Magician, Retail
Jedi, Digital Overlord, Chief Everything
Officer …
Unless:
◎ No
26. Data Science
Final Notes
You have to be
passionate
● There is a lot to learn in this
field and keep up with
constantly
● Rapid advancement in data
science/AI. FOMO is real!
Leverage on great
community
● A unique feature of Data Science
practitioners/community -
competitive yet collaborative
● Reach out to experts in different
areas, mentors
● Personally benefited from DSSG
talks and workshops
FOMO: Fear of Missing Out
27. Thank you for your time :)
Any questions?
linkedin.com/in/yongsiang
ivan@techinasia.com
Notes de l'éditeur
Knowledge Discovery in Databases (KDD) Process
Developing an understanding of the application domain, the relevant prior knowledge, the goals of the end-user
Creating a target data set: selecting a data set, or focusing on a subset of variables, or data samples, on which discovery is to be performed.
Data cleaning and preprocessing.
Removal of noise or outliers.
Collecting necessary information to model or account for noise.
Strategies for handling missing data fields.
Accounting for time sequence information and known changes.
Data reduction and projection.
Finding useful features to represent the data depending on the goal of the task.
Using dimensionality reduction or transformation methods to reduce the effective number of variables under consideration or to find invariant representations for the data.
Choosing the data mining task.
Deciding whether the goal of the KDD process is classification, regression, clustering, etc.
Choosing the data mining algorithm(s).
Selecting method(s) to be used for searching for patterns in the data.
Deciding which models and parameters may be appropriate.
Matching a particular data mining method with the overall criteria of the KDD process.
Data mining.
Searching for patterns of interest in a particular representational form or a set of such representations as classification rules or trees, regression, clustering, and so forth.
Interpreting mined patterns.
Consolidating discovered knowledge.
Challenges before embarking on the project
As a group of 5 year-3 students back then, we had no idea what is data mining/data science all about! (None of us had taken this module as well)
Imagine trying to build an application to teach a topic that you are not even familiar with… Therefore, the challenge was trying to build an application to teach a topic that we were not experienced in: Learn and build concurrently (all while taking other courses, generally being a student.)
As if none of that is crazy enough, (Illustrate with a timeline) Deploy project iteratively as IS424 was being taught (students were our guinea pigs)
Grateful for having great sponsors to guide us along in terms of domain knowledge, weekly iterations/consultations
Start learning analytics from foundations
Context/overview before I conclude first part with takeaways
I don’t even know who is paying for the server
Backup slide in case website goes down
Key takeaways from the project
A ‘sneak preview’ into Machine Learning algorithms, sparked interest to continue pursuing data science
Built a strong foundation of simple supervised/unsupervised ML concepts such classification (decision trees), regression, association, evaluation metrics (accuracy, precision, confusion matrix)
Once you understand the basics, most advanced algorithms build upon the foundation. Do not skip the basics and go into deep learning, neural networks, etc. because simple problems require simple solutions (overkill)
You have to be genuinely passionate in data science to pursue it. A lot of hard work, perseverance and self-learning is crucial and it is always never enough
Proficiency in Python (data cleaning, wrangling) 80-20 data cleaning
If you are ahead of time, your boss will add new features
Having the right attitude and mindset is important! You can do anything if you put your mind to it (or when your grades depend on it)!
You have to be passionate
There is a lot (really) to learn in this field and keep up with constantly. You might think you know a lot about an area but most of the time its just the tip of the iceberg (But personally, that’s the excitement is all about)
if you are not genuinely passionate you will lose interest quickly
The field advances very, very quickly with new publications in AI or new tools every now and then. FOMO is real. I don’t have a solution for that.
Make full use of helpful community
There will be a lot of times where you feel stuck in a particular area - reach out to people in the community, ask experts, mentors, whatever, just do whatever it takes! (Stackoverflow will be your best friend)
Despite being a very competitive field, people are very helpful and are mostly willingly share
Personally benefited a lot from attending DSSG talks every now and then (that’s why I’m here!)