2. whoami
•
•
•
Greek
7 years as a researcher
– High performance computing, network security, social network analysis
Specific role: between data scientists and engineers
2
3. My first experience with data science
•
•
•
•
•
EGEE pan-european grid cluster, 2002
Thousands of analytics jobs from CERN labs
MPI jobs
Power of around 10,000 CPUs
My first submitted jobs were particle simulation and a
parallel version of the Conway’s game of life
3
4. The importance of data science
Source: IBM analytics, http://www-935.ibm.com/services/us/gbs/thoughtleadership/ninelevers/
4
5. The problem of “unicorn” data scientists
Statistical analysis
- Math
- Data Mining
- Machine Learning
- Graph mining
- Data Visualization
Computer Science
- Advanced/High
performance
computing
- Visualization
Database
- Data engineering
- Data
warehousing
Domain expertise
- Finance
- Advertising
- Physics
5
6. Top daily activities
•
•
•
•
•
•
Data cleaning (painful)
Data processing (boring)
Data modeling (starting to get fun)
Statistical analysis, machine learning, data mining (yeaaahhh)
Visualization (exciting)
Report (back to painful stuff)
6
7. From data to actions
End users
Teams
Actions
Insights
Summaries and aggregations
Data Foundation
Data sources
7
8. Data sources - Data engineers
•
•
•
Most data sources encountered contain either:
– Unclean data (for exampple inconsistent formats)
– Incomplete data (sampling)
– Noise
Data engineers capture, process and store data sources
Hadoop, MapReduce, HBase, Cassandra, Python scripts
8
9. Data Foundation
•
•
•
•
•
•
The basic foundation where all data and analytic results are stored
Combined scientific and engineering effort
Heavy data modeling driven by analytics requirements
A good foundation means less time spent to retrieve and query data
Summaries and aggregation are helpful for large-scale data
If there is no data foundation, spend your initial effort to build one
9
10. Validation
•
•
•
•
Critical part of the analytics process
Validating against the ground truth is not always feasible
Finding representative training sets is hard
Open source and social network data sometimes help with validation
10
11. Engineering side
•
•
•
•
A good data scientist needs to have a good engineering side
Not expert, up to the stage of prototyping
Big teams have engineers side by side with data scientists
– Engineers gain the domain expertise
– Data scientists acquire engineering skills to facilitate the handover of their analytics
processes
Which comes to the question: what tools/languages/skills/methodologies should I learn?
11
12. Data Scientist Toolkit
•
•
•
•
•
•
•
•
•
•
R, Python, Java
Hadoop, HDFS, MapReduce, Spark
Hbase, Pig, Hive, Impala
SQL, RDBMS
SciPy, Numpy, scikit-learn
D3.js, Tableau, Gephi
SAS, Matlab, SPSS
NoSQL, MongoDB, Cassandra
Neo4J, FlockDB
MS-Excel
Which tools
should I learn?
As many as you
can
Bold: my skillsets
12
13. But I know only R, will I have a hard time?
•
•
•
•
Tricky question
The window opportunity for pure analysts is getting smaller
– Company-specific statement
Even paired with an engineer, knowledge transfer is hard if you are stubborn with one
toolkit/technology/methodology
The churn analysis example
13
14. Churning
•
•
•
•
Apart from regular contract termination, customers leave the provider early
Churn analysis tries to identify and quantify the reasons behind churning
Variables for investigation
– Call quality (calls being dropped)
– Network coverage (bad 3G/4G quality in my place)
– Prices and bundles
– My friends left the provider
Country and culture-specific problem
14
15. Churn analysis
•
•
•
•
•
•
Billions of call and SMS records
Millions of subscribers
Thousands of contract cancellations (5-10% of total subscribers)
Subscribers have a very small number of people they interact with (less than 5)
Insight: canceling customers are 7x more likely to be linked (country: US)
Action: identify churners social group, take actions to prevent them from leaving
CDR
database
Data
Insights
15
16. Domain expertise
•
•
•
•
•
Diverse opinions whether data scientists should have domain expertise
Domain expertise vs machine learning
Opinions so far are shared
Cases where non-experts outperform experts
No point of worrying, most data scientists that join large companies do not have domain
expertise
16
17. The importance of visualization
•
•
•
All performed analyses should be accompanied by the appropriate visualization
Do not get stuck on Excel / matplotlib graphs
Introduce infographics, custom heatmaps, Google maps to your skill arsenal
17
18. Visualization leads to great insights
•
•
•
•
Understanding data through visualization
Data scientists with expert visualization skills are rare
Relying on professional UI/UX experts is not always the solution for data products
Examples: spatial and SNA graph representation
18
19. Do not stand isolated from the business owners
•
•
•
•
•
•
Use cases define the requirements of what you are trying to solve
Isolation from use cases leads to generic models that do not fit to real life problems
Sales people are paired with data scientists to address customer needs
Data scientists can answer all the hard questions around data!
Cases where top sales people were data scientists or engineers
Data scientists can even become CEOs of leading companies!
19
20. Sense of privacy
•
•
•
•
Environments like telcos and social network companies deal with private and sensitive
data
Companies enforce security and privacy measures to prevent data leakage
Dealing with massive amounts of data requires a great sense of responsibility
Confidentiality protection ensures that specific individuals are not pinpointed
20
Demystify the “Big Data” role: “I know Java” vs “I know programming” paradigmGE has 600 data scientists
Alternative drew conway’svenn diagram : hacking skills, math & statistics, substantive expertise
Graph databases are exciting
Churn analysis shown before was done by non-expert
Data artisans: Data artisans are employees who possess a blend of technical skills and business acumen that enables them to extract actionable insight from the huge volumes of data that exist--despite their lack of experience with it--demonstrating that businesses don’t always need a data scientist to interpret data effectively