SlideShare une entreprise Scribd logo
1  sur  27
Cloudera
Data Science
Challenge
Presentation by @dougneedham
Cloudera: Certified Data Scientist
 This is the goal. What are the requirements?
 Requirement 1. DS-200 Test.
 Requirement 2. Data Science Challenge.
 http://www.cloudera.com/content/cloudera/en/training/certification/
ccp-ds/challenge/challenge3.html
DS-200 Test
 Here are the sections of the exam:
 Data Acquisition
 Data Evaluation
 Data Transformation
 Machine Learning Basics
 Clustering
 Classification
 Collaborative Filtering
 Model/Feature Selection
 Probability
 Visualization
 Optimization
Data Science Challenge Itself
 The resources needed for the challenge.
 A Cluster:
 7 nodes (1 Named Node, 6 Data Nodes.
 1 Node Cloudera Manager
 1 Node Cloudera Director
 Cloudera Director requires a particular AMI for the East Coast AWS region(
ami-3218595b RHEL 6.4 X86_64 ). It took a bit of time to get this right. The
version I used is highly dependent on RHEL 6.4
Cluster Management
 Cloudera Director made it easy to create the cluster once I used the
proper AMI. As noted previously it only really works well with RHEL 6.4
 Cloudera Manager allowed for management and monitoring.
 Demo here:
 Restarting the cluster non-trivial.
 Clusters are meant to stay up. A bit of verification work was needed
when the cluster was shut down over the holidays.
The problems
 The challenge is made up of 3 problems. Each one could be a larger
effort in and of itself to solve.
 The data had to be transformed in order to process it.
 The sophisticated portion of the challenge was the actual processing of
the data, Machine Learning, Graph analysis, Statistical Confidence,
etc…
 Then the output needed to be tweaked a bit in order to conform to the
deliverable specification.
Problem 1
 Flight Delays.
 SmartFly’s business is providing its customers with timely travel
information and notifications about flights, hotels, destination weather,
traffic getting to the airport, and anything else that can help make the
travel experience smoother. Their product team has come up with the
idea of using the flight data that they have been collecting to predict
whether customers’ flights will be delayed so that they can respond
proactively. They’ve now contacted you to help them test out the
viability of the idea
 From a given set of historical flights, create an ordered list based on the
probability that a given set of future scheduled flights will be delayed or
not.
Problem 2
 Web site log analytics.
 Congratulations! You have just published your first book on data
science, advanced analytics, and predictive modeling. You’ve
decided to use your skills as a data scientist to create and optimize a
website promoting your book, and you have started several ad
campaigns on a popular search engine in order to drive traffic to your
site.
 Provide statistics about a web-site where your new book is featured.
Problem 3
 Who should Follow whom?
 Winklr is a curiously popular social network for fans of the sitcom Happy
Days. Users can post photos, write messages, and most importantly,
follow each other’s posts and content. This helps users keep up with
new content from their favorite users on the site.
Rules
 Individual Contributions Only
You must participate in this challenge only on an individual basis; teams
are not permitted.
 Sharing
Any sharing of code or solutions or collaboration with another person or
entity is strictly forbidden.
 Tools
You may use any tools or software you desire to complete the
challenge.
 Prerequisites
You must have successfully passed Data Science Essentials (DS–200)
Deliverables
 Problem1 – Ordered list of flights
 Problem 2 – JSON file with a python populated Python Dictionary containing specific
answers to questions.
 Problem3 – Top 70,000 connections that should be recommended.
 In addition to the deliverables stated above for each problem, you must provide a solution
abstract and the complete set of source code used to solve the challenge problems, as
described below.
 Solution Abstract
 The solution abstract should be a brief write-up in PDF format that addresses the following
points:
 For each part you needed to do this:
 Explain your methodology including approach, assumptions, software and algorithms used, testing
and validation techniques applied, model selection criteria, and total time spent.
 Please include in your solution abstract any information that can be used to understand the
logic behind your approach and all steps taken, including data preparation, modeling,
validation, analysis, visualization, etc. The solution abstract should typically be 3 to 5 pages
and no more than 6 pages.
 Complete Source Code
 Tarball or zip file of all source code used to complete the challenge, including programs,
scripts, and other artifacts.
 My github is linked at the end of this presentation
Scoring
 Submission Scoring
 Submissions will be scored as follows. Each problem part will be scored
independently. The score for each part will be a composite of the
percentage correct for all submitted solutions for that part and the score
assigned to the corresponding section of the solution abstract. The scores
for the three parts will be weighted and combined into a final composite
score.
 The percentage correct for each part will be scored against a golden
master of known correct answers. Note that some questions may have
more than one correct answer, and partial credit may be awarded.
 The solution abstract will be scored according to objective criteria about
your approach and general mastery of the tools and techniques. Writing
quality and formatting will not contribute to the score, except in cases
where the writing is so poor as to impact understanding.
Did anyone notice anything?
 Each of the 3 problems highlight a very different aspect of Data
Science.
 Machine Learning
 Statistical Analysis
 Graph Analysis
 Each of these individually are areas that people specialize in. I for one,
intend to dive deep into Graph Analysis. It is quite interesting from what I
have seen so far.
 The code has to be straightforward and while it is not clear they will do so for
this particular challenge, at least one of the prior challenges, the code is run
independently as part of the Cloudera grading process.
Bringing us to the question of What is
a Data Scientist ?
http://nirvacana.com/thoughts/becoming-a-data-scientist/
If you do the challenge are you a purple
squirrel?
Who is a Data Scientist ?
 Who here has seen the Indiana Jones movies?
 Marcus Brody and Henry “Indiana” Jones were both Archeologists.
 Both Lectured, and taught Archeology
 Both understood the tenets of Archeology
 Both knew what finding an artifact means.
 Both could speak intelligently about the significance of any finds associated with the
search.
 Only Indiana went on the “quest” – Why?
 Those “intangible” skills of being able not only to survive, but to thrive in a chaotic
environment played to “Indy’s” strengths.
 https://www.youtube.com/watch?v=PgfpIV29Ccc
 There are many types of data scientist:
http://www.datasciencecentral.com/profiles/blogs/six-categories-of-data-scientists
 I think the environment affects the success of a data scientist.
From Data Science Central
Opinion time
 This challenge has reminded me of some of my first efforts building a data
warehouse, where I had to build the pipeline of data from our source
systems into our Operational Data Store, and out to our Star Schema for
daily reports.
 It has many of the earmarks of a real-world project.
 The biggest difference between this challenge and a “real world problem”,
is, there are known solutions. And someone knows those solutions.
 In “the real world”, we have things like User acceptance testing and such.
This is a more opinionated way of judging success or failure, rather than the
objective measure of : Is the data in the answer set?
Doug’s Problem Solving approach
 This is the approach I took, and may or may not be useful for others to apply.
 Analysis. I started with some basic numbers, and just browsing through the data with the “Data
Science at the Command line toolkit”. This is very handy for getting a feel for things.
 Based on some general understanding this analysis provided, create a “pipeline”
 Generally the data has to be transformed to a usable structure for the particular method of solving
the problem.
 Do some basics with the problem solving method, Stats, ML, Graph, etc…
 Get some data back out of that tool, then format output to specification.
 Iterate.
 I did this for problem 1, moved to problem 2, then finally problem 3. Then went back to 1, back to
2, back to 3.
 This method allowed me to give some “space” to myself, and actually look at the each problem
with fresh eyes on more than one occasion.
 Breaking the basics down of Input, Process, Output for each problem allowed me to have
“working” code for each problem really quickly, then through tuning, analysis, research, and some
time to think about the problem, I was able to come up with each unique solution.
 It also allows me to refactor the code, having given each problem time to “rest”.
 Very much like a painting, broad strokes first, details emerge as the painting progresses.
 Another benefit is, if I am able to get the data all the way through the pipeline, it becomes obvious
where the performance bottlenecks are for the pipeline.
 This method does take a bit of time.
Solution 1
 Type of problem: Machine Learning
 Use Python to format the data.
 Create an individual set of files based on point of origin (Departing
airport)
 Use these individual files to create a model using Spark MLLib per
airport.
 Run the scheduled flights through the model, then use the score of the
model(Area under the ROC, denoting accuracy) multiplied by the
output of the prediction (which is either 1 or 0).
 This allows us to say with a higher degree of accuracy or not whether a
flight will be delayed.
 Code problem1.sh, and PredictFlights.scala
What the heck is a ROC?
 This comes from: http://gim.unmc.edu/dxtests/roc3.htm
 There are metrics using a predictive model
 True Positive
 True Negative
 False Positive
 False Negative.
 The higher the area under the Yellow line, the better.
 This is used for model validation to ensure that the model is making accurate
predictions against known data.
 http://en.wikipedia.org/wiki/Receiver_operating_characteristic
Solution 2
 Type of problem: Statistical reporting
 Python Streaming, so lots of Map-Reduce code. Could probably be
replaced with Spark, at the time, this path seemed to be the most
straight-forward.
 A bit of analysis.
 Collect some numbers. – Statistically Significant numbers, that is.
 Format the data in JSON.
 Code: Problem2.sh
What makes this data science?
 Isn’t this the same thing as business analysis?
 What makes this difference is the latter part of question 4 and 5.
 Here they are:
 Question 4: “How many full days of data, starting from the first day, are required to
determine that the newsletter signup rate for experiment one is better than
experiment two at the 99% confidence level?”
 Question 5: “Using a z-test, determine how many full days of data, starting from the
first full day, are needed to confirm that experiment four earns more revenue than
experiment three at the 99% confidence level.”
 The accurate measurement of confidence is what makes this different. I have built a
number of Data warehouse environments from Retail, Finance, and Health Care.
Even with the Chain of Custody information built in to provide for data traceability, I
have seen very few decisions based on the output of the data warehouse. Why is
this?
 No one has discussed confidence levels in the data. Having a rational conversation
about the “confidence” and statistical significance of the data allows for more
rational decision making.
 <Opinion>This is one of the key differentiators of data science versus business
analytics. </Opinion>
Solution 3
 Type of problem: Graph Analysis
 Create a Master Graph.
 Run Page Rank to identify centrality.
 Create many small graphs for individual users.
 Mask the Master Graph, and PageRank Graph.
 Multiply out Centrality, number of in Degrees for a possible followers,
and the inverse of the length of the path away from this particular user
to a candidate vertex to be followed.
 This code runs in over 48 hours.
 Code: Problem3.sh, and AnalyzeGraph.scala
Graph Analysis
 As Graphs get really large it becomes difficult to visualize them.
 However, I was able to “subset” the master graph based on the
recommendation output of my process.
 I was expecting to see one big clump of nodes tightly connected. This
would be the “Target” to follow.
 I was also expecting to see two smaller clumps of nodes, loosely
connected to the larger clump. These are the “followers”, as we make
a recommendation to them to follow the more popular node, they will
be closer connected to this user.
 Here is the output from Gephi that shows whether the code worked or
not.
Gephi output
Where to go from here?
 Spark.
 Scala.
 Learn these topics.
 Teach these topics.
 Especially for folks planning on sitting for Data Science challenge 4:
Learn Scala. Learn Spark.
 Oh, and keep studying about Graphs…
 Code located here: https://github.com/dougneedham/Cloudera-
Data-Scientist-Challenge-3

Contenu connexe

Tendances

Real time streaming analytics
Real time streaming analyticsReal time streaming analytics
Real time streaming analyticsAnirudh
 
H2O World - Machine Learning for non-data scientists
H2O World - Machine Learning for non-data scientistsH2O World - Machine Learning for non-data scientists
H2O World - Machine Learning for non-data scientistsSri Ambati
 
Barga Data Science lecture 5
Barga Data Science lecture 5Barga Data Science lecture 5
Barga Data Science lecture 5Roger Barga
 
Engineering Machine Learning Data Pipelines Series: Big Data Quality - Cleans...
Engineering Machine Learning Data Pipelines Series: Big Data Quality - Cleans...Engineering Machine Learning Data Pipelines Series: Big Data Quality - Cleans...
Engineering Machine Learning Data Pipelines Series: Big Data Quality - Cleans...Precisely
 
CRISP-DM - Agile Approach To Data Mining Projects
CRISP-DM - Agile Approach To Data Mining ProjectsCRISP-DM - Agile Approach To Data Mining Projects
CRISP-DM - Agile Approach To Data Mining ProjectsMichał Łopuszyński
 
Agility in an AI / DS / ML Project
Agility in an AI / DS / ML ProjectAgility in an AI / DS / ML Project
Agility in an AI / DS / ML ProjectTathagat Varma
 
Architecting Agile Data Applications for Scale
Architecting Agile Data Applications for ScaleArchitecting Agile Data Applications for Scale
Architecting Agile Data Applications for ScaleDatabricks
 
SKOS as the focal point of linked data strategies
SKOS as the focal point of linked data strategiesSKOS as the focal point of linked data strategies
SKOS as the focal point of linked data strategiesSemantic Web Company
 
Estimating the Total Costs of Your Cloud Analytics Platform 
Estimating the Total Costs of Your Cloud Analytics Platform Estimating the Total Costs of Your Cloud Analytics Platform 
Estimating the Total Costs of Your Cloud Analytics Platform DATAVERSITY
 
Big Data Agile Analytics by Ken Collier - Director Agile Analytics, Thoughtwo...
Big Data Agile Analytics by Ken Collier - Director Agile Analytics, Thoughtwo...Big Data Agile Analytics by Ken Collier - Director Agile Analytics, Thoughtwo...
Big Data Agile Analytics by Ken Collier - Director Agile Analytics, Thoughtwo...Thoughtworks
 
The (very) basics of AI for the Radiology resident
The (very) basics of AI for the Radiology residentThe (very) basics of AI for the Radiology resident
The (very) basics of AI for the Radiology residentPedro Staziaki
 
Shape Your Data into a Data Model with M
Shape Your Data into a Data Model with MShape Your Data into a Data Model with M
Shape Your Data into a Data Model with MCCG
 
Deep Learning for Recommender Systems
Deep Learning for Recommender SystemsDeep Learning for Recommender Systems
Deep Learning for Recommender Systemsinovex GmbH
 
Full-Stack Data Science: How to be a One-person Data Team
Full-Stack Data Science: How to be a One-person Data TeamFull-Stack Data Science: How to be a One-person Data Team
Full-Stack Data Science: How to be a One-person Data TeamGreg Goltsov
 
Using Advanced Analytics for Data-Driven Decision Making
Using Advanced Analytics for Data-Driven Decision MakingUsing Advanced Analytics for Data-Driven Decision Making
Using Advanced Analytics for Data-Driven Decision MakingBooz Allen Hamilton
 
GraphTour 2020 - Graphs & AI: A Path for Data Science
GraphTour 2020 - Graphs & AI: A Path for Data ScienceGraphTour 2020 - Graphs & AI: A Path for Data Science
GraphTour 2020 - Graphs & AI: A Path for Data ScienceNeo4j
 
Data Science Project Lifecycle
Data Science Project LifecycleData Science Project Lifecycle
Data Science Project LifecycleJason Geng
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data AnalyticsOsman Ali
 
Traffic Data Analysis and Prediction using Big Data
Traffic Data Analysis and Prediction using Big DataTraffic Data Analysis and Prediction using Big Data
Traffic Data Analysis and Prediction using Big DataJongwook Woo
 

Tendances (20)

Real time streaming analytics
Real time streaming analyticsReal time streaming analytics
Real time streaming analytics
 
H2O World - Machine Learning for non-data scientists
H2O World - Machine Learning for non-data scientistsH2O World - Machine Learning for non-data scientists
H2O World - Machine Learning for non-data scientists
 
Barga Data Science lecture 5
Barga Data Science lecture 5Barga Data Science lecture 5
Barga Data Science lecture 5
 
Engineering Machine Learning Data Pipelines Series: Big Data Quality - Cleans...
Engineering Machine Learning Data Pipelines Series: Big Data Quality - Cleans...Engineering Machine Learning Data Pipelines Series: Big Data Quality - Cleans...
Engineering Machine Learning Data Pipelines Series: Big Data Quality - Cleans...
 
CRISP-DM - Agile Approach To Data Mining Projects
CRISP-DM - Agile Approach To Data Mining ProjectsCRISP-DM - Agile Approach To Data Mining Projects
CRISP-DM - Agile Approach To Data Mining Projects
 
Agility in an AI / DS / ML Project
Agility in an AI / DS / ML ProjectAgility in an AI / DS / ML Project
Agility in an AI / DS / ML Project
 
Architecting Agile Data Applications for Scale
Architecting Agile Data Applications for ScaleArchitecting Agile Data Applications for Scale
Architecting Agile Data Applications for Scale
 
SKOS as the focal point of linked data strategies
SKOS as the focal point of linked data strategiesSKOS as the focal point of linked data strategies
SKOS as the focal point of linked data strategies
 
Estimating the Total Costs of Your Cloud Analytics Platform 
Estimating the Total Costs of Your Cloud Analytics Platform Estimating the Total Costs of Your Cloud Analytics Platform 
Estimating the Total Costs of Your Cloud Analytics Platform 
 
Big Data Agile Analytics by Ken Collier - Director Agile Analytics, Thoughtwo...
Big Data Agile Analytics by Ken Collier - Director Agile Analytics, Thoughtwo...Big Data Agile Analytics by Ken Collier - Director Agile Analytics, Thoughtwo...
Big Data Agile Analytics by Ken Collier - Director Agile Analytics, Thoughtwo...
 
The (very) basics of AI for the Radiology resident
The (very) basics of AI for the Radiology residentThe (very) basics of AI for the Radiology resident
The (very) basics of AI for the Radiology resident
 
Shape Your Data into a Data Model with M
Shape Your Data into a Data Model with MShape Your Data into a Data Model with M
Shape Your Data into a Data Model with M
 
Deep Learning for Recommender Systems
Deep Learning for Recommender SystemsDeep Learning for Recommender Systems
Deep Learning for Recommender Systems
 
Data Scientist Enablement roadmap 1.0
Data Scientist Enablement roadmap 1.0Data Scientist Enablement roadmap 1.0
Data Scientist Enablement roadmap 1.0
 
Full-Stack Data Science: How to be a One-person Data Team
Full-Stack Data Science: How to be a One-person Data TeamFull-Stack Data Science: How to be a One-person Data Team
Full-Stack Data Science: How to be a One-person Data Team
 
Using Advanced Analytics for Data-Driven Decision Making
Using Advanced Analytics for Data-Driven Decision MakingUsing Advanced Analytics for Data-Driven Decision Making
Using Advanced Analytics for Data-Driven Decision Making
 
GraphTour 2020 - Graphs & AI: A Path for Data Science
GraphTour 2020 - Graphs & AI: A Path for Data ScienceGraphTour 2020 - Graphs & AI: A Path for Data Science
GraphTour 2020 - Graphs & AI: A Path for Data Science
 
Data Science Project Lifecycle
Data Science Project LifecycleData Science Project Lifecycle
Data Science Project Lifecycle
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
Traffic Data Analysis and Prediction using Big Data
Traffic Data Analysis and Prediction using Big DataTraffic Data Analysis and Prediction using Big Data
Traffic Data Analysis and Prediction using Big Data
 

Similaire à Cloudera Data Science Challenge 3 Solution by Doug Needham

Data Science Challenge presentation given to the CinBITools Meetup Group
Data Science Challenge presentation given to the CinBITools Meetup GroupData Science Challenge presentation given to the CinBITools Meetup Group
Data Science Challenge presentation given to the CinBITools Meetup GroupDoug Needham
 
Abcd iqs ssoftware-projects-mercecrosas
Abcd iqs ssoftware-projects-mercecrosasAbcd iqs ssoftware-projects-mercecrosas
Abcd iqs ssoftware-projects-mercecrosasMerce Crosas
 
Barga Data Science lecture 2
Barga Data Science lecture 2Barga Data Science lecture 2
Barga Data Science lecture 2Roger Barga
 
1. introduction to data science —
1. introduction to data science —1. introduction to data science —
1. introduction to data science —swethaT16
 
Machine learning at b.e.s.t. summer university
Machine learning  at b.e.s.t. summer universityMachine learning  at b.e.s.t. summer university
Machine learning at b.e.s.t. summer universityLászló Kovács
 
Doing Analytics Right - Building the Analytics Environment
Doing Analytics Right - Building the Analytics EnvironmentDoing Analytics Right - Building the Analytics Environment
Doing Analytics Right - Building the Analytics EnvironmentTasktop
 
data-science-pdf-16588.pdf
data-science-pdf-16588.pdfdata-science-pdf-16588.pdf
data-science-pdf-16588.pdfvkharish18
 
The Simple 5-Step Process for Creating a Winning Data Pipeline.pdf
The Simple 5-Step Process for Creating a Winning Data Pipeline.pdfThe Simple 5-Step Process for Creating a Winning Data Pipeline.pdf
The Simple 5-Step Process for Creating a Winning Data Pipeline.pdfData Science Council of America
 
Google Interview Prep Guide Software Engineer
Google Interview Prep Guide Software EngineerGoogle Interview Prep Guide Software Engineer
Google Interview Prep Guide Software EngineerLewis Lin 🦊
 
Guide for a Data Scientist
Guide for a Data ScientistGuide for a Data Scientist
Guide for a Data ScientistRohit Dubey
 
Cis 375 Enhance teaching / snaptutorial.com
Cis 375   Enhance teaching / snaptutorial.comCis 375   Enhance teaching / snaptutorial.com
Cis 375 Enhance teaching / snaptutorial.comDavis105
 
Computer Tools for Academic Research
Computer Tools for Academic ResearchComputer Tools for Academic Research
Computer Tools for Academic ResearchMiklos Koren
 
PyData 2015 Keynote: "A Systems View of Machine Learning"
PyData 2015 Keynote: "A Systems View of Machine Learning" PyData 2015 Keynote: "A Systems View of Machine Learning"
PyData 2015 Keynote: "A Systems View of Machine Learning" Joshua Bloom
 
Cis 375 Education Redefined - snaptutorial.com
Cis 375    Education Redefined - snaptutorial.comCis 375    Education Redefined - snaptutorial.com
Cis 375 Education Redefined - snaptutorial.comDavisMurphyC76
 
2013 Lecture 5: AR Tools and Interaction
2013 Lecture 5: AR Tools and Interaction 2013 Lecture 5: AR Tools and Interaction
2013 Lecture 5: AR Tools and Interaction Mark Billinghurst
 
What is data science ?
What is data science ?What is data science ?
What is data science ?ShahlKv
 
CIS 375 Focus Dreams/newtonhelp.com
CIS 375 Focus Dreams/newtonhelp.comCIS 375 Focus Dreams/newtonhelp.com
CIS 375 Focus Dreams/newtonhelp.combellflower87
 

Similaire à Cloudera Data Science Challenge 3 Solution by Doug Needham (20)

Data Science Challenge presentation given to the CinBITools Meetup Group
Data Science Challenge presentation given to the CinBITools Meetup GroupData Science Challenge presentation given to the CinBITools Meetup Group
Data Science Challenge presentation given to the CinBITools Meetup Group
 
Problem prediction model
Problem prediction modelProblem prediction model
Problem prediction model
 
Abcd iqs ssoftware-projects-mercecrosas
Abcd iqs ssoftware-projects-mercecrosasAbcd iqs ssoftware-projects-mercecrosas
Abcd iqs ssoftware-projects-mercecrosas
 
Barga Data Science lecture 2
Barga Data Science lecture 2Barga Data Science lecture 2
Barga Data Science lecture 2
 
1. introduction to data science —
1. introduction to data science —1. introduction to data science —
1. introduction to data science —
 
Machine learning at b.e.s.t. summer university
Machine learning  at b.e.s.t. summer universityMachine learning  at b.e.s.t. summer university
Machine learning at b.e.s.t. summer university
 
Doing Analytics Right - Building the Analytics Environment
Doing Analytics Right - Building the Analytics EnvironmentDoing Analytics Right - Building the Analytics Environment
Doing Analytics Right - Building the Analytics Environment
 
Text Analytics for Legal work
Text Analytics for Legal workText Analytics for Legal work
Text Analytics for Legal work
 
data-science-pdf-16588.pdf
data-science-pdf-16588.pdfdata-science-pdf-16588.pdf
data-science-pdf-16588.pdf
 
The Simple 5-Step Process for Creating a Winning Data Pipeline.pdf
The Simple 5-Step Process for Creating a Winning Data Pipeline.pdfThe Simple 5-Step Process for Creating a Winning Data Pipeline.pdf
The Simple 5-Step Process for Creating a Winning Data Pipeline.pdf
 
Google Interview Prep Guide Software Engineer
Google Interview Prep Guide Software EngineerGoogle Interview Prep Guide Software Engineer
Google Interview Prep Guide Software Engineer
 
Guide for a Data Scientist
Guide for a Data ScientistGuide for a Data Scientist
Guide for a Data Scientist
 
Cis 375 Enhance teaching / snaptutorial.com
Cis 375   Enhance teaching / snaptutorial.comCis 375   Enhance teaching / snaptutorial.com
Cis 375 Enhance teaching / snaptutorial.com
 
Computer Tools for Academic Research
Computer Tools for Academic ResearchComputer Tools for Academic Research
Computer Tools for Academic Research
 
PyData 2015 Keynote: "A Systems View of Machine Learning"
PyData 2015 Keynote: "A Systems View of Machine Learning" PyData 2015 Keynote: "A Systems View of Machine Learning"
PyData 2015 Keynote: "A Systems View of Machine Learning"
 
midterm_fa08.pdf
midterm_fa08.pdfmidterm_fa08.pdf
midterm_fa08.pdf
 
Cis 375 Education Redefined - snaptutorial.com
Cis 375    Education Redefined - snaptutorial.comCis 375    Education Redefined - snaptutorial.com
Cis 375 Education Redefined - snaptutorial.com
 
2013 Lecture 5: AR Tools and Interaction
2013 Lecture 5: AR Tools and Interaction 2013 Lecture 5: AR Tools and Interaction
2013 Lecture 5: AR Tools and Interaction
 
What is data science ?
What is data science ?What is data science ?
What is data science ?
 
CIS 375 Focus Dreams/newtonhelp.com
CIS 375 Focus Dreams/newtonhelp.comCIS 375 Focus Dreams/newtonhelp.com
CIS 375 Focus Dreams/newtonhelp.com
 

Dernier

➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...amitlee9823
 
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...gajnagarg
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Detecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning ApproachDetecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning ApproachBoston Institute of Analytics
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...amitlee9823
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...amitlee9823
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...amitlee9823
 
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...gajnagarg
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...amitlee9823
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...amitlee9823
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...amitlee9823
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangaloreamitlee9823
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men 🔝Ongole🔝 Escorts S...
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men  🔝Ongole🔝   Escorts S...➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men  🔝Ongole🔝   Escorts S...
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men 🔝Ongole🔝 Escorts S...amitlee9823
 

Dernier (20)

Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
 
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
 
Detecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning ApproachDetecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning Approach
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
 
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men 🔝Ongole🔝 Escorts S...
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men  🔝Ongole🔝   Escorts S...➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men  🔝Ongole🔝   Escorts S...
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men 🔝Ongole🔝 Escorts S...
 

Cloudera Data Science Challenge 3 Solution by Doug Needham

  • 2. Cloudera: Certified Data Scientist  This is the goal. What are the requirements?  Requirement 1. DS-200 Test.  Requirement 2. Data Science Challenge.  http://www.cloudera.com/content/cloudera/en/training/certification/ ccp-ds/challenge/challenge3.html
  • 3. DS-200 Test  Here are the sections of the exam:  Data Acquisition  Data Evaluation  Data Transformation  Machine Learning Basics  Clustering  Classification  Collaborative Filtering  Model/Feature Selection  Probability  Visualization  Optimization
  • 4. Data Science Challenge Itself  The resources needed for the challenge.  A Cluster:  7 nodes (1 Named Node, 6 Data Nodes.  1 Node Cloudera Manager  1 Node Cloudera Director  Cloudera Director requires a particular AMI for the East Coast AWS region( ami-3218595b RHEL 6.4 X86_64 ). It took a bit of time to get this right. The version I used is highly dependent on RHEL 6.4
  • 5. Cluster Management  Cloudera Director made it easy to create the cluster once I used the proper AMI. As noted previously it only really works well with RHEL 6.4  Cloudera Manager allowed for management and monitoring.  Demo here:  Restarting the cluster non-trivial.  Clusters are meant to stay up. A bit of verification work was needed when the cluster was shut down over the holidays.
  • 6. The problems  The challenge is made up of 3 problems. Each one could be a larger effort in and of itself to solve.  The data had to be transformed in order to process it.  The sophisticated portion of the challenge was the actual processing of the data, Machine Learning, Graph analysis, Statistical Confidence, etc…  Then the output needed to be tweaked a bit in order to conform to the deliverable specification.
  • 7. Problem 1  Flight Delays.  SmartFly’s business is providing its customers with timely travel information and notifications about flights, hotels, destination weather, traffic getting to the airport, and anything else that can help make the travel experience smoother. Their product team has come up with the idea of using the flight data that they have been collecting to predict whether customers’ flights will be delayed so that they can respond proactively. They’ve now contacted you to help them test out the viability of the idea  From a given set of historical flights, create an ordered list based on the probability that a given set of future scheduled flights will be delayed or not.
  • 8. Problem 2  Web site log analytics.  Congratulations! You have just published your first book on data science, advanced analytics, and predictive modeling. You’ve decided to use your skills as a data scientist to create and optimize a website promoting your book, and you have started several ad campaigns on a popular search engine in order to drive traffic to your site.  Provide statistics about a web-site where your new book is featured.
  • 9. Problem 3  Who should Follow whom?  Winklr is a curiously popular social network for fans of the sitcom Happy Days. Users can post photos, write messages, and most importantly, follow each other’s posts and content. This helps users keep up with new content from their favorite users on the site.
  • 10. Rules  Individual Contributions Only You must participate in this challenge only on an individual basis; teams are not permitted.  Sharing Any sharing of code or solutions or collaboration with another person or entity is strictly forbidden.  Tools You may use any tools or software you desire to complete the challenge.  Prerequisites You must have successfully passed Data Science Essentials (DS–200)
  • 11. Deliverables  Problem1 – Ordered list of flights  Problem 2 – JSON file with a python populated Python Dictionary containing specific answers to questions.  Problem3 – Top 70,000 connections that should be recommended.  In addition to the deliverables stated above for each problem, you must provide a solution abstract and the complete set of source code used to solve the challenge problems, as described below.  Solution Abstract  The solution abstract should be a brief write-up in PDF format that addresses the following points:  For each part you needed to do this:  Explain your methodology including approach, assumptions, software and algorithms used, testing and validation techniques applied, model selection criteria, and total time spent.  Please include in your solution abstract any information that can be used to understand the logic behind your approach and all steps taken, including data preparation, modeling, validation, analysis, visualization, etc. The solution abstract should typically be 3 to 5 pages and no more than 6 pages.  Complete Source Code  Tarball or zip file of all source code used to complete the challenge, including programs, scripts, and other artifacts.  My github is linked at the end of this presentation
  • 12. Scoring  Submission Scoring  Submissions will be scored as follows. Each problem part will be scored independently. The score for each part will be a composite of the percentage correct for all submitted solutions for that part and the score assigned to the corresponding section of the solution abstract. The scores for the three parts will be weighted and combined into a final composite score.  The percentage correct for each part will be scored against a golden master of known correct answers. Note that some questions may have more than one correct answer, and partial credit may be awarded.  The solution abstract will be scored according to objective criteria about your approach and general mastery of the tools and techniques. Writing quality and formatting will not contribute to the score, except in cases where the writing is so poor as to impact understanding.
  • 13. Did anyone notice anything?  Each of the 3 problems highlight a very different aspect of Data Science.  Machine Learning  Statistical Analysis  Graph Analysis  Each of these individually are areas that people specialize in. I for one, intend to dive deep into Graph Analysis. It is quite interesting from what I have seen so far.  The code has to be straightforward and while it is not clear they will do so for this particular challenge, at least one of the prior challenges, the code is run independently as part of the Cloudera grading process.
  • 14. Bringing us to the question of What is a Data Scientist ? http://nirvacana.com/thoughts/becoming-a-data-scientist/
  • 15. If you do the challenge are you a purple squirrel?
  • 16. Who is a Data Scientist ?  Who here has seen the Indiana Jones movies?  Marcus Brody and Henry “Indiana” Jones were both Archeologists.  Both Lectured, and taught Archeology  Both understood the tenets of Archeology  Both knew what finding an artifact means.  Both could speak intelligently about the significance of any finds associated with the search.  Only Indiana went on the “quest” – Why?  Those “intangible” skills of being able not only to survive, but to thrive in a chaotic environment played to “Indy’s” strengths.  https://www.youtube.com/watch?v=PgfpIV29Ccc  There are many types of data scientist: http://www.datasciencecentral.com/profiles/blogs/six-categories-of-data-scientists  I think the environment affects the success of a data scientist.
  • 17. From Data Science Central
  • 18. Opinion time  This challenge has reminded me of some of my first efforts building a data warehouse, where I had to build the pipeline of data from our source systems into our Operational Data Store, and out to our Star Schema for daily reports.  It has many of the earmarks of a real-world project.  The biggest difference between this challenge and a “real world problem”, is, there are known solutions. And someone knows those solutions.  In “the real world”, we have things like User acceptance testing and such. This is a more opinionated way of judging success or failure, rather than the objective measure of : Is the data in the answer set?
  • 19. Doug’s Problem Solving approach  This is the approach I took, and may or may not be useful for others to apply.  Analysis. I started with some basic numbers, and just browsing through the data with the “Data Science at the Command line toolkit”. This is very handy for getting a feel for things.  Based on some general understanding this analysis provided, create a “pipeline”  Generally the data has to be transformed to a usable structure for the particular method of solving the problem.  Do some basics with the problem solving method, Stats, ML, Graph, etc…  Get some data back out of that tool, then format output to specification.  Iterate.  I did this for problem 1, moved to problem 2, then finally problem 3. Then went back to 1, back to 2, back to 3.  This method allowed me to give some “space” to myself, and actually look at the each problem with fresh eyes on more than one occasion.  Breaking the basics down of Input, Process, Output for each problem allowed me to have “working” code for each problem really quickly, then through tuning, analysis, research, and some time to think about the problem, I was able to come up with each unique solution.  It also allows me to refactor the code, having given each problem time to “rest”.  Very much like a painting, broad strokes first, details emerge as the painting progresses.  Another benefit is, if I am able to get the data all the way through the pipeline, it becomes obvious where the performance bottlenecks are for the pipeline.  This method does take a bit of time.
  • 20. Solution 1  Type of problem: Machine Learning  Use Python to format the data.  Create an individual set of files based on point of origin (Departing airport)  Use these individual files to create a model using Spark MLLib per airport.  Run the scheduled flights through the model, then use the score of the model(Area under the ROC, denoting accuracy) multiplied by the output of the prediction (which is either 1 or 0).  This allows us to say with a higher degree of accuracy or not whether a flight will be delayed.  Code problem1.sh, and PredictFlights.scala
  • 21. What the heck is a ROC?  This comes from: http://gim.unmc.edu/dxtests/roc3.htm  There are metrics using a predictive model  True Positive  True Negative  False Positive  False Negative.  The higher the area under the Yellow line, the better.  This is used for model validation to ensure that the model is making accurate predictions against known data.  http://en.wikipedia.org/wiki/Receiver_operating_characteristic
  • 22. Solution 2  Type of problem: Statistical reporting  Python Streaming, so lots of Map-Reduce code. Could probably be replaced with Spark, at the time, this path seemed to be the most straight-forward.  A bit of analysis.  Collect some numbers. – Statistically Significant numbers, that is.  Format the data in JSON.  Code: Problem2.sh
  • 23. What makes this data science?  Isn’t this the same thing as business analysis?  What makes this difference is the latter part of question 4 and 5.  Here they are:  Question 4: “How many full days of data, starting from the first day, are required to determine that the newsletter signup rate for experiment one is better than experiment two at the 99% confidence level?”  Question 5: “Using a z-test, determine how many full days of data, starting from the first full day, are needed to confirm that experiment four earns more revenue than experiment three at the 99% confidence level.”  The accurate measurement of confidence is what makes this different. I have built a number of Data warehouse environments from Retail, Finance, and Health Care. Even with the Chain of Custody information built in to provide for data traceability, I have seen very few decisions based on the output of the data warehouse. Why is this?  No one has discussed confidence levels in the data. Having a rational conversation about the “confidence” and statistical significance of the data allows for more rational decision making.  <Opinion>This is one of the key differentiators of data science versus business analytics. </Opinion>
  • 24. Solution 3  Type of problem: Graph Analysis  Create a Master Graph.  Run Page Rank to identify centrality.  Create many small graphs for individual users.  Mask the Master Graph, and PageRank Graph.  Multiply out Centrality, number of in Degrees for a possible followers, and the inverse of the length of the path away from this particular user to a candidate vertex to be followed.  This code runs in over 48 hours.  Code: Problem3.sh, and AnalyzeGraph.scala
  • 25. Graph Analysis  As Graphs get really large it becomes difficult to visualize them.  However, I was able to “subset” the master graph based on the recommendation output of my process.  I was expecting to see one big clump of nodes tightly connected. This would be the “Target” to follow.  I was also expecting to see two smaller clumps of nodes, loosely connected to the larger clump. These are the “followers”, as we make a recommendation to them to follow the more popular node, they will be closer connected to this user.  Here is the output from Gephi that shows whether the code worked or not.
  • 27. Where to go from here?  Spark.  Scala.  Learn these topics.  Teach these topics.  Especially for folks planning on sitting for Data Science challenge 4: Learn Scala. Learn Spark.  Oh, and keep studying about Graphs…  Code located here: https://github.com/dougneedham/Cloudera- Data-Scientist-Challenge-3