SlideShare une entreprise Scribd logo
1  sur  23
Applied Data Analytics 
Building a real data product
Github Repository 
http://bit.ly/1eLBzki 
Matrix Factorization 
http://slidesha.re/15Qssf0 
Links to various resources
Goals for this Course 
● Apply the ideas and tools learned during all previous program courses 
● Use a real world data set with actionable prediction 
● Present a completed project to faculty and peers 
● Build a data project portfolio 
What are your goals? 
● Understand the Data Science Pipeline 
● Understand what a complete data product looks like 
● Be able to set up and implement a data product in Python
Some Logistics 
This is a small class, I’m hoping for lots of participation! 
Course materials can be found in two places: 
● iPython: http://bit.ly/1gJ73Tt 
● Github: https://github.com/DistrictDataLabs/science-bookclub 
● Slides: on slideshare or on Blackboard 
Recommended Reading: 
● Matrix Factorization: A simple tutorial and implementation 
● http://www.quuxlabs.com/blog/2010/09/matrix-factorization-a-simple-tutorial- 
and-implementation-in-python/
Agenda - Day One 
● Review Data Products 
● Review Data Science Pipeline 
● Discuss architecture of the data product we’re going to build. 
● Setting up our project 
● Ingestion of Goodreads Data 
● Lunch 
● Creating a command line admin program 
● Wrangling of Goodreads Data 
● A computational data store
Agenda - Day Two 
● Review current state of recommender project 
● Matrix math review 
● Introduction to matrix factorization 
● Building a recommender system 
● Reporting with Jinja2 
● Lunch 
● Presentations of Capstone Projects 
● Course wrap-up
Building Data Products
A data product is a product that is 
based on the combination of data 
and algorithms.” 
Hilary Mason 
“
A data application acquires its value from the 
data itself, and creates more data as a result. 
It’s not just an application with data; it’s a 
data product. Data science enables the 
creation of data products.” 
Mike Loukides 
“
The Data Science Pipeline
Data Ingestion Data Munging 
and Wrangling 
Computation and 
Analyses 
Modeling and 
Application 
Reporting and 
Visualization
Data Ingestion 
● There is a world of data out 
there- how to get it? Web 
crawlers, APIs, Sensors? Python 
and other web scripting 
languages are custom made for 
this task. 
● The real question is how can we 
deal with such a giant volume 
and velocity of data? 
● Big Data and Data Science often 
require ingestion specialists!
Data Wrangling 
● Warehousing the data means 
storing the data in as raw a form 
as possible. 
● Extract, transform, and load 
operations move data to 
operational storage locations. 
● Filtering, aggregation, 
normalization and 
denormalization all ensure data is 
in a form it can be computed on. 
● Annotated training sets must be 
created for ML tasks.
Computation and Analyses 
● Hypothesis driven computation 
includes design and development 
of predictive models. 
● Many models have to be trained 
or constrained into a 
computational form like a Graph 
database, and this is time 
consuming. 
● Other data products like indices, 
relations, classifications, and 
clusters may be computed.
Modeling and Application 
This is the part we’re most familiar with. 
Supervised classification, Unsupervised 
clustering - Bayes, Logistic Regression, 
Decision Trees, and other models. 
This is also where the money is.
Reporting and Visualization 
● Often overlooked, this part is 
crucial, even if we have data 
products. 
● Humans recognize patterns 
better than machines. Human 
feedback is crucial in Active 
Learning and remodeling (error 
detection). 
● Mashups and collaborations 
generate more data- and 
therefore more value!
Don’t forget feedback! 
(Active Learning for Data 
Products)
What we’re going to build today 
SCIENCE BOOKCLUB!! 
● A book club that chooses what to 
read via a recommender system. 
● Uses GoodReads data to ingest 
and return feedback on books. 
● Statistical model is a non-negative 
matrix factorization 
● Reporting using Jinja (almost a 
web app)
Workflow 
1. Setting up a Python skeleton 
2. Creating and Running Tests 
3. Wading in with a configuration 
4. Ingestion with urllib and requests 
5. Creating a command line admin with argparse 
6. Wrangling with BeautifulSoup and SQLAlchemy 
7. Modeling with numpy 
8. Reporting with Jinja2
Matplotlib Jinja2 
Reporting 
Module 
Recommender 
Module 
Octavo Architecture (really clear DSP) 
requests.py 
Ingestion 
Module 
Raw Data 
Storage Computational 
Data Storage 
Wrangling 
Module 
BeautifulSou 
p 
SQLAlchemy 
Numpy
Let’s dive into some code!

Contenu connexe

Tendances

Introduction to Graph Databases
Introduction to Graph DatabasesIntroduction to Graph Databases
Introduction to Graph DatabasesMax De Marzi
 
IRJET- Interactive Smart Mirror
IRJET-  	  Interactive Smart MirrorIRJET-  	  Interactive Smart Mirror
IRJET- Interactive Smart MirrorIRJET Journal
 
Anomaly Detection and Spark Implementation - Meetup Presentation.pptx
Anomaly Detection and Spark Implementation - Meetup Presentation.pptxAnomaly Detection and Spark Implementation - Meetup Presentation.pptx
Anomaly Detection and Spark Implementation - Meetup Presentation.pptxImpetus Technologies
 
Big data Presentation
Big data PresentationBig data Presentation
Big data PresentationAswadmehar
 
Currency Recognition System for Visually Impaired: Egyptian Banknote as a Stu...
Currency Recognition System for Visually Impaired: Egyptian Banknote as a Stu...Currency Recognition System for Visually Impaired: Egyptian Banknote as a Stu...
Currency Recognition System for Visually Impaired: Egyptian Banknote as a Stu...DrNoura Semary
 
Ppt for Application of big data
Ppt for Application of big dataPpt for Application of big data
Ppt for Application of big dataPrashant Sharma
 
Unified MLOps: Feature Stores & Model Deployment
Unified MLOps: Feature Stores & Model DeploymentUnified MLOps: Feature Stores & Model Deployment
Unified MLOps: Feature Stores & Model DeploymentDatabricks
 
Tableau PPT Intro, Features, Advantages, Disadvantages
Tableau PPT Intro, Features, Advantages, DisadvantagesTableau PPT Intro, Features, Advantages, Disadvantages
Tableau PPT Intro, Features, Advantages, DisadvantagesBurn & Born
 
Introducing the Snowflake Computing Cloud Data Warehouse
Introducing the Snowflake Computing Cloud Data WarehouseIntroducing the Snowflake Computing Cloud Data Warehouse
Introducing the Snowflake Computing Cloud Data WarehouseSnowflake Computing
 
Data Warehouse Modeling
Data Warehouse ModelingData Warehouse Modeling
Data Warehouse Modelingvivekjv
 
Biometrics/fingerprint sensors
Biometrics/fingerprint sensorsBiometrics/fingerprint sensors
Biometrics/fingerprint sensorsJeffrey Funk
 
Predicting house price
Predicting house pricePredicting house price
Predicting house priceDivya Tiwari
 

Tendances (20)

Introduction to Graph Databases
Introduction to Graph DatabasesIntroduction to Graph Databases
Introduction to Graph Databases
 
IRJET- Interactive Smart Mirror
IRJET-  	  Interactive Smart MirrorIRJET-  	  Interactive Smart Mirror
IRJET- Interactive Smart Mirror
 
Power Bi Basics
Power Bi BasicsPower Bi Basics
Power Bi Basics
 
Anomaly Detection and Spark Implementation - Meetup Presentation.pptx
Anomaly Detection and Spark Implementation - Meetup Presentation.pptxAnomaly Detection and Spark Implementation - Meetup Presentation.pptx
Anomaly Detection and Spark Implementation - Meetup Presentation.pptx
 
Big data Presentation
Big data PresentationBig data Presentation
Big data Presentation
 
Tableau
TableauTableau
Tableau
 
Currency Recognition System for Visually Impaired: Egyptian Banknote as a Stu...
Currency Recognition System for Visually Impaired: Egyptian Banknote as a Stu...Currency Recognition System for Visually Impaired: Egyptian Banknote as a Stu...
Currency Recognition System for Visually Impaired: Egyptian Banknote as a Stu...
 
Elastic Data Warehousing
Elastic Data WarehousingElastic Data Warehousing
Elastic Data Warehousing
 
Ppt for Application of big data
Ppt for Application of big dataPpt for Application of big data
Ppt for Application of big data
 
Unified MLOps: Feature Stores & Model Deployment
Unified MLOps: Feature Stores & Model DeploymentUnified MLOps: Feature Stores & Model Deployment
Unified MLOps: Feature Stores & Model Deployment
 
Tableau PPT Intro, Features, Advantages, Disadvantages
Tableau PPT Intro, Features, Advantages, DisadvantagesTableau PPT Intro, Features, Advantages, Disadvantages
Tableau PPT Intro, Features, Advantages, Disadvantages
 
Introducing the Snowflake Computing Cloud Data Warehouse
Introducing the Snowflake Computing Cloud Data WarehouseIntroducing the Snowflake Computing Cloud Data Warehouse
Introducing the Snowflake Computing Cloud Data Warehouse
 
Data Warehouse Modeling
Data Warehouse ModelingData Warehouse Modeling
Data Warehouse Modeling
 
Tableau Presentation
Tableau PresentationTableau Presentation
Tableau Presentation
 
Data Visualization.pptx
Data Visualization.pptxData Visualization.pptx
Data Visualization.pptx
 
Biometrics/fingerprint sensors
Biometrics/fingerprint sensorsBiometrics/fingerprint sensors
Biometrics/fingerprint sensors
 
MASTER DATA MANAGEMENT.pptx
MASTER DATA MANAGEMENT.pptxMASTER DATA MANAGEMENT.pptx
MASTER DATA MANAGEMENT.pptx
 
Biometric Authentication PPT
Biometric Authentication PPTBiometric Authentication PPT
Biometric Authentication PPT
 
Predicting house price
Predicting house pricePredicting house price
Predicting house price
 
Tableau vs PowerBI
Tableau vs PowerBITableau vs PowerBI
Tableau vs PowerBI
 

En vedette

Startup Pitch Decks that Work: Creating a Winning Pitch Deck
Startup Pitch Decks that Work: Creating a Winning Pitch DeckStartup Pitch Decks that Work: Creating a Winning Pitch Deck
Startup Pitch Decks that Work: Creating a Winning Pitch DeckDavid Ehrenberg
 
300 Milligrams - Demo Day Presentation
300 Milligrams - Demo Day Presentation300 Milligrams - Demo Day Presentation
300 Milligrams - Demo Day Presentation500 Startups
 
500’s Demo Day Batch 12 >> Alfred
500’s Demo Day Batch 12 >> Alfred500’s Demo Day Batch 12 >> Alfred
500’s Demo Day Batch 12 >> Alfred500 Startups
 
BrandBoards demo day pitch deck
BrandBoards demo day pitch deckBrandBoards demo day pitch deck
BrandBoards demo day pitch deck500 Startups
 
Standard Treasury Series A Pitch Deck
Standard Treasury Series A Pitch DeckStandard Treasury Series A Pitch Deck
Standard Treasury Series A Pitch DeckZachary Townsend
 
Tealet - DRINK THE TEA
Tealet - DRINK THE TEATealet - DRINK THE TEA
Tealet - DRINK THE TEA500 Startups
 
500’s Demo Day Batch 11 >> Slidebean
500’s Demo Day Batch 11 >> Slidebean 500’s Demo Day Batch 11 >> Slidebean
500’s Demo Day Batch 11 >> Slidebean 500 Startups
 
Kickfolio - 500Startups Batch 5
Kickfolio - 500Startups Batch 5Kickfolio - 500Startups Batch 5
Kickfolio - 500Startups Batch 5500 Startups
 
TouristEye - Personalizing The Travel Experience - 500 Startups
TouristEye - Personalizing The Travel Experience - 500 StartupsTouristEye - Personalizing The Travel Experience - 500 Startups
TouristEye - Personalizing The Travel Experience - 500 Startups500 Startups
 
Pitch deck for Kejahunt
Pitch deck for KejahuntPitch deck for Kejahunt
Pitch deck for KejahuntJoshua Mutua
 
Square pitch deck
Square pitch deckSquare pitch deck
Square pitch deckpitchenvy
 
Contently Pitch Deck
Contently Pitch DeckContently Pitch Deck
Contently Pitch DeckRyan Gum
 

En vedette (20)

Startup Pitch Decks that Work: Creating a Winning Pitch Deck
Startup Pitch Decks that Work: Creating a Winning Pitch DeckStartup Pitch Decks that Work: Creating a Winning Pitch Deck
Startup Pitch Decks that Work: Creating a Winning Pitch Deck
 
300 Milligrams - Demo Day Presentation
300 Milligrams - Demo Day Presentation300 Milligrams - Demo Day Presentation
300 Milligrams - Demo Day Presentation
 
Cadee
CadeeCadee
Cadee
 
500’s Demo Day Batch 12 >> Alfred
500’s Demo Day Batch 12 >> Alfred500’s Demo Day Batch 12 >> Alfred
500’s Demo Day Batch 12 >> Alfred
 
Binpress
BinpressBinpress
Binpress
 
BrandBoards demo day pitch deck
BrandBoards demo day pitch deckBrandBoards demo day pitch deck
BrandBoards demo day pitch deck
 
Sverve
SverveSverve
Sverve
 
Standard Treasury Series A Pitch Deck
Standard Treasury Series A Pitch DeckStandard Treasury Series A Pitch Deck
Standard Treasury Series A Pitch Deck
 
PinMyPet
PinMyPetPinMyPet
PinMyPet
 
Farmeron
FarmeronFarmeron
Farmeron
 
Tealet - DRINK THE TEA
Tealet - DRINK THE TEATealet - DRINK THE TEA
Tealet - DRINK THE TEA
 
500’s Demo Day Batch 11 >> Slidebean
500’s Demo Day Batch 11 >> Slidebean 500’s Demo Day Batch 11 >> Slidebean
500’s Demo Day Batch 11 >> Slidebean
 
Kickfolio - 500Startups Batch 5
Kickfolio - 500Startups Batch 5Kickfolio - 500Startups Batch 5
Kickfolio - 500Startups Batch 5
 
Kibin
Kibin Kibin
Kibin
 
task.ly pitch deck
task.ly pitch decktask.ly pitch deck
task.ly pitch deck
 
TouristEye - Personalizing The Travel Experience - 500 Startups
TouristEye - Personalizing The Travel Experience - 500 StartupsTouristEye - Personalizing The Travel Experience - 500 Startups
TouristEye - Personalizing The Travel Experience - 500 Startups
 
Daily hundred Pitch Deck 2014
Daily hundred Pitch Deck 2014Daily hundred Pitch Deck 2014
Daily hundred Pitch Deck 2014
 
Pitch deck for Kejahunt
Pitch deck for KejahuntPitch deck for Kejahunt
Pitch deck for Kejahunt
 
Square pitch deck
Square pitch deckSquare pitch deck
Square pitch deck
 
Contently Pitch Deck
Contently Pitch DeckContently Pitch Deck
Contently Pitch Deck
 

Similaire à Building Data Products with Python (Georgetown)

Building Data Apps with Python
Building Data Apps with PythonBuilding Data Apps with Python
Building Data Apps with PythonBenjamin Bengfort
 
Data science presentation
Data science presentationData science presentation
Data science presentationMSDEVMTL
 
Simplified Machine Learning, Text, and Graph Analytics with Pivotal Greenplum
Simplified Machine Learning, Text, and Graph Analytics with Pivotal GreenplumSimplified Machine Learning, Text, and Graph Analytics with Pivotal Greenplum
Simplified Machine Learning, Text, and Graph Analytics with Pivotal GreenplumVMware Tanzu
 
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...Denodo
 
Advanced Project Data Analytics for Improved Project Delivery
Advanced Project Data Analytics for Improved Project DeliveryAdvanced Project Data Analytics for Improved Project Delivery
Advanced Project Data Analytics for Improved Project DeliveryMark Constable
 
Course 8 : How to start your big data project by Eric Rodriguez
Course 8 : How to start your big data project by Eric Rodriguez Course 8 : How to start your big data project by Eric Rodriguez
Course 8 : How to start your big data project by Eric Rodriguez Betacowork
 
Lak2018: Scaling Nationally: Seven Lesson Learned
Lak2018:  Scaling Nationally: Seven Lesson LearnedLak2018:  Scaling Nationally: Seven Lesson Learned
Lak2018: Scaling Nationally: Seven Lesson Learnedmwebbjisc
 
KSU IT Capstone Report 2012-2017.pdf
KSU IT Capstone Report 2012-2017.pdfKSU IT Capstone Report 2012-2017.pdf
KSU IT Capstone Report 2012-2017.pdfJack Zheng
 
How to Prepare for a Career in Data Science
How to Prepare for a Career in Data ScienceHow to Prepare for a Career in Data Science
How to Prepare for a Career in Data ScienceJuuso Parkkinen
 
KSU IT4983 Capstone Projects Report 2017 Update
KSU IT4983 Capstone Projects Report 2017 UpdateKSU IT4983 Capstone Projects Report 2017 Update
KSU IT4983 Capstone Projects Report 2017 UpdateJack Zheng
 
Career in Python and data science
Career in Python and data science Career in Python and data science
Career in Python and data science Sagar Hedau
 
Big Data overview
Big Data overviewBig Data overview
Big Data overviewalexisroos
 
Easy Analytics on AWS with Amazon Redshift, Amazon QuickSight, and Amazon Mac...
Easy Analytics on AWS with Amazon Redshift, Amazon QuickSight, and Amazon Mac...Easy Analytics on AWS with Amazon Redshift, Amazon QuickSight, and Amazon Mac...
Easy Analytics on AWS with Amazon Redshift, Amazon QuickSight, and Amazon Mac...Amazon Web Services
 
Agile data science
Agile data scienceAgile data science
Agile data scienceJoel Horwitz
 
Lambda Architecture and open source technology stack for real time big data
Lambda Architecture and open source technology stack for real time big dataLambda Architecture and open source technology stack for real time big data
Lambda Architecture and open source technology stack for real time big dataTrieu Nguyen
 
Visualising montioring and evaluation data
Visualising montioring and evaluation dataVisualising montioring and evaluation data
Visualising montioring and evaluation dataRob Worthington
 

Similaire à Building Data Products with Python (Georgetown) (20)

Building Data Apps with Python
Building Data Apps with PythonBuilding Data Apps with Python
Building Data Apps with Python
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
Architecting for Data Science
Architecting for Data ScienceArchitecting for Data Science
Architecting for Data Science
 
2020 | Metadata Day | LinkedIn
2020 | Metadata Day | LinkedIn2020 | Metadata Day | LinkedIn
2020 | Metadata Day | LinkedIn
 
Data science presentation
Data science presentationData science presentation
Data science presentation
 
Simplified Machine Learning, Text, and Graph Analytics with Pivotal Greenplum
Simplified Machine Learning, Text, and Graph Analytics with Pivotal GreenplumSimplified Machine Learning, Text, and Graph Analytics with Pivotal Greenplum
Simplified Machine Learning, Text, and Graph Analytics with Pivotal Greenplum
 
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...
 
Advanced Project Data Analytics for Improved Project Delivery
Advanced Project Data Analytics for Improved Project DeliveryAdvanced Project Data Analytics for Improved Project Delivery
Advanced Project Data Analytics for Improved Project Delivery
 
Course 8 : How to start your big data project by Eric Rodriguez
Course 8 : How to start your big data project by Eric Rodriguez Course 8 : How to start your big data project by Eric Rodriguez
Course 8 : How to start your big data project by Eric Rodriguez
 
Lak2018: Scaling Nationally: Seven Lesson Learned
Lak2018:  Scaling Nationally: Seven Lesson LearnedLak2018:  Scaling Nationally: Seven Lesson Learned
Lak2018: Scaling Nationally: Seven Lesson Learned
 
KSU IT Capstone Report 2012-2017.pdf
KSU IT Capstone Report 2012-2017.pdfKSU IT Capstone Report 2012-2017.pdf
KSU IT Capstone Report 2012-2017.pdf
 
How to Prepare for a Career in Data Science
How to Prepare for a Career in Data ScienceHow to Prepare for a Career in Data Science
How to Prepare for a Career in Data Science
 
KSU IT4983 Capstone Projects Report 2017 Update
KSU IT4983 Capstone Projects Report 2017 UpdateKSU IT4983 Capstone Projects Report 2017 Update
KSU IT4983 Capstone Projects Report 2017 Update
 
Career in Python and data science
Career in Python and data science Career in Python and data science
Career in Python and data science
 
Big Data overview
Big Data overviewBig Data overview
Big Data overview
 
Python and data analytics
Python and data analyticsPython and data analytics
Python and data analytics
 
Easy Analytics on AWS with Amazon Redshift, Amazon QuickSight, and Amazon Mac...
Easy Analytics on AWS with Amazon Redshift, Amazon QuickSight, and Amazon Mac...Easy Analytics on AWS with Amazon Redshift, Amazon QuickSight, and Amazon Mac...
Easy Analytics on AWS with Amazon Redshift, Amazon QuickSight, and Amazon Mac...
 
Agile data science
Agile data scienceAgile data science
Agile data science
 
Lambda Architecture and open source technology stack for real time big data
Lambda Architecture and open source technology stack for real time big dataLambda Architecture and open source technology stack for real time big data
Lambda Architecture and open source technology stack for real time big data
 
Visualising montioring and evaluation data
Visualising montioring and evaluation dataVisualising montioring and evaluation data
Visualising montioring and evaluation data
 

Plus de Benjamin Bengfort

Visual diagnostics for more effective machine learning
Visual diagnostics for more effective machine learningVisual diagnostics for more effective machine learning
Visual diagnostics for more effective machine learningBenjamin Bengfort
 
Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Devel...
Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Devel...Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Devel...
Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Devel...Benjamin Bengfort
 
Dynamics in graph analysis (PyData Carolinas 2016)
Dynamics in graph analysis (PyData Carolinas 2016)Dynamics in graph analysis (PyData Carolinas 2016)
Dynamics in graph analysis (PyData Carolinas 2016)Benjamin Bengfort
 
Visualizing the Model Selection Process
Visualizing the Model Selection ProcessVisualizing the Model Selection Process
Visualizing the Model Selection ProcessBenjamin Bengfort
 
A Primer on Entity Resolution
A Primer on Entity ResolutionA Primer on Entity Resolution
A Primer on Entity ResolutionBenjamin Bengfort
 
An Interactive Visual Analytics Dashboard for the Employment Situation Report
An Interactive Visual Analytics Dashboard for the Employment Situation ReportAn Interactive Visual Analytics Dashboard for the Employment Situation Report
An Interactive Visual Analytics Dashboard for the Employment Situation ReportBenjamin Bengfort
 
Graph Based Machine Learning on Relational Data
Graph Based Machine Learning on Relational DataGraph Based Machine Learning on Relational Data
Graph Based Machine Learning on Relational DataBenjamin Bengfort
 
Introduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnIntroduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnBenjamin Bengfort
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonBenjamin Bengfort
 
Evolutionary Design of Swarms (SSCI 2014)
Evolutionary Design of Swarms (SSCI 2014)Evolutionary Design of Swarms (SSCI 2014)
Evolutionary Design of Swarms (SSCI 2014)Benjamin Bengfort
 
An Overview of Spanner: Google's Globally Distributed Database
An Overview of Spanner: Google's Globally Distributed DatabaseAn Overview of Spanner: Google's Globally Distributed Database
An Overview of Spanner: Google's Globally Distributed DatabaseBenjamin Bengfort
 
Graph Analyses with Python and NetworkX
Graph Analyses with Python and NetworkXGraph Analyses with Python and NetworkX
Graph Analyses with Python and NetworkXBenjamin Bengfort
 
Natural Language Processing with Python
Natural Language Processing with PythonNatural Language Processing with Python
Natural Language Processing with PythonBenjamin Bengfort
 
Beginners Guide to Non-Negative Matrix Factorization
Beginners Guide to Non-Negative Matrix FactorizationBeginners Guide to Non-Negative Matrix Factorization
Beginners Guide to Non-Negative Matrix FactorizationBenjamin Bengfort
 

Plus de Benjamin Bengfort (18)

Getting Started with TRISA
Getting Started with TRISAGetting Started with TRISA
Getting Started with TRISA
 
Visual diagnostics for more effective machine learning
Visual diagnostics for more effective machine learningVisual diagnostics for more effective machine learning
Visual diagnostics for more effective machine learning
 
Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Devel...
Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Devel...Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Devel...
Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Devel...
 
Dynamics in graph analysis (PyData Carolinas 2016)
Dynamics in graph analysis (PyData Carolinas 2016)Dynamics in graph analysis (PyData Carolinas 2016)
Dynamics in graph analysis (PyData Carolinas 2016)
 
Visualizing the Model Selection Process
Visualizing the Model Selection ProcessVisualizing the Model Selection Process
Visualizing the Model Selection Process
 
Data Product Architectures
Data Product ArchitecturesData Product Architectures
Data Product Architectures
 
A Primer on Entity Resolution
A Primer on Entity ResolutionA Primer on Entity Resolution
A Primer on Entity Resolution
 
An Interactive Visual Analytics Dashboard for the Employment Situation Report
An Interactive Visual Analytics Dashboard for the Employment Situation ReportAn Interactive Visual Analytics Dashboard for the Employment Situation Report
An Interactive Visual Analytics Dashboard for the Employment Situation Report
 
Graph Based Machine Learning on Relational Data
Graph Based Machine Learning on Relational DataGraph Based Machine Learning on Relational Data
Graph Based Machine Learning on Relational Data
 
Introduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnIntroduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-Learn
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
 
Evolutionary Design of Swarms (SSCI 2014)
Evolutionary Design of Swarms (SSCI 2014)Evolutionary Design of Swarms (SSCI 2014)
Evolutionary Design of Swarms (SSCI 2014)
 
An Overview of Spanner: Google's Globally Distributed Database
An Overview of Spanner: Google's Globally Distributed DatabaseAn Overview of Spanner: Google's Globally Distributed Database
An Overview of Spanner: Google's Globally Distributed Database
 
Graph Analyses with Python and NetworkX
Graph Analyses with Python and NetworkXGraph Analyses with Python and NetworkX
Graph Analyses with Python and NetworkX
 
Natural Language Processing with Python
Natural Language Processing with PythonNatural Language Processing with Python
Natural Language Processing with Python
 
Beginners Guide to Non-Negative Matrix Factorization
Beginners Guide to Non-Negative Matrix FactorizationBeginners Guide to Non-Negative Matrix Factorization
Beginners Guide to Non-Negative Matrix Factorization
 
Annotation with Redfox
Annotation with RedfoxAnnotation with Redfox
Annotation with Redfox
 
Rasta processing of speech
Rasta processing of speechRasta processing of speech
Rasta processing of speech
 

Dernier

FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfOverkill Security
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfOverkill Security
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKJago de Vreede
 

Dernier (20)

FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdf
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 

Building Data Products with Python (Georgetown)

  • 1. Applied Data Analytics Building a real data product
  • 2. Github Repository http://bit.ly/1eLBzki Matrix Factorization http://slidesha.re/15Qssf0 Links to various resources
  • 3. Goals for this Course ● Apply the ideas and tools learned during all previous program courses ● Use a real world data set with actionable prediction ● Present a completed project to faculty and peers ● Build a data project portfolio What are your goals? ● Understand the Data Science Pipeline ● Understand what a complete data product looks like ● Be able to set up and implement a data product in Python
  • 4. Some Logistics This is a small class, I’m hoping for lots of participation! Course materials can be found in two places: ● iPython: http://bit.ly/1gJ73Tt ● Github: https://github.com/DistrictDataLabs/science-bookclub ● Slides: on slideshare or on Blackboard Recommended Reading: ● Matrix Factorization: A simple tutorial and implementation ● http://www.quuxlabs.com/blog/2010/09/matrix-factorization-a-simple-tutorial- and-implementation-in-python/
  • 5. Agenda - Day One ● Review Data Products ● Review Data Science Pipeline ● Discuss architecture of the data product we’re going to build. ● Setting up our project ● Ingestion of Goodreads Data ● Lunch ● Creating a command line admin program ● Wrangling of Goodreads Data ● A computational data store
  • 6. Agenda - Day Two ● Review current state of recommender project ● Matrix math review ● Introduction to matrix factorization ● Building a recommender system ● Reporting with Jinja2 ● Lunch ● Presentations of Capstone Projects ● Course wrap-up
  • 8. A data product is a product that is based on the combination of data and algorithms.” Hilary Mason “
  • 9.
  • 10. A data application acquires its value from the data itself, and creates more data as a result. It’s not just an application with data; it’s a data product. Data science enables the creation of data products.” Mike Loukides “
  • 11.
  • 12. The Data Science Pipeline
  • 13. Data Ingestion Data Munging and Wrangling Computation and Analyses Modeling and Application Reporting and Visualization
  • 14. Data Ingestion ● There is a world of data out there- how to get it? Web crawlers, APIs, Sensors? Python and other web scripting languages are custom made for this task. ● The real question is how can we deal with such a giant volume and velocity of data? ● Big Data and Data Science often require ingestion specialists!
  • 15. Data Wrangling ● Warehousing the data means storing the data in as raw a form as possible. ● Extract, transform, and load operations move data to operational storage locations. ● Filtering, aggregation, normalization and denormalization all ensure data is in a form it can be computed on. ● Annotated training sets must be created for ML tasks.
  • 16. Computation and Analyses ● Hypothesis driven computation includes design and development of predictive models. ● Many models have to be trained or constrained into a computational form like a Graph database, and this is time consuming. ● Other data products like indices, relations, classifications, and clusters may be computed.
  • 17. Modeling and Application This is the part we’re most familiar with. Supervised classification, Unsupervised clustering - Bayes, Logistic Regression, Decision Trees, and other models. This is also where the money is.
  • 18. Reporting and Visualization ● Often overlooked, this part is crucial, even if we have data products. ● Humans recognize patterns better than machines. Human feedback is crucial in Active Learning and remodeling (error detection). ● Mashups and collaborations generate more data- and therefore more value!
  • 19. Don’t forget feedback! (Active Learning for Data Products)
  • 20. What we’re going to build today SCIENCE BOOKCLUB!! ● A book club that chooses what to read via a recommender system. ● Uses GoodReads data to ingest and return feedback on books. ● Statistical model is a non-negative matrix factorization ● Reporting using Jinja (almost a web app)
  • 21. Workflow 1. Setting up a Python skeleton 2. Creating and Running Tests 3. Wading in with a configuration 4. Ingestion with urllib and requests 5. Creating a command line admin with argparse 6. Wrangling with BeautifulSoup and SQLAlchemy 7. Modeling with numpy 8. Reporting with Jinja2
  • 22. Matplotlib Jinja2 Reporting Module Recommender Module Octavo Architecture (really clear DSP) requests.py Ingestion Module Raw Data Storage Computational Data Storage Wrangling Module BeautifulSou p SQLAlchemy Numpy
  • 23. Let’s dive into some code!