SlideShare une entreprise Scribd logo
1  sur  76
Télécharger pour lire hors ligne
1

Mining Social Web APIs
with IPython Notebook
Matthew A. Russell - @ptwobrussell - http://MiningTheSocialWeb.com
New York City - 28 October 2013
2

Intro
3

Hello, My Name Is ... Matthew
Background in Computer Science
Data mining & machine learning
CTO @ Digital Reasoning Systems
Data mining; machine learning
Author @ O'Reilly Media
5 published books on technology
Principal @ Zaffra
Selective boutique consulting
4

Transforming Curiosity Into Insight
An open source software (OSS) project
http://bit.ly/MiningTheSocialWeb2E
A book
http://bit.ly/135dHfs
Accessible to (virtually) everyone
Virtual machine with turn-key coding
templates for data science experiments
Think of the book as "premium" support for the
OSS project
5

The Social Web Is All the Rage
World population: ~7B people
Facebook: 1.15B users
Twitter: 500M users
Google+ 343M users
LinkedIn: 238M users
~200M+ blogs (conservative estimate)
6

Overview
Intro (5 mins)
Module 1 - Virtual Machine Setup (10 mins)
Module 2 - Mining Twitter (40 mins)
Module 3 - Mining Facebook (35 mins)
BREAK (30 mins)
Module 4 - Mining LinkedIn (40 mins)
Module 5 - Open Hack (40 mins)
Final Q&A; Wrap Up (10 mins)
7

Module Format
~10-15 minutes of exposition
I talk; you listen

~25-30 minutes of independent (or collaborative) work
You hack while I walk around and help you

~5 minutes of Q&A
You ask; I try to answer
8

Workshop Objective

To send you away as a social web hacker
Broad working knowledge popular social web APIs
Hands-on experience hacking on social web data with a common toolkit

Not to listen to me talk to you for 3 hours
9

Just a Few More Things
This workshop is...
An adaptation of Mining the Social Web, 2nd Edition
More of a guided hacking session where you follow along (vs a preso)
Wider than it is deeper
There's only so much you can do in a few hours

I'm available 24/7 this week (and beyond) to help you be successful
10

Assumptions
At some point in your life, you have
Programmed with Python
Worked with JSON
Made requests and processed responses to/from web servers

Or you want to learn to do these things now...
And you're a quick learner
11

Module 1: Virtual Machine Setup
12

Why do you need a VM?
To save time
Because installation and configuration management is harder than it first
appears
So that you can focus on the task at hand instead
So that I can support you regardless of your hardware and operating
system
13

But I can do all of that myself...
True...
If you would rather troubleshoot unexpected installation/configuration issues
instead of immediately focusing on the real task at hand

At least give it a shot before resorting to your own devices so that you
don't have to install specific versions of ~40 Python packages
Including scientific computing tools that require underlying C/C++ code to
be compiled
Which requires specific versions of developer libraries to be installed

You get the idea...
14

The Virtual Machine Experience
Vagrant
A nice abstraction around virtual machine providers
One ring to rule them all
Virtualbox, VMWare, AWS, ...

IPython Notebook
The easiest way to program with Python
A better REPL (interpreter)
Great for hacking
15

What happens when you vagrant up?
Vagrant follows the instructions in your Vagrantfile
Starts up a Virtualbox instance
Uses Chef to provision it
Installs OS patches/updates
Installs MTSW software dependencies
Starts IPython Notebook server on port 8888
16

Why Should I Use IPython Notebook?
Because it's great for hacking
And hacking is usually the first step

Because it's great for collaboration
Sharing/publishing results is trivial

Because the UX is as easy as working in a notepad
Think of it as "executable paper"
17
18
19

VM Quick Start Instructions
Go to http://MiningTheSocialWeb.com/quick-start/
Follow the instructions
And watch the screencasts!

Basically:
Install Virtualbox & Vagrant
Run "vagrant up" in a terminal to start a guest VM
Then, go to http://localhost:8888 on your host machine's web browser
20

What Could Be Easier?
A hosted version of the VM!
But only for a few hours during this workshop
Because it costs money to run these servers

Go to <the URL provided in the session> and pick a machine
Do not share the URLs outside of this workshop!
Please don't try to hack the machines
I'll verbally provide the connection details (port and password)
21

A Hosted Virtual Machine
Yes, please.
Is it free?
Perhaps...
...Sign-up for the AWS free tier at http://aws.amazon.com/free/
But not right now. Do it later

Standby for the step-by-step instructions on how to do it
I'll publish a post on it in the next day or so
22
23

Module 2: Mining Twitter
24

Objectives
Be able to identify Twitter primitives
Understand tweet metadata and how to use it
Learn how to extract entities such as user mentions, hashtags, and URLs
from tweets
Apply techniques for performing frequency analysis with Python
Be able to plot histograms of Twitter data with IPython Notebook
25

Twitter Primitives
Accounts Types: "Anything"
"Following" Relationships
Favorites
Retweets
Replies
(Almost) No Privacy Controls
26

API Requests
RESTful requests
Everything is a "resource"
You GET, PUT, POST, and DELETE resources
Standard HTTP "verbs"

Example: GET https://api.twitter.com/1.1/statuses/user_timeline.json?
screen_name=SocialWebMining

Streaming API filters
JSON responses
Cursors (not quite pagination)
27

Twitter is an Interest Graph
Johnny
Araya
Roberto

Mercedes

Rodolfo
Hernández

Ana

Jorge

Nina
28

What's in a Tweet?
140 Characters ...
... Plus ~5KB of metadata!
Authorship
Time & location
Tweet "entities"
Replying, retweeting, favoriting, etc.
29

What are Tweet Entities?
Essentially, the "easy to get at" data in the 140 characters
@usermentions
#hashtags
URLs
multiple variations

(financial) symbols
stock tickers

media
30

Data Mining Is...

Counting
Comparing
Filtering
Ranking
31

Histograms

A chart that is handy for frequency analysis
They look like bar charts...except they're not bar charts
Each value on the x-axis is a range (or "bin") of values
Not categorical data
Each value on the y-axis is the combined frequency of values in each range
32

Plotting with IPython Notebook
33

Example: Histogram of Retweets
34

Social Media Analysis Framework
A memorable four step process to guide data science experiments:
Aspire
To test a hypothesis (answer a question)

Acquire
Get the data

Analyze
Count things

Summarize
Plot the results
35

Exercises
Review Python idioms in the "Appendix C (Python Tips & Tricks)" notebook
Follow the setup instructions in the "Chapter 1 (Mining Twitter)" notebook
Fill in Example 1-1 with credentials and begin work
Execute each example sequentially
Customize queries
Explore tweet metadata; count tweet entities; plot histograms of results
Explore the "Chapter 9 (Twitter Cookbook)" notebook
Think of it as a collection of building blocks
36

Module 3: Mining Facebook
37

Objectives

Be able to identify Facebook primitives
Learn about Facebook’s Social Graph API and how to make API requests
Understand how Open Graph protocol extends Facebook's Social Graph
API

Be able to analyze likes from Facebook pages and friends
38

Facebook Primitives

Account Types: People & Pages
Mutual Connections
Likes
Shares
Comments
Extensive Privacy Controls
39

API Requests
Social Graph API requests
Not RESTful but easy to learn and use
Special "field expansion" syntax
Example: GET http://graph.facebook.com/ptwobrussell/?
fields=id,name,friends.fields(likes.limit(10))

JSON responses
Traditional pagination
40

Facebook is an Interest Graph
Johnny
Araya
Roberto

Mercedes

Rodolfo
Hernández

Ana

Jorge

Nina
41

Facebook API Explorer

Go to https://developers.facebook.com/tools/explorer
Really, go there right now...
42

Retrieve Your Likes
43

Facebook Permissions
44

Facebook Permissions
45

Explore Facebook Pages
Names of pages
MiningTheSocialWeb
CrossFit
OReilly

Web URLs (OGP extensions to Facebook's Social Graph)
http://www.imdb.com/title/tt0117500
46

Social Media Analysis Framework

Recall the same four step process to guide data science experiments:
Aspire
Acquire
Analyze
Summarize
47

Embedded Visualizations with IPython NB
48

Social Network Diagram with D3
49

Exercises
Copy/paste your access token from the Graph API Explorer into the "Chapter 2
(Mining Facebook)" notebook
Paste the value and execute the cell just before Example 2-1
Execute examples sequentially (try to at least make it to Example 2-10)
Analyze your likes, your friends and likes from pages of interest
If you have time...
Remaining examples
50

Module 4: Mining LinkedIn
51

Objectives
Learn about LinkedIn’s Developer Platform
Understand how clustering works
A fundamental type of machine learning

Be able to employ geocoding services to arrive at a set of coordinates
from a textual reference to a location
Visualize geographic data with cartograms
52

LinkedIn Primitives
Account Types: People, Companies
The data seems "more closely held" than Facebook or Twitter
No FOAF visibility
Richest data source
Profile descriptions from mutual connections
A little messier than it first appears
Not necessarily a bad thing
53

API Requests

(Weirdly) RESTful Requests
Not really RESTful
Field selector syntax
http://api.linkedin.com/v1/people/~:(first-name,last-name,headline,picture-url)

XML responses
CSV address book download
54

Is LinkedIn an Interest Graph?
Fundamentally: yes. But not so much at the developer API level
Less trivial to find some of the "pivots"
No Skills API (yet)
But the data is there (mostly in profile descriptions) for your direct connections
Companies, job titles, job descriptions
Lots of richness is tucked away in human language data
55

Clustering

An unsupervised machine learning learning technique
Think: an algorithm that organizes the data into partitions
56

Example: Clustered Job Titles
57

3 Steps to Clustering Your Data
Normalization
Compare (similarity/distance measurement)
n-grams, edit distance, and Jaccard are common, but your imagination is the limit
Why can't you just compare everything to everything?
Dimensionality Reduction
Ideally, your clustering algorithm will mitigate the pain
k-means is among the most common clustering techniques in use
58

Jaccard Similarity
59

k-Means Explained
1. Randomly pick k points in the data space as initial values that will be used to
compute the k clusters: K1, K2, ..., Kk.
2. Assign each of the n points to a cluster by finding the nearest Kn—effectively
creating k clusters and requiring k*n comparisons.
3. For each of the k clusters, calculate the centroid, or the mean of the cluster, and
reassign its Ki value to be that value. (Hence, you’re computing “k-means” during each
iteration of the algorithm.)
4. Repeat steps 2–3 until the members of the clusters do not change between
iterations. Generally speaking, relatively few iterations are required for convergence.
60

k-Means: Initialize
61

k-Means: Step 1
62

k-Means: Step 2
63

k-Means: Step 3
64

k-Means: (Fast-Forward) Step 9
65

Geocoding
Transforming a location to a set of coordinates
Nashville, TN => (36.16783905029297, -86.77816009521484)
A harder problem than it first appears
The Bing API is especially generous
Requires an account sign up: http://bingmapsportal.com
Use the API key with the geopy package
66

Cartograms
67

Unless you use a Dorling Cartogram
68

Social Media Analysis Framework

Remember: Use the same four step process to guide data science experiments:
Aspire
Acquire
Analyze
Summarize
69

Exercises
Follow the instructions in the "Chapter 3 (Mining LinkedIn)" notebook to create an API
connection and follow along with the first few examples
Download your connections as a CSV file from http://www.linkedin.com/people/
export-settings and save them to your VM
A deviation from instructions in Example 3-6 is necessary for remote VMs
See http://bit.ly/mtsw-ch03-helper-code

Create a Bing Maps portal account and get your API key for Examples 3-8 and
beyond
Try clustering your contacts in Example 3-12
Try Example 3-13 (visualizing data in Google Earth) at home...
70

Social Media Is All the Rage
World population: ~7B people
Facebook: 1.15B users
Twitter: 500M users
Google+ 343M users
LinkedIn: 238M users
~200M+ blogs (conservative estimate)
71

Module 5: Open Hack
72

Objectives

To work on "loose ends" or areas of interest from previous modules
To hack on code in notebooks not yet encountered
To setup the virtual machine on your own box if you haven't yet
To collaborate/talk and otherwise make the most of our togetherness
73

Social Media Analysis Framework

Remember:
Aspire
Acquire
Analyze
Summarize
74

Recommendations
Setup your own development environment if you haven't already
Appendix A
Text Mining & Natural Language Processing
Chapter 4 (Mining Google+) & Chapter 5 (Mining Web Pages)
Graph Mining
Chapter 7 (Mining GitHub)
Analyzing Semantic Markup
Chapter 8 (Mining the Semantically Marked-Up Web)
75

Final Q&A; Wrap Up
76

Free Stuff
http://MiningTheSocialWeb.com
Mining the Social Web 2E Chapter 1 (Chimera)
http://bit.ly/13XgNWR
Source Code (GitHub)
http://bit.ly/MiningTheSocialWeb2E
http://bit.ly/1fVf5ej (numbered examples)
Screencasts (Vimeo)
http://bit.ly/mtsw2e-screencasts

Contenu connexe

Similaire à Mining Social Web APIs with IPython Notebook (Strata 2013)

OpenWhisk by Example - Auto Retweeting Example in Python
OpenWhisk by Example - Auto Retweeting Example in PythonOpenWhisk by Example - Auto Retweeting Example in Python
OpenWhisk by Example - Auto Retweeting Example in PythonCodeOps Technologies LLP
 
What does OOP stand for?
What does OOP stand for?What does OOP stand for?
What does OOP stand for?Colin Riley
 
Managing Phone Dev Projects
Managing Phone Dev ProjectsManaging Phone Dev Projects
Managing Phone Dev ProjectsJohn McKerrell
 
MySQL for Python_ Nho Vĩnh Share.pdf
MySQL for Python_ Nho Vĩnh Share.pdfMySQL for Python_ Nho Vĩnh Share.pdf
MySQL for Python_ Nho Vĩnh Share.pdfNho Vĩnh
 
A tale of two proxies
A tale of two proxiesA tale of two proxies
A tale of two proxiesSensePost
 
All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...
All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...
All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...Daniel Zivkovic
 
Data Workflows for Machine Learning - Seattle DAML
Data Workflows for Machine Learning - Seattle DAMLData Workflows for Machine Learning - Seattle DAML
Data Workflows for Machine Learning - Seattle DAMLPaco Nathan
 
Samsung SDS OpeniT - The possibility of Python
Samsung SDS OpeniT - The possibility of PythonSamsung SDS OpeniT - The possibility of Python
Samsung SDS OpeniT - The possibility of PythonInsuk (Chris) Cho
 
UKSG - Just Do IT Yourself
UKSG  - Just Do IT YourselfUKSG  - Just Do IT Yourself
UKSG - Just Do IT YourselfTony Hirst
 
Ardian Haxha- Flying with Python (OSCAL2014)
Ardian Haxha- Flying with Python  (OSCAL2014)Ardian Haxha- Flying with Python  (OSCAL2014)
Ardian Haxha- Flying with Python (OSCAL2014)Open Labs Albania
 
What is Python? An overview of Python for science.
What is Python? An overview of Python for science.What is Python? An overview of Python for science.
What is Python? An overview of Python for science.Nicholas Pringle
 
Going open source with small teams
Going open source with small teamsGoing open source with small teams
Going open source with small teamsJamie Thomas
 
The Rise of the DataOps - Dataiku - J On the Beach 2016
The Rise of the DataOps - Dataiku - J On the Beach 2016 The Rise of the DataOps - Dataiku - J On the Beach 2016
The Rise of the DataOps - Dataiku - J On the Beach 2016 Dataiku
 
Building an Open Source iOS app: lessons learned
Building an Open Source iOS app: lessons learnedBuilding an Open Source iOS app: lessons learned
Building an Open Source iOS app: lessons learnedWojciech Koszek
 
python programming.pptx
python programming.pptxpython programming.pptx
python programming.pptxKaviya452563
 
antrikshindutrialmachinelearningPPT.pptx
antrikshindutrialmachinelearningPPT.pptxantrikshindutrialmachinelearningPPT.pptx
antrikshindutrialmachinelearningPPT.pptxAnkitMishra616883
 
Machine learning in cybersecutiry
Machine learning in cybersecutiryMachine learning in cybersecutiry
Machine learning in cybersecutiryVishwas N
 
3stages Wdn08 V3
3stages Wdn08 V33stages Wdn08 V3
3stages Wdn08 V3Boris Mann
 
Lessons Learned from Building Machine Learning Software at Netflix
Lessons Learned from Building Machine Learning Software at NetflixLessons Learned from Building Machine Learning Software at Netflix
Lessons Learned from Building Machine Learning Software at NetflixJustin Basilico
 

Similaire à Mining Social Web APIs with IPython Notebook (Strata 2013) (20)

OpenWhisk by Example - Auto Retweeting Example in Python
OpenWhisk by Example - Auto Retweeting Example in PythonOpenWhisk by Example - Auto Retweeting Example in Python
OpenWhisk by Example - Auto Retweeting Example in Python
 
What does OOP stand for?
What does OOP stand for?What does OOP stand for?
What does OOP stand for?
 
Managing Phone Dev Projects
Managing Phone Dev ProjectsManaging Phone Dev Projects
Managing Phone Dev Projects
 
MySQL for Python_ Nho Vĩnh Share.pdf
MySQL for Python_ Nho Vĩnh Share.pdfMySQL for Python_ Nho Vĩnh Share.pdf
MySQL for Python_ Nho Vĩnh Share.pdf
 
A tale of two proxies
A tale of two proxiesA tale of two proxies
A tale of two proxies
 
All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...
All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...
All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...
 
Data Workflows for Machine Learning - Seattle DAML
Data Workflows for Machine Learning - Seattle DAMLData Workflows for Machine Learning - Seattle DAML
Data Workflows for Machine Learning - Seattle DAML
 
Introduction to python
Introduction to pythonIntroduction to python
Introduction to python
 
Samsung SDS OpeniT - The possibility of Python
Samsung SDS OpeniT - The possibility of PythonSamsung SDS OpeniT - The possibility of Python
Samsung SDS OpeniT - The possibility of Python
 
UKSG - Just Do IT Yourself
UKSG  - Just Do IT YourselfUKSG  - Just Do IT Yourself
UKSG - Just Do IT Yourself
 
Ardian Haxha- Flying with Python (OSCAL2014)
Ardian Haxha- Flying with Python  (OSCAL2014)Ardian Haxha- Flying with Python  (OSCAL2014)
Ardian Haxha- Flying with Python (OSCAL2014)
 
What is Python? An overview of Python for science.
What is Python? An overview of Python for science.What is Python? An overview of Python for science.
What is Python? An overview of Python for science.
 
Going open source with small teams
Going open source with small teamsGoing open source with small teams
Going open source with small teams
 
The Rise of the DataOps - Dataiku - J On the Beach 2016
The Rise of the DataOps - Dataiku - J On the Beach 2016 The Rise of the DataOps - Dataiku - J On the Beach 2016
The Rise of the DataOps - Dataiku - J On the Beach 2016
 
Building an Open Source iOS app: lessons learned
Building an Open Source iOS app: lessons learnedBuilding an Open Source iOS app: lessons learned
Building an Open Source iOS app: lessons learned
 
python programming.pptx
python programming.pptxpython programming.pptx
python programming.pptx
 
antrikshindutrialmachinelearningPPT.pptx
antrikshindutrialmachinelearningPPT.pptxantrikshindutrialmachinelearningPPT.pptx
antrikshindutrialmachinelearningPPT.pptx
 
Machine learning in cybersecutiry
Machine learning in cybersecutiryMachine learning in cybersecutiry
Machine learning in cybersecutiry
 
3stages Wdn08 V3
3stages Wdn08 V33stages Wdn08 V3
3stages Wdn08 V3
 
Lessons Learned from Building Machine Learning Software at Netflix
Lessons Learned from Building Machine Learning Software at NetflixLessons Learned from Building Machine Learning Software at Netflix
Lessons Learned from Building Machine Learning Software at Netflix
 

Dernier

Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 

Dernier (20)

Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 

Mining Social Web APIs with IPython Notebook (Strata 2013)

  • 1. 1 Mining Social Web APIs with IPython Notebook Matthew A. Russell - @ptwobrussell - http://MiningTheSocialWeb.com New York City - 28 October 2013
  • 3. 3 Hello, My Name Is ... Matthew Background in Computer Science Data mining & machine learning CTO @ Digital Reasoning Systems Data mining; machine learning Author @ O'Reilly Media 5 published books on technology Principal @ Zaffra Selective boutique consulting
  • 4. 4 Transforming Curiosity Into Insight An open source software (OSS) project http://bit.ly/MiningTheSocialWeb2E A book http://bit.ly/135dHfs Accessible to (virtually) everyone Virtual machine with turn-key coding templates for data science experiments Think of the book as "premium" support for the OSS project
  • 5. 5 The Social Web Is All the Rage World population: ~7B people Facebook: 1.15B users Twitter: 500M users Google+ 343M users LinkedIn: 238M users ~200M+ blogs (conservative estimate)
  • 6. 6 Overview Intro (5 mins) Module 1 - Virtual Machine Setup (10 mins) Module 2 - Mining Twitter (40 mins) Module 3 - Mining Facebook (35 mins) BREAK (30 mins) Module 4 - Mining LinkedIn (40 mins) Module 5 - Open Hack (40 mins) Final Q&A; Wrap Up (10 mins)
  • 7. 7 Module Format ~10-15 minutes of exposition I talk; you listen ~25-30 minutes of independent (or collaborative) work You hack while I walk around and help you ~5 minutes of Q&A You ask; I try to answer
  • 8. 8 Workshop Objective To send you away as a social web hacker Broad working knowledge popular social web APIs Hands-on experience hacking on social web data with a common toolkit Not to listen to me talk to you for 3 hours
  • 9. 9 Just a Few More Things This workshop is... An adaptation of Mining the Social Web, 2nd Edition More of a guided hacking session where you follow along (vs a preso) Wider than it is deeper There's only so much you can do in a few hours I'm available 24/7 this week (and beyond) to help you be successful
  • 10. 10 Assumptions At some point in your life, you have Programmed with Python Worked with JSON Made requests and processed responses to/from web servers Or you want to learn to do these things now... And you're a quick learner
  • 11. 11 Module 1: Virtual Machine Setup
  • 12. 12 Why do you need a VM? To save time Because installation and configuration management is harder than it first appears So that you can focus on the task at hand instead So that I can support you regardless of your hardware and operating system
  • 13. 13 But I can do all of that myself... True... If you would rather troubleshoot unexpected installation/configuration issues instead of immediately focusing on the real task at hand At least give it a shot before resorting to your own devices so that you don't have to install specific versions of ~40 Python packages Including scientific computing tools that require underlying C/C++ code to be compiled Which requires specific versions of developer libraries to be installed You get the idea...
  • 14. 14 The Virtual Machine Experience Vagrant A nice abstraction around virtual machine providers One ring to rule them all Virtualbox, VMWare, AWS, ... IPython Notebook The easiest way to program with Python A better REPL (interpreter) Great for hacking
  • 15. 15 What happens when you vagrant up? Vagrant follows the instructions in your Vagrantfile Starts up a Virtualbox instance Uses Chef to provision it Installs OS patches/updates Installs MTSW software dependencies Starts IPython Notebook server on port 8888
  • 16. 16 Why Should I Use IPython Notebook? Because it's great for hacking And hacking is usually the first step Because it's great for collaboration Sharing/publishing results is trivial Because the UX is as easy as working in a notepad Think of it as "executable paper"
  • 17. 17
  • 18. 18
  • 19. 19 VM Quick Start Instructions Go to http://MiningTheSocialWeb.com/quick-start/ Follow the instructions And watch the screencasts! Basically: Install Virtualbox & Vagrant Run "vagrant up" in a terminal to start a guest VM Then, go to http://localhost:8888 on your host machine's web browser
  • 20. 20 What Could Be Easier? A hosted version of the VM! But only for a few hours during this workshop Because it costs money to run these servers Go to <the URL provided in the session> and pick a machine Do not share the URLs outside of this workshop! Please don't try to hack the machines I'll verbally provide the connection details (port and password)
  • 21. 21 A Hosted Virtual Machine Yes, please. Is it free? Perhaps... ...Sign-up for the AWS free tier at http://aws.amazon.com/free/ But not right now. Do it later Standby for the step-by-step instructions on how to do it I'll publish a post on it in the next day or so
  • 22. 22
  • 24. 24 Objectives Be able to identify Twitter primitives Understand tweet metadata and how to use it Learn how to extract entities such as user mentions, hashtags, and URLs from tweets Apply techniques for performing frequency analysis with Python Be able to plot histograms of Twitter data with IPython Notebook
  • 25. 25 Twitter Primitives Accounts Types: "Anything" "Following" Relationships Favorites Retweets Replies (Almost) No Privacy Controls
  • 26. 26 API Requests RESTful requests Everything is a "resource" You GET, PUT, POST, and DELETE resources Standard HTTP "verbs" Example: GET https://api.twitter.com/1.1/statuses/user_timeline.json? screen_name=SocialWebMining Streaming API filters JSON responses Cursors (not quite pagination)
  • 27. 27 Twitter is an Interest Graph Johnny Araya Roberto Mercedes Rodolfo Hernández Ana Jorge Nina
  • 28. 28 What's in a Tweet? 140 Characters ... ... Plus ~5KB of metadata! Authorship Time & location Tweet "entities" Replying, retweeting, favoriting, etc.
  • 29. 29 What are Tweet Entities? Essentially, the "easy to get at" data in the 140 characters @usermentions #hashtags URLs multiple variations (financial) symbols stock tickers media
  • 31. 31 Histograms A chart that is handy for frequency analysis They look like bar charts...except they're not bar charts Each value on the x-axis is a range (or "bin") of values Not categorical data Each value on the y-axis is the combined frequency of values in each range
  • 34. 34 Social Media Analysis Framework A memorable four step process to guide data science experiments: Aspire To test a hypothesis (answer a question) Acquire Get the data Analyze Count things Summarize Plot the results
  • 35. 35 Exercises Review Python idioms in the "Appendix C (Python Tips & Tricks)" notebook Follow the setup instructions in the "Chapter 1 (Mining Twitter)" notebook Fill in Example 1-1 with credentials and begin work Execute each example sequentially Customize queries Explore tweet metadata; count tweet entities; plot histograms of results Explore the "Chapter 9 (Twitter Cookbook)" notebook Think of it as a collection of building blocks
  • 37. 37 Objectives Be able to identify Facebook primitives Learn about Facebook’s Social Graph API and how to make API requests Understand how Open Graph protocol extends Facebook's Social Graph API Be able to analyze likes from Facebook pages and friends
  • 38. 38 Facebook Primitives Account Types: People & Pages Mutual Connections Likes Shares Comments Extensive Privacy Controls
  • 39. 39 API Requests Social Graph API requests Not RESTful but easy to learn and use Special "field expansion" syntax Example: GET http://graph.facebook.com/ptwobrussell/? fields=id,name,friends.fields(likes.limit(10)) JSON responses Traditional pagination
  • 40. 40 Facebook is an Interest Graph Johnny Araya Roberto Mercedes Rodolfo Hernández Ana Jorge Nina
  • 41. 41 Facebook API Explorer Go to https://developers.facebook.com/tools/explorer Really, go there right now...
  • 45. 45 Explore Facebook Pages Names of pages MiningTheSocialWeb CrossFit OReilly Web URLs (OGP extensions to Facebook's Social Graph) http://www.imdb.com/title/tt0117500
  • 46. 46 Social Media Analysis Framework Recall the same four step process to guide data science experiments: Aspire Acquire Analyze Summarize
  • 49. 49 Exercises Copy/paste your access token from the Graph API Explorer into the "Chapter 2 (Mining Facebook)" notebook Paste the value and execute the cell just before Example 2-1 Execute examples sequentially (try to at least make it to Example 2-10) Analyze your likes, your friends and likes from pages of interest If you have time... Remaining examples
  • 51. 51 Objectives Learn about LinkedIn’s Developer Platform Understand how clustering works A fundamental type of machine learning Be able to employ geocoding services to arrive at a set of coordinates from a textual reference to a location Visualize geographic data with cartograms
  • 52. 52 LinkedIn Primitives Account Types: People, Companies The data seems "more closely held" than Facebook or Twitter No FOAF visibility Richest data source Profile descriptions from mutual connections A little messier than it first appears Not necessarily a bad thing
  • 53. 53 API Requests (Weirdly) RESTful Requests Not really RESTful Field selector syntax http://api.linkedin.com/v1/people/~:(first-name,last-name,headline,picture-url) XML responses CSV address book download
  • 54. 54 Is LinkedIn an Interest Graph? Fundamentally: yes. But not so much at the developer API level Less trivial to find some of the "pivots" No Skills API (yet) But the data is there (mostly in profile descriptions) for your direct connections Companies, job titles, job descriptions Lots of richness is tucked away in human language data
  • 55. 55 Clustering An unsupervised machine learning learning technique Think: an algorithm that organizes the data into partitions
  • 57. 57 3 Steps to Clustering Your Data Normalization Compare (similarity/distance measurement) n-grams, edit distance, and Jaccard are common, but your imagination is the limit Why can't you just compare everything to everything? Dimensionality Reduction Ideally, your clustering algorithm will mitigate the pain k-means is among the most common clustering techniques in use
  • 59. 59 k-Means Explained 1. Randomly pick k points in the data space as initial values that will be used to compute the k clusters: K1, K2, ..., Kk. 2. Assign each of the n points to a cluster by finding the nearest Kn—effectively creating k clusters and requiring k*n comparisons. 3. For each of the k clusters, calculate the centroid, or the mean of the cluster, and reassign its Ki value to be that value. (Hence, you’re computing “k-means” during each iteration of the algorithm.) 4. Repeat steps 2–3 until the members of the clusters do not change between iterations. Generally speaking, relatively few iterations are required for convergence.
  • 65. 65 Geocoding Transforming a location to a set of coordinates Nashville, TN => (36.16783905029297, -86.77816009521484) A harder problem than it first appears The Bing API is especially generous Requires an account sign up: http://bingmapsportal.com Use the API key with the geopy package
  • 67. 67 Unless you use a Dorling Cartogram
  • 68. 68 Social Media Analysis Framework Remember: Use the same four step process to guide data science experiments: Aspire Acquire Analyze Summarize
  • 69. 69 Exercises Follow the instructions in the "Chapter 3 (Mining LinkedIn)" notebook to create an API connection and follow along with the first few examples Download your connections as a CSV file from http://www.linkedin.com/people/ export-settings and save them to your VM A deviation from instructions in Example 3-6 is necessary for remote VMs See http://bit.ly/mtsw-ch03-helper-code Create a Bing Maps portal account and get your API key for Examples 3-8 and beyond Try clustering your contacts in Example 3-12 Try Example 3-13 (visualizing data in Google Earth) at home...
  • 70. 70 Social Media Is All the Rage World population: ~7B people Facebook: 1.15B users Twitter: 500M users Google+ 343M users LinkedIn: 238M users ~200M+ blogs (conservative estimate)
  • 72. 72 Objectives To work on "loose ends" or areas of interest from previous modules To hack on code in notebooks not yet encountered To setup the virtual machine on your own box if you haven't yet To collaborate/talk and otherwise make the most of our togetherness
  • 73. 73 Social Media Analysis Framework Remember: Aspire Acquire Analyze Summarize
  • 74. 74 Recommendations Setup your own development environment if you haven't already Appendix A Text Mining & Natural Language Processing Chapter 4 (Mining Google+) & Chapter 5 (Mining Web Pages) Graph Mining Chapter 7 (Mining GitHub) Analyzing Semantic Markup Chapter 8 (Mining the Semantically Marked-Up Web)
  • 76. 76 Free Stuff http://MiningTheSocialWeb.com Mining the Social Web 2E Chapter 1 (Chimera) http://bit.ly/13XgNWR Source Code (GitHub) http://bit.ly/MiningTheSocialWeb2E http://bit.ly/1fVf5ej (numbered examples) Screencasts (Vimeo) http://bit.ly/mtsw2e-screencasts