Mining Social Web APIs with IPython Notebook (Strata 2013)

1

Mining Social Web APIs
with IPython Notebook
Matthew A. Russell - @ptwobrussell - http://MiningTheSocialWeb.com
New York City - 28 October 2013

3

Hello, My Name Is ... Matthew
Background in Computer Science
Data mining & machine learning
CTO @ Digital Reasoning Systems
Data mining; machine learning
Author @ O'Reilly Media
5 published books on technology
Principal @ Zaffra
Selective boutique consulting

4

Transforming Curiosity Into Insight
An open source software (OSS) project
http://bit.ly/MiningTheSocialWeb2E
A book
http://bit.ly/135dHfs
Accessible to (virtually) everyone
Virtual machine with turn-key coding
templates for data science experiments
Think of the book as "premium" support for the
OSS project

5

The Social Web Is All the Rage
World population: ~7B people
Facebook: 1.15B users
Twitter: 500M users
Google+ 343M users
LinkedIn: 238M users
~200M+ blogs (conservative estimate)

6

Overview
Intro (5 mins)
Module 1 - Virtual Machine Setup (10 mins)
Module 2 - Mining Twitter (40 mins)
Module 3 - Mining Facebook (35 mins)
BREAK (30 mins)
Module 4 - Mining LinkedIn (40 mins)
Module 5 - Open Hack (40 mins)
Final Q&A; Wrap Up (10 mins)

7

Module Format
~10-15 minutes of exposition
I talk; you listen

~25-30 minutes of independent (or collaborative) work
You hack while I walk around and help you

~5 minutes of Q&A
You ask; I try to answer

8

Workshop Objective

To send you away as a social web hacker
Broad working knowledge popular social web APIs
Hands-on experience hacking on social web data with a common toolkit

Not to listen to me talk to you for 3 hours

9

Just a Few More Things
This workshop is...
An adaptation of Mining the Social Web, 2nd Edition
More of a guided hacking session where you follow along (vs a preso)
Wider than it is deeper
There's only so much you can do in a few hours

I'm available 24/7 this week (and beyond) to help you be successful

10

Assumptions
At some point in your life, you have
Programmed with Python
Worked with JSON
Made requests and processed responses to/from web servers

Or you want to learn to do these things now...
And you're a quick learner

11

Module 1: Virtual Machine Setup

12

Why do you need a VM?
To save time
Because installation and conﬁguration management is harder than it ﬁrst
appears
So that you can focus on the task at hand instead
So that I can support you regardless of your hardware and operating
system

13

But I can do all of that myself...
True...
If you would rather troubleshoot unexpected installation/configuration issues
instead of immediately focusing on the real task at hand

At least give it a shot before resorting to your own devices so that you
don't have to install specific versions of ~40 Python packages
Including scientific computing tools that require underlying C/C++ code to
be compiled
Which requires specific versions of developer libraries to be installed

You get the idea...

14

The Virtual Machine Experience
Vagrant
A nice abstraction around virtual machine providers
One ring to rule them all
Virtualbox, VMWare, AWS, ...

IPython Notebook
The easiest way to program with Python
A better REPL (interpreter)
Great for hacking

15

What happens when you vagrant up?
Vagrant follows the instructions in your Vagrantﬁle
Starts up a Virtualbox instance
Uses Chef to provision it
Installs OS patches/updates
Installs MTSW software dependencies
Starts IPython Notebook server on port 8888

16

Why Should I Use IPython Notebook?
Because it's great for hacking
And hacking is usually the ﬁrst step

Because it's great for collaboration
Sharing/publishing results is trivial

Because the UX is as easy as working in a notepad
Think of it as "executable paper"

19

VM Quick Start Instructions
Go to http://MiningTheSocialWeb.com/quick-start/
Follow the instructions
And watch the screencasts!

Basically:
Install Virtualbox & Vagrant
Run "vagrant up" in a terminal to start a guest VM
Then, go to http://localhost:8888 on your host machine's web browser

20

What Could Be Easier?
A hosted version of the VM!
But only for a few hours during this workshop
Because it costs money to run these servers

Go to <the URL provided in the session> and pick a machine
Do not share the URLs outside of this workshop!
Please don't try to hack the machines
I'll verbally provide the connection details (port and password)

21

A Hosted Virtual Machine
Yes, please.
Is it free?
Perhaps...
...Sign-up for the AWS free tier at http://aws.amazon.com/free/
But not right now. Do it later

Standby for the step-by-step instructions on how to do it
I'll publish a post on it in the next day or so

24

Objectives
Be able to identify Twitter primitives
Understand tweet metadata and how to use it
Learn how to extract entities such as user mentions, hashtags, and URLs
from tweets
Apply techniques for performing frequency analysis with Python
Be able to plot histograms of Twitter data with IPython Notebook

25

Twitter Primitives
Accounts Types: "Anything"
"Following" Relationships
Favorites
Retweets
Replies
(Almost) No Privacy Controls

26

API Requests
RESTful requests
Everything is a "resource"
You GET, PUT, POST, and DELETE resources
Standard HTTP "verbs"

Example: GET https://api.twitter.com/1.1/statuses/user_timeline.json?
screen_name=SocialWebMining

Streaming API ﬁlters
JSON responses
Cursors (not quite pagination)

27

Twitter is an Interest Graph
Johnny
Araya
Roberto

Mercedes

Rodolfo
Hernández

Ana

Jorge

Nina

28

What's in a Tweet?
140 Characters ...
... Plus ~5KB of metadata!
Authorship
Time & location
Tweet "entities"
Replying, retweeting, favoriting, etc.

29

What are Tweet Entities?
Essentially, the "easy to get at" data in the 140 characters
@usermentions
#hashtags
URLs
multiple variations

(ﬁnancial) symbols
stock tickers

media

30

Data Mining Is...

Counting
Comparing
Filtering
Ranking

31

Histograms

A chart that is handy for frequency analysis
They look like bar charts...except they're not bar charts
Each value on the x-axis is a range (or "bin") of values
Not categorical data
Each value on the y-axis is the combined frequency of values in each range

32

Plotting with IPython Notebook

33

Example: Histogram of Retweets

34

Social Media Analysis Framework
A memorable four step process to guide data science experiments:
Aspire
To test a hypothesis (answer a question)

Acquire
Get the data

Analyze
Count things

Summarize
Plot the results

35

Exercises
Review Python idioms in the "Appendix C (Python Tips & Tricks)" notebook
Follow the setup instructions in the "Chapter 1 (Mining Twitter)" notebook
Fill in Example 1-1 with credentials and begin work
Execute each example sequentially
Customize queries
Explore tweet metadata; count tweet entities; plot histograms of results
Explore the "Chapter 9 (Twitter Cookbook)" notebook
Think of it as a collection of building blocks

37

Objectives

Be able to identify Facebook primitives
Learn about Facebook’s Social Graph API and how to make API requests
Understand how Open Graph protocol extends Facebook's Social Graph
API

Be able to analyze likes from Facebook pages and friends

38

Facebook Primitives

Account Types: People & Pages
Mutual Connections
Likes
Shares
Comments
Extensive Privacy Controls

39

API Requests
Social Graph API requests
Not RESTful but easy to learn and use
Special "field expansion" syntax
Example: GET http://graph.facebook.com/ptwobrussell/?
fields=id,name,friends.fields(likes.limit(10))

JSON responses
Traditional pagination

40

Facebook is an Interest Graph
Johnny
Araya
Roberto

Mercedes

Rodolfo
Hernández

Ana

Jorge

Nina

41

Facebook API Explorer

Go to https://developers.facebook.com/tools/explorer
Really, go there right now...

45

Explore Facebook Pages
Names of pages
MiningTheSocialWeb
CrossFit
OReilly

Web URLs (OGP extensions to Facebook's Social Graph)
http://www.imdb.com/title/tt0117500

46


Recall the same four step process to guide data science experiments:
Aspire
Acquire
Analyze
Summarize

47

Embedded Visualizations with IPython NB

48

Social Network Diagram with D3

49

Exercises
Copy/paste your access token from the Graph API Explorer into the "Chapter 2
(Mining Facebook)" notebook
Paste the value and execute the cell just before Example 2-1
Execute examples sequentially (try to at least make it to Example 2-10)
Analyze your likes, your friends and likes from pages of interest
If you have time...
Remaining examples

51

Objectives
Learn about LinkedIn’s Developer Platform
Understand how clustering works
A fundamental type of machine learning

Be able to employ geocoding services to arrive at a set of coordinates
from a textual reference to a location
Visualize geographic data with cartograms

52

LinkedIn Primitives
Account Types: People, Companies
The data seems "more closely held" than Facebook or Twitter
No FOAF visibility
Richest data source
Proﬁle descriptions from mutual connections
A little messier than it ﬁrst appears
Not necessarily a bad thing

53

API Requests

(Weirdly) RESTful Requests
Not really RESTful
Field selector syntax
http://api.linkedin.com/v1/people/~:(ﬁrst-name,last-name,headline,picture-url)

XML responses
CSV address book download

54

Is LinkedIn an Interest Graph?
Fundamentally: yes. But not so much at the developer API level
Less trivial to ﬁnd some of the "pivots"
No Skills API (yet)
But the data is there (mostly in proﬁle descriptions) for your direct connections
Companies, job titles, job descriptions
Lots of richness is tucked away in human language data

55

Clustering

An unsupervised machine learning learning technique
Think: an algorithm that organizes the data into partitions

56

Example: Clustered Job Titles

57

3 Steps to Clustering Your Data
Normalization
Compare (similarity/distance measurement)
n-grams, edit distance, and Jaccard are common, but your imagination is the limit
Why can't you just compare everything to everything?
Dimensionality Reduction
Ideally, your clustering algorithm will mitigate the pain
k-means is among the most common clustering techniques in use

59

k-Means Explained
1. Randomly pick k points in the data space as initial values that will be used to
compute the k clusters: K1, K2, ..., Kk.
2. Assign each of the n points to a cluster by ﬁnding the nearest Kn—effectively
creating k clusters and requiring k*n comparisons.
3. For each of the k clusters, calculate the centroid, or the mean of the cluster, and
reassign its Ki value to be that value. (Hence, you’re computing “k-means” during each
iteration of the algorithm.)
4. Repeat steps 2–3 until the members of the clusters do not change between
iterations. Generally speaking, relatively few iterations are required for convergence.

64

k-Means: (Fast-Forward) Step 9

65

Geocoding
Transforming a location to a set of coordinates
Nashville, TN => (36.16783905029297, -86.77816009521484)
A harder problem than it ﬁrst appears
The Bing API is especially generous
Requires an account sign up: http://bingmapsportal.com
Use the API key with the geopy package

67

Unless you use a Dorling Cartogram

68


Remember: Use the same four step process to guide data science experiments:
Aspire
Acquire
Analyze
Summarize

69

Exercises
Follow the instructions in the "Chapter 3 (Mining LinkedIn)" notebook to create an API
connection and follow along with the ﬁrst few examples
Download your connections as a CSV ﬁle from http://www.linkedin.com/people/
export-settings and save them to your VM
A deviation from instructions in Example 3-6 is necessary for remote VMs
See http://bit.ly/mtsw-ch03-helper-code

Create a Bing Maps portal account and get your API key for Examples 3-8 and
beyond
Try clustering your contacts in Example 3-12
Try Example 3-13 (visualizing data in Google Earth) at home...

70

Social Media Is All the Rage
World population: ~7B people
Facebook: 1.15B users
Twitter: 500M users
Google+ 343M users
LinkedIn: 238M users
~200M+ blogs (conservative estimate)

72

Objectives

To work on "loose ends" or areas of interest from previous modules
To hack on code in notebooks not yet encountered
To setup the virtual machine on your own box if you haven't yet
To collaborate/talk and otherwise make the most of our togetherness

73


Remember:
Aspire
Acquire
Analyze
Summarize

74

Recommendations
Setup your own development environment if you haven't already
Appendix A
Text Mining & Natural Language Processing
Chapter 4 (Mining Google+) & Chapter 5 (Mining Web Pages)
Graph Mining
Chapter 7 (Mining GitHub)
Analyzing Semantic Markup
Chapter 8 (Mining the Semantically Marked-Up Web)

76

Free Stuff
http://MiningTheSocialWeb.com
Mining the Social Web 2E Chapter 1 (Chimera)
http://bit.ly/13XgNWR
Source Code (GitHub)
http://bit.ly/MiningTheSocialWeb2E
http://bit.ly/1fVf5ej (numbered examples)
Screencasts (Vimeo)
http://bit.ly/mtsw2e-screencasts

Mining Social Web APIs with IPython Notebook (Strata 2013)

Recommandé

Recommandé

Contenu connexe

Similaire à Mining Social Web APIs with IPython Notebook (Strata 2013)

Similaire à Mining Social Web APIs with IPython Notebook (Strata 2013) (20)

Dernier

Dernier (20)

Mining Social Web APIs with IPython Notebook (Strata 2013)