Slides from my Strata / Hadoop World 2013 (NYC) hands-on workshop.
Workshop Description from http://strataconf.com/stratany2013/public/schedule/detail/30863
Social web properties such as Twitter, Facebook, LinkedIn, and Google+ have vast amounts of valuable insights lurking just beneath the surface, and this workshop minimizes the barriers to exploring and mining this valuable data by presenting turn-key examples from Mining the Social Web (2nd Edition) with IPython Notebook.
Each module consists of a brief period in which each attendee will customize the corresponding notebook for the module with their own account credentials with the remainder of the module devoted to learning what data is available from the API and exercises demonstrating analysis of the data—all from a pre-populated IPython Notebook. Even attendees with minimal programming experience should be able to walk away from this workshop with a working knowledge of the material and be equipped with sample code that can be easily repurposed given the design of this tutorial.
Time will be set aside at the end of each module’s follow-along presentation for attendees to hack on the code, discuss examples, and ask any lingering questions.
3. 3
Hello, My Name Is ... Matthew
Background in Computer Science
Data mining & machine learning
CTO @ Digital Reasoning Systems
Data mining; machine learning
Author @ O'Reilly Media
5 published books on technology
Principal @ Zaffra
Selective boutique consulting
4. 4
Transforming Curiosity Into Insight
An open source software (OSS) project
http://bit.ly/MiningTheSocialWeb2E
A book
http://bit.ly/135dHfs
Accessible to (virtually) everyone
Virtual machine with turn-key coding
templates for data science experiments
Think of the book as "premium" support for the
OSS project
5. 5
The Social Web Is All the Rage
World population: ~7B people
Facebook: 1.15B users
Twitter: 500M users
Google+ 343M users
LinkedIn: 238M users
~200M+ blogs (conservative estimate)
7. 7
Module Format
~10-15 minutes of exposition
I talk; you listen
~25-30 minutes of independent (or collaborative) work
You hack while I walk around and help you
~5 minutes of Q&A
You ask; I try to answer
8. 8
Workshop Objective
To send you away as a social web hacker
Broad working knowledge popular social web APIs
Hands-on experience hacking on social web data with a common toolkit
Not to listen to me talk to you for 3 hours
9. 9
Just a Few More Things
This workshop is...
An adaptation of Mining the Social Web, 2nd Edition
More of a guided hacking session where you follow along (vs a preso)
Wider than it is deeper
There's only so much you can do in a few hours
I'm available 24/7 this week (and beyond) to help you be successful
10. 10
Assumptions
At some point in your life, you have
Programmed with Python
Worked with JSON
Made requests and processed responses to/from web servers
Or you want to learn to do these things now...
And you're a quick learner
12. 12
Why do you need a VM?
To save time
Because installation and configuration management is harder than it first
appears
So that you can focus on the task at hand instead
So that I can support you regardless of your hardware and operating
system
13. 13
But I can do all of that myself...
True...
If you would rather troubleshoot unexpected installation/configuration issues
instead of immediately focusing on the real task at hand
At least give it a shot before resorting to your own devices so that you
don't have to install specific versions of ~40 Python packages
Including scientific computing tools that require underlying C/C++ code to
be compiled
Which requires specific versions of developer libraries to be installed
You get the idea...
14. 14
The Virtual Machine Experience
Vagrant
A nice abstraction around virtual machine providers
One ring to rule them all
Virtualbox, VMWare, AWS, ...
IPython Notebook
The easiest way to program with Python
A better REPL (interpreter)
Great for hacking
15. 15
What happens when you vagrant up?
Vagrant follows the instructions in your Vagrantfile
Starts up a Virtualbox instance
Uses Chef to provision it
Installs OS patches/updates
Installs MTSW software dependencies
Starts IPython Notebook server on port 8888
16. 16
Why Should I Use IPython Notebook?
Because it's great for hacking
And hacking is usually the first step
Because it's great for collaboration
Sharing/publishing results is trivial
Because the UX is as easy as working in a notepad
Think of it as "executable paper"
19. 19
VM Quick Start Instructions
Go to http://MiningTheSocialWeb.com/quick-start/
Follow the instructions
And watch the screencasts!
Basically:
Install Virtualbox & Vagrant
Run "vagrant up" in a terminal to start a guest VM
Then, go to http://localhost:8888 on your host machine's web browser
20. 20
What Could Be Easier?
A hosted version of the VM!
But only for a few hours during this workshop
Because it costs money to run these servers
Go to <the URL provided in the session> and pick a machine
Do not share the URLs outside of this workshop!
Please don't try to hack the machines
I'll verbally provide the connection details (port and password)
21. 21
A Hosted Virtual Machine
Yes, please.
Is it free?
Perhaps...
...Sign-up for the AWS free tier at http://aws.amazon.com/free/
But not right now. Do it later
Standby for the step-by-step instructions on how to do it
I'll publish a post on it in the next day or so
24. 24
Objectives
Be able to identify Twitter primitives
Understand tweet metadata and how to use it
Learn how to extract entities such as user mentions, hashtags, and URLs
from tweets
Apply techniques for performing frequency analysis with Python
Be able to plot histograms of Twitter data with IPython Notebook
26. 26
API Requests
RESTful requests
Everything is a "resource"
You GET, PUT, POST, and DELETE resources
Standard HTTP "verbs"
Example: GET https://api.twitter.com/1.1/statuses/user_timeline.json?
screen_name=SocialWebMining
Streaming API filters
JSON responses
Cursors (not quite pagination)
27. 27
Twitter is an Interest Graph
Johnny
Araya
Roberto
Mercedes
Rodolfo
Hernández
Ana
Jorge
Nina
28. 28
What's in a Tweet?
140 Characters ...
... Plus ~5KB of metadata!
Authorship
Time & location
Tweet "entities"
Replying, retweeting, favoriting, etc.
29. 29
What are Tweet Entities?
Essentially, the "easy to get at" data in the 140 characters
@usermentions
#hashtags
URLs
multiple variations
(financial) symbols
stock tickers
media
31. 31
Histograms
A chart that is handy for frequency analysis
They look like bar charts...except they're not bar charts
Each value on the x-axis is a range (or "bin") of values
Not categorical data
Each value on the y-axis is the combined frequency of values in each range
34. 34
Social Media Analysis Framework
A memorable four step process to guide data science experiments:
Aspire
To test a hypothesis (answer a question)
Acquire
Get the data
Analyze
Count things
Summarize
Plot the results
35. 35
Exercises
Review Python idioms in the "Appendix C (Python Tips & Tricks)" notebook
Follow the setup instructions in the "Chapter 1 (Mining Twitter)" notebook
Fill in Example 1-1 with credentials and begin work
Execute each example sequentially
Customize queries
Explore tweet metadata; count tweet entities; plot histograms of results
Explore the "Chapter 9 (Twitter Cookbook)" notebook
Think of it as a collection of building blocks
37. 37
Objectives
Be able to identify Facebook primitives
Learn about Facebook’s Social Graph API and how to make API requests
Understand how Open Graph protocol extends Facebook's Social Graph
API
Be able to analyze likes from Facebook pages and friends
39. 39
API Requests
Social Graph API requests
Not RESTful but easy to learn and use
Special "field expansion" syntax
Example: GET http://graph.facebook.com/ptwobrussell/?
fields=id,name,friends.fields(likes.limit(10))
JSON responses
Traditional pagination
40. 40
Facebook is an Interest Graph
Johnny
Araya
Roberto
Mercedes
Rodolfo
Hernández
Ana
Jorge
Nina
45. 45
Explore Facebook Pages
Names of pages
MiningTheSocialWeb
CrossFit
OReilly
Web URLs (OGP extensions to Facebook's Social Graph)
http://www.imdb.com/title/tt0117500
46. 46
Social Media Analysis Framework
Recall the same four step process to guide data science experiments:
Aspire
Acquire
Analyze
Summarize
49. 49
Exercises
Copy/paste your access token from the Graph API Explorer into the "Chapter 2
(Mining Facebook)" notebook
Paste the value and execute the cell just before Example 2-1
Execute examples sequentially (try to at least make it to Example 2-10)
Analyze your likes, your friends and likes from pages of interest
If you have time...
Remaining examples
51. 51
Objectives
Learn about LinkedIn’s Developer Platform
Understand how clustering works
A fundamental type of machine learning
Be able to employ geocoding services to arrive at a set of coordinates
from a textual reference to a location
Visualize geographic data with cartograms
52. 52
LinkedIn Primitives
Account Types: People, Companies
The data seems "more closely held" than Facebook or Twitter
No FOAF visibility
Richest data source
Profile descriptions from mutual connections
A little messier than it first appears
Not necessarily a bad thing
53. 53
API Requests
(Weirdly) RESTful Requests
Not really RESTful
Field selector syntax
http://api.linkedin.com/v1/people/~:(first-name,last-name,headline,picture-url)
XML responses
CSV address book download
54. 54
Is LinkedIn an Interest Graph?
Fundamentally: yes. But not so much at the developer API level
Less trivial to find some of the "pivots"
No Skills API (yet)
But the data is there (mostly in profile descriptions) for your direct connections
Companies, job titles, job descriptions
Lots of richness is tucked away in human language data
57. 57
3 Steps to Clustering Your Data
Normalization
Compare (similarity/distance measurement)
n-grams, edit distance, and Jaccard are common, but your imagination is the limit
Why can't you just compare everything to everything?
Dimensionality Reduction
Ideally, your clustering algorithm will mitigate the pain
k-means is among the most common clustering techniques in use
59. 59
k-Means Explained
1. Randomly pick k points in the data space as initial values that will be used to
compute the k clusters: K1, K2, ..., Kk.
2. Assign each of the n points to a cluster by finding the nearest Kn—effectively
creating k clusters and requiring k*n comparisons.
3. For each of the k clusters, calculate the centroid, or the mean of the cluster, and
reassign its Ki value to be that value. (Hence, you’re computing “k-means” during each
iteration of the algorithm.)
4. Repeat steps 2–3 until the members of the clusters do not change between
iterations. Generally speaking, relatively few iterations are required for convergence.
65. 65
Geocoding
Transforming a location to a set of coordinates
Nashville, TN => (36.16783905029297, -86.77816009521484)
A harder problem than it first appears
The Bing API is especially generous
Requires an account sign up: http://bingmapsportal.com
Use the API key with the geopy package
68. 68
Social Media Analysis Framework
Remember: Use the same four step process to guide data science experiments:
Aspire
Acquire
Analyze
Summarize
69. 69
Exercises
Follow the instructions in the "Chapter 3 (Mining LinkedIn)" notebook to create an API
connection and follow along with the first few examples
Download your connections as a CSV file from http://www.linkedin.com/people/
export-settings and save them to your VM
A deviation from instructions in Example 3-6 is necessary for remote VMs
See http://bit.ly/mtsw-ch03-helper-code
Create a Bing Maps portal account and get your API key for Examples 3-8 and
beyond
Try clustering your contacts in Example 3-12
Try Example 3-13 (visualizing data in Google Earth) at home...
70. 70
Social Media Is All the Rage
World population: ~7B people
Facebook: 1.15B users
Twitter: 500M users
Google+ 343M users
LinkedIn: 238M users
~200M+ blogs (conservative estimate)
72. 72
Objectives
To work on "loose ends" or areas of interest from previous modules
To hack on code in notebooks not yet encountered
To setup the virtual machine on your own box if you haven't yet
To collaborate/talk and otherwise make the most of our togetherness
74. 74
Recommendations
Setup your own development environment if you haven't already
Appendix A
Text Mining & Natural Language Processing
Chapter 4 (Mining Google+) & Chapter 5 (Mining Web Pages)
Graph Mining
Chapter 7 (Mining GitHub)
Analyzing Semantic Markup
Chapter 8 (Mining the Semantically Marked-Up Web)