Talk at the ACM SIGKDD - Austin Chapter Meeting, March 21, 2012. Paper by Hohyon Ryu, Matthew Lease, and Nicholas Woodward, at the 23rd ACM Conference on Hypertext and Social Media, 2012.
1. Discovering Memes in Social Media
Matt Lease
School of Information
University of Texas at Austin
ml@ischool.utexas.edu
@mattlease
Joint Work with
Hohyon Ryu & Nicholas Woodward
Research paper to appear at the 23rd ACM Conference on Hypertext and Social Media, 2012
2. Memes
• Short, similar phrases found in
many different sources
– Re-use, shared temporal context
• Evolutionary mutation &
propagation as they transmit
from source-to-source
• Reveals implicit connections
between sources, individuals
and communities involved
March 21, 2012 ACM SIGKDD - Austin Chapter Meeting 2
4. Google/NYT Living Stories
livingstories.googlelabs.com
March 21, 2012 ACM SIGKDD - Austin Chapter Meeting 4
5. Related Work
• Jure Leskovec et al. (KDD’09): blogs
– quotations only: http://memetracker.org
• Steven Skiena, Stony Brook NY: blogs
– Named-entities only: http://www.textmap.com
• O. Kolak and B. Schilit (HT’08): scanned books
– Mine “popular passages” from complete texts
– MapReduce “shingling” approach
– Popular passages found are local, not global
March 21, 2012 ACM SIGKDD - Austin Chapter Meeting 5
6. MapReduce @ UT
• UT LIFT Award to Lease, Baldridge, & Xu in Sept.’10
• New harddisks @ TACC Longhorn installed Dec.’10
– 48 Dell R610 nodes
• 2 Intel Nehalem quad-core processors (8 cores) @ 2.53 GHz
• 48GB RAM with ~1.5TB disk per node
• With 1 NameNode & 47 Datanodes, up to 376 parallel Mappers
– 16 Dell R710 (same CPU configuration)
• 144GB RAM with ~0.8TB disk per node
– Setup Hadoop, testing, benchmarking, etc.
• Baldridge & Lease teach MapReduce class Fall’11
March 21, 2012 ACM SIGKDD - Austin Chapter Meeting 6
7. Datasets
• TREC Blogs08 Collection
– http://ir.dcs.gla.ac.uk/test_collections/blogs08info.html
– 28M permalinks (January 2008 – January 2009)
– 250G compressed
• ICWSM 2009 Spinn3r Blog Dataset
– http://www.icwsm.org/data/
– 44 million blog posts (August - September, 2008)
– 27 GB compressed
• ICWSM 2011 Spinn3r Blog Dataset
March 21, 2012 ACM SIGKDD - Austin Chapter Meeting 7
8. Processing Architecture
Blogs08 Test Collection
28M posts, 1.4TB
Preprocessing (Pseudo-MapReduce)
Decruft & Language Identification
HTML Strip & Near-Duplicate Detection 16M posts, 960GB
Common Phrase Extraction
15K posts, 43GB
3 MapReduce Stages
Common Phrase Ranking
Daily Top 200 Phrases 6.2M phrases, 2GB
1 MapReduce Process
Common Phrase Clustering
75K phrases, 2.6MB
1 MapReduce Process
Meme Browser
68K memes
March 21, 2012 ACM SIGKDD - Austin Chapter Meeting 8
9. Creating the Shingle Table
• e.g. trigram shingles for: what do you think of
– what do you
– do you think
– you think of
March 21, 2012 ACM SIGKDD - Austin Chapter Meeting 9
11. Common Phrase (CP) Detection
• Mapper:
Merge adjacent
shingles into memes
(ignoring small gaps)
• Reducer:
Find set of
documents in which
each meme occurs
March 21, 2012 ACM SIGKDD - Austin Chapter Meeting 11
18. Thank You!
• Joint Work with Matt Lease
– Hohyon (Will) Ryu ml@ischool.utexas.edu
• InfoChimps (Summer’11) www.ischool.utexas.edu/~ml
• Indeed.com (Summer’12) @mattlease
– Nicholas Woodward (TACC)
• Latin American Network
Information Center (LANIC) Support
• FCT of Portugal / UT CoLab
• Amazon Web Services
• UT Austin LIFT Award
• John P. Commons Fellowship