SlideShare une entreprise Scribd logo
1  sur  18
Discovering Memes in Social Media

                              Matt Lease
                        School of Information
                      University of Texas at Austin
                        ml@ischool.utexas.edu
                              @mattlease

                             Joint Work with
                     Hohyon Ryu & Nicholas Woodward


Research paper to appear at the 23rd ACM Conference on Hypertext and Social Media, 2012
Memes
• Short, similar phrases found in
  many different sources
  – Re-use, shared temporal context
• Evolutionary mutation &
  propagation as they transmit
  from source-to-source
• Reveals implicit connections
  between sources, individuals
  and communities involved
  March 21, 2012   ACM SIGKDD - Austin Chapter Meeting   2
MemeBrowser & Critical Literacy




March 21, 2012   ACM SIGKDD - Austin Chapter Meeting   3
Google/NYT Living Stories




                 livingstories.googlelabs.com
March 21, 2012         ACM SIGKDD - Austin Chapter Meeting   4
Related Work
• Jure Leskovec et al. (KDD’09): blogs
     – quotations only: http://memetracker.org
• Steven Skiena, Stony Brook NY: blogs
     – Named-entities only: http://www.textmap.com
• O. Kolak and B. Schilit (HT’08): scanned books
     – Mine “popular passages” from complete texts
     – MapReduce “shingling” approach
     – Popular passages found are local, not global

March 21, 2012      ACM SIGKDD - Austin Chapter Meeting   5
MapReduce @ UT
• UT LIFT Award to Lease, Baldridge, & Xu in Sept.’10
• New harddisks @ TACC Longhorn installed Dec.’10
   – 48 Dell R610 nodes
         • 2 Intel Nehalem quad-core processors (8 cores) @ 2.53 GHz
         • 48GB RAM with ~1.5TB disk per node
         • With 1 NameNode & 47 Datanodes, up to 376 parallel Mappers
   – 16 Dell R710 (same CPU configuration)
         • 144GB RAM with ~0.8TB disk per node
   – Setup Hadoop, testing, benchmarking, etc.
• Baldridge & Lease teach MapReduce class Fall’11
 March 21, 2012         ACM SIGKDD - Austin Chapter Meeting        6
Datasets
• TREC Blogs08 Collection
     – http://ir.dcs.gla.ac.uk/test_collections/blogs08info.html
     – 28M permalinks (January 2008 – January 2009)
     – 250G compressed
• ICWSM 2009 Spinn3r Blog Dataset
     – http://www.icwsm.org/data/
     – 44 million blog posts (August - September, 2008)
     – 27 GB compressed
• ICWSM 2011 Spinn3r Blog Dataset

March 21, 2012        ACM SIGKDD - Austin Chapter Meeting      7
Processing Architecture
                                                               Blogs08 Test Collection
                                                                  28M posts, 1.4TB
       Preprocessing (Pseudo-MapReduce)
       Decruft & Language Identification
       HTML Strip & Near-Duplicate Detection                       16M posts, 960GB



       Common Phrase Extraction
                                                                    15K posts, 43GB
       3 MapReduce Stages

       Common Phrase Ranking
       Daily Top 200 Phrases                                       6.2M phrases, 2GB
       1 MapReduce Process

       Common Phrase Clustering
                                                                   75K phrases, 2.6MB
       1 MapReduce Process

       Meme Browser
                                                                      68K memes


March 21, 2012               ACM SIGKDD - Austin Chapter Meeting                         8
Creating the Shingle Table
• e.g. trigram shingles for: what do you think of

  – what do you
  – do you think
  – you think of




 March 21, 2012     ACM SIGKDD - Austin Chapter Meeting   9
Grouping Shingles by Document
• Mapper: trivial grouping; Reducer: Identity




March 21, 2012   ACM SIGKDD - Austin Chapter Meeting   10
Common Phrase (CP) Detection
• Mapper:
  Merge adjacent
  shingles into memes
  (ignoring small gaps)

• Reducer:
  Find set of
  documents in which
  each meme occurs
  March 21, 2012   ACM SIGKDD - Austin Chapter Meeting   11
Ranking Memes




 March 21, 2012   ACM SIGKDD - Austin Chapter Meeting   12
Clustering Memes
• Mapper:
  Single-link
  hierarchical
  clustering with
  cosine similarity
• Reducer:
  create/merge
  clusters


  March 21, 2012      ACM SIGKDD - Austin Chapter Meeting   13
Efficiency: Meme Clustering



• From WEKA ARFF format to sparse representation
   – From ~96 hours  11 hours
• Indexed vs. un-indexed
   – From 11 hours  16 minutes (single core)
   – From 34 minutes  3 minutes (136 cores)
• Distributed vs. single core
   – From 11 hours  34 minutes (un-indexed)
   – From 16 minutes  3 minutes (indexed)
  March 21, 2012     ACM SIGKDD - Austin Chapter Meeting   14
Meme Browser: Original Interface




March 21, 2012   ACM SIGKDD - Austin Chapter Meeting   15
Meme Browser: Current Interface




March 21, 2012   ACM SIGKDD - Austin Chapter Meeting   16
Meme Evolution (Leskovec et al.’09)




March 21, 2012   ACM SIGKDD - Austin Chapter Meeting   17
Thank You!
• Joint Work with                   Matt Lease
  – Hohyon (Will) Ryu               ml@ischool.utexas.edu
     • InfoChimps (Summer’11)       www.ischool.utexas.edu/~ml
     • Indeed.com (Summer’12)         @mattlease
  – Nicholas Woodward (TACC)
     • Latin American Network
       Information Center (LANIC)   Support
                                    • FCT of Portugal / UT CoLab
                                    • Amazon Web Services
                                    • UT Austin LIFT Award
                                    • John P. Commons Fellowship

Contenu connexe

En vedette

Making Memes Latinitas
Making Memes LatinitasMaking Memes Latinitas
Making Memes Latinitas
Andrea Zarate
 

En vedette (11)

Gdc reports2013 4_13
Gdc reports2013 4_13Gdc reports2013 4_13
Gdc reports2013 4_13
 
Making Memes Latinitas
Making Memes LatinitasMaking Memes Latinitas
Making Memes Latinitas
 
WTF is meme culture? / memes anatomy.
WTF is meme culture? / memes anatomy.WTF is meme culture? / memes anatomy.
WTF is meme culture? / memes anatomy.
 
Memes
MemesMemes
Memes
 
Meme Powerpoint
Meme PowerpointMeme Powerpoint
Meme Powerpoint
 
mems ppt
mems pptmems ppt
mems ppt
 
Memes, Memes Everywhere
Memes, Memes EverywhereMemes, Memes Everywhere
Memes, Memes Everywhere
 
Fantastic memes and how to use them
Fantastic memes and how to use themFantastic memes and how to use them
Fantastic memes and how to use them
 
Social networking PPT
Social networking PPTSocial networking PPT
Social networking PPT
 
A Complete Guide To The Best Times To Post On Social Media (And More!)
A Complete Guide To The Best Times To Post On Social Media (And More!)A Complete Guide To The Best Times To Post On Social Media (And More!)
A Complete Guide To The Best Times To Post On Social Media (And More!)
 
How to Win Friends, Influence People, and Get a Better Valuation with Emoji, ...
How to Win Friends, Influence People, and Get a Better Valuation with Emoji, ...How to Win Friends, Influence People, and Get a Better Valuation with Emoji, ...
How to Win Friends, Influence People, and Get a Better Valuation with Emoji, ...
 

Similaire à Discovering Memes in Social Media

MapReduce and Hadoop
MapReduce and HadoopMapReduce and Hadoop
MapReduce and Hadoop
Salil Navgire
 
INSPIRE Hackathon Webinar Intro to Linked Data and Semantics
INSPIRE Hackathon Webinar   Intro to Linked Data and SemanticsINSPIRE Hackathon Webinar   Intro to Linked Data and Semantics
INSPIRE Hackathon Webinar Intro to Linked Data and Semantics
plan4all
 
Intro to Big Data and NoSQL
Intro to Big Data and NoSQLIntro to Big Data and NoSQL
Intro to Big Data and NoSQL
Don Demcsak
 
IASSIST 2012 - DDI-RDF - Trouble with Triples
IASSIST 2012 - DDI-RDF - Trouble with TriplesIASSIST 2012 - DDI-RDF - Trouble with Triples
IASSIST 2012 - DDI-RDF - Trouble with Triples
Dr.-Ing. Thomas Hartmann
 

Similaire à Discovering Memes in Social Media (20)

Discovering and Navigating Memes in Social Media
Discovering and Navigating Memes in Social MediaDiscovering and Navigating Memes in Social Media
Discovering and Navigating Memes in Social Media
 
E Science As A Lens On The World Lazowska
E Science As A Lens On The World   LazowskaE Science As A Lens On The World   Lazowska
E Science As A Lens On The World Lazowska
 
E Science As A Lens On The World Lazowska
E Science As A Lens On The World   LazowskaE Science As A Lens On The World   Lazowska
E Science As A Lens On The World Lazowska
 
MapReduce and Hadoop
MapReduce and HadoopMapReduce and Hadoop
MapReduce and Hadoop
 
Data Science Keys to Open Up OpenNASA Datasets
Data Science Keys to Open Up OpenNASA DatasetsData Science Keys to Open Up OpenNASA Datasets
Data Science Keys to Open Up OpenNASA Datasets
 
Data Science Keys to Open Up OpenNASA Datasets - PyData New York 2017
Data Science Keys to Open Up OpenNASA Datasets - PyData New York 2017Data Science Keys to Open Up OpenNASA Datasets - PyData New York 2017
Data Science Keys to Open Up OpenNASA Datasets - PyData New York 2017
 
Realtime Indexing for Fast Queries on Massive Semi-Structured Data
Realtime Indexing for Fast Queries on Massive Semi-Structured DataRealtime Indexing for Fast Queries on Massive Semi-Structured Data
Realtime Indexing for Fast Queries on Massive Semi-Structured Data
 
Startup Bootcamp - Intro to NoSQL/Big Data by DataZone
Startup Bootcamp - Intro to NoSQL/Big Data by DataZoneStartup Bootcamp - Intro to NoSQL/Big Data by DataZone
Startup Bootcamp - Intro to NoSQL/Big Data by DataZone
 
RDBMS vs NoSQL
RDBMS vs NoSQLRDBMS vs NoSQL
RDBMS vs NoSQL
 
Distributed data mining
Distributed data miningDistributed data mining
Distributed data mining
 
Lunch & Learn Intro to Big Data
Lunch & Learn Intro to Big DataLunch & Learn Intro to Big Data
Lunch & Learn Intro to Big Data
 
Scaling the (evolving) web data –at low cost-
Scaling the (evolving) web data –at low cost-Scaling the (evolving) web data –at low cost-
Scaling the (evolving) web data –at low cost-
 
INSPIRE Hackathon Webinar Intro to Linked Data and Semantics
INSPIRE Hackathon Webinar   Intro to Linked Data and SemanticsINSPIRE Hackathon Webinar   Intro to Linked Data and Semantics
INSPIRE Hackathon Webinar Intro to Linked Data and Semantics
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
 
Intro to Big Data and NoSQL
Intro to Big Data and NoSQLIntro to Big Data and NoSQL
Intro to Big Data and NoSQL
 
RDF: Resource Description Failures?
RDF: Resource Description Failures?RDF: Resource Description Failures?
RDF: Resource Description Failures?
 
Webinar: The Future of SQL
Webinar: The Future of SQLWebinar: The Future of SQL
Webinar: The Future of SQL
 
07 data structures_and_representations
07 data structures_and_representations07 data structures_and_representations
07 data structures_and_representations
 
How DITA Got Her Groove Back: Going Mapless with Don Day
How DITA Got Her Groove Back: Going Mapless with Don DayHow DITA Got Her Groove Back: Going Mapless with Don Day
How DITA Got Her Groove Back: Going Mapless with Don Day
 
IASSIST 2012 - DDI-RDF - Trouble with Triples
IASSIST 2012 - DDI-RDF - Trouble with TriplesIASSIST 2012 - DDI-RDF - Trouble with Triples
IASSIST 2012 - DDI-RDF - Trouble with Triples
 

Plus de Matthew Lease

The Rise of Crowd Computing - 2016
The Rise of Crowd Computing - 2016The Rise of Crowd Computing - 2016
The Rise of Crowd Computing - 2016
Matthew Lease
 

Plus de Matthew Lease (20)

Automated Models for Quantifying Centrality of Survey Responses
Automated Models for Quantifying Centrality of Survey ResponsesAutomated Models for Quantifying Centrality of Survey Responses
Automated Models for Quantifying Centrality of Survey Responses
 
Key Challenges in Moderating Social Media: Accuracy, Cost, Scalability, and S...
Key Challenges in Moderating Social Media: Accuracy, Cost, Scalability, and S...Key Challenges in Moderating Social Media: Accuracy, Cost, Scalability, and S...
Key Challenges in Moderating Social Media: Accuracy, Cost, Scalability, and S...
 
Explainable Fact Checking with Humans in-the-loop
Explainable Fact Checking with Humans in-the-loopExplainable Fact Checking with Humans in-the-loop
Explainable Fact Checking with Humans in-the-loop
 
Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Suppor...
Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Suppor...Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Suppor...
Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Suppor...
 
AI & Work, with Transparency & the Crowd
AI & Work, with Transparency & the Crowd AI & Work, with Transparency & the Crowd
AI & Work, with Transparency & the Crowd
 
Designing Human-AI Partnerships to Combat Misinfomation
Designing Human-AI Partnerships to Combat Misinfomation Designing Human-AI Partnerships to Combat Misinfomation
Designing Human-AI Partnerships to Combat Misinfomation
 
Designing at the Intersection of HCI & AI: Misinformation & Crowdsourced Anno...
Designing at the Intersection of HCI & AI: Misinformation & Crowdsourced Anno...Designing at the Intersection of HCI & AI: Misinformation & Crowdsourced Anno...
Designing at the Intersection of HCI & AI: Misinformation & Crowdsourced Anno...
 
But Who Protects the Moderators?
But Who Protects the Moderators?But Who Protects the Moderators?
But Who Protects the Moderators?
 
Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact...
Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact...Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact...
Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact...
 
Mix and Match: Collaborative Expert-Crowd Judging for Building Test Collectio...
Mix and Match: Collaborative Expert-Crowd Judging for Building Test Collectio...Mix and Match: Collaborative Expert-Crowd Judging for Building Test Collectio...
Mix and Match: Collaborative Expert-Crowd Judging for Building Test Collectio...
 
Fact Checking & Information Retrieval
Fact Checking & Information RetrievalFact Checking & Information Retrieval
Fact Checking & Information Retrieval
 
Your Behavior Signals Your Reliability: Modeling Crowd Behavioral Traces to E...
Your Behavior Signals Your Reliability: Modeling Crowd Behavioral Traces to E...Your Behavior Signals Your Reliability: Modeling Crowd Behavioral Traces to E...
Your Behavior Signals Your Reliability: Modeling Crowd Behavioral Traces to E...
 
What Can Machine Learning & Crowdsourcing Do for You? Exploring New Tools for...
What Can Machine Learning & Crowdsourcing Do for You? Exploring New Tools for...What Can Machine Learning & Crowdsourcing Do for You? Exploring New Tools for...
What Can Machine Learning & Crowdsourcing Do for You? Exploring New Tools for...
 
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & OpportunitiesDeep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
 
Systematic Review is e-Discovery in Doctor’s Clothing
Systematic Review is e-Discovery in Doctor’s ClothingSystematic Review is e-Discovery in Doctor’s Clothing
Systematic Review is e-Discovery in Doctor’s Clothing
 
The Rise of Crowd Computing (July 7, 2016)
The Rise of Crowd Computing (July 7, 2016)The Rise of Crowd Computing (July 7, 2016)
The Rise of Crowd Computing (July 7, 2016)
 
The Rise of Crowd Computing - 2016
The Rise of Crowd Computing - 2016The Rise of Crowd Computing - 2016
The Rise of Crowd Computing - 2016
 
The Rise of Crowd Computing (December 2015)
The Rise of Crowd Computing (December 2015)The Rise of Crowd Computing (December 2015)
The Rise of Crowd Computing (December 2015)
 
Toward Better Crowdsourcing Science
 Toward Better Crowdsourcing Science Toward Better Crowdsourcing Science
Toward Better Crowdsourcing Science
 
Beyond Mechanical Turk: An Analysis of Paid Crowd Work Platforms
Beyond Mechanical Turk: An Analysis of Paid Crowd Work PlatformsBeyond Mechanical Turk: An Analysis of Paid Crowd Work Platforms
Beyond Mechanical Turk: An Analysis of Paid Crowd Work Platforms
 

Dernier

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 

Dernier (20)

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 

Discovering Memes in Social Media

  • 1. Discovering Memes in Social Media Matt Lease School of Information University of Texas at Austin ml@ischool.utexas.edu @mattlease Joint Work with Hohyon Ryu & Nicholas Woodward Research paper to appear at the 23rd ACM Conference on Hypertext and Social Media, 2012
  • 2. Memes • Short, similar phrases found in many different sources – Re-use, shared temporal context • Evolutionary mutation & propagation as they transmit from source-to-source • Reveals implicit connections between sources, individuals and communities involved March 21, 2012 ACM SIGKDD - Austin Chapter Meeting 2
  • 3. MemeBrowser & Critical Literacy March 21, 2012 ACM SIGKDD - Austin Chapter Meeting 3
  • 4. Google/NYT Living Stories livingstories.googlelabs.com March 21, 2012 ACM SIGKDD - Austin Chapter Meeting 4
  • 5. Related Work • Jure Leskovec et al. (KDD’09): blogs – quotations only: http://memetracker.org • Steven Skiena, Stony Brook NY: blogs – Named-entities only: http://www.textmap.com • O. Kolak and B. Schilit (HT’08): scanned books – Mine “popular passages” from complete texts – MapReduce “shingling” approach – Popular passages found are local, not global March 21, 2012 ACM SIGKDD - Austin Chapter Meeting 5
  • 6. MapReduce @ UT • UT LIFT Award to Lease, Baldridge, & Xu in Sept.’10 • New harddisks @ TACC Longhorn installed Dec.’10 – 48 Dell R610 nodes • 2 Intel Nehalem quad-core processors (8 cores) @ 2.53 GHz • 48GB RAM with ~1.5TB disk per node • With 1 NameNode & 47 Datanodes, up to 376 parallel Mappers – 16 Dell R710 (same CPU configuration) • 144GB RAM with ~0.8TB disk per node – Setup Hadoop, testing, benchmarking, etc. • Baldridge & Lease teach MapReduce class Fall’11 March 21, 2012 ACM SIGKDD - Austin Chapter Meeting 6
  • 7. Datasets • TREC Blogs08 Collection – http://ir.dcs.gla.ac.uk/test_collections/blogs08info.html – 28M permalinks (January 2008 – January 2009) – 250G compressed • ICWSM 2009 Spinn3r Blog Dataset – http://www.icwsm.org/data/ – 44 million blog posts (August - September, 2008) – 27 GB compressed • ICWSM 2011 Spinn3r Blog Dataset March 21, 2012 ACM SIGKDD - Austin Chapter Meeting 7
  • 8. Processing Architecture Blogs08 Test Collection 28M posts, 1.4TB Preprocessing (Pseudo-MapReduce) Decruft & Language Identification HTML Strip & Near-Duplicate Detection 16M posts, 960GB Common Phrase Extraction 15K posts, 43GB 3 MapReduce Stages Common Phrase Ranking Daily Top 200 Phrases 6.2M phrases, 2GB 1 MapReduce Process Common Phrase Clustering 75K phrases, 2.6MB 1 MapReduce Process Meme Browser 68K memes March 21, 2012 ACM SIGKDD - Austin Chapter Meeting 8
  • 9. Creating the Shingle Table • e.g. trigram shingles for: what do you think of – what do you – do you think – you think of March 21, 2012 ACM SIGKDD - Austin Chapter Meeting 9
  • 10. Grouping Shingles by Document • Mapper: trivial grouping; Reducer: Identity March 21, 2012 ACM SIGKDD - Austin Chapter Meeting 10
  • 11. Common Phrase (CP) Detection • Mapper: Merge adjacent shingles into memes (ignoring small gaps) • Reducer: Find set of documents in which each meme occurs March 21, 2012 ACM SIGKDD - Austin Chapter Meeting 11
  • 12. Ranking Memes March 21, 2012 ACM SIGKDD - Austin Chapter Meeting 12
  • 13. Clustering Memes • Mapper: Single-link hierarchical clustering with cosine similarity • Reducer: create/merge clusters March 21, 2012 ACM SIGKDD - Austin Chapter Meeting 13
  • 14. Efficiency: Meme Clustering • From WEKA ARFF format to sparse representation – From ~96 hours  11 hours • Indexed vs. un-indexed – From 11 hours  16 minutes (single core) – From 34 minutes  3 minutes (136 cores) • Distributed vs. single core – From 11 hours  34 minutes (un-indexed) – From 16 minutes  3 minutes (indexed) March 21, 2012 ACM SIGKDD - Austin Chapter Meeting 14
  • 15. Meme Browser: Original Interface March 21, 2012 ACM SIGKDD - Austin Chapter Meeting 15
  • 16. Meme Browser: Current Interface March 21, 2012 ACM SIGKDD - Austin Chapter Meeting 16
  • 17. Meme Evolution (Leskovec et al.’09) March 21, 2012 ACM SIGKDD - Austin Chapter Meeting 17
  • 18. Thank You! • Joint Work with Matt Lease – Hohyon (Will) Ryu ml@ischool.utexas.edu • InfoChimps (Summer’11) www.ischool.utexas.edu/~ml • Indeed.com (Summer’12) @mattlease – Nicholas Woodward (TACC) • Latin American Network Information Center (LANIC) Support • FCT of Portugal / UT CoLab • Amazon Web Services • UT Austin LIFT Award • John P. Commons Fellowship