SlideShare une entreprise Scribd logo
1  sur  28
Subreddit
Subcultures
Insight Data Engineering Fellowship, Silicon Valley
David Lyon
Find your Reddit Subculture
2007 - Impersonal Web
2017 - Personal Web
Reddit Comment Dataset
2 billion comments
1 million
subreddits
Personalization of Reddit Over Time
Reddit Clustering App
https://youtu.be/XHczo0TM17E
Data Pipeline
Ingestion / Processing User Interface
Challenge 1:
Data Size
Every month on
Reddit:
● Reddit is too big to cluster
directly!
● The raw clustering matrix
has 200 billion elements.
60k Subreddits
3 million unique authors
Solution 1a:
Filtering
Every month on
Reddit:
● Filter for activity: 100
comments/month
● Active clustering matrix has
200 million elements
● Now 1000 times faster to
cluster
6k active
Subreddits
30k active
authors
Solution 1b: PCA
Every month on
Reddit:
● PCA transforms author
space to shared interest
space by finding correlations
● PCA shrinks dimensionality
by another 100 times 300
shared
interests
6k active
Subreddits
Challenge 2: Slow PCA
Even on a cluster, PCA takes too long
on 200 million elements: 100 minutes
on 9 Spark workers.
PCA scales as O(MI)
M is the number of matrix elements
I is the number of interests after PCA
Over 80% of total time!
Solution 2: Random PCA
Use Facebook Research Random PCA
(2014) on a single node
Fbpca is O(M ln(I))
For 250 interests, FBPCA is 45 times
faster! One FBPCA worker is 5x faster
than 9 full PCA workers.
5x faster for an average sized month
Challenge 3: Finding K for K-Means Clustering
Number of clusters is not the same
as number of PCA shared
interests
Clustering can happen on more
than one scale
Football
Baseball
TV
Movies
Solution 3: Silhouette Analysis
Silhouette Analysis reveals
clustering scale at small k
Also reveals a second clustering
scale of around 400 clusters
in this case
A Happy Medium
Too impersonal Too personalized
David Lyon
PhD Physics from the University of Illinois
Doing GPU simulations
I love hiking, table tennis, and astrophysics
Next Steps - Random PCA for Spark.ml
Step 1: Learn Scala!
Step 2: Contribute to Open Source community
Step 3: Streaming Random PCA?
Next Steps - Popular Topics by Cluster
Find the popular topics within each cluster using Term-Frequency Inverse-
Document-Frequency (TF-IDF) or LDA
Terms are 1-grams and 2-grams used in each cluster, and the document
frequency is over all of reddit for that month.
Challenge 2:
Every month on
Reddit:
● Too many individual authors
● Need to cluster by shared
interests, not author 30k active
authors
6k active
Subreddits
Challenge 3: Finding K for K-Means
Number of clusters is not the same
as number of PCA shared
interests
Clustering can happen on more
than one scale
Football
Baseball
TV
Movies
Random PCA
Complexity of PCA is O(mnk) for m rows, n input columns, k output columns
FINDING STRUCTURE WITH RANDOMNESS: PROBABILISTIC ALGORITHMS FOR
CONSTRUCTING APPROXIMATE MATRIX DECOMPOSITIONS (Nathan Halko,
2009)
Fast Randomized SVD (Facebook Research, 2014)
Complexity of Random PCA is O(mn ln(k))
For k=100, Random PCA is more than 20x faster!
Before PCA
Football 2 1
Baseball 3 1 15
TV 5 2 22
Movies 1 21 1 2
Sub Auth1 Auth2 Auth3 Auth4 Auth5 Auth6 Auth7 Auth99
9,999
Auth
1,000,0
00
After PCA
Football 80 2 1
Baseball 90 3 2
TV 6 80 77
Movies 2 80 20
Sub Sporting Fictional Political
Anatomy of a Reddit Comment
BodyAuthorDate Subreddit
Group by Month
Group by Subreddit
Count #comments by author per subreddit
Normalize authors so each author has
mean=0 and variance = 1
Growth in Number of Subreddits
40 subreddits
1 million subreddits
Week 4 Challenges
● Spark for iterative machine learning because Spark can
mapreduce in memory
● By reducing the dimension of data,
● No streaming - clustering requires lots of data & clusters
change slowly, but time window reduced from monthly to
daily
Clustering is Universal
Galaxies cluster into
superclusters of ~100k
members
The red dot is our galaxy
● Human knowledge is clustered - purple for physics, blue for
chemistry, green for biology and medicine.
● The big blob to the upper left is Liberal Arts.
Subreddit Clustering
Monthly graph from 10k subreddits X 2 million authors = 10 billion
matrix entries
Drastically reduce the size of data using Principal Component
Analysis, normalized so that larger subreddits aren’t favored
Cluster in reduced dimensional space using K-means
Topics within Clusters based on relative frequency of 1-grams and
Social media brings us closer
Continual contact with over 1 billion people
We can find people who share our exact interests
...and separates us
● Less tolerance for differences - unfriend
or ban from community!
● Online communities become bubbles
isolated from each other

Contenu connexe

Similaire à Subreddit Subcultures

February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data AnalyticsFebruary 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data AnalyticsYahoo Developer Network
 
Monitoring using Open source technologies
Monitoring using Open source technologiesMonitoring using Open source technologies
Monitoring using Open source technologiesUTKARSH BHATNAGAR
 
SE2016 BigData Denis Reznik "Data driven future"
SE2016 BigData Denis Reznik "Data driven future"SE2016 BigData Denis Reznik "Data driven future"
SE2016 BigData Denis Reznik "Data driven future"Inhacking
 
Community Structure, Interaction and Evolution Analysis of Online Social Netw...
Community Structure, Interaction and Evolution Analysis of Online Social Netw...Community Structure, Interaction and Evolution Analysis of Online Social Netw...
Community Structure, Interaction and Evolution Analysis of Online Social Netw...Symeon Papadopoulos
 
Immersive Recommendation
Immersive RecommendationImmersive Recommendation
Immersive Recommendation承剛 謝
 
Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2DataStax Academy
 
Big Data and APIs - a recon tour on how to successfully do Big Data analytics
Big Data and APIs - a recon tour on how to successfully do Big Data analyticsBig Data and APIs - a recon tour on how to successfully do Big Data analytics
Big Data and APIs - a recon tour on how to successfully do Big Data analyticsNatalino Busa
 
Dmitry Bugaychenko - Smart.Data@ОК.ru. How to make the world a bit better usi...
Dmitry Bugaychenko - Smart.Data@ОК.ru. How to make the world a bit better usi...Dmitry Bugaychenko - Smart.Data@ОК.ru. How to make the world a bit better usi...
Dmitry Bugaychenko - Smart.Data@ОК.ru. How to make the world a bit better usi...AIST
 
Approximate "Now" is Better Than Accurate "Later"
Approximate "Now" is Better Than Accurate "Later"Approximate "Now" is Better Than Accurate "Later"
Approximate "Now" is Better Than Accurate "Later"NUS-ISS
 
[CS570] Machine Learning Team Project (I know what items really are)
[CS570] Machine Learning Team Project (I know what items really are)[CS570] Machine Learning Team Project (I know what items really are)
[CS570] Machine Learning Team Project (I know what items really are)Kunwoo Park
 
Msr2010 ibrahim
Msr2010 ibrahimMsr2010 ibrahim
Msr2010 ibrahimSAIL_QU
 
STING: A Framework for Analyzing Spacio-Temporal Interaction Networks and Graphs
STING: A Framework for Analyzing Spacio-Temporal Interaction Networks and GraphsSTING: A Framework for Analyzing Spacio-Temporal Interaction Networks and Graphs
STING: A Framework for Analyzing Spacio-Temporal Interaction Networks and GraphsJason Riedy
 
Dynamic Data Community Discovery
Dynamic Data Community DiscoveryDynamic Data Community Discovery
Dynamic Data Community DiscoverySarang Rakhecha
 
Conor Hayes - Topics, tags and trends in the blogosphere
Conor Hayes - Topics, tags and trends in the blogosphereConor Hayes - Topics, tags and trends in the blogosphere
Conor Hayes - Topics, tags and trends in the blogosphereDERIGalway
 
[ICDE 2014] Incremental Cluster Evolution Tracking from Highly Dynamic Networ...
[ICDE 2014] Incremental Cluster Evolution Tracking from Highly Dynamic Networ...[ICDE 2014] Incremental Cluster Evolution Tracking from Highly Dynamic Networ...
[ICDE 2014] Incremental Cluster Evolution Tracking from Highly Dynamic Networ...Pei Lee
 
Dynamic Data Center concept
Dynamic Data Center concept  Dynamic Data Center concept
Dynamic Data Center concept Miha Ahronovitz
 

Similaire à Subreddit Subcultures (20)

Insight presentation
Insight presentationInsight presentation
Insight presentation
 
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data AnalyticsFebruary 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
 
Monitoring using Open source technologies
Monitoring using Open source technologiesMonitoring using Open source technologies
Monitoring using Open source technologies
 
Talk @ GrafanaCon 2016
Talk @ GrafanaCon 2016Talk @ GrafanaCon 2016
Talk @ GrafanaCon 2016
 
Denis Reznik Data driven future
Denis Reznik Data driven futureDenis Reznik Data driven future
Denis Reznik Data driven future
 
SE2016 BigData Denis Reznik "Data driven future"
SE2016 BigData Denis Reznik "Data driven future"SE2016 BigData Denis Reznik "Data driven future"
SE2016 BigData Denis Reznik "Data driven future"
 
Community Structure, Interaction and Evolution Analysis of Online Social Netw...
Community Structure, Interaction and Evolution Analysis of Online Social Netw...Community Structure, Interaction and Evolution Analysis of Online Social Netw...
Community Structure, Interaction and Evolution Analysis of Online Social Netw...
 
Immersive Recommendation
Immersive RecommendationImmersive Recommendation
Immersive Recommendation
 
Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2
 
Big Data and APIs - a recon tour on how to successfully do Big Data analytics
Big Data and APIs - a recon tour on how to successfully do Big Data analyticsBig Data and APIs - a recon tour on how to successfully do Big Data analytics
Big Data and APIs - a recon tour on how to successfully do Big Data analytics
 
Dmitry Bugaychenko - Smart.Data@ОК.ru. How to make the world a bit better usi...
Dmitry Bugaychenko - Smart.Data@ОК.ru. How to make the world a bit better usi...Dmitry Bugaychenko - Smart.Data@ОК.ru. How to make the world a bit better usi...
Dmitry Bugaychenko - Smart.Data@ОК.ru. How to make the world a bit better usi...
 
Approximate "Now" is Better Than Accurate "Later"
Approximate "Now" is Better Than Accurate "Later"Approximate "Now" is Better Than Accurate "Later"
Approximate "Now" is Better Than Accurate "Later"
 
[CS570] Machine Learning Team Project (I know what items really are)
[CS570] Machine Learning Team Project (I know what items really are)[CS570] Machine Learning Team Project (I know what items really are)
[CS570] Machine Learning Team Project (I know what items really are)
 
Msr2010 ibrahim
Msr2010 ibrahimMsr2010 ibrahim
Msr2010 ibrahim
 
Hendrickson data2 2012-gnip
Hendrickson data2 2012-gnipHendrickson data2 2012-gnip
Hendrickson data2 2012-gnip
 
STING: A Framework for Analyzing Spacio-Temporal Interaction Networks and Graphs
STING: A Framework for Analyzing Spacio-Temporal Interaction Networks and GraphsSTING: A Framework for Analyzing Spacio-Temporal Interaction Networks and Graphs
STING: A Framework for Analyzing Spacio-Temporal Interaction Networks and Graphs
 
Dynamic Data Community Discovery
Dynamic Data Community DiscoveryDynamic Data Community Discovery
Dynamic Data Community Discovery
 
Conor Hayes - Topics, tags and trends in the blogosphere
Conor Hayes - Topics, tags and trends in the blogosphereConor Hayes - Topics, tags and trends in the blogosphere
Conor Hayes - Topics, tags and trends in the blogosphere
 
[ICDE 2014] Incremental Cluster Evolution Tracking from Highly Dynamic Networ...
[ICDE 2014] Incremental Cluster Evolution Tracking from Highly Dynamic Networ...[ICDE 2014] Incremental Cluster Evolution Tracking from Highly Dynamic Networ...
[ICDE 2014] Incremental Cluster Evolution Tracking from Highly Dynamic Networ...
 
Dynamic Data Center concept
Dynamic Data Center concept  Dynamic Data Center concept
Dynamic Data Center concept
 

Dernier

100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...shambhavirathore45
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramMoniSankarHazra
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Delhi Call girls
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxolyaivanovalion
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...Pooja Nehwal
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 

Dernier (20)

100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptx
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 

Subreddit Subcultures

  • 1. Subreddit Subcultures Insight Data Engineering Fellowship, Silicon Valley David Lyon Find your Reddit Subculture
  • 4. Reddit Comment Dataset 2 billion comments 1 million subreddits
  • 5. Personalization of Reddit Over Time Reddit Clustering App https://youtu.be/XHczo0TM17E
  • 6. Data Pipeline Ingestion / Processing User Interface
  • 7. Challenge 1: Data Size Every month on Reddit: ● Reddit is too big to cluster directly! ● The raw clustering matrix has 200 billion elements. 60k Subreddits 3 million unique authors
  • 8. Solution 1a: Filtering Every month on Reddit: ● Filter for activity: 100 comments/month ● Active clustering matrix has 200 million elements ● Now 1000 times faster to cluster 6k active Subreddits 30k active authors
  • 9. Solution 1b: PCA Every month on Reddit: ● PCA transforms author space to shared interest space by finding correlations ● PCA shrinks dimensionality by another 100 times 300 shared interests 6k active Subreddits
  • 10. Challenge 2: Slow PCA Even on a cluster, PCA takes too long on 200 million elements: 100 minutes on 9 Spark workers. PCA scales as O(MI) M is the number of matrix elements I is the number of interests after PCA Over 80% of total time!
  • 11. Solution 2: Random PCA Use Facebook Research Random PCA (2014) on a single node Fbpca is O(M ln(I)) For 250 interests, FBPCA is 45 times faster! One FBPCA worker is 5x faster than 9 full PCA workers. 5x faster for an average sized month
  • 12. Challenge 3: Finding K for K-Means Clustering Number of clusters is not the same as number of PCA shared interests Clustering can happen on more than one scale Football Baseball TV Movies
  • 13. Solution 3: Silhouette Analysis Silhouette Analysis reveals clustering scale at small k Also reveals a second clustering scale of around 400 clusters in this case
  • 14. A Happy Medium Too impersonal Too personalized
  • 15. David Lyon PhD Physics from the University of Illinois Doing GPU simulations I love hiking, table tennis, and astrophysics
  • 16. Next Steps - Random PCA for Spark.ml Step 1: Learn Scala! Step 2: Contribute to Open Source community Step 3: Streaming Random PCA?
  • 17. Next Steps - Popular Topics by Cluster Find the popular topics within each cluster using Term-Frequency Inverse- Document-Frequency (TF-IDF) or LDA Terms are 1-grams and 2-grams used in each cluster, and the document frequency is over all of reddit for that month.
  • 18. Challenge 2: Every month on Reddit: ● Too many individual authors ● Need to cluster by shared interests, not author 30k active authors 6k active Subreddits
  • 19. Challenge 3: Finding K for K-Means Number of clusters is not the same as number of PCA shared interests Clustering can happen on more than one scale Football Baseball TV Movies
  • 20. Random PCA Complexity of PCA is O(mnk) for m rows, n input columns, k output columns FINDING STRUCTURE WITH RANDOMNESS: PROBABILISTIC ALGORITHMS FOR CONSTRUCTING APPROXIMATE MATRIX DECOMPOSITIONS (Nathan Halko, 2009) Fast Randomized SVD (Facebook Research, 2014) Complexity of Random PCA is O(mn ln(k)) For k=100, Random PCA is more than 20x faster!
  • 21. Before PCA Football 2 1 Baseball 3 1 15 TV 5 2 22 Movies 1 21 1 2 Sub Auth1 Auth2 Auth3 Auth4 Auth5 Auth6 Auth7 Auth99 9,999 Auth 1,000,0 00
  • 22. After PCA Football 80 2 1 Baseball 90 3 2 TV 6 80 77 Movies 2 80 20 Sub Sporting Fictional Political
  • 23. Anatomy of a Reddit Comment BodyAuthorDate Subreddit Group by Month Group by Subreddit Count #comments by author per subreddit Normalize authors so each author has mean=0 and variance = 1
  • 24. Growth in Number of Subreddits 40 subreddits 1 million subreddits
  • 25. Week 4 Challenges ● Spark for iterative machine learning because Spark can mapreduce in memory ● By reducing the dimension of data, ● No streaming - clustering requires lots of data & clusters change slowly, but time window reduced from monthly to daily
  • 26. Clustering is Universal Galaxies cluster into superclusters of ~100k members The red dot is our galaxy ● Human knowledge is clustered - purple for physics, blue for chemistry, green for biology and medicine. ● The big blob to the upper left is Liberal Arts.
  • 27. Subreddit Clustering Monthly graph from 10k subreddits X 2 million authors = 10 billion matrix entries Drastically reduce the size of data using Principal Component Analysis, normalized so that larger subreddits aren’t favored Cluster in reduced dimensional space using K-means Topics within Clusters based on relative frequency of 1-grams and
  • 28. Social media brings us closer Continual contact with over 1 billion people We can find people who share our exact interests ...and separates us ● Less tolerance for differences - unfriend or ban from community! ● Online communities become bubbles isolated from each other