Text Analytics Summit 2009 - Roddy Lindsay - "Social Media, Happiness, Petabytes and LOLs"

Social Media, Happiness,
Petabytes and LOLs
Roddy Lindsay, Data Scientist, Facebook

June 1, 2009

Lots of data is generated on Facebook
▪ 200 million active users
▪ More than 20 million users update their statuses at least once each day
▪ More than 850 million photos uploaded to the site each month
▪ More than 8 million videos uploaded each month
▪ More than 1 billion pieces of content (web links, news stories, blog
posts, notes, photos, etc.) shared each week
▪ More than 2.5 million events created each month
▪ More than 25 million active user groups exist on the site

Lots of data is generated on Facebook
▪ Undoubtedly a very rich data set (and large...we’re talking petabytes)
▪ Many different groups clamoring for data:
▪ Internal analysts
▪ FB Engineers
▪ Advertisers
▪ Page owners
▪ Platform/Connect developers
▪ Marketers
▪ Academics

Challenges
▪ How can Facebook satisfy all the different consumers of data?
▪ What are the challenges?
▪ 1. Infrastructure

Facebook’s Data Infrastructure
▪ Attempt 1: Oracle Data Warehouse (2005)
▪ Business analysts already familiar with tools, SQL
▪ Fast JOINs for data slicing ideal for dashboards (home-rolled in PHP)
▪ i.e. growth by country and demographic
▪ When growth took off (2007), ETL processes to load and roll-up data
started taking a very long time
▪ A single machine (or several machines) were not going to cut it much
longer for data volumes at that scale...

▪ Attempt 2: Hadoop (2007)
▪ Open-source framework for running Map-Reduce on a cluster of
commodity machines, as well as a distributed ﬁle system for long-term
storage
▪ Map-Reduce (invented at Google) provides a way to process large data sets
that scales linearly with the number of machines in the cluster....if your
data doubles in size, just buy twice as many computers
▪ Hadoop initially developed by Doug Cutting, now an Apache project led by
the Grid Computing team at Yahoo!
▪ Much faster ETL when transform and load is distributed across a
cluster
▪ Engineers able to write jobs in Java and Python
▪ Not a viable solution for analysts who can write SQL but not code

▪ Attempt 3: Hive (2008)
▪ SQL-like query language, table partitioning schema, and metadata
store built on top of Hadoop
▪ Developed at Facebook, now an Apache subproject
▪ Also includes:
▪ Web interface for constructing queries on the ﬂy without using a shell
▪ Live support for query problems from the data team
▪ Easy integration with charts and dashboards
▪ One-click scheduling
▪ CSV/Excel export

▪ Example: “Find the number of status updates mentioning ‘swine ﬂu’
per day last month”

▪ SELECT a.date, count(1)
▪ FROM status_updates a
▪ WHERE a.status LIKE “%swine ﬂu%”
▪ AND a.date >= ‘2009-05-01’ AND a.date <= ‘2009-05-31’
▪ GROUP BY a.date

▪ Easily extendable to new operators
▪ Hypothetical example: “Find the sentiment of the ‘Terminator’ movie”

▪ FROM (
▪ FROM status_updates b
▪ SELECT SENTIMENT(b.status, ‘terminator’) AS sentiment
▪ WHERE b.status LIKE “%terminator%”
▪ AND b.date >= ‘2009-05-01’ AND b.date <= ‘2009-05-31’) a
▪ SELECT a.sentiment, count(1)
▪ GROUP BY a.sentiment

▪ Successfully decentralized the querying and consumption of data
across the company
▪ Instead of 10 dedicated data analysts, we trained a few hundred
▪ Everyone is able to answer 95% of his or her data questions with
minimal training
▪ Dedicated data scientists, instead of working on an endless queue of
ad-hoc requests, can spend their time performing complex analyses
and building scalable systems on top of Hadoop/Hive
▪ Machine Learning systems
▪ Rich reporting for clients + Page owners
▪ Text analytics

Facebook text analytics
▪ Lexicon (Spring 2008)
▪ Started as an intern project to test Hadoop
▪ First external deployment of a Hadoop-powered system at Facebook
(and one of the ﬁrst anywhere)
▪ Simple idea: count the number of occurrences of words and bigrams
on Facebook Walls per day, plot them on a line graph

▪ “New” Lexicon (Fall 2008), beta preview
▪ Leveraged Hive’s structured metadata and the raw computational
power of a 600-node Hadoop cluster
▪ Slices by age, gender, region
▪ Sentiment analysis
▪ Common user interests
▪ Associations graph of similar keywords, with age and gender axes

Sentiment: “iron man” (blue) vs.
“indiana jones” (yellow)

▪ Hadoop and Hive makes this all possible
▪ Consider “Associations” (similar words and phrases)
▪ Need to compare the co-occurrence of each term with every single
other word and bigram, compared to baseline probability of
occurrence (TF-IDF)......and keep demographic metadata around for fun
▪ Typical job generates several TB of data along the way
▪ Absolutely need a cluster of machines
▪ Distributed computation opens up the possibilities for text analytics
algorithms!
▪ And.....the software is free!

Text Analytics
▪ Text analytics is clearly useful in the “macro”:
▪ Big data sets
▪ Big compute clusters
▪ Big consumers (corporations)
▪ What about in the micro?
▪ Small data sets
▪ B, not PB
▪ Small consumers
▪ Individual people analyzing their own data

HappyFactor
▪ Facebook Application (personal project, not associated with Facebook)
▪ Idea: ask people privately how happy they are and what they are doing
▪ Uses random text messages to ensure a good sample and to collect data
easily
▪ Provide users with trends on their happiness (by day, week, month, etc.)
▪ When are you happiest?
▪ Sift through the unstructured text to ﬁnd patterns in behavior that
correlate with happiness and unhappiness
▪ Which activities make you happiest?
▪ Which people in your life make you happiest?

HappyFactor
▪ Just like corporations can learn about (and improve) themselves through
text analytics....
▪ Why not humans?

On a scale from 1 to 10, how happy are
you right now? Reply with your score and
an optional description of what you are
doing.

In sum...
▪ Analyzing large data sets is a challenging problem that requires
signiﬁcant investment (both human and ﬁnancial) in infrastructure
▪ We’re now just learning what we can do with Facebook data since we
developed the infrastructure to support it
▪ Distributed computation and structured metadata allow for a powerful
new class of text analytics algorithms
▪ Text analytics has applications well beyond enterprise data-mining...
▪ ...could it potentially make the world a happier place?

Text Analytics Summit 2009 - Roddy Lindsay - "Social Media, Happiness, Petabytes and LOLs"

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (19)

Similar to Text Analytics Summit 2009 - Roddy Lindsay - "Social Media, Happiness, Petabytes and LOLs"

Similar to Text Analytics Summit 2009 - Roddy Lindsay - "Social Media, Happiness, Petabytes and LOLs" (20)

Recently uploaded

Recently uploaded (20)

Text Analytics Summit 2009 - Roddy Lindsay - "Social Media, Happiness, Petabytes and LOLs"