SlideShare a Scribd company logo
1 of 60
What the #(&*$ is Big Data?
A Holistic View of Data and
Algorithms
Alice Zheng, GraphLab
Strata Conference, Santa Clara
February, 2014
Background
• Machine Learning
• Enable machines to understand the world
• Play with data
• GraphLab
• Unleash data science!
• Enable non-ML experts to play with data
• This talk: a look at Big Data and Machine
Learning from a tool builder’s perspective
Strata Conf, Feb 2014 2
DATA
Strata Conf, Feb 2014
What is Data?
• Data is an extension of ourselves
• Pictures, texts, messages, logs
• Sensors and devices
• Measurements and experiments
• Data is organic; it is wild and messy
• Data proliferates
Strata Conf, Feb 2014 4
Producers of Big Data
• Tech industry
• Google, Microsoft, Facebook, Amazon, Twitter, …
• Consumer/Retail
• Walmart, Target, Amazon, Netflix, …
• Telecomm
• Verizon, AT&T, Telefonica, …
• Finance
• Thomson Reuters, Dow Jones, …
• Health care and monitoring
• Personal health metrics, health care records, …
• Science
• Genome research, high energy physics, astronomy, NASA, …
• Etc.
Strata Conf, Feb 2014 5
• 1.11 billion active users [March 2013]
• 665 million daily users on average [March 2013]
• Daily data amount: [Aug 2012]
• 500+ TB data
• 2.5 billion pieces of content
• 2.7 billion “Like” actions
• 300 mil photos
• Scans 105 TB data every ½ hour
• 100+ PB data stored on a single Hadoop
cluster [Aug 2012]
Strata Conf, Feb 2014 6
Data Sources: [Yahoo! news] [TechCrunch]
System Event Logs
ETW (Event Tracing for Windows)
• Logs of kernel and application events
• Up to 100K events per second
• Binary log size: ~200 MB every 2-5
minutes
• 20-50 TB/year from one machine
• ~50 PB/year from 1000 machines
Strata Conf, Feb 2014 7
Data source: http://msdn.microsoft.com/en-us/library/windows/desktop/bb968803%28v=vs.85%29.aspx
A Picture of Big Data
Strata Conf, Feb 2014 8
WikipediaWebSpam
Sys Logs
Walmart
LHC
Whole
Genome Scans
SDSS
Flickr
Cellphone
CDRs
Facebook
Twitter
GB
TB
PB
EB
Total Size / Year
Structure
Science
Tech
Size of bubble =
Size of a single
record (log-scale)
Other
TAKING THE LEAP
Strata Conf, Feb 2014 9
ALGORITHMS
Strata Conf, Feb 2014 10
The Way to Insight
• What do people do with Big Data?
• Myriad algorithms for myriad tasks
• Two disparate examples
• What movies would Bob like? – discovering
recommendations from a crowd
• Why is my machine so slow? – diagnosing
systems using event logs
Strata Conf, Feb 2014 11
Algorithm Example 1:
A Recommender System
Strata Conf, Feb 2014
What Movies Would Bob Like?
• Bob watched “Silver Linings Playbook”
and “Twin Peaks.” What else might Bob
like?
• Given movie selections of many users,
make recommendations for individuals
Strata Conf, Feb 2014
User-Movie Interaction Matrix
Silver
Linings
Playbook
Hunger
Games
Twin Peaks Iron Man 3 Mulholland
Drive
Bob
Anna
David
Ethan
Strata Conf, Feb 2014
Finding Similar Movies
• Jaccard similarity between a pair of movies
num users who watched both
num users who watched either
• If every user who watched one or the other
movie, ends up watching both, then the two
movies must be very similar.
Strata Conf, Feb 2014
User-Movie Interaction Matrix
Silver
Linings
Playbook
Hunger
Games
Twin Peaks Iron Man 3 Mulholland
Drive
Bob
Anna
David
Ethan
Strata Conf, Feb 2014
Sim(“Silver Linings Playbook”, “Hunger Games”) = ?
User-Movie Interaction Matrix
Silver
Linings
Playbook
Hunger
Games
Twin Peaks Iron Man 3 Mulholland
Drive
Bob
Anna
David
Ethan
Strata Conf, Feb 2014
Sim(“Silver Linings Playbook”, “Hunger Games”) = ?
User-Movie Interaction Matrix
Silver
Linings
Playbook
Hunger
Games
Twin Peaks Iron Man 3 Mulholland
Drive
Bob
Anna
David
Ethan
Strata Conf, Feb 2014
Sim(“Silver Linings Playbook”, “Hunger Games”) = ?
User-Movie Interaction Matrix
Silver
Linings
Playbook
Hunger
Games
Twin Peaks Iron Man 3 Mulholland
Drive
Bob
Anna
David
Ethan
Strata Conf, Feb 2014
Sim(“Silver Linings Playbook”, “Hunger Games”) = 1/3
Movie Similarity Matrix
Strata Conf, Feb 2014
Silver
Linings
Playbook
Hunger
Games
Twin Peaks Iron Man 3 Mulholland
Drive
Silver
Linings
Playbook
1 1/3 2/3 0 1/3
Hunger
Games
1/3 1 1/4 0 1/3
Twin Peaks 2/3 1/4 1 0 2/3
Iron Man 3 0 0 0 1 0
Mulholland
Drive
1/3 1/3 2/3 0 1
Making New Recommendations
recs = [ ]
for movie in user.preferences:
new_movies = Sim[movie, :].topk( )
recs.append(new_movies)
recs.sort()
• Equivalently, take the vector-matrix product
• vector = the user’s preferences
• matrix = movie similarity matrix
Strata Conf, Feb 2014
Key Ideas
• During training: compute item-item
similarity matrix
• Making recommendations: take vector-
matrix product
Strata Conf, Feb 2014
Algorithm Example 2:
Diagnosing a slow computer
Strata Conf, Feb 2014
Why is My Machine So Slow?
• Slow machines are frustrating!
• Diagnose slowness via event logs
ETW – Event Tracing for Windows
• Fine-grained event tracing
• Up to 100,000 events per second
Strata Conf, Feb 2014 25
Excerpt of Sample ETW log
Diagnosing Slowness
• Start from slow thread
• Walk backwards to construct wait graph
Strata Conf, Feb 2014
Firefox
Time
Network Stack
TCP/IP packet
Search Indexer
File Lock
Anti-Virus Checker
File Lock
Key Algorithm Ideas
• The insight is a wait graph
• Constructing the graph involves repeated
queries into a large set of events
• Iterate:
• What was the current thread waiting on?
• Go to the source of the wait
Strata Conf, Feb 2014
What links these algorithms and data?
Strata Conf, Feb 2014
DATA STRUCTURES
– THE BRIDGE
Strata Conf, Feb 2014
Between Data and Algorithms
• Data structures
• Organized data
• Optimized for certain computations
• The key to efficient analysis
• Algorithms prefer certain data structures
• Raw data is amenable to certain data structures
Data Algorithms
Data
Structures
Amenable Preference
The Disconnect
• Machine Learning research – largely disconnected
from implementation
• Some recent advances in large-scale ML are rediscovering
known data structures
• Next-gen ML tools need well-tailored data structures
Strata Conf, Feb 2014
Machine Learning
(Statistics, optimization,
linear algebra, …)
Data Structures
(Lists, trees,
tables, graphs, …)
Two Useful Data Structures
• Flat tables
• Graphs
Strata Conf, Feb 2014
Data Structure 1: Flat Table
Strata Conf, Feb 2014
Flat Tables
• Rows and columns
• Rows = records
• Columns can be typed
• A lot of raw data looks like flat tables!
Strata Conf, Feb 2014
Example 1
User Item Rating Time
Alice Breaking Bad, Season 1 3 …
Charlie Twilight 2
Bob Silver Linings Playbook 4
Frank American Hustle 2
Tina Plan 9 From Outer Space 4
Bob Twin Peaks 2
Diana Dr. Strangelove 5
…
Strata Conf, Feb 2014
User-Item interaction data
Example 2
Timestamp Name PID CPU Stack …
447590409 audiodg.exe 1848 1 ntkrnlpa.exe!KeSetEvent
ntkrnlpa.exe!WaitForLock
447590411 csrss.exe 460 0 …
447590415 iexplore.exe 2478 1 kernel64.exe!WaitForMultipleObjects
…
Strata Conf, Feb 2014
Event log data
Variations of Flat Tables
• Query vs. computation
• Random access (in-memory) vs.
sequential access (on-disk)
• Column vs. row-wise representation
• Indexed or not
• Distributed or not
• Key-value stores (hash tables)
Strata Conf, Feb 2014
Data Structure 1.5: Indexed Flat Table
Strata Conf, Feb 2014
Example of Indexed Flat Table
Strata Conf, Feb 2014
User Item Rating
Alice Breaking Bad, Season 1 3
Charlie Twilight 2
Bob Silver Linings Playbook 4
Frank American Hustle 2
Tina Plan 9 From Outer Space 4
Bob Twin Peaks 2
Diana Dr. Strangelove 5
…
Example of Indexed Flat Table
Strata Conf, Feb 2014
User Item Rating
Alice Breaking Bad, Season 1 3
Charlie Twilight 2
Bob Silver Linings Playbook 4
Frank American Hustle 2
Tina Plan 9 From Outer Space 4
Bob Twin Peaks 2
Diana Dr. Strangelove 5
…
Index
Query: What items did Bob rate?
Example of Indexed Flat Table
Strata Conf, Feb 2014
User Item Rating
Alice Breaking Bad, Season 1 3
Charlie Twilight 2
Bob Silver Linings Playbook 4
Frank American Hustle 2
Tina Plan 9 From Outer Space 4
Bob Twin Peaks 2
Diana Dr. Strangelove 5
…
Index
Query: What items did Bob rate?
Index of “Bob” points to rows 3 and 6
Back to the Recommender
• Training: compute a matrix
• Recommending: vector-matrix product
• Raw data: user-item interaction log
• Load in as flat table
• Build index (user-item matrix)
• Iterate through the users to train
Strata Conf, Feb 2014
ML on Flat Tables
• Anything where data is represented as
feature vectors
• Computations operate on rows
• Stochastic gradient descent
• K-means clustering
• … or columns
• Decision tree family
Strata Conf, Feb 2014
Data Structure 2: Graph
Strata Conf, Feb 2014
Example
Strata Conf, Feb 2014
Anna
Diana
Charlie
Frank
Tina
Bob
Sam
Implementation 1: Edge List
• A simple flat table!
• Additional columns = edge attributes (e.g., user rating
of movie, time watched, etc.)
Strata Conf, Feb 2014
User Item
Alice Breaking Bad, Season 1
Charlie Twilight
Bob Silver Linings Playbook
Frank American Hustle
Tina Plan 9 From Outer Space
Bob Twin Peaks
Diana Dr. Strangelove
…
Implementation 2:
Edge List + Vertex List
• Two flat tables
• Pre-computed join on VertexID
Strata Conf, Feb 2014
VertexID Name Age Genre
1 Alice 50
2 Charlie 26
3 Bob 33
…
100001 Silver Linings Playbook Romance
100002 Iron Man 3 Action
100003 Twin Peaks Thriller
SrcVertex DstVertex
1 389944
2 136782
3 100001
4 572639
5 200835
3 100003
…
Graph Operations
• get_neighbors():
1. Query indexed flat table
Strata Conf, Feb 2014
Example of Indexed Flat Table
Strata Conf, Feb 2014
User Item Rating
Alice Breaking Bad, Season 1 3
Charlie Twilight 2
Bob Silver Linings Playbook 4
Frank American Hustle 2
Tina Plan 9 From Outer Space 4
Bob Twin Peaks 2
Diana Dr. Strangelove 5
…
Index
Query: What items did Bob rate?
Index of “Bob” points to rows 3 and 6
Graph Operations
• get_neighbors():
1. Query indexed flat table
2. Join with vertex table on VertexID or Name
Strata Conf, Feb 2014
User Movie Rating
Bob Silver Linings Playbook 4
Bob Twin Peaks 2
VertexID Name Age Genre
3 Bob 33
100001 Silver Linings Playbook Romance
100003 Twin Peaks Thriller
Graph Operations
• get_subgraph():
• get_neighbors(), instantiate new table with subset of
rows of old tables
• Find edges/vertices with attribute = x
• Filter old tables
• Hypergraph – edges span more than 2 vertices
• Just add more columns to the edge table
Strata Conf, Feb 2014
Back to Syslog Mining
• Wait graph construction = search and filter
• Iterate:
• get_neighbors()
• filter on edge and vertex attribute to find culprits
• Sequential process
• Underlying event graph is enormous
• SLOW
Strata Conf, Feb 2014
ML on Graphs
• Graphical models (Bayes nets)
• Belief propagation
• Gibbs sampling
• Random walk on Markov chains
• PageRank
• Some algos are implementable on either
• Matrix factorization
Strata Conf, Feb 2014
Graphs vs. Tables
Strata Santa Clara, Feb 2014
Tables
Graphs
Graphs vs. Tables
• Closely related
• Graphs can be implemented on top of tables
• … yet different
• What key operations to optimize
• How much to pre-compute
• Indexes
• Joins
• Filters
Strata Santa Clara, Feb 2014
Popular Implementations
Strata Santa Clara, Feb 2014
Flat Tables
Strata Conf, Feb 2014
Random Access
(In Memory)
Sequential Access
(On Disk)
Querying
(Interactive)
Computation
(Batch)
Pandas
Spark
SQL
Hive/Pig
GraphLab
SFrame
Graphs
Strata Conf, Feb 2014
Random Access
(In-Memory)
Sequential Access
(On disk)
Querying
(Interactive)
Computation
(Batch)
GraphLab
Graph
GraphChi
Graph
GraphDBs:
HyperGraphDB,
Titan, Neo4j
Giraph
Conclusions
• Fast and scalable analysis hinges upon
efficient data structures
• Match the algo to the data structure
• Morph raw data into the data structure
Strata Conf, Feb 2014
Raw Data
Data
Structure
Algorithm Insight
Advertising
• GraphLab Tutorial this afternoon!
• “Large Scale Machine Learning Cookbook
Using GraphLab”
• Ballroom G, 1:30pm—5pm
Strata Santa Clara, Feb 2014

More Related Content

What's hot

Machine Learning with Azure
Machine Learning with AzureMachine Learning with Azure
Machine Learning with AzureBarbara Fusinska
 
Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16
Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16
Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16MLconf
 
Networks are like onions: Practical Deep Learning with TensorFlow
Networks are like onions: Practical Deep Learning with TensorFlowNetworks are like onions: Practical Deep Learning with TensorFlow
Networks are like onions: Practical Deep Learning with TensorFlowBarbara Fusinska
 
Snorkel: Dark Data and Machine Learning with Christopher Ré
Snorkel: Dark Data and Machine Learning with Christopher RéSnorkel: Dark Data and Machine Learning with Christopher Ré
Snorkel: Dark Data and Machine Learning with Christopher RéJen Aman
 
Making Machine Learning Scale: Single Machine and Distributed
Making Machine Learning Scale: Single Machine and DistributedMaking Machine Learning Scale: Single Machine and Distributed
Making Machine Learning Scale: Single Machine and DistributedTuri, Inc.
 
Ted Willke, Intel Labs MLconf 2013
Ted Willke, Intel Labs MLconf 2013Ted Willke, Intel Labs MLconf 2013
Ted Willke, Intel Labs MLconf 2013MLconf
 
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...MLconf
 
Kaz Sato, Evangelist, Google at MLconf ATL 2016
Kaz Sato, Evangelist, Google at MLconf ATL 2016Kaz Sato, Evangelist, Google at MLconf ATL 2016
Kaz Sato, Evangelist, Google at MLconf ATL 2016MLconf
 
Scalable data structures for data science
Scalable data structures for data scienceScalable data structures for data science
Scalable data structures for data scienceTuri, Inc.
 
Magellan-Spark as a Geospatial Analytics Engine by Ram Sriharsha
Magellan-Spark as a Geospatial Analytics Engine by Ram SriharshaMagellan-Spark as a Geospatial Analytics Engine by Ram Sriharsha
Magellan-Spark as a Geospatial Analytics Engine by Ram SriharshaSpark Summit
 
EDHREC @ Data Science MD
EDHREC @ Data Science MDEDHREC @ Data Science MD
EDHREC @ Data Science MDDonald Miner
 
Big Graph Analytics on Neo4j with Apache Spark
Big Graph Analytics on Neo4j with Apache SparkBig Graph Analytics on Neo4j with Apache Spark
Big Graph Analytics on Neo4j with Apache SparkKenny Bastani
 
Deep Learning with MXNet - Dmitry Larko
Deep Learning with MXNet - Dmitry LarkoDeep Learning with MXNet - Dmitry Larko
Deep Learning with MXNet - Dmitry LarkoSri Ambati
 
New Capabilities in the PyData Ecosystem
New Capabilities in the PyData EcosystemNew Capabilities in the PyData Ecosystem
New Capabilities in the PyData EcosystemTuri, Inc.
 
Machine Learning Pipelines
Machine Learning PipelinesMachine Learning Pipelines
Machine Learning Pipelinesjeykottalam
 
Automated Data Exploration: Building efficient analysis pipelines with Dask
Automated Data Exploration: Building efficient analysis pipelines with DaskAutomated Data Exploration: Building efficient analysis pipelines with Dask
Automated Data Exploration: Building efficient analysis pipelines with DaskASI Data Science
 
Unifying State-of-the-Art AI and Big Data in Apache Spark with Reynold Xin
Unifying State-of-the-Art AI and Big Data in Apache Spark with Reynold XinUnifying State-of-the-Art AI and Big Data in Apache Spark with Reynold Xin
Unifying State-of-the-Art AI and Big Data in Apache Spark with Reynold XinDatabricks
 
Apache Spark MLlib's Past Trajectory and New Directions with Joseph Bradley
Apache Spark MLlib's Past Trajectory and New Directions with Joseph BradleyApache Spark MLlib's Past Trajectory and New Directions with Joseph Bradley
Apache Spark MLlib's Past Trajectory and New Directions with Joseph BradleyDatabricks
 
Distributed Inference on Large Datasets Using Apache MXNet and Apache Spark ...
 Distributed Inference on Large Datasets Using Apache MXNet and Apache Spark ... Distributed Inference on Large Datasets Using Apache MXNet and Apache Spark ...
Distributed Inference on Large Datasets Using Apache MXNet and Apache Spark ...Databricks
 

What's hot (20)

Machine Learning with Azure
Machine Learning with AzureMachine Learning with Azure
Machine Learning with Azure
 
Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16
Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16
Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16
 
Networks are like onions: Practical Deep Learning with TensorFlow
Networks are like onions: Practical Deep Learning with TensorFlowNetworks are like onions: Practical Deep Learning with TensorFlow
Networks are like onions: Practical Deep Learning with TensorFlow
 
Snorkel: Dark Data and Machine Learning with Christopher Ré
Snorkel: Dark Data and Machine Learning with Christopher RéSnorkel: Dark Data and Machine Learning with Christopher Ré
Snorkel: Dark Data and Machine Learning with Christopher Ré
 
Making Machine Learning Scale: Single Machine and Distributed
Making Machine Learning Scale: Single Machine and DistributedMaking Machine Learning Scale: Single Machine and Distributed
Making Machine Learning Scale: Single Machine and Distributed
 
Ted Willke, Intel Labs MLconf 2013
Ted Willke, Intel Labs MLconf 2013Ted Willke, Intel Labs MLconf 2013
Ted Willke, Intel Labs MLconf 2013
 
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
 
Kaz Sato, Evangelist, Google at MLconf ATL 2016
Kaz Sato, Evangelist, Google at MLconf ATL 2016Kaz Sato, Evangelist, Google at MLconf ATL 2016
Kaz Sato, Evangelist, Google at MLconf ATL 2016
 
Scalable data structures for data science
Scalable data structures for data scienceScalable data structures for data science
Scalable data structures for data science
 
Magellan-Spark as a Geospatial Analytics Engine by Ram Sriharsha
Magellan-Spark as a Geospatial Analytics Engine by Ram SriharshaMagellan-Spark as a Geospatial Analytics Engine by Ram Sriharsha
Magellan-Spark as a Geospatial Analytics Engine by Ram Sriharsha
 
EDHREC @ Data Science MD
EDHREC @ Data Science MDEDHREC @ Data Science MD
EDHREC @ Data Science MD
 
Big Graph Analytics on Neo4j with Apache Spark
Big Graph Analytics on Neo4j with Apache SparkBig Graph Analytics on Neo4j with Apache Spark
Big Graph Analytics on Neo4j with Apache Spark
 
Deep Learning with MXNet - Dmitry Larko
Deep Learning with MXNet - Dmitry LarkoDeep Learning with MXNet - Dmitry Larko
Deep Learning with MXNet - Dmitry Larko
 
New Capabilities in the PyData Ecosystem
New Capabilities in the PyData EcosystemNew Capabilities in the PyData Ecosystem
New Capabilities in the PyData Ecosystem
 
Machine Learning Pipelines
Machine Learning PipelinesMachine Learning Pipelines
Machine Learning Pipelines
 
Automated Data Exploration: Building efficient analysis pipelines with Dask
Automated Data Exploration: Building efficient analysis pipelines with DaskAutomated Data Exploration: Building efficient analysis pipelines with Dask
Automated Data Exploration: Building efficient analysis pipelines with Dask
 
Spark
SparkSpark
Spark
 
Unifying State-of-the-Art AI and Big Data in Apache Spark with Reynold Xin
Unifying State-of-the-Art AI and Big Data in Apache Spark with Reynold XinUnifying State-of-the-Art AI and Big Data in Apache Spark with Reynold Xin
Unifying State-of-the-Art AI and Big Data in Apache Spark with Reynold Xin
 
Apache Spark MLlib's Past Trajectory and New Directions with Joseph Bradley
Apache Spark MLlib's Past Trajectory and New Directions with Joseph BradleyApache Spark MLlib's Past Trajectory and New Directions with Joseph Bradley
Apache Spark MLlib's Past Trajectory and New Directions with Joseph Bradley
 
Distributed Inference on Large Datasets Using Apache MXNet and Apache Spark ...
 Distributed Inference on Large Datasets Using Apache MXNet and Apache Spark ... Distributed Inference on Large Datasets Using Apache MXNet and Apache Spark ...
Distributed Inference on Large Datasets Using Apache MXNet and Apache Spark ...
 

Viewers also liked

The How and Why of Feature Engineering
The How and Why of Feature EngineeringThe How and Why of Feature Engineering
The How and Why of Feature EngineeringAlice Zheng
 
Understanding Feature Space in Machine Learning
Understanding Feature Space in Machine LearningUnderstanding Feature Space in Machine Learning
Understanding Feature Space in Machine LearningAlice Zheng
 
Feature engineering for diverse data types
Feature engineering for diverse data typesFeature engineering for diverse data types
Feature engineering for diverse data typesAlice Zheng
 
An Example of Predictive Analytics: Building a Recommendation Engine Using Py...
An Example of Predictive Analytics: Building a Recommendation Engine Using Py...An Example of Predictive Analytics: Building a Recommendation Engine Using Py...
An Example of Predictive Analytics: Building a Recommendation Engine Using Py...PyData
 
Interpreting charts and graphs
Interpreting charts and graphsInterpreting charts and graphs
Interpreting charts and graphslesliejohnson441
 
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Cloudera, Inc.
 
Feature Engineering
Feature EngineeringFeature Engineering
Feature EngineeringHJ van Veen
 

Viewers also liked (7)

The How and Why of Feature Engineering
The How and Why of Feature EngineeringThe How and Why of Feature Engineering
The How and Why of Feature Engineering
 
Understanding Feature Space in Machine Learning
Understanding Feature Space in Machine LearningUnderstanding Feature Space in Machine Learning
Understanding Feature Space in Machine Learning
 
Feature engineering for diverse data types
Feature engineering for diverse data typesFeature engineering for diverse data types
Feature engineering for diverse data types
 
An Example of Predictive Analytics: Building a Recommendation Engine Using Py...
An Example of Predictive Analytics: Building a Recommendation Engine Using Py...An Example of Predictive Analytics: Building a Recommendation Engine Using Py...
An Example of Predictive Analytics: Building a Recommendation Engine Using Py...
 
Interpreting charts and graphs
Interpreting charts and graphsInterpreting charts and graphs
Interpreting charts and graphs
 
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
 
Feature Engineering
Feature EngineeringFeature Engineering
Feature Engineering
 

Similar to What the Bleep is Big Data? A Holistic View of Data and Algorithms

Data Matters for AGU Early Career Conference
Data Matters for AGU Early Career ConferenceData Matters for AGU Early Career Conference
Data Matters for AGU Early Career ConferenceCarly Strasser
 
Bren - UCSB - Spooky spreadsheets
Bren - UCSB - Spooky spreadsheetsBren - UCSB - Spooky spreadsheets
Bren - UCSB - Spooky spreadsheetsCarly Strasser
 
Preventing data loss
Preventing data lossPreventing data loss
Preventing data lossIUPUI
 
Data Stewardship for SPATIAL/IsoCamp 2014
Data Stewardship for SPATIAL/IsoCamp 2014Data Stewardship for SPATIAL/IsoCamp 2014
Data Stewardship for SPATIAL/IsoCamp 2014Carly Strasser
 
How Azure Databricks helped make IoT Analytics a Reality with Janath Manohara...
How Azure Databricks helped make IoT Analytics a Reality with Janath Manohara...How Azure Databricks helped make IoT Analytics a Reality with Janath Manohara...
How Azure Databricks helped make IoT Analytics a Reality with Janath Manohara...Databricks
 
Presentation1.pdf
Presentation1.pdfPresentation1.pdf
Presentation1.pdfZixunZhou
 
Immersive Recommendation
Immersive RecommendationImmersive Recommendation
Immersive Recommendation承剛 謝
 
Responsible conduct of research: Data Management
Responsible conduct of research: Data ManagementResponsible conduct of research: Data Management
Responsible conduct of research: Data ManagementC. Tobin Magle
 
PyData Texas 2015 Keynote
PyData Texas 2015 KeynotePyData Texas 2015 Keynote
PyData Texas 2015 KeynotePeter Wang
 
Infusing Social Data Analytics into Future Internet applications for Manufact...
Infusing Social Data Analytics into Future Internet applications for Manufact...Infusing Social Data Analytics into Future Internet applications for Manufact...
Infusing Social Data Analytics into Future Internet applications for Manufact...Michael Petychakis
 
Materials Data Facility: Streamlined and automated data sharing, discovery, ...
Materials Data Facility: Streamlined and automated data sharing,  discovery, ...Materials Data Facility: Streamlined and automated data sharing,  discovery, ...
Materials Data Facility: Streamlined and automated data sharing, discovery, ...Ian Foster
 
Introduction to Big Data
Introduction to Big Data Introduction to Big Data
Introduction to Big Data Srinath Perera
 
Big Data Analysis : Deciphering the haystack
Big Data Analysis : Deciphering the haystack Big Data Analysis : Deciphering the haystack
Big Data Analysis : Deciphering the haystack Srinath Perera
 
WebServices_Grid.ppt
WebServices_Grid.pptWebServices_Grid.ppt
WebServices_Grid.pptEqinNiftalyev
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodDuncan Hull
 
POWRR Tools: Lessons learned from an IMLS National Leadership Grant
POWRR Tools: Lessons learned from an IMLS National Leadership GrantPOWRR Tools: Lessons learned from an IMLS National Leadership Grant
POWRR Tools: Lessons learned from an IMLS National Leadership GrantLynne Thomas
 
How to Feed a Data Hungry Organization – by Traveloka Data Team
How to Feed a Data Hungry Organization – by Traveloka Data TeamHow to Feed a Data Hungry Organization – by Traveloka Data Team
How to Feed a Data Hungry Organization – by Traveloka Data TeamTraveloka
 

Similar to What the Bleep is Big Data? A Holistic View of Data and Algorithms (20)

Data Matters for AGU Early Career Conference
Data Matters for AGU Early Career ConferenceData Matters for AGU Early Career Conference
Data Matters for AGU Early Career Conference
 
Bren - UCSB - Spooky spreadsheets
Bren - UCSB - Spooky spreadsheetsBren - UCSB - Spooky spreadsheets
Bren - UCSB - Spooky spreadsheets
 
Preventing data loss
Preventing data lossPreventing data loss
Preventing data loss
 
Data Stewardship for SPATIAL/IsoCamp 2014
Data Stewardship for SPATIAL/IsoCamp 2014Data Stewardship for SPATIAL/IsoCamp 2014
Data Stewardship for SPATIAL/IsoCamp 2014
 
How Azure Databricks helped make IoT Analytics a Reality with Janath Manohara...
How Azure Databricks helped make IoT Analytics a Reality with Janath Manohara...How Azure Databricks helped make IoT Analytics a Reality with Janath Manohara...
How Azure Databricks helped make IoT Analytics a Reality with Janath Manohara...
 
Presentation1.pdf
Presentation1.pdfPresentation1.pdf
Presentation1.pdf
 
Immersive Recommendation
Immersive RecommendationImmersive Recommendation
Immersive Recommendation
 
Meetup SF - Amundsen
Meetup SF  -  AmundsenMeetup SF  -  Amundsen
Meetup SF - Amundsen
 
Responsible conduct of research: Data Management
Responsible conduct of research: Data ManagementResponsible conduct of research: Data Management
Responsible conduct of research: Data Management
 
PyData Texas 2015 Keynote
PyData Texas 2015 KeynotePyData Texas 2015 Keynote
PyData Texas 2015 Keynote
 
DataHub
DataHubDataHub
DataHub
 
Infusing Social Data Analytics into Future Internet applications for Manufact...
Infusing Social Data Analytics into Future Internet applications for Manufact...Infusing Social Data Analytics into Future Internet applications for Manufact...
Infusing Social Data Analytics into Future Internet applications for Manufact...
 
Materials Data Facility: Streamlined and automated data sharing, discovery, ...
Materials Data Facility: Streamlined and automated data sharing,  discovery, ...Materials Data Facility: Streamlined and automated data sharing,  discovery, ...
Materials Data Facility: Streamlined and automated data sharing, discovery, ...
 
Introduction to Big Data
Introduction to Big Data Introduction to Big Data
Introduction to Big Data
 
Big Data Analysis : Deciphering the haystack
Big Data Analysis : Deciphering the haystack Big Data Analysis : Deciphering the haystack
Big Data Analysis : Deciphering the haystack
 
WebServices_Grid.ppt
WebServices_Grid.pptWebServices_Grid.ppt
WebServices_Grid.ppt
 
Tdwg14 fp-kurator-ludaescher
Tdwg14 fp-kurator-ludaescherTdwg14 fp-kurator-ludaescher
Tdwg14 fp-kurator-ludaescher
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific Method
 
POWRR Tools: Lessons learned from an IMLS National Leadership Grant
POWRR Tools: Lessons learned from an IMLS National Leadership GrantPOWRR Tools: Lessons learned from an IMLS National Leadership Grant
POWRR Tools: Lessons learned from an IMLS National Leadership Grant
 
How to Feed a Data Hungry Organization – by Traveloka Data Team
How to Feed a Data Hungry Organization – by Traveloka Data TeamHow to Feed a Data Hungry Organization – by Traveloka Data Team
How to Feed a Data Hungry Organization – by Traveloka Data Team
 

Recently uploaded

OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...Soham Mondal
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130Suhani Kapoor
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxupamatechverse
 
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escortsranjana rawat
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )Tsuyoshi Horigome
 
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...RajaP95
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...Call Girls in Nagpur High Profile
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Dr.Costas Sachpazis
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSKurinjimalarL3
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxupamatechverse
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingUNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingrknatarajan
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlysanyuktamishra911
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Call Girls in Nagpur High Profile
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escortsranjana rawat
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...ranjana rawat
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 

Recently uploaded (20)

OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptx
 
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )
 
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptx
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
 
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingUNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghly
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
 
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
 

What the Bleep is Big Data? A Holistic View of Data and Algorithms

  • 1. What the #(&*$ is Big Data? A Holistic View of Data and Algorithms Alice Zheng, GraphLab Strata Conference, Santa Clara February, 2014
  • 2. Background • Machine Learning • Enable machines to understand the world • Play with data • GraphLab • Unleash data science! • Enable non-ML experts to play with data • This talk: a look at Big Data and Machine Learning from a tool builder’s perspective Strata Conf, Feb 2014 2
  • 4. What is Data? • Data is an extension of ourselves • Pictures, texts, messages, logs • Sensors and devices • Measurements and experiments • Data is organic; it is wild and messy • Data proliferates Strata Conf, Feb 2014 4
  • 5. Producers of Big Data • Tech industry • Google, Microsoft, Facebook, Amazon, Twitter, … • Consumer/Retail • Walmart, Target, Amazon, Netflix, … • Telecomm • Verizon, AT&T, Telefonica, … • Finance • Thomson Reuters, Dow Jones, … • Health care and monitoring • Personal health metrics, health care records, … • Science • Genome research, high energy physics, astronomy, NASA, … • Etc. Strata Conf, Feb 2014 5
  • 6. • 1.11 billion active users [March 2013] • 665 million daily users on average [March 2013] • Daily data amount: [Aug 2012] • 500+ TB data • 2.5 billion pieces of content • 2.7 billion “Like” actions • 300 mil photos • Scans 105 TB data every ½ hour • 100+ PB data stored on a single Hadoop cluster [Aug 2012] Strata Conf, Feb 2014 6 Data Sources: [Yahoo! news] [TechCrunch]
  • 7. System Event Logs ETW (Event Tracing for Windows) • Logs of kernel and application events • Up to 100K events per second • Binary log size: ~200 MB every 2-5 minutes • 20-50 TB/year from one machine • ~50 PB/year from 1000 machines Strata Conf, Feb 2014 7 Data source: http://msdn.microsoft.com/en-us/library/windows/desktop/bb968803%28v=vs.85%29.aspx
  • 8. A Picture of Big Data Strata Conf, Feb 2014 8 WikipediaWebSpam Sys Logs Walmart LHC Whole Genome Scans SDSS Flickr Cellphone CDRs Facebook Twitter GB TB PB EB Total Size / Year Structure Science Tech Size of bubble = Size of a single record (log-scale) Other
  • 9. TAKING THE LEAP Strata Conf, Feb 2014 9
  • 11. The Way to Insight • What do people do with Big Data? • Myriad algorithms for myriad tasks • Two disparate examples • What movies would Bob like? – discovering recommendations from a crowd • Why is my machine so slow? – diagnosing systems using event logs Strata Conf, Feb 2014 11
  • 12. Algorithm Example 1: A Recommender System Strata Conf, Feb 2014
  • 13. What Movies Would Bob Like? • Bob watched “Silver Linings Playbook” and “Twin Peaks.” What else might Bob like? • Given movie selections of many users, make recommendations for individuals Strata Conf, Feb 2014
  • 14. User-Movie Interaction Matrix Silver Linings Playbook Hunger Games Twin Peaks Iron Man 3 Mulholland Drive Bob Anna David Ethan Strata Conf, Feb 2014
  • 15. Finding Similar Movies • Jaccard similarity between a pair of movies num users who watched both num users who watched either • If every user who watched one or the other movie, ends up watching both, then the two movies must be very similar. Strata Conf, Feb 2014
  • 16. User-Movie Interaction Matrix Silver Linings Playbook Hunger Games Twin Peaks Iron Man 3 Mulholland Drive Bob Anna David Ethan Strata Conf, Feb 2014 Sim(“Silver Linings Playbook”, “Hunger Games”) = ?
  • 17. User-Movie Interaction Matrix Silver Linings Playbook Hunger Games Twin Peaks Iron Man 3 Mulholland Drive Bob Anna David Ethan Strata Conf, Feb 2014 Sim(“Silver Linings Playbook”, “Hunger Games”) = ?
  • 18. User-Movie Interaction Matrix Silver Linings Playbook Hunger Games Twin Peaks Iron Man 3 Mulholland Drive Bob Anna David Ethan Strata Conf, Feb 2014 Sim(“Silver Linings Playbook”, “Hunger Games”) = ?
  • 19. User-Movie Interaction Matrix Silver Linings Playbook Hunger Games Twin Peaks Iron Man 3 Mulholland Drive Bob Anna David Ethan Strata Conf, Feb 2014 Sim(“Silver Linings Playbook”, “Hunger Games”) = 1/3
  • 20. Movie Similarity Matrix Strata Conf, Feb 2014 Silver Linings Playbook Hunger Games Twin Peaks Iron Man 3 Mulholland Drive Silver Linings Playbook 1 1/3 2/3 0 1/3 Hunger Games 1/3 1 1/4 0 1/3 Twin Peaks 2/3 1/4 1 0 2/3 Iron Man 3 0 0 0 1 0 Mulholland Drive 1/3 1/3 2/3 0 1
  • 21. Making New Recommendations recs = [ ] for movie in user.preferences: new_movies = Sim[movie, :].topk( ) recs.append(new_movies) recs.sort() • Equivalently, take the vector-matrix product • vector = the user’s preferences • matrix = movie similarity matrix Strata Conf, Feb 2014
  • 22. Key Ideas • During training: compute item-item similarity matrix • Making recommendations: take vector- matrix product Strata Conf, Feb 2014
  • 23. Algorithm Example 2: Diagnosing a slow computer Strata Conf, Feb 2014
  • 24. Why is My Machine So Slow? • Slow machines are frustrating! • Diagnose slowness via event logs
  • 25. ETW – Event Tracing for Windows • Fine-grained event tracing • Up to 100,000 events per second Strata Conf, Feb 2014 25 Excerpt of Sample ETW log
  • 26. Diagnosing Slowness • Start from slow thread • Walk backwards to construct wait graph Strata Conf, Feb 2014 Firefox Time Network Stack TCP/IP packet Search Indexer File Lock Anti-Virus Checker File Lock
  • 27. Key Algorithm Ideas • The insight is a wait graph • Constructing the graph involves repeated queries into a large set of events • Iterate: • What was the current thread waiting on? • Go to the source of the wait Strata Conf, Feb 2014
  • 28. What links these algorithms and data? Strata Conf, Feb 2014
  • 29. DATA STRUCTURES – THE BRIDGE Strata Conf, Feb 2014
  • 30. Between Data and Algorithms • Data structures • Organized data • Optimized for certain computations • The key to efficient analysis • Algorithms prefer certain data structures • Raw data is amenable to certain data structures Data Algorithms Data Structures Amenable Preference
  • 31. The Disconnect • Machine Learning research – largely disconnected from implementation • Some recent advances in large-scale ML are rediscovering known data structures • Next-gen ML tools need well-tailored data structures Strata Conf, Feb 2014 Machine Learning (Statistics, optimization, linear algebra, …) Data Structures (Lists, trees, tables, graphs, …)
  • 32. Two Useful Data Structures • Flat tables • Graphs Strata Conf, Feb 2014
  • 33. Data Structure 1: Flat Table Strata Conf, Feb 2014
  • 34. Flat Tables • Rows and columns • Rows = records • Columns can be typed • A lot of raw data looks like flat tables! Strata Conf, Feb 2014
  • 35. Example 1 User Item Rating Time Alice Breaking Bad, Season 1 3 … Charlie Twilight 2 Bob Silver Linings Playbook 4 Frank American Hustle 2 Tina Plan 9 From Outer Space 4 Bob Twin Peaks 2 Diana Dr. Strangelove 5 … Strata Conf, Feb 2014 User-Item interaction data
  • 36. Example 2 Timestamp Name PID CPU Stack … 447590409 audiodg.exe 1848 1 ntkrnlpa.exe!KeSetEvent ntkrnlpa.exe!WaitForLock 447590411 csrss.exe 460 0 … 447590415 iexplore.exe 2478 1 kernel64.exe!WaitForMultipleObjects … Strata Conf, Feb 2014 Event log data
  • 37. Variations of Flat Tables • Query vs. computation • Random access (in-memory) vs. sequential access (on-disk) • Column vs. row-wise representation • Indexed or not • Distributed or not • Key-value stores (hash tables) Strata Conf, Feb 2014
  • 38. Data Structure 1.5: Indexed Flat Table Strata Conf, Feb 2014
  • 39. Example of Indexed Flat Table Strata Conf, Feb 2014 User Item Rating Alice Breaking Bad, Season 1 3 Charlie Twilight 2 Bob Silver Linings Playbook 4 Frank American Hustle 2 Tina Plan 9 From Outer Space 4 Bob Twin Peaks 2 Diana Dr. Strangelove 5 …
  • 40. Example of Indexed Flat Table Strata Conf, Feb 2014 User Item Rating Alice Breaking Bad, Season 1 3 Charlie Twilight 2 Bob Silver Linings Playbook 4 Frank American Hustle 2 Tina Plan 9 From Outer Space 4 Bob Twin Peaks 2 Diana Dr. Strangelove 5 … Index Query: What items did Bob rate?
  • 41. Example of Indexed Flat Table Strata Conf, Feb 2014 User Item Rating Alice Breaking Bad, Season 1 3 Charlie Twilight 2 Bob Silver Linings Playbook 4 Frank American Hustle 2 Tina Plan 9 From Outer Space 4 Bob Twin Peaks 2 Diana Dr. Strangelove 5 … Index Query: What items did Bob rate? Index of “Bob” points to rows 3 and 6
  • 42. Back to the Recommender • Training: compute a matrix • Recommending: vector-matrix product • Raw data: user-item interaction log • Load in as flat table • Build index (user-item matrix) • Iterate through the users to train Strata Conf, Feb 2014
  • 43. ML on Flat Tables • Anything where data is represented as feature vectors • Computations operate on rows • Stochastic gradient descent • K-means clustering • … or columns • Decision tree family Strata Conf, Feb 2014
  • 44. Data Structure 2: Graph Strata Conf, Feb 2014
  • 45. Example Strata Conf, Feb 2014 Anna Diana Charlie Frank Tina Bob Sam
  • 46. Implementation 1: Edge List • A simple flat table! • Additional columns = edge attributes (e.g., user rating of movie, time watched, etc.) Strata Conf, Feb 2014 User Item Alice Breaking Bad, Season 1 Charlie Twilight Bob Silver Linings Playbook Frank American Hustle Tina Plan 9 From Outer Space Bob Twin Peaks Diana Dr. Strangelove …
  • 47. Implementation 2: Edge List + Vertex List • Two flat tables • Pre-computed join on VertexID Strata Conf, Feb 2014 VertexID Name Age Genre 1 Alice 50 2 Charlie 26 3 Bob 33 … 100001 Silver Linings Playbook Romance 100002 Iron Man 3 Action 100003 Twin Peaks Thriller SrcVertex DstVertex 1 389944 2 136782 3 100001 4 572639 5 200835 3 100003 …
  • 48. Graph Operations • get_neighbors(): 1. Query indexed flat table Strata Conf, Feb 2014
  • 49. Example of Indexed Flat Table Strata Conf, Feb 2014 User Item Rating Alice Breaking Bad, Season 1 3 Charlie Twilight 2 Bob Silver Linings Playbook 4 Frank American Hustle 2 Tina Plan 9 From Outer Space 4 Bob Twin Peaks 2 Diana Dr. Strangelove 5 … Index Query: What items did Bob rate? Index of “Bob” points to rows 3 and 6
  • 50. Graph Operations • get_neighbors(): 1. Query indexed flat table 2. Join with vertex table on VertexID or Name Strata Conf, Feb 2014 User Movie Rating Bob Silver Linings Playbook 4 Bob Twin Peaks 2 VertexID Name Age Genre 3 Bob 33 100001 Silver Linings Playbook Romance 100003 Twin Peaks Thriller
  • 51. Graph Operations • get_subgraph(): • get_neighbors(), instantiate new table with subset of rows of old tables • Find edges/vertices with attribute = x • Filter old tables • Hypergraph – edges span more than 2 vertices • Just add more columns to the edge table Strata Conf, Feb 2014
  • 52. Back to Syslog Mining • Wait graph construction = search and filter • Iterate: • get_neighbors() • filter on edge and vertex attribute to find culprits • Sequential process • Underlying event graph is enormous • SLOW Strata Conf, Feb 2014
  • 53. ML on Graphs • Graphical models (Bayes nets) • Belief propagation • Gibbs sampling • Random walk on Markov chains • PageRank • Some algos are implementable on either • Matrix factorization Strata Conf, Feb 2014
  • 54. Graphs vs. Tables Strata Santa Clara, Feb 2014 Tables Graphs
  • 55. Graphs vs. Tables • Closely related • Graphs can be implemented on top of tables • … yet different • What key operations to optimize • How much to pre-compute • Indexes • Joins • Filters Strata Santa Clara, Feb 2014
  • 57. Flat Tables Strata Conf, Feb 2014 Random Access (In Memory) Sequential Access (On Disk) Querying (Interactive) Computation (Batch) Pandas Spark SQL Hive/Pig GraphLab SFrame
  • 58. Graphs Strata Conf, Feb 2014 Random Access (In-Memory) Sequential Access (On disk) Querying (Interactive) Computation (Batch) GraphLab Graph GraphChi Graph GraphDBs: HyperGraphDB, Titan, Neo4j Giraph
  • 59. Conclusions • Fast and scalable analysis hinges upon efficient data structures • Match the algo to the data structure • Morph raw data into the data structure Strata Conf, Feb 2014 Raw Data Data Structure Algorithm Insight
  • 60. Advertising • GraphLab Tutorial this afternoon! • “Large Scale Machine Learning Cookbook Using GraphLab” • Ballroom G, 1:30pm—5pm Strata Santa Clara, Feb 2014

Editor's Notes

  1. In order to understand the problems involved in Big Data Analysis, we have to switch from a learning- or modeling-centric view to the data-centric view.