High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
What the Bleep is Big Data? A Holistic View of Data and Algorithms
1. What the #(&*$ is Big Data?
A Holistic View of Data and
Algorithms
Alice Zheng, GraphLab
Strata Conference, Santa Clara
February, 2014
2. Background
• Machine Learning
• Enable machines to understand the world
• Play with data
• GraphLab
• Unleash data science!
• Enable non-ML experts to play with data
• This talk: a look at Big Data and Machine
Learning from a tool builder’s perspective
Strata Conf, Feb 2014 2
4. What is Data?
• Data is an extension of ourselves
• Pictures, texts, messages, logs
• Sensors and devices
• Measurements and experiments
• Data is organic; it is wild and messy
• Data proliferates
Strata Conf, Feb 2014 4
5. Producers of Big Data
• Tech industry
• Google, Microsoft, Facebook, Amazon, Twitter, …
• Consumer/Retail
• Walmart, Target, Amazon, Netflix, …
• Telecomm
• Verizon, AT&T, Telefonica, …
• Finance
• Thomson Reuters, Dow Jones, …
• Health care and monitoring
• Personal health metrics, health care records, …
• Science
• Genome research, high energy physics, astronomy, NASA, …
• Etc.
Strata Conf, Feb 2014 5
6. • 1.11 billion active users [March 2013]
• 665 million daily users on average [March 2013]
• Daily data amount: [Aug 2012]
• 500+ TB data
• 2.5 billion pieces of content
• 2.7 billion “Like” actions
• 300 mil photos
• Scans 105 TB data every ½ hour
• 100+ PB data stored on a single Hadoop
cluster [Aug 2012]
Strata Conf, Feb 2014 6
Data Sources: [Yahoo! news] [TechCrunch]
7. System Event Logs
ETW (Event Tracing for Windows)
• Logs of kernel and application events
• Up to 100K events per second
• Binary log size: ~200 MB every 2-5
minutes
• 20-50 TB/year from one machine
• ~50 PB/year from 1000 machines
Strata Conf, Feb 2014 7
Data source: http://msdn.microsoft.com/en-us/library/windows/desktop/bb968803%28v=vs.85%29.aspx
8. A Picture of Big Data
Strata Conf, Feb 2014 8
WikipediaWebSpam
Sys Logs
Walmart
LHC
Whole
Genome Scans
SDSS
Flickr
Cellphone
CDRs
Facebook
Twitter
GB
TB
PB
EB
Total Size / Year
Structure
Science
Tech
Size of bubble =
Size of a single
record (log-scale)
Other
11. The Way to Insight
• What do people do with Big Data?
• Myriad algorithms for myriad tasks
• Two disparate examples
• What movies would Bob like? – discovering
recommendations from a crowd
• Why is my machine so slow? – diagnosing
systems using event logs
Strata Conf, Feb 2014 11
13. What Movies Would Bob Like?
• Bob watched “Silver Linings Playbook”
and “Twin Peaks.” What else might Bob
like?
• Given movie selections of many users,
make recommendations for individuals
Strata Conf, Feb 2014
15. Finding Similar Movies
• Jaccard similarity between a pair of movies
num users who watched both
num users who watched either
• If every user who watched one or the other
movie, ends up watching both, then the two
movies must be very similar.
Strata Conf, Feb 2014
20. Movie Similarity Matrix
Strata Conf, Feb 2014
Silver
Linings
Playbook
Hunger
Games
Twin Peaks Iron Man 3 Mulholland
Drive
Silver
Linings
Playbook
1 1/3 2/3 0 1/3
Hunger
Games
1/3 1 1/4 0 1/3
Twin Peaks 2/3 1/4 1 0 2/3
Iron Man 3 0 0 0 1 0
Mulholland
Drive
1/3 1/3 2/3 0 1
21. Making New Recommendations
recs = [ ]
for movie in user.preferences:
new_movies = Sim[movie, :].topk( )
recs.append(new_movies)
recs.sort()
• Equivalently, take the vector-matrix product
• vector = the user’s preferences
• matrix = movie similarity matrix
Strata Conf, Feb 2014
22. Key Ideas
• During training: compute item-item
similarity matrix
• Making recommendations: take vector-
matrix product
Strata Conf, Feb 2014
24. Why is My Machine So Slow?
• Slow machines are frustrating!
• Diagnose slowness via event logs
25. ETW – Event Tracing for Windows
• Fine-grained event tracing
• Up to 100,000 events per second
Strata Conf, Feb 2014 25
Excerpt of Sample ETW log
26. Diagnosing Slowness
• Start from slow thread
• Walk backwards to construct wait graph
Strata Conf, Feb 2014
Firefox
Time
Network Stack
TCP/IP packet
Search Indexer
File Lock
Anti-Virus Checker
File Lock
27. Key Algorithm Ideas
• The insight is a wait graph
• Constructing the graph involves repeated
queries into a large set of events
• Iterate:
• What was the current thread waiting on?
• Go to the source of the wait
Strata Conf, Feb 2014
30. Between Data and Algorithms
• Data structures
• Organized data
• Optimized for certain computations
• The key to efficient analysis
• Algorithms prefer certain data structures
• Raw data is amenable to certain data structures
Data Algorithms
Data
Structures
Amenable Preference
31. The Disconnect
• Machine Learning research – largely disconnected
from implementation
• Some recent advances in large-scale ML are rediscovering
known data structures
• Next-gen ML tools need well-tailored data structures
Strata Conf, Feb 2014
Machine Learning
(Statistics, optimization,
linear algebra, …)
Data Structures
(Lists, trees,
tables, graphs, …)
32. Two Useful Data Structures
• Flat tables
• Graphs
Strata Conf, Feb 2014
34. Flat Tables
• Rows and columns
• Rows = records
• Columns can be typed
• A lot of raw data looks like flat tables!
Strata Conf, Feb 2014
35. Example 1
User Item Rating Time
Alice Breaking Bad, Season 1 3 …
Charlie Twilight 2
Bob Silver Linings Playbook 4
Frank American Hustle 2
Tina Plan 9 From Outer Space 4
Bob Twin Peaks 2
Diana Dr. Strangelove 5
…
Strata Conf, Feb 2014
User-Item interaction data
36. Example 2
Timestamp Name PID CPU Stack …
447590409 audiodg.exe 1848 1 ntkrnlpa.exe!KeSetEvent
ntkrnlpa.exe!WaitForLock
447590411 csrss.exe 460 0 …
447590415 iexplore.exe 2478 1 kernel64.exe!WaitForMultipleObjects
…
Strata Conf, Feb 2014
Event log data
37. Variations of Flat Tables
• Query vs. computation
• Random access (in-memory) vs.
sequential access (on-disk)
• Column vs. row-wise representation
• Indexed or not
• Distributed or not
• Key-value stores (hash tables)
Strata Conf, Feb 2014
39. Example of Indexed Flat Table
Strata Conf, Feb 2014
User Item Rating
Alice Breaking Bad, Season 1 3
Charlie Twilight 2
Bob Silver Linings Playbook 4
Frank American Hustle 2
Tina Plan 9 From Outer Space 4
Bob Twin Peaks 2
Diana Dr. Strangelove 5
…
40. Example of Indexed Flat Table
Strata Conf, Feb 2014
User Item Rating
Alice Breaking Bad, Season 1 3
Charlie Twilight 2
Bob Silver Linings Playbook 4
Frank American Hustle 2
Tina Plan 9 From Outer Space 4
Bob Twin Peaks 2
Diana Dr. Strangelove 5
…
Index
Query: What items did Bob rate?
41. Example of Indexed Flat Table
Strata Conf, Feb 2014
User Item Rating
Alice Breaking Bad, Season 1 3
Charlie Twilight 2
Bob Silver Linings Playbook 4
Frank American Hustle 2
Tina Plan 9 From Outer Space 4
Bob Twin Peaks 2
Diana Dr. Strangelove 5
…
Index
Query: What items did Bob rate?
Index of “Bob” points to rows 3 and 6
42. Back to the Recommender
• Training: compute a matrix
• Recommending: vector-matrix product
• Raw data: user-item interaction log
• Load in as flat table
• Build index (user-item matrix)
• Iterate through the users to train
Strata Conf, Feb 2014
43. ML on Flat Tables
• Anything where data is represented as
feature vectors
• Computations operate on rows
• Stochastic gradient descent
• K-means clustering
• … or columns
• Decision tree family
Strata Conf, Feb 2014
46. Implementation 1: Edge List
• A simple flat table!
• Additional columns = edge attributes (e.g., user rating
of movie, time watched, etc.)
Strata Conf, Feb 2014
User Item
Alice Breaking Bad, Season 1
Charlie Twilight
Bob Silver Linings Playbook
Frank American Hustle
Tina Plan 9 From Outer Space
Bob Twin Peaks
Diana Dr. Strangelove
…
47. Implementation 2:
Edge List + Vertex List
• Two flat tables
• Pre-computed join on VertexID
Strata Conf, Feb 2014
VertexID Name Age Genre
1 Alice 50
2 Charlie 26
3 Bob 33
…
100001 Silver Linings Playbook Romance
100002 Iron Man 3 Action
100003 Twin Peaks Thriller
SrcVertex DstVertex
1 389944
2 136782
3 100001
4 572639
5 200835
3 100003
…
49. Example of Indexed Flat Table
Strata Conf, Feb 2014
User Item Rating
Alice Breaking Bad, Season 1 3
Charlie Twilight 2
Bob Silver Linings Playbook 4
Frank American Hustle 2
Tina Plan 9 From Outer Space 4
Bob Twin Peaks 2
Diana Dr. Strangelove 5
…
Index
Query: What items did Bob rate?
Index of “Bob” points to rows 3 and 6
50. Graph Operations
• get_neighbors():
1. Query indexed flat table
2. Join with vertex table on VertexID or Name
Strata Conf, Feb 2014
User Movie Rating
Bob Silver Linings Playbook 4
Bob Twin Peaks 2
VertexID Name Age Genre
3 Bob 33
100001 Silver Linings Playbook Romance
100003 Twin Peaks Thriller
51. Graph Operations
• get_subgraph():
• get_neighbors(), instantiate new table with subset of
rows of old tables
• Find edges/vertices with attribute = x
• Filter old tables
• Hypergraph – edges span more than 2 vertices
• Just add more columns to the edge table
Strata Conf, Feb 2014
52. Back to Syslog Mining
• Wait graph construction = search and filter
• Iterate:
• get_neighbors()
• filter on edge and vertex attribute to find culprits
• Sequential process
• Underlying event graph is enormous
• SLOW
Strata Conf, Feb 2014
53. ML on Graphs
• Graphical models (Bayes nets)
• Belief propagation
• Gibbs sampling
• Random walk on Markov chains
• PageRank
• Some algos are implementable on either
• Matrix factorization
Strata Conf, Feb 2014
55. Graphs vs. Tables
• Closely related
• Graphs can be implemented on top of tables
• … yet different
• What key operations to optimize
• How much to pre-compute
• Indexes
• Joins
• Filters
Strata Santa Clara, Feb 2014
57. Flat Tables
Strata Conf, Feb 2014
Random Access
(In Memory)
Sequential Access
(On Disk)
Querying
(Interactive)
Computation
(Batch)
Pandas
Spark
SQL
Hive/Pig
GraphLab
SFrame
58. Graphs
Strata Conf, Feb 2014
Random Access
(In-Memory)
Sequential Access
(On disk)
Querying
(Interactive)
Computation
(Batch)
GraphLab
Graph
GraphChi
Graph
GraphDBs:
HyperGraphDB,
Titan, Neo4j
Giraph
59. Conclusions
• Fast and scalable analysis hinges upon
efficient data structures
• Match the algo to the data structure
• Morph raw data into the data structure
Strata Conf, Feb 2014
Raw Data
Data
Structure
Algorithm Insight
60. Advertising
• GraphLab Tutorial this afternoon!
• “Large Scale Machine Learning Cookbook
Using GraphLab”
• Ballroom G, 1:30pm—5pm
Strata Santa Clara, Feb 2014
Editor's Notes
In order to understand the problems involved in Big Data Analysis, we have to switch from a learning- or modeling-centric view to the data-centric view.