SlideShare a Scribd company logo
1 of 38
Download to read offline
Graph Analysis Trends and 
Opportunities 
Dr. Jason Riedy and Dr. David Bader 
Georgia Institute of Technology 
CMG Performance & Capacity, November 2014
Dr. Jason Riedy 
É Research Scientist II, Computational Science and 
Engineering 
É PhD UC Berkeley, 2010 
É Major developer of STING, community-el, and other 
used graph analysis codes 
É PI or co-PI on > 5 current funded graph analysis 
projects 
É Primary author of the Graph500 specification 
É Program Committees for HPC conferences including 
IPDPS, HiPC, ICPP 
É 20+ referreed publications, dozens of cited technical 
reports,  350 citations, etc. 
É Widely used code in packages like LAPACK, BLAS; 
contributions ranging from git to GNU R and Octave 
Riedy, Bader— Graph Analysis CMG, Nov 2014 2 / 38
Outline 
Graph Analysis Introduction 
Motivation and Applications 
Data Volumes and Velocities 
Methods 
Tools 
Hardware 
Summary and Opportunities 
Riedy, Bader— Graph Analysis CMG, Nov 2014 3 / 38
(insert prefix here)-scale data analysis 
Cyber-security Identify anomalies, malicious actors 
Health care Finding outbreaks, population epidemiology 
Social networks Advertising, searching, grouping 
Intelligence Decisions at scale, regulating algorithms 
Systems biology Understanding interactions, drug design 
Power grid Disruptions, conservation 
Simulation Discrete events, cracking meshes 
É Graphs are a unifying motif for data analysis. 
É Changing and dynamic graphs are important! 
Riedy, Bader— Graph Analysis CMG, Nov 2014 4 / 38
Why Graphs? 
É Smaller, more generalized than 
raw data. 
É Taught (roughly) to all CS 
students... 
É Semantic attributions can 
capture essential relationships. 
É Traversals can be faster than 
filtering DB joins. 
É Provide clear phrasing for 
queries about relationships. 
Often next step after dense and sparse linear algebra. 
Riedy, Bader— Graph Analysis CMG, Nov 2014 5 / 38
Graphs: A Fundamental Abstraction 
Structure for “unstructured” data 
É Traditional uses: 
É Route planning on fixed routes 
É Logistic planning between sources, routes, destinations 
É Increasing uses: 
É Computer security: Identify anomalies (e.g. spam, viruses, 
hacks) as they occur, insider threats, control access, 
localize malware 
É Data / intelligence integration: Find smaller, relevant 
subsets of massive, “unstructured” data piles 
É Recommender systems (industry): Given activities, 
automatically find other interesting data. 
Riedy, Bader— Graph Analysis CMG, Nov 2014 6 / 38
Application: Analyzing Twitter for Social Good 
ICPP 2010 public tweets 
Image credit: bioethicsinstitute.org 
Riedy, Bader— Graph Analysis CMG, Nov 2014 7 / 38
Application: Social Network Analysis 
Problems 
É Detecting “communities” 
automatically 
É Identifying important 
individuals 
É Given a few members, 
finding a joint community 
É Finding actual anomalies 
What techniques can scale to 
massive, noisy, changing 
populations? 
http://xkcd.com/723 
http://xkcd.com/802 
Riedy, Bader— Graph Analysis CMG, Nov 2014 8 / 38
And more applications... 
É Cybersecurity 
É Determine if new packets are allowed or represent new 
threat in  5ms... 
É Is the transfer a virus? Illicit? 
É Credit fraud forensics ) detection ) monitoring 
É Integrate all the customer’s data 
É Becoming closer to real-time, massive scale 
É Bioinformatics 
É Construct gene sequences, analyze protein interactions, 
map brain interactions 
É Amount of new data arriving is growing massively 
É Power network planning, monitoring, and re-routing 
É Already nation-scale problem 
É As more power sources come online (rooftop solar)... 
Riedy, Bader— Graph Analysis CMG, Nov 2014 9 / 38
No shortage of data... 
Existing (some out-of-date) data volumes 
NYSE 1.5 TB generated daily into a maintained 8 PB archive 
Google “Several dozen” 1PB data sets (CACM, Jan 2010) 
LHC 15 PB per year (avg. 21 TB daily) 
Wal-Mart 536 TB, 1B entries daily (2006) 
EBay 2 PB traditional DB, and 6.5PB streaming, 17 trillion records, 
1.5B records/day, web click = 50–150 details. (2009) 
Facebook  1B monthly users... 
É All data is rich and semantic (graphs!) and changing. 
É Base data rates include items and not relationships. 
Riedy, Bader— Graph Analysis CMG, Nov 2014 10 / 38
Data velocities 
Data volumes 
NYSE 1.5TB daily 
LHC 41TB daily 
NG seq. 150GB per 
machine daily 
Facebook Who knows? 
Data transfer 
É 1 Gb Ethernet: 8.7TB daily at 100%, 
5-6TB daily realistic 
É PB disk rack, parallel 10GE: 1.7PB daily 
streaming read/write 
É CPU $Memory: QPI,HT: 
5+PB/day@100% 
Data growth 
É Facebook:  2/yr 
É Twitter:  10/yr 
É Growing sources: Health, 
sensors, security 
Speed growth 
É Ethernet/IB/etc.: 4 in next 2 years? 
É Memory: Slow growth, possible bump? 
É Direct storage: flash, then what? 
Riedy, Bader— Graph Analysis CMG, Nov 2014 11 / 38
Streaming graph data 
Data Rates 
Networks: 
É Gigabit ethernet: 81k – 1.5M packets per second 
É Over 130 000 flows per second on 10 GigE 
Person-level, from www.statisticsbrain.com: 
É 58M posts per day on Twitter (671 / sec) 
É 1M links shared per 20 minutes on Facebook 
Opportunities 
É Often analyze only changes, not entire graph 
É Throughput  latency: Different levels of concurrency 
Riedy, Bader— Graph Analysis CMG, Nov 2014 12 / 38
Methods 
Graph Analysis Introduction 
Methods 
Example Algorithm: BFS, Graph500 
Methods for Streaming Data 
Algorithmic Disruptions 
Tools 
Hardware 
Summary and Opportunities 
Riedy, Bader— Graph Analysis CMG, Nov 2014 13 / 38
General approaches: Static and Streaming 
Different approaches 
É High-performance static graph analysis 
É Techniques apply to unchanging massive graphs 
É Provides useful after-the-fact information, starting points. 
É Serves many existing applications well: market research, 
much bio-  health-informatics... 
É Massive-scale algorithms need to be O(jEj) or 
approximated down to it. 
É High-performance streaming graph analysis 
É Focus: smaller dynamic changes within massive graphs 
É Streaming data, not CS-style streaming algorithms 
É Find trends or new information as they appear. 
É Serves upcoming applications: fault or threat detection, 
trend analysis, online prediction... 
É Can be O(jEj)? O(Vol(V))? 
É Less data ) faster, more efficient, lower latency 
Riedy, Bader— Graph Analysis CMG, Nov 2014 14 / 38
Breadth-First Search 
The problem... 
Build a tree from a starting vertex by repeatedly visiting all 
immediate, unvisited neighbors. At each traversal, record a 
parent. Repeat until there are no unvisited neighbors. 
É O(jVj + jEj), but problem-dependent parallel performace 
É Core of many scalable, parallel graph algorithms 
É Non-deterministic when any parent works 
É Base of the Graph500 benchmark, “fastest traversal” 
É Isn’t it done yet? Nope. 
Riedy, Bader— Graph Analysis CMG, Nov 2014 15 / 38
Graph500 Performance History 
Plot courtesy of Scott Beamer, UC Berkeley 
Riedy, Bader— Graph Analysis CMG, Nov 2014 16 / 38
Graph500 Perf/Cores History 
Plot courtesy of Scott Beamer, UC Berkeley 
Riedy, Bader— Graph Analysis CMG, Nov 2014 17 / 38
Graph500 Perf v. Size, Summer 2014 
Riedy, Bader— Graph Analysis CMG, Nov 2014 18 / 38
Streaming Queries 
Different kinds of questions 
É How are individual graph metrics (e.g. clustering 
coefficients) changing? 
É What are the patterns in the changes? 
É Are there seasonal variations? 
É What are responses to events? 
É What are temporal anomalies in the graph? 
É Do key members in clusters / communities change? 
É Are there indicators of event responses before they are 
obvious? 
New kinds of queries, new challenges... 
Riedy, Bader— Graph Analysis CMG, Nov 2014 19 / 38
Performance on Streaming Graphs 
Work at Georgia Tech 
É Triangle counting / clustering coefficients 
É Up to 130k graph updates per second on X5570 (Nehalem-EP, 
2.93GHz) 
É Connected components  spanning forest 
É Over 88k graph updates per second on X5570 
É Community detection  maintenance 
É Up to 100 million updates per second, 4-socket 40-core 
Westmere-EX 
É (Note: Most updates do not change communities...) 
É Incremental PageRank 
É Reduce lower latency by  2 over restarting 
É Betweenness centrality 
É O(jVj  (jVj + jEj)), can be sampled 
É Speed-ups of 40–150 over static recomputation 
Riedy, Bader— Graph Analysis CMG, Nov 2014 20 / 38
Algorithmic Disruptions 
É Current: Computing on the data as it arrives, not 
recomputing over all data. 
É Faster, lower latency, lower power... 
É New algorithms for old problems. 
É Many practical parallel graph algorithms are not 
“work-efficient.” 
É New work is finding work-efficient and practical methods: 
Connected components (CMU: Shun, Dhulipala, Blelloch, 
SPAA 2014), betweenness centrality (McLaughlin and 
Bader, SC14) 
É Approximations and coping with errors 
É There is very little approximation theory for graph 
algorithms. 
É Not sure which metrics are sensitive to sampling, errors... 
(Zakrzewska and Bader, PPAM 2013) 
Riedy, Bader— Graph Analysis CMG, Nov 2014 21 / 38
Tools 
Graph Analysis Introduction 
Methods 
Tools 
Graph Databases 
Cluster/Cloud Tools 
“Capability” Tools 
HPC Tools 
Streaming Tools 
Software Disruptions 
Hardware 
RSieudym, Bmadear—rGyraaphnAdnalyOsispportunities CMG, Nov 2014 22 / 38
Tools 
Rough Categories 
Graph DB Neo4j, Sparksee (was DEX), AllegroGraph, 
Sesame, Titan, Flock... 
Clusters/Cloud GraphX, Pregel, giraph, pegasus... 
“Capability” igraph, networkX 
HPC KDT / GraphBLAS, GraphLab, NetworKIT, GraphCT 
Streaming GT STINGER 
Riedy, Bader— Graph Analysis CMG, Nov 2014 23 / 38
Graph Databases 
Pros 
É Incredibly flexible data models 
É Large ecosystem: 
É query and viz tools 
É data management tools 
Cons 
É Standard query languages do not support most 
algorithms. 
É The flexibility costs performance. Analysis algorithms run 
10 – 100 slower than more specific analysis tools, at 
least. (“A Performance Evaluation of Open Source Graph 
Databases,” R. McColl, et al., 2014) 
Riedy, Bader— Graph Analysis CMG, Nov 2014 24 / 38
Cluster/Cloud Tools 
Pros 
É Growing ecosystem, large buzz 
É Simple to write simple analyses. 
É Often the only systems that handle hardware failure! 
Cons 
É Performance can be comparable to graph databases... 
É Often incredibly difficult to write more complex algorithms 
É Clusters are expensive compared to single-node. 
É Many more power supplies 
É Wasted memory on OSes 
Riedy, Bader— Graph Analysis CMG, Nov 2014 25 / 38
“Capability” Tools 
Pros 
É Stockpiles of algorithms 
É Available for many interactive environments (e.g. R) 
É Good solution for exploring analysis of small data sets 
Cons 
É Rarely ever parallel 
É Often cannot scale to large problems 
Riedy, Bader— Graph Analysis CMG, Nov 2014 26 / 38
HPC Tools 
Pros 
É HPC: Fast. Really fast. Often fastest. 
É Scale to large problems 
É Exist for traditional HPC boxes, “cloud” allocations, etc. 
É Also for large-memory servers! 
Cons 
É Distributed-memory versions use very focused models for 
performance 
É GraphBLAS: Sparse matrix - sparse vector product 
É GraphLab: Vertex programs 
É If your problem does not fit the model... 
É Algorithms still being developed 
Riedy, Bader— Graph Analysis CMG, Nov 2014 27 / 38
Streaming Tools 
Pros 
É Great fit for streaming problems! 
É Astounding speed-ups over static re-analysis. Speed-up 
grows with problem size. 
É Can target high throughput or low latency. 
Cons 
É There really aren’t many tools... (STINGER at GT) 
É Terminology is very much in flux... 
É Algorithms are still being designed... 
Riedy, Bader— Graph Analysis CMG, Nov 2014 28 / 38
Software Disruptions 
É New algorithms are being developed, tuning can be 
astronomically hard. 
É “Work-efficient” is not always fastest, need sampling and 
run-time algorithm selection (McLaughlin and Bader, SC 
2014) 
É Combinations: Let each tool do what it does well. 
É Cloud/cluster: Fantastic for data extraction 
É HPC tools: Fantastic for analysis 
É Combination: Kang and Bader, MTAAP 2010, reduce 
analysis time by five orders of magnitude. 
É Cloud extraction ! streaming processing: Demonstrated 
with STINGER at Research@Intel 2013, GraphLab Workshop 
2013 
Riedy, Bader— Graph Analysis CMG, Nov 2014 29 / 38
Hardware 
Graph Analysis Introduction 
Methods 
Tools 
Hardware 
Architecture Requirements 
Existing Platforms 
Disruptive Platform Changes 
Summary and Opportunities 
Riedy, Bader— Graph Analysis CMG, Nov 2014 30 / 38
Architecture Requirements for Efficiency 
The issues 
É Runtime is dominated by latency 
É “Random” accesses to global graph and data storage 
É Can hot-spot: Many accesses to the same place 
É Essentially no computation to hide the latency 
É Access pattern is problem dependent 
É Prefetching can hinder performance 
É Often only want a small portion of data 
É Most parts suffer from abysmal locality in memory 
É Cannot require a nuclear reactor. 
Riedy, Bader— Graph Analysis CMG, Nov 2014 31 / 38
Architecture Requirements for Efficiency 
Some desires 
É Large memory capacity 
É Low latency, high bandwidth, high injection rate 
É For very small messages! 
É Latency tolerance (threading...) 
É Light-weight, localized synchronization 
É Global address space 
É Partitioning is nigh impossible 
É Ghost nodes everywhere 
É Algorithms are difficult enough to implement 
Riedy, Bader— Graph Analysis CMG, Nov 2014 32 / 38
Existing Platforms 
É Distributed memory / cluster 
É Cloud-ish: Slow network, massive storage 
É HPC-ish: Fast network, less storage 
É Shared memory 
É Single motherboard: Ultra-fast network, little storage 
É Many motherboards: Tricky... 
É Accelerators: Tiny memory, incredible bandwidth 
Now start combining the platforms... 
Riedy, Bader— Graph Analysis CMG, Nov 2014 33 / 38
Mapping Problems to Platforms 
É Distributed memory / cluster 
É Cloud-ish: Fantastic for massive storage, extraction 
É HPC-ish: Great for known, forensic analysis on extracted 
graph 
É All of them eat power. 
É Shared memory 
É Highly-threaded, single node: Focused analysis, streaming 
É Highly-threaded, multi-node: Often hard to extract enough 
parallelism (Cray XMT / URiKA) 
É Multi-node virtual shared memory: Re-eval in progress 
É Single node often eats less power, but... 
É Accelerators 
É Very, very focused analysis 
É Can be very energy-efficient (McLaughlin, Riedy, Bader, 
HPEC 2014) 
Riedy, Bader— Graph Analysis CMG, Nov 2014 34 / 38
Distruptive Platform Changes 
É In next 3–5 years, memory is going to change. 
É 3D stacked memory (IBM, NVIDIA) 
É Hybrid memory cube (HMC Cons., Micron, Intel) 
É Programming logic layer on-chip 
É Possibly non-volatile 
É Order of magnitude higher bandwidth 
É Order of magnitude lower energy cost 
É This is happening. You can obtain HMC-FPGA 
combinations for testing. 
É Interconnects are changing. 
É Processor ,memory ,accelerator (NVLink, Phi) 
É Data-center networks finally may change, not just nGbE 
Riedy, Bader— Graph Analysis CMG, Nov 2014 35 / 38
Summary and Opportunities 
We live in interesting times. 
É Graph analysis tools, platforms are developing rapidly. 
É Only just starting to combine platforms and map problems 
appropriately. 
É Performance is developing rapidly. 
É New algorithms, improved implementations, better 
platform choices 
É New approaches like streaming and approximation 
É Even bigger changes are coming. 
É Can you imagine a PB of non-volatile storage at nearly RAM 
speed and latency? 
Riedy, Bader— Graph Analysis CMG, Nov 2014 36 / 38
STINGER: Where do you get it? 
www.cc.gatech.edu/stinger/ 
Gateway to 
É code, 
É development, 
É documentation, 
É presentations... 
Remember: Still academic code, but 
maturing. 
Users / contributors / questioners: 
Georgia Tech, PNNL, CMU, Berkeley, 
Intel, Cray, NVIDIA, IBM, Federal 
Government, Ionic Security, Citi 
Riedy, Bader— Graph Analysis CMG, Nov 2014 37 / 38
Acknowledgment of support 
Riedy, Bader— Graph Analysis CMG, Nov 2014 38 / 38

More Related Content

What's hot

Scalable Machine Learning: The Role of Stratified Data Sharding
Scalable Machine Learning: The Role of Stratified Data ShardingScalable Machine Learning: The Role of Stratified Data Sharding
Scalable Machine Learning: The Role of Stratified Data Sharding
inside-BigData.com
 

What's hot (20)

The Other HPC: High Productivity Computing in Polystore Environments
The Other HPC: High Productivity Computing in Polystore EnvironmentsThe Other HPC: High Productivity Computing in Polystore Environments
The Other HPC: High Productivity Computing in Polystore Environments
 
Machine Learning in the Cloud with GraphLab
Machine Learning in the Cloud with GraphLabMachine Learning in the Cloud with GraphLab
Machine Learning in the Cloud with GraphLab
 
Learning to Compose Domain-Specific Transformations for Data Augmentation
Learning to Compose Domain-Specific Transformations for Data AugmentationLearning to Compose Domain-Specific Transformations for Data Augmentation
Learning to Compose Domain-Specific Transformations for Data Augmentation
 
Three Tools for "Human-in-the-loop" Data Science
Three Tools for "Human-in-the-loop" Data ScienceThree Tools for "Human-in-the-loop" Data Science
Three Tools for "Human-in-the-loop" Data Science
 
Icml2017 overview
Icml2017 overviewIcml2017 overview
Icml2017 overview
 
Towards Visualization Recommendation Systems
Towards Visualization Recommendation SystemsTowards Visualization Recommendation Systems
Towards Visualization Recommendation Systems
 
CLIM Program: Remote Sensing Workshop, Foundations Session: A Discussion - Br...
CLIM Program: Remote Sensing Workshop, Foundations Session: A Discussion - Br...CLIM Program: Remote Sensing Workshop, Foundations Session: A Discussion - Br...
CLIM Program: Remote Sensing Workshop, Foundations Session: A Discussion - Br...
 
DataHub
DataHubDataHub
DataHub
 
Big Data: Issues and Challenges
Big Data: Issues and ChallengesBig Data: Issues and Challenges
Big Data: Issues and Challenges
 
Machine Learning in Big Data
Machine Learning in Big DataMachine Learning in Big Data
Machine Learning in Big Data
 
Visualizing and Clustering Life Science Applications in Parallel 
Visualizing and Clustering Life Science Applications in Parallel Visualizing and Clustering Life Science Applications in Parallel 
Visualizing and Clustering Life Science Applications in Parallel 
 
Artificial intelligence and data stream mining
Artificial intelligence and data stream miningArtificial intelligence and data stream mining
Artificial intelligence and data stream mining
 
ArXiv Literature Exploration using Social Network Analysis
ArXiv Literature Exploration using Social Network AnalysisArXiv Literature Exploration using Social Network Analysis
ArXiv Literature Exploration using Social Network Analysis
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 
Big data analytics_7_giants_public_24_sep_2013
Big data analytics_7_giants_public_24_sep_2013Big data analytics_7_giants_public_24_sep_2013
Big data analytics_7_giants_public_24_sep_2013
 
Classifying Simulation and Data Intensive Applications and the HPC-Big Data C...
Classifying Simulation and Data Intensive Applications and the HPC-Big Data C...Classifying Simulation and Data Intensive Applications and the HPC-Big Data C...
Classifying Simulation and Data Intensive Applications and the HPC-Big Data C...
 
Scalable Machine Learning: The Role of Stratified Data Sharding
Scalable Machine Learning: The Role of Stratified Data ShardingScalable Machine Learning: The Role of Stratified Data Sharding
Scalable Machine Learning: The Role of Stratified Data Sharding
 
Introduction to machine learning
Introduction to machine learningIntroduction to machine learning
Introduction to machine learning
 
MOA for the IoT at ACML 2016
MOA for the IoT at ACML 2016 MOA for the IoT at ACML 2016
MOA for the IoT at ACML 2016
 

Viewers also liked

STINGER: Multi-threaded Graph Streaming
STINGER: Multi-threaded Graph StreamingSTINGER: Multi-threaded Graph Streaming
STINGER: Multi-threaded Graph Streaming
Jason Riedy
 

Viewers also liked (10)

STING: Spatio-Temporal Interaction Networks and Graphs for Intel Platforms
STING: Spatio-Temporal Interaction Networks and Graphs for Intel PlatformsSTING: Spatio-Temporal Interaction Networks and Graphs for Intel Platforms
STING: Spatio-Temporal Interaction Networks and Graphs for Intel Platforms
 
Hadoop World 2011: Hadoop and Graph Data Management: Challenges and Opportuni...
Hadoop World 2011: Hadoop and Graph Data Management: Challenges and Opportuni...Hadoop World 2011: Hadoop and Graph Data Management: Challenges and Opportuni...
Hadoop World 2011: Hadoop and Graph Data Management: Challenges and Opportuni...
 
Graph Exploitation Seminar, 2011
Graph Exploitation Seminar, 2011Graph Exploitation Seminar, 2011
Graph Exploitation Seminar, 2011
 
Expert Insight on Implementing New Service Channels
Expert Insight on Implementing New Service ChannelsExpert Insight on Implementing New Service Channels
Expert Insight on Implementing New Service Channels
 
Network Challenge: Error and Sensitivity Analysis
Network Challenge: Error and Sensitivity AnalysisNetwork Challenge: Error and Sensitivity Analysis
Network Challenge: Error and Sensitivity Analysis
 
Introduction to STINGER
Introduction to STINGERIntroduction to STINGER
Introduction to STINGER
 
Community Detection with Networkx
Community Detection with NetworkxCommunity Detection with Networkx
Community Detection with Networkx
 
Graph Analyses with Python and NetworkX
Graph Analyses with Python and NetworkXGraph Analyses with Python and NetworkX
Graph Analyses with Python and NetworkX
 
Community Detection in Social Media
Community Detection in Social MediaCommunity Detection in Social Media
Community Detection in Social Media
 
STINGER: Multi-threaded Graph Streaming
STINGER: Multi-threaded Graph StreamingSTINGER: Multi-threaded Graph Streaming
STINGER: Multi-threaded Graph Streaming
 

Similar to Graph Analysis Trends and Opportunities -- CMG Performance and Capacity 2014

Current Trends and Challenges in Big Data Benchmarking
Current Trends and Challenges in Big Data BenchmarkingCurrent Trends and Challenges in Big Data Benchmarking
Current Trends and Challenges in Big Data Benchmarking
eXascale Infolab
 
using big-data methods analyse the Cross platform aviation
 using big-data methods analyse the Cross platform aviation using big-data methods analyse the Cross platform aviation
using big-data methods analyse the Cross platform aviation
ranjit banshpal
 

Similar to Graph Analysis Trends and Opportunities -- CMG Performance and Capacity 2014 (20)

Current Trends and Challenges in Big Data Benchmarking
Current Trends and Challenges in Big Data BenchmarkingCurrent Trends and Challenges in Big Data Benchmarking
Current Trends and Challenges in Big Data Benchmarking
 
Using Graphs to Enable National-Scale Analytics
Using Graphs to Enable National-Scale AnalyticsUsing Graphs to Enable National-Scale Analytics
Using Graphs to Enable National-Scale Analytics
 
KIT-601 Lecture Notes-UNIT-1.pdf
KIT-601 Lecture Notes-UNIT-1.pdfKIT-601 Lecture Notes-UNIT-1.pdf
KIT-601 Lecture Notes-UNIT-1.pdf
 
Data analysis
Data analysisData analysis
Data analysis
 
SC6 Workshop 1: Big Data Europe platform requirements and draft architecture:...
SC6 Workshop 1: Big Data Europe platform requirements and draft architecture:...SC6 Workshop 1: Big Data Europe platform requirements and draft architecture:...
SC6 Workshop 1: Big Data Europe platform requirements and draft architecture:...
 
Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. Februa...
Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. Februa...Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. Februa...
Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. Februa...
 
Come diventare data scientist - Paolo Pellegrini
Come diventare data scientist - Paolo PellegriniCome diventare data scientist - Paolo Pellegrini
Come diventare data scientist - Paolo Pellegrini
 
Challenges in Analytics for BIG Data
Challenges in Analytics for BIG DataChallenges in Analytics for BIG Data
Challenges in Analytics for BIG Data
 
Impact of big data on analytics
Impact of big data on analyticsImpact of big data on analytics
Impact of big data on analytics
 
using big-data methods analyse the Cross platform aviation
 using big-data methods analyse the Cross platform aviation using big-data methods analyse the Cross platform aviation
using big-data methods analyse the Cross platform aviation
 
ESSnet Big Data WP8 Methodology (+ Quality, +IT)
ESSnet Big Data WP8 Methodology (+ Quality, +IT)ESSnet Big Data WP8 Methodology (+ Quality, +IT)
ESSnet Big Data WP8 Methodology (+ Quality, +IT)
 
Massive Data Analysis- Challenges and Applications
Massive Data Analysis- Challenges and ApplicationsMassive Data Analysis- Challenges and Applications
Massive Data Analysis- Challenges and Applications
 
Ets train ppt_big_data_basics_v2.0
Ets train ppt_big_data_basics_v2.0Ets train ppt_big_data_basics_v2.0
Ets train ppt_big_data_basics_v2.0
 
Data Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data ScienceData Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data Science
 
Data Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data ScienceData Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data Science
 
The role of data engineering in data science and analytics practice
The role of data engineering in data science and analytics practiceThe role of data engineering in data science and analytics practice
The role of data engineering in data science and analytics practice
 
Big Data Architectures @ JAX / BigDataCon 2016
Big Data Architectures @ JAX / BigDataCon 2016Big Data Architectures @ JAX / BigDataCon 2016
Big Data Architectures @ JAX / BigDataCon 2016
 
Hadoop dev 01
Hadoop dev 01Hadoop dev 01
Hadoop dev 01
 
Building a Data Platform Strata SF 2019
Building a Data Platform Strata SF 2019Building a Data Platform Strata SF 2019
Building a Data Platform Strata SF 2019
 
IDC Perspectives on Big Data Outside of HPC
IDC Perspectives on Big Data Outside of HPCIDC Perspectives on Big Data Outside of HPC
IDC Perspectives on Big Data Outside of HPC
 

More from Jason Riedy

STING: Spatio-Temporal Interaction Networks and Graphs for Intel Platforms
STING: Spatio-Temporal Interaction Networks and Graphs for Intel PlatformsSTING: Spatio-Temporal Interaction Networks and Graphs for Intel Platforms
STING: Spatio-Temporal Interaction Networks and Graphs for Intel Platforms
Jason Riedy
 

More from Jason Riedy (20)

Lucata at the HPEC GraphBLAS BoF
Lucata at the HPEC GraphBLAS BoFLucata at the HPEC GraphBLAS BoF
Lucata at the HPEC GraphBLAS BoF
 
LAGraph 2021-10-13
LAGraph 2021-10-13LAGraph 2021-10-13
LAGraph 2021-10-13
 
Lucata at the HPEC GraphBLAS BoF
Lucata at the HPEC GraphBLAS BoFLucata at the HPEC GraphBLAS BoF
Lucata at the HPEC GraphBLAS BoF
 
Graph analysis and novel architectures
Graph analysis and novel architecturesGraph analysis and novel architectures
Graph analysis and novel architectures
 
GraphBLAS and Emus
GraphBLAS and EmusGraphBLAS and Emus
GraphBLAS and Emus
 
Reproducible Linear Algebra from Application to Architecture
Reproducible Linear Algebra from Application to ArchitectureReproducible Linear Algebra from Application to Architecture
Reproducible Linear Algebra from Application to Architecture
 
PEARC19: Wrangling Rogues: A Case Study on Managing Experimental Post-Moore A...
PEARC19: Wrangling Rogues: A Case Study on Managing Experimental Post-Moore A...PEARC19: Wrangling Rogues: A Case Study on Managing Experimental Post-Moore A...
PEARC19: Wrangling Rogues: A Case Study on Managing Experimental Post-Moore A...
 
ICIAM 2019: Reproducible Linear Algebra from Application to Architecture
ICIAM 2019: Reproducible Linear Algebra from Application to ArchitectureICIAM 2019: Reproducible Linear Algebra from Application to Architecture
ICIAM 2019: Reproducible Linear Algebra from Application to Architecture
 
ICIAM 2019: A New Algorithm Model for Massive-Scale Streaming Graph Analysis
ICIAM 2019: A New Algorithm Model for Massive-Scale Streaming Graph AnalysisICIAM 2019: A New Algorithm Model for Massive-Scale Streaming Graph Analysis
ICIAM 2019: A New Algorithm Model for Massive-Scale Streaming Graph Analysis
 
Novel Architectures for Applications in Data Science and Beyond
Novel Architectures for Applications in Data Science and BeyondNovel Architectures for Applications in Data Science and Beyond
Novel Architectures for Applications in Data Science and Beyond
 
Characterization of Emu Chick with Microbenchmarks
Characterization of Emu Chick with MicrobenchmarksCharacterization of Emu Chick with Microbenchmarks
Characterization of Emu Chick with Microbenchmarks
 
CRNCH 2018 Summit: Rogues Gallery Update
CRNCH 2018 Summit: Rogues Gallery UpdateCRNCH 2018 Summit: Rogues Gallery Update
CRNCH 2018 Summit: Rogues Gallery Update
 
Augmented Arithmetic Operations Proposed for IEEE-754 2018
Augmented Arithmetic Operations Proposed for IEEE-754 2018Augmented Arithmetic Operations Proposed for IEEE-754 2018
Augmented Arithmetic Operations Proposed for IEEE-754 2018
 
Graph Analysis: New Algorithm Models, New Architectures
Graph Analysis: New Algorithm Models, New ArchitecturesGraph Analysis: New Algorithm Models, New Architectures
Graph Analysis: New Algorithm Models, New Architectures
 
CRNCH Rogues Gallery: A Community Core for Novel Computing Platforms
CRNCH Rogues Gallery: A Community Core for Novel Computing PlatformsCRNCH Rogues Gallery: A Community Core for Novel Computing Platforms
CRNCH Rogues Gallery: A Community Core for Novel Computing Platforms
 
CRNCH Rogues Gallery: A Community Core for Novel Computing Platforms
CRNCH Rogues Gallery: A Community Core for Novel Computing PlatformsCRNCH Rogues Gallery: A Community Core for Novel Computing Platforms
CRNCH Rogues Gallery: A Community Core for Novel Computing Platforms
 
High-Performance Analysis of Streaming Graphs
High-Performance Analysis of Streaming GraphsHigh-Performance Analysis of Streaming Graphs
High-Performance Analysis of Streaming Graphs
 
STING: Spatio-Temporal Interaction Networks and Graphs for Intel Platforms
STING: Spatio-Temporal Interaction Networks and Graphs for Intel PlatformsSTING: Spatio-Temporal Interaction Networks and Graphs for Intel Platforms
STING: Spatio-Temporal Interaction Networks and Graphs for Intel Platforms
 
STING: Spatio-Temporal Interaction Networks and Graphs for Intel Platforms
STING: Spatio-Temporal Interaction Networks and Graphs for Intel PlatformsSTING: Spatio-Temporal Interaction Networks and Graphs for Intel Platforms
STING: Spatio-Temporal Interaction Networks and Graphs for Intel Platforms
 
SIAM Annual Meeting 2012: Streaming Graph Analytics for Massive Graphs
SIAM Annual Meeting 2012: Streaming Graph Analytics for Massive GraphsSIAM Annual Meeting 2012: Streaming Graph Analytics for Massive Graphs
SIAM Annual Meeting 2012: Streaming Graph Analytics for Massive Graphs
 

Recently uploaded

Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
shivangimorya083
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
amitlee9823
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
JohnnyPlasten
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
Lars Albertsson
 

Recently uploaded (20)

Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
 

Graph Analysis Trends and Opportunities -- CMG Performance and Capacity 2014

  • 1. Graph Analysis Trends and Opportunities Dr. Jason Riedy and Dr. David Bader Georgia Institute of Technology CMG Performance & Capacity, November 2014
  • 2. Dr. Jason Riedy É Research Scientist II, Computational Science and Engineering É PhD UC Berkeley, 2010 É Major developer of STING, community-el, and other used graph analysis codes É PI or co-PI on > 5 current funded graph analysis projects É Primary author of the Graph500 specification É Program Committees for HPC conferences including IPDPS, HiPC, ICPP É 20+ referreed publications, dozens of cited technical reports, 350 citations, etc. É Widely used code in packages like LAPACK, BLAS; contributions ranging from git to GNU R and Octave Riedy, Bader— Graph Analysis CMG, Nov 2014 2 / 38
  • 3. Outline Graph Analysis Introduction Motivation and Applications Data Volumes and Velocities Methods Tools Hardware Summary and Opportunities Riedy, Bader— Graph Analysis CMG, Nov 2014 3 / 38
  • 4. (insert prefix here)-scale data analysis Cyber-security Identify anomalies, malicious actors Health care Finding outbreaks, population epidemiology Social networks Advertising, searching, grouping Intelligence Decisions at scale, regulating algorithms Systems biology Understanding interactions, drug design Power grid Disruptions, conservation Simulation Discrete events, cracking meshes É Graphs are a unifying motif for data analysis. É Changing and dynamic graphs are important! Riedy, Bader— Graph Analysis CMG, Nov 2014 4 / 38
  • 5. Why Graphs? É Smaller, more generalized than raw data. É Taught (roughly) to all CS students... É Semantic attributions can capture essential relationships. É Traversals can be faster than filtering DB joins. É Provide clear phrasing for queries about relationships. Often next step after dense and sparse linear algebra. Riedy, Bader— Graph Analysis CMG, Nov 2014 5 / 38
  • 6. Graphs: A Fundamental Abstraction Structure for “unstructured” data É Traditional uses: É Route planning on fixed routes É Logistic planning between sources, routes, destinations É Increasing uses: É Computer security: Identify anomalies (e.g. spam, viruses, hacks) as they occur, insider threats, control access, localize malware É Data / intelligence integration: Find smaller, relevant subsets of massive, “unstructured” data piles É Recommender systems (industry): Given activities, automatically find other interesting data. Riedy, Bader— Graph Analysis CMG, Nov 2014 6 / 38
  • 7. Application: Analyzing Twitter for Social Good ICPP 2010 public tweets Image credit: bioethicsinstitute.org Riedy, Bader— Graph Analysis CMG, Nov 2014 7 / 38
  • 8. Application: Social Network Analysis Problems É Detecting “communities” automatically É Identifying important individuals É Given a few members, finding a joint community É Finding actual anomalies What techniques can scale to massive, noisy, changing populations? http://xkcd.com/723 http://xkcd.com/802 Riedy, Bader— Graph Analysis CMG, Nov 2014 8 / 38
  • 9. And more applications... É Cybersecurity É Determine if new packets are allowed or represent new threat in 5ms... É Is the transfer a virus? Illicit? É Credit fraud forensics ) detection ) monitoring É Integrate all the customer’s data É Becoming closer to real-time, massive scale É Bioinformatics É Construct gene sequences, analyze protein interactions, map brain interactions É Amount of new data arriving is growing massively É Power network planning, monitoring, and re-routing É Already nation-scale problem É As more power sources come online (rooftop solar)... Riedy, Bader— Graph Analysis CMG, Nov 2014 9 / 38
  • 10. No shortage of data... Existing (some out-of-date) data volumes NYSE 1.5 TB generated daily into a maintained 8 PB archive Google “Several dozen” 1PB data sets (CACM, Jan 2010) LHC 15 PB per year (avg. 21 TB daily) Wal-Mart 536 TB, 1B entries daily (2006) EBay 2 PB traditional DB, and 6.5PB streaming, 17 trillion records, 1.5B records/day, web click = 50–150 details. (2009) Facebook 1B monthly users... É All data is rich and semantic (graphs!) and changing. É Base data rates include items and not relationships. Riedy, Bader— Graph Analysis CMG, Nov 2014 10 / 38
  • 11. Data velocities Data volumes NYSE 1.5TB daily LHC 41TB daily NG seq. 150GB per machine daily Facebook Who knows? Data transfer É 1 Gb Ethernet: 8.7TB daily at 100%, 5-6TB daily realistic É PB disk rack, parallel 10GE: 1.7PB daily streaming read/write É CPU $Memory: QPI,HT: 5+PB/day@100% Data growth É Facebook: 2/yr É Twitter: 10/yr É Growing sources: Health, sensors, security Speed growth É Ethernet/IB/etc.: 4 in next 2 years? É Memory: Slow growth, possible bump? É Direct storage: flash, then what? Riedy, Bader— Graph Analysis CMG, Nov 2014 11 / 38
  • 12. Streaming graph data Data Rates Networks: É Gigabit ethernet: 81k – 1.5M packets per second É Over 130 000 flows per second on 10 GigE Person-level, from www.statisticsbrain.com: É 58M posts per day on Twitter (671 / sec) É 1M links shared per 20 minutes on Facebook Opportunities É Often analyze only changes, not entire graph É Throughput latency: Different levels of concurrency Riedy, Bader— Graph Analysis CMG, Nov 2014 12 / 38
  • 13. Methods Graph Analysis Introduction Methods Example Algorithm: BFS, Graph500 Methods for Streaming Data Algorithmic Disruptions Tools Hardware Summary and Opportunities Riedy, Bader— Graph Analysis CMG, Nov 2014 13 / 38
  • 14. General approaches: Static and Streaming Different approaches É High-performance static graph analysis É Techniques apply to unchanging massive graphs É Provides useful after-the-fact information, starting points. É Serves many existing applications well: market research, much bio- health-informatics... É Massive-scale algorithms need to be O(jEj) or approximated down to it. É High-performance streaming graph analysis É Focus: smaller dynamic changes within massive graphs É Streaming data, not CS-style streaming algorithms É Find trends or new information as they appear. É Serves upcoming applications: fault or threat detection, trend analysis, online prediction... É Can be O(jEj)? O(Vol(V))? É Less data ) faster, more efficient, lower latency Riedy, Bader— Graph Analysis CMG, Nov 2014 14 / 38
  • 15. Breadth-First Search The problem... Build a tree from a starting vertex by repeatedly visiting all immediate, unvisited neighbors. At each traversal, record a parent. Repeat until there are no unvisited neighbors. É O(jVj + jEj), but problem-dependent parallel performace É Core of many scalable, parallel graph algorithms É Non-deterministic when any parent works É Base of the Graph500 benchmark, “fastest traversal” É Isn’t it done yet? Nope. Riedy, Bader— Graph Analysis CMG, Nov 2014 15 / 38
  • 16. Graph500 Performance History Plot courtesy of Scott Beamer, UC Berkeley Riedy, Bader— Graph Analysis CMG, Nov 2014 16 / 38
  • 17. Graph500 Perf/Cores History Plot courtesy of Scott Beamer, UC Berkeley Riedy, Bader— Graph Analysis CMG, Nov 2014 17 / 38
  • 18. Graph500 Perf v. Size, Summer 2014 Riedy, Bader— Graph Analysis CMG, Nov 2014 18 / 38
  • 19. Streaming Queries Different kinds of questions É How are individual graph metrics (e.g. clustering coefficients) changing? É What are the patterns in the changes? É Are there seasonal variations? É What are responses to events? É What are temporal anomalies in the graph? É Do key members in clusters / communities change? É Are there indicators of event responses before they are obvious? New kinds of queries, new challenges... Riedy, Bader— Graph Analysis CMG, Nov 2014 19 / 38
  • 20. Performance on Streaming Graphs Work at Georgia Tech É Triangle counting / clustering coefficients É Up to 130k graph updates per second on X5570 (Nehalem-EP, 2.93GHz) É Connected components spanning forest É Over 88k graph updates per second on X5570 É Community detection maintenance É Up to 100 million updates per second, 4-socket 40-core Westmere-EX É (Note: Most updates do not change communities...) É Incremental PageRank É Reduce lower latency by 2 over restarting É Betweenness centrality É O(jVj (jVj + jEj)), can be sampled É Speed-ups of 40–150 over static recomputation Riedy, Bader— Graph Analysis CMG, Nov 2014 20 / 38
  • 21. Algorithmic Disruptions É Current: Computing on the data as it arrives, not recomputing over all data. É Faster, lower latency, lower power... É New algorithms for old problems. É Many practical parallel graph algorithms are not “work-efficient.” É New work is finding work-efficient and practical methods: Connected components (CMU: Shun, Dhulipala, Blelloch, SPAA 2014), betweenness centrality (McLaughlin and Bader, SC14) É Approximations and coping with errors É There is very little approximation theory for graph algorithms. É Not sure which metrics are sensitive to sampling, errors... (Zakrzewska and Bader, PPAM 2013) Riedy, Bader— Graph Analysis CMG, Nov 2014 21 / 38
  • 22. Tools Graph Analysis Introduction Methods Tools Graph Databases Cluster/Cloud Tools “Capability” Tools HPC Tools Streaming Tools Software Disruptions Hardware RSieudym, Bmadear—rGyraaphnAdnalyOsispportunities CMG, Nov 2014 22 / 38
  • 23. Tools Rough Categories Graph DB Neo4j, Sparksee (was DEX), AllegroGraph, Sesame, Titan, Flock... Clusters/Cloud GraphX, Pregel, giraph, pegasus... “Capability” igraph, networkX HPC KDT / GraphBLAS, GraphLab, NetworKIT, GraphCT Streaming GT STINGER Riedy, Bader— Graph Analysis CMG, Nov 2014 23 / 38
  • 24. Graph Databases Pros É Incredibly flexible data models É Large ecosystem: É query and viz tools É data management tools Cons É Standard query languages do not support most algorithms. É The flexibility costs performance. Analysis algorithms run 10 – 100 slower than more specific analysis tools, at least. (“A Performance Evaluation of Open Source Graph Databases,” R. McColl, et al., 2014) Riedy, Bader— Graph Analysis CMG, Nov 2014 24 / 38
  • 25. Cluster/Cloud Tools Pros É Growing ecosystem, large buzz É Simple to write simple analyses. É Often the only systems that handle hardware failure! Cons É Performance can be comparable to graph databases... É Often incredibly difficult to write more complex algorithms É Clusters are expensive compared to single-node. É Many more power supplies É Wasted memory on OSes Riedy, Bader— Graph Analysis CMG, Nov 2014 25 / 38
  • 26. “Capability” Tools Pros É Stockpiles of algorithms É Available for many interactive environments (e.g. R) É Good solution for exploring analysis of small data sets Cons É Rarely ever parallel É Often cannot scale to large problems Riedy, Bader— Graph Analysis CMG, Nov 2014 26 / 38
  • 27. HPC Tools Pros É HPC: Fast. Really fast. Often fastest. É Scale to large problems É Exist for traditional HPC boxes, “cloud” allocations, etc. É Also for large-memory servers! Cons É Distributed-memory versions use very focused models for performance É GraphBLAS: Sparse matrix - sparse vector product É GraphLab: Vertex programs É If your problem does not fit the model... É Algorithms still being developed Riedy, Bader— Graph Analysis CMG, Nov 2014 27 / 38
  • 28. Streaming Tools Pros É Great fit for streaming problems! É Astounding speed-ups over static re-analysis. Speed-up grows with problem size. É Can target high throughput or low latency. Cons É There really aren’t many tools... (STINGER at GT) É Terminology is very much in flux... É Algorithms are still being designed... Riedy, Bader— Graph Analysis CMG, Nov 2014 28 / 38
  • 29. Software Disruptions É New algorithms are being developed, tuning can be astronomically hard. É “Work-efficient” is not always fastest, need sampling and run-time algorithm selection (McLaughlin and Bader, SC 2014) É Combinations: Let each tool do what it does well. É Cloud/cluster: Fantastic for data extraction É HPC tools: Fantastic for analysis É Combination: Kang and Bader, MTAAP 2010, reduce analysis time by five orders of magnitude. É Cloud extraction ! streaming processing: Demonstrated with STINGER at Research@Intel 2013, GraphLab Workshop 2013 Riedy, Bader— Graph Analysis CMG, Nov 2014 29 / 38
  • 30. Hardware Graph Analysis Introduction Methods Tools Hardware Architecture Requirements Existing Platforms Disruptive Platform Changes Summary and Opportunities Riedy, Bader— Graph Analysis CMG, Nov 2014 30 / 38
  • 31. Architecture Requirements for Efficiency The issues É Runtime is dominated by latency É “Random” accesses to global graph and data storage É Can hot-spot: Many accesses to the same place É Essentially no computation to hide the latency É Access pattern is problem dependent É Prefetching can hinder performance É Often only want a small portion of data É Most parts suffer from abysmal locality in memory É Cannot require a nuclear reactor. Riedy, Bader— Graph Analysis CMG, Nov 2014 31 / 38
  • 32. Architecture Requirements for Efficiency Some desires É Large memory capacity É Low latency, high bandwidth, high injection rate É For very small messages! É Latency tolerance (threading...) É Light-weight, localized synchronization É Global address space É Partitioning is nigh impossible É Ghost nodes everywhere É Algorithms are difficult enough to implement Riedy, Bader— Graph Analysis CMG, Nov 2014 32 / 38
  • 33. Existing Platforms É Distributed memory / cluster É Cloud-ish: Slow network, massive storage É HPC-ish: Fast network, less storage É Shared memory É Single motherboard: Ultra-fast network, little storage É Many motherboards: Tricky... É Accelerators: Tiny memory, incredible bandwidth Now start combining the platforms... Riedy, Bader— Graph Analysis CMG, Nov 2014 33 / 38
  • 34. Mapping Problems to Platforms É Distributed memory / cluster É Cloud-ish: Fantastic for massive storage, extraction É HPC-ish: Great for known, forensic analysis on extracted graph É All of them eat power. É Shared memory É Highly-threaded, single node: Focused analysis, streaming É Highly-threaded, multi-node: Often hard to extract enough parallelism (Cray XMT / URiKA) É Multi-node virtual shared memory: Re-eval in progress É Single node often eats less power, but... É Accelerators É Very, very focused analysis É Can be very energy-efficient (McLaughlin, Riedy, Bader, HPEC 2014) Riedy, Bader— Graph Analysis CMG, Nov 2014 34 / 38
  • 35. Distruptive Platform Changes É In next 3–5 years, memory is going to change. É 3D stacked memory (IBM, NVIDIA) É Hybrid memory cube (HMC Cons., Micron, Intel) É Programming logic layer on-chip É Possibly non-volatile É Order of magnitude higher bandwidth É Order of magnitude lower energy cost É This is happening. You can obtain HMC-FPGA combinations for testing. É Interconnects are changing. É Processor ,memory ,accelerator (NVLink, Phi) É Data-center networks finally may change, not just nGbE Riedy, Bader— Graph Analysis CMG, Nov 2014 35 / 38
  • 36. Summary and Opportunities We live in interesting times. É Graph analysis tools, platforms are developing rapidly. É Only just starting to combine platforms and map problems appropriately. É Performance is developing rapidly. É New algorithms, improved implementations, better platform choices É New approaches like streaming and approximation É Even bigger changes are coming. É Can you imagine a PB of non-volatile storage at nearly RAM speed and latency? Riedy, Bader— Graph Analysis CMG, Nov 2014 36 / 38
  • 37. STINGER: Where do you get it? www.cc.gatech.edu/stinger/ Gateway to É code, É development, É documentation, É presentations... Remember: Still academic code, but maturing. Users / contributors / questioners: Georgia Tech, PNNL, CMU, Berkeley, Intel, Cray, NVIDIA, IBM, Federal Government, Ionic Security, Citi Riedy, Bader— Graph Analysis CMG, Nov 2014 37 / 38
  • 38. Acknowledgment of support Riedy, Bader— Graph Analysis CMG, Nov 2014 38 / 38