SlideShare a Scribd company logo
1 of 33
Distributed Data Storage and Parallel Processing Engine   Sector & Sphere Yunhong Gu  Univ. of Illinois at Chicago
What is Sector/Sphere? ,[object Object],[object Object],[object Object],[object Object],[object Object]
Overview ,[object Object],[object Object],[object Object],[object Object]
Motivation Super-computer model: Expensive, data IO bottleneck Sector/Sphere model: Inexpensive, parallel data IO,  data locality
Motivation Parallel/Distributed Programming with MPI, etc.: Flexible and powerful. But too complicated Sector/Sphere model (cloud model): Clusters are a unity to the developer, simplified programming interface. Limited to certain data parallel applications.
Motivation Systems for single data centers: Requires additional effort to locate and move data. Sector/Sphere model: Support wide-area data collection and distribution.
Sector Distributed File System Security Server Masters slaves slaves SSL SSL Clients User account Data protection System Security Metadata Scheduling Service provider System access tools App. Programming Interfaces Storage and Processing Data UDT Encryption optional
Sector Distributed File System ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Sector: Performance ,[object Object],[object Object],[object Object],[object Object],[object Object]
UDT: UDP-based Data Transfer ,[object Object],[object Object],[object Object],[object Object],[object Object]
Sector: Fault Tolerance ,[object Object],[object Object],[object Object]
Sector: Security ,[object Object],[object Object],[object Object],[object Object],[object Object]
Sector: Tools and API ,[object Object],[object Object],[object Object],[object Object],[object Object]
Sphere: Simplified Data Processing ,[object Object],[object Object],[object Object],[object Object],[object Object]
Sphere: Simplified Data Processing for each file F in (SDSS datasets) for each image I in F findBrownDwarf(I, …);   SphereStream sdss; sdss.init("sdss files"); SphereProcess myproc; myproc->run(sdss," findBrownDwarf ", …); myproc->read(result);   findBrownDwarf(char* image, int isize, char* result, int rsize);
Sphere: Data Movement ,[object Object],[object Object],[object Object]
Sphere/UDF vs. MapReduce ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Sphere/UDF vs. MapReduce ,[object Object],[object Object],[object Object],[object Object],[object Object]
Why Sector doesn’t Split Files? ,[object Object],[object Object],[object Object],[object Object]
Load Balance ,[object Object],[object Object]
Fault Tolerance ,[object Object],[object Object],[object Object],[object Object],[object Object]
Open Cloud Testbed ,[object Object],[object Object],[object Object],[object Object],[object Object]
Open Cloud Testbed
The TeraSort Benchmark ,[object Object],[object Object],[object Object]
TeraSort 10-byte 90-byte Key Value 10-bit Bucket-0 Bucket-1 Bucket-1023 0-1023 Stage 1 : Hash based on  the first 10 bits Bucket-0 Bucket-1 Bucket-1023 Stage 2 : Sort each bucket  on local node 100 bytes record
Performance Results: TeraSort Run time: seconds Sector v1.16 vs Hadoop 0.17 1.2TB 900GB 600GB 300GB Data Size 3702 6675 1526 UIC + StarLight + Calit2 + JHU 3069 4341 1430 UIC + StarLight + Calit2 2617 2896 1361 UIC + StarLight 2252 2889 1265 UIC Hadoop (1 replica) Hadoop (3 replicas) Sphere
Performance Results: TeraSort ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
The MalStone Benchmark ,[object Object],[object Object],[object Object],http://code.google.com/p/malgen/
MalStone Site ID Time Key Value 3-byte site-000X site-001X site-999X 000-999 Stage 1 : Process each record and hash into buckets according to site ID site-000X site-001X site-999x Stage 2 : Compute infection rate  for each merchant Event ID | Timestamp | Site ID | Compromise Flag | Entity ID 00000000005000000043852268954353585368|2008-11-08 17:56:52.422640|3857268954353628599|1|000000497829 Text Record Transform Flag
Performance Results: MalStone * Courtesy of Collin Bennet and Jonathan Seidman of Open Data Group. Process 10 billions records on 20 OCT nodes (local). 43m 44s  33m 40s  Sector/Sphere 142m 32s  87m 29s Hadoop Streaming/Python 840m 50s  454m 13s  Hadoop MalStone-B MalStone-A
System Monitoring (Testbed)
System Monitoring (Sector/Sphere)
For More Information ,[object Object],[object Object],[object Object]

More Related Content

What's hot

Meet Hadoop Family: part 2
Meet Hadoop Family: part 2Meet Hadoop Family: part 2
Meet Hadoop Family: part 2caizer_x
 
Fault tolerant mechanisms in Big Data
Fault tolerant mechanisms in Big DataFault tolerant mechanisms in Big Data
Fault tolerant mechanisms in Big DataKaran Pardeshi
 
Architecture of Hadoop
Architecture of HadoopArchitecture of Hadoop
Architecture of HadoopKnoldus Inc.
 
Hadoop scalability
Hadoop scalabilityHadoop scalability
Hadoop scalabilityWANdisco Plc
 
MapR M7: Providing an enterprise quality Apache HBase API
MapR M7: Providing an enterprise quality Apache HBase APIMapR M7: Providing an enterprise quality Apache HBase API
MapR M7: Providing an enterprise quality Apache HBase APImcsrivas
 
Apache Hadoop YARN, NameNode HA, HDFS Federation
Apache Hadoop YARN, NameNode HA, HDFS FederationApache Hadoop YARN, NameNode HA, HDFS Federation
Apache Hadoop YARN, NameNode HA, HDFS FederationAdam Kawa
 
Introduction to YARN and MapReduce 2
Introduction to YARN and MapReduce 2Introduction to YARN and MapReduce 2
Introduction to YARN and MapReduce 2Cloudera, Inc.
 
Cluster based storage - Nasd and Google file system - advanced operating syst...
Cluster based storage - Nasd and Google file system - advanced operating syst...Cluster based storage - Nasd and Google file system - advanced operating syst...
Cluster based storage - Nasd and Google file system - advanced operating syst...Antonio Cesarano
 
Spark Overview and Performance Issues
Spark Overview and Performance IssuesSpark Overview and Performance Issues
Spark Overview and Performance IssuesAntonios Katsarakis
 
Apache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce TutorialApache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce TutorialFarzad Nozarian
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewNisanth Simon
 
Introduction to Hadoop part 2
Introduction to Hadoop part 2Introduction to Hadoop part 2
Introduction to Hadoop part 2Giovanna Roda
 
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Andrey Vykhodtsev
 
Hadoop 2.0, MRv2 and YARN - Module 9
Hadoop 2.0, MRv2 and YARN - Module 9Hadoop 2.0, MRv2 and YARN - Module 9
Hadoop 2.0, MRv2 and YARN - Module 9Rohit Agrawal
 

What's hot (20)

Hadoop Internals
Hadoop InternalsHadoop Internals
Hadoop Internals
 
Meet Hadoop Family: part 2
Meet Hadoop Family: part 2Meet Hadoop Family: part 2
Meet Hadoop Family: part 2
 
Fault tolerant mechanisms in Big Data
Fault tolerant mechanisms in Big DataFault tolerant mechanisms in Big Data
Fault tolerant mechanisms in Big Data
 
Architecture of Hadoop
Architecture of HadoopArchitecture of Hadoop
Architecture of Hadoop
 
Hadoop scalability
Hadoop scalabilityHadoop scalability
Hadoop scalability
 
MapReduce in Cloud Computing
MapReduce in Cloud ComputingMapReduce in Cloud Computing
MapReduce in Cloud Computing
 
MapR M7: Providing an enterprise quality Apache HBase API
MapR M7: Providing an enterprise quality Apache HBase APIMapR M7: Providing an enterprise quality Apache HBase API
MapR M7: Providing an enterprise quality Apache HBase API
 
Apache Hadoop YARN, NameNode HA, HDFS Federation
Apache Hadoop YARN, NameNode HA, HDFS FederationApache Hadoop YARN, NameNode HA, HDFS Federation
Apache Hadoop YARN, NameNode HA, HDFS Federation
 
Hadoop 2
Hadoop 2Hadoop 2
Hadoop 2
 
Introduction to YARN and MapReduce 2
Introduction to YARN and MapReduce 2Introduction to YARN and MapReduce 2
Introduction to YARN and MapReduce 2
 
Cluster based storage - Nasd and Google file system - advanced operating syst...
Cluster based storage - Nasd and Google file system - advanced operating syst...Cluster based storage - Nasd and Google file system - advanced operating syst...
Cluster based storage - Nasd and Google file system - advanced operating syst...
 
Spark Overview and Performance Issues
Spark Overview and Performance IssuesSpark Overview and Performance Issues
Spark Overview and Performance Issues
 
HDFS Internals
HDFS InternalsHDFS Internals
HDFS Internals
 
Apache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce TutorialApache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce Tutorial
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce Overview
 
H04502048051
H04502048051H04502048051
H04502048051
 
Introduction to Hadoop part 2
Introduction to Hadoop part 2Introduction to Hadoop part 2
Introduction to Hadoop part 2
 
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
 
Apache Hadoop 0.22 and Other Versions
Apache Hadoop 0.22 and Other VersionsApache Hadoop 0.22 and Other Versions
Apache Hadoop 0.22 and Other Versions
 
Hadoop 2.0, MRv2 and YARN - Module 9
Hadoop 2.0, MRv2 and YARN - Module 9Hadoop 2.0, MRv2 and YARN - Module 9
Hadoop 2.0, MRv2 and YARN - Module 9
 

Similar to Sector Sphere 2009

Sector Cloudcom Tutorial
Sector Cloudcom TutorialSector Cloudcom Tutorial
Sector Cloudcom Tutoriallilyco
 
BWC Supercomputing 2008 Presentation
BWC Supercomputing 2008 PresentationBWC Supercomputing 2008 Presentation
BWC Supercomputing 2008 Presentationlilyco
 
Google Cloud Computing on Google Developer 2008 Day
Google Cloud Computing on Google Developer 2008 DayGoogle Cloud Computing on Google Developer 2008 Day
Google Cloud Computing on Google Developer 2008 Dayprogrammermag
 
Lessons Learned from a Year's Worth of Benchmarking Large Data Clouds (Robert...
Lessons Learned from a Year's Worth of Benchmarking Large Data Clouds (Robert...Lessons Learned from a Year's Worth of Benchmarking Large Data Clouds (Robert...
Lessons Learned from a Year's Worth of Benchmarking Large Data Clouds (Robert...Robert Grossman
 
Sector - Presentation at Cloud Computing & Its Applications 2009
Sector - Presentation at Cloud Computing & Its Applications 2009Sector - Presentation at Cloud Computing & Its Applications 2009
Sector - Presentation at Cloud Computing & Its Applications 2009Robert Grossman
 
Data Grids with Oracle Coherence
Data Grids with Oracle CoherenceData Grids with Oracle Coherence
Data Grids with Oracle CoherenceBen Stopford
 
seed block algorithm
seed block algorithmseed block algorithm
seed block algorithmDipak Badhe
 
Hadoop training in bangalore-kellytechnologies
Hadoop training in bangalore-kellytechnologiesHadoop training in bangalore-kellytechnologies
Hadoop training in bangalore-kellytechnologiesappaji intelhunt
 
RAC - The Savior of DBA
RAC - The Savior of DBARAC - The Savior of DBA
RAC - The Savior of DBANikhil Kumar
 
Data center disaster recovery.ppt
Data center disaster recovery.ppt Data center disaster recovery.ppt
Data center disaster recovery.ppt omalreda
 
Seattle spark-meetup-032317
Seattle spark-meetup-032317Seattle spark-meetup-032317
Seattle spark-meetup-032317Nan Zhu
 
60141457-Oracle-Golden-Gate-Presentation.ppt
60141457-Oracle-Golden-Gate-Presentation.ppt60141457-Oracle-Golden-Gate-Presentation.ppt
60141457-Oracle-Golden-Gate-Presentation.pptpadalamail
 
My Other Computer is a Data Center: The Sector Perspective on Big Data
My Other Computer is a Data Center: The Sector Perspective on Big DataMy Other Computer is a Data Center: The Sector Perspective on Big Data
My Other Computer is a Data Center: The Sector Perspective on Big DataRobert Grossman
 
Big data & Hadoop
Big data & HadoopBig data & Hadoop
Big data & HadoopAhmed Gamil
 
Near Real time Indexing Kafka Messages to Apache Blur using Spark Streaming
Near Real time Indexing Kafka Messages to Apache Blur using Spark StreamingNear Real time Indexing Kafka Messages to Apache Blur using Spark Streaming
Near Real time Indexing Kafka Messages to Apache Blur using Spark StreamingDibyendu Bhattacharya
 
Hw09 Hadoop Based Data Mining Platform For The Telecom Industry
Hw09   Hadoop Based Data Mining Platform For The Telecom IndustryHw09   Hadoop Based Data Mining Platform For The Telecom Industry
Hw09 Hadoop Based Data Mining Platform For The Telecom IndustryCloudera, Inc.
 
Designing a Scalable Twitter - Patterns for Designing Scalable Real-Time Web ...
Designing a Scalable Twitter - Patterns for Designing Scalable Real-Time Web ...Designing a Scalable Twitter - Patterns for Designing Scalable Real-Time Web ...
Designing a Scalable Twitter - Patterns for Designing Scalable Real-Time Web ...Nati Shalom
 

Similar to Sector Sphere 2009 (20)

Sector Cloudcom Tutorial
Sector Cloudcom TutorialSector Cloudcom Tutorial
Sector Cloudcom Tutorial
 
BWC Supercomputing 2008 Presentation
BWC Supercomputing 2008 PresentationBWC Supercomputing 2008 Presentation
BWC Supercomputing 2008 Presentation
 
Google Cloud Computing on Google Developer 2008 Day
Google Cloud Computing on Google Developer 2008 DayGoogle Cloud Computing on Google Developer 2008 Day
Google Cloud Computing on Google Developer 2008 Day
 
Lessons Learned from a Year's Worth of Benchmarking Large Data Clouds (Robert...
Lessons Learned from a Year's Worth of Benchmarking Large Data Clouds (Robert...Lessons Learned from a Year's Worth of Benchmarking Large Data Clouds (Robert...
Lessons Learned from a Year's Worth of Benchmarking Large Data Clouds (Robert...
 
Sector - Presentation at Cloud Computing & Its Applications 2009
Sector - Presentation at Cloud Computing & Its Applications 2009Sector - Presentation at Cloud Computing & Its Applications 2009
Sector - Presentation at Cloud Computing & Its Applications 2009
 
Data Grids with Oracle Coherence
Data Grids with Oracle CoherenceData Grids with Oracle Coherence
Data Grids with Oracle Coherence
 
seed block algorithm
seed block algorithmseed block algorithm
seed block algorithm
 
Hadoop training in bangalore-kellytechnologies
Hadoop training in bangalore-kellytechnologiesHadoop training in bangalore-kellytechnologies
Hadoop training in bangalore-kellytechnologies
 
RAC - The Savior of DBA
RAC - The Savior of DBARAC - The Savior of DBA
RAC - The Savior of DBA
 
Data center disaster recovery.ppt
Data center disaster recovery.ppt Data center disaster recovery.ppt
Data center disaster recovery.ppt
 
Seattle spark-meetup-032317
Seattle spark-meetup-032317Seattle spark-meetup-032317
Seattle spark-meetup-032317
 
60141457-Oracle-Golden-Gate-Presentation.ppt
60141457-Oracle-Golden-Gate-Presentation.ppt60141457-Oracle-Golden-Gate-Presentation.ppt
60141457-Oracle-Golden-Gate-Presentation.ppt
 
My Other Computer is a Data Center: The Sector Perspective on Big Data
My Other Computer is a Data Center: The Sector Perspective on Big DataMy Other Computer is a Data Center: The Sector Perspective on Big Data
My Other Computer is a Data Center: The Sector Perspective on Big Data
 
Big data & Hadoop
Big data & HadoopBig data & Hadoop
Big data & Hadoop
 
MYSQL
MYSQLMYSQL
MYSQL
 
Dpdk applications
Dpdk applicationsDpdk applications
Dpdk applications
 
Near Real time Indexing Kafka Messages to Apache Blur using Spark Streaming
Near Real time Indexing Kafka Messages to Apache Blur using Spark StreamingNear Real time Indexing Kafka Messages to Apache Blur using Spark Streaming
Near Real time Indexing Kafka Messages to Apache Blur using Spark Streaming
 
Hw09 Hadoop Based Data Mining Platform For The Telecom Industry
Hw09   Hadoop Based Data Mining Platform For The Telecom IndustryHw09   Hadoop Based Data Mining Platform For The Telecom Industry
Hw09 Hadoop Based Data Mining Platform For The Telecom Industry
 
Handout3o
Handout3oHandout3o
Handout3o
 
Designing a Scalable Twitter - Patterns for Designing Scalable Real-Time Web ...
Designing a Scalable Twitter - Patterns for Designing Scalable Real-Time Web ...Designing a Scalable Twitter - Patterns for Designing Scalable Real-Time Web ...
Designing a Scalable Twitter - Patterns for Designing Scalable Real-Time Web ...
 

Sector Sphere 2009

  • 1. Distributed Data Storage and Parallel Processing Engine Sector & Sphere Yunhong Gu Univ. of Illinois at Chicago
  • 2.
  • 3.
  • 4. Motivation Super-computer model: Expensive, data IO bottleneck Sector/Sphere model: Inexpensive, parallel data IO, data locality
  • 5. Motivation Parallel/Distributed Programming with MPI, etc.: Flexible and powerful. But too complicated Sector/Sphere model (cloud model): Clusters are a unity to the developer, simplified programming interface. Limited to certain data parallel applications.
  • 6. Motivation Systems for single data centers: Requires additional effort to locate and move data. Sector/Sphere model: Support wide-area data collection and distribution.
  • 7. Sector Distributed File System Security Server Masters slaves slaves SSL SSL Clients User account Data protection System Security Metadata Scheduling Service provider System access tools App. Programming Interfaces Storage and Processing Data UDT Encryption optional
  • 8.
  • 9.
  • 10.
  • 11.
  • 12.
  • 13.
  • 14.
  • 15. Sphere: Simplified Data Processing for each file F in (SDSS datasets) for each image I in F findBrownDwarf(I, …); SphereStream sdss; sdss.init("sdss files"); SphereProcess myproc; myproc->run(sdss," findBrownDwarf ", …); myproc->read(result); findBrownDwarf(char* image, int isize, char* result, int rsize);
  • 16.
  • 17.
  • 18.
  • 19.
  • 20.
  • 21.
  • 22.
  • 24.
  • 25. TeraSort 10-byte 90-byte Key Value 10-bit Bucket-0 Bucket-1 Bucket-1023 0-1023 Stage 1 : Hash based on the first 10 bits Bucket-0 Bucket-1 Bucket-1023 Stage 2 : Sort each bucket on local node 100 bytes record
  • 26. Performance Results: TeraSort Run time: seconds Sector v1.16 vs Hadoop 0.17 1.2TB 900GB 600GB 300GB Data Size 3702 6675 1526 UIC + StarLight + Calit2 + JHU 3069 4341 1430 UIC + StarLight + Calit2 2617 2896 1361 UIC + StarLight 2252 2889 1265 UIC Hadoop (1 replica) Hadoop (3 replicas) Sphere
  • 27.
  • 28.
  • 29. MalStone Site ID Time Key Value 3-byte site-000X site-001X site-999X 000-999 Stage 1 : Process each record and hash into buckets according to site ID site-000X site-001X site-999x Stage 2 : Compute infection rate for each merchant Event ID | Timestamp | Site ID | Compromise Flag | Entity ID 00000000005000000043852268954353585368|2008-11-08 17:56:52.422640|3857268954353628599|1|000000497829 Text Record Transform Flag
  • 30. Performance Results: MalStone * Courtesy of Collin Bennet and Jonathan Seidman of Open Data Group. Process 10 billions records on 20 OCT nodes (local). 43m 44s 33m 40s Sector/Sphere 142m 32s 87m 29s Hadoop Streaming/Python 840m 50s 454m 13s Hadoop MalStone-B MalStone-A
  • 33.