SlideShare une entreprise Scribd logo
1  sur  14
Sector vs. Hadoop A Brief Comparison Between the Two Systems
Background Sector is a relatively “new” system that is broadly comparable to Hadoop, and people want to know what are the differences. Is Sector simply another clone of GFS/MapReduce? No. These slides compare most recent versions of Sector and Hadoop as of Nov. 2010. Both software are still under active development. If you find any inaccurate information, please contact Yunhong Gu [yunhong.gu # gmail]. We will try to keep these slides accurate and up to date.
Design Goals Sector Hadoop Three-layer functionality: Distributed file system Data collection, sharing, and distribution (over wide area networks) Massive in-storage parallel data processing with simplified interface Two-layer functionality: Distributed file system Massive in-storage parallel data processing with simplified interface
History Sector Hadoop Started around 2005 - 2006, as a P2P file sharing and content distribution system for extreme large scientific datasets. Switched to centralized general purpose file system in 2007. Introduced in-storage parallel data processing in 2007. First “modern” version released in 2009. Started as a web crawler & indexer, Nutch, that adopted GFS/MapReduce between 2004 – 2006. Y! took over the project in 2006. Hadoop split from Nutch. First “modern” version released in 2008.
Architecture Sector Hadoop Master-slave system Masters store metadata, slaves store data Multiple active masters Clients perform IO directly with slaves Master-slave system Namenode stores metadata, datanodes store data Single namenode (single point of failure) Clients perform IO directly with datanodes
Distributed File System Sector HDFS General purpose IO Optimized for large files File based (file not split by Sector but users have to take care of it) Use replication for fault tolerance Write once read many (no random write) Optimized for large files Block based (64MB block as default) Use replication for fault tolerance
Replication Sector HDFS System level default in configuration Per-file replica factor can be specified in a configuration file and can be changed at run-time Replicas are stored as far away as possible, but within a distance limit, configurable at per-file level File location can be limited to certain clusters only System level default in configuration Per-file replica factor is supported during file creation For 3 replicas (default), 2 on the same rack, the 3rd on a different rack
Security Sector Hadoop A Sector security server is used to maintain user credentials and permission, server ACL, etc. Security server can be extended to connect to other sources, e.g., LDAP Optional file transfer encryption UDP-based hole punching firewall traversing for clients Still in active development, new features in 2010 Kerberos/token based security framework to authenticate users No file transfer encryption
Wide Area Data Access Sector HDFS Sector ensures high performance data transfer with UDT, a high speed data transfer protocol As Sector pushes replicas as far away from each other as possible, a remote Sector client may find a nearby replica Thus, Sector can be used as content distribution network for very large datasets HDFS has no special consideration for wide area access. Its performance for remote access would be close to a stock FTP server. Its security mechanism may also be a problem for remote data access.
In-Storage Data Processing Sphere HadoopMapReduce Apply arbitrary user-defined functions (UDFs) on data segments UDFs can be Map, Reduce, or others Support native MapReduce as well Support the MapReduce framework
Sphere UDF vs. HadoopMapReduce Sphere UDF HadoopMapReduce Parsing: permanent record offset index if necessary Data segments (records, blocks, files, and directories) are processed by UDFs Transparent load balancing and fault tolerance Sphere is about 2 – 4x faster in various benchmarks Parsing: run-time data parsing with default or user-defined parser Data records are processed by Map and Reduce operations Transparent load balancing and fault tolerance
Why Sphere is Faster than Hadoop? C++ vs. Java Different internal data flows Sphere UDF model is more flexible than MapReduce UDF dissembles MapReduce and gives developers more control to the process Sphere has better input data locality Better performance for applications that process files and group of files as minimum input unit Sphere has better output data locality Output location can be used to optimize iterative and combinative processing, such as join Different implementations and optimizations UDT vs. TCP (significant in wide area systems)
Compatibility with Existing Systems Sector/Sphere Hadoop Sector files can be accessed from outside if necessary Sphere can simply apply an existing application executable on multiple files or even directories in parallel, if the executable accepts a file or a directory as input Data in HDFS can only be accessed via HDFS interfaces In Hadoop, executables that process files may also be wrapped in Map or Reduce operations, but will cause extra data movement if file size is greater than block size Hadoop cannot process multiple files within one operation.
Conclusions Sector is a unique system that integrates distributed file system, content sharing network, and parallel data processing framework. Hadoop is mainly focused on large data processing within a single data center. They overlap on the parallel data processing support.

Contenu connexe

Tendances

Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File SystemNilaNila16
 
Snapshot in Hadoop Distributed File System
Snapshot in Hadoop Distributed File SystemSnapshot in Hadoop Distributed File System
Snapshot in Hadoop Distributed File SystemBhavesh Padharia
 
Hadoop training in bangalore-kellytechnologies
Hadoop training in bangalore-kellytechnologiesHadoop training in bangalore-kellytechnologies
Hadoop training in bangalore-kellytechnologiesappaji intelhunt
 
Hadoop distributed file system
Hadoop distributed file systemHadoop distributed file system
Hadoop distributed file systemsrikanthhadoop
 
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.MaharajothiP
 
Apache hadoop basics
Apache hadoop basicsApache hadoop basics
Apache hadoop basicssaili mane
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : BeginnersShweta Patnaik
 

Tendances (18)

Hadoop training in bangalore
Hadoop training in bangaloreHadoop training in bangalore
Hadoop training in bangalore
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
 
Hadoop
HadoopHadoop
Hadoop
 
Snapshot in Hadoop Distributed File System
Snapshot in Hadoop Distributed File SystemSnapshot in Hadoop Distributed File System
Snapshot in Hadoop Distributed File System
 
hadoop
hadoophadoop
hadoop
 
Hadoop training in bangalore-kellytechnologies
Hadoop training in bangalore-kellytechnologiesHadoop training in bangalore-kellytechnologies
Hadoop training in bangalore-kellytechnologies
 
Hadoop distributed file system
Hadoop distributed file systemHadoop distributed file system
Hadoop distributed file system
 
Hadoop Technology
Hadoop TechnologyHadoop Technology
Hadoop Technology
 
HADOOP
HADOOPHADOOP
HADOOP
 
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
 
Hadoop technology
Hadoop technologyHadoop technology
Hadoop technology
 
Seminar ppt
Seminar pptSeminar ppt
Seminar ppt
 
Apache hadoop basics
Apache hadoop basicsApache hadoop basics
Apache hadoop basics
 
Hdfs
HdfsHdfs
Hdfs
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
 
getFamiliarWithHadoop
getFamiliarWithHadoopgetFamiliarWithHadoop
getFamiliarWithHadoop
 
Cloud storage
Cloud storageCloud storage
Cloud storage
 
Cppt
CpptCppt
Cppt
 

Similaire à A Brief Comparison of Sector and Hadoop Distributed Systems

Unit-1 Introduction to Big Data.pptx
Unit-1 Introduction to Big Data.pptxUnit-1 Introduction to Big Data.pptx
Unit-1 Introduction to Big Data.pptxAnkitChauhan817826
 
Managing Big data with Hadoop
Managing Big data with HadoopManaging Big data with Hadoop
Managing Big data with HadoopNalini Mehta
 
big data hadoop technonolgy for storing and processing data
big data hadoop technonolgy for storing and processing databig data hadoop technonolgy for storing and processing data
big data hadoop technonolgy for storing and processing datapreetik9044
 
Introduction to hadoop and hdfs
Introduction to hadoop and hdfsIntroduction to hadoop and hdfs
Introduction to hadoop and hdfsshrey mehrotra
 
IRJET- A Novel Approach to Process Small HDFS Files with Apache Spark
IRJET- A Novel Approach to Process Small HDFS Files with Apache SparkIRJET- A Novel Approach to Process Small HDFS Files with Apache Spark
IRJET- A Novel Approach to Process Small HDFS Files with Apache SparkIRJET Journal
 
Hadoop installation by santosh nage
Hadoop installation by santosh nageHadoop installation by santosh nage
Hadoop installation by santosh nageSantosh Nage
 
Design of Hadoop Distributed File System
Design of Hadoop Distributed File SystemDesign of Hadoop Distributed File System
Design of Hadoop Distributed File SystemDr. C.V. Suresh Babu
 
Big data overview of apache hadoop
Big data overview of apache hadoopBig data overview of apache hadoop
Big data overview of apache hadoopveeracynixit
 
Big data overview of apache hadoop
Big data overview of apache hadoopBig data overview of apache hadoop
Big data overview of apache hadoopveeracynixit
 
PARALLEL FILE SYSTEM FOR LINUX CLUSTERS
PARALLEL FILE SYSTEM FOR LINUX CLUSTERSPARALLEL FILE SYSTEM FOR LINUX CLUSTERS
PARALLEL FILE SYSTEM FOR LINUX CLUSTERSRaheemUnnisa1
 

Similaire à A Brief Comparison of Sector and Hadoop Distributed Systems (20)

Hadoop
HadoopHadoop
Hadoop
 
Hadoop overview.pdf
Hadoop overview.pdfHadoop overview.pdf
Hadoop overview.pdf
 
module 2.pptx
module 2.pptxmodule 2.pptx
module 2.pptx
 
Unit-1 Introduction to Big Data.pptx
Unit-1 Introduction to Big Data.pptxUnit-1 Introduction to Big Data.pptx
Unit-1 Introduction to Big Data.pptx
 
Hdfs design
Hdfs designHdfs design
Hdfs design
 
Managing Big data with Hadoop
Managing Big data with HadoopManaging Big data with Hadoop
Managing Big data with Hadoop
 
Hdfs
HdfsHdfs
Hdfs
 
big data hadoop technonolgy for storing and processing data
big data hadoop technonolgy for storing and processing databig data hadoop technonolgy for storing and processing data
big data hadoop technonolgy for storing and processing data
 
Introduction to hadoop and hdfs
Introduction to hadoop and hdfsIntroduction to hadoop and hdfs
Introduction to hadoop and hdfs
 
IRJET- A Novel Approach to Process Small HDFS Files with Apache Spark
IRJET- A Novel Approach to Process Small HDFS Files with Apache SparkIRJET- A Novel Approach to Process Small HDFS Files with Apache Spark
IRJET- A Novel Approach to Process Small HDFS Files with Apache Spark
 
HDFS
HDFSHDFS
HDFS
 
Unit-3.pptx
Unit-3.pptxUnit-3.pptx
Unit-3.pptx
 
Hadoop
HadoopHadoop
Hadoop
 
Unit 1
Unit 1Unit 1
Unit 1
 
Hadoop installation by santosh nage
Hadoop installation by santosh nageHadoop installation by santosh nage
Hadoop installation by santosh nage
 
Design of Hadoop Distributed File System
Design of Hadoop Distributed File SystemDesign of Hadoop Distributed File System
Design of Hadoop Distributed File System
 
Big data overview of apache hadoop
Big data overview of apache hadoopBig data overview of apache hadoop
Big data overview of apache hadoop
 
Big data overview of apache hadoop
Big data overview of apache hadoopBig data overview of apache hadoop
Big data overview of apache hadoop
 
PARALLEL FILE SYSTEM FOR LINUX CLUSTERS
PARALLEL FILE SYSTEM FOR LINUX CLUSTERSPARALLEL FILE SYSTEM FOR LINUX CLUSTERS
PARALLEL FILE SYSTEM FOR LINUX CLUSTERS
 
Bigdata and Hadoop Introduction
Bigdata and Hadoop IntroductionBigdata and Hadoop Introduction
Bigdata and Hadoop Introduction
 

Dernier

The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 

Dernier (20)

The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 

A Brief Comparison of Sector and Hadoop Distributed Systems

  • 1. Sector vs. Hadoop A Brief Comparison Between the Two Systems
  • 2. Background Sector is a relatively “new” system that is broadly comparable to Hadoop, and people want to know what are the differences. Is Sector simply another clone of GFS/MapReduce? No. These slides compare most recent versions of Sector and Hadoop as of Nov. 2010. Both software are still under active development. If you find any inaccurate information, please contact Yunhong Gu [yunhong.gu # gmail]. We will try to keep these slides accurate and up to date.
  • 3. Design Goals Sector Hadoop Three-layer functionality: Distributed file system Data collection, sharing, and distribution (over wide area networks) Massive in-storage parallel data processing with simplified interface Two-layer functionality: Distributed file system Massive in-storage parallel data processing with simplified interface
  • 4. History Sector Hadoop Started around 2005 - 2006, as a P2P file sharing and content distribution system for extreme large scientific datasets. Switched to centralized general purpose file system in 2007. Introduced in-storage parallel data processing in 2007. First “modern” version released in 2009. Started as a web crawler & indexer, Nutch, that adopted GFS/MapReduce between 2004 – 2006. Y! took over the project in 2006. Hadoop split from Nutch. First “modern” version released in 2008.
  • 5. Architecture Sector Hadoop Master-slave system Masters store metadata, slaves store data Multiple active masters Clients perform IO directly with slaves Master-slave system Namenode stores metadata, datanodes store data Single namenode (single point of failure) Clients perform IO directly with datanodes
  • 6. Distributed File System Sector HDFS General purpose IO Optimized for large files File based (file not split by Sector but users have to take care of it) Use replication for fault tolerance Write once read many (no random write) Optimized for large files Block based (64MB block as default) Use replication for fault tolerance
  • 7. Replication Sector HDFS System level default in configuration Per-file replica factor can be specified in a configuration file and can be changed at run-time Replicas are stored as far away as possible, but within a distance limit, configurable at per-file level File location can be limited to certain clusters only System level default in configuration Per-file replica factor is supported during file creation For 3 replicas (default), 2 on the same rack, the 3rd on a different rack
  • 8. Security Sector Hadoop A Sector security server is used to maintain user credentials and permission, server ACL, etc. Security server can be extended to connect to other sources, e.g., LDAP Optional file transfer encryption UDP-based hole punching firewall traversing for clients Still in active development, new features in 2010 Kerberos/token based security framework to authenticate users No file transfer encryption
  • 9. Wide Area Data Access Sector HDFS Sector ensures high performance data transfer with UDT, a high speed data transfer protocol As Sector pushes replicas as far away from each other as possible, a remote Sector client may find a nearby replica Thus, Sector can be used as content distribution network for very large datasets HDFS has no special consideration for wide area access. Its performance for remote access would be close to a stock FTP server. Its security mechanism may also be a problem for remote data access.
  • 10. In-Storage Data Processing Sphere HadoopMapReduce Apply arbitrary user-defined functions (UDFs) on data segments UDFs can be Map, Reduce, or others Support native MapReduce as well Support the MapReduce framework
  • 11. Sphere UDF vs. HadoopMapReduce Sphere UDF HadoopMapReduce Parsing: permanent record offset index if necessary Data segments (records, blocks, files, and directories) are processed by UDFs Transparent load balancing and fault tolerance Sphere is about 2 – 4x faster in various benchmarks Parsing: run-time data parsing with default or user-defined parser Data records are processed by Map and Reduce operations Transparent load balancing and fault tolerance
  • 12. Why Sphere is Faster than Hadoop? C++ vs. Java Different internal data flows Sphere UDF model is more flexible than MapReduce UDF dissembles MapReduce and gives developers more control to the process Sphere has better input data locality Better performance for applications that process files and group of files as minimum input unit Sphere has better output data locality Output location can be used to optimize iterative and combinative processing, such as join Different implementations and optimizations UDT vs. TCP (significant in wide area systems)
  • 13. Compatibility with Existing Systems Sector/Sphere Hadoop Sector files can be accessed from outside if necessary Sphere can simply apply an existing application executable on multiple files or even directories in parallel, if the executable accepts a file or a directory as input Data in HDFS can only be accessed via HDFS interfaces In Hadoop, executables that process files may also be wrapped in Map or Reduce operations, but will cause extra data movement if file size is greater than block size Hadoop cannot process multiple files within one operation.
  • 14. Conclusions Sector is a unique system that integrates distributed file system, content sharing network, and parallel data processing framework. Hadoop is mainly focused on large data processing within a single data center. They overlap on the parallel data processing support.