SlideShare a Scribd company logo
1 of 37
Download to read offline
Hadoop and object
stores –
can we do it better?
Gil Vernik, Trent Gray-Donald
IBM
Speakers..
§ Gil Vernik
- IBM Research from 2010
- Architect, 25+ years of development experience
- Active in open source
- Recent interest: Big Data engines and object stores
§ Trent
- IBM Distinguished Engineer
- Architect on Watson Data Platform
- Historically worked on the IBM Java VM
Twitter: @vernikgil
Agenda
§ Storage for unstructured data
§ Introduction to object storage – why are they needed and what are they?
§ HDFS and object stores – differences
§ Real world usage
- How Hadoop accesses object stores
- Understanding the issues
- An alternative approach
§ SETI usage
Storage for unstructured data
§ HDFS, distributed file system, or similar
§ Object storage
- On premise, cloud based, hybrid, etc.
- IBM Cloud Object Storage
- Amazon S3
- OpenStack Swift, Azure Blob Storage, etc.
§ Non-SQL data bases / key-value stores
Ingest raw data
Read data with
schema
Unstructured
data storage
HDFS - Summary
§ Hadoop Distributed File System (distributed, and Hadoop-native).
§ Stores large amounts of unstructured data in arbitrary formats.
§ Default internal block size is large - usually 64MB.
§ Blocks are replicated.
§ Write once – read many (append allowed)
§ (Often) collocated with compute capacity.
§ Need an HDFS client to work with HDFS.
§ Hadoop FS shell is widely used with HDFS.
What is an object store?
§ Object store is a perfect solution to store files (we call them data objects)
§ Each data object contains rich metadata and data itself.
§ Capable of storing huge amounts of unstructured data.
§ On premise, cloud based, hybrid, etc.
Object storage
Good things about object stores
§ Resilient store: data is will not be lost.
§ Fault tolerant : object store designed to operate during failures.
§ Various security models – data is safe.
§ Can be easily accessed for write or read flows.
§ (effectively) infinitely scalable – EB and beyond.
§ Low cost, long term storage solution.
Organize data in the object store
§ Data objects are organized inside buckets (s3) or containers (Swift).
§ Each data object may contain a name with delimiters, usually “/”.
§ Conceptual grouping via delimiters allows hierarchical organization, an analogy to the directories in
file systems but without the overhead or scalability limits of lots of directories.
mytalks/year=2016/month=5/day=24/data-palooza.pdf
mytalks/year=2017/month=5/day=24/hadoop-strata.pdf
mytalks/year=2017/month=6/day=07/spark-summit.pdf
bucket
data object
Object storage is not a file system
§ Write once – no append in place
§ Usually eventual consistent
§ Accessed via RESTful API, SDKs available for many languages.
§ Each data object has a unique URI.
§ Rename in object store is not atomic operation (unlike on file systems).
- Rename = GET and PUT/COPY and DELETE.
§ Object creation is atomic.
- Writing a file is not.
§ Examples
- Store raw data for archive, raw IoT sensor data.
- Export old data from database and store it as objects.
Export “old” data
HDFS HDFS HDFS
Worker 1 Worker 2 Worker 3
Worker 1 Worker 2 Worker 3
Object
storage
The usual dilemma
No data locality Data locality
• Impossible to scale storage without scaling compute.
• Difficult to share HDFS data more globally
• Separated from compute nodes thus storage can
be scaled independently from compute
• Data is easily shared and can be accessed from
different locations.
HDFS HDFS HDFS
Worker 1 Worker 2 Worker 3
Worker 1 Worker 2 Worker 3
Object
storage
The usual dilemma
No data locality Data locality
• Impossible to scale storage without scaling compute.
• Difficult to share HDFS data more globally
• Separated from compute nodes thus storage can
be scaled independently from compute
• Data is easily shared and can be accessed from
different locations.
Lower cost
More versatile
Fast enough
Higher cost
Less versatile
Potentially Faster
Choose your storage
Big Data engines
Storage
Choose your storage
Big Data engines
Storage
Choose your storage
Big Data engines
Storage
Hadoop ecosystem
§ Hadoop FileSystem interface is popular to interact with underlying storage
§ Hadoop shipped with various storage connectors that implement FileSystem interface
§ Many Big Data engines utilize Hadoop storage connectors
Object Storage ( S3 API, Swift API, Azure API)
HDFS
Apache Spark
§ Apache Spark is a fast and general engine for large-scale data processing
§ Written in Scala, Python, Java, R
§ Very active Big Data project
§ Apache Spark combines Spark SQL, streaming, machine learning, graph processing and complex
analytics (MapReduce plus) in a single engine and is able to optimize programs across all of these
paradigms
§ Spark can handle multiple object stores as a data source
§ Spark depends on Hadoop connectors to interact with objects
Example: persist collection as an object
val data = Array(1, 2, 3, 4, 5, 6, 7, 8, 9)
val myData = sc.parallelize(data, 9)
myData.saveAsTextFile(”s3a://mybucket/data.txt")
API GET HEAD PUT DELETE
Hadoop (s3a) S3
Example: persist collection as an object
val data = Array(1, 2, 3, 4, 5, 6, 7, 8, 9)
val myData = sc.parallelize(data, 9)
myData.saveAsTextFile(”s3a://mybucket/data.txt")
API GET HEAD PUT DELETE
Hadoop (s3a) S3 158 361 26 16
The deep dive into the numbers
§ What is wrong?
- We observed that some of the Hadoop components are highly inefficient to
work with object stores
- Two major reasons
- The existing algorithms used by Hadoop for persisting distributed data sets
are not optimized for object stores.
- Cost of supporting FS shell operations and treating object store as a file
system. This has negative effect on the Hadoop connectors.
We can make it much better
We did it better
It doesn’t have to be like this
Fault tolerance algorithms in the write flows
§ Output committers are code components in the Hadoop that responsible to persist data sets
generated by MapReduce jobs. Output committers designed to be fault tolerant.
..result/_temporary/0/_temporary/attempt_201702221313_0000_m_000000_0/part-0000
..result/_temporary/0/task_201702221313_0000_m_000000/part-00000
..result/part-00001
input data set wordcount
Persist result as an object “result”
Output committers and object stores
§ Output committers uses temp files and folders for every write operation and then renames them.
§ Algorithms used by output committers uses temporary files to achieve fault tolerance of the write
flows. Hadoop has FileOutputComitter version 1 and 2
§ File systems supports atomic rename, which perfectly fits into this paradigm.
§ Object stores do not support rename natively; use copy and delete instead.
..result/_temporary/0/_temporary/attempt_201702221313_0000_m_000000_0/part-0000
..result/_temporary/0/task_201702221313_0000_m_000000/part-00000
..result/part-00001
This leads to dozens of expensive requests targeted to the object
store
Hadoop FS shell operations and Hadoop connectors
§ All the Hadoop connectors to be 100% compliant with the Hadoop ecosystem must support FS shell
operations.
§ FS shell operations are frequently used with HDFS
§ FS shell operations are not object store friendly
- not native object store operations : operations on files/directories such as : copy, rename, etc.
- not optimized object store operations: upload object will first create temp object, then rename it to
the final name
§ Object store vendors provide CLI tools that are preferable over Hadoop FS shell commands.
./bin/hadoop fs –mkdirs hdfs://myhdfs/a/b/c/
./bin/hadoop fs –put mydata.txt hdfs://myhdfs/a/b/c/data.txt
Hadoop FS shell operations and analytic flows
§ The code to enable FS shell indirectly hurts entire analytic flows in the Hadoop connectors by
performing operations that are not inherent to the analytic flows
- Recursive directories create (empty object), check if directory exists, etc.
- Supporting move, rename, recursive listing of directories, etc.
§ Analytic flows such as Spark or Map Reduce do not directly need these FS shell operations
Hadoop FS shell operations and analytic flows
§
-
-
§
§ What does analytic flows need?
- Object listing
- Create new objects (object name may contain “/” to indicate pseudo-directory)
- Read objects
- Get data partitions (data partition is the unit of data parallelism for each MapReduce task)
- Delete
Analytic flows need only a small subset of the functionality
Why does supporting FS shell affect analytic flows?
/data.txt/_temporary/0/_temporary/attempt_201702221313_0000_m_000000_0/part-0000
/data.txt/_temporary/0/_temporary/attempt_201702221313_0000_m_000001_1/part-0001
……
/data.txt/_temporary/0/_temporary/attempt_201702221313_0000_m_000008_8/part-0008
4
5
3
2
1
0
7
8
6
Persist distributed data
set as an object
Why does supporting FS shell affect analytic flows?
Operation File
1 Spark Driver: make
directories recursively
..data.txt/_temporary/0
2 Spark Executor: make
directories recursively
..data.txt/_temporary/0/_temporary/attempt_201702221313_0000_m_000001_1
3 (SE): write task
temporary object
..data.txt/_temporary/0/_temporary/attempt_201702221313_0000_m_000001_1/part-00001
4 (SE): list directory ..data.txt/_temporary/0/_temporary/attempt_201702221313_0000_m_000001_1
5 (SE): rename task
temporary object to job
temporary object
..data.txt/_temporary/0/task_201702221313_0000_m_000001/part-00001
6 (SD): list job temporary
directories recursively
..data.txt/_temporary/0/task_201702221313_0000_m_000001
7 (SD): rename job
temporary object to
final name
..data.txt/part-00001
8 (SD): write
SUCCESS object
..data.txt/_SUCCESS
Certain Hadoop components designed to
work with file systems and not object stores
An opinionated object store connector for
Spark can provide significant gains
.FileSystem
Stocator – the next-gen object store connector
§ Advanced connector designed for object stores. Doesn’t create temp files and folders for write
operations and still provides fault tolerance coverage, including speculative mode.
§ Doesn’t use Hadoop modules and interacts with object store directly. This makes Stocator superior
faster for write flow and generate many less REST calls
§ Supports analytic flows and not shell commands
§ Implements Hadoop FileSystem interface.
§ No need to modify Spark or Hadoop
§ Stocator doesn’t need local HDFS
Stocator adapted for analytic flows
https://github.com/SparkTC/stocator
Released under Apache License 2.0
Hadoop and objects
Object Storage ( S3 API, Swift API, Azure API)
HDFS
Where to find Stocator
§ IBM Cloud Object Storage
§ Based on the open source stocator
§ Bluemix Spark as a Service
§ IBM Data Science Experience
§ Open source - https://github.com/SparkTC/stocator
- Stocator-core module, stocator-openstack-swift connector
- Apache License 2.0
Example: persist collection as object
val data = Array(1, 2, 3, 4, 5, 6, 7, 8, 9)
val myData = sc.parallelize(data, 9)
myData.saveAsTextFile(”s3d://mybucket/data.txt")
API GET HEAD PUT DELETE
Stocator S3 1 2 11 0
Hadoop (s3a) S3 158 361 26 16
Compare performance of Stocator
0
100
200
300
400
500
600
700
800
Teragen Copy	 Terasort Wordcount Read	(50GB) Read	(500GB) TPC-DS
Seconds
Stocator Hadoop	Swift S3a
18x 10x 9x 2x 1x
1x 1x**
** Comparing Stocator to S3a
* 40Gbps in accesser tier
§ Stocator is much faster
for write-intensive
workloads
§ Stocator as good for
read-intensive
workloads
S3a connector is improving*
1.5x 1.3x 1.3x 1.1x 1x
1x 1x**
** Comparing Stocator to S3a with CV2 and FU
§ File Output Committer
Algorithm 2 halves
number of renames
(CV2)
§ Fast Upload introduces
streaming on output
(FU)
§ Stocator still faster for
write-intensive
workloads and as good
for read-intensive 0
100
200
300
400
500
600
700
800
Teragen Copy Terasort Wordcount Read	(50	GB) Read	(500	GB) TPC-DS
seconds
Stocator S3a S3a	CV2 S3a	CV2+FU
Compare number of REST operations*
21x 16x 15x 16x 2x
2x 2x**
** Comparing Stocator to S3a with CV2 and FU
§ Stocator does
many less REST
operations
§ Less operations
means
• Lower overhead
• Lower cost
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
Teragen Copy	 Terasort Wordcount Read	(50GB) Read	(500GB) TPC-DS
RESTful	operations
Stocator Hadoop	Swift S3a
* 40Gbps in accesser tier
IBM Spark@SETI
§ Headquartered in Mountain View, CA. Founded 1984. 150 Scientists, researchers and staff.
§ The mission of the SETI Institute is to explore the potential for extra-terrestrial life….
§ Allen Telescope Array (ATA)
42 Receiving Dishes
Each 6m diameter
1GHz to 10GHz
The Allen Telescope Array
The Spark@SETI Project – By the Numbers
§ 200 million signal events
§ 14 million complex amplitude files in Object Store
- Signal of interest
- Each binary file contains 90 second ‘snapshot’ of raw antennae voltages
- 14M files = 1TB of raw signal data
feature extraction for clustering ~12 hours
§ Long duration observations = 2 beams @ 2.5TB each
- Wide-band analysis…. 5TB processed for wideband detection in approximately 13.5 hours
wall time.
Visit our joint talk with SETI at Spark Summit San Francisco, Wednesday, June 7 5:40 PM – 6:10 PM
“Very large data files, object stores, and deep learning –
lessons learned while looking for sights of extra-terrestrial life”
Lessons learned
Object storage provides a good alternative for HDFS
Existing Hadoop ecosystem doesn’t work efficient with object stores
Nothing fundamental wrong with object stores and the inefficiency is due to software
components that are not adapted for object stores
We demonstrated Stocator – an object store connector
Gil Vernik (gilv@il.ibm.com), Trent Gray-Donald (trent@ca.ibm.com)

More Related Content

What's hot

Accelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud EraAccelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud EraAlluxio, Inc.
 
SQL-on-Hadoop Tutorial
SQL-on-Hadoop TutorialSQL-on-Hadoop Tutorial
SQL-on-Hadoop TutorialDaniel Abadi
 
2013 march 26_thug_etl_cdc_talking_points
2013 march 26_thug_etl_cdc_talking_points2013 march 26_thug_etl_cdc_talking_points
2013 march 26_thug_etl_cdc_talking_pointsAdam Muise
 
A Brave new object store world
A Brave new object store worldA Brave new object store world
A Brave new object store worldEffi Ofer
 
What Is Hadoop | Hadoop Tutorial For Beginners | Edureka
What Is Hadoop | Hadoop Tutorial For Beginners | EdurekaWhat Is Hadoop | Hadoop Tutorial For Beginners | Edureka
What Is Hadoop | Hadoop Tutorial For Beginners | EdurekaEdureka!
 
Csci12 report aug18
Csci12 report aug18Csci12 report aug18
Csci12 report aug18karenostil
 
Enabling Apache Spark for Hybrid Cloud
Enabling Apache Spark for Hybrid CloudEnabling Apache Spark for Hybrid Cloud
Enabling Apache Spark for Hybrid CloudAlluxio, Inc.
 
Hadoop Architecture Options for Existing Enterprise DataWarehouse
Hadoop Architecture Options for Existing Enterprise DataWarehouseHadoop Architecture Options for Existing Enterprise DataWarehouse
Hadoop Architecture Options for Existing Enterprise DataWarehouseAsis Mohanty
 
Red hat, inc. open storage in the enterprise 0
Red hat, inc.   open storage in the enterprise 0Red hat, inc.   open storage in the enterprise 0
Red hat, inc. open storage in the enterprise 0Tommy Lee
 
Big data - Online Training
Big data - Online TrainingBig data - Online Training
Big data - Online TrainingLearntek1
 
Hadoop in three use cases
Hadoop in three use casesHadoop in three use cases
Hadoop in three use casesJoey Echeverria
 
Hadoop data access layer v4.0
Hadoop data access layer v4.0Hadoop data access layer v4.0
Hadoop data access layer v4.0SpringPeople
 
Data Orchestration Platform for the Cloud
Data Orchestration Platform for the CloudData Orchestration Platform for the Cloud
Data Orchestration Platform for the CloudAlluxio, Inc.
 
Apache Hive authorization models
Apache Hive authorization modelsApache Hive authorization models
Apache Hive authorization modelsThejas Nair
 
Data Orchestration for the Hybrid Cloud Era
Data Orchestration for the Hybrid Cloud EraData Orchestration for the Hybrid Cloud Era
Data Orchestration for the Hybrid Cloud EraAlluxio, Inc.
 

What's hot (20)

Accelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud EraAccelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud Era
 
SQL-on-Hadoop Tutorial
SQL-on-Hadoop TutorialSQL-on-Hadoop Tutorial
SQL-on-Hadoop Tutorial
 
Dremio introduction
Dremio introductionDremio introduction
Dremio introduction
 
2013 march 26_thug_etl_cdc_talking_points
2013 march 26_thug_etl_cdc_talking_points2013 march 26_thug_etl_cdc_talking_points
2013 march 26_thug_etl_cdc_talking_points
 
A Brave new object store world
A Brave new object store worldA Brave new object store world
A Brave new object store world
 
What Is Hadoop | Hadoop Tutorial For Beginners | Edureka
What Is Hadoop | Hadoop Tutorial For Beginners | EdurekaWhat Is Hadoop | Hadoop Tutorial For Beginners | Edureka
What Is Hadoop | Hadoop Tutorial For Beginners | Edureka
 
Csci12 report aug18
Csci12 report aug18Csci12 report aug18
Csci12 report aug18
 
Enabling Apache Spark for Hybrid Cloud
Enabling Apache Spark for Hybrid CloudEnabling Apache Spark for Hybrid Cloud
Enabling Apache Spark for Hybrid Cloud
 
Practical Use of a NoSQL Database
Practical Use of a NoSQL DatabasePractical Use of a NoSQL Database
Practical Use of a NoSQL Database
 
Introducing Big Data
Introducing Big DataIntroducing Big Data
Introducing Big Data
 
Introducing Data Lakes
Introducing Data LakesIntroducing Data Lakes
Introducing Data Lakes
 
Hadoop Architecture Options for Existing Enterprise DataWarehouse
Hadoop Architecture Options for Existing Enterprise DataWarehouseHadoop Architecture Options for Existing Enterprise DataWarehouse
Hadoop Architecture Options for Existing Enterprise DataWarehouse
 
Red hat, inc. open storage in the enterprise 0
Red hat, inc.   open storage in the enterprise 0Red hat, inc.   open storage in the enterprise 0
Red hat, inc. open storage in the enterprise 0
 
Big data - Online Training
Big data - Online TrainingBig data - Online Training
Big data - Online Training
 
Hadoop in three use cases
Hadoop in three use casesHadoop in three use cases
Hadoop in three use cases
 
Hadoop data access layer v4.0
Hadoop data access layer v4.0Hadoop data access layer v4.0
Hadoop data access layer v4.0
 
Data Orchestration Platform for the Cloud
Data Orchestration Platform for the CloudData Orchestration Platform for the Cloud
Data Orchestration Platform for the Cloud
 
Apache Hive authorization models
Apache Hive authorization modelsApache Hive authorization models
Apache Hive authorization models
 
Hadoop
HadoopHadoop
Hadoop
 
Data Orchestration for the Hybrid Cloud Era
Data Orchestration for the Hybrid Cloud EraData Orchestration for the Hybrid Cloud Era
Data Orchestration for the Hybrid Cloud Era
 

Similar to Hadoop and object stores can we do it better

Optimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public CloudOptimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public CloudQubole
 
Data Analytics Meetup: Introduction to Azure Data Lake Storage
Data Analytics Meetup: Introduction to Azure Data Lake Storage Data Analytics Meetup: Introduction to Azure Data Lake Storage
Data Analytics Meetup: Introduction to Azure Data Lake Storage CCG
 
Topic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptxTopic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptxDanishMahmood23
 
Hadoop - HDFS
Hadoop - HDFSHadoop - HDFS
Hadoop - HDFSKavyaGo
 
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...Lucidworks
 
Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Ranjith Sekar
 
Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017Vinoth Chandar
 
Big Data in the Cloud - The What, Why and How from the Experts
Big Data in the Cloud - The What, Why and How from the ExpertsBig Data in the Cloud - The What, Why and How from the Experts
Big Data in the Cloud - The What, Why and How from the ExpertsDataWorks Summit/Hadoop Summit
 
Hadoop ecosystem; J.Ayeesha parveen 2 nd M.sc., computer science Bon Secours...
Hadoop ecosystem; J.Ayeesha parveen 2 nd M.sc., computer science  Bon Secours...Hadoop ecosystem; J.Ayeesha parveen 2 nd M.sc., computer science  Bon Secours...
Hadoop ecosystem; J.Ayeesha parveen 2 nd M.sc., computer science Bon Secours...AyeeshaParveen
 
Hands on Hadoop and pig
Hands on Hadoop and pigHands on Hadoop and pig
Hands on Hadoop and pigSudar Muthu
 
Hadoop in the Cloud – The What, Why and How from the Experts
Hadoop in the Cloud – The What, Why and How from the ExpertsHadoop in the Cloud – The What, Why and How from the Experts
Hadoop in the Cloud – The What, Why and How from the ExpertsDataWorks Summit/Hadoop Summit
 
OPERATING SYSTEM .pptx
OPERATING SYSTEM .pptxOPERATING SYSTEM .pptx
OPERATING SYSTEM .pptxAltafKhadim
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introductionChirag Ahuja
 

Similar to Hadoop and object stores can we do it better (20)

Optimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public CloudOptimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public Cloud
 
Data Analytics Meetup: Introduction to Azure Data Lake Storage
Data Analytics Meetup: Introduction to Azure Data Lake Storage Data Analytics Meetup: Introduction to Azure Data Lake Storage
Data Analytics Meetup: Introduction to Azure Data Lake Storage
 
מיכאל
מיכאלמיכאל
מיכאל
 
Anju
AnjuAnju
Anju
 
The ABC of Big Data
The ABC of Big DataThe ABC of Big Data
The ABC of Big Data
 
Topic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptxTopic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptx
 
Hadoop - HDFS
Hadoop - HDFSHadoop - HDFS
Hadoop - HDFS
 
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
 
Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016
 
Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017
 
Big Data in the Cloud - The What, Why and How from the Experts
Big Data in the Cloud - The What, Why and How from the ExpertsBig Data in the Cloud - The What, Why and How from the Experts
Big Data in the Cloud - The What, Why and How from the Experts
 
Hadoop ecosystem; J.Ayeesha parveen 2 nd M.sc., computer science Bon Secours...
Hadoop ecosystem; J.Ayeesha parveen 2 nd M.sc., computer science  Bon Secours...Hadoop ecosystem; J.Ayeesha parveen 2 nd M.sc., computer science  Bon Secours...
Hadoop ecosystem; J.Ayeesha parveen 2 nd M.sc., computer science Bon Secours...
 
Hands on Hadoop and pig
Hands on Hadoop and pigHands on Hadoop and pig
Hands on Hadoop and pig
 
Ess1000 glossary
Ess1000 glossaryEss1000 glossary
Ess1000 glossary
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 
Hadoop Primer
Hadoop PrimerHadoop Primer
Hadoop Primer
 
Hadoop in the Cloud – The What, Why and How from the Experts
Hadoop in the Cloud – The What, Why and How from the ExpertsHadoop in the Cloud – The What, Why and How from the Experts
Hadoop in the Cloud – The What, Why and How from the Experts
 
Hadoop in action
Hadoop in actionHadoop in action
Hadoop in action
 
OPERATING SYSTEM .pptx
OPERATING SYSTEM .pptxOPERATING SYSTEM .pptx
OPERATING SYSTEM .pptx
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 

Recently uploaded

Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Boston Institute of Analytics
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max PrincetonTimothy Spann
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Boston Institute of Analytics
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesTimothy Spann
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxaleedritatuxx
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2
 
Vision, Mission, Goals and Objectives ppt..pptx
Vision, Mission, Goals and Objectives ppt..pptxVision, Mission, Goals and Objectives ppt..pptx
Vision, Mission, Goals and Objectives ppt..pptxellehsormae
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degreeyuu sss
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...ssuserf63bd7
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...GQ Research
 

Recently uploaded (20)

Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max Princeton
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population Mean
 
Vision, Mission, Goals and Objectives ppt..pptx
Vision, Mission, Goals and Objectives ppt..pptxVision, Mission, Goals and Objectives ppt..pptx
Vision, Mission, Goals and Objectives ppt..pptx
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
 

Hadoop and object stores can we do it better

  • 1. Hadoop and object stores – can we do it better? Gil Vernik, Trent Gray-Donald IBM
  • 2. Speakers.. § Gil Vernik - IBM Research from 2010 - Architect, 25+ years of development experience - Active in open source - Recent interest: Big Data engines and object stores § Trent - IBM Distinguished Engineer - Architect on Watson Data Platform - Historically worked on the IBM Java VM Twitter: @vernikgil
  • 3. Agenda § Storage for unstructured data § Introduction to object storage – why are they needed and what are they? § HDFS and object stores – differences § Real world usage - How Hadoop accesses object stores - Understanding the issues - An alternative approach § SETI usage
  • 4. Storage for unstructured data § HDFS, distributed file system, or similar § Object storage - On premise, cloud based, hybrid, etc. - IBM Cloud Object Storage - Amazon S3 - OpenStack Swift, Azure Blob Storage, etc. § Non-SQL data bases / key-value stores Ingest raw data Read data with schema Unstructured data storage
  • 5. HDFS - Summary § Hadoop Distributed File System (distributed, and Hadoop-native). § Stores large amounts of unstructured data in arbitrary formats. § Default internal block size is large - usually 64MB. § Blocks are replicated. § Write once – read many (append allowed) § (Often) collocated with compute capacity. § Need an HDFS client to work with HDFS. § Hadoop FS shell is widely used with HDFS.
  • 6. What is an object store? § Object store is a perfect solution to store files (we call them data objects) § Each data object contains rich metadata and data itself. § Capable of storing huge amounts of unstructured data. § On premise, cloud based, hybrid, etc. Object storage
  • 7. Good things about object stores § Resilient store: data is will not be lost. § Fault tolerant : object store designed to operate during failures. § Various security models – data is safe. § Can be easily accessed for write or read flows. § (effectively) infinitely scalable – EB and beyond. § Low cost, long term storage solution.
  • 8. Organize data in the object store § Data objects are organized inside buckets (s3) or containers (Swift). § Each data object may contain a name with delimiters, usually “/”. § Conceptual grouping via delimiters allows hierarchical organization, an analogy to the directories in file systems but without the overhead or scalability limits of lots of directories. mytalks/year=2016/month=5/day=24/data-palooza.pdf mytalks/year=2017/month=5/day=24/hadoop-strata.pdf mytalks/year=2017/month=6/day=07/spark-summit.pdf bucket data object
  • 9. Object storage is not a file system § Write once – no append in place § Usually eventual consistent § Accessed via RESTful API, SDKs available for many languages. § Each data object has a unique URI. § Rename in object store is not atomic operation (unlike on file systems). - Rename = GET and PUT/COPY and DELETE. § Object creation is atomic. - Writing a file is not. § Examples - Store raw data for archive, raw IoT sensor data. - Export old data from database and store it as objects. Export “old” data
  • 10. HDFS HDFS HDFS Worker 1 Worker 2 Worker 3 Worker 1 Worker 2 Worker 3 Object storage The usual dilemma No data locality Data locality • Impossible to scale storage without scaling compute. • Difficult to share HDFS data more globally • Separated from compute nodes thus storage can be scaled independently from compute • Data is easily shared and can be accessed from different locations.
  • 11. HDFS HDFS HDFS Worker 1 Worker 2 Worker 3 Worker 1 Worker 2 Worker 3 Object storage The usual dilemma No data locality Data locality • Impossible to scale storage without scaling compute. • Difficult to share HDFS data more globally • Separated from compute nodes thus storage can be scaled independently from compute • Data is easily shared and can be accessed from different locations. Lower cost More versatile Fast enough Higher cost Less versatile Potentially Faster
  • 12. Choose your storage Big Data engines Storage
  • 13. Choose your storage Big Data engines Storage
  • 14. Choose your storage Big Data engines Storage
  • 15. Hadoop ecosystem § Hadoop FileSystem interface is popular to interact with underlying storage § Hadoop shipped with various storage connectors that implement FileSystem interface § Many Big Data engines utilize Hadoop storage connectors Object Storage ( S3 API, Swift API, Azure API) HDFS
  • 16. Apache Spark § Apache Spark is a fast and general engine for large-scale data processing § Written in Scala, Python, Java, R § Very active Big Data project § Apache Spark combines Spark SQL, streaming, machine learning, graph processing and complex analytics (MapReduce plus) in a single engine and is able to optimize programs across all of these paradigms § Spark can handle multiple object stores as a data source § Spark depends on Hadoop connectors to interact with objects
  • 17. Example: persist collection as an object val data = Array(1, 2, 3, 4, 5, 6, 7, 8, 9) val myData = sc.parallelize(data, 9) myData.saveAsTextFile(”s3a://mybucket/data.txt") API GET HEAD PUT DELETE Hadoop (s3a) S3
  • 18. Example: persist collection as an object val data = Array(1, 2, 3, 4, 5, 6, 7, 8, 9) val myData = sc.parallelize(data, 9) myData.saveAsTextFile(”s3a://mybucket/data.txt") API GET HEAD PUT DELETE Hadoop (s3a) S3 158 361 26 16
  • 19. The deep dive into the numbers § What is wrong? - We observed that some of the Hadoop components are highly inefficient to work with object stores - Two major reasons - The existing algorithms used by Hadoop for persisting distributed data sets are not optimized for object stores. - Cost of supporting FS shell operations and treating object store as a file system. This has negative effect on the Hadoop connectors. We can make it much better We did it better It doesn’t have to be like this
  • 20. Fault tolerance algorithms in the write flows § Output committers are code components in the Hadoop that responsible to persist data sets generated by MapReduce jobs. Output committers designed to be fault tolerant. ..result/_temporary/0/_temporary/attempt_201702221313_0000_m_000000_0/part-0000 ..result/_temporary/0/task_201702221313_0000_m_000000/part-00000 ..result/part-00001 input data set wordcount Persist result as an object “result”
  • 21. Output committers and object stores § Output committers uses temp files and folders for every write operation and then renames them. § Algorithms used by output committers uses temporary files to achieve fault tolerance of the write flows. Hadoop has FileOutputComitter version 1 and 2 § File systems supports atomic rename, which perfectly fits into this paradigm. § Object stores do not support rename natively; use copy and delete instead. ..result/_temporary/0/_temporary/attempt_201702221313_0000_m_000000_0/part-0000 ..result/_temporary/0/task_201702221313_0000_m_000000/part-00000 ..result/part-00001 This leads to dozens of expensive requests targeted to the object store
  • 22. Hadoop FS shell operations and Hadoop connectors § All the Hadoop connectors to be 100% compliant with the Hadoop ecosystem must support FS shell operations. § FS shell operations are frequently used with HDFS § FS shell operations are not object store friendly - not native object store operations : operations on files/directories such as : copy, rename, etc. - not optimized object store operations: upload object will first create temp object, then rename it to the final name § Object store vendors provide CLI tools that are preferable over Hadoop FS shell commands. ./bin/hadoop fs –mkdirs hdfs://myhdfs/a/b/c/ ./bin/hadoop fs –put mydata.txt hdfs://myhdfs/a/b/c/data.txt
  • 23. Hadoop FS shell operations and analytic flows § The code to enable FS shell indirectly hurts entire analytic flows in the Hadoop connectors by performing operations that are not inherent to the analytic flows - Recursive directories create (empty object), check if directory exists, etc. - Supporting move, rename, recursive listing of directories, etc. § Analytic flows such as Spark or Map Reduce do not directly need these FS shell operations
  • 24. Hadoop FS shell operations and analytic flows § - - § § What does analytic flows need? - Object listing - Create new objects (object name may contain “/” to indicate pseudo-directory) - Read objects - Get data partitions (data partition is the unit of data parallelism for each MapReduce task) - Delete Analytic flows need only a small subset of the functionality
  • 25. Why does supporting FS shell affect analytic flows? /data.txt/_temporary/0/_temporary/attempt_201702221313_0000_m_000000_0/part-0000 /data.txt/_temporary/0/_temporary/attempt_201702221313_0000_m_000001_1/part-0001 …… /data.txt/_temporary/0/_temporary/attempt_201702221313_0000_m_000008_8/part-0008 4 5 3 2 1 0 7 8 6 Persist distributed data set as an object
  • 26. Why does supporting FS shell affect analytic flows? Operation File 1 Spark Driver: make directories recursively ..data.txt/_temporary/0 2 Spark Executor: make directories recursively ..data.txt/_temporary/0/_temporary/attempt_201702221313_0000_m_000001_1 3 (SE): write task temporary object ..data.txt/_temporary/0/_temporary/attempt_201702221313_0000_m_000001_1/part-00001 4 (SE): list directory ..data.txt/_temporary/0/_temporary/attempt_201702221313_0000_m_000001_1 5 (SE): rename task temporary object to job temporary object ..data.txt/_temporary/0/task_201702221313_0000_m_000001/part-00001 6 (SD): list job temporary directories recursively ..data.txt/_temporary/0/task_201702221313_0000_m_000001 7 (SD): rename job temporary object to final name ..data.txt/part-00001 8 (SD): write SUCCESS object ..data.txt/_SUCCESS
  • 27. Certain Hadoop components designed to work with file systems and not object stores An opinionated object store connector for Spark can provide significant gains .FileSystem
  • 28. Stocator – the next-gen object store connector § Advanced connector designed for object stores. Doesn’t create temp files and folders for write operations and still provides fault tolerance coverage, including speculative mode. § Doesn’t use Hadoop modules and interacts with object store directly. This makes Stocator superior faster for write flow and generate many less REST calls § Supports analytic flows and not shell commands § Implements Hadoop FileSystem interface. § No need to modify Spark or Hadoop § Stocator doesn’t need local HDFS Stocator adapted for analytic flows https://github.com/SparkTC/stocator Released under Apache License 2.0
  • 29. Hadoop and objects Object Storage ( S3 API, Swift API, Azure API) HDFS
  • 30. Where to find Stocator § IBM Cloud Object Storage § Based on the open source stocator § Bluemix Spark as a Service § IBM Data Science Experience § Open source - https://github.com/SparkTC/stocator - Stocator-core module, stocator-openstack-swift connector - Apache License 2.0
  • 31. Example: persist collection as object val data = Array(1, 2, 3, 4, 5, 6, 7, 8, 9) val myData = sc.parallelize(data, 9) myData.saveAsTextFile(”s3d://mybucket/data.txt") API GET HEAD PUT DELETE Stocator S3 1 2 11 0 Hadoop (s3a) S3 158 361 26 16
  • 32. Compare performance of Stocator 0 100 200 300 400 500 600 700 800 Teragen Copy Terasort Wordcount Read (50GB) Read (500GB) TPC-DS Seconds Stocator Hadoop Swift S3a 18x 10x 9x 2x 1x 1x 1x** ** Comparing Stocator to S3a * 40Gbps in accesser tier § Stocator is much faster for write-intensive workloads § Stocator as good for read-intensive workloads
  • 33. S3a connector is improving* 1.5x 1.3x 1.3x 1.1x 1x 1x 1x** ** Comparing Stocator to S3a with CV2 and FU § File Output Committer Algorithm 2 halves number of renames (CV2) § Fast Upload introduces streaming on output (FU) § Stocator still faster for write-intensive workloads and as good for read-intensive 0 100 200 300 400 500 600 700 800 Teragen Copy Terasort Wordcount Read (50 GB) Read (500 GB) TPC-DS seconds Stocator S3a S3a CV2 S3a CV2+FU
  • 34. Compare number of REST operations* 21x 16x 15x 16x 2x 2x 2x** ** Comparing Stocator to S3a with CV2 and FU § Stocator does many less REST operations § Less operations means • Lower overhead • Lower cost 0 5000 10000 15000 20000 25000 30000 35000 40000 45000 Teragen Copy Terasort Wordcount Read (50GB) Read (500GB) TPC-DS RESTful operations Stocator Hadoop Swift S3a * 40Gbps in accesser tier
  • 35. IBM Spark@SETI § Headquartered in Mountain View, CA. Founded 1984. 150 Scientists, researchers and staff. § The mission of the SETI Institute is to explore the potential for extra-terrestrial life…. § Allen Telescope Array (ATA) 42 Receiving Dishes Each 6m diameter 1GHz to 10GHz The Allen Telescope Array
  • 36. The Spark@SETI Project – By the Numbers § 200 million signal events § 14 million complex amplitude files in Object Store - Signal of interest - Each binary file contains 90 second ‘snapshot’ of raw antennae voltages - 14M files = 1TB of raw signal data feature extraction for clustering ~12 hours § Long duration observations = 2 beams @ 2.5TB each - Wide-band analysis…. 5TB processed for wideband detection in approximately 13.5 hours wall time. Visit our joint talk with SETI at Spark Summit San Francisco, Wednesday, June 7 5:40 PM – 6:10 PM “Very large data files, object stores, and deep learning – lessons learned while looking for sights of extra-terrestrial life”
  • 37. Lessons learned Object storage provides a good alternative for HDFS Existing Hadoop ecosystem doesn’t work efficient with object stores Nothing fundamental wrong with object stores and the inefficiency is due to software components that are not adapted for object stores We demonstrated Stocator – an object store connector Gil Vernik (gilv@il.ibm.com), Trent Gray-Donald (trent@ca.ibm.com)