SlideShare une entreprise Scribd logo
1  sur  54
Big Data for Quality Engineers
Ahmed Misbah
Agenda
• Introduction to Big Data
– Problem with traditional Large Scale Systems
– Requirements for the new approach
– Hadoop’s Approach
– Batch Processing and Steam Processing
• Big Data Technologies
– Batch Processing Technologies
– Stream Processing Technologies
• Testing Big Data Solutions
Rules
• Phones silent
• No laptops
• Questions/Discussions at anytime welcome
• 10 minute break every 1 hour
INTRODUCTION TO BIG DATA
PROBLEMS WITH TRADITIONAL
LARGE SCALE SYSTEMS
Traditional Large Scale Computing
• Traditionally, computation has been
processor-bound:
– Small amounts of data
– Lots of complex processing
• Early solution: Bigger computers!!
– Faster processor(s)
– More memory
Distributed Systems (1/3)
• More computers instead of bigger computers
• Distributed systems evolved
• Use multiple machines for a single job
Distributed Systems (2/3)
“In pioneer days they used oxen for heavy
pulling, and when one ox couldn’t budge a log,
we didn’t try to grow a larger ox. We shouldn’t
be trying for bigger computers, but for more
systems of computers”
Grace Hopper
Distributed Systems (3/3)
Problems with Distributed Systems
(1/2)
• Programming for traditional distributed
systems in complex:
– Keeping data and processes in sync
– Finite bandwidth
– Partial failures
Problems with Distributed Systems
(2/2)
“Failure is the defining difference between
distributed and local programming, so you
have to design distributed systems with the
expectation of failure”
Ken Arnold, CORBA Designer
The Data Bottleneck (1/4)
• Moore’s Law has held firm for over 40 years:
– Processing power doubles every two years
– Processing speed is no longer the problem
• Getting the data to the processor becomes the
bottleneck
The Data Bottleneck (2/4)
• Example:
– Typical disk data transfer rate: 75MB/sec
– Time taken to transfer 100GB of data to the
processor ≈ 22 minutes
– Actual time will be worse since most servers have
less than 100GB of RAM
The Data Bottleneck (3/4)
• Typically, data is stored in a central location
• Data is copied to the processors at runtime
• Acceptable for limited amounts of data
The Data Bottleneck (4/4)
• Modern system have much more data
– Terabytes/day
– Petabytes/year
• A new approach is required
REQUIREMENTS FOR THE NEW
APPROACH
Requirements for the new approach
(1/2)
• Partial failure support:
– Failure of a component should result in a graceful
degradation of the application performance
– It should not lead to a complete failure of the entire
system
• Data recoverability:
– If a component of the system fails, its workload should
be assumed by still-functioning units in the system
• Component recovery:
– If a component fails then recovers, it should be able to
rejoin the system without requiring full system restart
Requirements for the new approach
(2/2)
• Consistency:
– Component failures during execution of a job
should not affect the outcome of the job
• Scalability:
– Adding load to the system should result in graceful
degradation in performance and not the failure of
the entire system
– Increasing resources should support proportional
increase in load capacity
HADOOP’S APPROACH
A new approach to distributed
computing!
• Distribute data when the data is being stored
• Run computation where the data is stored
Core Concept (1/3)
• Distribute the data as it is initially stored in the
system
• Individual nodes can work on the data local to
those nodes
• No data transfer over the network is required
for initial processing
Core Concept (2/3)
• Applications are written in high-level code
• Developers need not to worry about network
programming or low-level infrastructure
• Nodes talk to each other as little as possible
Core Concept (3/3)
• Data is spread among machines in advance
• Computation happens where the data is
stored
• Data is replicated multiple times on the
system for increased availability and reliability
Fault Tolerance
• If a node fails, the master will detect the failure
and re-assign the work to a different node on the
system
• Restarting a task does not require the
communication with nodes working on other
portions of the data
• If a failed node restarts it is automatically added
back to the system and assigned a new task
• If a node appears to be running slowly, the
master can redundantly execute another instance
of the same task
BATCH PROCESSING VS STREAM
PROCESSING
Batch Processing
• Also known as History-based processing
• Processing is executed against large data
already stored in some storage medium (e.g.
HDFS or S3)
Stream Processing
• Processing executed against batches of data
coming continuously from a stream
BIG DATA TECHNOLOGIES
Batch Processing Technologies (1/2)
• Hadoop
Batch Processing Technologies (2/2)
• Spark
Stream Processing Technologies (1/2)
• Spark Streaming
Stream Processing Technologies (2/2)
• Apache Storm
• Apache Flink
Supporting Technologies
• Apache Kafka
• Akka
TESTING BIG DATA TECHNOLOGIES
Hadoop MapReduce (1/3)
• LocalJobRunner:
– Does not require any Hadoop daemons to be
running
– Uses the local file system instead of HDFS
• MRUnit:
– Built on top of JUnit
– Works with Mockito Framework to provide
required mock objects
Hadoop MapReduce (2/3)
• Apache Hue:
– Is an open source Web interface for analyzing data
with Apache Hadoop
Hadoop MapReduce (3/3)
• MapReduce Job Tracker Web Interface
Apache Spark (1/3)
• Run locally using Eclipse of IntelliJ
• Run using Spark Standalone
• Spark Testing Base:
– For implementing unit tests for Spark code
• Spark Validator:
– A library you can include in your Spark job to validate
the counters and perform operations on success
Apache Spark (2/3)
• Spark UI and History Server:
Apache Spark (3/3)
• Apache Zeppelin (using Sparklet on
Windows)
Performance Testing Tools
• Gatling
• Yahoo Cloud Serving Benchmark (YCSB)
• Jumbune
• Netflix Inviso
• TestDFSIO
• TeraSort
• NNBench
• MRbench
• BigBench
More tools
• https://github.com/Intel-bigdata/HiBench
• https://github.com/yahoo/streaming-
benchmarks
• https://github.com/tdas/spark-streaming-
benchmark
• https://github.com/BBVA/spark-benchmarks
• https://github.com/databricks/spark-perf
Monitoring Tools
• https://ambari.apache.org/
• https://github.com/groupon/sparklint
• https://github.com/linkedin/dr-elephant
• https://github.com/ibm-research-
ireland/sparkoscope
• https://supergloo.com/spark-
monitoring/spark-performance-monitoring-
tools/
Important Considerations
• Number of clusters/nodes
• Hardware Specifications (HDD or SSD)
• Application/Environment Configurations (no.
of cores, no. of partitions, no. of threads,
disk/memory persistence, etc.)
• Data format (Text, Sequence, Avro, etc.)
• Data size
• Compression (Snappy, Gzip, etc.)
• Number of Reducers (MapReduce)
Spark Best Practices
• https://medium.com/teads-engineering/spark-
performance-tuning-from-the-trenches-
7cbde521cf60
• https://dzone.com/articles/apache-spark-
performance-tuning-degree-of-parallel
• https://databricks.com/glossary/spark-tuning
• https://blog.cloudera.com/how-to-tune-your-
apache-spark-jobs-part-1/
• https://www.bi4all.pt/en/news/en-blog/apache-
spark-best-practices/
Sampling
• Sampling is defined as: “the act, process, or
technique of selecting a representative part of
a population for the purpose of determining
parameters or characteristics of the whole
population” - Merriam- Webster dictionary
Useful Resources (1/3)
• Benchmarking Hadoop and HBase on Violin
• Benchmarking Cassandra on Violin
• http://blog.cloudera.com/blog/2014/11/bigbe
nch-toward-an-industry-standard-benchmark-
for-big-data-analytics/
Useful Resources (2/3)
• http://blog.cloudera.com/blog/2015/08/ycsb-the-
open-standard-for-nosql-benchmarking-joins-cloudera-
labs/
• https://discuss.zendesk.com/hc/en-
us/articles/200864057-Running-DFSIO-MapReduce-
benchmark-test
• http://www.michael-
noll.com/blog/2011/04/09/benchmarking-and-stress-
testing-an-hadoop-cluster-with-terasort-testdfsio-
nnbench-mrbench/
Useful Resources (3/3)
• http://bdaafall2015.readthedocs.io/en/latest/
nnbench.html
• http://bdaafall2015.readthedocs.io/en/latest/
mrbench.html
Thank You!

Contenu connexe

Tendances

project--2 nd review_2
project--2 nd review_2project--2 nd review_2
project--2 nd review_2
Aswini Ashu
 

Tendances (18)

Low Latency Streaming Data Processing in Hadoop
Low Latency Streaming Data Processing in HadoopLow Latency Streaming Data Processing in Hadoop
Low Latency Streaming Data Processing in Hadoop
 
Hadoop Ecosystem and Low Latency Streaming Architecture
Hadoop Ecosystem and Low Latency Streaming ArchitectureHadoop Ecosystem and Low Latency Streaming Architecture
Hadoop Ecosystem and Low Latency Streaming Architecture
 
Bigdata workshop february 2015
Bigdata workshop  february 2015 Bigdata workshop  february 2015
Bigdata workshop february 2015
 
Hadoop
HadoopHadoop
Hadoop
 
Spark 1.0
Spark 1.0Spark 1.0
Spark 1.0
 
Fault-Tolerant File Input & Output
Fault-Tolerant File Input & OutputFault-Tolerant File Input & Output
Fault-Tolerant File Input & Output
 
Hadoop training-in-hyderabad
Hadoop training-in-hyderabadHadoop training-in-hyderabad
Hadoop training-in-hyderabad
 
Map reducecloudtech
Map reducecloudtechMap reducecloudtech
Map reducecloudtech
 
PMIx Updated Overview
PMIx Updated OverviewPMIx Updated Overview
PMIx Updated Overview
 
HPC Resource Management: Futures
HPC Resource Management: FuturesHPC Resource Management: Futures
HPC Resource Management: Futures
 
Gfarm presentation and thesis topic introduction
Gfarm presentation and thesis topic introductionGfarm presentation and thesis topic introduction
Gfarm presentation and thesis topic introduction
 
Your Guide to Streaming - The Engineer's Perspective
Your Guide to Streaming - The Engineer's PerspectiveYour Guide to Streaming - The Engineer's Perspective
Your Guide to Streaming - The Engineer's Perspective
 
Data Streaming For Big Data
Data Streaming For Big DataData Streaming For Big Data
Data Streaming For Big Data
 
project--2 nd review_2
project--2 nd review_2project--2 nd review_2
project--2 nd review_2
 
Apache Apex Introduction with PubMatic
Apache Apex Introduction with PubMaticApache Apex Introduction with PubMatic
Apache Apex Introduction with PubMatic
 
Scheduling scheme for hadoop clusters
Scheduling scheme for hadoop clustersScheduling scheme for hadoop clusters
Scheduling scheme for hadoop clusters
 
Extending Hadoop for Fun & Profit
Extending Hadoop for Fun & ProfitExtending Hadoop for Fun & Profit
Extending Hadoop for Fun & Profit
 
Fault tolerant mechanisms in Big Data
Fault tolerant mechanisms in Big DataFault tolerant mechanisms in Big Data
Fault tolerant mechanisms in Big Data
 

Similaire à Big Data for QAs

Similaire à Big Data for QAs (20)

HDFS_architecture.ppt
HDFS_architecture.pptHDFS_architecture.ppt
HDFS_architecture.ppt
 
(ATS6-PLAT06) Maximizing AEP Performance
(ATS6-PLAT06) Maximizing AEP Performance(ATS6-PLAT06) Maximizing AEP Performance
(ATS6-PLAT06) Maximizing AEP Performance
 
08 operating system support
08 operating system support08 operating system support
08 operating system support
 
08 operating system support
08 operating system support08 operating system support
08 operating system support
 
08 operating system support
08 operating system support08 operating system support
08 operating system support
 
Hadoop
HadoopHadoop
Hadoop
 
08 operating system support
08 operating system support08 operating system support
08 operating system support
 
2. hadoop fundamentals
2. hadoop fundamentals2. hadoop fundamentals
2. hadoop fundamentals
 
05. performance-concepts
05. performance-concepts05. performance-concepts
05. performance-concepts
 
Overview of Scientific Workflows - Why Use Them?
Overview of Scientific Workflows - Why Use Them?Overview of Scientific Workflows - Why Use Them?
Overview of Scientific Workflows - Why Use Them?
 
High performance computing
High performance computingHigh performance computing
High performance computing
 
Hardware Provisioning
Hardware ProvisioningHardware Provisioning
Hardware Provisioning
 
Hardware Provisioning
Hardware Provisioning Hardware Provisioning
Hardware Provisioning
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introduction
 
Distributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewDistributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology Overview
 
Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016
 
Topic 6 IB DP CS
Topic 6 IB DP CSTopic 6 IB DP CS
Topic 6 IB DP CS
 
Cloud infrastructure. Google File System and MapReduce - Andrii Vozniuk
Cloud infrastructure. Google File System and MapReduce - Andrii VozniukCloud infrastructure. Google File System and MapReduce - Andrii Vozniuk
Cloud infrastructure. Google File System and MapReduce - Andrii Vozniuk
 
Deep Learning at Scale
Deep Learning at ScaleDeep Learning at Scale
Deep Learning at Scale
 

Plus de Ahmed Misbah

Plus de Ahmed Misbah (20)

6+1 Technical Tips for Tech Startups (2023 Edition)
6+1 Technical Tips for Tech Startups (2023 Edition)6+1 Technical Tips for Tech Startups (2023 Edition)
6+1 Technical Tips for Tech Startups (2023 Edition)
 
Migrating to Microservices Patterns and Technologies (edition 2023)
 Migrating to Microservices Patterns and Technologies (edition 2023) Migrating to Microservices Patterns and Technologies (edition 2023)
Migrating to Microservices Patterns and Technologies (edition 2023)
 
Practical Microservice Architecture (edition 2022).pdf
Practical Microservice Architecture (edition 2022).pdfPractical Microservice Architecture (edition 2022).pdf
Practical Microservice Architecture (edition 2022).pdf
 
Istio as an enabler for migrating to microservices (edition 2022)
Istio as an enabler for migrating to microservices (edition 2022)Istio as an enabler for migrating to microservices (edition 2022)
Istio as an enabler for migrating to microservices (edition 2022)
 
DevOps for absolute beginners (2022 edition)
DevOps for absolute beginners (2022 edition)DevOps for absolute beginners (2022 edition)
DevOps for absolute beginners (2022 edition)
 
TDD Anti-patterns (2022 edition)
TDD Anti-patterns (2022 edition)TDD Anti-patterns (2022 edition)
TDD Anti-patterns (2022 edition)
 
Implementing FaaS on Kubernetes using Kubeless
Implementing FaaS on Kubernetes using KubelessImplementing FaaS on Kubernetes using Kubeless
Implementing FaaS on Kubernetes using Kubeless
 
Istio as an Enabler for Migrating Monolithic Applications to Microservices v1.3
Istio as an Enabler for Migrating Monolithic Applications to Microservices v1.3Istio as an Enabler for Migrating Monolithic Applications to Microservices v1.3
Istio as an Enabler for Migrating Monolithic Applications to Microservices v1.3
 
Introduction to TDD
Introduction to TDDIntroduction to TDD
Introduction to TDD
 
Getting Started with DevOps
Getting Started with DevOpsGetting Started with DevOps
Getting Started with DevOps
 
DevOps for absolute beginners
DevOps for absolute beginnersDevOps for absolute beginners
DevOps for absolute beginners
 
Microservice test strategies for applications based on Spring, K8s and Istio
Microservice test strategies for applications based on Spring, K8s and IstioMicroservice test strategies for applications based on Spring, K8s and Istio
Microservice test strategies for applications based on Spring, K8s and Istio
 
Cucumber jvm best practices v3
Cucumber jvm best practices v3Cucumber jvm best practices v3
Cucumber jvm best practices v3
 
Welcome to the Professional World
Welcome to the Professional WorldWelcome to the Professional World
Welcome to the Professional World
 
More topics on Java
More topics on JavaMore topics on Java
More topics on Java
 
Career Paths for Software Professionals
Career Paths for Software ProfessionalsCareer Paths for Software Professionals
Career Paths for Software Professionals
 
Effective User Story Writing
Effective User Story WritingEffective User Story Writing
Effective User Story Writing
 
AndGen+
AndGen+AndGen+
AndGen+
 
DDT Testing Library for Android
DDT Testing Library for AndroidDDT Testing Library for Android
DDT Testing Library for Android
 
Software Architecture
Software ArchitectureSoftware Architecture
Software Architecture
 

Dernier

The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
shinachiaurasa2
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
masabamasaba
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
masabamasaba
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
masabamasaba
 
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Medical / Health Care (+971588192166) Mifepristone and Misoprostol tablets 200mg
 

Dernier (20)

Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdfPayment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
 
%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
 
8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
Architecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastArchitecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the past
 
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
 
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
 
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
 
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park %in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
 
%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand
 

Big Data for QAs

  • 1. Big Data for Quality Engineers Ahmed Misbah
  • 2. Agenda • Introduction to Big Data – Problem with traditional Large Scale Systems – Requirements for the new approach – Hadoop’s Approach – Batch Processing and Steam Processing • Big Data Technologies – Batch Processing Technologies – Stream Processing Technologies • Testing Big Data Solutions
  • 3. Rules • Phones silent • No laptops • Questions/Discussions at anytime welcome • 10 minute break every 1 hour
  • 6. Traditional Large Scale Computing • Traditionally, computation has been processor-bound: – Small amounts of data – Lots of complex processing • Early solution: Bigger computers!! – Faster processor(s) – More memory
  • 7.
  • 8. Distributed Systems (1/3) • More computers instead of bigger computers • Distributed systems evolved • Use multiple machines for a single job
  • 9. Distributed Systems (2/3) “In pioneer days they used oxen for heavy pulling, and when one ox couldn’t budge a log, we didn’t try to grow a larger ox. We shouldn’t be trying for bigger computers, but for more systems of computers” Grace Hopper
  • 11. Problems with Distributed Systems (1/2) • Programming for traditional distributed systems in complex: – Keeping data and processes in sync – Finite bandwidth – Partial failures
  • 12. Problems with Distributed Systems (2/2) “Failure is the defining difference between distributed and local programming, so you have to design distributed systems with the expectation of failure” Ken Arnold, CORBA Designer
  • 13. The Data Bottleneck (1/4) • Moore’s Law has held firm for over 40 years: – Processing power doubles every two years – Processing speed is no longer the problem • Getting the data to the processor becomes the bottleneck
  • 14. The Data Bottleneck (2/4) • Example: – Typical disk data transfer rate: 75MB/sec – Time taken to transfer 100GB of data to the processor ≈ 22 minutes – Actual time will be worse since most servers have less than 100GB of RAM
  • 15. The Data Bottleneck (3/4) • Typically, data is stored in a central location • Data is copied to the processors at runtime • Acceptable for limited amounts of data
  • 16. The Data Bottleneck (4/4) • Modern system have much more data – Terabytes/day – Petabytes/year • A new approach is required
  • 17. REQUIREMENTS FOR THE NEW APPROACH
  • 18. Requirements for the new approach (1/2) • Partial failure support: – Failure of a component should result in a graceful degradation of the application performance – It should not lead to a complete failure of the entire system • Data recoverability: – If a component of the system fails, its workload should be assumed by still-functioning units in the system • Component recovery: – If a component fails then recovers, it should be able to rejoin the system without requiring full system restart
  • 19. Requirements for the new approach (2/2) • Consistency: – Component failures during execution of a job should not affect the outcome of the job • Scalability: – Adding load to the system should result in graceful degradation in performance and not the failure of the entire system – Increasing resources should support proportional increase in load capacity
  • 21. A new approach to distributed computing! • Distribute data when the data is being stored • Run computation where the data is stored
  • 22. Core Concept (1/3) • Distribute the data as it is initially stored in the system • Individual nodes can work on the data local to those nodes • No data transfer over the network is required for initial processing
  • 23. Core Concept (2/3) • Applications are written in high-level code • Developers need not to worry about network programming or low-level infrastructure • Nodes talk to each other as little as possible
  • 24. Core Concept (3/3) • Data is spread among machines in advance • Computation happens where the data is stored • Data is replicated multiple times on the system for increased availability and reliability
  • 25. Fault Tolerance • If a node fails, the master will detect the failure and re-assign the work to a different node on the system • Restarting a task does not require the communication with nodes working on other portions of the data • If a failed node restarts it is automatically added back to the system and assigned a new task • If a node appears to be running slowly, the master can redundantly execute another instance of the same task
  • 26.
  • 27. BATCH PROCESSING VS STREAM PROCESSING
  • 28. Batch Processing • Also known as History-based processing • Processing is executed against large data already stored in some storage medium (e.g. HDFS or S3)
  • 29. Stream Processing • Processing executed against batches of data coming continuously from a stream
  • 31. Batch Processing Technologies (1/2) • Hadoop
  • 32. Batch Processing Technologies (2/2) • Spark
  • 33. Stream Processing Technologies (1/2) • Spark Streaming
  • 34. Stream Processing Technologies (2/2) • Apache Storm • Apache Flink
  • 36. TESTING BIG DATA TECHNOLOGIES
  • 37. Hadoop MapReduce (1/3) • LocalJobRunner: – Does not require any Hadoop daemons to be running – Uses the local file system instead of HDFS • MRUnit: – Built on top of JUnit – Works with Mockito Framework to provide required mock objects
  • 38. Hadoop MapReduce (2/3) • Apache Hue: – Is an open source Web interface for analyzing data with Apache Hadoop
  • 39. Hadoop MapReduce (3/3) • MapReduce Job Tracker Web Interface
  • 40. Apache Spark (1/3) • Run locally using Eclipse of IntelliJ • Run using Spark Standalone • Spark Testing Base: – For implementing unit tests for Spark code • Spark Validator: – A library you can include in your Spark job to validate the counters and perform operations on success
  • 41. Apache Spark (2/3) • Spark UI and History Server:
  • 42.
  • 43. Apache Spark (3/3) • Apache Zeppelin (using Sparklet on Windows)
  • 44.
  • 45. Performance Testing Tools • Gatling • Yahoo Cloud Serving Benchmark (YCSB) • Jumbune • Netflix Inviso • TestDFSIO • TeraSort • NNBench • MRbench • BigBench
  • 46. More tools • https://github.com/Intel-bigdata/HiBench • https://github.com/yahoo/streaming- benchmarks • https://github.com/tdas/spark-streaming- benchmark • https://github.com/BBVA/spark-benchmarks • https://github.com/databricks/spark-perf
  • 47. Monitoring Tools • https://ambari.apache.org/ • https://github.com/groupon/sparklint • https://github.com/linkedin/dr-elephant • https://github.com/ibm-research- ireland/sparkoscope • https://supergloo.com/spark- monitoring/spark-performance-monitoring- tools/
  • 48. Important Considerations • Number of clusters/nodes • Hardware Specifications (HDD or SSD) • Application/Environment Configurations (no. of cores, no. of partitions, no. of threads, disk/memory persistence, etc.) • Data format (Text, Sequence, Avro, etc.) • Data size • Compression (Snappy, Gzip, etc.) • Number of Reducers (MapReduce)
  • 49. Spark Best Practices • https://medium.com/teads-engineering/spark- performance-tuning-from-the-trenches- 7cbde521cf60 • https://dzone.com/articles/apache-spark- performance-tuning-degree-of-parallel • https://databricks.com/glossary/spark-tuning • https://blog.cloudera.com/how-to-tune-your- apache-spark-jobs-part-1/ • https://www.bi4all.pt/en/news/en-blog/apache- spark-best-practices/
  • 50. Sampling • Sampling is defined as: “the act, process, or technique of selecting a representative part of a population for the purpose of determining parameters or characteristics of the whole population” - Merriam- Webster dictionary
  • 51. Useful Resources (1/3) • Benchmarking Hadoop and HBase on Violin • Benchmarking Cassandra on Violin • http://blog.cloudera.com/blog/2014/11/bigbe nch-toward-an-industry-standard-benchmark- for-big-data-analytics/
  • 52. Useful Resources (2/3) • http://blog.cloudera.com/blog/2015/08/ycsb-the- open-standard-for-nosql-benchmarking-joins-cloudera- labs/ • https://discuss.zendesk.com/hc/en- us/articles/200864057-Running-DFSIO-MapReduce- benchmark-test • http://www.michael- noll.com/blog/2011/04/09/benchmarking-and-stress- testing-an-hadoop-cluster-with-terasort-testdfsio- nnbench-mrbench/
  • 53. Useful Resources (3/3) • http://bdaafall2015.readthedocs.io/en/latest/ nnbench.html • http://bdaafall2015.readthedocs.io/en/latest/ mrbench.html