Big Data for QAs

Big Data for Quality Engineers
Ahmed Misbah

Agenda
• Introduction to Big Data
– Problem with traditional Large Scale Systems
– Requirements for the new approach
– Hadoop’s Approach
– Batch Processing and Steam Processing
• Big Data Technologies
– Batch Processing Technologies
– Stream Processing Technologies
• Testing Big Data Solutions

Rules
• Phones silent
• No laptops
• Questions/Discussions at anytime welcome
• 10 minute break every 1 hour

PROBLEMS WITH TRADITIONAL
LARGE SCALE SYSTEMS

Traditional Large Scale Computing
• Traditionally, computation has been
processor-bound:
– Small amounts of data
– Lots of complex processing
• Early solution: Bigger computers!!
– Faster processor(s)
– More memory

Distributed Systems (1/3)
• More computers instead of bigger computers
• Distributed systems evolved
• Use multiple machines for a single job

Distributed Systems (2/3)
“In pioneer days they used oxen for heavy
pulling, and when one ox couldn’t budge a log,
we didn’t try to grow a larger ox. We shouldn’t
be trying for bigger computers, but for more
systems of computers”
Grace Hopper

Problems with Distributed Systems
(1/2)
• Programming for traditional distributed
systems in complex:
– Keeping data and processes in sync
– Finite bandwidth
– Partial failures

Problems with Distributed Systems
(2/2)
“Failure is the defining difference between
distributed and local programming, so you
have to design distributed systems with the
expectation of failure”
Ken Arnold, CORBA Designer

The Data Bottleneck (1/4)
• Moore’s Law has held firm for over 40 years:
– Processing power doubles every two years
– Processing speed is no longer the problem
• Getting the data to the processor becomes the
bottleneck

• Example:
– Typical disk data transfer rate: 75MB/sec
– Time taken to transfer 100GB of data to the
processor ≈ 22 minutes
– Actual time will be worse since most servers have
less than 100GB of RAM

• Typically, data is stored in a central location
• Data is copied to the processors at runtime
• Acceptable for limited amounts of data

• Modern system have much more data
– Terabytes/day
– Petabytes/year
• A new approach is required

REQUIREMENTS FOR THE NEW
APPROACH

Requirements for the new approach
(1/2)
• Partial failure support:
– Failure of a component should result in a graceful
degradation of the application performance
– It should not lead to a complete failure of the entire
system
• Data recoverability:
– If a component of the system fails, its workload should
be assumed by still-functioning units in the system
• Component recovery:
– If a component fails then recovers, it should be able to
rejoin the system without requiring full system restart

Requirements for the new approach
(2/2)
• Consistency:
– Component failures during execution of a job
should not affect the outcome of the job
• Scalability:
– Adding load to the system should result in graceful
degradation in performance and not the failure of
the entire system
– Increasing resources should support proportional
increase in load capacity

A new approach to distributed
computing!
• Distribute data when the data is being stored
• Run computation where the data is stored

Core Concept (1/3)
• Distribute the data as it is initially stored in the
system
• Individual nodes can work on the data local to
those nodes
• No data transfer over the network is required
for initial processing

Core Concept (2/3)
• Applications are written in high-level code
• Developers need not to worry about network
programming or low-level infrastructure
• Nodes talk to each other as little as possible

Core Concept (3/3)
• Data is spread among machines in advance
• Computation happens where the data is
stored
• Data is replicated multiple times on the
system for increased availability and reliability

Fault Tolerance
• If a node fails, the master will detect the failure
and re-assign the work to a different node on the
system
• Restarting a task does not require the
communication with nodes working on other
portions of the data
• If a failed node restarts it is automatically added
back to the system and assigned a new task
• If a node appears to be running slowly, the
master can redundantly execute another instance
of the same task

BATCH PROCESSING VS STREAM
PROCESSING

Batch Processing
• Also known as History-based processing
• Processing is executed against large data
already stored in some storage medium (e.g.
HDFS or S3)

Stream Processing
• Processing executed against batches of data
coming continuously from a stream

Batch Processing Technologies (1/2)
• Hadoop

Batch Processing Technologies (2/2)
• Spark

Stream Processing Technologies (1/2)
• Spark Streaming

Stream Processing Technologies (2/2)
• Apache Storm
• Apache Flink

Supporting Technologies
• Apache Kafka
• Akka

Hadoop MapReduce (1/3)
• LocalJobRunner:
– Does not require any Hadoop daemons to be
running
– Uses the local file system instead of HDFS
• MRUnit:
– Built on top of JUnit
– Works with Mockito Framework to provide
required mock objects

• Apache Hue:
– Is an open source Web interface for analyzing data
with Apache Hadoop

• MapReduce Job Tracker Web Interface

Apache Spark (1/3)
• Run locally using Eclipse of IntelliJ
• Run using Spark Standalone
• Spark Testing Base:
– For implementing unit tests for Spark code
• Spark Validator:
– A library you can include in your Spark job to validate
the counters and perform operations on success

Apache Spark (2/3)
• Spark UI and History Server:

Apache Spark (3/3)
• Apache Zeppelin (using Sparklet on
Windows)

Performance Testing Tools
• Gatling
• Yahoo Cloud Serving Benchmark (YCSB)
• Jumbune
• Netflix Inviso
• TestDFSIO
• TeraSort
• NNBench
• MRbench
• BigBench

More tools
• https://github.com/Intel-bigdata/HiBench
• https://github.com/yahoo/streaming-
benchmarks
• https://github.com/tdas/spark-streaming-
benchmark
• https://github.com/BBVA/spark-benchmarks
• https://github.com/databricks/spark-perf

Monitoring Tools
• https://ambari.apache.org/
• https://github.com/groupon/sparklint
• https://github.com/linkedin/dr-elephant
• https://github.com/ibm-research-
ireland/sparkoscope
• https://supergloo.com/spark-
monitoring/spark-performance-monitoring-
tools/

Important Considerations
• Number of clusters/nodes
• Hardware Specifications (HDD or SSD)
• Application/Environment Configurations (no.
of cores, no. of partitions, no. of threads,
disk/memory persistence, etc.)
• Data format (Text, Sequence, Avro, etc.)
• Data size
• Compression (Snappy, Gzip, etc.)
• Number of Reducers (MapReduce)

Spark Best Practices
• https://medium.com/teads-engineering/spark-
performance-tuning-from-the-trenches-
7cbde521cf60
• https://dzone.com/articles/apache-spark-
performance-tuning-degree-of-parallel
• https://databricks.com/glossary/spark-tuning
• https://blog.cloudera.com/how-to-tune-your-
apache-spark-jobs-part-1/
• https://www.bi4all.pt/en/news/en-blog/apache-
spark-best-practices/

Sampling
• Sampling is defined as: “the act, process, or
technique of selecting a representative part of
a population for the purpose of determining
parameters or characteristics of the whole
population” - Merriam- Webster dictionary

Useful Resources (1/3)
• Benchmarking Hadoop and HBase on Violin
• Benchmarking Cassandra on Violin
• http://blog.cloudera.com/blog/2014/11/bigbe
nch-toward-an-industry-standard-benchmark-
for-big-data-analytics/

• http://blog.cloudera.com/blog/2015/08/ycsb-the-
open-standard-for-nosql-benchmarking-joins-cloudera-
labs/
• https://discuss.zendesk.com/hc/en-
us/articles/200864057-Running-DFSIO-MapReduce-
benchmark-test
• http://www.michael-
noll.com/blog/2011/04/09/benchmarking-and-stress-
testing-an-hadoop-cluster-with-terasort-testdfsio-
nnbench-mrbench/

• http://bdaafall2015.readthedocs.io/en/latest/
nnbench.html
• http://bdaafall2015.readthedocs.io/en/latest/
mrbench.html

Big Data for QAs

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (18)

Similaire à Big Data for QAs

Similaire à Big Data for QAs (20)

Plus de Ahmed Misbah

Plus de Ahmed Misbah (20)

Dernier

Dernier (20)

Big Data for QAs