This presentation introduces the Big Data topic to Software Quality Assurance Engineers. It can also be useful for Software Developers and other software professionals.
2. Agenda
• Introduction to Big Data
– Problem with traditional Large Scale Systems
– Requirements for the new approach
– Hadoop’s Approach
– Batch Processing and Steam Processing
• Big Data Technologies
– Batch Processing Technologies
– Stream Processing Technologies
• Testing Big Data Solutions
3. Rules
• Phones silent
• No laptops
• Questions/Discussions at anytime welcome
• 10 minute break every 1 hour
6. Traditional Large Scale Computing
• Traditionally, computation has been
processor-bound:
– Small amounts of data
– Lots of complex processing
• Early solution: Bigger computers!!
– Faster processor(s)
– More memory
7.
8. Distributed Systems (1/3)
• More computers instead of bigger computers
• Distributed systems evolved
• Use multiple machines for a single job
9. Distributed Systems (2/3)
“In pioneer days they used oxen for heavy
pulling, and when one ox couldn’t budge a log,
we didn’t try to grow a larger ox. We shouldn’t
be trying for bigger computers, but for more
systems of computers”
Grace Hopper
11. Problems with Distributed Systems
(1/2)
• Programming for traditional distributed
systems in complex:
– Keeping data and processes in sync
– Finite bandwidth
– Partial failures
12. Problems with Distributed Systems
(2/2)
“Failure is the defining difference between
distributed and local programming, so you
have to design distributed systems with the
expectation of failure”
Ken Arnold, CORBA Designer
13. The Data Bottleneck (1/4)
• Moore’s Law has held firm for over 40 years:
– Processing power doubles every two years
– Processing speed is no longer the problem
• Getting the data to the processor becomes the
bottleneck
14. The Data Bottleneck (2/4)
• Example:
– Typical disk data transfer rate: 75MB/sec
– Time taken to transfer 100GB of data to the
processor ≈ 22 minutes
– Actual time will be worse since most servers have
less than 100GB of RAM
15. The Data Bottleneck (3/4)
• Typically, data is stored in a central location
• Data is copied to the processors at runtime
• Acceptable for limited amounts of data
16. The Data Bottleneck (4/4)
• Modern system have much more data
– Terabytes/day
– Petabytes/year
• A new approach is required
18. Requirements for the new approach
(1/2)
• Partial failure support:
– Failure of a component should result in a graceful
degradation of the application performance
– It should not lead to a complete failure of the entire
system
• Data recoverability:
– If a component of the system fails, its workload should
be assumed by still-functioning units in the system
• Component recovery:
– If a component fails then recovers, it should be able to
rejoin the system without requiring full system restart
19. Requirements for the new approach
(2/2)
• Consistency:
– Component failures during execution of a job
should not affect the outcome of the job
• Scalability:
– Adding load to the system should result in graceful
degradation in performance and not the failure of
the entire system
– Increasing resources should support proportional
increase in load capacity
21. A new approach to distributed
computing!
• Distribute data when the data is being stored
• Run computation where the data is stored
22. Core Concept (1/3)
• Distribute the data as it is initially stored in the
system
• Individual nodes can work on the data local to
those nodes
• No data transfer over the network is required
for initial processing
23. Core Concept (2/3)
• Applications are written in high-level code
• Developers need not to worry about network
programming or low-level infrastructure
• Nodes talk to each other as little as possible
24. Core Concept (3/3)
• Data is spread among machines in advance
• Computation happens where the data is
stored
• Data is replicated multiple times on the
system for increased availability and reliability
25. Fault Tolerance
• If a node fails, the master will detect the failure
and re-assign the work to a different node on the
system
• Restarting a task does not require the
communication with nodes working on other
portions of the data
• If a failed node restarts it is automatically added
back to the system and assigned a new task
• If a node appears to be running slowly, the
master can redundantly execute another instance
of the same task
28. Batch Processing
• Also known as History-based processing
• Processing is executed against large data
already stored in some storage medium (e.g.
HDFS or S3)
37. Hadoop MapReduce (1/3)
• LocalJobRunner:
– Does not require any Hadoop daemons to be
running
– Uses the local file system instead of HDFS
• MRUnit:
– Built on top of JUnit
– Works with Mockito Framework to provide
required mock objects
38. Hadoop MapReduce (2/3)
• Apache Hue:
– Is an open source Web interface for analyzing data
with Apache Hadoop
40. Apache Spark (1/3)
• Run locally using Eclipse of IntelliJ
• Run using Spark Standalone
• Spark Testing Base:
– For implementing unit tests for Spark code
• Spark Validator:
– A library you can include in your Spark job to validate
the counters and perform operations on success
48. Important Considerations
• Number of clusters/nodes
• Hardware Specifications (HDD or SSD)
• Application/Environment Configurations (no.
of cores, no. of partitions, no. of threads,
disk/memory persistence, etc.)
• Data format (Text, Sequence, Avro, etc.)
• Data size
• Compression (Snappy, Gzip, etc.)
• Number of Reducers (MapReduce)
50. Sampling
• Sampling is defined as: “the act, process, or
technique of selecting a representative part of
a population for the purpose of determining
parameters or characteristics of the whole
population” - Merriam- Webster dictionary
51. Useful Resources (1/3)
• Benchmarking Hadoop and HBase on Violin
• Benchmarking Cassandra on Violin
• http://blog.cloudera.com/blog/2014/11/bigbe
nch-toward-an-industry-standard-benchmark-
for-big-data-analytics/