5. 5
X-ray – 30MB
3D CT scan – 1GB
3D MRI – 150MB
Mammograms – 120MB
Growing – 20-40%/year
Preemies health
◦ University of Ontario & IBM
◦ 16 different data streams
◦ 1260 data points per second
Early treatment
6. Data structure and storage
Analytical methods & Processing power
Needed parallelization
6
12. Hadoop – framework for processing BigData
Two main components:
◦ HDFS
◦ MapReduce
Thousands of nodes in
cluster
12
13. Distributed fault-tolerant file system designed
to run on commodity hardware
Main characteristics
◦ Scalability
◦ High availability
◦ Large files
◦ Common hardware
◦ Streeming data access - write once read many times
13
14. NameNode
◦ Master
◦ Control storage
◦ Store metadata about files
Name, path, size, block size, block IDs, ...
DataNode
◦ Slave
◦ Store data in blocks
14
16. Files are stored in blocks
◦ Large files are split
Size: 64, 128, 256 MB …
Stored in NameNode memory
◦ Limit factor
150 bytes per file/directory or block object
◦ 3GB of memory = 10 million one blocks files.
16
17. Seek time - 10ms seek time
Block size - 100 MB 1% of
Transfer rate - 100 MB/s transfer time
Number of Map & Reduce Jobs depends on
block size
17
20. First – same node, client
Second – off-rack
Third – same rack, different node
Next… - random nodes (tries to avoid placing
too many replicas on the same rack)
20
22. Programing model for data processing
◦ Functional programming - directed acyclic graph
Hadoop support: Java, RUBY, Python, C++
Associative array
◦ <key,value> pairs
22
23. Job - unit of work
◦ Input data
◦ Map & Reduce program
◦ Configuration information
Job is divided into task
◦ Map tasks
◦ Reduce tasks
23
24. Job tracker
◦ Coordinates all jobs by scheduling tasks to run on
task trackers
◦ Keeps job progress records
◦ Reschedule task in case of fails
Task trackers
◦ Run tasks
◦ Send progress report to Jobtracker
24
26. Hadoop divide input to MapReduce job into
fixed-size piece of work – input split
Create one map per split
◦ Run user define map function
Split size tends to be the size of an HDFS
block
26
27. Data locality optimization
◦ Run the map task on a node where the input data
resides in HDFS.
◦ Data-local (a), rack-local (b), and off-rack (c) map
tasks.
27
28. Output - <Key, Value> pairs
Write to local disk – NOT to HDFS !!!
◦ Map output is processed by reduce tasks to
produce final output
◦ No replicas needed
Sort <Key, Value> pairs
If node fails before reduce –> map again
28
29. TaskTracker read region files remotely (RPC)
Invoke Reduce function (aggregate)
Output is stored in HDFS
Don’t have the advantage of data locality
◦ Input to reduce – output from all mappers
29
31. Minimize the data transferred between map
and reduce tasks
Run on the map output
“Reduce on Map side”
max(0, 20, 10, 25, 15) = max(max(0, 20, 10), max(25, 15)) = max(20, 25) = 25
mean(0, 20, 10, 25, 15) = 14
mean(mean(0, 20, 10), mean(25, 15)) = mean(10, 20) = 15
31
32. Java (RUBY, Python, C++)
◦ Good for programmers
Pig
◦ Scripting language with a focus on dataflows.
◦ Use Pig Latin language
◦ Allow merging, filtering, applying functions
Hive
◦ Use HiveQL - similar to SQL (use Facebook)
◦ Provides a database query interface
Hbase
32
36. #!/usr/bin/env bash
for year in all/*
do
echo -ne `basename $year .gz`"t"
gunzip -c $year |
awk '{ temp = substr($0, 88, 5) + 0;
q = substr($0, 93, 1);
if (temp !=9999 && q ~ /[01459]/ && temp > max) max = temp }
END { print max }'
done
% ./max_temperature.sh
1901 317
1902 244
1903 289
1904 256
1905 283
...
36
37. 37
Run parts of the program in parallel
◦ Process different years in different processes
Problems
◦ Non equal-size pieces
◦ Combining partial results need processing time
◦ Single machine processing limit
◦ Long processing time
44. Hadoop: The Definitive Guide, 3rd Edition
◦ http://it-ebooks.info/book/635/
Big Data: A Revolution That Will Transform
How We Live, Work, and Think
http://hadoop.apache.org/
http://architects.dzone.com/articles/how-
hadoop-mapreduce-works
44