3. ● Data analysis becomes
more and more
important
● Increasing complexity
of analysis
● Meanwhile the data we
analyze grows big, fast!
s: http://www.flickr.com/photos/pallotron/2479541331/ by pallotron
4.
5. Hadoop: Intro
Hadoop is an open source Java framework aimed
at data intensive distributed applications.
It enables applications to work with thousands of
nodes and petabytes of data.
6. Hadoop: Intro
Hadoop was inspired by Google's Map Reduce
and Google File System.
http://labs.google.com/papers/mapreduce.html
7. Hadoop: HDFS
HDFS is a distributed, scalable filesystem
designed to store large files.
In combination with the Hadoop JobTracker it
provides data locality.
It auto replicates all blocks to 3 data nodes,
where preferable 2 copies are stored on two data
nodes within the same rack and one in another
rack.
8. Hadoop: HDFS
● NameNode
● Keeps track of what is stored where
● In memory
● Single Point of Failure
● DataNodes
9. Hadoop: HDFS
s: Practical problem solving with Hadoop and Pig by Milind Bhandarkar
http://www.slideshare.net/hadoop/practical-problem-solving-with-apache-hadoop-pig
18. Hadoop Pig
data = LOAD 'employee.csv' USING PigStorage() AS (
first_name:chararray,
last_name:chararray,
age:int,
wage:float,
department:chararray
);
grouped_by_department = GROUP data BY department;
total_wage_by_department =
FOREACH grouped_by_department
GENERATE
group AS department,
COUNT(data) as employee_count,
SUM(data::wage) AS total_wage;
total_ordered = ORDER total_wage_by_department BY total_wage;
total_limited = LIMIT total_ordered 10;
DUMP total_limited;
19. books = LOAD 'books.csv.bz2' USING PigStorage() AS (
book_id:int,
book_name:chararray,
author_name:chararray
);
book_sales = LOAD 'book_sales.csv.bz2' USING PigStorage() AS (
book_id:int,
price:float,
country:chararray
);
--- books = FILTER books BY (author_name LIKE 'Pamuk');
data = JOIN books ON book_id, book_sales ON book_id PARALLEL 12;
grouped_by_book = GROUP data BY books::book_name;
total_sales_by_book =
FOREACH grouped_by_book
GENERATE
group as book,
COUNT(data) as sales_volume,
SUM(book_sales::price) AS total_sales;
STORE total_sales_by_book INTO 'book_sale_results';
20. UDF
● Custom Load and Store classes.
● Hbase
● ProtocolBuffers
● CombinedLog
● Custom extraction
eg. date, ...
Take a look at the PiggyBank.