SlideShare une entreprise Scribd logo
1  sur  90
Hadoop Big Data Intro
2/16/2013
Hadoop/BigData Intro
Provided agenda
Addition:Theory from papers
Addition: Demo/code samples
Addition: System architecture
Goal: develop some theory
Agenda
●

Introduction to Big Data

●

Basic Concepts

●

Hadoop
Overview of Hadoop
Working with HDFS / Map Reduce
Architecture
Anatomy of File write / read
Admin and Development

Introduce other components of Hadoop
ecosystem
Agenda (2)
Hive / HBase / Pig / Sqoop
Map Reduce
Features
- Architecture
●
●

Working
Job Execution

We can cover this circa 2005 agenda in 3h
w/some additions. Need hands on lab to
understand the content.
Big Data defn.
●
●

Big data, too big to run SQL queries on
Lots of data (cover Google approach which is
what Hadoop is based on)
Replacing Legacy Systems

10x

Building Applications on Hadoop, Compet Gap
Astayanax
DevOps, Packaging, Chaos Monkey, AWS,
Zookeeper

Modifying the Hadoop Components, JIRA

3-4x
Big Data Basic Concepts
●

Storing large amounts of data and doing
something with them
–

Some sort of analytics
●
●

Easy: Tableau, Datameer
Competitive Advantage
–
–

Small scale analytics: R, stats 202 , DemographicsWeblog
Large scale analytics:
● cs246
● Should be able to define analytics POCs based on the next
slide which are domain specific
Big Data Analytics
Big Data started in 2000, 2 design
problems @Google, 1998-2000
There is a separate Big Data product for each
use case.
●

Google Design Problems/GFS:
–

Store internet pages on hard drives

–

Unstructured data
●
●
●
●
●
●

Collect HTML and Links; images?
20+ billion web pages x 20KB = 400+ TB
1 computer reads 30-35 MB/sec from disk
~4 months to read the web
~1,000 hard drives to store the web
Source Jure Leskovic Slides cs264
Google M/R
●

Once the data is on 1k machines...
–
–

Traditional method: read file into memory. Can't put
webpages into memory & reading data would
saturate network.

–

●

How to run an algorithm over 1k disk drives?

Soln: Map Reduce. Move the code to the data via
mappers and reducers which are placed on the
same computer as the data

GFS paper/MapReduce Paper. Hadoop =
GFS+M/R
Google GFS
●

●

Stored the html/links/images were stored in
BigTable. Store html pages into files. Many
pages per file. Why? Seeks $, store crawl
2 parts:What is a file system? SB=Collection of
inodes
R/W in a file system
●

Read the contents of foo.txt
–

●

Go to superblock, find location of datablocks from
pointer in superblock for foo.txt and read them into
memory

Write into foo.txt
–

Go to superblock, write contents into new
datablocks and append addresses of datablocks
into superblock entry for foo.txt.
Distribute file system across servers
●

Superblocks=>GFS master =>Hadoop NameN
inodes=>chunkservers=>Hadoop DataNode
R/W in distributed file system
●

Read from HDFS foo.txt:
–

●

Go to namenode, find datablock where data is, read
data into memory on client machine. What is the
difference?

Write into HDFS foo.txt:
–

Go to namenode, find empty block, tell client to
send data to an empty block on the datanode,
append the addresses of the new blocks into NN for
foo.txt. What is the difference?
●

Client, Network
Hadoop HDFS, List of files in
system, blocks file contents
HDFS Demo
●

List of files

●

NN+DN website
–

http://<name_node_address>:50070/

–

Where is the DN? Port:50075

●

Logs demo

●

Running in single node PD mode
–

–
●

JVM processes are threads vs. separate JVM
processes for each service.
Global vars in mappers good in PD not in cluster

/etc/init.d. Do not download and install tar ball
File R/W system issue
●
●

Cache/Disk Drives
Before writing from memory to disk power goes
out. Lost data
Write to Memory

Write to Disk
Failures
●

●
●

●
●

●

Commodity servers fail, One server @G may
stay up 3 years (1,000 days)
If you have 1,000 servers, expect to lose 1/day
With 1M machines 1,000 machines fail every
day!
Google 3y vs else once 3w? Why? 20Servers?
GFS paper/restart failed M/R tasks. Not in
Hadoop
Most system designs neglect failure except
Netflix ChaosM
What is Hadoop?
●

An implementation of GFS/Map Reduce in
Java.
–

Used at Yahoo, LinkedIn, Facebook, Netflix, Twitter
●

What did each contribute? Use cases?

–

Doug Cutting (Cloudera)/Lucene

–

v1.0 vs v2.0

–

Hadoop Components, HBase, Flume, Sqoop,
Zookeeper, Oozie, Pig, Hive
HDFS
●

HDFS is a distributed file system. Hadoop
Distributed File System
–

Unlimited capacity, add more capacity add more
nodes

–

A file SB info is stored in a NN server. Inodes or
datablocks are stored in DN server.

–

Replicate for data locality & error detection/recovery
●

–

Replicate a data block 3x. Why?

HDFS:
●

Append only file system (copy Google Paper)
HDFS
●

What is your file system on your laptop?
–

●

Append only or Random R/W?

When is append only bad?
–

Digression:RMW. Editing a word document is what?
Append only or RMW?

–

Design exercise: 200Gb in files. How many files are
there?

–

Does this fit in memory?
HDFS Design exercise
●

Many files combined into smaller number of
large files. How to access smaller files?
–

Slower to access for reads

–

If RMW; add modify into new blocks in HDFS. Find
the new blocks and read them into memory is
slower than sequential access on a single node file
system

–

Faster to delete old file and create a new file with
sequential blocks in place.
Solns
1) on write write to disk everytime write to memory
–
–
●

Why Good?
Why Bad?

2) lose the data when the power goes out
–
–

●

Why Good?
Why Bad?

FSCK; File System Check Consistency
Agenda: Admin and Development
●

HDFS/MR Administration. HBase,etc. different
–

24x7 SLA
●
●

●

●

Hot standbys for maintenance
HDFS:Recovery from User error, restore the file I just
deleted
HDFS/MR Recovery from failures, (not automated in
Hadoop)
MR lagging mapper, cascading failures
Development
●

Apache S/W development practices
–

Jenkins, Jira tickets

–

Repos
HDFS Schemas
●

Do you store 20B files on HDFS by file name?
–

–
●

What happens with multiple files with same name?
e.g. test.txt?
Create metadata, partitions

HDFS Schemas:
–

Avro

–

Parquet
●

Dremel column store/encoding
Map Reduce Intro(1)
●

Map Reduce
–

Designed in 2000, when there was very little
memory in commodity PCs, ~4GB or less. These
aren't enterprise class servers.

–

This isn't the case today. MultiCPU/MultiCore 192gb
machines are much more reliable with different use
cases

–

M/R idiom is being replaced with non MR systems.

–

What we don't cover
●

Google F1
Map Reduce Intro
●

There are 3 parts to how Map Reduce works:
–
–

Shuffle

–
●

Mapper
Reducer

There are 3 parts to a Map Reduce program
–
–

Reducers

–
●

Mappers
Driver

These 2 concepts aren't the same. People get
these mixed up.
Map Reduce Part 1
●

1k node cluster; bring the code to the data.
Reduce network traffic

●

Programming idiom

●

Divide task into mappers.

●

Examples of what can be divided and combined
–

Try dividing first, assume you can combine anything
you can divide

–

Divide input file into single lines, send one line to
each server, process each line
Word Count
●

●

I can count a text file of words with a single
program.
I can split the file into a mappers and have the
mappers count the words in parallel
FileLine

FileLine

FileLine

FileLine

Mapper

Mapper

Mapper

Mapper
Word Count
●

The mappers output K/V pairs onto the
network. These are not Java Strings or Java
objects!
–
–

●

Keys: Comparable, Writable
Values: Writable

Network saturates with multiple M/R jobs.
Network

Reducer

Reducer

Reducer

Reducer
Shuffle/Reduce Part 2/3
●

The K/V pairs are sent to the network. The K/V
pairs are sent to certain destinations based on
2 rules:
–

1) each K goes to the same reducer

–

2) all keys are in sorted order

–

3) Output in 2 forms, _SUCCESS and part-00000

–

Custom partitioner to send K to specific Reducer

–

Grouping Comparator: group keys to reducer

–

Sorting Comparator: can modify sort order for
compound keys
Map Reduce Word Count
M/R Program/Mapper
public void map(LongWritable key, Text value, OutputCollector<Text,
IntWritable> output, Reporter reporter) throws IOException {
String line = (caseSensitive) ? value.toString() :
value.toString().toLowerCase();
for (String pattern : patternsToSkip) {
line = line.replaceAll(pattern, "");
}
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
output.collect(word, one);
reporter.incrCounter(Counters.INPUT_WORDS, 1);
}
M/R Program Reducer
Class Reduce extends MapReduceBase implements Reducer<Text,
IntWritable, Text, IntWritable>
public void reduce(Text key, Iterator<IntWritable> values,
OutputCollector<Text, IntWritable> output, Reporter reporter) throws
IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
M/R Program Driver
●

public class WordCount extends Configured implements Tool
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(Map.class);
conf.setCombinerClass(Reduce.class);
conf.setReducerClass(Reduce.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
AVRO/Protocol Buffers
●
●

Avro is a serialization format for Mappers.
Splittable and human readable. Not as small as
PB.
Avro object
Protocol Buffers
●

●

Used internally at Google, compact serialization
https://code.google.com/p/protobuf/
“proto bufs” ,not just serializtion, closest to
binary. Used internally in Hadoop.
Why do we need Avro,Protobufs?
Binary: no parser, fast, small. OK for objects
maybe this is like Hibernate
Thrift
●

Add a server to send/receive objects and do the
serialization/deserialization
Map Reduce References
●

What can I do with each text line?
–

Easy: ETL patterns:
●
●
●

–

Match patterns
Count num occurences tokens
Processing files

Harder: Machine Learning/DMWhat can't be easily
done?
●
●

●

K-means clustering
Ullman book: Mining massive
datasets:http://infolab.stanford.edu/~ullman/mmds.html
Jimmy Lin book:http://www.umiacs.umd.edu/~jimmylin/
MRv2
●

2 versions of M/R
–
–

●

v1: old api import xxx.mapred, JT/TT
v2: new api, import xxx.mapreduce, RM/NM/JH

YARN, in Hadoop 2.x maintains backward
compatability to M/R v1.
–

Devs start shifting to Hadoop 2.x YARN for new bug
fixes
YARN Dameons
●

hadoop-hdfs-datanode

●

hadoop-hdfs-namenode

●

hadoop-yarn-resourcemanager

●

hadoop-yarn-nodemanager

●

hadoop-yarn-proxyserver

●

hadoop-hdfs-SecondaryNameNode

●

hadoop-hdfs-JournalNode

●

hadoop-Hdfs-zkfc

●

hadoop-Httpfs

●

hadoop-mapreduce-HistoryServer
YARN->Enterprise
●

Encrypted/Pluggable Shuffle/Sort

●

Httpfs rewrite or proxyserver

●

V2 user authentication/permissions.
–

Apache Sentry
●
●

●

●

Separate authorization policies per database/schema
Users have to customize for shared data structures
(tables/metadata,(hbase,search,zk). Not in any distro!
Schema metadata needs fine grained auth.

Web app proxy/part of RM to reduce attacks on
exposed RM web server
Map Reduce Demo
●

Word Count demo
–
–

HDFS NameNode: http://localhost:50070/

–

ResourceManager http://localhost:8088

–
●

HDFS DataNode: http://localhost:50075/

JobHistory Server http://jhs_host:19888.

Logging mistakes
–

Adding logging to M/R jobs prop to data size and
number times program run. 1TB file means 1TB
logs. Processing 100GB 10x

–

Logs fill up disk crash system

–

Zookeeper logs
M/R Pipelines
●

●

The successful organizations never write direct
Mappers/Reducers. They use higher level tools
like Pig,Hive, etc..
Defn:
–
–

●

Workflow:series of M/R jobs
Pipeline: output of one M/R job is the input to
another

Apache Crunch modeled after Google
FlumeJava
Google FlumeJava
●

●

●

Introduction of data pipelines based on multiple
M/R stages
Define a parallel collection with a set of parallel
operations
Much easier to use than M/R programming.
Contrast w/UDFs. Less lines of source:
Apache Crunch
●

Not just M/R
–

Faster to specify w/API a data processing pipeline
you can customize instead of writing Pig/Hive
scripts, MRPipelines

–

YARN, next version of M/R

–

Supports Apache Spark, SparkPipelines

–

Can keep in memory vs. spill to disk,
MemPipelines
Case Study of old systems
●

●

Older generation of Hadoop Components,
Hadoop, Pig, Hive.
Gives insight to stability/capability of products
Hive at LinkedIn (bottom left). All 3
similar
LinkedIn
●

Pig+DataFu

●

Hive bottom left corner

●

Teradata+Hadoop
Netflix, Block Diagram
Yahoo Block Diagram, Pig, Hive,
Spark, Storm
Yahoo
●

Targeting Content, not Search

●

3k Pig jobs in production

●

Hive in small use for analysts, Pig in heavy
production use. Non MR in use now. Matches
Google's progression
Mapper Failures
●

What happens? Google's paper restarts failed
tasks. NS

●

Hadoop isn't auto recovery

●

Hadoop Mapper/Reducer Worker Failure:
–
–

Reschedule on another worker

–

Speculative Execution

–
●

Completed ok, in progress reset

(ADD FROM VIDEo)

Master failure, abort and return fail to client
M/R Runtime
●

Balancing Cluster capacity
–

#m>>num nodes

–

#r<<#m

–

One HDFS chunk/mapper. Careful w/small files.
Why? Won't just “run” Need admin
Bad Design
●

Combiners
–

Reduce network traffic. Google has special
switches for network latency/throughput

–

job.setCombinerClass(IntSumReducer.class);

–

Combiner can execute 0, 1 or many times. Why?

Combiner demo:
Greedy Scheduling
●

Google Borg (not published)

●

Mesos/YARN
–

Linux cGroups/containers

–

Allocate memory/CPU to each task

–

IO not implemented; Sync/Async
Batch ETL/Apache
●

Apache Pig
–
–

Demo

–
●

PigLatin for ETL
No metadata generated

Pig Lipstick
–

Demo

–

For debugging Pig M/R DAGs
Writing SQL queries in M/R
●

Select * from /tmp/sqlqueries

●

Select a

●

What is the problem with implementing SQL
queries in M/R? What do you get w/a db you
don't get with SQL M/R?
After GFS, M/R; Google Sawzall
●

Contributions:
–

High level procedural language simpler than SQL
operating on unstructured data

–

How to deal with performance problems with sparse
data records?
●

–

Protobufs (used in Hadoop). Dense serialization format to
reduce network traffic/disk space

Multiple jobs, multiple users
●

Workqueue (Apache Oozie)
Apache Pig
●

Paper, Chris Olsten Stanford/Yahoo Research

●

Related to Google Sawzall

●

Contributions:
–

PigLatin, like a unix pipe model

–

Cat a file, grep and count # of the word 'foo',
sed/awk

–

all data are tuples

–

Write M/R jobs at a higher level than Java
Mappers/Reducers

–

Write multistage M/R pipelines
Pig example
Select col1,col2 from table;
Data = Load 'table'
cols = foreach data generate $0,$1;
Dump cols;
Pig Join example
●

Select t1.col2,t2.col2 from t1 join t2 on
t1.col1=t2.col1

●

t1=load 't1';t2=load 't2';

●

Cols1 = foreach t1 generate $0,$1

●

Cols2 = foreach t2 generate $0,$1

●

Joined = JOIN Cols1 by $0, Cols2 by $0

●

Cols = foreach joined generate $1,$3

●

Dump Cols;
Pig example
●

-------------------- # Map Reduce Plan
#----------------------

●

digraph plan

●

{ compound=true;

●
●

●

●

node [shape=rect];
s46641938_in [label="", style=invis, height=0,
width=0];
s46641938_out [label="", style=invis,
height=0, width=0];
subgraph cluster_46641938 {...
Apache Pig Lipstick
●

Debug tool for Pig jobs open sourced by Netflix
SQL queries over GFS?
●

Google Tenzing
–

Distributed Worker Pool

–

Modify M/R API
●

–

Streaming M/R jobs

SQL92
Apache Tez (WIP)
●

●

Streaming of MR Job1 into MRJob2 like
Tenzing
Patches in Pig/Hive
Apache Hive
●

FB data warehouse paper
–

Introduce tables into HDFS (schema)

–

Requires DB to store metadata

–

HiveQL
●

Solved problems
–
–
–
–

Easy for analysts to use, w/o writing MR jobs
Stored metadata unlike PIG
Supports user queries w/joins
Doesn't support UPDATE. Can't update a file in HDFS. Files are
immutable.
Hive QL
●

Create table foo(id int)

●

Create table foo(id int) location '/tmp/data/data.txt'.
–

●

●

●

Hive moves the data.txt file! Looks like Hive deleted
it. Use external table; when dropped nothing
happens. Non external table data is deleted after
table dropped.

We can parse in csv files, this is different than a
database b/c we are dealing with unstructured data.
Create table foo1(username string, map<String,int>)
row format delimited fields terminated by '; '

Map is an aggregate type
Schema on read vs Schema on
write
●

●

●

Data has to match schema for database.
Process data then import into db. Everything
has to match, columns, format, etc...
Hadoop is schema on read; can create any
schema. Doesn't drop a column not defined like
in DB.
Typically loading data into a database requires
some clean up program to get all the data in the
right form with the right number of columns with
the right data ranges.
Hive Serdes
●

Use this to import in data without processing
like in database

●

CREATE TABLE access_log (

●

remote_ip STRING, request_date STRING,method STRING,request STRING,protocol STRING)

●

ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'

●

WITH SERDEPROPERTIES (

●

"input.regex" = "([^ ]) . . [([^]]+)] "([^ ]) ([^ ]) ([^ "])" *",

●

"output.format.string" = "%1$s %2$s %3$s %4$s %5$s"

●

)

●

STORED AS TEXTFILE;
Hive Impl
●

●

External Hive tables are directories in HDFS.
You can delete the files and the tables will be
empty.
Or you can add data into directories and have
the tables grow

●

Hive adds a schema to HDFS

●

HiveServer2:
–

●

Security, multiple clients

Hive+Tez, Hive-0.12+
Hive Metastore Changes
Hcatalog/WebHCat

s
HiveQL
●

Select * from table1;

●

Select col1,col2 from table2;

●

Writing data into Hive
–

Load DATA inpath '/user/dc/tmp' into table1;

–

Load data inpath '/user/dc/tmp' OVERWRITE into
table1; (DELETEs first before writing)
Google BigTable
●

Paper

●

Contributions
–

Chubby(metadata)

–

GQL; limited syntax compiled into M/R, control over
data placement(locality MR perf) and format

Data model: 1 table w/millions columns. Billions
rows
Apache HBase
Bigtable Clone: Zookeeper, HDFS, AVRO, Thrift
, REST. Data model: 1 table w/millions columns
& Billions rows
HBase
●

Partition billion rows into regions
Hbase W/R
●

Write, client->write into WAL&memstore.
Autosharding (src:cloudera). Presplit
regions!
Apache HBase
●

Schema design critical point
–

Schema design shows understanding of
architecture & implementation to use case

–

Rows and Column families. Why?

–

Wibidata Apache KijiSchema
HBase Client Design
●

●

Do things in parallel then merge the results. Not
JDBC
Mistake:
–

private void doMultipleClients(final Class<? extends
Test> cmd) throws IOException {

–

final List<Thread> threads = new
ArrayList<Thread>(this.N);

–

final int perClientRows = R/N;

–

for (int i = 0; i < this.N; i++) {

–
–

Thread t = new Thread (Integer.toString(i)) {
Correct
●

github
Client

Client

Client

Zookeeper

Hbase RS

Hbase RS

Hbase RS
Google Dremel
●
●

●

End of M/R
select count(*) from
publicdata:samples.wikipedia where
REGEXP_MATCH (title, ‘[0-9]*’) AND
wp_namespace = 0;
35B rows in 10s/~35GB. How? 2 tricks
Google Dremel(2)
Cloudera Impala
●

Google Dremel
–

Partition key and Parquet column store schema
GFS/MR are obsolete, F1
Apache Spark
●

UC Berkeley BDAS, in CDH5, support here,
statement from ULM changing
–

Tachyon/Mesos/Spark/Shark/MLBase
Apache Storm
●

Storm in HDP 2.X
–

Instead of multiple threads across multiple servers.

–

Sample code
Hands On Labs
●

Install Hadoop on Amazon EC2.
–

Goal: learn config/logs/how things run in HDFS &
M/R
●
●

●

M/R programming
–

Goal: understand internals of M/R. Understand
implications of production, how to balance 1k M/R
jobs in a cluster (programming Java M/R)
●

●

HDFS hands on
M/R hands on no programming

cs246, cs246h

Individual Components
Hands on Labs
●

Systems labs
–

How to create a Data Repository?
●

HDFS Schemas

–

Zookeeper, coordination and distributed
programming

–

YARN/Mesos examples

–

Spark/Storm

Contenu connexe

Tendances

Hadoop Administration pdf
Hadoop Administration pdfHadoop Administration pdf
Hadoop Administration pdfEdureka!
 
Power Hadoop Cluster with AWS Cloud
Power Hadoop Cluster with AWS CloudPower Hadoop Cluster with AWS Cloud
Power Hadoop Cluster with AWS CloudEdureka!
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation HadoopVarun Narang
 
Setting High Availability in Hadoop Cluster
Setting High Availability in Hadoop ClusterSetting High Availability in Hadoop Cluster
Setting High Availability in Hadoop ClusterEdureka!
 
Deep Learning with Spark and GPUs
Deep Learning with Spark and GPUsDeep Learning with Spark and GPUs
Deep Learning with Spark and GPUsDataWorks Summit
 
XML Parsing with Map Reduce
XML Parsing with Map ReduceXML Parsing with Map Reduce
XML Parsing with Map ReduceEdureka!
 
Hadoop - Overview
Hadoop - OverviewHadoop - Overview
Hadoop - OverviewJay
 
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...Hadoop / Spark Conference Japan
 
Hadoop hdfs interview questions
Hadoop hdfs interview questionsHadoop hdfs interview questions
Hadoop hdfs interview questionsKalyan Hadoop
 
Introduction to Hadoop part1
Introduction to Hadoop part1Introduction to Hadoop part1
Introduction to Hadoop part1Giovanna Roda
 
Introduction to Data Analyst Training
Introduction to Data Analyst TrainingIntroduction to Data Analyst Training
Introduction to Data Analyst TrainingCloudera, Inc.
 
Introduction to Pig | Pig Architecture | Pig Fundamentals
Introduction to Pig | Pig Architecture | Pig FundamentalsIntroduction to Pig | Pig Architecture | Pig Fundamentals
Introduction to Pig | Pig Architecture | Pig FundamentalsSkillspeed
 
Hadoop Tutorial
Hadoop TutorialHadoop Tutorial
Hadoop Tutorialawesomesos
 
A Day in the Life of a Hadoop Administrator
A Day in the Life of a Hadoop AdministratorA Day in the Life of a Hadoop Administrator
A Day in the Life of a Hadoop AdministratorEdureka!
 
Introduction to hadoop administration jk
Introduction to hadoop administration   jkIntroduction to hadoop administration   jk
Introduction to hadoop administration jkEdureka!
 
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)Adam Kawa
 
Presentation on Hadoop Technology
Presentation on Hadoop TechnologyPresentation on Hadoop Technology
Presentation on Hadoop TechnologyOpenDev
 

Tendances (20)

Hadoop Administration pdf
Hadoop Administration pdfHadoop Administration pdf
Hadoop Administration pdf
 
Power Hadoop Cluster with AWS Cloud
Power Hadoop Cluster with AWS CloudPower Hadoop Cluster with AWS Cloud
Power Hadoop Cluster with AWS Cloud
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
 
Setting High Availability in Hadoop Cluster
Setting High Availability in Hadoop ClusterSetting High Availability in Hadoop Cluster
Setting High Availability in Hadoop Cluster
 
Deep Learning with Spark and GPUs
Deep Learning with Spark and GPUsDeep Learning with Spark and GPUs
Deep Learning with Spark and GPUs
 
XML Parsing with Map Reduce
XML Parsing with Map ReduceXML Parsing with Map Reduce
XML Parsing with Map Reduce
 
Exploiting GPUs in Spark
Exploiting GPUs in SparkExploiting GPUs in Spark
Exploiting GPUs in Spark
 
Hadoop - Overview
Hadoop - OverviewHadoop - Overview
Hadoop - Overview
 
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...
 
Hadoop content
Hadoop contentHadoop content
Hadoop content
 
Hadoop hdfs interview questions
Hadoop hdfs interview questionsHadoop hdfs interview questions
Hadoop hdfs interview questions
 
Introduction to Hadoop part1
Introduction to Hadoop part1Introduction to Hadoop part1
Introduction to Hadoop part1
 
Introduction to Data Analyst Training
Introduction to Data Analyst TrainingIntroduction to Data Analyst Training
Introduction to Data Analyst Training
 
Introduction to Pig | Pig Architecture | Pig Fundamentals
Introduction to Pig | Pig Architecture | Pig FundamentalsIntroduction to Pig | Pig Architecture | Pig Fundamentals
Introduction to Pig | Pig Architecture | Pig Fundamentals
 
Hadoop Tutorial
Hadoop TutorialHadoop Tutorial
Hadoop Tutorial
 
A Day in the Life of a Hadoop Administrator
A Day in the Life of a Hadoop AdministratorA Day in the Life of a Hadoop Administrator
A Day in the Life of a Hadoop Administrator
 
Introduction to hadoop administration jk
Introduction to hadoop administration   jkIntroduction to hadoop administration   jk
Introduction to hadoop administration jk
 
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
 
Presentation on Hadoop Technology
Presentation on Hadoop TechnologyPresentation on Hadoop Technology
Presentation on Hadoop Technology
 

En vedette

Generalized Linear Models in Spark MLlib and SparkR by Xiangrui Meng
Generalized Linear Models in Spark MLlib and SparkR by Xiangrui MengGeneralized Linear Models in Spark MLlib and SparkR by Xiangrui Meng
Generalized Linear Models in Spark MLlib and SparkR by Xiangrui MengSpark Summit
 
Generalized Linear Models in Spark MLlib and SparkR
Generalized Linear Models in Spark MLlib and SparkRGeneralized Linear Models in Spark MLlib and SparkR
Generalized Linear Models in Spark MLlib and SparkRDatabricks
 
Online Predictive Modeling of Fraud Schemes from Mulitple Live Streams by Cla...
Online Predictive Modeling of Fraud Schemes from Mulitple Live Streams by Cla...Online Predictive Modeling of Fraud Schemes from Mulitple Live Streams by Cla...
Online Predictive Modeling of Fraud Schemes from Mulitple Live Streams by Cla...Spark Summit
 
Recent Developments in Spark MLlib and Beyond
Recent Developments in Spark MLlib and BeyondRecent Developments in Spark MLlib and Beyond
Recent Developments in Spark MLlib and BeyondXiangrui Meng
 
Computational Advertising in Yelp Local Ads
Computational Advertising in Yelp Local AdsComputational Advertising in Yelp Local Ads
Computational Advertising in Yelp Local Adssoupsranjan
 
Exploiting Ranking Factorization Machines for Microblog Retrieval
Exploiting Ranking Factorization Machines for Microblog RetrievalExploiting Ranking Factorization Machines for Microblog Retrieval
Exploiting Ranking Factorization Machines for Microblog RetrievalRunwei Qiang
 
(2016 07-19) providing click predictions in real-time at scale
(2016 07-19) providing click predictions in real-time at scale(2016 07-19) providing click predictions in real-time at scale
(2016 07-19) providing click predictions in real-time at scaleLawrence Evans
 
Multi Model Machine Learning by Maximo Gurmendez and Beth Logan
Multi Model Machine Learning by Maximo Gurmendez and Beth LoganMulti Model Machine Learning by Maximo Gurmendez and Beth Logan
Multi Model Machine Learning by Maximo Gurmendez and Beth LoganSpark Summit
 
Building a Recommendation Engine Using Diverse Features by Divyanshu Vats
Building a Recommendation Engine Using Diverse Features by Divyanshu VatsBuilding a Recommendation Engine Using Diverse Features by Divyanshu Vats
Building a Recommendation Engine Using Diverse Features by Divyanshu VatsSpark Summit
 
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...MLconf
 
Not Your Father's Database by Vida Ha
Not Your Father's Database by Vida HaNot Your Father's Database by Vida Ha
Not Your Father's Database by Vida HaSpark Summit
 
Time Series Analysis with Spark by Sandy Ryza
Time Series Analysis with Spark by Sandy RyzaTime Series Analysis with Spark by Sandy Ryza
Time Series Analysis with Spark by Sandy RyzaSpark Summit
 
Analyzing Time Series Data with Apache Spark and Cassandra
Analyzing Time Series Data with Apache Spark and CassandraAnalyzing Time Series Data with Apache Spark and Cassandra
Analyzing Time Series Data with Apache Spark and CassandraPatrick McFadin
 
Apache cassandra & apache spark for time series data
Apache cassandra & apache spark for time series dataApache cassandra & apache spark for time series data
Apache cassandra & apache spark for time series dataPatrick McFadin
 
How to Build a Recommendation Engine on Spark
How to Build a Recommendation Engine on SparkHow to Build a Recommendation Engine on Spark
How to Build a Recommendation Engine on SparkCaserta
 

En vedette (15)

Generalized Linear Models in Spark MLlib and SparkR by Xiangrui Meng
Generalized Linear Models in Spark MLlib and SparkR by Xiangrui MengGeneralized Linear Models in Spark MLlib and SparkR by Xiangrui Meng
Generalized Linear Models in Spark MLlib and SparkR by Xiangrui Meng
 
Generalized Linear Models in Spark MLlib and SparkR
Generalized Linear Models in Spark MLlib and SparkRGeneralized Linear Models in Spark MLlib and SparkR
Generalized Linear Models in Spark MLlib and SparkR
 
Online Predictive Modeling of Fraud Schemes from Mulitple Live Streams by Cla...
Online Predictive Modeling of Fraud Schemes from Mulitple Live Streams by Cla...Online Predictive Modeling of Fraud Schemes from Mulitple Live Streams by Cla...
Online Predictive Modeling of Fraud Schemes from Mulitple Live Streams by Cla...
 
Recent Developments in Spark MLlib and Beyond
Recent Developments in Spark MLlib and BeyondRecent Developments in Spark MLlib and Beyond
Recent Developments in Spark MLlib and Beyond
 
Computational Advertising in Yelp Local Ads
Computational Advertising in Yelp Local AdsComputational Advertising in Yelp Local Ads
Computational Advertising in Yelp Local Ads
 
Exploiting Ranking Factorization Machines for Microblog Retrieval
Exploiting Ranking Factorization Machines for Microblog RetrievalExploiting Ranking Factorization Machines for Microblog Retrieval
Exploiting Ranking Factorization Machines for Microblog Retrieval
 
(2016 07-19) providing click predictions in real-time at scale
(2016 07-19) providing click predictions in real-time at scale(2016 07-19) providing click predictions in real-time at scale
(2016 07-19) providing click predictions in real-time at scale
 
Multi Model Machine Learning by Maximo Gurmendez and Beth Logan
Multi Model Machine Learning by Maximo Gurmendez and Beth LoganMulti Model Machine Learning by Maximo Gurmendez and Beth Logan
Multi Model Machine Learning by Maximo Gurmendez and Beth Logan
 
Building a Recommendation Engine Using Diverse Features by Divyanshu Vats
Building a Recommendation Engine Using Diverse Features by Divyanshu VatsBuilding a Recommendation Engine Using Diverse Features by Divyanshu Vats
Building a Recommendation Engine Using Diverse Features by Divyanshu Vats
 
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
 
Not Your Father's Database by Vida Ha
Not Your Father's Database by Vida HaNot Your Father's Database by Vida Ha
Not Your Father's Database by Vida Ha
 
Time Series Analysis with Spark by Sandy Ryza
Time Series Analysis with Spark by Sandy RyzaTime Series Analysis with Spark by Sandy Ryza
Time Series Analysis with Spark by Sandy Ryza
 
Analyzing Time Series Data with Apache Spark and Cassandra
Analyzing Time Series Data with Apache Spark and CassandraAnalyzing Time Series Data with Apache Spark and Cassandra
Analyzing Time Series Data with Apache Spark and Cassandra
 
Apache cassandra & apache spark for time series data
Apache cassandra & apache spark for time series dataApache cassandra & apache spark for time series data
Apache cassandra & apache spark for time series data
 
How to Build a Recommendation Engine on Spark
How to Build a Recommendation Engine on SparkHow to Build a Recommendation Engine on Spark
How to Build a Recommendation Engine on Spark
 

Similaire à Training

Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewNisanth Simon
 
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...Big Data Montreal
 
Intro to Apache Hadoop
Intro to Apache HadoopIntro to Apache Hadoop
Intro to Apache HadoopSufi Nawaz
 
MapReduce Improvements in MapR Hadoop
MapReduce Improvements in MapR HadoopMapReduce Improvements in MapR Hadoop
MapReduce Improvements in MapR Hadoopabord
 
Session 01 - Into to Hadoop
Session 01 - Into to HadoopSession 01 - Into to Hadoop
Session 01 - Into to HadoopAnandMHadoop
 
Presentation sreenu dwh-services
Presentation sreenu dwh-servicesPresentation sreenu dwh-services
Presentation sreenu dwh-servicesSreenu Musham
 
Apache Hadoop- Hadoop Basics.pptx
Apache Hadoop- Hadoop Basics.pptxApache Hadoop- Hadoop Basics.pptx
Apache Hadoop- Hadoop Basics.pptxMiraj Godha
 
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...Reynold Xin
 
Hadoop cluster configuration
Hadoop cluster configurationHadoop cluster configuration
Hadoop cluster configurationprabakaranbrick
 
Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Ranjith Sekar
 
Hadoop ecosystem framework n hadoop in live environment
Hadoop ecosystem framework  n hadoop in live environmentHadoop ecosystem framework  n hadoop in live environment
Hadoop ecosystem framework n hadoop in live environmentDelhi/NCR HUG
 

Similaire à Training (20)

Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce Overview
 
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
 
Lecture 2 part 1
Lecture 2 part 1Lecture 2 part 1
Lecture 2 part 1
 
Intro to Apache Hadoop
Intro to Apache HadoopIntro to Apache Hadoop
Intro to Apache Hadoop
 
Cppt Hadoop
Cppt HadoopCppt Hadoop
Cppt Hadoop
 
Cppt
CpptCppt
Cppt
 
Cppt
CpptCppt
Cppt
 
MapReduce Improvements in MapR Hadoop
MapReduce Improvements in MapR HadoopMapReduce Improvements in MapR Hadoop
MapReduce Improvements in MapR Hadoop
 
Hadoop seminar
Hadoop seminarHadoop seminar
Hadoop seminar
 
HADOOP
HADOOPHADOOP
HADOOP
 
Session 01 - Into to Hadoop
Session 01 - Into to HadoopSession 01 - Into to Hadoop
Session 01 - Into to Hadoop
 
Hadoop 2.0 handout 5.0
Hadoop 2.0 handout 5.0Hadoop 2.0 handout 5.0
Hadoop 2.0 handout 5.0
 
Presentation sreenu dwh-services
Presentation sreenu dwh-servicesPresentation sreenu dwh-services
Presentation sreenu dwh-services
 
Apache Hadoop- Hadoop Basics.pptx
Apache Hadoop- Hadoop Basics.pptxApache Hadoop- Hadoop Basics.pptx
Apache Hadoop- Hadoop Basics.pptx
 
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
 
Hadoop cluster configuration
Hadoop cluster configurationHadoop cluster configuration
Hadoop cluster configuration
 
Hadoop
HadoopHadoop
Hadoop
 
Anju
AnjuAnju
Anju
 
Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016
 
Hadoop ecosystem framework n hadoop in live environment
Hadoop ecosystem framework  n hadoop in live environmentHadoop ecosystem framework  n hadoop in live environment
Hadoop ecosystem framework n hadoop in live environment
 

Plus de Doug Chang

BRV CTO Summit Deep Learning Talk
BRV CTO Summit Deep Learning TalkBRV CTO Summit Deep Learning Talk
BRV CTO Summit Deep Learning TalkDoug Chang
 
Odersky week1 notes
Odersky week1 notesOdersky week1 notes
Odersky week1 notesDoug Chang
 
Spark Streaming Info
Spark Streaming InfoSpark Streaming Info
Spark Streaming InfoDoug Chang
 
Capital onehadoopclass
Capital onehadoopclassCapital onehadoopclass
Capital onehadoopclassDoug Chang
 
Capital onehadoopintro
Capital onehadoopintroCapital onehadoopintro
Capital onehadoopintroDoug Chang
 
L'Oreal Tech Talk
L'Oreal Tech TalkL'Oreal Tech Talk
L'Oreal Tech TalkDoug Chang
 
Apache bigtopwg7142013
Apache bigtopwg7142013Apache bigtopwg7142013
Apache bigtopwg7142013Doug Chang
 
Bigtop june302013
Bigtop june302013Bigtop june302013
Bigtop june302013Doug Chang
 
Bigtop elancesmallrev1
Bigtop elancesmallrev1Bigtop elancesmallrev1
Bigtop elancesmallrev1Doug Chang
 
Hadoop/HBase POC framework
Hadoop/HBase POC frameworkHadoop/HBase POC framework
Hadoop/HBase POC frameworkDoug Chang
 
Demographics andweblogtargeting
Demographics andweblogtargetingDemographics andweblogtargeting
Demographics andweblogtargetingDoug Chang
 

Plus de Doug Chang (12)

BRV CTO Summit Deep Learning Talk
BRV CTO Summit Deep Learning TalkBRV CTO Summit Deep Learning Talk
BRV CTO Summit Deep Learning Talk
 
Hapi
HapiHapi
Hapi
 
Odersky week1 notes
Odersky week1 notesOdersky week1 notes
Odersky week1 notes
 
Spark Streaming Info
Spark Streaming InfoSpark Streaming Info
Spark Streaming Info
 
Capital onehadoopclass
Capital onehadoopclassCapital onehadoopclass
Capital onehadoopclass
 
Capital onehadoopintro
Capital onehadoopintroCapital onehadoopintro
Capital onehadoopintro
 
L'Oreal Tech Talk
L'Oreal Tech TalkL'Oreal Tech Talk
L'Oreal Tech Talk
 
Apache bigtopwg7142013
Apache bigtopwg7142013Apache bigtopwg7142013
Apache bigtopwg7142013
 
Bigtop june302013
Bigtop june302013Bigtop june302013
Bigtop june302013
 
Bigtop elancesmallrev1
Bigtop elancesmallrev1Bigtop elancesmallrev1
Bigtop elancesmallrev1
 
Hadoop/HBase POC framework
Hadoop/HBase POC frameworkHadoop/HBase POC framework
Hadoop/HBase POC framework
 
Demographics andweblogtargeting
Demographics andweblogtargetingDemographics andweblogtargeting
Demographics andweblogtargeting
 

Dernier

Control-Plan-Training.pptx for the Automotive standard AIAG
Control-Plan-Training.pptx for the Automotive standard AIAGControl-Plan-Training.pptx for the Automotive standard AIAG
Control-Plan-Training.pptx for the Automotive standard AIAGVikrantPawar37
 
A Comprehensive Exploration of the Components and Parts Found in Diesel Engines
A Comprehensive Exploration of the Components and Parts Found in Diesel EnginesA Comprehensive Exploration of the Components and Parts Found in Diesel Engines
A Comprehensive Exploration of the Components and Parts Found in Diesel EnginesROJANE BERNAS, PhD.
 
-The-Present-Simple-Tense.pdf english hh
-The-Present-Simple-Tense.pdf english hh-The-Present-Simple-Tense.pdf english hh
-The-Present-Simple-Tense.pdf english hhmhamadhawlery16
 
248649330-Animatronics-Technical-Seminar-Report-by-Aswin-Sarang.pdf
248649330-Animatronics-Technical-Seminar-Report-by-Aswin-Sarang.pdf248649330-Animatronics-Technical-Seminar-Report-by-Aswin-Sarang.pdf
248649330-Animatronics-Technical-Seminar-Report-by-Aswin-Sarang.pdfkushkruthik555
 
Human Resource Practices TATA MOTORS.pdf
Human Resource Practices TATA MOTORS.pdfHuman Resource Practices TATA MOTORS.pdf
Human Resource Practices TATA MOTORS.pdfAditiMishra247289
 
ABOUT REGENERATIVE BRAKING SYSTEM ON AUTOMOBILES
ABOUT REGENERATIVE BRAKING SYSTEM ON AUTOMOBILESABOUT REGENERATIVE BRAKING SYSTEM ON AUTOMOBILES
ABOUT REGENERATIVE BRAKING SYSTEM ON AUTOMOBILESsriharshaganjam1
 
Can't Roll Up Your Audi A4 Power Window Let's Uncover the Issue!
Can't Roll Up Your Audi A4 Power Window Let's Uncover the Issue!Can't Roll Up Your Audi A4 Power Window Let's Uncover the Issue!
Can't Roll Up Your Audi A4 Power Window Let's Uncover the Issue!Mint Automotive
 
Welcome to Auto Know University Orientation
Welcome to Auto Know University OrientationWelcome to Auto Know University Orientation
Welcome to Auto Know University Orientationxlr8sales
 
Mastering Mercedes Engine Care Top Tips for Rowlett, TX Residents
Mastering Mercedes Engine Care Top Tips for Rowlett, TX ResidentsMastering Mercedes Engine Care Top Tips for Rowlett, TX Residents
Mastering Mercedes Engine Care Top Tips for Rowlett, TX ResidentsRowlett Motorwerks
 
Lighting the Way Understanding Jaguar Car Check Engine Light Service
Lighting the Way Understanding Jaguar Car Check Engine Light ServiceLighting the Way Understanding Jaguar Car Check Engine Light Service
Lighting the Way Understanding Jaguar Car Check Engine Light ServiceImport Car Center
 
Building a Future Where Everyone Can Ride and Drive Electric by Bridget Gilmore
Building a Future Where Everyone Can Ride and Drive Electric by Bridget GilmoreBuilding a Future Where Everyone Can Ride and Drive Electric by Bridget Gilmore
Building a Future Where Everyone Can Ride and Drive Electric by Bridget GilmoreForth
 
Pros and cons of buying used fleet vehicles.pptx
Pros and cons of buying used fleet vehicles.pptxPros and cons of buying used fleet vehicles.pptx
Pros and cons of buying used fleet vehicles.pptxjennifermiller8137
 

Dernier (12)

Control-Plan-Training.pptx for the Automotive standard AIAG
Control-Plan-Training.pptx for the Automotive standard AIAGControl-Plan-Training.pptx for the Automotive standard AIAG
Control-Plan-Training.pptx for the Automotive standard AIAG
 
A Comprehensive Exploration of the Components and Parts Found in Diesel Engines
A Comprehensive Exploration of the Components and Parts Found in Diesel EnginesA Comprehensive Exploration of the Components and Parts Found in Diesel Engines
A Comprehensive Exploration of the Components and Parts Found in Diesel Engines
 
-The-Present-Simple-Tense.pdf english hh
-The-Present-Simple-Tense.pdf english hh-The-Present-Simple-Tense.pdf english hh
-The-Present-Simple-Tense.pdf english hh
 
248649330-Animatronics-Technical-Seminar-Report-by-Aswin-Sarang.pdf
248649330-Animatronics-Technical-Seminar-Report-by-Aswin-Sarang.pdf248649330-Animatronics-Technical-Seminar-Report-by-Aswin-Sarang.pdf
248649330-Animatronics-Technical-Seminar-Report-by-Aswin-Sarang.pdf
 
Human Resource Practices TATA MOTORS.pdf
Human Resource Practices TATA MOTORS.pdfHuman Resource Practices TATA MOTORS.pdf
Human Resource Practices TATA MOTORS.pdf
 
ABOUT REGENERATIVE BRAKING SYSTEM ON AUTOMOBILES
ABOUT REGENERATIVE BRAKING SYSTEM ON AUTOMOBILESABOUT REGENERATIVE BRAKING SYSTEM ON AUTOMOBILES
ABOUT REGENERATIVE BRAKING SYSTEM ON AUTOMOBILES
 
Can't Roll Up Your Audi A4 Power Window Let's Uncover the Issue!
Can't Roll Up Your Audi A4 Power Window Let's Uncover the Issue!Can't Roll Up Your Audi A4 Power Window Let's Uncover the Issue!
Can't Roll Up Your Audi A4 Power Window Let's Uncover the Issue!
 
Welcome to Auto Know University Orientation
Welcome to Auto Know University OrientationWelcome to Auto Know University Orientation
Welcome to Auto Know University Orientation
 
Mastering Mercedes Engine Care Top Tips for Rowlett, TX Residents
Mastering Mercedes Engine Care Top Tips for Rowlett, TX ResidentsMastering Mercedes Engine Care Top Tips for Rowlett, TX Residents
Mastering Mercedes Engine Care Top Tips for Rowlett, TX Residents
 
Lighting the Way Understanding Jaguar Car Check Engine Light Service
Lighting the Way Understanding Jaguar Car Check Engine Light ServiceLighting the Way Understanding Jaguar Car Check Engine Light Service
Lighting the Way Understanding Jaguar Car Check Engine Light Service
 
Building a Future Where Everyone Can Ride and Drive Electric by Bridget Gilmore
Building a Future Where Everyone Can Ride and Drive Electric by Bridget GilmoreBuilding a Future Where Everyone Can Ride and Drive Electric by Bridget Gilmore
Building a Future Where Everyone Can Ride and Drive Electric by Bridget Gilmore
 
Pros and cons of buying used fleet vehicles.pptx
Pros and cons of buying used fleet vehicles.pptxPros and cons of buying used fleet vehicles.pptx
Pros and cons of buying used fleet vehicles.pptx
 

Training

  • 1. Hadoop Big Data Intro 2/16/2013 Hadoop/BigData Intro Provided agenda Addition:Theory from papers Addition: Demo/code samples Addition: System architecture Goal: develop some theory
  • 2. Agenda ● Introduction to Big Data ● Basic Concepts ● Hadoop Overview of Hadoop Working with HDFS / Map Reduce Architecture Anatomy of File write / read Admin and Development Introduce other components of Hadoop ecosystem
  • 3. Agenda (2) Hive / HBase / Pig / Sqoop Map Reduce Features - Architecture ● ● Working Job Execution We can cover this circa 2005 agenda in 3h w/some additions. Need hands on lab to understand the content.
  • 4. Big Data defn. ● ● Big data, too big to run SQL queries on Lots of data (cover Google approach which is what Hadoop is based on) Replacing Legacy Systems 10x Building Applications on Hadoop, Compet Gap Astayanax DevOps, Packaging, Chaos Monkey, AWS, Zookeeper Modifying the Hadoop Components, JIRA 3-4x
  • 5. Big Data Basic Concepts ● Storing large amounts of data and doing something with them – Some sort of analytics ● ● Easy: Tableau, Datameer Competitive Advantage – – Small scale analytics: R, stats 202 , DemographicsWeblog Large scale analytics: ● cs246 ● Should be able to define analytics POCs based on the next slide which are domain specific
  • 7. Big Data started in 2000, 2 design problems @Google, 1998-2000 There is a separate Big Data product for each use case. ● Google Design Problems/GFS: – Store internet pages on hard drives – Unstructured data ● ● ● ● ● ● Collect HTML and Links; images? 20+ billion web pages x 20KB = 400+ TB 1 computer reads 30-35 MB/sec from disk ~4 months to read the web ~1,000 hard drives to store the web Source Jure Leskovic Slides cs264
  • 8. Google M/R ● Once the data is on 1k machines... – – Traditional method: read file into memory. Can't put webpages into memory & reading data would saturate network. – ● How to run an algorithm over 1k disk drives? Soln: Map Reduce. Move the code to the data via mappers and reducers which are placed on the same computer as the data GFS paper/MapReduce Paper. Hadoop = GFS+M/R
  • 9. Google GFS ● ● Stored the html/links/images were stored in BigTable. Store html pages into files. Many pages per file. Why? Seeks $, store crawl 2 parts:What is a file system? SB=Collection of inodes
  • 10. R/W in a file system ● Read the contents of foo.txt – ● Go to superblock, find location of datablocks from pointer in superblock for foo.txt and read them into memory Write into foo.txt – Go to superblock, write contents into new datablocks and append addresses of datablocks into superblock entry for foo.txt.
  • 11. Distribute file system across servers ● Superblocks=>GFS master =>Hadoop NameN inodes=>chunkservers=>Hadoop DataNode
  • 12. R/W in distributed file system ● Read from HDFS foo.txt: – ● Go to namenode, find datablock where data is, read data into memory on client machine. What is the difference? Write into HDFS foo.txt: – Go to namenode, find empty block, tell client to send data to an empty block on the datanode, append the addresses of the new blocks into NN for foo.txt. What is the difference? ● Client, Network
  • 13. Hadoop HDFS, List of files in system, blocks file contents
  • 14. HDFS Demo ● List of files ● NN+DN website – http://<name_node_address>:50070/ – Where is the DN? Port:50075 ● Logs demo ● Running in single node PD mode – – ● JVM processes are threads vs. separate JVM processes for each service. Global vars in mappers good in PD not in cluster /etc/init.d. Do not download and install tar ball
  • 15. File R/W system issue ● ● Cache/Disk Drives Before writing from memory to disk power goes out. Lost data Write to Memory Write to Disk
  • 16. Failures ● ● ● ● ● ● Commodity servers fail, One server @G may stay up 3 years (1,000 days) If you have 1,000 servers, expect to lose 1/day With 1M machines 1,000 machines fail every day! Google 3y vs else once 3w? Why? 20Servers? GFS paper/restart failed M/R tasks. Not in Hadoop Most system designs neglect failure except Netflix ChaosM
  • 17. What is Hadoop? ● An implementation of GFS/Map Reduce in Java. – Used at Yahoo, LinkedIn, Facebook, Netflix, Twitter ● What did each contribute? Use cases? – Doug Cutting (Cloudera)/Lucene – v1.0 vs v2.0 – Hadoop Components, HBase, Flume, Sqoop, Zookeeper, Oozie, Pig, Hive
  • 18. HDFS ● HDFS is a distributed file system. Hadoop Distributed File System – Unlimited capacity, add more capacity add more nodes – A file SB info is stored in a NN server. Inodes or datablocks are stored in DN server. – Replicate for data locality & error detection/recovery ● – Replicate a data block 3x. Why? HDFS: ● Append only file system (copy Google Paper)
  • 19. HDFS ● What is your file system on your laptop? – ● Append only or Random R/W? When is append only bad? – Digression:RMW. Editing a word document is what? Append only or RMW? – Design exercise: 200Gb in files. How many files are there? – Does this fit in memory?
  • 20. HDFS Design exercise ● Many files combined into smaller number of large files. How to access smaller files? – Slower to access for reads – If RMW; add modify into new blocks in HDFS. Find the new blocks and read them into memory is slower than sequential access on a single node file system – Faster to delete old file and create a new file with sequential blocks in place.
  • 21. Solns 1) on write write to disk everytime write to memory – – ● Why Good? Why Bad? 2) lose the data when the power goes out – – ● Why Good? Why Bad? FSCK; File System Check Consistency
  • 22. Agenda: Admin and Development ● HDFS/MR Administration. HBase,etc. different – 24x7 SLA ● ● ● ● Hot standbys for maintenance HDFS:Recovery from User error, restore the file I just deleted HDFS/MR Recovery from failures, (not automated in Hadoop) MR lagging mapper, cascading failures
  • 23. Development ● Apache S/W development practices – Jenkins, Jira tickets – Repos
  • 24. HDFS Schemas ● Do you store 20B files on HDFS by file name? – – ● What happens with multiple files with same name? e.g. test.txt? Create metadata, partitions HDFS Schemas: – Avro – Parquet ● Dremel column store/encoding
  • 25. Map Reduce Intro(1) ● Map Reduce – Designed in 2000, when there was very little memory in commodity PCs, ~4GB or less. These aren't enterprise class servers. – This isn't the case today. MultiCPU/MultiCore 192gb machines are much more reliable with different use cases – M/R idiom is being replaced with non MR systems. – What we don't cover ● Google F1
  • 26. Map Reduce Intro ● There are 3 parts to how Map Reduce works: – – Shuffle – ● Mapper Reducer There are 3 parts to a Map Reduce program – – Reducers – ● Mappers Driver These 2 concepts aren't the same. People get these mixed up.
  • 27. Map Reduce Part 1 ● 1k node cluster; bring the code to the data. Reduce network traffic ● Programming idiom ● Divide task into mappers. ● Examples of what can be divided and combined – Try dividing first, assume you can combine anything you can divide – Divide input file into single lines, send one line to each server, process each line
  • 28. Word Count ● ● I can count a text file of words with a single program. I can split the file into a mappers and have the mappers count the words in parallel FileLine FileLine FileLine FileLine Mapper Mapper Mapper Mapper
  • 29. Word Count ● The mappers output K/V pairs onto the network. These are not Java Strings or Java objects! – – ● Keys: Comparable, Writable Values: Writable Network saturates with multiple M/R jobs. Network Reducer Reducer Reducer Reducer
  • 30. Shuffle/Reduce Part 2/3 ● The K/V pairs are sent to the network. The K/V pairs are sent to certain destinations based on 2 rules: – 1) each K goes to the same reducer – 2) all keys are in sorted order – 3) Output in 2 forms, _SUCCESS and part-00000 – Custom partitioner to send K to specific Reducer – Grouping Comparator: group keys to reducer – Sorting Comparator: can modify sort order for compound keys
  • 32. M/R Program/Mapper public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = (caseSensitive) ? value.toString() : value.toString().toLowerCase(); for (String pattern : patternsToSkip) { line = line.replaceAll(pattern, ""); } StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); output.collect(word, one); reporter.incrCounter(Counters.INPUT_WORDS, 1); }
  • 33. M/R Program Reducer Class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } }
  • 34. M/R Program Driver ● public class WordCount extends Configured implements Tool conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); conf.setMapperClass(Map.class); conf.setCombinerClass(Reduce.class); conf.setReducerClass(Reduce.class); conf.setInputFormat(TextInputFormat.class); conf.setOutputFormat(TextOutputFormat.class);
  • 35. AVRO/Protocol Buffers ● ● Avro is a serialization format for Mappers. Splittable and human readable. Not as small as PB.
  • 37. Protocol Buffers ● ● Used internally at Google, compact serialization https://code.google.com/p/protobuf/ “proto bufs” ,not just serializtion, closest to binary. Used internally in Hadoop.
  • 38. Why do we need Avro,Protobufs? Binary: no parser, fast, small. OK for objects maybe this is like Hibernate
  • 39. Thrift ● Add a server to send/receive objects and do the serialization/deserialization
  • 40. Map Reduce References ● What can I do with each text line? – Easy: ETL patterns: ● ● ● – Match patterns Count num occurences tokens Processing files Harder: Machine Learning/DMWhat can't be easily done? ● ● ● K-means clustering Ullman book: Mining massive datasets:http://infolab.stanford.edu/~ullman/mmds.html Jimmy Lin book:http://www.umiacs.umd.edu/~jimmylin/
  • 41. MRv2 ● 2 versions of M/R – – ● v1: old api import xxx.mapred, JT/TT v2: new api, import xxx.mapreduce, RM/NM/JH YARN, in Hadoop 2.x maintains backward compatability to M/R v1. – Devs start shifting to Hadoop 2.x YARN for new bug fixes
  • 43. YARN->Enterprise ● Encrypted/Pluggable Shuffle/Sort ● Httpfs rewrite or proxyserver ● V2 user authentication/permissions. – Apache Sentry ● ● ● ● Separate authorization policies per database/schema Users have to customize for shared data structures (tables/metadata,(hbase,search,zk). Not in any distro! Schema metadata needs fine grained auth. Web app proxy/part of RM to reduce attacks on exposed RM web server
  • 44. Map Reduce Demo ● Word Count demo – – HDFS NameNode: http://localhost:50070/ – ResourceManager http://localhost:8088 – ● HDFS DataNode: http://localhost:50075/ JobHistory Server http://jhs_host:19888. Logging mistakes – Adding logging to M/R jobs prop to data size and number times program run. 1TB file means 1TB logs. Processing 100GB 10x – Logs fill up disk crash system – Zookeeper logs
  • 45. M/R Pipelines ● ● The successful organizations never write direct Mappers/Reducers. They use higher level tools like Pig,Hive, etc.. Defn: – – ● Workflow:series of M/R jobs Pipeline: output of one M/R job is the input to another Apache Crunch modeled after Google FlumeJava
  • 46. Google FlumeJava ● ● ● Introduction of data pipelines based on multiple M/R stages Define a parallel collection with a set of parallel operations Much easier to use than M/R programming. Contrast w/UDFs. Less lines of source:
  • 47. Apache Crunch ● Not just M/R – Faster to specify w/API a data processing pipeline you can customize instead of writing Pig/Hive scripts, MRPipelines – YARN, next version of M/R – Supports Apache Spark, SparkPipelines – Can keep in memory vs. spill to disk, MemPipelines
  • 48. Case Study of old systems ● ● Older generation of Hadoop Components, Hadoop, Pig, Hive. Gives insight to stability/capability of products
  • 49. Hive at LinkedIn (bottom left). All 3 similar
  • 50. LinkedIn ● Pig+DataFu ● Hive bottom left corner ● Teradata+Hadoop
  • 52. Yahoo Block Diagram, Pig, Hive, Spark, Storm
  • 53. Yahoo ● Targeting Content, not Search ● 3k Pig jobs in production ● Hive in small use for analysts, Pig in heavy production use. Non MR in use now. Matches Google's progression
  • 54. Mapper Failures ● What happens? Google's paper restarts failed tasks. NS ● Hadoop isn't auto recovery ● Hadoop Mapper/Reducer Worker Failure: – – Reschedule on another worker – Speculative Execution – ● Completed ok, in progress reset (ADD FROM VIDEo) Master failure, abort and return fail to client
  • 55. M/R Runtime ● Balancing Cluster capacity – #m>>num nodes – #r<<#m – One HDFS chunk/mapper. Careful w/small files. Why? Won't just “run” Need admin
  • 56. Bad Design ● Combiners – Reduce network traffic. Google has special switches for network latency/throughput – job.setCombinerClass(IntSumReducer.class); – Combiner can execute 0, 1 or many times. Why? Combiner demo:
  • 57. Greedy Scheduling ● Google Borg (not published) ● Mesos/YARN – Linux cGroups/containers – Allocate memory/CPU to each task – IO not implemented; Sync/Async
  • 58. Batch ETL/Apache ● Apache Pig – – Demo – ● PigLatin for ETL No metadata generated Pig Lipstick – Demo – For debugging Pig M/R DAGs
  • 59. Writing SQL queries in M/R ● Select * from /tmp/sqlqueries ● Select a ● What is the problem with implementing SQL queries in M/R? What do you get w/a db you don't get with SQL M/R?
  • 60. After GFS, M/R; Google Sawzall ● Contributions: – High level procedural language simpler than SQL operating on unstructured data – How to deal with performance problems with sparse data records? ● – Protobufs (used in Hadoop). Dense serialization format to reduce network traffic/disk space Multiple jobs, multiple users ● Workqueue (Apache Oozie)
  • 61. Apache Pig ● Paper, Chris Olsten Stanford/Yahoo Research ● Related to Google Sawzall ● Contributions: – PigLatin, like a unix pipe model – Cat a file, grep and count # of the word 'foo', sed/awk – all data are tuples – Write M/R jobs at a higher level than Java Mappers/Reducers – Write multistage M/R pipelines
  • 62. Pig example Select col1,col2 from table; Data = Load 'table' cols = foreach data generate $0,$1; Dump cols;
  • 63. Pig Join example ● Select t1.col2,t2.col2 from t1 join t2 on t1.col1=t2.col1 ● t1=load 't1';t2=load 't2'; ● Cols1 = foreach t1 generate $0,$1 ● Cols2 = foreach t2 generate $0,$1 ● Joined = JOIN Cols1 by $0, Cols2 by $0 ● Cols = foreach joined generate $1,$3 ● Dump Cols;
  • 64. Pig example ● -------------------- # Map Reduce Plan #---------------------- ● digraph plan ● { compound=true; ● ● ● ● node [shape=rect]; s46641938_in [label="", style=invis, height=0, width=0]; s46641938_out [label="", style=invis, height=0, width=0]; subgraph cluster_46641938 {...
  • 65. Apache Pig Lipstick ● Debug tool for Pig jobs open sourced by Netflix
  • 66. SQL queries over GFS? ● Google Tenzing – Distributed Worker Pool – Modify M/R API ● – Streaming M/R jobs SQL92
  • 67. Apache Tez (WIP) ● ● Streaming of MR Job1 into MRJob2 like Tenzing Patches in Pig/Hive
  • 68. Apache Hive ● FB data warehouse paper – Introduce tables into HDFS (schema) – Requires DB to store metadata – HiveQL ● Solved problems – – – – Easy for analysts to use, w/o writing MR jobs Stored metadata unlike PIG Supports user queries w/joins Doesn't support UPDATE. Can't update a file in HDFS. Files are immutable.
  • 69. Hive QL ● Create table foo(id int) ● Create table foo(id int) location '/tmp/data/data.txt'. – ● ● ● Hive moves the data.txt file! Looks like Hive deleted it. Use external table; when dropped nothing happens. Non external table data is deleted after table dropped. We can parse in csv files, this is different than a database b/c we are dealing with unstructured data. Create table foo1(username string, map<String,int>) row format delimited fields terminated by '; ' Map is an aggregate type
  • 70. Schema on read vs Schema on write ● ● ● Data has to match schema for database. Process data then import into db. Everything has to match, columns, format, etc... Hadoop is schema on read; can create any schema. Doesn't drop a column not defined like in DB. Typically loading data into a database requires some clean up program to get all the data in the right form with the right number of columns with the right data ranges.
  • 71. Hive Serdes ● Use this to import in data without processing like in database ● CREATE TABLE access_log ( ● remote_ip STRING, request_date STRING,method STRING,request STRING,protocol STRING) ● ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' ● WITH SERDEPROPERTIES ( ● "input.regex" = "([^ ]) . . [([^]]+)] "([^ ]) ([^ ]) ([^ "])" *", ● "output.format.string" = "%1$s %2$s %3$s %4$s %5$s" ● ) ● STORED AS TEXTFILE;
  • 72. Hive Impl ● ● External Hive tables are directories in HDFS. You can delete the files and the tables will be empty. Or you can add data into directories and have the tables grow ● Hive adds a schema to HDFS ● HiveServer2: – ● Security, multiple clients Hive+Tez, Hive-0.12+
  • 74. HiveQL ● Select * from table1; ● Select col1,col2 from table2; ● Writing data into Hive – Load DATA inpath '/user/dc/tmp' into table1; – Load data inpath '/user/dc/tmp' OVERWRITE into table1; (DELETEs first before writing)
  • 75. Google BigTable ● Paper ● Contributions – Chubby(metadata) – GQL; limited syntax compiled into M/R, control over data placement(locality MR perf) and format Data model: 1 table w/millions columns. Billions rows
  • 76. Apache HBase Bigtable Clone: Zookeeper, HDFS, AVRO, Thrift , REST. Data model: 1 table w/millions columns & Billions rows
  • 80. Apache HBase ● Schema design critical point – Schema design shows understanding of architecture & implementation to use case – Rows and Column families. Why? – Wibidata Apache KijiSchema
  • 81. HBase Client Design ● ● Do things in parallel then merge the results. Not JDBC Mistake: – private void doMultipleClients(final Class<? extends Test> cmd) throws IOException { – final List<Thread> threads = new ArrayList<Thread>(this.N); – final int perClientRows = R/N; – for (int i = 0; i < this.N; i++) { – – Thread t = new Thread (Integer.toString(i)) {
  • 83. Google Dremel ● ● ● End of M/R select count(*) from publicdata:samples.wikipedia where REGEXP_MATCH (title, ‘[0-9]*’) AND wp_namespace = 0; 35B rows in 10s/~35GB. How? 2 tricks
  • 85. Cloudera Impala ● Google Dremel – Partition key and Parquet column store schema
  • 87. Apache Spark ● UC Berkeley BDAS, in CDH5, support here, statement from ULM changing – Tachyon/Mesos/Spark/Shark/MLBase
  • 88. Apache Storm ● Storm in HDP 2.X – Instead of multiple threads across multiple servers. – Sample code
  • 89. Hands On Labs ● Install Hadoop on Amazon EC2. – Goal: learn config/logs/how things run in HDFS & M/R ● ● ● M/R programming – Goal: understand internals of M/R. Understand implications of production, how to balance 1k M/R jobs in a cluster (programming Java M/R) ● ● HDFS hands on M/R hands on no programming cs246, cs246h Individual Components
  • 90. Hands on Labs ● Systems labs – How to create a Data Repository? ● HDFS Schemas – Zookeeper, coordination and distributed programming – YARN/Mesos examples – Spark/Storm

Notes de l'éditeur

  1. When you create and delete files you are adding/removing inodes from the superblock. When you add contents to a file and save it like adding text in word you are adding data blocks to an inode.
  2. Write: write into write ahead log (in memory) then persist to disk files. Small disk files have to be merged to larger files to reduce search time for read Rowkeys are sorted and split into regions. Sequential design vs. random key. Avoid hotspots limiting cluster throughput Hbase Regions are autosplit when they get too big on writes
  3. Distributed tree over cluster. log(n)