SlideShare une entreprise Scribd logo
1  sur  83
Télécharger pour lire hors ligne
Wprowadzenie do
technologii Big Data
Radosław Stankiewicz
HackerBig Data NerdEnterpreneur Trainer
3
Src: computing.co.uk , https://www.flickr.com/photos/barron/15483113 , tech.co
Agenda
Wstęp -> Map Reduce -> Pig -> Hive -> Ambari
4
Wprowadzenie
5
V O LUME6
Variety
7
A|123|10$
B|555|20$
Y|333|15$
{
'typ'='A',
'id'=123,
'kwota'='10$'
}
Velocity
OLAP
Real
Time
Batch
Streaming
Interactive
analytics
8
Value
10
Klasyfikacja problemu
• Baza danych ulic Warszawy, Dane w formacie JSON,
optymalizacja odbioru śmieci jednego z usługodawców.
• Zdarzenia z bazy transakcyjnej i kart kredytowych w
celu lepszego wykrywania fraudów
• System wyszukujący dobre oferty samochodów z wielu
serwisów - web crawling, parsowanie danych, analiza
trendów cen samochodów
• Centralne repozytorium skanów umów, TB danych,
codziennie przybywa kilkaset nowych dokumentów
11
Geneza
• za dużo danych
• pady serwerów
• wolne relacyjne bazy danych
12
13
14
Architektura
15 źródło: Hortonworks
Ekosystem Hadoop
16 źródło: Hortonworks
17
HDFS - Namenode,
Datanode
18
● User Commands
o dfs
o fsck
● Administration Commands
o datanode
o dfsadmin
o namenode
dfs:
appendToFile cat chgrp chmod chown copyFromLocal copyToLocal count cp du
dus expunge get getfacl getfattr getmerge ls lsr mkdir moveFromLocal
moveToLocal mv put rm rmr setfacl setfattr setrep stat tail test text touchz
hdfs dfs -put localfile1 localfile2 /user/tmp/hadoopdir
hdfs dfs -getmerge /user/hadoop/output/ localfile
komendy
19
HDFS - uprawnienia
• prawie POSIX
• Users, Groups
• chmod, chgrp, chown
• ACL
• getfacl, setfacl
• można wyłączyć kontrolę uprawnień
• dodatkowo:
• Apache Knox
• Apache Ranger
Architektura YARN
21
Map Reduce Framework
22
Map Reduce Framework
23
M
M
M
M
R
R
R
R
R
Mapper
#!/usr/bin/env python
import sys
for line in sys.stdin:
words = line.strip().split()
for word in words:
print '%st%s' % (word, 1)
line = “Ala ma kota”
Ala 1
ma 1
kota 1
24
Reducer
#!/usr/bin/env python
import sys
current_word = None
current_count = 0
word = None
for line in sys.stdin:
line = line.strip()
word, count = line.split('t', 1)
count = int(count)
if current_word == word:
current_count += count
else:
if current_word:
print '%s,%s' % (current_word, current_count)
current_count = count
current_word = word
if current_word == word:
print '%s,%s' % (current_word, current_count)
ala 1
ala 1
bela 1
dela 1
ala,2
bela,1
dela,1
25
Uruchomienie streaming
cat input.txt | ./mapper.py | sort | ./reducer.py
bin/yarn jar [..]/hadoop-*streaming*.jar 

-file mapper.py -mapper ./mapper.py -file
reducer.py -reducer ./reducer.py 

-input /tmp/wordcount/input -output /tmp/
wordcount/output
26
Map Reduce w Java
(input) <k1, v1> -> map -> <k2, v2> -> combine ->
<k2, v2> -> reduce -> <k3, v3> (output)
1) Mapper
2) Reducer
3) run
public class WordCount extends Configured
implements Tool {
public static class TokenizerMapper{...}
public static class IntSumReducer{...}
public int run(...){...}
}
27
Mapper<KEYIN,VALUEIN,KEY
OUT,VALUEOUT>
public static class TokenizerMapper

extends Mapper<LongWritable, Text, Text, IntWritable>{



private final static IntWritable one = new IntWritable(1);

private Text word = new Text();



public void map(LongWritable key, Text value, Context context

) throws IOException, InterruptedException {

StringTokenizer itr = new StringTokenizer(value.toString());

while (itr.hasMoreTokens()) {

word.set(itr.nextToken());

context.write(word, one);

}

}
public void setup(...) {...}
public void cleanup(...) {...}
public void run(...) {...}

}
value = “Ala ma kota”
Ala,1
ma,1
kota,1
Reducer<KEYIN,VALUEIN,KEY
OUT,VALUEOUT>
public static class IntSumReducer

extends Reducer<Text,IntWritable,Text,IntWritable> {

private IntWritable result = new IntWritable();



public void reduce(Text key, Iterable<IntWritable> values,

Context context

) throws IOException, InterruptedException {

int sum = 0;

for (IntWritable val : values) {

sum += val.get();

}

result.set(sum);

context.write(key, result);

}
public void setup(...) {...}
public void cleanup(...) {...}
public void run(...) {...}

}
kota,(1,1,1,1)
kota,4
Main
public int run(String[] args) throws Exception {

Configuration conf = new Configuration();

Job job = Job.getInstance(conf, "word count");

job.setJarByClass(WordCount.class);

job.setMapperClass(TokenizerMapper.class);

job.setCombinerClass(IntSumReducer.class);

job.setReducerClass(IntSumReducer.class);

job.setOutputKeyClass(Text.class);

job.setOutputValueClass(IntWritable.class);

FileInputFormat.addInputPath(job, new Path(args[0]));

FileOutputFormat.setOutputPath(job, new Path(args[1]));

System.exit(job.waitForCompletion(true) ? 0 : 1);

}
public static void main(String[] args) throws Exception {
int res = ToolRunner.run(new Configuration(), new WordCount(),args);
System.exit(res);
}
yarn jar wc.jar WordCount /tmp/wordcount/input /tmp/wordcount/output
Co dalej?
• Map Reduce w Javie
• Testowanie MRUnit
• Joins
• Avro
• Custom Key, Value
• Złączanie wielu zadań
• Custom Input, Output
31
Warsztat
https://notehub.org/jkrqs
Wprowadzenie do
przetwarzania danych na
przykładzie Pig
33
Architektura Pig
34
Czy warto?
Top 5 stron odwiedzanych przez
użytkowników mających 18 lat
import java.io.IOException;
import java.util.ArrayList;
import java.util.Iterator;
import java.util.List;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.Writable;
import org.apache.hadoop.io.WritableComparable;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.KeyValueTextInputFormat;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.RecordReader;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;
import org.apache.hadoop.mapred.SequenceFileInputFormat;
import org.apache.hadoop.mapred.SequenceFileOutputFormat;
import org.apache.hadoop.mapred.TextInputFormat;
import org.apache.hadoop.mapred.jobcontrol.Job;
import org.apache.hadoop.mapred.jobcontrol.JobControl;
import org.apache.hadoop.mapred.lib.IdentityMapper;
public class MRExample {
public static class LoadPages extends MapReduceBase
implements Mapper<LongWritable, Text, Text, Text> {
public void map(LongWritable k, Text val,
OutputCollector<Text, Text> oc,
Reporter reporter) throws IOException {
// Pull the key out
String line = val.toString();
int firstComma = line.indexOf(',');
String key = line.substring(0, firstComma);
String value = line.substring(firstComma + 1);
Text outKey = new Text(key);
// Prepend an index to the value so we know which file
// it came from.
Text outVal = new Text("1" + value);
oc.collect(outKey, outVal);
}
}
public static class LoadAndFilterUsers extends MapReduceBase
implements Mapper<LongWritable, Text, Text, Text> {
public void map(LongWritable k, Text val,
OutputCollector<Text, Text> oc,
Reporter reporter) throws IOException {
// Pull the key out
String line = val.toString();
int firstComma = line.indexOf(',');
String value = line.substring(firstComma + 1);
int age = Integer.parseInt(value);
if (age < 18 || age > 25) return;
String key = line.substring(0, firstComma);
Text outKey = new Text(key);
// Prepend an index to the value so we know which file
// it came from.
Text outVal = new Text("2" + value);
oc.collect(outKey, outVal);
}
}
public static class Join extends MapReduceBase
implements Reducer<Text, Text, Text, Text> {
public void reduce(Text key,
Iterator<Text> iter,
OutputCollector<Text, Text> oc,
Reporter reporter) throws IOException {
// For each value, figure out which file it's from and
store it
// accordingly.
List<String> first = new ArrayList<String>();
List<String> second = new ArrayList<String>();
while (iter.hasNext()) {
Text t = iter.next();
String value = t.toString();
if (value.charAt(0) == '1')
first.add(value.substring(1));
else second.add(value.substring(1));
reporter.setStatus("OK");
}
// Do the cross product and collect the values
for (String s1 : first) {
for (String s2 : second) {
String outval = key + "," + s1 + "," + s2;
oc.collect(null, new Text(outval));
reporter.setStatus("OK");
}
}
}
public static class LoadJoined extends MapReduceBase
implements Mapper<Text, Text, Text, LongWritable> {
public void map(
Text k,
Text val,
OutputCollector<Text, LongWritable> oc,
Reporter reporter) throws IOException {
// Find the url
String line = val.toString();
int firstComma = line.indexOf(',');
int secondComma = line.indexOf(',', firstComma);
String key = line.substring(firstComma, secondComma);
// drop the rest of the record, I don't need it anymore,
// just pass a 1 for the combiner/reducer to sum instead.
Text outKey = new Text(key);
oc.collect(outKey, new LongWritable(1L));
}
}
public static class ReduceUrls extends MapReduceBase
implements Reducer<Text, LongWritable, WritableComparable,
Writable> {
public void reduce(
Text key,
Iterator<LongWritable> iter,
OutputCollector<WritableComparable, Writable> oc,
Reporter reporter) throws IOException {
// Add up all the values we see
long sum = 0;
while (iter.hasNext()) {
sum += iter.next().get();
reporter.setStatus("OK");
}
oc.collect(key, new LongWritable(sum));
}
}
public static class LoadClicks extends MapReduceBase
implements Mapper<WritableComparable, Writable, LongWritable,
Text> {
public void map(
WritableComparable key,
Writable val,
OutputCollector<LongWritable, Text> oc,
Reporter reporter) throws IOException {
oc.collect((LongWritable)val, (Text)key);
}
}
public static class LimitClicks extends MapReduceBase
implements Reducer<LongWritable, Text, LongWritable, Text> {
int count = 0;
public void reduce(
LongWritable key,
Iterator<Text> iter,
OutputCollector<LongWritable, Text> oc,
Reporter reporter) throws IOException {
// Only output the first 100 records
while (count < 100 && iter.hasNext()) {
oc.collect(key, iter.next());
count++;
}
}
}
public static void main(String[] args) throws IOException {
JobConf lp = new JobConf(MRExample.class);
lp.setJobName("Load Pages");
lp.setInputFormat(TextInputFormat.class);
lp.setOutputKeyClass(Text.class);
lp.setOutputValueClass(Text.class);
lp.setMapperClass(LoadPages.class);
FileInputFormat.addInputPath(lp, new
Path("/user/gates/pages"));
FileOutputFormat.setOutputPath(lp,
new Path("/user/gates/tmp/indexed_pages"));
lp.setNumReduceTasks(0);
Job loadPages = new Job(lp);
JobConf lfu = new JobConf(MRExample.class);
lfu.setJobName("Load and Filter Users");
lfu.setInputFormat(TextInputFormat.class);
lfu.setOutputKeyClass(Text.class);
lfu.setOutputValueClass(Text.class);
lfu.setMapperClass(LoadAndFilterUsers.class);
FileInputFormat.addInputPath(lfu, new
Path("/user/gates/users"));
FileOutputFormat.setOutputPath(lfu,
new Path("/user/gates/tmp/filtered_users"));
lfu.setNumReduceTasks(0);
Job loadUsers = new Job(lfu);
JobConf join = new JobConf(MRExample.class);
join.setJobName("Join Users and Pages");
join.setInputFormat(KeyValueTextInputFormat.class);
join.setOutputKeyClass(Text.class);
join.setOutputValueClass(Text.class);
join.setMapperClass(IdentityMapper.class);
join.setReducerClass(Join.class);
FileInputFormat.addInputPath(join, new
Path("/user/gates/tmp/indexed_pages"));
FileInputFormat.addInputPath(join, new
Path("/user/gates/tmp/filtered_users"));
FileOutputFormat.setOutputPath(join, new
Path("/user/gates/tmp/joined"));
join.setNumReduceTasks(50);
Job joinJob = new Job(join);
joinJob.addDependingJob(loadPages);
joinJob.addDependingJob(loadUsers);
JobConf group = new JobConf(MRExample.class);
group.setJobName("Group URLs");
group.setInputFormat(KeyValueTextInputFormat.class);
group.setOutputKeyClass(Text.class);
group.setOutputValueClass(LongWritable.class);
group.setOutputFormat(SequenceFileOutputFormat.class);
group.setMapperClass(LoadJoined.class);
group.setCombinerClass(ReduceUrls.class);
group.setReducerClass(ReduceUrls.class);
FileInputFormat.addInputPath(group, new
Path("/user/gates/tmp/joined"));
FileOutputFormat.setOutputPath(group, new
Path("/user/gates/tmp/grouped"));
group.setNumReduceTasks(50);
Job groupJob = new Job(group);
groupJob.addDependingJob(joinJob);
JobConf top100 = new JobConf(MRExample.class);
top100.setJobName("Top 100 sites");
top100.setInputFormat(SequenceFileInputFormat.class);
top100.setOutputKeyClass(LongWritable.class);
top100.setOutputValueClass(Text.class);
top100.setOutputFormat(SequenceFileOutputFormat.class);
top100.setMapperClass(LoadClicks.class);
top100.setCombinerClass(LimitClicks.class);
top100.setReducerClass(LimitClicks.class);
FileInputFormat.addInputPath(top100, new
Path("/user/gates/tmp/grouped"));
FileOutputFormat.setOutputPath(top100, new
Path("/user/gates/top100sitesforusers18to25"));
top100.setNumReduceTasks(1);
Job limit = new Job(top100);
limit.addDependingJob(groupJob);
JobControl jc = new JobControl("Find top 100 sites for users
18 to 25");
jc.addJob(loadPages);
jc.addJob(loadUsers);
jc.addJob(joinJob);
jc.addJob(groupJob);
jc.addJob(limit);
jc.run();
}
}
Users = load ‘users’ as (name, age);
Fltrd = filter Users by
age >= 18 and age <= 25;
Pages = load ‘pages’ as (user, url);
Jnd = join Fltrd by name, Pages by user;
Grpd = group Jnd by url;
Smmd = foreach Grpd generate group, COUNT(Jnd) as clicks;
Srtd = order Smmd by clicks desc;
Top5 = limit Srtd 5;
store Top5 into ‘top5sites’;
Architektura Pig
40
Tryb Pracy
Interaktywny lub Wsadowy
41
Tryb Pracy
Lokalny lub Rozproszony
42
Tryb Pracy
Map Reduce lub Tez
43
Typy danych
44
int long float double
chararray datetime boolean
bytearray biginteger bigdecimal
Złożone typy
45
tuple bag map
Podstawy Pig Latin -
wielkość liter
• A = LOAD 'data' USING PigStorage() AS (f1:int, f2:int, f3:int);

B = GROUP A BY f1;

C = FOREACH B GENERATE COUNT ($0);

DUMP C;
• Nazwy zmiennych A, B, and C (tzw. aliasy) są case sensitive.
• Wielkość liter jest też istotna dla:
• nazwy pól f1, f2, i f3
• nazwy zmiennych A, B, C
• nazwy funkcji PigStorage, COUNT
• Z wyjątkiem: LOAD, USING, AS, GROUP, BY, FOREACH, GENERATE, oraz DUMP
46
assert, and, any, all, arrange, as, asc, AVG, bag,
BinStorage, by, bytearray, BIGINTEGER, BIGDECIMAL,
cache, CASE, cat, cd, chararray, cogroup, CONCAT,
copyFromLocal, copyToLocal, COUNT, cp, cross,
datetime, %declare, %default, define, dense, desc,
describe, DIFF, distinct, double, du, dump, e, E,
eval, exec, explain, f, F, filter, flatten, float,
foreach, full, generate, group, help, if, illustrate,
import, inner, input, int, into, is, join, kill, l, L,
left, limit, load, long, ls, map, matches, MAX, MIN,
mkdir, mv, not, null, onschema, or, order, outer,
output, parallel, pig, PigDump, PigStorage, pwd, quit,
register, returns, right, rm, rmf, rollup, run,
sample, set, ship, SIZE, split, stderr, stdin, stdout,
store, stream, SUM, TextLoader, TOKENIZE, through,
tuple, union, using, void
47
Słowa kluczowe
Pierwsze kroki
data = LOAD 'input' AS (query:CHARARRAY);
A = LOAD 'data' USING PigStorage('t') AS (f1:int, f2:int, f3:int);
STORE A INTO '/tmp/result' USING PigStorage(';')
48
Pierwsze kroki
SAMPLE
DESCRIBE
DUMP
EXPLAIN
ILLUSTRATE
49
Kolejne kroki - operacje na
danych
A = LOAD 'student' USING PigStorage() AS (name:chararray, age:int, semestre:int,
scholarship:float);
B = FILTER A BY age > 20;
50
Kolejne kroki - operacje na
danych
A = LOAD 'student' USING PigStorage() AS (name:chararray, age:int, semestre:int,
scholarship:float);
B = FILTER A BY age > 20;
C = LIMIT B 5;
51
Kolejne kroki - operacje na
danych
A = LOAD 'student' USING PigStorage() AS (name:chararray, age:int, semestre:int,
scholarship:float);
B = FILTER A BY age > 20;
C = LIMIT B 5;
D = FOREACH C GENERATE name, scholarship*semestre as funds
52
Kolejne kroki - operacje na
danych
A = LOAD 'student' USING PigStorage() AS (name:chararray, age:int, semestre:int,
scholarship:float);
E = GROUP A by age
53
Kolejne kroki - operacje na
danych
A = LOAD 'student' USING PigStorage() AS (name:chararray, age:int, semestre:int,
scholarship:float);
E = GROUP A by age
F = FOREACH E GENERATE group as age, AVG(A.scholarship)
54
Wydajność
Tez, Projekcje, Filtrowanie, Join
55
Co dalej?
• Pig ma całą masę funkcji
• UDF
• Facebook DataFu, PiggyBank
• napisać samemu (java, jvm, python, ruby etc.)
• testy jednostkowe PigUnit
56
Warsztat
57
https://notehub.org/k2glz
Wprowadzenie do analizy
danych na przykładzie
Hive
58
Architektura
59
Unikalne cechy Hive
Zapytania SQL na plikach płaskich, np. CSV
60
Unikalne cechy Hive
Znaczne przyspieszenie analizy - nie potrzeba pisać Map Reduce
Optymalizacja, wykonywanie części operacji w pamięci zamiast MR
61
Unikalne cechy Hive
Nieograniczone formy integracji - MongoDB, Elastic Search,
HBase
62
Unikalne cechy Hive
Integracja narzędzi BI oraz DWH z Hive poprzez JDBC
63
Hive CLI
Tryb Interaktywny
hive
Tryb Wsadowy:
hive -e ‘select foo from bar’
hive -f ‘/path/to/my/script.q’
hive -f ‘hdfs://namenode:port/path/to/my/
script.q’
więcej opcji: hive --help
64
Typy danych
INT, TINYINT, SMALLINT, BIGINT
BOOLEAN
DECIMAL
FLOAT, DOUBLE
STRING
BINARY
TIMESTAMP
ARRAY, MAP, STRUCT, UNION
DATE
CHAR
VARCHAR
65
Składnia zapytań
SELECT, INSERT, UPDATE
GROUP BY
UNION
LEFT, RIGHT, FULL INNER, FULL OUTER JOIN
OVER, RANK
(NOT) IN, HAVING
(NOT) EXISTS
66
Data Definition Language
• CREATE DATABASE/SCHEMA, TABLE, VIEW, FUNCTION, INDEX
• DROP DATABASE/SCHEMA, TABLE, VIEW, INDEX
• TRUNCATE TABLE
• ALTER DATABASE/SCHEMA, TABLE, VIEW
• MSCK REPAIR TABLE (or ALTER TABLE RECOVER PARTITIONS)
• SHOW DATABASES/SCHEMAS, TABLES, TBLPROPERTIES,
PARTITIONS, FUNCTIONS, INDEX[ES], COLUMNS, CREATE TABLE
• DESCRIBE DATABASE/SCHEMA, table_name, view_name
67
Tabele
CREATE TABLE page_view(viewTime INT, userid BIGINT,
page_url STRING, referrer_url STRING,
ip STRING COMMENT 'IP Address of the User')
COMMENT 'This is the page view table'
PARTITIONED BY(dt STRING, country STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '001'
STORED AS TEXTFILE;
68
Pierwsze kroki w Hive
CREATE TABLE tablename1 (foo INT, bar STRING) PARTITIONED BY (ds STRING);
LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename1;
INSERT INTO TABLE tablename1 PARTITION (ds='2014') select_statement1 FROM
from_statement;
69
Inne formaty plików? SerDe
127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326 "http://www.example.com/
start.html" "Mozilla/4.08 [en] (Win98; I ;Nav)"
CREATE TABLE apachelog (
host STRING, identity STRING, user STRING, time STRING, request STRING, status STRING,
size STRING, referer STRING, agent STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "([^]*) ([^]*) ([^]*) (-|[^]*]) ([^ "]*|"[^"]*") (-|[0-9]*) (-|[0-9]*)(?:
([^ "]*|".*") ([^ "]*|".*"))?"
)
STORED AS TEXTFILE;
70
Inne formaty plików? SerDe
CREATE TABLE table (
foo STRING, bar STRING)
STORED AS TEXTFILE; ← lub SEQUENCEFILE, ORC, AVRO lub PARQUET
71
Zalety, wady, porównanie
Hive Pig
deklaratywny proceduralny
tabele tymczasowe pipeline
polegamy na optymalizatorze
bardziej ingerujemy w
implementacje
UDF, Transform UDF, streaming
sterowniki sql data pipeline splits
72
Stinger
http://hortonworks.com/labs/stinger/
73
Tips & Tricks
hive.vectorized.execution.enabled=true
ORC
hive.execution.engine=tez
John Lund Stone Getty Images
74
Co dalej?
• Integracje z Solr, Elastic, MongoDB, HBase
• UDF
• multi table inserts
• JDBC
75
Warsztat
źródło:HikingArtist76
https://notehub.org/hz5od
Now what?
78
Spark
Chcesz wiedzieć więcej?
Szkolenia pozwalają na indywidualną pracę z
każdym uczestnikiem
• pracujemy w grupach 4-8 osobowych
• program może być dostosowany do oczekiwań
grupy
• rozwiązujemy i odpowiadamy na indywidualne
pytania uczestników
• mamy dużo więcej czasu :)
Szkolenie dedykowane dla Ciebie
Interesuje Cię tematyka warsztatu?
Zapoznaj się z programami szkoleń:
Projektowanie rozwiązań Big Data z wykorzystaniem Apache
Hadoop & Family
Analiza danych tekstowych i języka naturalnego
Przetwarzanie Big Data z użyciem Apache Spark
Podstawy uczenia maszynowego w języku Python
Wspierają nas
Warsztat 2
https://notehub.org/njwv7
źródła
• HikingArtist.com - rysunki
• hortonworks.com - architektura HDP
• apache.org - grafiki Pig, Hive, Hadoop

Contenu connexe

Tendances

Jggug 2010 330 Grails 1.3 観察
Jggug 2010 330 Grails 1.3 観察Jggug 2010 330 Grails 1.3 観察
Jggug 2010 330 Grails 1.3 観察
Tsuyoshi Yamamoto
 
Look Ma, “update DB to HTML5 using C++”, no hands! 
Look Ma, “update DB to HTML5 using C++”, no hands! Look Ma, “update DB to HTML5 using C++”, no hands! 
Look Ma, “update DB to HTML5 using C++”, no hands! 
aleks-f
 
JJUG CCC 2011 Spring
JJUG CCC 2011 SpringJJUG CCC 2011 Spring
JJUG CCC 2011 Spring
Kiyotaka Oku
 

Tendances (20)

PyCon KR 2019 sprint - RustPython by example
PyCon KR 2019 sprint  - RustPython by examplePyCon KR 2019 sprint  - RustPython by example
PyCon KR 2019 sprint - RustPython by example
 
RxJS Evolved
RxJS EvolvedRxJS Evolved
RxJS Evolved
 
Angular2 rxjs
Angular2 rxjsAngular2 rxjs
Angular2 rxjs
 
Programming with Python and PostgreSQL
Programming with Python and PostgreSQLProgramming with Python and PostgreSQL
Programming with Python and PostgreSQL
 
CONFidence 2015: DTrace + OSX = Fun - Andrzej Dyjak
CONFidence 2015: DTrace + OSX = Fun - Andrzej Dyjak   CONFidence 2015: DTrace + OSX = Fun - Andrzej Dyjak
CONFidence 2015: DTrace + OSX = Fun - Andrzej Dyjak
 
Writing native bindings to node.js in C++
Writing native bindings to node.js in C++Writing native bindings to node.js in C++
Writing native bindings to node.js in C++
 
Apache Spark Structured Streaming + Apache Kafka = ♡
Apache Spark Structured Streaming + Apache Kafka = ♡Apache Spark Structured Streaming + Apache Kafka = ♡
Apache Spark Structured Streaming + Apache Kafka = ♡
 
Monitoring Your ISP Using InfluxDB Cloud and Raspberry Pi
Monitoring Your ISP Using InfluxDB Cloud and Raspberry PiMonitoring Your ISP Using InfluxDB Cloud and Raspberry Pi
Monitoring Your ISP Using InfluxDB Cloud and Raspberry Pi
 
Jggug 2010 330 Grails 1.3 観察
Jggug 2010 330 Grails 1.3 観察Jggug 2010 330 Grails 1.3 観察
Jggug 2010 330 Grails 1.3 観察
 
Concurrent applications with free monads and stm
Concurrent applications with free monads and stmConcurrent applications with free monads and stm
Concurrent applications with free monads and stm
 
Look Ma, “update DB to HTML5 using C++”, no hands! 
Look Ma, “update DB to HTML5 using C++”, no hands! Look Ma, “update DB to HTML5 using C++”, no hands! 
Look Ma, “update DB to HTML5 using C++”, no hands! 
 
Herding types with Scala macros
Herding types with Scala macrosHerding types with Scala macros
Herding types with Scala macros
 
Jakarta Commons - Don't re-invent the wheel
Jakarta Commons - Don't re-invent the wheelJakarta Commons - Don't re-invent the wheel
Jakarta Commons - Don't re-invent the wheel
 
Letswift19-clean-architecture
Letswift19-clean-architectureLetswift19-clean-architecture
Letswift19-clean-architecture
 
The Ring programming language version 1.3 book - Part 84 of 88
The Ring programming language version 1.3 book - Part 84 of 88The Ring programming language version 1.3 book - Part 84 of 88
The Ring programming language version 1.3 book - Part 84 of 88
 
Typelevel summit
Typelevel summitTypelevel summit
Typelevel summit
 
JJUG CCC 2011 Spring
JJUG CCC 2011 SpringJJUG CCC 2011 Spring
JJUG CCC 2011 Spring
 
JavaScript Event Loop
JavaScript Event LoopJavaScript Event Loop
JavaScript Event Loop
 
Google App Engine Developer - Day3
Google App Engine Developer - Day3Google App Engine Developer - Day3
Google App Engine Developer - Day3
 
Nodejs性能分析优化和分布式设计探讨
Nodejs性能分析优化和分布式设计探讨Nodejs性能分析优化和分布式设计探讨
Nodejs性能分析优化和分布式设计探讨
 

En vedette

Am I too old to do proper programming? - Przemek Hocke
Am I too old to do proper programming? - Przemek HockeAm I too old to do proper programming? - Przemek Hocke
Am I too old to do proper programming? - Przemek Hocke
WebMuses
 

En vedette (20)

Wprowadzenie do Big Data i Apache Spark
Wprowadzenie do Big Data i Apache SparkWprowadzenie do Big Data i Apache Spark
Wprowadzenie do Big Data i Apache Spark
 
Architektura aplikacji android
Architektura aplikacji androidArchitektura aplikacji android
Architektura aplikacji android
 
Jak zacząć przetwarzanie małych i dużych danych tekstowych?
Jak zacząć przetwarzanie małych i dużych danych tekstowych?Jak zacząć przetwarzanie małych i dużych danych tekstowych?
Jak zacząć przetwarzanie małych i dużych danych tekstowych?
 
Technologia Xamarin i wprowadzenie do Windows IoT core
Technologia Xamarin i wprowadzenie do Windows IoT coreTechnologia Xamarin i wprowadzenie do Windows IoT core
Technologia Xamarin i wprowadzenie do Windows IoT core
 
Budowa elementów GUI za pomocą biblioteki React - szybki start
Budowa elementów GUI za pomocą biblioteki React - szybki startBudowa elementów GUI za pomocą biblioteki React - szybki start
Budowa elementów GUI za pomocą biblioteki React - szybki start
 
Zrób dobrze swojej komórce - programowanie urządzeń mobilnych z wykorzystanie...
Zrób dobrze swojej komórce - programowanie urządzeń mobilnych z wykorzystanie...Zrób dobrze swojej komórce - programowanie urządzeń mobilnych z wykorzystanie...
Zrób dobrze swojej komórce - programowanie urządzeń mobilnych z wykorzystanie...
 
Bezpieczne dane w aplikacjach java
Bezpieczne dane w aplikacjach javaBezpieczne dane w aplikacjach java
Bezpieczne dane w aplikacjach java
 
Szybkie wprowadzenie do eksploracji danych z pakietem Weka
Szybkie wprowadzenie do eksploracji danych z pakietem WekaSzybkie wprowadzenie do eksploracji danych z pakietem Weka
Szybkie wprowadzenie do eksploracji danych z pakietem Weka
 
Wprowadzenie do technologii Big Data
Wprowadzenie do technologii Big DataWprowadzenie do technologii Big Data
Wprowadzenie do technologii Big Data
 
Владимир Тагаков. Dagger2: dependency injection in Android
Владимир Тагаков. Dagger2: dependency injection in AndroidВладимир Тагаков. Dagger2: dependency injection in Android
Владимир Тагаков. Dagger2: dependency injection in Android
 
Am I too old to do proper programming? - Przemek Hocke
Am I too old to do proper programming? - Przemek HockeAm I too old to do proper programming? - Przemek Hocke
Am I too old to do proper programming? - Przemek Hocke
 
HelloMobile! Android
HelloMobile! AndroidHelloMobile! Android
HelloMobile! Android
 
Zapowiedź raportu - Stan Androida w Polsce 2015
Zapowiedź raportu - Stan Androida w Polsce 2015Zapowiedź raportu - Stan Androida w Polsce 2015
Zapowiedź raportu - Stan Androida w Polsce 2015
 
Podstawy AngularJS
Podstawy AngularJSPodstawy AngularJS
Podstawy AngularJS
 
Środowisko android studio - podstawy
Środowisko android studio - podstawyŚrodowisko android studio - podstawy
Środowisko android studio - podstawy
 
Wprowadzenie do technologii Puppet
Wprowadzenie do technologii PuppetWprowadzenie do technologii Puppet
Wprowadzenie do technologii Puppet
 
Michał Dec - Quality in Clouds
Michał Dec - Quality in CloudsMichał Dec - Quality in Clouds
Michał Dec - Quality in Clouds
 
Big data w praktyce
Big data w praktyceBig data w praktyce
Big data w praktyce
 
Vert.x v3 - high performance polyglot application toolkit
Vert.x v3 - high performance  polyglot application toolkitVert.x v3 - high performance  polyglot application toolkit
Vert.x v3 - high performance polyglot application toolkit
 
[WebMuses] Big data dla zdezorientowanych
[WebMuses] Big data dla zdezorientowanych[WebMuses] Big data dla zdezorientowanych
[WebMuses] Big data dla zdezorientowanych
 

Similaire à Wprowadzenie do technologi Big Data i Apache Hadoop

Refactoring to Macros with Clojure
Refactoring to Macros with ClojureRefactoring to Macros with Clojure
Refactoring to Macros with Clojure
Dmitry Buzdin
 

Similaire à Wprowadzenie do technologi Big Data i Apache Hadoop (20)

Introduction to Scalding and Monoids
Introduction to Scalding and MonoidsIntroduction to Scalding and Monoids
Introduction to Scalding and Monoids
 
Presto anatomy
Presto anatomyPresto anatomy
Presto anatomy
 
[245] presto 내부구조 파헤치기
[245] presto 내부구조 파헤치기[245] presto 내부구조 파헤치기
[245] presto 내부구조 파헤치기
 
Solr @ Etsy - Apache Lucene Eurocon
Solr @ Etsy - Apache Lucene EuroconSolr @ Etsy - Apache Lucene Eurocon
Solr @ Etsy - Apache Lucene Eurocon
 
Stream analysis with kafka native way and considerations about monitoring as ...
Stream analysis with kafka native way and considerations about monitoring as ...Stream analysis with kafka native way and considerations about monitoring as ...
Stream analysis with kafka native way and considerations about monitoring as ...
 
Refactoring to Macros with Clojure
Refactoring to Macros with ClojureRefactoring to Macros with Clojure
Refactoring to Macros with Clojure
 
Scalable and Flexible Machine Learning With Scala @ LinkedIn
Scalable and Flexible Machine Learning With Scala @ LinkedInScalable and Flexible Machine Learning With Scala @ LinkedIn
Scalable and Flexible Machine Learning With Scala @ LinkedIn
 
Big Data LDN 2017: Processing Fast Data With Apache Spark: the Tale of Two APIs
Big Data LDN 2017: Processing Fast Data With Apache Spark: the Tale of Two APIsBig Data LDN 2017: Processing Fast Data With Apache Spark: the Tale of Two APIs
Big Data LDN 2017: Processing Fast Data With Apache Spark: the Tale of Two APIs
 
Kick your database_to_the_curb_reston_08_27_19
Kick your database_to_the_curb_reston_08_27_19Kick your database_to_the_curb_reston_08_27_19
Kick your database_to_the_curb_reston_08_27_19
 
GDG Devfest 2019 - Build go kit microservices at kubernetes with ease
GDG Devfest 2019 - Build go kit microservices at kubernetes with easeGDG Devfest 2019 - Build go kit microservices at kubernetes with ease
GDG Devfest 2019 - Build go kit microservices at kubernetes with ease
 
Spark what's new what's coming
Spark what's new what's comingSpark what's new what's coming
Spark what's new what's coming
 
Osd ctw spark
Osd ctw sparkOsd ctw spark
Osd ctw spark
 
EuroPython 2015 - Big Data with Python and Hadoop
EuroPython 2015 - Big Data with Python and HadoopEuroPython 2015 - Big Data with Python and Hadoop
EuroPython 2015 - Big Data with Python and Hadoop
 
Don't Be Afraid of Abstract Syntax Trees
Don't Be Afraid of Abstract Syntax TreesDon't Be Afraid of Abstract Syntax Trees
Don't Be Afraid of Abstract Syntax Trees
 
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design PathshalaAdvance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
 
Real Time Big Data Management
Real Time Big Data ManagementReal Time Big Data Management
Real Time Big Data Management
 
Building Go Web Apps
Building Go Web AppsBuilding Go Web Apps
Building Go Web Apps
 
JS everywhere 2011
JS everywhere 2011JS everywhere 2011
JS everywhere 2011
 
PigSPARQL: A SPARQL Query Processing Baseline for Big Data
PigSPARQL: A SPARQL Query Processing Baseline for Big DataPigSPARQL: A SPARQL Query Processing Baseline for Big Data
PigSPARQL: A SPARQL Query Processing Baseline for Big Data
 
Lambdas puzzler - Peter Lawrey
Lambdas puzzler - Peter LawreyLambdas puzzler - Peter Lawrey
Lambdas puzzler - Peter Lawrey
 

Plus de Sages (7)

Python szybki start
Python   szybki startPython   szybki start
Python szybki start
 
Budowanie rozwiązań serverless w chmurze Azure
Budowanie rozwiązań serverless w chmurze AzureBudowanie rozwiązań serverless w chmurze Azure
Budowanie rozwiązań serverless w chmurze Azure
 
Docker praktyczne podstawy
Docker  praktyczne podstawyDocker  praktyczne podstawy
Docker praktyczne podstawy
 
Angular 4 pragmatycznie
Angular 4 pragmatycznieAngular 4 pragmatycznie
Angular 4 pragmatycznie
 
Jak działa blockchain?
Jak działa blockchain?Jak działa blockchain?
Jak działa blockchain?
 
Qgis szybki start
Qgis szybki startQgis szybki start
Qgis szybki start
 
Architektura SOA - wstęp
Architektura SOA - wstępArchitektura SOA - wstęp
Architektura SOA - wstęp
 

Dernier

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 

Dernier (20)

Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 

Wprowadzenie do technologi Big Data i Apache Hadoop

  • 1. Wprowadzenie do technologii Big Data Radosław Stankiewicz
  • 2.
  • 3. HackerBig Data NerdEnterpreneur Trainer 3 Src: computing.co.uk , https://www.flickr.com/photos/barron/15483113 , tech.co
  • 4. Agenda Wstęp -> Map Reduce -> Pig -> Hive -> Ambari 4
  • 9.
  • 11. Klasyfikacja problemu • Baza danych ulic Warszawy, Dane w formacie JSON, optymalizacja odbioru śmieci jednego z usługodawców. • Zdarzenia z bazy transakcyjnej i kart kredytowych w celu lepszego wykrywania fraudów • System wyszukujący dobre oferty samochodów z wielu serwisów - web crawling, parsowanie danych, analiza trendów cen samochodów • Centralne repozytorium skanów umów, TB danych, codziennie przybywa kilkaset nowych dokumentów 11
  • 12. Geneza • za dużo danych • pady serwerów • wolne relacyjne bazy danych 12
  • 13. 13
  • 14. 14
  • 17. 17
  • 19. ● User Commands o dfs o fsck ● Administration Commands o datanode o dfsadmin o namenode dfs: appendToFile cat chgrp chmod chown copyFromLocal copyToLocal count cp du dus expunge get getfacl getfattr getmerge ls lsr mkdir moveFromLocal moveToLocal mv put rm rmr setfacl setfattr setrep stat tail test text touchz hdfs dfs -put localfile1 localfile2 /user/tmp/hadoopdir hdfs dfs -getmerge /user/hadoop/output/ localfile komendy 19
  • 20. HDFS - uprawnienia • prawie POSIX • Users, Groups • chmod, chgrp, chown • ACL • getfacl, setfacl • można wyłączyć kontrolę uprawnień • dodatkowo: • Apache Knox • Apache Ranger
  • 24. Mapper #!/usr/bin/env python import sys for line in sys.stdin: words = line.strip().split() for word in words: print '%st%s' % (word, 1) line = “Ala ma kota” Ala 1 ma 1 kota 1 24
  • 25. Reducer #!/usr/bin/env python import sys current_word = None current_count = 0 word = None for line in sys.stdin: line = line.strip() word, count = line.split('t', 1) count = int(count) if current_word == word: current_count += count else: if current_word: print '%s,%s' % (current_word, current_count) current_count = count current_word = word if current_word == word: print '%s,%s' % (current_word, current_count) ala 1 ala 1 bela 1 dela 1 ala,2 bela,1 dela,1 25
  • 26. Uruchomienie streaming cat input.txt | ./mapper.py | sort | ./reducer.py bin/yarn jar [..]/hadoop-*streaming*.jar 
 -file mapper.py -mapper ./mapper.py -file reducer.py -reducer ./reducer.py 
 -input /tmp/wordcount/input -output /tmp/ wordcount/output 26
  • 27. Map Reduce w Java (input) <k1, v1> -> map -> <k2, v2> -> combine -> <k2, v2> -> reduce -> <k3, v3> (output) 1) Mapper 2) Reducer 3) run public class WordCount extends Configured implements Tool { public static class TokenizerMapper{...} public static class IntSumReducer{...} public int run(...){...} } 27
  • 28. Mapper<KEYIN,VALUEIN,KEY OUT,VALUEOUT> public static class TokenizerMapper
 extends Mapper<LongWritable, Text, Text, IntWritable>{
 
 private final static IntWritable one = new IntWritable(1);
 private Text word = new Text();
 
 public void map(LongWritable key, Text value, Context context
 ) throws IOException, InterruptedException {
 StringTokenizer itr = new StringTokenizer(value.toString());
 while (itr.hasMoreTokens()) {
 word.set(itr.nextToken());
 context.write(word, one);
 }
 } public void setup(...) {...} public void cleanup(...) {...} public void run(...) {...}
 } value = “Ala ma kota” Ala,1 ma,1 kota,1
  • 29. Reducer<KEYIN,VALUEIN,KEY OUT,VALUEOUT> public static class IntSumReducer
 extends Reducer<Text,IntWritable,Text,IntWritable> {
 private IntWritable result = new IntWritable();
 
 public void reduce(Text key, Iterable<IntWritable> values,
 Context context
 ) throws IOException, InterruptedException {
 int sum = 0;
 for (IntWritable val : values) {
 sum += val.get();
 }
 result.set(sum);
 context.write(key, result);
 } public void setup(...) {...} public void cleanup(...) {...} public void run(...) {...}
 } kota,(1,1,1,1) kota,4
  • 30. Main public int run(String[] args) throws Exception {
 Configuration conf = new Configuration();
 Job job = Job.getInstance(conf, "word count");
 job.setJarByClass(WordCount.class);
 job.setMapperClass(TokenizerMapper.class);
 job.setCombinerClass(IntSumReducer.class);
 job.setReducerClass(IntSumReducer.class);
 job.setOutputKeyClass(Text.class);
 job.setOutputValueClass(IntWritable.class);
 FileInputFormat.addInputPath(job, new Path(args[0]));
 FileOutputFormat.setOutputPath(job, new Path(args[1]));
 System.exit(job.waitForCompletion(true) ? 0 : 1);
 } public static void main(String[] args) throws Exception { int res = ToolRunner.run(new Configuration(), new WordCount(),args); System.exit(res); } yarn jar wc.jar WordCount /tmp/wordcount/input /tmp/wordcount/output
  • 31. Co dalej? • Map Reduce w Javie • Testowanie MRUnit • Joins • Avro • Custom Key, Value • Złączanie wielu zadań • Custom Input, Output 31
  • 33. Wprowadzenie do przetwarzania danych na przykładzie Pig 33
  • 35. Czy warto? Top 5 stron odwiedzanych przez użytkowników mających 18 lat
  • 36. import java.io.IOException; import java.util.ArrayList; import java.util.Iterator; import java.util.List; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.io.Writable; import org.apache.hadoop.io.WritableComparable; import org.apache.hadoop.mapred.FileInputFormat; import org.apache.hadoop.mapred.FileOutputFormat; import org.apache.hadoop.mapred.JobConf; import org.apache.hadoop.mapred.KeyValueTextInputFormat; import org.apache.hadoop.mapred.Mapper; import org.apache.hadoop.mapred.MapReduceBase; import org.apache.hadoop.mapred.OutputCollector; import org.apache.hadoop.mapred.RecordReader; import org.apache.hadoop.mapred.Reducer; import org.apache.hadoop.mapred.Reporter; import org.apache.hadoop.mapred.SequenceFileInputFormat; import org.apache.hadoop.mapred.SequenceFileOutputFormat; import org.apache.hadoop.mapred.TextInputFormat; import org.apache.hadoop.mapred.jobcontrol.Job; import org.apache.hadoop.mapred.jobcontrol.JobControl; import org.apache.hadoop.mapred.lib.IdentityMapper; public class MRExample { public static class LoadPages extends MapReduceBase implements Mapper<LongWritable, Text, Text, Text> { public void map(LongWritable k, Text val, OutputCollector<Text, Text> oc, Reporter reporter) throws IOException { // Pull the key out String line = val.toString(); int firstComma = line.indexOf(','); String key = line.substring(0, firstComma); String value = line.substring(firstComma + 1); Text outKey = new Text(key); // Prepend an index to the value so we know which file // it came from. Text outVal = new Text("1" + value); oc.collect(outKey, outVal); } } public static class LoadAndFilterUsers extends MapReduceBase implements Mapper<LongWritable, Text, Text, Text> { public void map(LongWritable k, Text val, OutputCollector<Text, Text> oc, Reporter reporter) throws IOException { // Pull the key out String line = val.toString(); int firstComma = line.indexOf(','); String value = line.substring(firstComma + 1); int age = Integer.parseInt(value); if (age < 18 || age > 25) return; String key = line.substring(0, firstComma); Text outKey = new Text(key); // Prepend an index to the value so we know which file // it came from. Text outVal = new Text("2" + value); oc.collect(outKey, outVal); } } public static class Join extends MapReduceBase implements Reducer<Text, Text, Text, Text> { public void reduce(Text key, Iterator<Text> iter, OutputCollector<Text, Text> oc, Reporter reporter) throws IOException { // For each value, figure out which file it's from and store it // accordingly. List<String> first = new ArrayList<String>(); List<String> second = new ArrayList<String>(); while (iter.hasNext()) { Text t = iter.next(); String value = t.toString(); if (value.charAt(0) == '1') first.add(value.substring(1)); else second.add(value.substring(1)); reporter.setStatus("OK"); } // Do the cross product and collect the values for (String s1 : first) { for (String s2 : second) { String outval = key + "," + s1 + "," + s2; oc.collect(null, new Text(outval)); reporter.setStatus("OK"); } } }
  • 37. public static class LoadJoined extends MapReduceBase implements Mapper<Text, Text, Text, LongWritable> { public void map( Text k, Text val, OutputCollector<Text, LongWritable> oc, Reporter reporter) throws IOException { // Find the url String line = val.toString(); int firstComma = line.indexOf(','); int secondComma = line.indexOf(',', firstComma); String key = line.substring(firstComma, secondComma); // drop the rest of the record, I don't need it anymore, // just pass a 1 for the combiner/reducer to sum instead. Text outKey = new Text(key); oc.collect(outKey, new LongWritable(1L)); } } public static class ReduceUrls extends MapReduceBase implements Reducer<Text, LongWritable, WritableComparable, Writable> { public void reduce( Text key, Iterator<LongWritable> iter, OutputCollector<WritableComparable, Writable> oc, Reporter reporter) throws IOException { // Add up all the values we see long sum = 0; while (iter.hasNext()) { sum += iter.next().get(); reporter.setStatus("OK"); } oc.collect(key, new LongWritable(sum)); } } public static class LoadClicks extends MapReduceBase implements Mapper<WritableComparable, Writable, LongWritable, Text> { public void map( WritableComparable key, Writable val, OutputCollector<LongWritable, Text> oc, Reporter reporter) throws IOException { oc.collect((LongWritable)val, (Text)key); } } public static class LimitClicks extends MapReduceBase implements Reducer<LongWritable, Text, LongWritable, Text> { int count = 0; public void reduce( LongWritable key, Iterator<Text> iter, OutputCollector<LongWritable, Text> oc, Reporter reporter) throws IOException { // Only output the first 100 records while (count < 100 && iter.hasNext()) { oc.collect(key, iter.next()); count++; } } }
  • 38. public static void main(String[] args) throws IOException { JobConf lp = new JobConf(MRExample.class); lp.setJobName("Load Pages"); lp.setInputFormat(TextInputFormat.class); lp.setOutputKeyClass(Text.class); lp.setOutputValueClass(Text.class); lp.setMapperClass(LoadPages.class); FileInputFormat.addInputPath(lp, new Path("/user/gates/pages")); FileOutputFormat.setOutputPath(lp, new Path("/user/gates/tmp/indexed_pages")); lp.setNumReduceTasks(0); Job loadPages = new Job(lp); JobConf lfu = new JobConf(MRExample.class); lfu.setJobName("Load and Filter Users"); lfu.setInputFormat(TextInputFormat.class); lfu.setOutputKeyClass(Text.class); lfu.setOutputValueClass(Text.class); lfu.setMapperClass(LoadAndFilterUsers.class); FileInputFormat.addInputPath(lfu, new Path("/user/gates/users")); FileOutputFormat.setOutputPath(lfu, new Path("/user/gates/tmp/filtered_users")); lfu.setNumReduceTasks(0); Job loadUsers = new Job(lfu); JobConf join = new JobConf(MRExample.class); join.setJobName("Join Users and Pages"); join.setInputFormat(KeyValueTextInputFormat.class); join.setOutputKeyClass(Text.class); join.setOutputValueClass(Text.class); join.setMapperClass(IdentityMapper.class); join.setReducerClass(Join.class); FileInputFormat.addInputPath(join, new Path("/user/gates/tmp/indexed_pages")); FileInputFormat.addInputPath(join, new Path("/user/gates/tmp/filtered_users")); FileOutputFormat.setOutputPath(join, new Path("/user/gates/tmp/joined")); join.setNumReduceTasks(50); Job joinJob = new Job(join); joinJob.addDependingJob(loadPages); joinJob.addDependingJob(loadUsers); JobConf group = new JobConf(MRExample.class); group.setJobName("Group URLs"); group.setInputFormat(KeyValueTextInputFormat.class); group.setOutputKeyClass(Text.class); group.setOutputValueClass(LongWritable.class); group.setOutputFormat(SequenceFileOutputFormat.class); group.setMapperClass(LoadJoined.class); group.setCombinerClass(ReduceUrls.class); group.setReducerClass(ReduceUrls.class); FileInputFormat.addInputPath(group, new Path("/user/gates/tmp/joined")); FileOutputFormat.setOutputPath(group, new Path("/user/gates/tmp/grouped")); group.setNumReduceTasks(50); Job groupJob = new Job(group); groupJob.addDependingJob(joinJob); JobConf top100 = new JobConf(MRExample.class); top100.setJobName("Top 100 sites"); top100.setInputFormat(SequenceFileInputFormat.class); top100.setOutputKeyClass(LongWritable.class); top100.setOutputValueClass(Text.class); top100.setOutputFormat(SequenceFileOutputFormat.class); top100.setMapperClass(LoadClicks.class); top100.setCombinerClass(LimitClicks.class); top100.setReducerClass(LimitClicks.class); FileInputFormat.addInputPath(top100, new Path("/user/gates/tmp/grouped")); FileOutputFormat.setOutputPath(top100, new Path("/user/gates/top100sitesforusers18to25")); top100.setNumReduceTasks(1); Job limit = new Job(top100); limit.addDependingJob(groupJob); JobControl jc = new JobControl("Find top 100 sites for users 18 to 25"); jc.addJob(loadPages); jc.addJob(loadUsers); jc.addJob(joinJob); jc.addJob(groupJob); jc.addJob(limit); jc.run(); } }
  • 39. Users = load ‘users’ as (name, age); Fltrd = filter Users by age >= 18 and age <= 25; Pages = load ‘pages’ as (user, url); Jnd = join Fltrd by name, Pages by user; Grpd = group Jnd by url; Smmd = foreach Grpd generate group, COUNT(Jnd) as clicks; Srtd = order Smmd by clicks desc; Top5 = limit Srtd 5; store Top5 into ‘top5sites’;
  • 42. Tryb Pracy Lokalny lub Rozproszony 42
  • 43. Tryb Pracy Map Reduce lub Tez 43
  • 44. Typy danych 44 int long float double chararray datetime boolean bytearray biginteger bigdecimal
  • 46. Podstawy Pig Latin - wielkość liter • A = LOAD 'data' USING PigStorage() AS (f1:int, f2:int, f3:int);
 B = GROUP A BY f1;
 C = FOREACH B GENERATE COUNT ($0);
 DUMP C; • Nazwy zmiennych A, B, and C (tzw. aliasy) są case sensitive. • Wielkość liter jest też istotna dla: • nazwy pól f1, f2, i f3 • nazwy zmiennych A, B, C • nazwy funkcji PigStorage, COUNT • Z wyjątkiem: LOAD, USING, AS, GROUP, BY, FOREACH, GENERATE, oraz DUMP 46
  • 47. assert, and, any, all, arrange, as, asc, AVG, bag, BinStorage, by, bytearray, BIGINTEGER, BIGDECIMAL, cache, CASE, cat, cd, chararray, cogroup, CONCAT, copyFromLocal, copyToLocal, COUNT, cp, cross, datetime, %declare, %default, define, dense, desc, describe, DIFF, distinct, double, du, dump, e, E, eval, exec, explain, f, F, filter, flatten, float, foreach, full, generate, group, help, if, illustrate, import, inner, input, int, into, is, join, kill, l, L, left, limit, load, long, ls, map, matches, MAX, MIN, mkdir, mv, not, null, onschema, or, order, outer, output, parallel, pig, PigDump, PigStorage, pwd, quit, register, returns, right, rm, rmf, rollup, run, sample, set, ship, SIZE, split, stderr, stdin, stdout, store, stream, SUM, TextLoader, TOKENIZE, through, tuple, union, using, void 47 Słowa kluczowe
  • 48. Pierwsze kroki data = LOAD 'input' AS (query:CHARARRAY); A = LOAD 'data' USING PigStorage('t') AS (f1:int, f2:int, f3:int); STORE A INTO '/tmp/result' USING PigStorage(';') 48
  • 50. Kolejne kroki - operacje na danych A = LOAD 'student' USING PigStorage() AS (name:chararray, age:int, semestre:int, scholarship:float); B = FILTER A BY age > 20; 50
  • 51. Kolejne kroki - operacje na danych A = LOAD 'student' USING PigStorage() AS (name:chararray, age:int, semestre:int, scholarship:float); B = FILTER A BY age > 20; C = LIMIT B 5; 51
  • 52. Kolejne kroki - operacje na danych A = LOAD 'student' USING PigStorage() AS (name:chararray, age:int, semestre:int, scholarship:float); B = FILTER A BY age > 20; C = LIMIT B 5; D = FOREACH C GENERATE name, scholarship*semestre as funds 52
  • 53. Kolejne kroki - operacje na danych A = LOAD 'student' USING PigStorage() AS (name:chararray, age:int, semestre:int, scholarship:float); E = GROUP A by age 53
  • 54. Kolejne kroki - operacje na danych A = LOAD 'student' USING PigStorage() AS (name:chararray, age:int, semestre:int, scholarship:float); E = GROUP A by age F = FOREACH E GENERATE group as age, AVG(A.scholarship) 54
  • 56. Co dalej? • Pig ma całą masę funkcji • UDF • Facebook DataFu, PiggyBank • napisać samemu (java, jvm, python, ruby etc.) • testy jednostkowe PigUnit 56
  • 58. Wprowadzenie do analizy danych na przykładzie Hive 58
  • 60. Unikalne cechy Hive Zapytania SQL na plikach płaskich, np. CSV 60
  • 61. Unikalne cechy Hive Znaczne przyspieszenie analizy - nie potrzeba pisać Map Reduce Optymalizacja, wykonywanie części operacji w pamięci zamiast MR 61
  • 62. Unikalne cechy Hive Nieograniczone formy integracji - MongoDB, Elastic Search, HBase 62
  • 63. Unikalne cechy Hive Integracja narzędzi BI oraz DWH z Hive poprzez JDBC 63
  • 64. Hive CLI Tryb Interaktywny hive Tryb Wsadowy: hive -e ‘select foo from bar’ hive -f ‘/path/to/my/script.q’ hive -f ‘hdfs://namenode:port/path/to/my/ script.q’ więcej opcji: hive --help 64
  • 65. Typy danych INT, TINYINT, SMALLINT, BIGINT BOOLEAN DECIMAL FLOAT, DOUBLE STRING BINARY TIMESTAMP ARRAY, MAP, STRUCT, UNION DATE CHAR VARCHAR 65
  • 66. Składnia zapytań SELECT, INSERT, UPDATE GROUP BY UNION LEFT, RIGHT, FULL INNER, FULL OUTER JOIN OVER, RANK (NOT) IN, HAVING (NOT) EXISTS 66
  • 67. Data Definition Language • CREATE DATABASE/SCHEMA, TABLE, VIEW, FUNCTION, INDEX • DROP DATABASE/SCHEMA, TABLE, VIEW, INDEX • TRUNCATE TABLE • ALTER DATABASE/SCHEMA, TABLE, VIEW • MSCK REPAIR TABLE (or ALTER TABLE RECOVER PARTITIONS) • SHOW DATABASES/SCHEMAS, TABLES, TBLPROPERTIES, PARTITIONS, FUNCTIONS, INDEX[ES], COLUMNS, CREATE TABLE • DESCRIBE DATABASE/SCHEMA, table_name, view_name 67
  • 68. Tabele CREATE TABLE page_view(viewTime INT, userid BIGINT, page_url STRING, referrer_url STRING, ip STRING COMMENT 'IP Address of the User') COMMENT 'This is the page view table' PARTITIONED BY(dt STRING, country STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '001' STORED AS TEXTFILE; 68
  • 69. Pierwsze kroki w Hive CREATE TABLE tablename1 (foo INT, bar STRING) PARTITIONED BY (ds STRING); LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename1; INSERT INTO TABLE tablename1 PARTITION (ds='2014') select_statement1 FROM from_statement; 69
  • 70. Inne formaty plików? SerDe 127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326 "http://www.example.com/ start.html" "Mozilla/4.08 [en] (Win98; I ;Nav)" CREATE TABLE apachelog ( host STRING, identity STRING, user STRING, time STRING, request STRING, status STRING, size STRING, referer STRING, agent STRING) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe' WITH SERDEPROPERTIES ( "input.regex" = "([^]*) ([^]*) ([^]*) (-|[^]*]) ([^ "]*|"[^"]*") (-|[0-9]*) (-|[0-9]*)(?: ([^ "]*|".*") ([^ "]*|".*"))?" ) STORED AS TEXTFILE; 70
  • 71. Inne formaty plików? SerDe CREATE TABLE table ( foo STRING, bar STRING) STORED AS TEXTFILE; ← lub SEQUENCEFILE, ORC, AVRO lub PARQUET 71
  • 72. Zalety, wady, porównanie Hive Pig deklaratywny proceduralny tabele tymczasowe pipeline polegamy na optymalizatorze bardziej ingerujemy w implementacje UDF, Transform UDF, streaming sterowniki sql data pipeline splits 72
  • 75. Co dalej? • Integracje z Solr, Elastic, MongoDB, HBase • UDF • multi table inserts • JDBC 75
  • 77.
  • 79. Chcesz wiedzieć więcej? Szkolenia pozwalają na indywidualną pracę z każdym uczestnikiem • pracujemy w grupach 4-8 osobowych • program może być dostosowany do oczekiwań grupy • rozwiązujemy i odpowiadamy na indywidualne pytania uczestników • mamy dużo więcej czasu :)
  • 80. Szkolenie dedykowane dla Ciebie Interesuje Cię tematyka warsztatu? Zapoznaj się z programami szkoleń: Projektowanie rozwiązań Big Data z wykorzystaniem Apache Hadoop & Family Analiza danych tekstowych i języka naturalnego Przetwarzanie Big Data z użyciem Apache Spark Podstawy uczenia maszynowego w języku Python
  • 83. źródła • HikingArtist.com - rysunki • hortonworks.com - architektura HDP • apache.org - grafiki Pig, Hive, Hadoop