SlideShare une entreprise Scribd logo
1  sur  45
Télécharger pour lire hors ligne
FIWARE Big Data ecosystem : Cosmos
Joaquin Salvachua
Andres Muñoz
Universidad Politécnica de Madrid
Joaquin.salvachua@upm.es, @jsalvachua, @FIWARE
www.slideshare.net/jsalvachua
FIWARE	Architecture
1
Cosmos	Ecosystem
• The batch part has	been widely developed throught the adoption and/or the in-house creation
of	the following tools:
• A	Hadoop As	A	Service (HAAS)	engine,	either the ''official''	one based on Openstack's	Sahara,	
either the light	version based on a	shared Hadoop cluster.
• Cosmos	GUI,	for Cosmos	accounts managing.
• Cosmos	Auth,	managing OAuth2-based	authentication tokens for Cosmos	REST	APIs.
• Cosmos	Proxy,	enforcing OAuth2-based	authentication and	authorization for Cosmos	REST	APIs.
• Cosmos	Hive	Auth	Provider,	a	custom auth provider for HiveServer2	based in	OAuth2.
• Tidoop	API for MapReduce job submission and	management.
• Cosmos	Admin,	a	set	of	admin scripts.
• The batch part is enriched with the following tools,	that conform the so-called Cosmos	
Ecosystem:Cygnus,	a	tool for connecting Orion Context Broker with several data	storages,	HDFS	
and	STH	(see velow)	included,	in	order to	create context data	historics.
• STH	Comet,	a	MongoDB context data	storage for short-term historic.
2
3
Big	Data:
What	is	it	and	how	much	data	is	there
What	is	big	data?
4
> small
data
What	is	big	data?
5
> big data
http://commons.wikimedia.org/wiki/File:Interior_view_of_Stockholm_Public_Library.jp
g
How	much	data	is	there?
6
Data	growing	forecast
7
2.3 3.6
12
19
11.3
39
0.5
1.4
Global
users
(billions)
Global
networked
devices
(billions)
Global broadband
speed
(Mbps)
Global traffic
(zettabytes)
http://www.cisco.com/en/US/netsol/ns827/networking_solutions_sub_solution.html#~forecast
2012
2012
2012
2012
2017
2017
2017
2017
8
How	to	deal	with	it:
Distributed	storage	and	computing
What happens if one shelving is not enough?
9
You buy more shelves…
… and you create an index
10
“The Avengers”, 1-100, shelf 1
“The Avengers”, 101-125, shelf 2
“Superman”, 1-50, shelf 2
“X-Men”, 1-100, shelf 3
“X-Men”, 101-200, shelf 4
“X-Men”, 201, 225, shelf 5
The
Avengers
The
Avengers
The
Avengers
The
Avengers
The
Avengers
Superman
Superman
X-Men
X-Men
X-Men
X-Men
X-Men
X-Men
X-Men
X-Men
X-Men
What about distributed computing?
11
12
Distributed	storage:
The	Hadoop reference
Hadoop Distributed File System (HDFS)
13
• Based on Google	File	System
• Large files	are	stored across multiple machines	(Datanodes)	by
spliting them into blocks	that are	distributed
• Metadata is managed by the Namenode
• Scalable by simply adding more	Datanodes
• Fault-tolerant since HDFS	replicates each block	(default	to 3)
• Security	based on authentication (Kerberos)	and	authorization
(permissions,	HACLs)
• It is managed like a	Unix-like file	system
HDFS architecture
14
Namenode
Datanode0 Datanode1 DatanodeN
Rack 1 Rack 2
1 2 2 3 3 1 2
Path Replicas Block	IDs
/user/user1/data/fil
e0
2 1,3
/user/user1/data/fil
e1
3 2,4,5
… … …
15
Managing	HDFS:
File	System	Shell
HTTP	REST	API
File System Shell
16
• The File	System Shell	includes various shell-like comands that directly interact with
HDFS
• The FS	shell is invoked by any of	these scripts:
– bin/hadoop fs
– bin/hdfs dfs
• All FS	Shell	commans take URI paths as	arguments:
– scheme://authority/path
– Available schemas:	file	(local	FS),	hdfs (HDFS)
– If nothing is especified,	hdfs is considered
• It is necessary to connect to the cluster via SSH
• Full	commands reference
– http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-
common/FileSystemShell.html
File System Shell examples
17
$	hadoop fs -cat webinar/abriefhistoryoftime_page1
CHAPTER	1
OUR	PICTURE	OF	THE	UNIVERSE
A	well-known scientist (some say it was Bertrand	Russell)	once	gave a	public lecture on astronomy.	He	
described how the earth orbits around the sun and	how the sun,	in	turn,	orbits around the center	of	a	vast
$	hadoop fs -mkdir webinar/afolder
$	hadoop fs -ls webinar
Found 4	items
-rw-r--r-- 3	frb cosmos							3431	2014-12-10	14:00	/user/frb/webinar/abriefhistoryoftime_page1
-rw-r--r-- 3	frb cosmos							1604	2014-12-10	14:00	/user/frb/webinar/abriefhistoryoftime_page2
-rw-r--r-- 3	frb cosmos							5257	2014-12-10	14:00	/user/frb/webinar/abriefhistoryoftime_page3
drwxr-xr-x			- frb cosmos										0	2015-03-10	11:09	/user/frb/webinar/afolder
$	hadoop fs -rmr webinar/afolder
Deleted hdfs://cosmosmaster-gi/user/frb/webinar/afolder
HTTP REST API
18
• The HTTP	REST	API	supports the complete	File	System
interface	for HDFS
• It relies on the webhdfs schema for URIs
– webhdfs://<HOST>:<HTTP_PORT>/<PATH>
• HTTP	URLs are	built as:
– http://<HOST>:<HTTP_PORT>/webhdfs/v1/<PATH>?op=…
• Full	API	specification
– http://hadoop.apache.org/docs/current/hadoop-project-
dist/hadoop-hdfs/WebHDFS.html
HTTP REST API examples
19
$	curl –X	GET	“http://cosmos.lab.fi-
ware.org:14000/webhdfs/v1/user/frb/webinar/abriefhistoryoftime_page1?op=open&user.name=frb”
CHAPTER	1
OUR	PICTURE	OF	THE	UNIVERSE
A	well-known scientist (some say it was Bertrand	Russell)	once	gave a	public lecture on astronomy.	He	described how the
earth orbits around the sun and	how the sun,	in	turn,	orbits around the center	of	a	vast
$	curl -X	PUT	"http://cosmos.lab.fi-ware.org:14000/webhdfs/v1/user/frb/webinar/afolder?op=mkdirs&user.name=frb"
{"boolean":true}
$	curl –X	GET	"http://cosmos.lab.fi-ware.org:14000/webhdfs/v1/user/frb/webinar?op=liststatus&user.name=frb"
{"FileStatuses":{"FileStatus":[{"pathSuffix":"abriefhistoryoftime_page1","type":"FILE","length":3431,"owner":"frb","group
":"cosmos","permission":"644","accessTime":1425995831489,"modificationTime":1418216412441,"blockSize":67108864
,"replication":3},{"pathSuffix":"abriefhistoryoftime_page2","type":"FILE","length":1604,"owner":"frb","group":"cosmos","
permission":"644","accessTime":1418216412460,"modificationTime":1418216412500,"blockSize":67108864,"replication
":3},{"pathSuffix":"abriefhistoryoftime_page3","type":"FILE","length":5257,"owner":"frb","group":"cosmos","permission"
:"644","accessTime":1418216412515,"modificationTime":1418216412551,"blockSize":67108864,"replication":3},{"pathS
uffix":"afolder","type":"DIRECTORY","length":0,"owner":"frb","group":"cosmos","permission":"755","accessTime":0,"mo
dificationTime":1425995941361,"blockSize":0,"replication":0}]}}
$	curl -X	DELETE	"http://cosmos.lab.fi-
ware.org:14000/webhdfs/v1/user/frb/webinar/afolder?op=delete&user.name=frb"
{"boolean":true}
20
Distributed	computing:
The	Hadoop reference
Hadoop was	created	by	Doug	
Cutting	at	Yahoo!...
21
… based on the MapReduce patent by Google
Well,	MapReduce was	really	
invented	by	Julius	Caesar
22
Divide et
impera*
* Divide and conquer
An	example
23
How much pages are written in latin among the books
in the Ancient Library of Alexandria?
LATIN
REF1
P45
GREEK
REF2
P128
EGYPT
REF3
P12
LATIN
pages 45
EGYPTIAN
LATIN
REF4
P73
LATIN
REF5
P34
EGYPT
REF6
P10
GREEK
REF7
P20
GREEK
REF8
P230
45(ref 1)
still reading…
Mappers
Reducer
An	example
24
How much pages are written in latin among the books
in the Ancient Library of Alexandria?
GREEK
REF2
P128
still
reading…
EGYPTIAN
LATIN
REF4
P73
LATIN
REF5
P34
EGYPT
REF6
P10
GREEK
REF7
P20
GREEK
REF8
P230
GREEK
45(ref 1)
Mappers
Reducer
An	example
25
How much pages are written in latin among the books
in the Ancient Library of Alexandria?
LATIN
pages 73
EGYPTIAN
LATIN
REF4
P73
LATIN
REF5
P34
GREEK
REF7
P20
GREEK
REF8
P230
LATIN
pages 34
45(ref 1)
+73(ref 4)
+34(ref 5)
Mappers
Reducer
An	example
26
How much pages are written in latin among the books
in the Ancient Library of Alexandria?
GREEK
GREEK
GREEK
REF7
P20
GREEK
REF8
P230
idle…
45(ref 1)
+73(ref 4)
+34(ref 5)
Mappers
Reducer
An	example
27
How much pages are written in latin among the books
in the Ancient Library of Alexandria?
idle…
idle…
idle…
45(ref 1)
+73(ref 4)
+34(ref 5)
152TOTAL
Mappers
Reducer
MapReduce applications
28
• MapReduce applications	are	commonly	written	in	Java
– Can	be	written	in	other	languages	through	Hadoop Streaming
• They	are	executed	in	the	command	line
$ hadoop jar <jar-file> <main-class> <input-dir> <output-dir>
• A	MapReduce job	consists	of:
– A	driver,	a	piece	of	software	where	to	define	inputs,	outputs,	formats,	etc.	and	the	entry	point	for	launching	the	
job
– A	set	of	Mappers,	given	by	a	piece	of	software	defining	its	behaviour
– A	set	of	Reducers,	given	by	a	piece	of	software	defining	its	behaviour
• There	are	2	APIS
– org.apache.mapred à old one
– org.apache.mapreduce à new one
• Hadoop is	distributed	with	MapReduce examples
– [HADOOP_HOME]/hadoop-examples.jar
Implementing the example
29
• The	input	will	be	a	single	big	file	containing:
symbolae botanicae,latin,230
mathematica,greek,95
physica,greek,109
ptolomaics,egyptian,120
terra,latin,541
iustitia est vincit,latin,134
• The	mappers	will	receive	pieces	of	the	above	file,	which	will	be	read	line	by	
line
– Each	line	will	be	represented	by	a	(key,value)	pair,	i.e.	the	offset	on	the	file	and	the	real	
data	within	the	line,	respectively
– For	each	input	pair	a	(key,value)	pair	will	be	output,	i.e.	a	common	“num_pages”	key	and	
the	third	field	in	the	line
• The	reducers	will	receive	arrays	of	pairs	produced	by	the	mappers,	all	
having	the	same	key	(“num_pages”)
– For	each	array	of	pairs,	the	sum	of	the	values	will	be	output	as	a	(key,value)	pair,	in	this	
case	a	“total_pages”	key	and	the	sum	as	value
Implementing the example: JCMapper.class
30
public static class JCMapper extends
Mapper<Object, Text, Text, IntWritable> {
private final Text globalKey = new Text(”num_pages");
private final IntWritable bookPages = new IntWritable();
@Override
public void map(Object key, Text value, Context context)
throws Exception {
String[] fields = value.toString().split(“,”);
system.out.println(“Processing “ + fields[0]);
if (fields[1].equals(“latin”)) {
bookPages.set(fields[2]);
context.write(globalKey, bookPages);
} // if
} // map
} // JCMapper
Implementing the example: JCReducer.class
31
public static class JCReducer extends
Reducer<Text, IntWritable, Text, IntWritable> {
private final IntWritable totalPages= new IntWritable();
@Override
public void reduce(Text globalKey, Iterable<IntWritable>
bookPages, Context context) throws Exception {
int sum = 0;
for (IntWritable val: bookPages) {
sum += val.get();
} // for
totalPages.set(sum);
context.write(globalKey, totalPages);
} // reduce
} // JCReducer
Implementing the example: JC.class
32
public static void main(String[] args) throws Exception {
int res = ToolRunner.run(
new Configuration(), new CKANMapReduceExample(), args);
System.exit(res);
} // main
@Override
public int run(String[] args) throws Exception {
Configuration conf = this.getConf();
Job job = Job.getInstance(conf, ”julius caesar");
job.setJarByClass(JC.class);
job.setMapperClass(JCMapper.class);
job.setCombinerClass(JCReducer.class);
job.setReducerClass(JCReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
return job.waitForCompletion(true) ? 0 : 1;
} // run
Non HDFS inputs for MapReduce
33
• Currently developing certain Hadoop extensions with the codename
of	“Tidoop”	(Teléfonica Investigación	y	Desarrollo	+	Hadoop J)
– https://github.com/telefonicaid/fiware-tidoop
• These extensions allow using generic non	HDFS	data located at:
– CKAN	(beta	available)
– MongoDB-based Short-Term Historic (roadmap)
• Other specific public repos	UNDER	STUDY…	suggestions accepted
J
Non HDFS inputs for MapReduce: driver
34
@Override
public int run(String[] args) throws Exception {
Configuration conf = this.getConf();
Job job = Job.getInstance(conf, ”ckan mr");
job.setJarByClass(CKANMapReduceExample.class);
job.setMapperClass(CKANMapper.class);
job.setCombinerClass(CKANReducer.class);
job.setReducerClass(CKANReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
// start of the relevant part
job.setInputFormatClass(CKANInputFormat.class);
CKANInputFormat.addCKANInput(job, ckanInputs);
CKANInputFormat.setCKANEnvironmnet(job, ckanHost, ckanPort,
sslEnabled, ckanAPIKey);
CKANInputFormat.setCKANSplitsLength(job, splitsLength);
// end of the relevant part
FileOutputFormat.setOutputPath(job, new Path(args[1]));
return job.waitForCompletion(true) ? 0 : 1;
} // run
35
Querying	tools
Hive
Querying tools
36
• MapReduce paradigm	may	be	hard	to	understand	and,	the	
worst,	to	use
• Indeed,	many	data	analyzers	just	need	to	query for	the	data
– If	possible,	by	using	already	well-known	languages
• Regarding	that,	some	querying	tools	appeared	in	the	Hadoop
ecosystem
– Hive and	its	HiveQL language	à quite	similar	to	SQL
– Pig and	its	Pig	Latin	language	à a	new	language
Hive and HiveQL
37
• HiveQL reference
– https://cwiki.apache.org/confluence/display/Hive/LanguageManual
• All	the	data	is	loaded	into	Hive	tables
– Not	real	tables	(they	don’t	contain	the	real	data)	but	metadata	
pointing	to	the	real	data	at	HDFS
• The	best	thing	is	Hive	uses	pre-defined	MapReduce jobs
behind	the	scenes!
• Column	selection
• Fields	grouping
• Table	joining
• Values	filtering
• …
• Important	remark:	since	MapReduce is	used	by	Hive,	the	
queries	make	take	some	time	to	produce	a	result
Hive CLI
38
$ hive
hive history
file=/tmp/myuser/hive_job_log_opendata_XXX_XXX.txt
hive>select column1,column2,otherColumns from mytable
where column1='whatever' and columns2 like '%whatever%';
Total MapReduce jobs = 1
Launching Job 1 out of 1
Starting Job = job_201308280930_0953, Tracking URL =
http://cosmosmaster-
gi:50030/jobdetails.jsp?jobid=job_201308280930_0953
Kill Command = /usr/lib/hadoop/bin/hadoop job -
Dmapred.job.tracker=cosmosmaster-gi:8021 -kill
job_201308280930_0953
2013-10-03 09:15:34,519 Stage-1 map = 0%, reduce = 0%
2013-10-03 09:15:36,545 Stage-1 map = 67%, reduce = 0%
2013-10-03 09:15:37,554 Stage-1 map = 100%, reduce = 0%
2013-10-03 09:15:44,609 Stage-1 map = 100%, reduce = 33%
Hive Java API
39
• Hive	CLI	is	OK	for	human-driven	testing	purposes
• But	it	is	not	usable	by	remote	applications
• Hive	has	no	REST	API
• Hive	has	several	drivers	and	libraries
• JDBC	for	Java
• Python
• PHP
• ODBC	for	C/C++
• Thrift	for	Java	and	C++
• https://cwiki.apache.org/confluence/display/Hive/HiveClien
t
• A	remote	Hive	client	usually	performs:
• A	connection to	the	Hive	server	(TCP/10000)
• The	query	execution
Hive Java API: get a connection
40
private static Connection getConnection(String ip, String port,
String user, String password) {
try {
Class.forName("org.apache.hadoop.hive.jdbc.HiveDriver");
} catch (ClassNotFoundException e) {
System.out.println(e.getMessage());
return null;
} // try catch
try {
return DriverManager.getConnection("jdbc:hive://" + ip
+ ":” + port + "/default?user=" + user + "&password=“
+ password);
} catch (SQLException e) {
System.out.println(e.getMessage());
return null;
} // try catch
} // getConnection
Hive Java API: do the query
41
private static void doQuery() {
try {
Statement stmt = con.createStatement();
ResultSet res = stmt.executeQuery(
"select column1,column2,”
+ "otherColumns from mytable where “
+ “column1='whatever' and “
+ "columns2 like '%whatever%'");
while (res.next()) {
String column1 = res.getString(1);
Integer column2 = res.getInteger(2);
} // while
res.close(); stmt.close(); con.close();
} catch (SQLException e) {
System.exit(0);
} // try catch
} // doQuery
Hive tables creation
42
• Both locally using the CLI,	or remotely using the Java	API,	use	this
command:
hive> Create external table...
• CSV-like HDFS	files
hive> create external table <table_name> (<field1_name>
<field1_type>, ..., <fieldN_name> <fieldN_type>) row
format delimited field terminated by ‘<separator>'
location ‘/user/<username>/<path>/<to>/<the>/<data>';
• Json-like HDFS	files
create external table <table_name> (<field1_name>
<field1_type>, ..., <fieldN_name> <fieldN_type>) row
format serde 'org.openx.data.jsonserde.JsonSerDe'
location ‘/user/<username>/<path>/<to>/<the>/<data>’;
• Cygnus automatically creates the Hive tables for the entities it is
notified for!
Thank you!
http://fiware.org
Follow @FIWARE on Twitter
FIWARE Big Data ecosystem : Cygnus
Joaquin Salvachua
Universidad Politécnica de Madrid
Joaquin.salvachua@upm.es, @jsalvachua, @FIWARE
www.slideshare.net/jsalvachua

Contenu connexe

Tendances

Redis Introduction
Redis IntroductionRedis Introduction
Redis Introduction
Alex Su
 
AOS Lab 1: Hello, Linux!
AOS Lab 1: Hello, Linux!AOS Lab 1: Hello, Linux!
AOS Lab 1: Hello, Linux!
Zubair Nabi
 
Unix fundamentals and_shell scripting
Unix fundamentals and_shell scriptingUnix fundamentals and_shell scripting
Unix fundamentals and_shell scripting
Ganesh Bhosale
 

Tendances (20)

Meeting 13. web server i
Meeting 13. web server iMeeting 13. web server i
Meeting 13. web server i
 
Python arch wiki
Python   arch wikiPython   arch wiki
Python arch wiki
 
Module net cdf4
Module net cdf4 Module net cdf4
Module net cdf4
 
Redis Introduction
Redis IntroductionRedis Introduction
Redis Introduction
 
Linux
LinuxLinux
Linux
 
IPFS: A Whole New World
IPFS: A Whole New WorldIPFS: A Whole New World
IPFS: A Whole New World
 
Decentralized storage IPFS & Ulord
Decentralized storage   IPFS & UlordDecentralized storage   IPFS & Ulord
Decentralized storage IPFS & Ulord
 
Redis Lua Scripts
Redis Lua ScriptsRedis Lua Scripts
Redis Lua Scripts
 
Shark - Lab Assignment
Shark - Lab AssignmentShark - Lab Assignment
Shark - Lab Assignment
 
Redis 101
Redis 101Redis 101
Redis 101
 
AOS Lab 1: Hello, Linux!
AOS Lab 1: Hello, Linux!AOS Lab 1: Hello, Linux!
AOS Lab 1: Hello, Linux!
 
Unix fundamentals and_shell scripting
Unix fundamentals and_shell scriptingUnix fundamentals and_shell scripting
Unix fundamentals and_shell scripting
 
Redis introduction
Redis introductionRedis introduction
Redis introduction
 
Redis/Lessons learned
Redis/Lessons learnedRedis/Lessons learned
Redis/Lessons learned
 
Ipfs
IpfsIpfs
Ipfs
 
Apache HBase - Lab Assignment
Apache HBase - Lab AssignmentApache HBase - Lab Assignment
Apache HBase - Lab Assignment
 
Linux powerpoint
Linux powerpointLinux powerpoint
Linux powerpoint
 
Linux commands
Linux commandsLinux commands
Linux commands
 
UNIX Basics and Cluster Computing
UNIX Basics and Cluster ComputingUNIX Basics and Cluster Computing
UNIX Basics and Cluster Computing
 
The Galaxy toolshed
The Galaxy toolshedThe Galaxy toolshed
The Galaxy toolshed
 

Similaire à FIWARE Tech Summit - FIWARE Big Data Ecosystem Cosmos

Cosmos, Big Data GE Implementation
Cosmos, Big Data GE ImplementationCosmos, Big Data GE Implementation
Cosmos, Big Data GE Implementation
FIWARE
 
Search onhadoopsfhug081413
Search onhadoopsfhug081413Search onhadoopsfhug081413
Search onhadoopsfhug081413
gregchanan
 
dotCloud (now Docker) Paas under the_hood
dotCloud (now Docker) Paas under the_hood dotCloud (now Docker) Paas under the_hood
dotCloud (now Docker) Paas under the_hood
Susan Wu
 
2 Linux Container and Docker
2 Linux Container and Docker2 Linux Container and Docker
2 Linux Container and Docker
Fabio Fumarola
 
Solr + Hadoop = Big Data Search
Solr + Hadoop = Big Data SearchSolr + Hadoop = Big Data Search
Solr + Hadoop = Big Data Search
Mark Miller
 

Similaire à FIWARE Tech Summit - FIWARE Big Data Ecosystem Cosmos (20)

Cosmos, Big Data GE implementation in FIWARE
Cosmos, Big Data GE implementation in FIWARECosmos, Big Data GE implementation in FIWARE
Cosmos, Big Data GE implementation in FIWARE
 
Cosmos, Big Data GE Implementation
Cosmos, Big Data GE ImplementationCosmos, Big Data GE Implementation
Cosmos, Big Data GE Implementation
 
Red Hat Storage Day Dallas - Storage for OpenShift Containers
Red Hat Storage Day Dallas - Storage for OpenShift Containers Red Hat Storage Day Dallas - Storage for OpenShift Containers
Red Hat Storage Day Dallas - Storage for OpenShift Containers
 
Search onhadoopsfhug081413
Search onhadoopsfhug081413Search onhadoopsfhug081413
Search onhadoopsfhug081413
 
dotCloud (now Docker) Paas under the_hood
dotCloud (now Docker) Paas under the_hood dotCloud (now Docker) Paas under the_hood
dotCloud (now Docker) Paas under the_hood
 
Develop with linux containers and docker
Develop with linux containers and dockerDevelop with linux containers and docker
Develop with linux containers and docker
 
Setting up a big data platform at kelkoo
Setting up a big data platform at kelkooSetting up a big data platform at kelkoo
Setting up a big data platform at kelkoo
 
2 Linux Container and Docker
2 Linux Container and Docker2 Linux Container and Docker
2 Linux Container and Docker
 
Docker-v3.pdf
Docker-v3.pdfDocker-v3.pdf
Docker-v3.pdf
 
The ExtremeEarth infrastructure-phiweek19
The ExtremeEarth infrastructure-phiweek19The ExtremeEarth infrastructure-phiweek19
The ExtremeEarth infrastructure-phiweek19
 
Developing a Redis Module - Hackathon Kickoff
 Developing a Redis Module - Hackathon Kickoff Developing a Redis Module - Hackathon Kickoff
Developing a Redis Module - Hackathon Kickoff
 
Kfs presentation
Kfs presentationKfs presentation
Kfs presentation
 
Networked digital library through harvesting
Networked digital library through harvestingNetworked digital library through harvesting
Networked digital library through harvesting
 
Solr + Hadoop = Big Data Search
Solr + Hadoop = Big Data SearchSolr + Hadoop = Big Data Search
Solr + Hadoop = Big Data Search
 
OSDC 2010 | Use Distributed Filesystem as a Storage Tier by Fabrizio Manfred
OSDC 2010 | Use Distributed Filesystem as a Storage Tier by Fabrizio ManfredOSDC 2010 | Use Distributed Filesystem as a Storage Tier by Fabrizio Manfred
OSDC 2010 | Use Distributed Filesystem as a Storage Tier by Fabrizio Manfred
 
PXF HAWQ Unmanaged Data
PXF HAWQ Unmanaged DataPXF HAWQ Unmanaged Data
PXF HAWQ Unmanaged Data
 
Search On Hadoop
Search On HadoopSearch On Hadoop
Search On Hadoop
 
containerd summit - Deep Dive into containerd
containerd summit - Deep Dive into containerdcontainerd summit - Deep Dive into containerd
containerd summit - Deep Dive into containerd
 
Storage and-compute-hdfs-map reduce
Storage and-compute-hdfs-map reduceStorage and-compute-hdfs-map reduce
Storage and-compute-hdfs-map reduce
 
Deploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analyticsDeploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analytics
 

Plus de FIWARE

Cameron Brooks_FGS23_FIWARE Summit_Keynote_Cameron.pptx
Cameron Brooks_FGS23_FIWARE Summit_Keynote_Cameron.pptxCameron Brooks_FGS23_FIWARE Summit_Keynote_Cameron.pptx
Cameron Brooks_FGS23_FIWARE Summit_Keynote_Cameron.pptx
FIWARE
 
Boris Otto_FGS2023_Opening- EU Innovations from Data_PUB_V1_BOt.pptx
Boris Otto_FGS2023_Opening- EU Innovations from Data_PUB_V1_BOt.pptxBoris Otto_FGS2023_Opening- EU Innovations from Data_PUB_V1_BOt.pptx
Boris Otto_FGS2023_Opening- EU Innovations from Data_PUB_V1_BOt.pptx
FIWARE
 
Bjoern de Vidts_FGS23_Opening_athumi - bjord de vidts - personal data spaces....
Bjoern de Vidts_FGS23_Opening_athumi - bjord de vidts - personal data spaces....Bjoern de Vidts_FGS23_Opening_athumi - bjord de vidts - personal data spaces....
Bjoern de Vidts_FGS23_Opening_athumi - bjord de vidts - personal data spaces....
FIWARE
 
Abdulrahman Ibrahim_FGS23 Opening - Abdulrahman Ibrahim.pdf
Abdulrahman Ibrahim_FGS23 Opening - Abdulrahman Ibrahim.pdfAbdulrahman Ibrahim_FGS23 Opening - Abdulrahman Ibrahim.pdf
Abdulrahman Ibrahim_FGS23 Opening - Abdulrahman Ibrahim.pdf
FIWARE
 
FGS2023_Opening_Red Hat Keynote Andrea Battaglia.pdf
FGS2023_Opening_Red Hat Keynote Andrea Battaglia.pdfFGS2023_Opening_Red Hat Keynote Andrea Battaglia.pdf
FGS2023_Opening_Red Hat Keynote Andrea Battaglia.pdf
FIWARE
 

Plus de FIWARE (20)

Behm_Herne_NeMo_akt.pptx
Behm_Herne_NeMo_akt.pptxBehm_Herne_NeMo_akt.pptx
Behm_Herne_NeMo_akt.pptx
 
Katharina Hogrebe Herne Digital Days.pdf
 Katharina Hogrebe Herne Digital Days.pdf Katharina Hogrebe Herne Digital Days.pdf
Katharina Hogrebe Herne Digital Days.pdf
 
Christoph Mertens_IDSA_Introduction to Data Spaces.pptx
Christoph Mertens_IDSA_Introduction to Data Spaces.pptxChristoph Mertens_IDSA_Introduction to Data Spaces.pptx
Christoph Mertens_IDSA_Introduction to Data Spaces.pptx
 
Behm_Herne_NeMo.pptx
Behm_Herne_NeMo.pptxBehm_Herne_NeMo.pptx
Behm_Herne_NeMo.pptx
 
Evangelists + iHubs Promo Slides.pptx
Evangelists + iHubs Promo Slides.pptxEvangelists + iHubs Promo Slides.pptx
Evangelists + iHubs Promo Slides.pptx
 
Lukas Künzel Smart City Operating System.pptx
Lukas Künzel Smart City Operating System.pptxLukas Künzel Smart City Operating System.pptx
Lukas Künzel Smart City Operating System.pptx
 
Pierre Golz Der Transformationsprozess im Konzern Stadt.pptx
Pierre Golz Der Transformationsprozess im Konzern Stadt.pptxPierre Golz Der Transformationsprozess im Konzern Stadt.pptx
Pierre Golz Der Transformationsprozess im Konzern Stadt.pptx
 
Dennis Wendland_The i4Trust Collaboration Programme.pptx
Dennis Wendland_The i4Trust Collaboration Programme.pptxDennis Wendland_The i4Trust Collaboration Programme.pptx
Dennis Wendland_The i4Trust Collaboration Programme.pptx
 
Ulrich Ahle_FIWARE.pptx
Ulrich Ahle_FIWARE.pptxUlrich Ahle_FIWARE.pptx
Ulrich Ahle_FIWARE.pptx
 
Aleksandar Vrglevski _FIWARE DACH_OSIH.pptx
Aleksandar Vrglevski _FIWARE DACH_OSIH.pptxAleksandar Vrglevski _FIWARE DACH_OSIH.pptx
Aleksandar Vrglevski _FIWARE DACH_OSIH.pptx
 
Water Quality - Lukas Kuenzel.pdf
Water Quality - Lukas Kuenzel.pdfWater Quality - Lukas Kuenzel.pdf
Water Quality - Lukas Kuenzel.pdf
 
Cameron Brooks_FGS23_FIWARE Summit_Keynote_Cameron.pptx
Cameron Brooks_FGS23_FIWARE Summit_Keynote_Cameron.pptxCameron Brooks_FGS23_FIWARE Summit_Keynote_Cameron.pptx
Cameron Brooks_FGS23_FIWARE Summit_Keynote_Cameron.pptx
 
FiWareSummit.msGIS-Data-to-Value.2023.06.12.pptx
FiWareSummit.msGIS-Data-to-Value.2023.06.12.pptxFiWareSummit.msGIS-Data-to-Value.2023.06.12.pptx
FiWareSummit.msGIS-Data-to-Value.2023.06.12.pptx
 
Boris Otto_FGS2023_Opening- EU Innovations from Data_PUB_V1_BOt.pptx
Boris Otto_FGS2023_Opening- EU Innovations from Data_PUB_V1_BOt.pptxBoris Otto_FGS2023_Opening- EU Innovations from Data_PUB_V1_BOt.pptx
Boris Otto_FGS2023_Opening- EU Innovations from Data_PUB_V1_BOt.pptx
 
Bjoern de Vidts_FGS23_Opening_athumi - bjord de vidts - personal data spaces....
Bjoern de Vidts_FGS23_Opening_athumi - bjord de vidts - personal data spaces....Bjoern de Vidts_FGS23_Opening_athumi - bjord de vidts - personal data spaces....
Bjoern de Vidts_FGS23_Opening_athumi - bjord de vidts - personal data spaces....
 
Abdulrahman Ibrahim_FGS23 Opening - Abdulrahman Ibrahim.pdf
Abdulrahman Ibrahim_FGS23 Opening - Abdulrahman Ibrahim.pdfAbdulrahman Ibrahim_FGS23 Opening - Abdulrahman Ibrahim.pdf
Abdulrahman Ibrahim_FGS23 Opening - Abdulrahman Ibrahim.pdf
 
FGS2023_Opening_Red Hat Keynote Andrea Battaglia.pdf
FGS2023_Opening_Red Hat Keynote Andrea Battaglia.pdfFGS2023_Opening_Red Hat Keynote Andrea Battaglia.pdf
FGS2023_Opening_Red Hat Keynote Andrea Battaglia.pdf
 
HTAG_Skalierung_Plattform_lokal_final_versand.pptx
HTAG_Skalierung_Plattform_lokal_final_versand.pptxHTAG_Skalierung_Plattform_lokal_final_versand.pptx
HTAG_Skalierung_Plattform_lokal_final_versand.pptx
 
WE_LoRaWAN _ IoT.pptx
WE_LoRaWAN  _ IoT.pptxWE_LoRaWAN  _ IoT.pptx
WE_LoRaWAN _ IoT.pptx
 
EU Opp_Clara Pezuela - German chapter.pptx
EU Opp_Clara Pezuela - German chapter.pptxEU Opp_Clara Pezuela - German chapter.pptx
EU Opp_Clara Pezuela - German chapter.pptx
 

Dernier

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Dernier (20)

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 

FIWARE Tech Summit - FIWARE Big Data Ecosystem Cosmos

  • 1. FIWARE Big Data ecosystem : Cosmos Joaquin Salvachua Andres Muñoz Universidad Politécnica de Madrid Joaquin.salvachua@upm.es, @jsalvachua, @FIWARE www.slideshare.net/jsalvachua
  • 3. Cosmos Ecosystem • The batch part has been widely developed throught the adoption and/or the in-house creation of the following tools: • A Hadoop As A Service (HAAS) engine, either the ''official'' one based on Openstack's Sahara, either the light version based on a shared Hadoop cluster. • Cosmos GUI, for Cosmos accounts managing. • Cosmos Auth, managing OAuth2-based authentication tokens for Cosmos REST APIs. • Cosmos Proxy, enforcing OAuth2-based authentication and authorization for Cosmos REST APIs. • Cosmos Hive Auth Provider, a custom auth provider for HiveServer2 based in OAuth2. • Tidoop API for MapReduce job submission and management. • Cosmos Admin, a set of admin scripts. • The batch part is enriched with the following tools, that conform the so-called Cosmos Ecosystem:Cygnus, a tool for connecting Orion Context Broker with several data storages, HDFS and STH (see velow) included, in order to create context data historics. • STH Comet, a MongoDB context data storage for short-term historic. 2
  • 8. Data growing forecast 7 2.3 3.6 12 19 11.3 39 0.5 1.4 Global users (billions) Global networked devices (billions) Global broadband speed (Mbps) Global traffic (zettabytes) http://www.cisco.com/en/US/netsol/ns827/networking_solutions_sub_solution.html#~forecast 2012 2012 2012 2012 2017 2017 2017 2017
  • 10. What happens if one shelving is not enough? 9 You buy more shelves…
  • 11. … and you create an index 10 “The Avengers”, 1-100, shelf 1 “The Avengers”, 101-125, shelf 2 “Superman”, 1-50, shelf 2 “X-Men”, 1-100, shelf 3 “X-Men”, 101-200, shelf 4 “X-Men”, 201, 225, shelf 5 The Avengers The Avengers The Avengers The Avengers The Avengers Superman Superman X-Men X-Men X-Men X-Men X-Men X-Men X-Men X-Men X-Men
  • 12. What about distributed computing? 11
  • 14. Hadoop Distributed File System (HDFS) 13 • Based on Google File System • Large files are stored across multiple machines (Datanodes) by spliting them into blocks that are distributed • Metadata is managed by the Namenode • Scalable by simply adding more Datanodes • Fault-tolerant since HDFS replicates each block (default to 3) • Security based on authentication (Kerberos) and authorization (permissions, HACLs) • It is managed like a Unix-like file system
  • 15. HDFS architecture 14 Namenode Datanode0 Datanode1 DatanodeN Rack 1 Rack 2 1 2 2 3 3 1 2 Path Replicas Block IDs /user/user1/data/fil e0 2 1,3 /user/user1/data/fil e1 3 2,4,5 … … …
  • 17. File System Shell 16 • The File System Shell includes various shell-like comands that directly interact with HDFS • The FS shell is invoked by any of these scripts: – bin/hadoop fs – bin/hdfs dfs • All FS Shell commans take URI paths as arguments: – scheme://authority/path – Available schemas: file (local FS), hdfs (HDFS) – If nothing is especified, hdfs is considered • It is necessary to connect to the cluster via SSH • Full commands reference – http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop- common/FileSystemShell.html
  • 18. File System Shell examples 17 $ hadoop fs -cat webinar/abriefhistoryoftime_page1 CHAPTER 1 OUR PICTURE OF THE UNIVERSE A well-known scientist (some say it was Bertrand Russell) once gave a public lecture on astronomy. He described how the earth orbits around the sun and how the sun, in turn, orbits around the center of a vast $ hadoop fs -mkdir webinar/afolder $ hadoop fs -ls webinar Found 4 items -rw-r--r-- 3 frb cosmos 3431 2014-12-10 14:00 /user/frb/webinar/abriefhistoryoftime_page1 -rw-r--r-- 3 frb cosmos 1604 2014-12-10 14:00 /user/frb/webinar/abriefhistoryoftime_page2 -rw-r--r-- 3 frb cosmos 5257 2014-12-10 14:00 /user/frb/webinar/abriefhistoryoftime_page3 drwxr-xr-x - frb cosmos 0 2015-03-10 11:09 /user/frb/webinar/afolder $ hadoop fs -rmr webinar/afolder Deleted hdfs://cosmosmaster-gi/user/frb/webinar/afolder
  • 19. HTTP REST API 18 • The HTTP REST API supports the complete File System interface for HDFS • It relies on the webhdfs schema for URIs – webhdfs://<HOST>:<HTTP_PORT>/<PATH> • HTTP URLs are built as: – http://<HOST>:<HTTP_PORT>/webhdfs/v1/<PATH>?op=… • Full API specification – http://hadoop.apache.org/docs/current/hadoop-project- dist/hadoop-hdfs/WebHDFS.html
  • 20. HTTP REST API examples 19 $ curl –X GET “http://cosmos.lab.fi- ware.org:14000/webhdfs/v1/user/frb/webinar/abriefhistoryoftime_page1?op=open&user.name=frb” CHAPTER 1 OUR PICTURE OF THE UNIVERSE A well-known scientist (some say it was Bertrand Russell) once gave a public lecture on astronomy. He described how the earth orbits around the sun and how the sun, in turn, orbits around the center of a vast $ curl -X PUT "http://cosmos.lab.fi-ware.org:14000/webhdfs/v1/user/frb/webinar/afolder?op=mkdirs&user.name=frb" {"boolean":true} $ curl –X GET "http://cosmos.lab.fi-ware.org:14000/webhdfs/v1/user/frb/webinar?op=liststatus&user.name=frb" {"FileStatuses":{"FileStatus":[{"pathSuffix":"abriefhistoryoftime_page1","type":"FILE","length":3431,"owner":"frb","group ":"cosmos","permission":"644","accessTime":1425995831489,"modificationTime":1418216412441,"blockSize":67108864 ,"replication":3},{"pathSuffix":"abriefhistoryoftime_page2","type":"FILE","length":1604,"owner":"frb","group":"cosmos"," permission":"644","accessTime":1418216412460,"modificationTime":1418216412500,"blockSize":67108864,"replication ":3},{"pathSuffix":"abriefhistoryoftime_page3","type":"FILE","length":5257,"owner":"frb","group":"cosmos","permission" :"644","accessTime":1418216412515,"modificationTime":1418216412551,"blockSize":67108864,"replication":3},{"pathS uffix":"afolder","type":"DIRECTORY","length":0,"owner":"frb","group":"cosmos","permission":"755","accessTime":0,"mo dificationTime":1425995941361,"blockSize":0,"replication":0}]}} $ curl -X DELETE "http://cosmos.lab.fi- ware.org:14000/webhdfs/v1/user/frb/webinar/afolder?op=delete&user.name=frb" {"boolean":true}
  • 24. An example 23 How much pages are written in latin among the books in the Ancient Library of Alexandria? LATIN REF1 P45 GREEK REF2 P128 EGYPT REF3 P12 LATIN pages 45 EGYPTIAN LATIN REF4 P73 LATIN REF5 P34 EGYPT REF6 P10 GREEK REF7 P20 GREEK REF8 P230 45(ref 1) still reading… Mappers Reducer
  • 25. An example 24 How much pages are written in latin among the books in the Ancient Library of Alexandria? GREEK REF2 P128 still reading… EGYPTIAN LATIN REF4 P73 LATIN REF5 P34 EGYPT REF6 P10 GREEK REF7 P20 GREEK REF8 P230 GREEK 45(ref 1) Mappers Reducer
  • 26. An example 25 How much pages are written in latin among the books in the Ancient Library of Alexandria? LATIN pages 73 EGYPTIAN LATIN REF4 P73 LATIN REF5 P34 GREEK REF7 P20 GREEK REF8 P230 LATIN pages 34 45(ref 1) +73(ref 4) +34(ref 5) Mappers Reducer
  • 27. An example 26 How much pages are written in latin among the books in the Ancient Library of Alexandria? GREEK GREEK GREEK REF7 P20 GREEK REF8 P230 idle… 45(ref 1) +73(ref 4) +34(ref 5) Mappers Reducer
  • 28. An example 27 How much pages are written in latin among the books in the Ancient Library of Alexandria? idle… idle… idle… 45(ref 1) +73(ref 4) +34(ref 5) 152TOTAL Mappers Reducer
  • 29. MapReduce applications 28 • MapReduce applications are commonly written in Java – Can be written in other languages through Hadoop Streaming • They are executed in the command line $ hadoop jar <jar-file> <main-class> <input-dir> <output-dir> • A MapReduce job consists of: – A driver, a piece of software where to define inputs, outputs, formats, etc. and the entry point for launching the job – A set of Mappers, given by a piece of software defining its behaviour – A set of Reducers, given by a piece of software defining its behaviour • There are 2 APIS – org.apache.mapred à old one – org.apache.mapreduce à new one • Hadoop is distributed with MapReduce examples – [HADOOP_HOME]/hadoop-examples.jar
  • 30. Implementing the example 29 • The input will be a single big file containing: symbolae botanicae,latin,230 mathematica,greek,95 physica,greek,109 ptolomaics,egyptian,120 terra,latin,541 iustitia est vincit,latin,134 • The mappers will receive pieces of the above file, which will be read line by line – Each line will be represented by a (key,value) pair, i.e. the offset on the file and the real data within the line, respectively – For each input pair a (key,value) pair will be output, i.e. a common “num_pages” key and the third field in the line • The reducers will receive arrays of pairs produced by the mappers, all having the same key (“num_pages”) – For each array of pairs, the sum of the values will be output as a (key,value) pair, in this case a “total_pages” key and the sum as value
  • 31. Implementing the example: JCMapper.class 30 public static class JCMapper extends Mapper<Object, Text, Text, IntWritable> { private final Text globalKey = new Text(”num_pages"); private final IntWritable bookPages = new IntWritable(); @Override public void map(Object key, Text value, Context context) throws Exception { String[] fields = value.toString().split(“,”); system.out.println(“Processing “ + fields[0]); if (fields[1].equals(“latin”)) { bookPages.set(fields[2]); context.write(globalKey, bookPages); } // if } // map } // JCMapper
  • 32. Implementing the example: JCReducer.class 31 public static class JCReducer extends Reducer<Text, IntWritable, Text, IntWritable> { private final IntWritable totalPages= new IntWritable(); @Override public void reduce(Text globalKey, Iterable<IntWritable> bookPages, Context context) throws Exception { int sum = 0; for (IntWritable val: bookPages) { sum += val.get(); } // for totalPages.set(sum); context.write(globalKey, totalPages); } // reduce } // JCReducer
  • 33. Implementing the example: JC.class 32 public static void main(String[] args) throws Exception { int res = ToolRunner.run( new Configuration(), new CKANMapReduceExample(), args); System.exit(res); } // main @Override public int run(String[] args) throws Exception { Configuration conf = this.getConf(); Job job = Job.getInstance(conf, ”julius caesar"); job.setJarByClass(JC.class); job.setMapperClass(JCMapper.class); job.setCombinerClass(JCReducer.class); job.setReducerClass(JCReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); return job.waitForCompletion(true) ? 0 : 1; } // run
  • 34. Non HDFS inputs for MapReduce 33 • Currently developing certain Hadoop extensions with the codename of “Tidoop” (Teléfonica Investigación y Desarrollo + Hadoop J) – https://github.com/telefonicaid/fiware-tidoop • These extensions allow using generic non HDFS data located at: – CKAN (beta available) – MongoDB-based Short-Term Historic (roadmap) • Other specific public repos UNDER STUDY… suggestions accepted J
  • 35. Non HDFS inputs for MapReduce: driver 34 @Override public int run(String[] args) throws Exception { Configuration conf = this.getConf(); Job job = Job.getInstance(conf, ”ckan mr"); job.setJarByClass(CKANMapReduceExample.class); job.setMapperClass(CKANMapper.class); job.setCombinerClass(CKANReducer.class); job.setReducerClass(CKANReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); // start of the relevant part job.setInputFormatClass(CKANInputFormat.class); CKANInputFormat.addCKANInput(job, ckanInputs); CKANInputFormat.setCKANEnvironmnet(job, ckanHost, ckanPort, sslEnabled, ckanAPIKey); CKANInputFormat.setCKANSplitsLength(job, splitsLength); // end of the relevant part FileOutputFormat.setOutputPath(job, new Path(args[1])); return job.waitForCompletion(true) ? 0 : 1; } // run
  • 37. Querying tools 36 • MapReduce paradigm may be hard to understand and, the worst, to use • Indeed, many data analyzers just need to query for the data – If possible, by using already well-known languages • Regarding that, some querying tools appeared in the Hadoop ecosystem – Hive and its HiveQL language à quite similar to SQL – Pig and its Pig Latin language à a new language
  • 38. Hive and HiveQL 37 • HiveQL reference – https://cwiki.apache.org/confluence/display/Hive/LanguageManual • All the data is loaded into Hive tables – Not real tables (they don’t contain the real data) but metadata pointing to the real data at HDFS • The best thing is Hive uses pre-defined MapReduce jobs behind the scenes! • Column selection • Fields grouping • Table joining • Values filtering • … • Important remark: since MapReduce is used by Hive, the queries make take some time to produce a result
  • 39. Hive CLI 38 $ hive hive history file=/tmp/myuser/hive_job_log_opendata_XXX_XXX.txt hive>select column1,column2,otherColumns from mytable where column1='whatever' and columns2 like '%whatever%'; Total MapReduce jobs = 1 Launching Job 1 out of 1 Starting Job = job_201308280930_0953, Tracking URL = http://cosmosmaster- gi:50030/jobdetails.jsp?jobid=job_201308280930_0953 Kill Command = /usr/lib/hadoop/bin/hadoop job - Dmapred.job.tracker=cosmosmaster-gi:8021 -kill job_201308280930_0953 2013-10-03 09:15:34,519 Stage-1 map = 0%, reduce = 0% 2013-10-03 09:15:36,545 Stage-1 map = 67%, reduce = 0% 2013-10-03 09:15:37,554 Stage-1 map = 100%, reduce = 0% 2013-10-03 09:15:44,609 Stage-1 map = 100%, reduce = 33%
  • 40. Hive Java API 39 • Hive CLI is OK for human-driven testing purposes • But it is not usable by remote applications • Hive has no REST API • Hive has several drivers and libraries • JDBC for Java • Python • PHP • ODBC for C/C++ • Thrift for Java and C++ • https://cwiki.apache.org/confluence/display/Hive/HiveClien t • A remote Hive client usually performs: • A connection to the Hive server (TCP/10000) • The query execution
  • 41. Hive Java API: get a connection 40 private static Connection getConnection(String ip, String port, String user, String password) { try { Class.forName("org.apache.hadoop.hive.jdbc.HiveDriver"); } catch (ClassNotFoundException e) { System.out.println(e.getMessage()); return null; } // try catch try { return DriverManager.getConnection("jdbc:hive://" + ip + ":” + port + "/default?user=" + user + "&password=“ + password); } catch (SQLException e) { System.out.println(e.getMessage()); return null; } // try catch } // getConnection
  • 42. Hive Java API: do the query 41 private static void doQuery() { try { Statement stmt = con.createStatement(); ResultSet res = stmt.executeQuery( "select column1,column2,” + "otherColumns from mytable where “ + “column1='whatever' and “ + "columns2 like '%whatever%'"); while (res.next()) { String column1 = res.getString(1); Integer column2 = res.getInteger(2); } // while res.close(); stmt.close(); con.close(); } catch (SQLException e) { System.exit(0); } // try catch } // doQuery
  • 43. Hive tables creation 42 • Both locally using the CLI, or remotely using the Java API, use this command: hive> Create external table... • CSV-like HDFS files hive> create external table <table_name> (<field1_name> <field1_type>, ..., <fieldN_name> <fieldN_type>) row format delimited field terminated by ‘<separator>' location ‘/user/<username>/<path>/<to>/<the>/<data>'; • Json-like HDFS files create external table <table_name> (<field1_name> <field1_type>, ..., <fieldN_name> <fieldN_type>) row format serde 'org.openx.data.jsonserde.JsonSerDe' location ‘/user/<username>/<path>/<to>/<the>/<data>’; • Cygnus automatically creates the Hive tables for the entities it is notified for!
  • 45. FIWARE Big Data ecosystem : Cygnus Joaquin Salvachua Universidad Politécnica de Madrid Joaquin.salvachua@upm.es, @jsalvachua, @FIWARE www.slideshare.net/jsalvachua