Hadoop and IoT Sinergija 2014

Hadoop and IoT
Darko Marjanović
Đorđe Stepanić
Miloš Milovanović

AGENDA
BIG DATA
HADOOP AND IOT MODEL
HADOOP
IOT
HADOOP DATA PROCESSING
HIVE
STINGER INITIATIVE
Q&A

BIG DATA
Big Data describes the collection of complex and large data sets such that it’s
difficult to capture, process, store, search and analyze using conventional data
base systems.
Anything that Won't Fit in Excel.
*Definition taken from (www.bigdata-startups.com)

BIG DATA DIMESIONS
1992 100GB/Day
2002 100GB/Second
2013 28,000GB/Second
2018 50,000GB/Second

HADOOP
Apache Hadoop is an open-source software framework for storage and large-scale
processing of data-sets on clusters of commodity hardware.
Hadoop was created by Doug Cutting and Mike Cafarella in 2005
All the modules in Hadoop are designed with a fundamental assumption that
hardware failures are common and thus should be automatically handled in software
by the framework.

HADOOP COMPONENTS
Hadoop common
HDFS
Map Reduce
YARN (Starting with Hadoop 2.x.x)

HADOOP HDFS
The Hadoop distributed file system (HDFS) is a distributed, scalable, and portable file-system
written in Java for the Hadoop framework.

HADOOP MAP REDUCE
Map Reduce is a programming model and an associated implementation for processing
and generating large data sets with a parallel, distributed algorithm on a cluster.

HADOOP YARN
Apache Hadoop YARN (Yet Another Resource Negotiator) is a cluster management
technology. YARN is now characterized as a large-scale, distributed operating
system for big data applications.

HADOOP ECOSYSTEM
The main groups of tools in the Hadoop ecosystem:
Data Ingestion (Flume, Sqoop …)
Data Processing (Pig, Hive, Storm …)
Cluster Management(Ambari)
Security (Knox)

DATA INGESTION
Flume
Flume is a distributed, reliable, and available service for efficiently collecting,
aggregating, and moving large amounts of streaming event data.
Sqoop
Apache Sqoop is a tool designed for efficiently transferring bulk data between Apache
Hadoop and structured datastores such as relational databases.
WEB HDFS REST API

SQOOP AND WEB HDFS API EXAMPLE

UBIQUITOUS COMPUTING & INTERNET OF THINGS
Ubiquitous computing - trend (wave) in computing where computers are
spreaded throughout our everyday environment.
Concept: one person - many computers
Internet Of Things - is the network of physical objects accessed through the
Internet, which contains embedded technology to interact (sense and
communicate) with internal states or the external environment
(Cisco definition).

INTERNET OF THINGS AND BIG DATA

REAL-TIME DATA, STRUCTURED AND UNSTRUCTURED DATA GENERATED FROM INTERNET OF THINGS

INTERNET OF THINGS - FIELDS OF APPLICATION
* Production - energy savings, lower maintenance costs, prediction of
machine failure, quality control etc.
** Logistic - efficient supply control , optimization of transport,
environmental controls in the warehouse, JIT, lean logistics, better capacity
utilization etc.
Smart cities & environment - smart parking, traffic congestion, smart
lighting, waste management, noise urban maps, air pollution etc.
Smart agriculture
eHealth
and everything you can imagine...

HADOOP DATA PROCESSING
Input:
- Raw data files
- No metadata
- No schema
Objective:
- Perform analysis, run interactive queries
- Explore, structure and analyze the data
- Real-time processing (Apache Storm)
- Visualization

HIVE
Apache Hive is a data warehousing software that facilitates querying and
managing large datasets residing in distributed storage.
Hive provides:
- Tools ETL processes
- A mechanism for imposing a structure on a variety of data formats
- Access to files stored in HDFS or other storage systems
- Query execution via MapReduce?

HIVE ARCHITECTURE
Data Model:
- Tables
- Partitions
- Buckets
SERDEs
Datatypes:
Common primitive data types (int,
boolean, float, double, string, char, date,
timestamp, …)
+Complex data types (structs, maps,
arrays)
UI
Driver
Compiler
Metastore
Execution
engine

HIVE.NOW
Hive defines a simple SQL-like query language, called HQL, that enables users
familiar with SQL to query the data.
Scalable and extensible.
Most commonly used for:
- Log analysis
- Statistical analysis
- Document indexing

STINGER INITIATIVE
Stinger is the initiative to improve query execution time and increase SQL
functionality for Apache Hive. Microsoft and Hortonworks worked actively in the
Apache community towards completing Stinger.
Announced in February 2013
44 companies, 145 developers, 392,000 lines of Java code
Hive 0.13
Speed: Hive on Tez, vectorized query engine & cost-based optimizer
Scale: dynamic partition loads and smaller hash tables
SQL: CHAR & DECIMAL datatypes, subqueries for IN / NOT IN
Improved Hive performance up to 100x.

STINGER.NEXT
Stinger.next is a continuation of Stinger initiative to further speed, scale and SQL in
Hive in the open Apache Hive community.
Main goals:
- transactions with ACID semantics
- sub-second queries
- SQL:2011 Analytics
- usability improvements
To be delivered in next 18 months.

STINGER.NEXT
*Photo taken from the official Hortonworks website (www.hortonworks.com)

HIVE ON SPARK
Apache Spark is a fast and general engine for large-scale data processing.
Spark powers a stack of high-level tools including Spark SQL, MLlib for machine
learning, GraphX, and Spark Streaming.
Hive-Spark Machine Learning Integration will allow Hive users to run machine
learning models via Hive.

Q&A
darko@thingsolver.com
djordje@thingsolver.com
milosmilovanovic@outlook.com
hadoop-srbija.com

Please rate this lecture
and win Windows Phone NOKIA Lumia 1320
Help us choose the best Sinergija lecturer!
Microsoft will award you – at the conference end,
we’ll give one NOKIA Lumia 1320 to someone
from the audience – randomly.
Go to www.mssinergija.net, log in and cast your
votes!
You can rate only lectures that you were present
at, just once. More lectures you rate, more
chances you have.
Winner will be announced at the official Sinergija
web portal, www.mssinergija.net

Hadoop and IoT Sinergija 2014

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Hadoop and IoT Sinergija 2014

Similaire à Hadoop and IoT Sinergija 2014 (20)

Plus de Darko Marjanovic

Plus de Darko Marjanovic (9)

Dernier

Dernier (20)

Hadoop and IoT Sinergija 2014

Notes de l'éditeur