Big Data Journey

•

1 j'aime•844 vues

Generic presentation about Big Data Architecture/Components. This presentation was delivered by David Pilato and Tugdual Grall during JUG Summer Camp 2015 in La Rochelle, France

Technologie

© 2015 MapR Technologies ‹#›
Big Data Journey
Tug Grall
tug@mapr.com
@tgrall
Tug Grall
tug@mapr.com
@tgrall
David Pilato
david@elastic.co
@dadoonet

Copy files in HDFS
hadoop fs -put dailylogs-log.zip /logs/2015/09/10/

Import RDBMS data
sqoop import --connect jdbc:mysql://db.foo.com/somedb --table
customers --target-dir /incremental_dataset --append
Files
HBase
Hive

$Import RDBMS data input { jdbc { jdbc_connection_string => "jdbc:postgresql://localhost:5432/mydb" jdbc_user => "postgres" jdbc_driver_library => "/path/to/postgresql-9.4-1201.jdbc41.jar" jdbc_driver_class => "org.postgresql.Driver" statement => "SELECT * from contacts" } }$

Streaming
Flume, Kafka, Logstash
to the rescue

Log
App Events
Twitter
Sensors
…
HDFS
MapR-FS
Alerts
Elasticsearch
…
DB

Log
App Events
Twitter
Sensors
…
HDFS
MapR-FS
Alerts
Elasticsearch
…
DB
Broker
Producers Consumers

Stream data into Hadoop using Flume
Server
Files
HBase
Hive
Server
Server
Server

Streams using Kafka
Files
HBase
Hive
Producer
Producer
Producer
Consumer
Consumer
Consumer
Alert

How to store your data?
• Files in a distributed file system
• Rows in NoSQL Table
• Index in Search Engine

Data Processing
• Transform the data
• Enrich the data
• Examples:
• Store data in multiple formats
• Aggregate data
• Build Recommendations
• ….

MapReduce Processing Model
• Define mappers
• Shuffling is automatic
• Define reducers
• For complex work, chain jobs together
– Use a higher level language or DSL that does this for you

Apache Spark: Fast Big Data
– Rich APIs in Java,
Scala, Python
– Interactive shell
• Fast to Run
– General execution
graphs
– In-memory storage

Spark: Unified Platform
Spark SQL
Spark Streaming
(Streaming)
MLlib
(Machine learning)
Spark (General execution engine)
GraphX (Graph
computation)
Mesos
Distributed File System (HDFS, MapR-FS, S3, …)
Hadoop YARN

Files
HBase
Hive
Index
Discovery/Analytics

Files
HBase
Hive
SQL on Hadoop
• SQL Shell

• JDBC ODBC

• BI Tools

• Reporting

Machine Learning
MapR Cluster
HBase 
MapR DB
MapR-FS
Add recommendations
to movies
Capture Ratings
Movies & Recommendations
Movie Database

Conclusion
• If possible use Streams: Kafka, Logstash 
• Advanced Data Processing and Machine Learning : Spark
• Expose your data using SQL for your “BI folks” : Drill
• Aggregation and Full Text Search : Elasticsearch
• Data Visualisation : Kibana

Contenu connexe

Tendances

MapReduce Improvements in MapR Hadoopabord

Hive at Yahoo: Letters from the trenchesDataWorks Summit

February 2014 HUG : Hive On TezYahoo Developer Network

Philly DB MapR OverviewMapR Technologies

20140228 - Singapore - BDAS - Ensuring Hadoop Production SuccessAllen Day, PhD

Hive+Tez: A performance deep divet3rmin4t0r

Apache Spark & HadoopMapR Technologies

February 2014 HUG : Tez Details and InsidesYahoo Developer Network

Hd insight essentials quick viewRajesh Nadipalli

2. hadoop fundamentalsLokesh Ramaswamy

Hadoop And Their Ecosystemsunera pathan

February 2014 HUG : Pig On TezYahoo Developer Network

Apache hadoop technology : BeginnersShweta Patnaik

Scale 12 x Efficient Multi-tenant Hadoop 2 Workloads with YarnDavid Kaiser

Marcel Kornacker: Impala tech talk Tue Feb 26th 2013Modern Data Stack France

Big Data Performance and Capacity Managementrightsize

2013 July 23 Toronto Hadoop User Group Hive TuningAdam Muise

Hadoop_Its_Not_Just_Internal_Storage_V14John Sing

Hadoop ecosystemStanley Wang

SQOOP - RDBMS to HadoopSofian Hadiwijaya

Tendances (20)

MapReduce Improvements in MapR Hadoop

Hive at Yahoo: Letters from the trenches

February 2014 HUG : Hive On Tez

Philly DB MapR Overview

20140228 - Singapore - BDAS - Ensuring Hadoop Production Success

Hive+Tez: A performance deep dive

Apache Spark & Hadoop

February 2014 HUG : Tez Details and Insides

Hd insight essentials quick view

2. hadoop fundamentals

Hadoop And Their Ecosystem

February 2014 HUG : Pig On Tez

Apache hadoop technology : Beginners

Scale 12 x Efficient Multi-tenant Hadoop 2 Workloads with Yarn

Marcel Kornacker: Impala tech talk Tue Feb 26th 2013

Big Data Performance and Capacity Management

2013 July 23 Toronto Hadoop User Group Hive Tuning

Hadoop_Its_Not_Just_Internal_Storage_V14

Hadoop ecosystem

SQOOP - RDBMS to Hadoop

En vedette

Proud to be Polyglot - Riviera Dev 2015Tugdual Grall

Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in ProductionCodemotion

Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!Tugdual Grall

Map r hadoop-security-mar2014 (2)MapR Technologies

Hadoop and Your Enterprise Data WarehouseEdgar Alejandro Villegas

Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...Mathieu Dumoulin

Fast Cars, Big Data - How Streaming Can Help Formula 1Tugdual Grall

Why Elastic? @ 50th Vinitaly 2016Christoph Wurm

Elastic v5.0.0 Update uptoalpha3 v0.2 - 김종민NAVER D2

Which data should you move to Hadoop?Attunity

Understanding Metadata: Why it's essential to your big data solution and how ...Zaloni

MapR-DB Elasticsearch IntegrationMapR Technologies

Handling the Extremes: Scaling and Streaming in FinanceMapR Technologies

MapR and Cisco Make IT BetterMapR Technologies

Big Data: Architecture and Performance Considerations in Logical Data LakesDenodo

Kibana + timelion: time series with the elastic stackSylvain Wallez

Real Time and Big Data – It’s About TimeMapR Technologies

MapR 5.2: Getting More Value from the MapR Converged Data PlatformMapR Technologies

Key Considerations for Putting Hadoop in Production SlideShareMapR Technologies

Big Data Hadoop Briefing Hosted by Cisco, WWT and MapR: MapR Overview Present...ervogler

En vedette (20)

Proud to be Polyglot - Riviera Dev 2015

Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production

Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!

Map r hadoop-security-mar2014 (2)

Hadoop and Your Enterprise Data Warehouse

Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...

Fast Cars, Big Data - How Streaming Can Help Formula 1

Why Elastic? @ 50th Vinitaly 2016

Elastic v5.0.0 Update uptoalpha3 v0.2 - 김종민

Which data should you move to Hadoop?

Understanding Metadata: Why it's essential to your big data solution and how ...

MapR-DB Elasticsearch Integration

Handling the Extremes: Scaling and Streaming in Finance

MapR and Cisco Make IT Better

Big Data: Architecture and Performance Considerations in Logical Data Lakes

Kibana + timelion: time series with the elastic stack

Real Time and Big Data – It’s About Time

MapR 5.2: Getting More Value from the MapR Converged Data Platform

Key Considerations for Putting Hadoop in Production SlideShare

Big Data Hadoop Briefing Hosted by Cisco, WWT and MapR: MapR Overview Present...

Similaire à Big Data Journey

Hadoop in Practice (SDN Conference, Dec 2014)Marcel Krcah

Big Data in the Microsoft PlatformJesus Rodriguez

Big Data Analytics with Hadoop, MongoDB and SQL ServerMark Kromer

HdInsight essentials Hadoop on Microsoft Platformnvvrajesh

Hd insight essentials quick viewRajesh Nadipalli

The Nuts and Bolts of Hadoop and it's Ever-changing Ecosystem, Presented by J...NashvilleTechCouncil

Big Data Hoopla Simplified - TDWI Memphis 2014Rajan Kanitkar

Hortonworks tech workshop in-memory processing with sparkHortonworks

Big Data in the Real WorldMark Kromer

Hadoop in actionMahmoud Yassin

מיכאלsqlserver.co.il

Cloudera Impala - San Diego Big Data Meetup August 13th 2014cdmaxime

Hadoop Ecosystem | Big Data Analytics Tools | Hadoop Tutorial | Edureka Edureka!

Paris Data Geek - Spark Streaming Djamel Zouaoui

Hadoop a Natural Choice for Data Intensive Log ProcessingHitendra Kumar

Predictive Analytics and Machine Learning…with SAS and Apache HadoopHortonworks

Hadoop and Big Data: RevealedSachin Holla

What is hadoopAsis Mohanty

Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3tcloudcomputing-tw

Apache Spark Workshop at Hadoop SummitSaptak Sen

Similaire à Big Data Journey (20)

Hadoop in Practice (SDN Conference, Dec 2014)

Big Data in the Microsoft Platform

Big Data Analytics with Hadoop, MongoDB and SQL Server

HdInsight essentials Hadoop on Microsoft Platform

Hd insight essentials quick view

The Nuts and Bolts of Hadoop and it's Ever-changing Ecosystem, Presented by J...

Big Data Hoopla Simplified - TDWI Memphis 2014

Hortonworks tech workshop in-memory processing with spark

Big Data in the Real World

Hadoop in action

מיכאל

Cloudera Impala - San Diego Big Data Meetup August 13th 2014

Hadoop Ecosystem | Big Data Analytics Tools | Hadoop Tutorial | Edureka

Paris Data Geek - Spark Streaming

Hadoop a Natural Choice for Data Intensive Log Processing

Predictive Analytics and Machine Learning…with SAS and Apache Hadoop

Hadoop and Big Data: Revealed

What is hadoop

Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3

Apache Spark Workshop at Hadoop Summit

Plus de Tugdual Grall

Introduction to Streaming with Apache FlinkTugdual Grall

Introduction to NoSQL with MongoDB - SQLi WorkshopTugdual Grall

Enabling Telco to Build and Run Modern Applications Tugdual Grall

MongoDB and HadoopTugdual Grall

Proud to be polyglotTugdual Grall

Drop your table ! MongoDB Schema DesignTugdual Grall

Devoxx 2014 : Atelier MongoDB - Decouverte de MongoDB 2.6Tugdual Grall

Some cool features of MongoDBTugdual Grall

Building Your First MongoDB ApplicationTugdual Grall

Opensourceday 2014-iotTugdual Grall

Neotys conferenceTugdual Grall

Softshake 2013: Introduction to NoSQL with CouchbaseTugdual Grall

Introduction to NoSQL with CouchbaseTugdual Grall

Why and How to integrate Hadoop and NoSQL?Tugdual Grall

NoSQL Matters 2013 - Introduction to Map Reduce with Couchbase 2.0Tugdual Grall

Big Data Paris : Hadoop and NoSQLTugdual Grall

Big Data Israel Meetup : Couchbase and Big DataTugdual Grall

FOSDEM 2013 : Getting Started with Couchhbase Server 2.0Tugdual Grall

Open World Forum 2012 : eXo & the CloudTugdual Grall

Plus de Tugdual Grall (20)

Introduction to Streaming with Apache Flink

Introduction to NoSQL with MongoDB - SQLi Workshop

Enabling Telco to Build and Run Modern Applications

MongoDB and Hadoop

Proud to be polyglot

Drop your table ! MongoDB Schema Design

Devoxx 2014 : Atelier MongoDB - Decouverte de MongoDB 2.6

Some cool features of MongoDB

Building Your First MongoDB Application

Opensourceday 2014-iot

Neotys conference

Softshake 2013: Introduction to NoSQL with Couchbase

Introduction to NoSQL with Couchbase

Why and How to integrate Hadoop and NoSQL?

NoSQL Matters 2013 - Introduction to Map Reduce with Couchbase 2.0

Big Data Paris : Hadoop and NoSQL

Big Data Israel Meetup : Couchbase and Big Data

FOSDEM 2013 : Getting Started with Couchhbase Server 2.0

Open World Forum 2012 : eXo & the Cloud

Dernier

H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati

Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst

Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity

DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy

The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech

How to write a Business Continuity PlanDatabarracks

From Family Reminiscence to Scholarly Archive .Alan Dix

Artificial intelligence in cctv survelliance.pptxhariprasad279825

Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed

How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe

Gen AI in Business - Global Trends Report 2024.pdfAddepto

Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely

Take control of your SAP testing with UiPath Test SuiteDianaGray10

Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3

SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal

"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays

Story boards and shot lists for my a level piececharlottematthew16

SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero

Commit 2024 - Secret Management made easyAlfredo García Lavilla

DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell

Dernier (20)

H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day

Human Factors of XR: Using Human Factors to Design XR Systems

Dev Dives: Streamline document processing with UiPath Studio Web

DevoxxFR 2024 Reproducible Builds with Apache Maven

The Ultimate Guide to Choosing WordPress Pros and Cons

How to write a Business Continuity Plan

From Family Reminiscence to Scholarly Archive .

Artificial intelligence in cctv survelliance.pptx

Scanning the Internet for External Cloud Exposures via SSL Certs

How AI, OpenAI, and ChatGPT impact business and software.

Gen AI in Business - Global Trends Report 2024.pdf

Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf

Take control of your SAP testing with UiPath Test Suite

Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx

SAP Build Work Zone - Overview L2-L3.pptx

"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...

Story boards and shot lists for my a level piece

SIP trunking in Janus @ Kamailio World 2024

Commit 2024 - Secret Management made easy

DSPy a system for AI to Write Prompts and Do Fine Tuning

Big Data Journey

2. YARN

4. YARN

5. WHY?

6. https://www.domo.com/

7. Building new applications

9. Can I use my existing tools?

10. (Big) Data Platform (Big) Data Project

11. Ingest Store Process Consume

12. Ingest Data

13. Copy files in HDFS hadoop fs -put dailylogs-log.zip /logs/2015/09/10/

14. Import RDBMS data sqoop import --connect jdbc:mysql://db.foo.com/somedb --table customers --target-dir /incremental_dataset --append Files HBase Hive

15. Import RDBMS data input { jdbc { jdbc_connection_string => "jdbc:postgresql://localhost:5432/mydb" jdbc_user => "postgres" jdbc_driver_library => "/path/to/postgresql-9.4-1201.jdbc41.jar" jdbc_driver_class => "org.postgresql.Driver" statement => "SELECT * from contacts" } }

16. What’s “wrong”? Batch????

17. Streaming Flume, Kafka, Logstash to the rescue

18. Log App Events Twitter Sensors … HDFS MapR-FS Alerts Elasticsearch … DB

19. Log App Events Twitter Sensors … HDFS MapR-FS Alerts Elasticsearch … DB Broker Producers Consumers

20. Stream data into Hadoop using Flume Server Files HBase Hive Server Server Server

21. Streams using Kafka Files HBase Hive Producer Producer Producer Consumer Consumer Consumer Alert

22. Stream data using Logstash

23. Data Storage Data Format

24. How to store your data? • Files in a distributed file system • Rows in NoSQL Table • Index in Search Engine

25. Process Data

26. Data Processing • Transform the data • Enrich the data • Examples: • Store data in multiple formats • Aggregate data • Build Recommendations • ….

27. MapReduce Processing Model • Define mappers • Shuffling is automatic • Define reducers • For complex work, chain jobs together – Use a higher level language or DSL that does this for you

28. Apache Spark: Fast Big Data – Rich APIs in Java, Scala, Python – Interactive shell • Fast to Run – General execution graphs – In-memory storage

29. Spark: Unified Platform Spark SQL Spark Streaming (Streaming) MLlib (Machine learning) Spark (General execution engine) GraphX (Graph computation) Mesos Distributed File System (HDFS, MapR-FS, S3, …) Hadoop YARN

30. Elasticsearch / Watcher

31.

32. Query the data

33. Files HBase Hive Index Discovery/Analytics

34. SQL strikes back!

35. Files HBase Hive SQL on Hadoop • SQL Shell • JDBC ODBC • BI Tools • Reporting

36. Elasticsearch

37. Kibana as a frontend

38. Example: Recommendation Platform

39. Machine Learning MapR Cluster HBase  MapR DB MapR-FS Add recommendations to movies Capture Ratings Movies & Recommendations Movie Database

40. Conclusion • If possible use Streams: Kafka, Logstash  • Advanced Data Processing and Machine Learning : Spark • Expose your data using SQL for your “BI folks” : Drill • Aggregation and Full Text Search : Elasticsearch • Data Visualisation : Kibana

Big Data Journey

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (20)

Similaire à Big Data Journey

Similaire à Big Data Journey (20)

Plus de Tugdual Grall

Plus de Tugdual Grall (20)

Dernier

Dernier (20)

Big Data Journey