Meet Spark

•Télécharger en tant que PPTX, PDF•

3 j'aime•1,629 vues

This document summarizes a presentation about using Apache Spark for various data analytics use cases. It discusses how Spark can be used for interactive SQL queries on large datasets, log file enrichment by connecting to data stores like HBase, mixing SQL and machine learning by accessing training and query engines in the same platform, and building recommendation engines by performing ETL, training models with MLlib, and serving recommendations with NoSQL. The presentation argues that Spark helps flatten the adoption curve by providing a unified framework for all these tasks.

Technologie Formation

© 2014 MapR Technologies 1© 2014 MapR Technologies

© 2014 MapR Technologies 2
Agenda
• Introductions
• Log File enrichment
• ETL with ML
• Recommendation Engine
• Adhoc SQL Queries
• The Future case

© 2014 MapR Technologies 3
Who is Mike Emerick ?
My bio the highlights.
Architect for MapR for 2.5 years.
“creative hours at Workshop 88.”

© 2014 MapR Technologies 4
Approach to this presentation
1.No API discussion
1.Architecture features and utilization
2. Use Cases .. and Why Spark?

© 2014 MapR Technologies 5
Spark 10,000 feet
• Fundamentally Spark is an MPP.
• Can use many Storage Subsystems.
(Great for development)
• RDD, Accumulators, Broadcast.
• Map Reduce +.
• Apache Spark site has
great resources
on architecture and API.

© 2014 MapR Technologies 6
Usecase : SQL Queries
• “Interactive SQL on Hadoop...”
• How does Spark make this easier?
– Native Hive QL (SQL 93 ish)
– In memory and from disk
– Usually the first thought...
• Spark SQL

© 2014 MapR Technologies 7
Firewall logs
WAP Logs
Application Logs
Spark
Streaming
Compromised
accouts
M7 Persistant
NoSQL Persistent Datastore
Spark SQL
Presentation
Monitoring
Spark NoSQL
Tools
Tableau Qlikview
Spark Mlib
Reporting Data Products WebPortal
Content
Management
MapR Dataplatform
Microstrategy
Known Exploits
Blacklists
GeoIP

© 2014 MapR Technologies 8
Usecase : Log file enrichment
• Why enrich my log data..?
• This is not Storm it is Batch
– Similar to Hbase Async API..
• How does Spark make this easier?
– Streaming API
– Sliding Windows
– SQL Hive/Shark
• Connect to Hbase
– NoSQL Connectors
• Hbase

© 2014 MapR Technologies 9
Firewall logs
WAP Logs
Application Logs
Spark
Streaming
Compromised
accouts
M7 Persistant
NoSQL Persistent Datastore
Spark SQL
Presentation
Monitoring
Spark NoSQL
Tools
Tableau Qlikview
Spark Mlib
Reporting Data Products WebPortal
Content
Management
MapR Dataplatform
Microstrategy
Known Exploits
Blacklists
GeoIP

© 2014 MapR Technologies 10
Usecase : SQL mixing with ML
• Why are folks doing this..?
• How does Spark make this easier?
– Native Machine learning Mlib
– Access to neartime Adhoc SQL queries
– R and SQL in the same place
– Bigger than in memory faster than MR

© 2014 MapR Technologies 11
Firewall logs
WAP Logs
Application Logs
Spark
Streaming
Compromised
accouts
M7 Persistant
NoSQL Persistent Datastore
Spark SQL
Presentation
Monitoring
Spark NoSQL
Tools
Tableau Qlikview
Spark Mlib
Reporting Data Products WebPortal
Content
Management
MapR Dataplatform
Microstrategy
Known Exploits
Blacklists
GeoIP

© 2014 MapR Technologies 12
Usecase : Recommendation Engine
• It is a recommendation engine...
• How does Spark make this easier?
– ETL and Enrichment
– Mlib makes it easy to import data.
– Mlib Training in same cluster
– NoSQL Adhoc serves recommendations
– Dynamic

© 2014 MapR Technologies 13
Firewall logs
WAP Logs
Application Logs
Spark
Streaming
Compromised
accouts
M7 Persistant
NoSQL Persistent Datastore
Spark SQL
Presentation
Monitoring
Spark NoSQL
Tools
Tableau Qlikview
Spark Mlib
Reporting Data Products WebPortal
Content
Management
MapR Dataplatform
Microstrategy
Known Exploits
Blacklists
GeoIP

© 2014 MapR Technologies 14
Use cases build in complexity
• Adoption follows a curve of complexity
– Ingestion and query
– Ingestion Enrichment Query
– Ingestion Enrichment Machine learning Query
– Ingestion Enrichment Machine learning Serving recommendations
– .....
• Spark is flattening the curve
• Why?
– One framework
– Less data movement
– Access to preferred language

© 2014 MapR Technologies 15
Future state: ~ in the year 2000
• ADAM - Genomics
• GraphX – Graph is near...
• Mlib – Look for lots of work here
• PySpark – Fastest evolving
• SparkR – Just getting started
• BlinkDB – ~ Queries
• OEM...

© 2014 MapR Technologies 16
Business Services
MapR is hiring in Chicago
Apache Drill Beta this Summer
Happy National Making day !
Check out W88 for Hadoop classes

Contenu connexe

Tendances

Maintaining Low Latency While Maximizing Throughput on a Single ClusterMapR Technologies

Apache Drill and Zeppelin: Two Promising Tools You've Never Heard OfCharles Givre

Drill into Drill – How Providing Flexibility and Performance is PossibleMapR Technologies

M7 and Apache Drill, Micheal HausenblasModern Data Stack France

NoSQL HBase schema design and SQL with Apache Drill Carol McDonald

Hadoop And Their Ecosystemsunera pathan

Big Data Conference April 2015Aaron Benz

Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...DataWorks Summit/Hadoop Summit

Sql on everything with drillJulien Le Dem

Apache HBase - Introduction & Use CasesData Con LA

The Evolution of the Hadoop EcosystemCloudera, Inc.

Applied Deep Learning with Spark and Deeplearning4jDataWorks Summit

Understanding the Value and Architecture of Apache DrillDataWorks Summit

Cmu-2011-09.pptxTed Dunning

Apache Hive TutorialSandeep Patil

Intro to Apache Spark by Marco VasquezMapR Technologies

Apache Hadoop YARN - Enabling Next Generation Data ApplicationsHortonworks

SQOOP - RDBMS to HadoopSofian Hadiwijaya

Introduction to Apache DrillSwiss Big Data User Group

Introduction to Spark on HadoopCarol McDonald

Tendances (20)

Maintaining Low Latency While Maximizing Throughput on a Single Cluster

Apache Drill and Zeppelin: Two Promising Tools You've Never Heard Of

Drill into Drill – How Providing Flexibility and Performance is Possible

M7 and Apache Drill, Micheal Hausenblas

NoSQL HBase schema design and SQL with Apache Drill

Hadoop And Their Ecosystem

Big Data Conference April 2015

Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...

Sql on everything with drill

Apache HBase - Introduction & Use Cases

The Evolution of the Hadoop Ecosystem

Applied Deep Learning with Spark and Deeplearning4j

Understanding the Value and Architecture of Apache Drill

Cmu-2011-09.pptx

Apache Hive Tutorial

Intro to Apache Spark by Marco Vasquez

Apache Hadoop YARN - Enabling Next Generation Data Applications

SQOOP - RDBMS to Hadoop

Introduction to Apache Drill

Introduction to Spark on Hadoop

Similaire à Meet Spark

Spark and Hadoop Technology Avinash Gautam

Cleveland Hadoop Users Group - SparkVince Gonzalez

Big Data for Data Scientists - WeCloudDataWeCloudData

Big Data Processing with Apache Spark 2014mahchiev

Apache Spark At Apple with Sam Maclennan and Vishwanath LakkundiDatabricks

Spark 101Shahaf Azriely {TopLinked} ☁

Introduing sparkTaotao Li

Transitioning Compute Models: Hadoop MapReduce to SparkSlim Baltagi

Spark & Hadoop at Production at ScaleMapR Technologies

YARN webinar series: Using Scalding to write applications to Hadoop and YARNHortonworks

Self-Service BI for big data applications using Apache Drill (Big Data Amster...Dataconomy Media

Self-Service BI for big data applications using Apache Drill (Big Data Amster...Mats Uddenfeldt

Introduction to sparkHome

BDTC2015 databricks-辛湜-state of sparkJerry Wen

Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability MeetupHyderabad Scalability Meetup

Dec6 meetup spark presentationRamesh Mudunuri

A short introduction to Spark and its benefitsJohan Picard

Implementing the Lambda Architecture efficiently with Apache SparkDataWorks Summit

Apache Spark for BeginnersAnirudh

Why Apache Spark is the Heir to MapReduce in the Hadoop EcosystemCloudera, Inc.

Similaire à Meet Spark (20)

Spark and Hadoop Technology

Cleveland Hadoop Users Group - Spark

Big Data for Data Scientists - WeCloudData

Big Data Processing with Apache Spark 2014

Apache Spark At Apple with Sam Maclennan and Vishwanath Lakkundi

Spark 101

Introduing spark

Transitioning Compute Models: Hadoop MapReduce to Spark

Spark & Hadoop at Production at Scale

YARN webinar series: Using Scalding to write applications to Hadoop and YARN

Self-Service BI for big data applications using Apache Drill (Big Data Amster...

Introduction to spark

BDTC2015 databricks-辛湜-state of spark

Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup

Dec6 meetup spark presentation

A short introduction to Spark and its benefits

Implementing the Lambda Architecture efficiently with Apache Spark

Apache Spark for Beginners

Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem

Plus de Chicago Hadoop Users Group

Kinetica master chug_9.12Chicago Hadoop Users Group

Chug dl presentationChicago Hadoop Users Group

Yahoo compares Storm and SparkChicago Hadoop Users Group

Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...Chicago Hadoop Users Group

Choosing the Right Big Data Architecture for your BusinessChicago Hadoop Users Group

An Overview of AmbariChicago Hadoop Users Group

Hadoop and Big Data SecurityChicago Hadoop Users Group

Introduction to MapReduceChicago Hadoop Users Group

Advanced OozieChicago Hadoop Users Group

Scalding for HadoopChicago Hadoop Users Group

Financial Data Analytics with HadoopChicago Hadoop Users Group

Everything you wanted to know, but were afraid to ask about OozieChicago Hadoop Users Group

An Introduction to Impala – Low Latency Queries for Apache HadoopChicago Hadoop Users Group

HCatalog: Table Management for Hadoop - CHUG - 20120917Chicago Hadoop Users Group

Map Reduce v2 and YARN - CHUG - 20120604Chicago Hadoop Users Group

Hadoop in a Windows Shop - CHUG - 20120416Chicago Hadoop Users Group

Running R on Hadoop - CHUG - 20120815Chicago Hadoop Users Group

Avro - More Than Just a Serialization Framework - CHUG - 20120416Chicago Hadoop Users Group

Plus de Chicago Hadoop Users Group (18)

Kinetica master chug_9.12

Chug dl presentation

Yahoo compares Storm and Spark

Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...

Choosing the Right Big Data Architecture for your Business

An Overview of Ambari

Hadoop and Big Data Security

Introduction to MapReduce

Advanced Oozie

Scalding for Hadoop

Financial Data Analytics with Hadoop

Everything you wanted to know, but were afraid to ask about Oozie

An Introduction to Impala – Low Latency Queries for Apache Hadoop

HCatalog: Table Management for Hadoop - CHUG - 20120917

Map Reduce v2 and YARN - CHUG - 20120604

Hadoop in a Windows Shop - CHUG - 20120416

Running R on Hadoop - CHUG - 20120815

Avro - More Than Just a Serialization Framework - CHUG - 20120416

Dernier

[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745

A Call to Action for Generative AI in 2024Results

IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge

WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal

Scaling API-first – The story of a global engineering organizationRadu Cotescu

How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes

The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad

How to convert PDF to text with Nanonetsnaman860154

GenCyber Cyber Security Day PresentationMichael W. Hawkins

2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo

TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc

Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2

A Domino Admins Adventures (Engage 2024)Gabriella Davis

The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge

Presentation on how to chat with PDF using ChatGPT code interpreternaman860154

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays

Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies

Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik

Dernier (20)

[2024]Digital Global Overview Report 2024 Meltwater.pdf

A Call to Action for Generative AI in 2024

IAC 2024 - IA Fast Track to Search Focused AI Solutions

WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service

Scaling API-first – The story of a global engineering organization

How to Troubleshoot Apps for the Modern Connected Worker

The Codex of Business Writing Software for Real-World Solutions 2.pptx

How to convert PDF to text with Nanonets

GenCyber Cyber Security Day Presentation

2024: Domino Containers - The Next Step. News from the Domino Container commu...

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments

Exploring the Future Potential of AI-Enabled Smartphone Processors

A Domino Admins Adventures (Engage 2024)

The 7 Things I Know About Cyber Security After 25 Years | April 2024

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf

Presentation on how to chat with PDF using ChatGPT code interpreter

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...

Factors to Consider When Choosing Accounts Payable Services Providers.pptx

Injustice - Developers Among Us (SciFiDevCon 2024)

Meet Spark

5. © 2014 MapR Technologies 5 Spark 10,000 feet • Fundamentally Spark is an MPP. • Can use many Storage Subsystems. (Great for development) • RDD, Accumulators, Broadcast. • Map Reduce +. • Apache Spark site has great resources on architecture and API.

6. © 2014 MapR Technologies 6 Usecase : SQL Queries • “Interactive SQL on Hadoop...” • How does Spark make this easier? – Native Hive QL (SQL 93 ish) – In memory and from disk – Usually the first thought... • Spark SQL

7. © 2014 MapR Technologies 7 Firewall logs WAP Logs Application Logs Spark Streaming Compromised accouts M7 Persistant NoSQL Persistent Datastore Spark SQL Presentation Monitoring Spark NoSQL Tools Tableau Qlikview Spark Mlib Reporting Data Products WebPortal Content Management MapR Dataplatform Microstrategy Known Exploits Blacklists GeoIP

8. © 2014 MapR Technologies 8 Usecase : Log file enrichment • Why enrich my log data..? • This is not Storm it is Batch – Similar to Hbase Async API.. • How does Spark make this easier? – Streaming API – Sliding Windows – SQL Hive/Shark • Connect to Hbase – NoSQL Connectors • Hbase

9. © 2014 MapR Technologies 9 Firewall logs WAP Logs Application Logs Spark Streaming Compromised accouts M7 Persistant NoSQL Persistent Datastore Spark SQL Presentation Monitoring Spark NoSQL Tools Tableau Qlikview Spark Mlib Reporting Data Products WebPortal Content Management MapR Dataplatform Microstrategy Known Exploits Blacklists GeoIP

10. © 2014 MapR Technologies 10 Usecase : SQL mixing with ML • Why are folks doing this..? • How does Spark make this easier? – Native Machine learning Mlib – Access to neartime Adhoc SQL queries – R and SQL in the same place – Bigger than in memory faster than MR

11. © 2014 MapR Technologies 11 Firewall logs WAP Logs Application Logs Spark Streaming Compromised accouts M7 Persistant NoSQL Persistent Datastore Spark SQL Presentation Monitoring Spark NoSQL Tools Tableau Qlikview Spark Mlib Reporting Data Products WebPortal Content Management MapR Dataplatform Microstrategy Known Exploits Blacklists GeoIP

12. © 2014 MapR Technologies 12 Usecase : Recommendation Engine • It is a recommendation engine... • How does Spark make this easier? – ETL and Enrichment – Mlib makes it easy to import data. – Mlib Training in same cluster – NoSQL Adhoc serves recommendations – Dynamic

13. © 2014 MapR Technologies 13 Firewall logs WAP Logs Application Logs Spark Streaming Compromised accouts M7 Persistant NoSQL Persistent Datastore Spark SQL Presentation Monitoring Spark NoSQL Tools Tableau Qlikview Spark Mlib Reporting Data Products WebPortal Content Management MapR Dataplatform Microstrategy Known Exploits Blacklists GeoIP

14. © 2014 MapR Technologies 14 Use cases build in complexity • Adoption follows a curve of complexity – Ingestion and query – Ingestion Enrichment Query – Ingestion Enrichment Machine learning Query – Ingestion Enrichment Machine learning Serving recommendations – ..... • Spark is flattening the curve • Why? – One framework – Less data movement – Access to preferred language

15. © 2014 MapR Technologies 15 Future state: ~ in the year 2000 • ADAM - Genomics • GraphX – Graph is near... • Mlib – Look for lots of work here • PySpark – Fastest evolving • SparkR – Just getting started • BlinkDB – ~ Queries • OEM...

Notes de l'éditeur

My approach to this presentation Not and API presentation Great documentation and examples Lots of presentations of this variety Architecture presentation – Use case study. What unique features facilitate workloads. What is here and coming for new workloads.
Adhoc Queries with Shark An early use case. Not usually the first production. MapR has a few. In the long run...
Adhoc Queries with Shark An early use case. Not usually the first production. MapR has a few. In the long run...
Logfile enrichment Streaming API Leveraging near time resolution for enrichment. Not the same as Storm - Its Batch Hooks to other messaging tools ZeroMQ, Kafka etc.. Sliding Window features NoSQL capabilities Access to Tables via Hbase API Access to in memory RDDs
Adhoc Queries with Shark An early use case. Not usually the first production. MapR has a few. In the long run...
SQL Mixing with Machine Learning “ETL for the math nerd” R and Python Current access is Shark Spark SQL will drive to native SQL
Adhoc Queries with Shark An early use case. Not usually the first production. MapR has a few. In the long run...
Recommendation Engine Provides all aspects ETL Vector/Matrix Generation Training Near time recommendation
Adhoc Queries with Shark An early use case. Not usually the first production. MapR has a few. In the long run...
Pharma’s like Adam Academia is behind Few Graph use cases in development none deployed none on GraphX Mlib and Mahout may join forces.. PySpark according to DataBricks is some of the most active code SparkR BlinkDB Time limited queries separate git hub will be merged in to main Spark Branch Couple OEM vendors for Spark they are covered on the @ databricks site.
MapR has a very large and robust, growing ecosystem of partners. This is important for you because you have existing investments and relationships with other technologies which need to work well with MapR, integrate easily, and allow you to create a differentiated set of technologies. (highlight key partners which are important to your customer)

Meet Spark

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Meet Spark

Similaire à Meet Spark (20)

Plus de Chicago Hadoop Users Group

Plus de Chicago Hadoop Users Group (18)

Dernier

Dernier (20)

Meet Spark

Notes de l'éditeur