SlideShare une entreprise Scribd logo
1  sur  26
Télécharger pour lire hors ligne
© 2016 The Apache Software Foundation. Apache, Apache Ignite, the Apache feather and the Apache Ignite logo are trademarks of The Apache Software Foundation.
Better Together: Fast Data with Ignite & Spark
Christos Erotocritou - Spark Summit EU 2016
© 2016 GridGain Systems, Inc.
Agenda
• GridGain & Apache Ignite
Project
• Ignite In-Memory Data Fabric
• Apache Ignite vs. Apache Spark
• Hadoop & Spark Integration
• Q & A
© 2016 GridGain Systems, Inc.
Apache Ignite Project
• 2007: First version of GridGain
• Oct. 2014: GridGain contributes Ignite to
ASF
• Aug. 2015: Ignite is the second fastest
project to graduate after Spark
• Today:
• 82+ contributors and growing rapidly
• Huge development momentum -
Estimated 233 years of effort since the
first commit in February, 2014 [Openhub]
• Mature codebase: 840k+ SLOC & more
than 17k commits
January 2016
© 2016 GridGain Systems, Inc.
• GridGain Enterprise Edition
• Is a binary build of Apache Ignite™ created by GridGain
• Added enterprise features for enterprise deployments
• Earlier features and bug fixes by a few weeks
• Heavily tested
© 2016 GridGain Systems, Inc.
Customer Use Cases
Automated Trading Systems
Real time analysis of trading positions & market risk.
High volume transactions, ultra low latencies.
Financial Services
Fraud Detection, Risk Analysis, Insurance rating and
modelling.
Online & Mobile Advertising
Real time decisions, geo-targeting & retail traffic
information.
Big Data Analytics
Customer 360 view, real-time analysis of KPIs, up-to-the-
second operational BI.
Online Gaming
Real-time back-ends for mobile and massively parallel
games.
SaaS Platforms & Apps
High performance next-generation architectures for
Software as a Service Application vendors.
Travel & E-Commerce
High performance next-generation architectures for
online hotel booking.
© 2016 GridGain Systems, Inc.
What is an IMDF?
High-performance distributed in-memory platform for
computing and transacting on large-scale data sets in
near real-time.
© 2016 GridGain Systems, Inc.
What is an IMDF?
© 2016 GridGain Systems, Inc.
In-Memory Computing Platform
Data Grid
Batch
Data
Compute
Grid
Transactional
& Analytical
Workloads
Transactional & Analytical workloads
Streaming
Data
External
Persistence
External
APIs
Back-end users, third-party clients and
downstream systems
Downstream
Systems
Clients accessing a high-speed
distributed multi-facet service
© 2016 GridGain Systems, Inc.
Scalability & Resilience with Ignite
Data Grid
Batch
Data
Compute
Grid
Transactional
& Analytical
Workloads
Transactional & Analytical workloads
Streaming
Data
External
Persistence
External
APIs
Back-end users, third-party clients and
downstream systems
Downstream
Systems
Clients accessing a high-speed
distributed multi-facet service
© 2016 GridGain Systems, Inc.
Fault Tolerance & Horizontal Scalability
Replicated Cache Partitioned Cache
© 2016 GridGain Systems, Inc.
Local Store & Vertical Scale
• Tiered Memory
• On-Heap -> Off-Heap ->
Disk
• Persistent On-Disk Store
• Fast Recovery
• Local Data Reload
• Eliminate Network and Db impacts
when reloading in-memory store
© 2016 GridGain Systems, Inc.
Storage and Caching using Ignite
Data Grid
Batch
Data
Compute
Grid
Transactional
& Analytical
Workloads
Transactional & Analytical workloads
Streaming
Data
External
Persistency
External
APIs
Back-end users, third-party clients and
downstream systems
Downstream
Systems
Clients accessing a high-speed
distributed multi-facet service
© 2016 GridGain Systems, Inc.
• 100% JCache Compliant (JSR 107)
– Basic Cache Operations
– Concurrent Map APIs
– Collocated Processing (EntryProcessor)
– Events and Metrics
– Pluggable Persistence
• Ignite Data Grid
– Fault Tolerance and Scalability
– Distributed Key-Value Store
– SQL Queries (ANSI 99)
– ACID Transactions
– In-Memory Indexes
– RDBMS / NoSQL Integration
In-Memory Data Grid
© 2016 GridGain Systems, Inc.
Distributed Computing with Ignite
Data Grid
Batch
Data
Compute
Grid
Transactional
& Analytical
Workloads
Transactional & Analytical workloads
Streaming
Data
External
Persistency
External
APIs
Back-end users, third-party clients and
downstream systems
Downstream
Systems
Clients accessing a high-speed
distributed multi-facet service
© 2016 GridGain Systems, Inc.
Client-Server vs. Affinity Colocation
1
2
4
3 Data 1
Job 1
2
3
Data 2
Job 2
Processing
Node 1
Processing
Node 2
Client
Node
Data
Node 1
Data
Node 2
Processing
Node 1
1
3
4
Data 1
Data 2
2
2
1. Initial Request
2. Fetch data from remote nodes
3. Process entire data-set
4. Return to client
1. Initial Request
2. Co-locating processing with data
3. Return partial result
4. Reduce & return to client
© 2016 GridGain Systems, Inc.
• Direct API for MapReduce
• Cron-like Task Scheduling
• State Checkpoints
• Load Balancing
• Round-robin
• Random & weighted
• Automatic Failover
• Per-node Shared State
• Zero Deployment
• Distributed class loading
In-Memory Compute Grid
© 2016 GridGain Systems, Inc.
Hadoop & Spark Integration
© 2016 GridGain Systems, Inc.
Apache Ignite Apache Spark
– Ingests data from HDFS or another
distributed file system
– Inclined towards analytics (OLAP) and
focused on MR-specific payloads
– Requires the creation of RDD and data and
processing operations are governed by it
– Primarily disk-based SQL support
– Strong ML libraries
– Big community
– Data source agnostic
– Fully fledged compute engine and
resilient data storage in-memory for
OLAP & OLTP
– Zero-deployment
– In-Memory SQL support
– Fully ACID transactions across memory
and disk
– Broader in-memory system that is less
focused on Hadoop
– Off-heap memory to avoid GC pauses
– In production since 2007
© 2016 GridGain Systems, Inc.
• IgniteRDD
– Share RDD across jobs on the host
– Share RDD across jobs in the
application
– Share RDD globally
• Faster SQL
– In-Memory Indexes
– SQL on top of Shared RDD
Spark & Ignite Integration
Spark Application
Spark Worker
Spark
Job
Spark
Job
Ignite Node
Yarn Mesos Docker Cloud
Server
Spark Worker
Spark
Job
Spark
Job
Ignite Node
Server
Spark Worker
Spark
Job
Spark
Job
Ignite Node
Server
In-Memory Shared RDDs
© 2016 GridGain Systems, Inc.
• Reading values from Ignite:
• IgniteContext is the main entry point to Spark-Ignite integration:
val igniteContext = new IgniteContext[Integer, Integer]
(sparkContext, () => new IgniteConfiguration())
val cache = igniteContext.fromCache("myRdd")
val result = cache.filter(_._2.contains("Ignite")).collect()
val cacheRdd = igniteContext.fromCache("myRdd")
cacheRdd.savePairs(sparkContext.parallelize(1 to 10000, 10).map(i => (i, i)))
• Saving values to Ignite:
• Running SQL queries against Ignite Cache:
val cacheRdd = igniteContext.fromCache("myRdd")
val result = cacheRdd.sql
("select _val from Integer where val > ? and val < ?", 10, 100)
Spark & Ignite Integration: IgniteRDD
© 2016 GridGain Systems, Inc.
Spark Integration: Using Dataframes from IgniteRDDs
// Create an IgniteRDD
val companyCacheIgnite = new IgniteContext[Int, String](sc, () =>
new IgniteConfiguration()).fromCache("CompanyCache")
// Create company DataFrame
val dfCompany = sqlContext.createDataFrame(companyCacheIgnite.map(p =>
Company(p._1, p._2)))
// Register DataFrame as a table
dfCompany.registerTempTable("company")
© 2016 GridGain Systems, Inc.
• Ignite In-Memory File System (IGFS)
– Hadoop-compliant
– Easy to Install
– On-Heap and Off-Heap
– Caching Layer for HDFS
– Write-through and Read-through
HDFS
– Performance Boost
IGFS: In-Memory File System
MR HIVE PIG
In-Memory MapReduce
IGFS
HDFS
IGFS
YARN
Any
Hadoop
Distro
© 2016 GridGain Systems, Inc.
Hadoop Accelerator: Map Reduce
• In-Memory Performance
• Zero Code Change
• Use existing MR code
• Use existing Hive queries
• No Name Node
• No Network Noise
• In-Process Data Colocation
• Eager Push Scheduling
User
Applicatio
n
Hadoop
Client
Ignite
Client
Hadoop
Jobtracke
r
Hadoop
Name
Node
Hadoop
Tasktrack
er
Hadoop
Tasktrack
er
Ignite
Data
Node
(IGFS)
Ignite
Data
Node
(IGFS)
Hadoop
Data
Node
(HDFS)
Hadoop
Data
Node
(HDFS)
Ignite Path
Hadoop Path
© 2016 GridGain Systems, Inc.
• Docker
• Amazon AWS
• Azure Marketplace
• Google Cloud
• Apache JClouds
• Mesos
• YARN
• Apache Karaf (OSGi)
Deployment
© 2016 GridGain Systems, Inc.
Thank You!
www.gridgain.com
@gridgain @ApacheIgnite
#gridgain #ApacheIgnite
Thank you for joining us. Follow the conversation.
Michael Griggs
Michael.Griggs@gridgain.com
github.com/kemiz/SparkIgniteSimpleExample

Contenu connexe

Plus de Big Data Spain

Big Data, Big Quality? by Irene Gonzálvez at Big Data Spain 2017
Big Data, Big Quality? by Irene Gonzálvez at Big Data Spain 2017Big Data, Big Quality? by Irene Gonzálvez at Big Data Spain 2017
Big Data, Big Quality? by Irene Gonzálvez at Big Data Spain 2017Big Data Spain
 
Scaling a backend for a big data and blockchain environment by Rafael Ríos at...
Scaling a backend for a big data and blockchain environment by Rafael Ríos at...Scaling a backend for a big data and blockchain environment by Rafael Ríos at...
Scaling a backend for a big data and blockchain environment by Rafael Ríos at...Big Data Spain
 
AI: The next frontier by Amparo Alonso at Big Data Spain 2017
AI: The next frontier by Amparo Alonso at Big Data Spain 2017AI: The next frontier by Amparo Alonso at Big Data Spain 2017
AI: The next frontier by Amparo Alonso at Big Data Spain 2017Big Data Spain
 
Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017
Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017
Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017Big Data Spain
 
Presentation: Boost Hadoop and Spark with in-memory technologies by Akmal Cha...
Presentation: Boost Hadoop and Spark with in-memory technologies by Akmal Cha...Presentation: Boost Hadoop and Spark with in-memory technologies by Akmal Cha...
Presentation: Boost Hadoop and Spark with in-memory technologies by Akmal Cha...Big Data Spain
 
Data science for lazy people, Automated Machine Learning by Diego Hueltes at ...
Data science for lazy people, Automated Machine Learning by Diego Hueltes at ...Data science for lazy people, Automated Machine Learning by Diego Hueltes at ...
Data science for lazy people, Automated Machine Learning by Diego Hueltes at ...Big Data Spain
 
Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...
Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...
Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...Big Data Spain
 
Unbalanced data: Same algorithms different techniques by Eric Martín at Big D...
Unbalanced data: Same algorithms different techniques by Eric Martín at Big D...Unbalanced data: Same algorithms different techniques by Eric Martín at Big D...
Unbalanced data: Same algorithms different techniques by Eric Martín at Big D...Big Data Spain
 
State of the art time-series analysis with deep learning by Javier Ordóñez at...
State of the art time-series analysis with deep learning by Javier Ordóñez at...State of the art time-series analysis with deep learning by Javier Ordóñez at...
State of the art time-series analysis with deep learning by Javier Ordóñez at...Big Data Spain
 
Trading at market speed with the latest Kafka features by Iñigo González at B...
Trading at market speed with the latest Kafka features by Iñigo González at B...Trading at market speed with the latest Kafka features by Iñigo González at B...
Trading at market speed with the latest Kafka features by Iñigo González at B...Big Data Spain
 
Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...
Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...
Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...Big Data Spain
 
The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...
 The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a... The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...
The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...Big Data Spain
 
Artificial Intelligence and Data-centric businesses by Óscar Méndez at Big Da...
Artificial Intelligence and Data-centric businesses by Óscar Méndez at Big Da...Artificial Intelligence and Data-centric businesses by Óscar Méndez at Big Da...
Artificial Intelligence and Data-centric businesses by Óscar Méndez at Big Da...Big Data Spain
 
Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017
Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017
Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017Big Data Spain
 
Meme Index. Analyzing fads and sensations on the Internet by Miguel Romero at...
Meme Index. Analyzing fads and sensations on the Internet by Miguel Romero at...Meme Index. Analyzing fads and sensations on the Internet by Miguel Romero at...
Meme Index. Analyzing fads and sensations on the Internet by Miguel Romero at...Big Data Spain
 
Vehicle Big Data that Drives Smart City Advancement by Mike Branch at Big Dat...
Vehicle Big Data that Drives Smart City Advancement by Mike Branch at Big Dat...Vehicle Big Data that Drives Smart City Advancement by Mike Branch at Big Dat...
Vehicle Big Data that Drives Smart City Advancement by Mike Branch at Big Dat...Big Data Spain
 
End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...
End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...
End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...Big Data Spain
 
Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...
Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...
Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...Big Data Spain
 
More people, less banking: Blockchain by Salvador Casquero at Big Data Spain ...
More people, less banking: Blockchain by Salvador Casquero at Big Data Spain ...More people, less banking: Blockchain by Salvador Casquero at Big Data Spain ...
More people, less banking: Blockchain by Salvador Casquero at Big Data Spain ...Big Data Spain
 
Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017
Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017
Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017Big Data Spain
 

Plus de Big Data Spain (20)

Big Data, Big Quality? by Irene Gonzálvez at Big Data Spain 2017
Big Data, Big Quality? by Irene Gonzálvez at Big Data Spain 2017Big Data, Big Quality? by Irene Gonzálvez at Big Data Spain 2017
Big Data, Big Quality? by Irene Gonzálvez at Big Data Spain 2017
 
Scaling a backend for a big data and blockchain environment by Rafael Ríos at...
Scaling a backend for a big data and blockchain environment by Rafael Ríos at...Scaling a backend for a big data and blockchain environment by Rafael Ríos at...
Scaling a backend for a big data and blockchain environment by Rafael Ríos at...
 
AI: The next frontier by Amparo Alonso at Big Data Spain 2017
AI: The next frontier by Amparo Alonso at Big Data Spain 2017AI: The next frontier by Amparo Alonso at Big Data Spain 2017
AI: The next frontier by Amparo Alonso at Big Data Spain 2017
 
Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017
Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017
Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017
 
Presentation: Boost Hadoop and Spark with in-memory technologies by Akmal Cha...
Presentation: Boost Hadoop and Spark with in-memory technologies by Akmal Cha...Presentation: Boost Hadoop and Spark with in-memory technologies by Akmal Cha...
Presentation: Boost Hadoop and Spark with in-memory technologies by Akmal Cha...
 
Data science for lazy people, Automated Machine Learning by Diego Hueltes at ...
Data science for lazy people, Automated Machine Learning by Diego Hueltes at ...Data science for lazy people, Automated Machine Learning by Diego Hueltes at ...
Data science for lazy people, Automated Machine Learning by Diego Hueltes at ...
 
Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...
Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...
Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...
 
Unbalanced data: Same algorithms different techniques by Eric Martín at Big D...
Unbalanced data: Same algorithms different techniques by Eric Martín at Big D...Unbalanced data: Same algorithms different techniques by Eric Martín at Big D...
Unbalanced data: Same algorithms different techniques by Eric Martín at Big D...
 
State of the art time-series analysis with deep learning by Javier Ordóñez at...
State of the art time-series analysis with deep learning by Javier Ordóñez at...State of the art time-series analysis with deep learning by Javier Ordóñez at...
State of the art time-series analysis with deep learning by Javier Ordóñez at...
 
Trading at market speed with the latest Kafka features by Iñigo González at B...
Trading at market speed with the latest Kafka features by Iñigo González at B...Trading at market speed with the latest Kafka features by Iñigo González at B...
Trading at market speed with the latest Kafka features by Iñigo González at B...
 
Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...
Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...
Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...
 
The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...
 The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a... The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...
The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...
 
Artificial Intelligence and Data-centric businesses by Óscar Méndez at Big Da...
Artificial Intelligence and Data-centric businesses by Óscar Méndez at Big Da...Artificial Intelligence and Data-centric businesses by Óscar Méndez at Big Da...
Artificial Intelligence and Data-centric businesses by Óscar Méndez at Big Da...
 
Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017
Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017
Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017
 
Meme Index. Analyzing fads and sensations on the Internet by Miguel Romero at...
Meme Index. Analyzing fads and sensations on the Internet by Miguel Romero at...Meme Index. Analyzing fads and sensations on the Internet by Miguel Romero at...
Meme Index. Analyzing fads and sensations on the Internet by Miguel Romero at...
 
Vehicle Big Data that Drives Smart City Advancement by Mike Branch at Big Dat...
Vehicle Big Data that Drives Smart City Advancement by Mike Branch at Big Dat...Vehicle Big Data that Drives Smart City Advancement by Mike Branch at Big Dat...
Vehicle Big Data that Drives Smart City Advancement by Mike Branch at Big Dat...
 
End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...
End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...
End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...
 
Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...
Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...
Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...
 
More people, less banking: Blockchain by Salvador Casquero at Big Data Spain ...
More people, less banking: Blockchain by Salvador Casquero at Big Data Spain ...More people, less banking: Blockchain by Salvador Casquero at Big Data Spain ...
More people, less banking: Blockchain by Salvador Casquero at Big Data Spain ...
 
Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017
Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017
Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017
 

Dernier

Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 

Dernier (20)

Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 

Better together: Fast Data with Apache Spark™ and Apache Ignite™ by Mike Griggs

  • 1.
  • 2. © 2016 The Apache Software Foundation. Apache, Apache Ignite, the Apache feather and the Apache Ignite logo are trademarks of The Apache Software Foundation. Better Together: Fast Data with Ignite & Spark Christos Erotocritou - Spark Summit EU 2016
  • 3. © 2016 GridGain Systems, Inc. Agenda • GridGain & Apache Ignite Project • Ignite In-Memory Data Fabric • Apache Ignite vs. Apache Spark • Hadoop & Spark Integration • Q & A
  • 4. © 2016 GridGain Systems, Inc. Apache Ignite Project • 2007: First version of GridGain • Oct. 2014: GridGain contributes Ignite to ASF • Aug. 2015: Ignite is the second fastest project to graduate after Spark • Today: • 82+ contributors and growing rapidly • Huge development momentum - Estimated 233 years of effort since the first commit in February, 2014 [Openhub] • Mature codebase: 840k+ SLOC & more than 17k commits January 2016
  • 5. © 2016 GridGain Systems, Inc. • GridGain Enterprise Edition • Is a binary build of Apache Ignite™ created by GridGain • Added enterprise features for enterprise deployments • Earlier features and bug fixes by a few weeks • Heavily tested
  • 6. © 2016 GridGain Systems, Inc. Customer Use Cases Automated Trading Systems Real time analysis of trading positions & market risk. High volume transactions, ultra low latencies. Financial Services Fraud Detection, Risk Analysis, Insurance rating and modelling. Online & Mobile Advertising Real time decisions, geo-targeting & retail traffic information. Big Data Analytics Customer 360 view, real-time analysis of KPIs, up-to-the- second operational BI. Online Gaming Real-time back-ends for mobile and massively parallel games. SaaS Platforms & Apps High performance next-generation architectures for Software as a Service Application vendors. Travel & E-Commerce High performance next-generation architectures for online hotel booking.
  • 7. © 2016 GridGain Systems, Inc. What is an IMDF? High-performance distributed in-memory platform for computing and transacting on large-scale data sets in near real-time.
  • 8. © 2016 GridGain Systems, Inc. What is an IMDF?
  • 9. © 2016 GridGain Systems, Inc. In-Memory Computing Platform Data Grid Batch Data Compute Grid Transactional & Analytical Workloads Transactional & Analytical workloads Streaming Data External Persistence External APIs Back-end users, third-party clients and downstream systems Downstream Systems Clients accessing a high-speed distributed multi-facet service
  • 10. © 2016 GridGain Systems, Inc. Scalability & Resilience with Ignite Data Grid Batch Data Compute Grid Transactional & Analytical Workloads Transactional & Analytical workloads Streaming Data External Persistence External APIs Back-end users, third-party clients and downstream systems Downstream Systems Clients accessing a high-speed distributed multi-facet service
  • 11. © 2016 GridGain Systems, Inc. Fault Tolerance & Horizontal Scalability Replicated Cache Partitioned Cache
  • 12. © 2016 GridGain Systems, Inc. Local Store & Vertical Scale • Tiered Memory • On-Heap -> Off-Heap -> Disk • Persistent On-Disk Store • Fast Recovery • Local Data Reload • Eliminate Network and Db impacts when reloading in-memory store
  • 13. © 2016 GridGain Systems, Inc. Storage and Caching using Ignite Data Grid Batch Data Compute Grid Transactional & Analytical Workloads Transactional & Analytical workloads Streaming Data External Persistency External APIs Back-end users, third-party clients and downstream systems Downstream Systems Clients accessing a high-speed distributed multi-facet service
  • 14. © 2016 GridGain Systems, Inc. • 100% JCache Compliant (JSR 107) – Basic Cache Operations – Concurrent Map APIs – Collocated Processing (EntryProcessor) – Events and Metrics – Pluggable Persistence • Ignite Data Grid – Fault Tolerance and Scalability – Distributed Key-Value Store – SQL Queries (ANSI 99) – ACID Transactions – In-Memory Indexes – RDBMS / NoSQL Integration In-Memory Data Grid
  • 15. © 2016 GridGain Systems, Inc. Distributed Computing with Ignite Data Grid Batch Data Compute Grid Transactional & Analytical Workloads Transactional & Analytical workloads Streaming Data External Persistency External APIs Back-end users, third-party clients and downstream systems Downstream Systems Clients accessing a high-speed distributed multi-facet service
  • 16. © 2016 GridGain Systems, Inc. Client-Server vs. Affinity Colocation 1 2 4 3 Data 1 Job 1 2 3 Data 2 Job 2 Processing Node 1 Processing Node 2 Client Node Data Node 1 Data Node 2 Processing Node 1 1 3 4 Data 1 Data 2 2 2 1. Initial Request 2. Fetch data from remote nodes 3. Process entire data-set 4. Return to client 1. Initial Request 2. Co-locating processing with data 3. Return partial result 4. Reduce & return to client
  • 17. © 2016 GridGain Systems, Inc. • Direct API for MapReduce • Cron-like Task Scheduling • State Checkpoints • Load Balancing • Round-robin • Random & weighted • Automatic Failover • Per-node Shared State • Zero Deployment • Distributed class loading In-Memory Compute Grid
  • 18. © 2016 GridGain Systems, Inc. Hadoop & Spark Integration
  • 19. © 2016 GridGain Systems, Inc. Apache Ignite Apache Spark – Ingests data from HDFS or another distributed file system – Inclined towards analytics (OLAP) and focused on MR-specific payloads – Requires the creation of RDD and data and processing operations are governed by it – Primarily disk-based SQL support – Strong ML libraries – Big community – Data source agnostic – Fully fledged compute engine and resilient data storage in-memory for OLAP & OLTP – Zero-deployment – In-Memory SQL support – Fully ACID transactions across memory and disk – Broader in-memory system that is less focused on Hadoop – Off-heap memory to avoid GC pauses – In production since 2007
  • 20. © 2016 GridGain Systems, Inc. • IgniteRDD – Share RDD across jobs on the host – Share RDD across jobs in the application – Share RDD globally • Faster SQL – In-Memory Indexes – SQL on top of Shared RDD Spark & Ignite Integration Spark Application Spark Worker Spark Job Spark Job Ignite Node Yarn Mesos Docker Cloud Server Spark Worker Spark Job Spark Job Ignite Node Server Spark Worker Spark Job Spark Job Ignite Node Server In-Memory Shared RDDs
  • 21. © 2016 GridGain Systems, Inc. • Reading values from Ignite: • IgniteContext is the main entry point to Spark-Ignite integration: val igniteContext = new IgniteContext[Integer, Integer] (sparkContext, () => new IgniteConfiguration()) val cache = igniteContext.fromCache("myRdd") val result = cache.filter(_._2.contains("Ignite")).collect() val cacheRdd = igniteContext.fromCache("myRdd") cacheRdd.savePairs(sparkContext.parallelize(1 to 10000, 10).map(i => (i, i))) • Saving values to Ignite: • Running SQL queries against Ignite Cache: val cacheRdd = igniteContext.fromCache("myRdd") val result = cacheRdd.sql ("select _val from Integer where val > ? and val < ?", 10, 100) Spark & Ignite Integration: IgniteRDD
  • 22. © 2016 GridGain Systems, Inc. Spark Integration: Using Dataframes from IgniteRDDs // Create an IgniteRDD val companyCacheIgnite = new IgniteContext[Int, String](sc, () => new IgniteConfiguration()).fromCache("CompanyCache") // Create company DataFrame val dfCompany = sqlContext.createDataFrame(companyCacheIgnite.map(p => Company(p._1, p._2))) // Register DataFrame as a table dfCompany.registerTempTable("company")
  • 23. © 2016 GridGain Systems, Inc. • Ignite In-Memory File System (IGFS) – Hadoop-compliant – Easy to Install – On-Heap and Off-Heap – Caching Layer for HDFS – Write-through and Read-through HDFS – Performance Boost IGFS: In-Memory File System MR HIVE PIG In-Memory MapReduce IGFS HDFS IGFS YARN Any Hadoop Distro
  • 24. © 2016 GridGain Systems, Inc. Hadoop Accelerator: Map Reduce • In-Memory Performance • Zero Code Change • Use existing MR code • Use existing Hive queries • No Name Node • No Network Noise • In-Process Data Colocation • Eager Push Scheduling User Applicatio n Hadoop Client Ignite Client Hadoop Jobtracke r Hadoop Name Node Hadoop Tasktrack er Hadoop Tasktrack er Ignite Data Node (IGFS) Ignite Data Node (IGFS) Hadoop Data Node (HDFS) Hadoop Data Node (HDFS) Ignite Path Hadoop Path
  • 25. © 2016 GridGain Systems, Inc. • Docker • Amazon AWS • Azure Marketplace • Google Cloud • Apache JClouds • Mesos • YARN • Apache Karaf (OSGi) Deployment
  • 26. © 2016 GridGain Systems, Inc. Thank You! www.gridgain.com @gridgain @ApacheIgnite #gridgain #ApacheIgnite Thank you for joining us. Follow the conversation. Michael Griggs Michael.Griggs@gridgain.com github.com/kemiz/SparkIgniteSimpleExample