Improving MapReduce: Scala and Gridgain to the rescue! Nikita Ivanov

•

0 j'aime•1,077 vues

The topic of this presentation is about demonstrating how Scala and GridGain software allows to build a real time, MapReduce application in a record time and how GridGain’s approach is different to the one employed by Hadoop MapReduce framework that is based on offline batch-oriented processing. Most of the presentation will be concentrate on the live “from scratch” coding of the application that takes an ubiquitious Hadoop example of calculating word frequencies in the large body of text and implementing it using GridGain to work in 100% real time context instead. The live coding part of the presentation will be done using Scala programming language that is natively supported by GridGain. The initial 10 minutes of the presentation will be devoted to explaining the MapReduce concept. High level explanation about GridGain and Hadoop approaches to solving it, and its respective tradeoffs. Overall - the presentation is extremely hands-on and requires a significant level of understanding of Java or Scala, and basic level of understanding for distributed programming concepts such as In-Memory Data Grid and Compute Grids.

Technologie

Improving MapReduce:
GridGain and Scala To The Rescue!

Nikita Ivanov, Founder & CEO GridGain Systems
October 2012 www.gridgain.com

#gridgain

Table Of Contents:

> 30% > 70%
> Why Real Time Hadoop/MR? > Live Coding
> GridGain Overview > Real Time & Streaming Word Count
> In-Memory Computing
> Compute & Data Grids

www.gridgain.com Slide 2

Real Time MapReduce/Hadoop:

Why?

www.gridgain.com Slide 3

In-Memory Computing: Why Now?

“In-memory will have an industry impact
comparable to web and cloud. RAM is a new disk,
and disk is a new tape.”

Technology & Cost: Performance & Scalability Matters:
> 64-bit CPU can address 16 exabytes > Citi: 100ms == $1M loss
Entire active data set on the planet is addressable by just 1 CPU. Forex trading

> Disk up to 105 times slower than DRAM > Google: 500ms == 20% trafﬁc drop
Dropping 20% of revenue
SSD drives are up to 103 times slower

> Super effective in-memory parallelization
> SAP sees +206% in proﬁt in Q112
For in-memory SAP HANA products
Enabled by modern multicore CPUs
> Software AG sees 3x revenue in 2012
> DRAM prices drop 30% every 18 months For in-memory Terracota products
1TB RAM & 48 cores cluster ~ $40K (< $20K in 3 years)

www.gridgain.com Slide 4

GridGain: In-Memory Data Platform
> In-Memory Compute Grid + > Full ACID Transactions
In-Memory Data Grid Fully distributed ACID transactions

> Real Time & Streaming MapReduce, CEP > Simplicity and Productivity
Dramatically reduces cost of application development
> Three Editions: Demonstrably faster time-to-market
> In-Memory HPC
HPC market
Example:
> In-Memory Data Grid
Full source code in Scala of world’s shortest real time MapReduce app
Transactional Data Caching market
built with GridGain. Works on one or thousands of computers with no
code changes required.
> In-Memory Hadoop
Real Time Big Data market

> Language support:
Server: Java, Scala, Groovy,
Clients: .NET, PHP, REST, C++

> Mobile platforms support:
iOS/ObjectiveC, Android clients

www.gridgain.com Slide 5

GridGain: In-Memory Compute Grid

> Direct API for split and aggregation
> Pluggable failover, topology and collision resolution
> Distributed task session
> Distributed continuations & recursive split
> Support for Streaming MapReduce
> Support for Complex Event Processing (CEP)
> Node-local cache
> AOP-based, OOP/FP-based execution modes
> Direct closure distribution in Java, Scala and Groovy
> Cron-based scheduling
> Direct redundant mapping support
> Zero deployment with P2P class loading
> Partial asynchronous reduction
> Direct support for weighted and adaptive mapping
> State checkpoints for long running tasks
> Early and late load balancing
> Afﬁnity routing with data grid

www.gridgain.com Slide 6

GridGain: In-Memory Data Grid
> Zero deployment for data
> Local, full replicable and partitioned cache types
> Pluggable expiration policies (LRU, LIRS, random, time-based)
> Read-through and write-through logic with pluggable cache store
> Synchronous and asynchronous cache operations
> MVCC-based concurrency
> Pluggable data overﬂow storage via new swap space SPI
> PESSIMISTIC, OPTIMISTIC transactions
> Standard isolation levels, JTA/JCA integration
> Master/Master data replication/invalidation
> Write-behind cache store support
> Concurrent and transactional data preloading
> Delayed preloading support
> Afﬁnity routing with compute grid
> Partitioned cache with active replicas
> Structured and unstructured data
> Datacenter replication
> JDBC driver for in-memory object data store
> Off-heap memory support
> Pluggable indexing via Indexing SPI
> Tiered storage with on-heap, off-heap, swap space, SQL, and Hadoop
> Distributed in-memory query capability
> SQL, H2, Lucene, predicate-based afﬁnity co-located queries

www.gridgain.com Slide 7

Live Coding: GridGain + Scala

> 100% Live Coding:
> Nothing pre-built
> Every line & character
> Everything from the start

www.gridgain.com Slide 8

Recommandé

GridGain - Java Grid Computing Made SimpleMatthew McCullough

Everything I know about software in spaghetti bolognese: managing complexityJAX London

Devops with the S for Sharing - Patrick DeboisJAX London

Busy Developer's Guide to Windows 8 HTML/JavaScript AppsJAX London

It's code but not as we know: Infrastructure as Code - Patrick DeboisJAX London

Locks? We Don't Need No Stinkin' Locks - Michael BarkerJAX London

Worse is better, for better or for worse - Kevlin HenneyJAX London

Java performance: What's the big deal? - Trisha GeeJAX London

Recommandé

GridGain - Java Grid Computing Made SimpleMatthew McCullough

Everything I know about software in spaghetti bolognese: managing complexityJAX London

Devops with the S for Sharing - Patrick DeboisJAX London

Busy Developer's Guide to Windows 8 HTML/JavaScript AppsJAX London

It's code but not as we know: Infrastructure as Code - Patrick DeboisJAX London

Locks? We Don't Need No Stinkin' Locks - Michael BarkerJAX London

Worse is better, for better or for worse - Kevlin HenneyJAX London

Java performance: What's the big deal? - Trisha GeeJAX London

Clojure made-simple - John StevensonJAX London

HTML alchemy: the secrets of mixing JavaScript and Java EE - Matthias WessendorfJAX London

Play framework 2 : Peter HiltonJAX London

Complexity theory and software development : Tim BerglundJAX London

Why FLOSS is a Java developer's best friend: Dave GruberJAX London

Akka in Action: Heiko SeeburgerJAX London

NoSQL Smackdown 2012 : Tim BerglundJAX London

Closures, the next "Big Thing" in Java: Russel WinderJAX London

Java and the machine - Martijn Verburg and Kirk PepperdineJAX London

Mongo DB on the JVM - Brendan McAdamsJAX London

New opportunities for connected data - Ian RobinsonJAX London

HTML5 Websockets and Java - Arun GuptaJAX London

The Big Data Con: Why Big Data is a Problem, not a Solution - Ian PloskerJAX London

Bluffers guide to elitist jargon - Martijn Verburg, Richard Warburton, James ...JAX London

No Crash Allowed - Patterns for fault tolerance : Uwe FriedrichsenJAX London

Size does matter - Patterns for high scalability: Uwe FriedrichsenJAX London

HBase Advanced - Lars GeorgeJAX London

Scala in Action - Heiko SeeburgerJAX London

Achieving genuine elastic multitenancy with the Waratek Cloud VM for Java : J...JAX London

Choosing the right Agile innovation practices: Scrum vs Kanban vs Lean Startu...JAX London

How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes

Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan

Contenu connexe

Plus de JAX London

Clojure made-simple - John StevensonJAX London

HTML alchemy: the secrets of mixing JavaScript and Java EE - Matthias WessendorfJAX London

Play framework 2 : Peter HiltonJAX London

Complexity theory and software development : Tim BerglundJAX London

Why FLOSS is a Java developer's best friend: Dave GruberJAX London

Akka in Action: Heiko SeeburgerJAX London

NoSQL Smackdown 2012 : Tim BerglundJAX London

Closures, the next "Big Thing" in Java: Russel WinderJAX London

Java and the machine - Martijn Verburg and Kirk PepperdineJAX London

Mongo DB on the JVM - Brendan McAdamsJAX London

New opportunities for connected data - Ian RobinsonJAX London

HTML5 Websockets and Java - Arun GuptaJAX London

The Big Data Con: Why Big Data is a Problem, not a Solution - Ian PloskerJAX London

Bluffers guide to elitist jargon - Martijn Verburg, Richard Warburton, James ...JAX London

No Crash Allowed - Patterns for fault tolerance : Uwe FriedrichsenJAX London

Size does matter - Patterns for high scalability: Uwe FriedrichsenJAX London

HBase Advanced - Lars GeorgeJAX London

Scala in Action - Heiko SeeburgerJAX London

Achieving genuine elastic multitenancy with the Waratek Cloud VM for Java : J...JAX London

Choosing the right Agile innovation practices: Scrum vs Kanban vs Lean Startu...JAX London

Plus de JAX London (20)

Clojure made-simple - John Stevenson

HTML alchemy: the secrets of mixing JavaScript and Java EE - Matthias Wessendorf

Play framework 2 : Peter Hilton

Complexity theory and software development : Tim Berglund

Why FLOSS is a Java developer's best friend: Dave Gruber

Akka in Action: Heiko Seeburger

NoSQL Smackdown 2012 : Tim Berglund

Closures, the next "Big Thing" in Java: Russel Winder

Java and the machine - Martijn Verburg and Kirk Pepperdine

Mongo DB on the JVM - Brendan McAdams

New opportunities for connected data - Ian Robinson

HTML5 Websockets and Java - Arun Gupta

The Big Data Con: Why Big Data is a Problem, not a Solution - Ian Plosker

Bluffers guide to elitist jargon - Martijn Verburg, Richard Warburton, James ...

No Crash Allowed - Patterns for fault tolerance : Uwe Friedrichsen

Size does matter - Patterns for high scalability: Uwe Friedrichsen

HBase Advanced - Lars George

Scala in Action - Heiko Seeburger

Achieving genuine elastic multitenancy with the Waratek Cloud VM for Java : J...

Choosing the right Agile innovation practices: Scrum vs Kanban vs Lean Startu...

Dernier

How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes

Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan

Take control of your SAP testing with UiPath Test SuiteDianaGray10

[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra

The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech

So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda

A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3

Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani

Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...panagenda

Scale your database traffic with Read & Write split using MySQL RouterMydbops

Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal

What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina

How to write a Business Continuity PlanDatabarracks

How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe

TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc

Decarbonising Buildings: Making a net-zero built environment a realityIES VE

New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada

UiPath Community: Communication Mining from Zero to HeroUiPathCommunity

Data governance with Unity Catalog PresentationKnoldus Inc.

The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3

Dernier (20)

How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes

Generative AI for Technical Writer or Information Developers

Take control of your SAP testing with UiPath Test Suite

[Webinar] SpiraTest - Setting New Standards in Quality Assurance

The Ultimate Guide to Choosing WordPress Pros and Cons

So einfach geht modernes Roaming fuer Notes und Nomad.pdf

A Deep Dive on Passkeys: FIDO Paris Seminar.pptx

Potential of AI (Generative AI) in Business: Learnings and Insights

Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...

Scale your database traffic with Read & Write split using MySQL Router

Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...

What is DBT - The Ultimate Data Build Tool.pdf

How to write a Business Continuity Plan

How AI, OpenAI, and ChatGPT impact business and software.

TrustArc Webinar - How to Build Consumer Trust Through Data Privacy

Decarbonising Buildings: Making a net-zero built environment a reality

New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024

UiPath Community: Communication Mining from Zero to Hero

Data governance with Unity Catalog Presentation

The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx

Improving MapReduce: Scala and Gridgain to the rescue! Nikita Ivanov

1. Improving MapReduce: GridGain and Scala To The Rescue! Nikita Ivanov, Founder & CEO GridGain Systems October 2012 www.gridgain.com #gridgain

2. Table Of Contents: > 30% > 70% > Why Real Time Hadoop/MR? > Live Coding > GridGain Overview > Real Time & Streaming Word Count > In-Memory Computing > Compute & Data Grids www.gridgain.com Slide 2

3. Real Time MapReduce/Hadoop: Why? www.gridgain.com Slide 3

4. In-Memory Computing: Why Now? “In-memory will have an industry impact comparable to web and cloud. RAM is a new disk, and disk is a new tape.” Technology & Cost: Performance & Scalability Matters: > 64-bit CPU can address 16 exabytes > Citi: 100ms == $1M loss Entire active data set on the planet is addressable by just 1 CPU. Forex trading > Disk up to 105 times slower than DRAM > Google: 500ms == 20% trafﬁc drop Dropping 20% of revenue SSD drives are up to 103 times slower > Super effective in-memory parallelization > SAP sees +206% in proﬁt in Q112 For in-memory SAP HANA products Enabled by modern multicore CPUs > Software AG sees 3x revenue in 2012 > DRAM prices drop 30% every 18 months For in-memory Terracota products 1TB RAM & 48 cores cluster ~ $40K (< $20K in 3 years) www.gridgain.com Slide 4

5. GridGain: In-Memory Data Platform > In-Memory Compute Grid + > Full ACID Transactions In-Memory Data Grid Fully distributed ACID transactions > Real Time & Streaming MapReduce, CEP > Simplicity and Productivity Dramatically reduces cost of application development > Three Editions: Demonstrably faster time-to-market > In-Memory HPC HPC market Example: > In-Memory Data Grid Full source code in Scala of world’s shortest real time MapReduce app Transactional Data Caching market built with GridGain. Works on one or thousands of computers with no code changes required. > In-Memory Hadoop Real Time Big Data market > Language support: Server: Java, Scala, Groovy, Clients: .NET, PHP, REST, C++ > Mobile platforms support: iOS/ObjectiveC, Android clients www.gridgain.com Slide 5

6. GridGain: In-Memory Compute Grid > Direct API for split and aggregation > Pluggable failover, topology and collision resolution > Distributed task session > Distributed continuations & recursive split > Support for Streaming MapReduce > Support for Complex Event Processing (CEP) > Node-local cache > AOP-based, OOP/FP-based execution modes > Direct closure distribution in Java, Scala and Groovy > Cron-based scheduling > Direct redundant mapping support > Zero deployment with P2P class loading > Partial asynchronous reduction > Direct support for weighted and adaptive mapping > State checkpoints for long running tasks > Early and late load balancing > Afﬁnity routing with data grid www.gridgain.com Slide 6

7. GridGain: In-Memory Data Grid > Zero deployment for data > Local, full replicable and partitioned cache types > Pluggable expiration policies (LRU, LIRS, random, time-based) > Read-through and write-through logic with pluggable cache store > Synchronous and asynchronous cache operations > MVCC-based concurrency > Pluggable data overflow storage via new swap space SPI > PESSIMISTIC, OPTIMISTIC transactions > Standard isolation levels, JTA/JCA integration > Master/Master data replication/invalidation > Write-behind cache store support > Concurrent and transactional data preloading > Delayed preloading support > Affinity routing with compute grid > Partitioned cache with active replicas > Structured and unstructured data > Datacenter replication > JDBC driver for in-memory object data store > Off-heap memory support > Pluggable indexing via Indexing SPI > Tiered storage with on-heap, off-heap, swap space, SQL, and Hadoop > Distributed in-memory query capability > SQL, H2, Lucene, predicate-based affinity co-located queries www.gridgain.com Slide 7

8. Live Coding: GridGain + Scala > 100% Live Coding: > Nothing pre-built > Every line & character > Everything from the start www.gridgain.com Slide 8

9. Thank You! #gridgain