SlideShare a Scribd company logo
1 of 55
1© Copyright 2013 Pivotal. All rights reserved. 1© Copyright 2013 Pivotal. All rights reserved.
Open Source Core GemFire
Introducing Project
Geode
2© Copyright 2013 Pivotal. All rights reserved.
Agenda
 Intro
– History, Use Cases, Customers, 2015 Roadmap
– Architecture Overview
 Why OSS, Why Apache
 Southwest experience
 Code walkthru/deepdive
– Build/source code
– PDX - Serialization
– Transactions
– Persistence & GII
 Demo
3© Copyright 2013 Pivotal. All rights reserved.
Geode Team Members in the room
Name Title Years with Technology
Catherine Johnson Product Manager 16 years GemFire, Coherence
Anthony Baker Software Engineer 3 years GemFire
Roman Shaposhnik Director of OS Pivotal 3 years in memory grids
Greg Chase Director of Community 20 years Poet, SAP, GemFire
Dan Smith Software Engineer 7 years GemFire
Jens Deppe Software Engineer 4 years GemFire
Swapnil Bawaskar Software Engineer 7 years GemFire
William Markito Enterprise Architect 6 years GemFire, Coherence
4© Copyright 2013 Pivotal. All rights reserved.
2004 2008 2014
‱ Massive increase in data
volumes
‱ Falling margins per
transaction
‱ Increasing cost of IT
maintenance
‱ Need for elasticity in
systems
‱ Financial Services
Providers (Every major
wall steet bank)
‱ Department of Defense
‱ Real Time response needs
‱ Time to market constraints
‱ Need for flexible data
models across enterprise
‱ Distributed development
‱ Persistence + In-memory
‱ Global data visibility needs
‱ Fast Ingest needs for data
‱ Need to allow devices to
hook into enterprise data
‱ Always on
‱ Largest travel Portal
‱ Airlines
‱ Trade clearing
‱ Online gambling
‱ Largest Telcos
‱ Large mfrers
‱ Largest Payroll processor
‱ Auto insurance giants
‱ Largest rail systems on
earth
Hybrid Transactional
/Analytics grids
Our GemFire Journey Over The Years
5© Copyright 2013 Pivotal. All rights reserved.
Big Data Apps at Scale Have Unique Needs
Project Geode is the distributed, NoSQL, in-memory database
for big data apps that need:
1. Scale-out performance
2. Consistent database operations across nodes
3. High availability, resilience, and elasticity
4. Powerful developer features
5. Easy administration of distributed nodes
6© Copyright 2013 Pivotal. All rights reserved.
1. Scale-Out Performance
China Railway
Corporation
“The system is operating with solid
performance and uptime. Now, we have a
reliable, economically sound production
system that supports record volumes and has
room to grow”
Dr. Jiansheng Zhu, Vice Director of China
Academy of Railway Sciences
‱ 4.5 million ticket purchases & 20 million
users per day.
‱ Spikes of 15,000 tickets sold per minute,
40,000 visits per second.
In-Memory Storage
Optimized data
distribution
Elastic, linear scalability
Nodes
Ops/Sec
7© Copyright 2013 Pivotal. All rights reserved.
2. Consistent Database Operations Across
Globally Distributed Nodes
Indexing, triggers, event notification
Performance-optimized
persistence
Configurable
consistency Partitioned Replicated Disabled
Distributed queries
& regional functions
“Our global deployment of Geode’s
distributed cache gives me a single version of
the trade – resolving hard-to-test-for
synchronization issues that exist within any
globally distributed business application
architecture”
Michael Benillouche, Global Head of Data
ManaGEOent
8© Copyright 2013 Pivotal. All rights reserved.
3. High Availability and Resilence
“We can track and collect money at our
4,000+ kiosks and branches – even without a
reliable Internet connection. Geode provides
the core data grid and a significant amount of
related functionality to help us handle this
unreliable network problem”
Gustavo Valdez, Chief of Architecture and
Development
‱ 19 million payment transactions per month
‱ 4000+ points of sale with intermittent
Internet connectivity
Cluster resilience
& failover
9© Copyright 2013 Pivotal. All rights reserved.
4. Powerful Developer Features
 Data Structures:
– User-defined objects
– Complex object graphs
– Documents (JSON)
 Schema versioning
– Multiple application versions can run
simultaneously against same data nodes
 API’s
– Java: Hashmap
– Spring Data GemFire
– Serialization API’s
 Minimal to no code changes:
– Web app session state caching
– L2 Hibernate
– Memchaced
 Powerful application functions:
– Data-aware functions
– Scatter-gather functions
– Object Query Language (OQL)
– Publish & subscribe & continuous query
event framework
– Reliable asynchronous event queues
10© Copyright 2013 Pivotal. All rights reserved.
5. Easy Administration of Distributed Data Grids
 Auto tuning of distributed computing resources to optimize performance
 Cluster monitoring dashboard
– Cluster and node status & performance
 Offline performance statistics analysis tool
– View historical logs and events to diagnose performance and resource bottlenecks
 Command-line tools for easy automation and scripting of administrative
tasks
11© Copyright 2013 Pivotal. All rights reserved.
Deployment Flexibility for In-Memory Apps
Embedded Embedded, Clustered Tiered, Clustered
WEB
SERVER
WEB
SERVER
WEB
SERVER
WEB
SERVER
GEO
CLIENT
WEB
SERVER
GEO
CLIENT
WEB
SERVER
GEO
CLIENT
GEO
SERVER
GEO
SERVER
GEO
SERVER
 Flexibility
 Flexibility
 Scale
 Flexibility
 Scale
 Performance
 Flexibility
 Scale
 Performance
 Availability
 Localization
WEB
SERVER
WEB
SERVER
WEB
SERVER
WEB
SERVER
WEB
SERVER
WEB
SERVER
GEO
PEER
GEO
PEER
GEO
PEER
WEB
SERVER
WEB
SERVER
GEO
CACHE
12© Copyright 2013 Pivotal. All rights reserved.
Difference between Geode and GemFire
 Native Clients beyond Java
– C++
– C#
 WAN connectivity between clusters
 Continuous Queries from clients
13© Copyright 2013 Pivotal. All rights reserved.
Geode High Level Architecture
14© Copyright 2013 Pivotal. All rights reserved.
‱ Scaled from 256 clients and 2 servers to 1280 clients and 10 servers
‱ Partitioned region with redundancy and 1K data size
Horizontal Scaling for Geode Reads with Consistent
Latency and CPU
15© Copyright 2013 Pivotal. All rights reserved. 15© Copyright 2013 Pivotal. All rights reserved.
Basic Design patterns
16© Copyright 2013 Pivotal. All rights reserved.
“low touch” Usage Patterns
Simple template for TCServer, TC, App servers
Shared nothing persistence, Global session state
HTTP Session manaGEOent
Set Cache in hibernate.cfg.xml
Support for query and entity caching
Hibernate L2 Cache plugin
Servers understand the memcached wire protocol
Use any memcached clientMemcached protocol
<bean id="cacheManager"
class="org.springframework.data.Geode.support.GeodeCacheManager"Spring Cache Abstraction
17© Copyright 2013 Pivotal. All rights reserved.
As embedded, clustered Java database
‱ Just deploy a JAR or WAR into
clustered App nodes
‱ Just like H2 or Derby except data
can be sync’d with DB is partitioned
or replicated across the cluster
‱ Low cost and easy to manage
18© Copyright 2013 Pivotal. All rights reserved.
As a scalable OLTP data store
‱ Shared nothing persistence to disk
‱ Backup and recovery
‱ No Database to configure and be throttled by
19© Copyright 2013 Pivotal. All rights reserved.
To process app behavior in parallel
Map-reduce but based on simpler RPC
20© Copyright 2013 Pivotal. All rights reserved.
“Write thru” Distributed caching
‱ Pre-load using DDLUtils
‱ for queries
‱ Lazily load using “RowLoader”
for PK queries
‱ Configure LRU eviction or
expiry for large data
‱ “Write thru” – participate in
container transaction
21© Copyright 2013 Pivotal. All rights reserved.
Distributed caching with Async writes to DB
‱ Buffer high write rate from
DB
‱ Writes can be enqueued in
memory redundantly on
multiple nodes
‱ Or, also be persisted to disk
on each node
‱ Batches can be conflated and
written to DB
‱ Pattern for “high ingest” into
Data Warehouse
22© Copyright 2013 Pivotal. All rights reserved.
Real-time Analytics
‱ Data stored within Geode in a “sliding window”
‱ Geode map-reduce style in-memory analytics
can be performed with data locality
– Ex: Violation of known trading patterns
‱ Benefit: Early-warning indicators can be
identified faster than waiting for analysis on just
Pivotal HD
‱ Benefit: Real-time analytics can better
influence what kind of big data analytics need
to be performed
Pivotal HD
Geode
Micro-batches
Analysis
Tools
Sliding
Window
Real time
analytics
Alerts
influence
23© Copyright 2013 Pivotal. All rights reserved. 23© Copyright 2013 Pivotal. All rights reserved.
What’s Next
24© Copyright 2013 Pivotal. All rights reserved.
Geode Roadmap for 2015
 HDFS Integration
 Off Heap Storage
 Spark Integration
 Lucene Indexing
 Distributed Transactions
25© Copyright 2013 Pivotal. All rights reserved. 25© Copyright 2013 Pivotal. All rights reserved.
Why OSS, Why Apache?
26© Copyright 2013 Pivotal. All rights reserved.
Why OSS? Why Now? Why Apache?
 Open Source Software is fundamentally changing buying patterns
– Developers have to endorse product selection (No longer CIO handshake)
– Community endorsement is key to product visibility
– Open source credentials attract the best developers
– Vendor credibility directly tied to street credibility of product
 Align with the tides of history
– Customers increasingly asking to participate in product development
– Resume driven development forces customers to consider OSS products
– Allow product development to happen with full transparency
 Apache is where you go to build Open Source street cred
– Transparent, meritocracy which puts developers in charge
– Roman keeps shouting “Apache!” every few hours
27© Copyright 2013 Pivotal. All rights reserved.
Geode Will Be A Significant Apache Project
 Over a 1000 person years invested into cutting edge R&D
 Thousands of production customers in very demanding verticals
 Cutting edge use cases that have shaped product thinking
 Tens of thousands of distributed , scaled up tests that can randomize
every aspect of the product
 A core technology team that has stayed together since founding
 Performance differentiators that are baked into every aspect of the
product
28Pivotal Confidential–Internal Use Only 28Pivotal Confidential–Internal Use Only
Transactions
Swapnil Bawaskar
29Pivotal Confidential–Internal Use Only
Geode Transactions
 Across multiple Entries and Regions
 Full ACID
 Isolation level: Repeatable Read
 JTA
– Last Resource
– Provider
 Optimistic, conflict detection rather than locks
 Faster than doing individual operations
 Ability to suspend and resume
 Work on Colocated data
30Pivotal Confidential–Internal Use Only
Usage
 TransactionManager provides methods to begin, commit, rollback, suspend, resume.
 E.g.
– TransactionManager txMgr = cache.getTransactionManager();
– txMgr.begin();
– Region1.put(k1, v1)
– Region2.get(k2)
– Region2.put(k2, v2)
– txMgr.commit();
 Single entry operations supported via ConcurrentMap methods
– putIfAbsent(K, V)
– replace(K, V, V)
– remove(K, V)
31Pivotal Confidential–Internal Use Only
Implementation
 Repeatable Read  ThreadLocal
 At commit()
– Grab a d-lock on key set. (tx with different key set can still execute concurrently)
– Conflict detection  Reference checks
– Send the commit set to all replicas – no ack
– Send a commit message
– Recipients apply the commit only on getting the second message and keep track of last few transactions
 Failure Scenarios
– Replica fails  No problem, it will do a GII operation when it starts up again
– Coordinator fails  Replicas gossip to arrive at the outcome of the transaction
– If no member has commit message, some members may be missing commit set, abort transaction
– If at-least one member has commit message, all members have commit set, apply transaction
32Pivotal Confidential–Internal Use Only 32Pivotal Confidential–Internal Use Only
Thanks!
33© Copyright 2015 Pivotal. All rights reserved. 33© Copyright 2015 Pivotal. All rights reserved.
Geode Demo
34© Copyright 2015 Pivotal. All rights reserved.
Post Region
Partitioned
People Region
Partitioned
Social Network
Person
Name: String
Description:String
Post
Id: PostId(name, date)
Text: String
35© Copyright 2015 Pivotal. All rights reserved.
Partition put
Client
Server 1
Server 2
Server 3
Bucket 1
Bucket 1
Bucket 2
Bucket 2
#(LOL)=1
Put LOL
36© Copyright 2015 Pivotal. All rights reserved.
Partition put
Client
Server 1
Server 2
Server 3
LOL
LOL
Bucket 2
Bucket 2
Replicate
To Secondary
37© Copyright 2015 Pivotal. All rights reserved.
public interface PersonRepository extends CrudRepository<Person, String> {
}
“User” Use Case – Save Objects
@Autowired
PersonRepository people;
public static void main(String[] args) {
people.save(new Person(name));
posts.save(new Post(new PostId(name, date), text));
}
Nested Objects,
Compound Keys
38© Copyright 2015 Pivotal. All rights reserved.
public interface PersonRepository extends CrudRepository<Person, String> {
}
“User” Use Case – Save Objects
@Autowired
PersonRepository people;
public static void main(String[] args) {
people.save(new Person(name));
posts.save(new Post(new PostId(name, date), text));
}
Automatically Serialized
With PDX
39© Copyright 2015 Pivotal. All rights reserved.
<bean id="pdxSerializer"
class="com.gemstone.gemfire.pdx.ReflectionBasedAutoSerializer">
<constructor-arg value="io.pivotal.happysocial.model.*"/>
</bean>
<gfe:cache pdx-serializer-ref="pdxSerializer"/>
<gfe:partitioned-region id="people" copies="1"/>
Configuration
40© Copyright 2015 Pivotal. All rights reserved.
‱ Find all of the posts for a user
‱ Analyze their content
Data Analyst – Determine Sentiment
41© Copyright 2015 Pivotal. All rights reserved.
public interface PostRepository extends
GemfireRepository<Post, PostId> {
@Query("select * from /posts where id.person=$1")
public Collection<Post> findPosts(String personName);
}
First try – Just use a Query
Collection<Post> posts = postRepository.findPosts(personName);
String sentiment = sentimentAnalyzer.analyze(posts);
42© Copyright 2015 Pivotal. All rights reserved.
public interface PostRepository extends
GemfireRepository<Post, PostId> {
@Query("select * from /posts where id.person=$1")
public Collection<Post> findPosts(String personName);
}
First try – Just use a Query
Collection<Post> posts = postRepository.findPosts(personName);
String sentiment = sentimentAnalyzer.analyze(posts);
Query Nested Objects
43© Copyright 2015 Pivotal. All rights reserved.
Use an Index
<gfe:index id="postAuthor" expression="id.person" from="/posts"/>
44© Copyright 2015 Pivotal. All rights reserved.
Still could be more efficient
Client
Server 1
Server 2
Server 3
Joe: LOL!!
Joe: LOL!!
EJ: arrg
Maya: Hii
Jess: sup
Jess: ok
Hitting multiple
Nodes
Bringing too much
Data to the client
45© Copyright 2015 Pivotal. All rights reserved.
Colocate the data
Client
Server 1
Server 2
Server 3
Joe: LOL!! Joe: LOL!!
EJ: arrgMaya: Hii
Jess: sup Jess: ok
<gfe:partitioned-region id="posts" copies="1"
colocated-with="people”>
<gfe:partition-resolver ref="partitionResolver"/>
</gfe:partitioned-region>
46© Copyright 2015 Pivotal. All rights reserved.
Send behavior to data
Client
Server 1
Server 2
Server 3
Joe: LOL!! Joe: LOL!!
EJ: arrgMaya: Hii
Jess: sup Jess: ok
Execution function
getSentiment
On Joe, Jess
Execute on Joe
Execute on Jess
47© Copyright 2015 Pivotal. All rights reserved.
Sample Function – Client Side
@Component
@OnRegion(region = "posts")
public interface FunctionClient {
public List<SentimentResult> getSentiment(@Filter Set<String> people);
}
48© Copyright 2015 Pivotal. All rights reserved.
Sample Function – Server Side
@GemfireFunction(HA=true)
public SentimentResult getSentiment(Region<PostId, Post> localPosts,
@Filter Set<String> personNames)
throws Exception {
String personName = personNames.iterator().next();
Collection<Post> posts = localPosts.query("id.person='" personName + "'");
String sentiment = sentimentAnalyzer.analyze(posts);
return new SentimentResult(sentiment, personName);
}
49© Copyright 2015 Pivotal. All rights reserved.
Sample Function – Server Side
@GemfireFunction(HA=true)
public SentimentResult getSentiment(Region<PostId, Post> localPosts,
@Filter Set<String> personNames)
throws Exception {
String personName = personNames.iterator().next();
Collection<Post> posts = localPosts.query("id.person='" personName + "'");
String sentiment = sentimentAnalyzer.analyze(posts);
return new SentimentResult(sentiment, personName);
}
50© Copyright 2015 Pivotal. All rights reserved.
Sample Function – Server Side
@GemfireFunction(HA=true)
public SentimentResult getSentiment(Region<PostId, Post> localPosts,
@Filter Set<String> personNames)
throws Exception {
String personName = personNames.iterator().next();
Collection<Post> posts = localPosts.query("id.person='" personName + "'");
String sentiment = sentimentAnalyzer.analyze(posts);
return new SentimentResult(sentiment, personName);
}
51© Copyright 2015 Pivotal. All rights reserved. 51© Copyright 2015 Pivotal. All rights reserved.
Demo
52© Copyright 2015 Pivotal. All rights reserved.
Highly Available Asynchronous Events
LOL!!
sup
LOL!! sup
put
LOL!! sup
Primary Queue
Secondary Queue
Enqueue
53© Copyright 2015 Pivotal. All rights reserved.
Colocated, Parallel Delivery
LOL!!
sup
LOL!! supput
LOL!!
sup
LOL!! sup
Primary Queue
(Partition 1)
Secondary Queue
(Partition 1)
Primary Queue
(Partition 2)
54© Copyright 2015 Pivotal. All rights reserved.
Modify
k1->v5
Create
k6->v6
Create
k1->v1
Create
k2->v2
Modify
k1->v3
Create
k4->v4
Modify
k1->v5
Create
k6->v6
Shared Nothing Persistence
Put k6->v6
k6->v6 k6->v6
Operation Logs
with compaction
55© Copyright 2015 Pivotal. All rights reserved.
GemFire (Geode) 3.5-4.5X Faster Than Cassandra
for YCSB

More Related Content

What's hot

Gemfire
GemfireGemfire
Gemfire
FNian
 
What's the Hadoop-la about Kubernetes?
What's the Hadoop-la about Kubernetes?What's the Hadoop-la about Kubernetes?
What's the Hadoop-la about Kubernetes?
DataWorks Summit
 

What's hot (20)

Building Scalable Applications using Pivotal Gemfire/Apache Geode
Building Scalable Applications using Pivotal Gemfire/Apache GeodeBuilding Scalable Applications using Pivotal Gemfire/Apache Geode
Building Scalable Applications using Pivotal Gemfire/Apache Geode
 
Scale Out Your Big Data Apps: The Latest on Pivotal GemFire and GemFire XD
Scale Out Your Big Data Apps: The Latest on Pivotal GemFire and GemFire XDScale Out Your Big Data Apps: The Latest on Pivotal GemFire and GemFire XD
Scale Out Your Big Data Apps: The Latest on Pivotal GemFire and GemFire XD
 
An Introduction to Apache Geode (incubating)
An Introduction to Apache Geode (incubating)An Introduction to Apache Geode (incubating)
An Introduction to Apache Geode (incubating)
 
Gemfire
GemfireGemfire
Gemfire
 
Building Apps with Distributed In-Memory Computing Using Apache Geode
Building Apps with Distributed In-Memory Computing Using Apache GeodeBuilding Apps with Distributed In-Memory Computing Using Apache Geode
Building Apps with Distributed In-Memory Computing Using Apache Geode
 
Development of concurrent services using In-Memory Data Grids
Development of concurrent services using In-Memory Data GridsDevelopment of concurrent services using In-Memory Data Grids
Development of concurrent services using In-Memory Data Grids
 
Apache Geode Meetup, London
Apache Geode Meetup, LondonApache Geode Meetup, London
Apache Geode Meetup, London
 
Apache geode
Apache geodeApache geode
Apache geode
 
Introducing Apache Geode and Spring Data GemFire
Introducing Apache Geode and Spring Data GemFireIntroducing Apache Geode and Spring Data GemFire
Introducing Apache Geode and Spring Data GemFire
 
EDB Postgres & Tools in a Smart City Project
EDB Postgres & Tools in a Smart City ProjectEDB Postgres & Tools in a Smart City Project
EDB Postgres & Tools in a Smart City Project
 
What's the Hadoop-la about Kubernetes?
What's the Hadoop-la about Kubernetes?What's the Hadoop-la about Kubernetes?
What's the Hadoop-la about Kubernetes?
 
Remote DBA Service: Powering your DBA needs
Remote DBA Service: Powering your DBA needsRemote DBA Service: Powering your DBA needs
Remote DBA Service: Powering your DBA needs
 
Best Practices & Lessons Learned from Deployment of PostgreSQL
 Best Practices & Lessons Learned from Deployment of PostgreSQL Best Practices & Lessons Learned from Deployment of PostgreSQL
Best Practices & Lessons Learned from Deployment of PostgreSQL
 
PostgreSQL as a Strategic Tool
PostgreSQL as a Strategic ToolPostgreSQL as a Strategic Tool
PostgreSQL as a Strategic Tool
 
What's new in apache hive
What's new in apache hive What's new in apache hive
What's new in apache hive
 
PostgreSQL as a Strategic Tool
PostgreSQL as a Strategic ToolPostgreSQL as a Strategic Tool
PostgreSQL as a Strategic Tool
 
Not all open source is the same
Not all open source is the sameNot all open source is the same
Not all open source is the same
 
Leveraging NoSQL Database Technology to Implement Real-time Data Architecture...
Leveraging NoSQL Database Technology to Implement Real-time Data Architecture...Leveraging NoSQL Database Technology to Implement Real-time Data Architecture...
Leveraging NoSQL Database Technology to Implement Real-time Data Architecture...
 
New enhancements for security and usability in EDB 13
New enhancements for security and usability in EDB 13New enhancements for security and usability in EDB 13
New enhancements for security and usability in EDB 13
 
Hive 3 - a new horizon
Hive 3 - a new horizonHive 3 - a new horizon
Hive 3 - a new horizon
 

Viewers also liked

Zettaset Elastic Big Data Security for Greenplum Database
Zettaset Elastic Big Data Security for Greenplum DatabaseZettaset Elastic Big Data Security for Greenplum Database
Zettaset Elastic Big Data Security for Greenplum Database
PivotalOpenSourceHub
 

Viewers also liked (7)

Virtualizing Latency Sensitive Workloads and vFabric GemFire
Virtualizing Latency Sensitive Workloads and vFabric GemFireVirtualizing Latency Sensitive Workloads and vFabric GemFire
Virtualizing Latency Sensitive Workloads and vFabric GemFire
 
Zettaset Elastic Big Data Security for Greenplum Database
Zettaset Elastic Big Data Security for Greenplum DatabaseZettaset Elastic Big Data Security for Greenplum Database
Zettaset Elastic Big Data Security for Greenplum Database
 
Asynchronous design with Spring and RTI: 1M events per second
Asynchronous design with Spring and RTI: 1M events per secondAsynchronous design with Spring and RTI: 1M events per second
Asynchronous design with Spring and RTI: 1M events per second
 
Build & test Apache Hawq
Build & test Apache Hawq Build & test Apache Hawq
Build & test Apache Hawq
 
Apache HAWQ and Apache MADlib: Journey to Apache
Apache HAWQ and Apache MADlib: Journey to ApacheApache HAWQ and Apache MADlib: Journey to Apache
Apache HAWQ and Apache MADlib: Journey to Apache
 
Mix it2014 - Machine Learning et Régulation Numérique
Mix it2014 - Machine Learning et Régulation NumériqueMix it2014 - Machine Learning et Régulation Numérique
Mix it2014 - Machine Learning et Régulation Numérique
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 

Similar to Geode Meetup Apachecon

EMC Pivotal overview deck
EMC Pivotal overview deckEMC Pivotal overview deck
EMC Pivotal overview deck
mister_moun
 
Pivotal deep dive_on_pivotal_hd_world_class_hdfs_platform
Pivotal deep dive_on_pivotal_hd_world_class_hdfs_platformPivotal deep dive_on_pivotal_hd_world_class_hdfs_platform
Pivotal deep dive_on_pivotal_hd_world_class_hdfs_platform
EMC
 
Feature Store as a Data Foundation for Machine Learning
Feature Store as a Data Foundation for Machine LearningFeature Store as a Data Foundation for Machine Learning
Feature Store as a Data Foundation for Machine Learning
Provectus
 

Similar to Geode Meetup Apachecon (20)

Open Sourcing GemFire - Apache Geode
Open Sourcing GemFire - Apache GeodeOpen Sourcing GemFire - Apache Geode
Open Sourcing GemFire - Apache Geode
 
2014.07.11 biginsights data2014
2014.07.11 biginsights data20142014.07.11 biginsights data2014
2014.07.11 biginsights data2014
 
InfoSphere BigInsights - Analytics power for Hadoop - field experience
InfoSphere BigInsights - Analytics power for Hadoop - field experienceInfoSphere BigInsights - Analytics power for Hadoop - field experience
InfoSphere BigInsights - Analytics power for Hadoop - field experience
 
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
 
EMC Pivotal overview deck
EMC Pivotal overview deckEMC Pivotal overview deck
EMC Pivotal overview deck
 
Big Data Ready Enterprise
Big Data Ready Enterprise Big Data Ready Enterprise
Big Data Ready Enterprise
 
Cloud-Native Patterns and the Benefits of MySQL as a Platform Managed Service
Cloud-Native Patterns and the Benefits of MySQL as a Platform Managed ServiceCloud-Native Patterns and the Benefits of MySQL as a Platform Managed Service
Cloud-Native Patterns and the Benefits of MySQL as a Platform Managed Service
 
Webinar: Enterprise Trends for Database-as-a-Service
Webinar: Enterprise Trends for Database-as-a-ServiceWebinar: Enterprise Trends for Database-as-a-Service
Webinar: Enterprise Trends for Database-as-a-Service
 
Data Engineer, Patterns & Architecture The future: Deep-dive into Microservic...
Data Engineer, Patterns & Architecture The future: Deep-dive into Microservic...Data Engineer, Patterns & Architecture The future: Deep-dive into Microservic...
Data Engineer, Patterns & Architecture The future: Deep-dive into Microservic...
 
Modernizing Global Shared Data Analytics Platform and our Alluxio Journey
Modernizing Global Shared Data Analytics Platform and our Alluxio JourneyModernizing Global Shared Data Analytics Platform and our Alluxio Journey
Modernizing Global Shared Data Analytics Platform and our Alluxio Journey
 
Simplifying Real-Time Architectures for IoT with Apache Kudu
Simplifying Real-Time Architectures for IoT with Apache KuduSimplifying Real-Time Architectures for IoT with Apache Kudu
Simplifying Real-Time Architectures for IoT with Apache Kudu
 
Best Practices for a Complete Postgres Enterprise Architecture Setup
Best Practices for a Complete Postgres Enterprise Architecture SetupBest Practices for a Complete Postgres Enterprise Architecture Setup
Best Practices for a Complete Postgres Enterprise Architecture Setup
 
InfoSphere BigInsights
InfoSphere BigInsightsInfoSphere BigInsights
InfoSphere BigInsights
 
Pivotal deep dive_on_pivotal_hd_world_class_hdfs_platform
Pivotal deep dive_on_pivotal_hd_world_class_hdfs_platformPivotal deep dive_on_pivotal_hd_world_class_hdfs_platform
Pivotal deep dive_on_pivotal_hd_world_class_hdfs_platform
 
The Modern Database for Enterprise Applications
The Modern Database for Enterprise ApplicationsThe Modern Database for Enterprise Applications
The Modern Database for Enterprise Applications
 
Oracle NoSQL Database -- Big Data Bellevue Meetup - 02-18-15
Oracle NoSQL Database -- Big Data Bellevue Meetup - 02-18-15Oracle NoSQL Database -- Big Data Bellevue Meetup - 02-18-15
Oracle NoSQL Database -- Big Data Bellevue Meetup - 02-18-15
 
Horses for Courses: Database Roundtable
Horses for Courses: Database RoundtableHorses for Courses: Database Roundtable
Horses for Courses: Database Roundtable
 
Big SQL 3.0 - Fast and easy SQL on Hadoop
Big SQL 3.0 - Fast and easy SQL on HadoopBig SQL 3.0 - Fast and easy SQL on Hadoop
Big SQL 3.0 - Fast and easy SQL on Hadoop
 
206450 primavera gateway
206450 primavera gateway206450 primavera gateway
206450 primavera gateway
 
Feature Store as a Data Foundation for Machine Learning
Feature Store as a Data Foundation for Machine LearningFeature Store as a Data Foundation for Machine Learning
Feature Store as a Data Foundation for Machine Learning
 

Recently uploaded

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Recently uploaded (20)

DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Navi Mumbai Call Girls đŸ„° 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls đŸ„° 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls đŸ„° 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls đŸ„° 8617370543 Service Offer VIP Hot Model
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 

Geode Meetup Apachecon

  • 1. 1© Copyright 2013 Pivotal. All rights reserved. 1© Copyright 2013 Pivotal. All rights reserved. Open Source Core GemFire Introducing Project Geode
  • 2. 2© Copyright 2013 Pivotal. All rights reserved. Agenda  Intro – History, Use Cases, Customers, 2015 Roadmap – Architecture Overview  Why OSS, Why Apache  Southwest experience  Code walkthru/deepdive – Build/source code – PDX - Serialization – Transactions – Persistence & GII  Demo
  • 3. 3© Copyright 2013 Pivotal. All rights reserved. Geode Team Members in the room Name Title Years with Technology Catherine Johnson Product Manager 16 years GemFire, Coherence Anthony Baker Software Engineer 3 years GemFire Roman Shaposhnik Director of OS Pivotal 3 years in memory grids Greg Chase Director of Community 20 years Poet, SAP, GemFire Dan Smith Software Engineer 7 years GemFire Jens Deppe Software Engineer 4 years GemFire Swapnil Bawaskar Software Engineer 7 years GemFire William Markito Enterprise Architect 6 years GemFire, Coherence
  • 4. 4© Copyright 2013 Pivotal. All rights reserved. 2004 2008 2014 ‱ Massive increase in data volumes ‱ Falling margins per transaction ‱ Increasing cost of IT maintenance ‱ Need for elasticity in systems ‱ Financial Services Providers (Every major wall steet bank) ‱ Department of Defense ‱ Real Time response needs ‱ Time to market constraints ‱ Need for flexible data models across enterprise ‱ Distributed development ‱ Persistence + In-memory ‱ Global data visibility needs ‱ Fast Ingest needs for data ‱ Need to allow devices to hook into enterprise data ‱ Always on ‱ Largest travel Portal ‱ Airlines ‱ Trade clearing ‱ Online gambling ‱ Largest Telcos ‱ Large mfrers ‱ Largest Payroll processor ‱ Auto insurance giants ‱ Largest rail systems on earth Hybrid Transactional /Analytics grids Our GemFire Journey Over The Years
  • 5. 5© Copyright 2013 Pivotal. All rights reserved. Big Data Apps at Scale Have Unique Needs Project Geode is the distributed, NoSQL, in-memory database for big data apps that need: 1. Scale-out performance 2. Consistent database operations across nodes 3. High availability, resilience, and elasticity 4. Powerful developer features 5. Easy administration of distributed nodes
  • 6. 6© Copyright 2013 Pivotal. All rights reserved. 1. Scale-Out Performance China Railway Corporation “The system is operating with solid performance and uptime. Now, we have a reliable, economically sound production system that supports record volumes and has room to grow” Dr. Jiansheng Zhu, Vice Director of China Academy of Railway Sciences ‱ 4.5 million ticket purchases & 20 million users per day. ‱ Spikes of 15,000 tickets sold per minute, 40,000 visits per second. In-Memory Storage Optimized data distribution Elastic, linear scalability Nodes Ops/Sec
  • 7. 7© Copyright 2013 Pivotal. All rights reserved. 2. Consistent Database Operations Across Globally Distributed Nodes Indexing, triggers, event notification Performance-optimized persistence Configurable consistency Partitioned Replicated Disabled Distributed queries & regional functions “Our global deployment of Geode’s distributed cache gives me a single version of the trade – resolving hard-to-test-for synchronization issues that exist within any globally distributed business application architecture” Michael Benillouche, Global Head of Data ManaGEOent
  • 8. 8© Copyright 2013 Pivotal. All rights reserved. 3. High Availability and Resilence “We can track and collect money at our 4,000+ kiosks and branches – even without a reliable Internet connection. Geode provides the core data grid and a significant amount of related functionality to help us handle this unreliable network problem” Gustavo Valdez, Chief of Architecture and Development ‱ 19 million payment transactions per month ‱ 4000+ points of sale with intermittent Internet connectivity Cluster resilience & failover
  • 9. 9© Copyright 2013 Pivotal. All rights reserved. 4. Powerful Developer Features  Data Structures: – User-defined objects – Complex object graphs – Documents (JSON)  Schema versioning – Multiple application versions can run simultaneously against same data nodes  API’s – Java: Hashmap – Spring Data GemFire – Serialization API’s  Minimal to no code changes: – Web app session state caching – L2 Hibernate – Memchaced  Powerful application functions: – Data-aware functions – Scatter-gather functions – Object Query Language (OQL) – Publish & subscribe & continuous query event framework – Reliable asynchronous event queues
  • 10. 10© Copyright 2013 Pivotal. All rights reserved. 5. Easy Administration of Distributed Data Grids  Auto tuning of distributed computing resources to optimize performance  Cluster monitoring dashboard – Cluster and node status & performance  Offline performance statistics analysis tool – View historical logs and events to diagnose performance and resource bottlenecks  Command-line tools for easy automation and scripting of administrative tasks
  • 11. 11© Copyright 2013 Pivotal. All rights reserved. Deployment Flexibility for In-Memory Apps Embedded Embedded, Clustered Tiered, Clustered WEB SERVER WEB SERVER WEB SERVER WEB SERVER GEO CLIENT WEB SERVER GEO CLIENT WEB SERVER GEO CLIENT GEO SERVER GEO SERVER GEO SERVER  Flexibility  Flexibility  Scale  Flexibility  Scale  Performance  Flexibility  Scale  Performance  Availability  Localization WEB SERVER WEB SERVER WEB SERVER WEB SERVER WEB SERVER WEB SERVER GEO PEER GEO PEER GEO PEER WEB SERVER WEB SERVER GEO CACHE
  • 12. 12© Copyright 2013 Pivotal. All rights reserved. Difference between Geode and GemFire  Native Clients beyond Java – C++ – C#  WAN connectivity between clusters  Continuous Queries from clients
  • 13. 13© Copyright 2013 Pivotal. All rights reserved. Geode High Level Architecture
  • 14. 14© Copyright 2013 Pivotal. All rights reserved. ‱ Scaled from 256 clients and 2 servers to 1280 clients and 10 servers ‱ Partitioned region with redundancy and 1K data size Horizontal Scaling for Geode Reads with Consistent Latency and CPU
  • 15. 15© Copyright 2013 Pivotal. All rights reserved. 15© Copyright 2013 Pivotal. All rights reserved. Basic Design patterns
  • 16. 16© Copyright 2013 Pivotal. All rights reserved. “low touch” Usage Patterns Simple template for TCServer, TC, App servers Shared nothing persistence, Global session state HTTP Session manaGEOent Set Cache in hibernate.cfg.xml Support for query and entity caching Hibernate L2 Cache plugin Servers understand the memcached wire protocol Use any memcached clientMemcached protocol <bean id="cacheManager" class="org.springframework.data.Geode.support.GeodeCacheManager"Spring Cache Abstraction
  • 17. 17© Copyright 2013 Pivotal. All rights reserved. As embedded, clustered Java database ‱ Just deploy a JAR or WAR into clustered App nodes ‱ Just like H2 or Derby except data can be sync’d with DB is partitioned or replicated across the cluster ‱ Low cost and easy to manage
  • 18. 18© Copyright 2013 Pivotal. All rights reserved. As a scalable OLTP data store ‱ Shared nothing persistence to disk ‱ Backup and recovery ‱ No Database to configure and be throttled by
  • 19. 19© Copyright 2013 Pivotal. All rights reserved. To process app behavior in parallel Map-reduce but based on simpler RPC
  • 20. 20© Copyright 2013 Pivotal. All rights reserved. “Write thru” Distributed caching ‱ Pre-load using DDLUtils ‱ for queries ‱ Lazily load using “RowLoader” for PK queries ‱ Configure LRU eviction or expiry for large data ‱ “Write thru” – participate in container transaction
  • 21. 21© Copyright 2013 Pivotal. All rights reserved. Distributed caching with Async writes to DB ‱ Buffer high write rate from DB ‱ Writes can be enqueued in memory redundantly on multiple nodes ‱ Or, also be persisted to disk on each node ‱ Batches can be conflated and written to DB ‱ Pattern for “high ingest” into Data Warehouse
  • 22. 22© Copyright 2013 Pivotal. All rights reserved. Real-time Analytics ‱ Data stored within Geode in a “sliding window” ‱ Geode map-reduce style in-memory analytics can be performed with data locality – Ex: Violation of known trading patterns ‱ Benefit: Early-warning indicators can be identified faster than waiting for analysis on just Pivotal HD ‱ Benefit: Real-time analytics can better influence what kind of big data analytics need to be performed Pivotal HD Geode Micro-batches Analysis Tools Sliding Window Real time analytics Alerts influence
  • 23. 23© Copyright 2013 Pivotal. All rights reserved. 23© Copyright 2013 Pivotal. All rights reserved. What’s Next
  • 24. 24© Copyright 2013 Pivotal. All rights reserved. Geode Roadmap for 2015  HDFS Integration  Off Heap Storage  Spark Integration  Lucene Indexing  Distributed Transactions
  • 25. 25© Copyright 2013 Pivotal. All rights reserved. 25© Copyright 2013 Pivotal. All rights reserved. Why OSS, Why Apache?
  • 26. 26© Copyright 2013 Pivotal. All rights reserved. Why OSS? Why Now? Why Apache?  Open Source Software is fundamentally changing buying patterns – Developers have to endorse product selection (No longer CIO handshake) – Community endorsement is key to product visibility – Open source credentials attract the best developers – Vendor credibility directly tied to street credibility of product  Align with the tides of history – Customers increasingly asking to participate in product development – Resume driven development forces customers to consider OSS products – Allow product development to happen with full transparency  Apache is where you go to build Open Source street cred – Transparent, meritocracy which puts developers in charge – Roman keeps shouting “Apache!” every few hours
  • 27. 27© Copyright 2013 Pivotal. All rights reserved. Geode Will Be A Significant Apache Project  Over a 1000 person years invested into cutting edge R&D  Thousands of production customers in very demanding verticals  Cutting edge use cases that have shaped product thinking  Tens of thousands of distributed , scaled up tests that can randomize every aspect of the product  A core technology team that has stayed together since founding  Performance differentiators that are baked into every aspect of the product
  • 28. 28Pivotal Confidential–Internal Use Only 28Pivotal Confidential–Internal Use Only Transactions Swapnil Bawaskar
  • 29. 29Pivotal Confidential–Internal Use Only Geode Transactions  Across multiple Entries and Regions  Full ACID  Isolation level: Repeatable Read  JTA – Last Resource – Provider  Optimistic, conflict detection rather than locks  Faster than doing individual operations  Ability to suspend and resume  Work on Colocated data
  • 30. 30Pivotal Confidential–Internal Use Only Usage  TransactionManager provides methods to begin, commit, rollback, suspend, resume.  E.g. – TransactionManager txMgr = cache.getTransactionManager(); – txMgr.begin(); – Region1.put(k1, v1) – Region2.get(k2) – Region2.put(k2, v2) – txMgr.commit();  Single entry operations supported via ConcurrentMap methods – putIfAbsent(K, V) – replace(K, V, V) – remove(K, V)
  • 31. 31Pivotal Confidential–Internal Use Only Implementation  Repeatable Read  ThreadLocal  At commit() – Grab a d-lock on key set. (tx with different key set can still execute concurrently) – Conflict detection  Reference checks – Send the commit set to all replicas – no ack – Send a commit message – Recipients apply the commit only on getting the second message and keep track of last few transactions  Failure Scenarios – Replica fails  No problem, it will do a GII operation when it starts up again – Coordinator fails  Replicas gossip to arrive at the outcome of the transaction – If no member has commit message, some members may be missing commit set, abort transaction – If at-least one member has commit message, all members have commit set, apply transaction
  • 32. 32Pivotal Confidential–Internal Use Only 32Pivotal Confidential–Internal Use Only Thanks!
  • 33. 33© Copyright 2015 Pivotal. All rights reserved. 33© Copyright 2015 Pivotal. All rights reserved. Geode Demo
  • 34. 34© Copyright 2015 Pivotal. All rights reserved. Post Region Partitioned People Region Partitioned Social Network Person Name: String Description:String Post Id: PostId(name, date) Text: String
  • 35. 35© Copyright 2015 Pivotal. All rights reserved. Partition put Client Server 1 Server 2 Server 3 Bucket 1 Bucket 1 Bucket 2 Bucket 2 #(LOL)=1 Put LOL
  • 36. 36© Copyright 2015 Pivotal. All rights reserved. Partition put Client Server 1 Server 2 Server 3 LOL LOL Bucket 2 Bucket 2 Replicate To Secondary
  • 37. 37© Copyright 2015 Pivotal. All rights reserved. public interface PersonRepository extends CrudRepository<Person, String> { } “User” Use Case – Save Objects @Autowired PersonRepository people; public static void main(String[] args) { people.save(new Person(name)); posts.save(new Post(new PostId(name, date), text)); } Nested Objects, Compound Keys
  • 38. 38© Copyright 2015 Pivotal. All rights reserved. public interface PersonRepository extends CrudRepository<Person, String> { } “User” Use Case – Save Objects @Autowired PersonRepository people; public static void main(String[] args) { people.save(new Person(name)); posts.save(new Post(new PostId(name, date), text)); } Automatically Serialized With PDX
  • 39. 39© Copyright 2015 Pivotal. All rights reserved. <bean id="pdxSerializer" class="com.gemstone.gemfire.pdx.ReflectionBasedAutoSerializer"> <constructor-arg value="io.pivotal.happysocial.model.*"/> </bean> <gfe:cache pdx-serializer-ref="pdxSerializer"/> <gfe:partitioned-region id="people" copies="1"/> Configuration
  • 40. 40© Copyright 2015 Pivotal. All rights reserved. ‱ Find all of the posts for a user ‱ Analyze their content Data Analyst – Determine Sentiment
  • 41. 41© Copyright 2015 Pivotal. All rights reserved. public interface PostRepository extends GemfireRepository<Post, PostId> { @Query("select * from /posts where id.person=$1") public Collection<Post> findPosts(String personName); } First try – Just use a Query Collection<Post> posts = postRepository.findPosts(personName); String sentiment = sentimentAnalyzer.analyze(posts);
  • 42. 42© Copyright 2015 Pivotal. All rights reserved. public interface PostRepository extends GemfireRepository<Post, PostId> { @Query("select * from /posts where id.person=$1") public Collection<Post> findPosts(String personName); } First try – Just use a Query Collection<Post> posts = postRepository.findPosts(personName); String sentiment = sentimentAnalyzer.analyze(posts); Query Nested Objects
  • 43. 43© Copyright 2015 Pivotal. All rights reserved. Use an Index <gfe:index id="postAuthor" expression="id.person" from="/posts"/>
  • 44. 44© Copyright 2015 Pivotal. All rights reserved. Still could be more efficient Client Server 1 Server 2 Server 3 Joe: LOL!! Joe: LOL!! EJ: arrg Maya: Hii Jess: sup Jess: ok Hitting multiple Nodes Bringing too much Data to the client
  • 45. 45© Copyright 2015 Pivotal. All rights reserved. Colocate the data Client Server 1 Server 2 Server 3 Joe: LOL!! Joe: LOL!! EJ: arrgMaya: Hii Jess: sup Jess: ok <gfe:partitioned-region id="posts" copies="1" colocated-with="people”> <gfe:partition-resolver ref="partitionResolver"/> </gfe:partitioned-region>
  • 46. 46© Copyright 2015 Pivotal. All rights reserved. Send behavior to data Client Server 1 Server 2 Server 3 Joe: LOL!! Joe: LOL!! EJ: arrgMaya: Hii Jess: sup Jess: ok Execution function getSentiment On Joe, Jess Execute on Joe Execute on Jess
  • 47. 47© Copyright 2015 Pivotal. All rights reserved. Sample Function – Client Side @Component @OnRegion(region = "posts") public interface FunctionClient { public List<SentimentResult> getSentiment(@Filter Set<String> people); }
  • 48. 48© Copyright 2015 Pivotal. All rights reserved. Sample Function – Server Side @GemfireFunction(HA=true) public SentimentResult getSentiment(Region<PostId, Post> localPosts, @Filter Set<String> personNames) throws Exception { String personName = personNames.iterator().next(); Collection<Post> posts = localPosts.query("id.person='" personName + "'"); String sentiment = sentimentAnalyzer.analyze(posts); return new SentimentResult(sentiment, personName); }
  • 49. 49© Copyright 2015 Pivotal. All rights reserved. Sample Function – Server Side @GemfireFunction(HA=true) public SentimentResult getSentiment(Region<PostId, Post> localPosts, @Filter Set<String> personNames) throws Exception { String personName = personNames.iterator().next(); Collection<Post> posts = localPosts.query("id.person='" personName + "'"); String sentiment = sentimentAnalyzer.analyze(posts); return new SentimentResult(sentiment, personName); }
  • 50. 50© Copyright 2015 Pivotal. All rights reserved. Sample Function – Server Side @GemfireFunction(HA=true) public SentimentResult getSentiment(Region<PostId, Post> localPosts, @Filter Set<String> personNames) throws Exception { String personName = personNames.iterator().next(); Collection<Post> posts = localPosts.query("id.person='" personName + "'"); String sentiment = sentimentAnalyzer.analyze(posts); return new SentimentResult(sentiment, personName); }
  • 51. 51© Copyright 2015 Pivotal. All rights reserved. 51© Copyright 2015 Pivotal. All rights reserved. Demo
  • 52. 52© Copyright 2015 Pivotal. All rights reserved. Highly Available Asynchronous Events LOL!! sup LOL!! sup put LOL!! sup Primary Queue Secondary Queue Enqueue
  • 53. 53© Copyright 2015 Pivotal. All rights reserved. Colocated, Parallel Delivery LOL!! sup LOL!! supput LOL!! sup LOL!! sup Primary Queue (Partition 1) Secondary Queue (Partition 1) Primary Queue (Partition 2)
  • 54. 54© Copyright 2015 Pivotal. All rights reserved. Modify k1->v5 Create k6->v6 Create k1->v1 Create k2->v2 Modify k1->v3 Create k4->v4 Modify k1->v5 Create k6->v6 Shared Nothing Persistence Put k6->v6 k6->v6 k6->v6 Operation Logs with compaction
  • 55. 55© Copyright 2015 Pivotal. All rights reserved. GemFire (Geode) 3.5-4.5X Faster Than Cassandra for YCSB

Editor's Notes

  1. Explosion of user population, bursty nature of load, at a reasonable cost. Rising expectations of uptime Disruptive innovators
  2.   For an enterprise perspective on real time data: https://www.gesoftware.com/blog/can-memory-storage-solve-one-datas-greatest-challenges   First, let’s talk about consumer driven applications. Applications today are really about serving this market. Historically, enterprise applications were truly the focus, where the interactions were expected to be that
 interactive. With consumer driven applications, you are really pushing the limits of instantaneous information to users, even going so far as to be predictive with what may be useful or appealing to them, to provide “value” to that consumer. What makes you different than your competitors in driving consumers to you is more about what extras you provide than the base service.   Consumer driven applications until recently were not so data driven. Data flowed through them, but they didn’t provide information back, except in a pull fashion. Geode is the distributed NoSQL, in memory database for big data apps that need:   Scale out performance – as demand goes up and down, due to seasonal demand, flash types of events, or increasing data pipes, Geode can scale up and down with commodity hardware, providing predictable, linear scalability, without downtime. Consistent database operations across globally distributed nodes: Geode focuses on data being consistent. This has been our bent from the beginning. The only thing we will sacrifice performance for is consistency. This is key to our customers that sell things like train tickets and stocks. High availability, resilience, and global scale – Geode is intended for mission critical data and applications. Through a series of innovations, Geode can provide continuous availability of data at a global scale, through code changes, model changes, hardware changes, major version upgrades and smoking hole disasters Powerful developer features – Even though Geode is pure Java, it is accessible through many APIs. It also provides a rich event framework, allowing applications to subscribe to individual data events in Geode Easy administration of distributed nodes: Easy administration of distributed nodes: Geode relies on you to say how to evenly distribute your data from a statistical perspective and takes care of the runtime implementation. Geode manages partitions of data between nodes in a dynamic fashion, removing the chore of mapping partitions to specific nodes, slaves, and masters in a cluster. Geode manages that for you, along with recovery of nodes. Additionally, Geode provides tools to help you manage and understand the behavior of the system.    
  3. Please see the case study here. Talk about what they did first. Then go over the key points in this slide http://blog.pivotal.io/pivotal/case-studies-2/china-railway-corp-for-chinese-new-year-chunyun Key stats: Query time reduced from 15 seconds to .2 sec – a 75x improvement (other customers have seen up to 100x) Queries went from 3500 per second to 10s of 1000s per second Train tickets to major cities sold out in 20 seconds Growth is about 20% per year Emphasize “room to grow” in this quote. Why was this so successful? A few things: CRC used a partitioned data set in memory. What this gave them was: In memory performance. Data is “persisted” to at least 2 nodes in memory, rather than to disk, saving IO cost Ability to scale up and down as needed, maximizing infrastructure usage. They don’t have to keep expensive dedicated hardware in place to accommodate peak usage Optimized data distribution – CRC defines how the data should be partitioned. Geode takes care of the physical implementation of this actively, optimizing the distribution of data as machines are added and removed. Geode is a shared nothing, in memory architecture. Data can be persisted to disk or another system, but it is out of band with the transaction, using the network to “persist” the data. This allows Geode to perform at maximum speeds with consistency. As more nodes are added, Geode can be notified to take advantage of these new nodes based on current resource usage. This gives us predictable linear performance as we scale up and down to meet peaky demands, while keeping data absolutely consistent – we can’t sell the same train ticket twice!  Colocation of related data sets allows us to scale transactions to hundreds of thousands of concurrent transactions running in a single cluster As we distribute this data, our ability to operate on it is also scaled out. Queries, events, and functions that the grid operates on are all done in a distributed, parallel fashion.      
  4. Read case study here: http://www.pivotal.io/sites/default/files/Pivotal_vFabric_CS_Newedge_061213__0.pdf Quote: Group CIO, Alain Courbebaisse, says, “We have successfully implemented the most advanced post-trading platform of the clearing industry.” As well as the strategic alignment, the NVision program resulted in a number of further, more tangible objectives being met: Support for higher clearing volumes and create a platform that is horizontally scalable as more markets are connected to it Reduced time to market for new market adapters Global cache provides single version of the truth and removes synchronization issues created by latency sensitive global flows Creation of a globally consistent trade flow across regions and exchanges, serving the derivative and equity business seamlessly SIGNIFICANT IMPROVEMENT IN CLIENT ON-BOARDING TIME AND CLIENT SERVICE Faster resolution window (investigate, resolve and re-submit) Able to resolve reference data issues once and propagate out Replay capability removes manual re-keying Thin about this – a trading platform. Operating in markets all over the world. A few things: These have to be FAST to minimize risk – trades need to be executed as much as they can be in real time in order to make sure that the actual price was as close to the agreed to price as possible. The longer the delay, the more the possibility of drift between these prices, and the more risk is introduced. The data must be consistent. Geode provides Performance optimized persistence – “first line persistence” can be to other nodes in memory, making the speed of persistence as fast as the network will allow it to be. Disk can still be used for long term storage, but your transaction in Geode is ACID compliant in memory. This takes disk i/o out of the critical path of the transaction, but still allows it to take place. Consistency can be configured to meet your performance and data requirements. If you care less about accuracy and more about response time, you can configure data sets to those characteristics. Distributed queries and regional functions – as queries and function calls are sent to the grid, their path is optimized as much as possible. What this means is that when queries or functions are sent to the grid, they are sent in parallel, and they are optimized to route directly to nodes that hold the data. When you query the grid or execute functions on it, you can be assured that your client will get a consistent overall view of the data. Queries and logic are distributed for optimal performance, but always lean towards consistency. Indexing, triggers and event notifications are provided to react in real time to data as it is coming into the system Data can also be configured to propagate to other geographically dispersed clusters, to allow you to operate on the data as close to the “action as possible” – minimize the distance data has to travel. --------- We support persistence and make sure updates are consistent. We use replication for higher performance Geode uses the concept of having redundant copies in memory to make the data more available. What this means, is that instead of writing the data to disk to persist it, I will write it to one or more other nodes. If I am doing this on a reasonably fast network, I can “persist” the data in less than 1 ms. Data can be configured to write to local attached disk (shared nothing) or some other data source at the appropriate time via a rich event framework Data can also be configured to propagate to other geographically dispersed clusters, to allow you to operate on the data as close to the “action as possible” – minimize the distance data has to travel.
  5. Read case study here: http://blog.pivotal.io/pivotal/case-studies-2/how-argentina-pays-its-bills-19-million-cash-transactions-a-month-on-unreliable-networks-with-pivotal-Geode-and-spring (Rapipago is their most well known brand) With over 2,600 branches, 4,000 kiosk-based points of sale, and a huge call center, Rapipago is part of GIRE’s business of providing billing, collection, payment, and transaction processing services. To put it simply, consumers visit our locations to pay their bills. Rapipago supports payments between 1200+ companies and their consumers—around 19 million transactions per month. Rapipago’s card, check, and cash-based transactions ultimately collect money on behalf of cell phone companies, automotive, banking, energy, gas, water, insurance, cable, credit cards, schools, municipalities, tourism operators, and more. The biggest problem Geode has helped us solve has to do with unreliable network connectivity. This limited our ability to report on the business operations and take certain actions. In our network, we collect money from 6 AM until 9 PM, depending on the region. Before Geode, we had to wait until the next day to have visibility into how our network is collecting money. The previous system would process batch files each night. Our manaGEOent team could only see transactions after a day’s time passed. In addition, many locations have unreliable network connections that make it harder to get a current view of the information. With Geode, we can see the information in real-time, even if there is an unreliable network. The data is synced 24 x 7 as the network allows. Now, we have a much more accurate view of the cash at each location. From an operations perspective, having a better picture of each payment point throughout the day means we can now decide to do things like send an armed vehicle to take cash off the street when large sums of money start to accumulate. In each Kiosk, transactions are captured in a local Geode instance. Geode also places the transactions in a shared branch region. This way, we can share information between kiosks within a branch. In each branch, we have a Geode peer-to-peer topology set up—a branch’s kiosks are part of a distributed system. Before Geode, business rules were on the server side, and the system had to be online for the rules to run. With Geode’s P2P topology, we have those rules running in each kiosk, and they can be executed when the server is offline. The kiosks also place information on the Geode WAN gateway to synchronize information to the central data center. With the WAN gateway, a returning network connection allows us to synchronize with the data center’s master database. When we don’t have internet connection to our central datacenter, we store all the transactions in the gateway’s queue. This is how we get a near real-time view of the entire network in a central place. Can you explain more about how Geode handles the WAN synchronization? Yes. This is the key function that allows us to have up to date information and deal with lost network connections. We use the WAN gateway to synchronize transactions between kiosks and our data center. Geode’s WAN gateway allows us to loosely couple multiple, independent Geode systems. So, each of our kiosks has it’s own Geode instance and region data is shared with the central Geode instance via the WAN gateway. If communications between sites fail or become slow, the systems still run independently, and persistent queues operate for messaging between sites. Within a cluster, data can be made resilient to failures. This means that nodes can come and go without data loss. Additionally, data can be persisted to local disk as an added measure. If servers fail within a cluster, it is transparent to the client. Data is kept consistent, and the connection is automatically routed to an available node. Cluster to cluster connectivity gives the data a way to survive smoking hole failures as well. The data is queued up to write asynchronously to other clusters. These queues can also be written to local disk to avoid data loss of queues in memory, in case a data center loses connectivity. The data is saved in the local cluster safely until network connectivity is restored. Gire specifically uses this to handle their unreliable network issues.
  6. Geode is written in 100% Java, but has access via several other languages. Before you say “but Java is slow”, keep in mind, some of the fastest trading platforms in the world utilize Geode at their core. Geode has many ways to interact with it. Web applications can be made more performant without any code changes by leveraging Geode as an HTTP SSC or Hibernate L2 cache without any changes to code. Just configuration. Many customers have seen significant performance improvements just by plugging these features into their existing applications. Additionally, Geode supports a Memcache API that can be used to enhance applications based on any of the 70 or so memcache clients where it just isn’t scaling to what is needed. This can be a nice transitional step towards Geode without major code changes. Geode provides a RICH set of application functions that act as more than just a “driver” to Geode. When a client connects to a grid, that client can Send functions to the grid and have the grid route the function to the appropriate nodes for execution, aggregate the results, and send them back to the client, without the client needing to be aware of being attached to a distributed grid. Query the grid via OQL (a standard from the OMG – Object ManaGEOent Group) Clients can subscribe to events on the server side, or listen for data matching particular criteria, without polling Geode. Geode’s extensive event framework provides client side event listeners as well. Clients can rely on data coming from the server, even if the client temporarily disconnects. The queues holding data going to clients can be made just as reliable as data in the grid Developers of the grid themselves have access to a rich event framework that allows for real time monitoring and reaction to data as it flows through the system and passes transactional and physical boundaries. 3. There are native clients available for Java, C++, and c# 4. Being a k/v store, Geode accommodates user defined objects that are arbitrarily complex (as long as they are serializable). Additionally, Geode nateively supports JSON documents. 5. Geode allows versioning of object schemas in the grid, without having to restart the grid, or change client applications using old versions of the schema. Geode detects these changes, and manages the different version for you. 6. Additionally, when you are interacting with Geode, you can use the java hashmap interface or Spring data Geode. Serialization APIs are also available for extraordinary performance and memory tuning.
  7. Deploying data to Geode is done via “configuration” of the data. . You specify how the data should be partitioned from a logical perspective. When this is deployed to the grid, Geode manages the specific physical implementation of this configuration. This means that while keeping the data consistent, it manages where the data lives for you, making the most of the resources you have available in the cluster. A dashboard is provided to monitor the most critical aspects of your grid. Additionally, JMX APIs are available for integration into most enterprise monitoring environments, keeping monitoring costs down. For historical root cause analysis, Geode comes with Visual Statistics Display (VSD) to allow you to correlate Application down to OS and hardware stats to performance tune the cluster for optimal throughput. A rich CLI, gfsh, provides automation of the system via scripting.
  8. Geode can be run in a number of different topologies, providing flexibility for a number of deployments. In it’s most basic form, Geode can be used embedded in a process, like a web server. This provides performant, transactional data, however, it doesn’t provide scale or high availability. Moving to a peer to peer model, we can achieve scale across an application. This is generally for a data set that is very application specific – frequently used content in a custom web app for example, or HTTP SSC. When our data becomes relevant across more than one application, it makes sense to deploy in a client server type of architecture. With Geode, data can be deployed to a dedicated, individually scalable and configurable grid. Client applications, such as web apps, dashboards, and alerting, can be clients of the data in the grid. Client can choose to interact with the grid in a pull fashion, or have the grid send updates in a push fashion, meaning real time dashboards are updated as new data is arriving, not after some polling interval has passed. Independent Geode grids can also be kept in sync with each other via the WAN gateway, or async event listeners. Different distributed systems can either have exact copies of data, or some other version of the data, such as an aggregate. Which brings us to geo-distributed. The WAN gateway can be used to distribute data reliably across the globe, getting data to where it needs to be for DR, reporting, or simply to be closer to the action (like with trading). Think about stock exchanges. Where the data lives matters. You do not want to have to do a global round trip to make your trade. You are constrained by the speed of light. Also to consider is the fact that you don’t always want certain data to cross international boundaries. We can put these rules in place at the boundaries between clusters.
  9. Reads are completely network-bound in theses runs due to the 1 gbit network used.
  10. There are a lot of different ways to partition data in SQLFire, by default SQLFire will try to evenly distribute data at random across all servers If that's not good enough you can exert a lot of control over how data is divided and distributed using list, range or expression based partitioning.
  11. There are a lot of different ways to partition data in SQLFire, by default SQLFire will try to evenly distribute data at random across all servers If that's not good enough you can exert a lot of control over how data is divided and distributed using list, range or expression based partitioning.
  12. There are a lot of different ways to partition data in SQLFire, by default SQLFire will try to evenly distribute data at random across all servers If that's not good enough you can exert a lot of control over how data is divided and distributed using list, range or expression based partitioning.
  13. There are a lot of different ways to partition data in SQLFire, by default SQLFire will try to evenly distribute data at random across all servers If that's not good enough you can exert a lot of control over how data is divided and distributed using list, range or expression based partitioning.
  14. There are a lot of different ways to partition data in SQLFire, by default SQLFire will try to evenly distribute data at random across all servers If that's not good enough you can exert a lot of control over how data is divided and distributed using list, range or expression based partitioning.
  15. Region on servers is broken up to buckets, and buckets are assigned to nodes. Key is hashed to a bucket.
  16. Replication is synchonous, key is locked. This is what gives consistency between the nodes.
  17. GemFire is network-bound or would be even faster. Cassandra is not. Latency has similar ratios, with Cassandra having 3.5-4.5X higher latency. This run was for 16 serevers with 8 client nodes running 400 client against 0.5 TB data (1K object size) with redundancy for total of 1 TB data. GemFire also 2X faster for the load phase, with both GemFire and Cassandra doing batched inserts (500 objects per insert).
  18. Reads are completely network-bound in theses runs due to the 1 gbit network used.