B 4 gravty

1 What Is Gravty?
2 The Internals of Gravty
3 Fine-Tuning Gravty
4 Future Plans

A Graph Database Is
“A graph database is a database that uses
graph structures for semantic queries with nodes,
edges and properties to represent and store data.” (Wikipedia)
Stores objects (vertices)
and relationships (edges)
Provides graph search
capabilities

Vertices and Edges in a Graph Database
Friends
Friends
Likes

Use Cases of a Graph Database
Facebook
Social Graph
Social networks
Google
PageRank
Ranking websites
Walmart
and eBay
Product recommendation

Need for a Large Graph Database System
Social GraphLINE Timeline
LINE Talk
Ranking
Recommendation
LINE Friends Shop
LINE News
Gravty

Need for a Large Graph Database System
Social GraphLINE Timeline
LINE Talk
Ranking
Recommendation
LINE Friends Shop
LINE News
Gravty
7 billion vertices
100 billion edges
200 billion indexes
5 billion writes a day
(create / update / delete)

Gravty Is
A scalable graph database to search
relational information efficiently
by searching through a large pool of data
using the graph search technique.

Requirements for Gravty
Easy to scale out
• To support
ever-increasing data
Easy to develop
• Add, modify, and remove
features as necessary
• Tailored to the LINE
development environment
• Not dependent on LINE-
specific components
Full control over everything!
Easy to use
• Graph query language
• REST API

1 What Is Gravty?
4 Future Plans
Technology Stack and Architecture
Data Model

Technology Stack and Architecture
Application
TinkerPop3 Gremlin-Console
TinkerPop3 Graph API
Graph Processing Layer
Storage Layer
MySQL
(config, meta)
HBaseKafka
Gravty

MySQL
(config, meta)
Kafka
Application
TinkerPop 3.2.0 Graph API
Graph Processing Layer (OLTP only)
HBase
Storage Layer
Gravty

HBase 1.1.x Local MemoryKafka 0.10.0.0 Phoenix 4.8.0
Application
TinkerPop3 Graph API
Gravty
Storage Layer (Abstract Interface)
Phoenix Repository
(Default)
Memory Repository
(Standalone)
Graph Processing Layer

• Row key: vertex-id
• Edges are stored in columns
• Disadvantages
Data Model
Flat-Wide Table
Column scan is slow
Columns cannot be split
Row Column
vertex-id1 property property edge edge edge edge edge edge
vertex-id2 …
vertex-id3 …

• Row key: edge-id
Data Model
Tall-Narrow Table (Gravty)
SrcVertexId-Label-TgtVertexId
Row Column
svtxid1-label-tvtxid2
edge
property
edge
property
svtxid1-label-tvtxid3 …
…
• Edges are stored in rows
• Advantages
More effective edge scan
Parallel execution

Friends
Flat-Wide vs Tall-Narrow
g.V(“brown”).out(“friends”).id().limit(3)
Brown
Cony
Moon
Sally
[cony, moon, sally]

Flat-Wide Model
Brown edge edge edge edge edge edge
(1) Row scan
2 operations
(2) Column scan
[cony, moon, sally]
‘likes’ ‘friends’

Tall-Narrow Model (Gravty)
brown-friends-sally
(1) Row scan
1 operation
[cony, moon, sally]
brown-friends-moon
brown-friends-cony
• Can split by rows (region)
• Can isolate hotspot rows
• Can scan in parallel

g.V(“brown”).out(“friends”).out(“friends”).
id().limit(10)
4 searches in total
• Flat-Wide = 8 operations
• Tall-Narrow (Gravty) = 4 operations

1 What Is Gravty?
4 Future Plans
Faster, Compact Querying
Avoiding Hot-Spotting
Efficient Secondary Indexing

g.V(brown).hasLabel("user").out("friends”).order().by(“name”, Order.incr).limit(5)
Reducing graph traversal steps
GraphStep VertexStepFilterStep RangeStepFilterStep
GGraphStep GVertexStep

g.V(brown)
.outE("friends”).limit(5)
.inV().order().by("name", Order.incr)
.properties("name")
inV(): Pipelined iterator from outE()
• TinkerPop: Sequential consuming
• Gravty: Parallel querying + pre-loading vertex property
Querying in parallel and pre-loading vertex properties
outE() “name”: “Boss”
limit 5
friends
inV()
“name”: “Edward”
“name”: “Moon”
“name”: “James”
“name”: “Jessica”
“name”: “Cony”
“name”: “Sally”

Row keys that have
sequential orders may cause
RegionServers to suffer:
Hot-spotting problem with HBase RegionServer
EDGE TABLE
SrcVertexId Label TgtVertexId
u000001 1 u000002
u000001 1 u000003
u000002 1 u000001
u000003 1 u000001
u000004 2 u000009
• Heavy loads of writes or reads
• Inefficient region splitting

Solutions to the hot-spotting problem
- Pre-splitting regions
- Salting row keys with a hashed prefix
(Salting tables by Apache Phoenix)
But, there is a scan performance issue
with the LIMIT clause
SELECT * FROM index …
LIMIT 100;

Phoenix Salted Table
Scan 100 rows
Client side merge sort
Phoenix Client
Result
Scan 100 rows
Scan 100 rows
Scan 100 rows
Scan maximum 400 rows

Custom Salting + Pre-splitting
hash (source-vertex-id)
Result
Phoenix Client
Scan 100 rows sequentially
Row Key Prefix

Indexed graph view for
faster graph search
Asynchronous index
processing using Kafka
Efficient Secondary Indexing
Tools for failure
recovery

Default Phoenix IndexCommitter
HRegion
HRegion
HRegion
HRegion
HRegion
HRegion
Put
Delete
Put
Indexer Coprocessor
Phoenix Driver
numConnections = regionServers * regionServers * needConnections
Index update
Index update
Too many connections on
each RegionServer
(Network is heavily congested)
Synchronous processing of index update requests

Gravty IndexCommitter
HRegion
HRegion
HRegion
HRegion
HRegion
HRegion
Put
Delete
Put
Indexer Coprocessor
Phoenix Driver
numConnections = indexers * regionServers * needConnections
Mutations
Asynchronous processing using Kafka
Kafka
Indexer
Indexer
Index
update

Default Phoenix IndexCommitter
1. Phoenix
client UPSERT
INDEX 1
Phoenix
Coprocessor
Region Server
Primary Table
Phoenix
Coprocessor
Region Server
INDEX 2
Phoenix
Coprocessor
Region Server
PUT
PUT / DELETE
PUT / DELETE
2. Request HBase mutations
for indexes in parallel
RETURN
3. Phoenix client
returns

Gravty IndexCommitter
INDEX 1
Phoenix
Coprocessor
Region Server
Primary Table
Phoenix
Coprocessor
Region Server
INDEX 2
Phoenix
Coprocessor
Region Server
1.PUT
2. HBase mutations for INDEX 1, 2
4. Consume
3.RETURN
Kafka
Index
Consumer
5. PUT / DELETE
5. PUT / DELETE

Secondary Indexing Metrics
Server TPS RegionServer
Number of connections
3x 1/8

Reentrant event
processing
Every row is versioned in
HBase (timestamp)
Logging failures
and replaying
failed requests
Time machine to
resume at
certain runtime
Resetting runtime offset
of Kafka consumers
Best-Effort Failover
Fail fast, fix later

Monitoring Tools for Failure Recovery
Setting alerts and displaying metrics
• Prometheus
• Dropwizard metrics
• jvm_exporter
• Grafana
• Ambari

Client
Graph API
Multiple Graph Clusters
Before
Gravty
HBase Cluster
Client
Graph API
After
Gravty
HBase Cluster HBase Cluster
HBase Cluster

HBase Repository
Storage Layer
Memory Repository
(Standalone)
Phoenix Repository
(Default)
HBase
Repository
Abstract Interface
HBase
Phoenix Region
CoprocessorLocal Memory

Graph analytics system
graph computation
OLAP Functionality
TinkerPop Graph
Computing API

B 4 gravty

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (20)

Similaire à B 4 gravty

Similaire à B 4 gravty (20)

Plus de LINE Corporation

Plus de LINE Corporation (20)

Dernier

Dernier (20)

B 4 gravty