2021 March Pravega Community Meeting

Pravega
March Community Meeting

Welcome
• Last call was November 6th, 2020 via Zoom
• Since then:
• Pravega was accepted as a CNCF Sandbox project (Nov. 10th, 2020)
• 0.7.3 was released (Dec. 9th, 2020)
• 0.8.1 was released (Jan. 14th, 2021)
• 0.9.0 was released (Mar. 3rd, 2021)
• This call via Cloud Native Community Groups (aka Bevy.com)
• We will host these monthly

Community Developments
• Pravaga Akka connector gets updated with Key Value Tables support
• https://github.com/akka/alpakka/pull/2566
• Pravega on ARM64 and RISCV
• RISCV support issues are opened in upstream dependencies
• PR for ARM64 support: https://github.com/pravega/pravega/pull/5747
• Documentation improvements coming soon
• New guides to help Dev & Admin roles get started, updates to existing docs
• New website coming soon
• Maintainers group and Steering Committee are forming
• Maintainers group invites have been sent out
• Steering Committee invites will be coming soon
• Join the users mailing list: https://lists.cncf.io/g/cncf-pravega-users

Next call
April 16th – 7 AM Pacific
Topic: Connectors
• Akka connector
• Presto connector
• Flink connector
• Spark connector
• NiFi connector
Open to suggestions, especially if you want to show off your connector.
(And please suggest topics for future monthly calls!)
Suggestions & feedback – send to: Derek.Moore@dell.com

Today’s call – Agenda
• State of Pravega
• Overview Experimental Features of Pravega
• Schema Registry
• Consumption Based Retention
• Simplified Long-Term Storage (SLTS)
• SLTS Plugin for BookKeeper
• Key Value Tables
• Overview Performance Evaluation

time ran short, so Key Value Tables and
Performance Evaluation presentations were
rescheduled
April 16th Community Call will feature:
Key Value Tables (KVT)
Performance Evaluation
Akka connector w/ KVT
EDIT

State of Pravega
Flavio Junqueira
Pravega

What's Pravega?
• Pravega is about streaming data...
8
Pravega Community Meeting - March 2021
• Data sources
• Continuously
generated
data
Processing
applications
Pravega
• Visualize
• Alert
• Train
• Infer
Scale-out storage
(e.g., Object Store)
Ingest and store providing:
• Consistency
• Elasticity
• Durability

What's Pravega?
9
• Data sources
• Continuously
generated
data
Processing
applications
Pravega
• Visualize
• Alert
• Train
• Infer
Scale-out storage
Durable
Log
Tiered storage
• Consistency
• Elasticity
• Durability

What's Pravega?
10
• Data sources
• Continuously
generated
data
Processing
applications
Pravega
• Visualize
• Alert
• Train
• Infer
Scale-out storage
• Consistency
• Elasticity
• Durability
Durable
Log
Tiered storage
Clients
• Java
• Others under
development

Where's Pravega used?
11

Streaming Data Platform
12

Pravega Community Meeting - March 2021 13
https://www.delltechnologies.com/en-us/blog/episode-two-the-
best-ride-of-your-life/
https://www.youtube.com/watch?v=BTh1gkf0kQQ
https://www.youtube.com/watch?v=89IDFI9jry8
Dell Tech Customer Profile - RWTH
Amusement Parks
Construction sites
Industrial IoT

Looking forward to seeing community use cases
14

Open-source trajectory
15

Timeline and status
• Open-sourced early in 2017
• First open-source release: 0.1.0 – Dec. 19, 2017
• 0.9.0 is fresh out of the oven
• In 2020, CNCF sandboxing
• Transition
• Overall bootstrapping
• Setting up communication channels
• Web site revamp (coming soon!)
• Organizing repositories
• Documenting governance
https://github.com/cncf/toc/issues/560
Source: https://star-history.t9t.io/#pravega/pravega
16

Repositories
17
Pravega Core
Total: 43 Repositories
Connectors:
• Apache Flink
• Apache Spark
• Apache NiFi
• Logstash
• Presto (brand new)
Kubernetes Operators:
• Pravega
• Apache BookKeeper
• Apache Zookeeper
Tools:
• Pravega tools
• Flink tools
• Benchmark
Client bindings:
• Rust
• Python

Contributions
• Unique collaborators across repositories: 135
• Vast majority from Dell
• Expect more non-Dell contributions
• Many open issues and opportunities to contribute
• Look for guidance, maybe a mentor, if you want to get involved
• Going forward
• Expect more Pravega features
• Important focus on ecosystem

Get involved!
19
https://github.com/pravega/pravega/wiki/Contributing

T-Shirts for the best questions
20

Thank you!

Consumption Based Retention
(CBR)
http://pravega.io
Prajakta Belgundi
12/03/2021

@PravegaIO https://github.com/pravega/pravega
Data Retention for Streams (without CBR)
• A Stream can be configured for:
• No Retention Policy –
• Data is never auto truncated.
• Manual Truncation using explicit API invocation possible.
• SIZE/TIME based Retention Policy
• Data is periodically truncated based on Size/Time limits.
• The Policy supports specifying an “atleast” (min) value.
• A Retention Cycle runs periodically on Controller that truncates Streams that breach the policy
limit.
http://pravega.io
https://pravega-io.slack.com

Stream Truncation
• Every time the retention cycle runs, a new Stream-Cut is generated at the tail of the Stream
• Truncation an happen only at a specific Stream-Cut.
• When a Stream is found to have more data than the configured retention limit, the Controller
identifies a Stream-Cut, that satisfies the Retention Policy and truncates the Stream at this
Stream-Cut.
http://pravega.io

Size Based Retention
http://pravega.io

Limitations
• Stream truncation is agnostic of reads by Reader Group(s).
• Un-read data could be lost.
• Streams tend to consume more space, as approach to deletion is conservative.
• No max limit on Stream size.
http://pravega.io

Why CBR ?
• Some streaming use-cases, do not require data to be stored over the long term.
• Once data is “read” by specific Reader-Group(s) it can be deleted.
• Environments may have constrained Storage capacity – e.g.: Edge Gateways.
• Need to cycle/move out data as soon as it is read.
http://pravega.io

What is CBR ?
• Stream Truncation can happen based on read positions of “specific” Reader Groups reading from
the Stream.
• These Reader Groups need to be created as “subscriber” Reader Groups.
• Read positions of “non-subscriber” Reader Groups do not impact Stream truncation.
• The Stream Retention policy also has a max limit (in addition to the min limit discussed earlier)
http://pravega.io

Configuring CBR
• The existing Retention policy (SIZE/TIME based) can stay as is.
• To enable CBR, update the Reader-Group(s) configuration on Client to be “subscriber” Reader-
Group(s).
• If all Reader Groups are non-subscribers, the Stream won’t have Consumption Based Retention.
• Optionally, Set a max(at-most) limit on Stream Retention Policy. Defaults to LONG_MAX.
http://pravega.io

How CBR works
• A subscriber Reader Group periodically publishes the Stream-Cut corresponding to its “read”
positions in the Stream to Controller.
• This stream-cut is stored on Controller.
• When the retention-cycle on Controller runs, Stream-Cuts from all “subscriber” Reader Groups,
are used to compute a single subscriber-lowerbound-stream-cut.
• The Stream is truncated at this subscriber-lowerbound-stream-cut if it satisfies the min/max limit
criteria.
http://pravega.io

Reader Group Configuration changes …
Retention Type - A New parameter that can be used to enable CBR:
• AUTOMATIC_RELEASE_AT_LAST_CHECKPOINT –
• At every checkpoint completion, the Reader Group automatically emits the Stream-cut
corresponding to the checkpoint as “read” acknowledgement to Controller.
• MANUAL_RELEASE_AT_USER_STREAMCUT –
• No automatic publishing of Stream-Cuts from Client to Controller.
• Users need to create manual Checkpoints and publish Stream-Cuts corresponding to this
checkpoint to Controller using the updateRetentionStreamCut()API.
• NONE (default) –
• This Reader Group does not participate in Consumption Based Retention.
• Its read positions do not impact Stream truncation.
http://pravega.io

Min/Max and Subscriber Lower Bound
• If Min < SLB < Max, truncate at SLB
• If SLB < Min, truncate at Min
• If SLB > Max, truncate at Max
http://pravega.io

SLTS
http://pravega.io
Introducing Simplified Long
Term Storage.
Sachin Joshi
Sr. Principal Software Engineer
DELL EMC

@PravegaProject https://github.com/pravega/pravega
http://pravega.io
Why: Pravega is Streaming Storage
• Durability is fundamental
• Once acknowledged data is never lost
• Performance is critical
• Low latency
• High throughput
• Storage Efficiency is important
• Single Unified API that works for
• Real time data
• Historical Data
• Excellent choice for Kappa Architecture
• Automatic Tiering
• between low latency short term
storage and large-capacity external long-
term storages.
• space efficient and performant
• completely transparent.
• Bring your own external storage.
• Cloud Native
• Multi-cloud
• Meet where customers are already
moving (Object stores)
• Enable edge.

http://pravega.io
LTS is integral part of Pravega Storage
Traditional Way Pravega way
@PravegaProject https://github.com/pravega/pravega https://pravega-io.slack.com

http://pravega.io
Quick Background
• Concepts
• Stream
• Segments
• Append only semantics. Can be sealed.
• Transactions use concatenation
• Scaling up or down
• Mapping of Routing key to segment is maintained by
Controller
• Implementation
• Segment Store
• Multiple containers per segment store.
• Number of containers are fixed for a deployment.
• Mapping from segment to containers is consistent.
• Storages
• Tier-1 : In cluster Short-term storage for Write Ahead Log
• Tier-2 : External Long-term storage
• Cache – Ephemeral (Internal spillover cache)
• Assumptions
• Tier-2 writes are async, not on critical path
• Throughput matters more
• Segment is an opaque sequence of bytes
• Strong assumptions about tier-1 fencing.
• Tier-2 reads can be optimized by prefetching

http://pravega.io
Goal : Provide Segment Abstraction to upper layers
Requirements
• Segments are dynamic:
• Grow at tail end
• Shrink at head end
• Otherwise immutable
• Segments are everywhere
• Segments form the very foundation on top of which higher
order Pravega features, and data structures are built.
• Almost all the data in Pravega is ultimately stored in such
segments.
• User streams, attribute segments,
• Key Value tables,
• internal Pravega streams,
• client state management etc.
• All of these assume that this segment abstraction
works as expected
Challenges
• How to build segments out of immutable
objects?
• How to enforce single writer pattern?
• How to implement Atomic appends?
• How to deal with eventual consistency?
• How to truncate at the head ?

http://pravega.io
Problem 1 : Split Brain & Fencing
Problem
• Requirement:
• Data is appended atomically
• Strong single writer pattern
• Split-brain: In case of network partition more than one Segment
Stores may end up writing to the same underlying file causing
data corruption.
• Writes to storage are async, so older SS may be simply be
flushing data even if not accepting new traffic
• Long GC pauses
• Safe appends are needed:
• NFS 3 locks are not problematic
• HDFS appends are atomic, but not concurrency safe –
cannot append at specific offset.
• AWS S3 has eventual consistency
Current Solution
• Fencing
• Use "fencing" to mark ownership on underlying storage object
to prevent older SS from writing
• Implementation depends on guarantees and capabilities
of underlying storage
• How we do it today
• ECS S3 – use offset conditional appends
• HDFS – use atomic renames
• NFS – overwrite data
• Downside
• Each storage binding provides different sets of guarantees
• New fencing solution needs to be provided for each new
binding
• Possible performance degradation
• Hard to reason about correctness and liveness properties.
• Saving grace: SS are simply applying changelog from WAL, so they
should produce same data at same offset.

http://pravega.io
Problem 2 : Object Stores
Problem
• Requirement:
• Data is appended atomically
• Strong single writer pattern
• No Append functionality:
• AWS S3 does not provide appends or partial updates
• Entire object must be overwritten
• Eventually consistent:
• Provides only read-after-write guarantee that too only on
new object creation
• All GET s are eventually consistent – you may get old
version
• Listing objects is eventually consistent
• Versions are also eventually consistent
• Cost
• Per operation charges
Current Solution
• Not supported today
• Possible solution : Use multi-part upload
• Create big object out of small parts
• Use CopyPartRequest
• Issues
• Still must deal with eventual consistency
• Managing objects is hard with eventual consistency
• Versioning doesn't help
• No mechanism today to manage multiple objects

http://pravega.io
What : Defining scope
Goals
• Simplify API contract for storage bindings
• Eliminate need for complex fencing
during failover. (E.g. During partitions)
• Give freedom to optimize append logic
• Leverage storage native
merge/concatenation capability
• Provide extension points for the future
• Additional background services –
Defragmentation, Integrity checks
• Data compression, encryption, erasure
encoding at
• Access control
• Multiple tiers
Out of scope
• This design does not target following
• Reading data written by Pravega
without using Pravega.
• Import or export of pre-existing
data in other formats (e.g. Avro)
• Multiple tiers

http://pravega.io
How : Architecture Change
Segment Store - Current
ReadIndex
DurableLog
SegmentMapper
StorageReader StorageWriter
Cache
AsyncStorage
RollingStorage
SyncStorage
SyncStorage
Segment Store - New design
ReadIndex
DurableLog
SegmentMapper
StorageReader StorageWriter
Cache
ChunkedSegmentStorage
Metadata
TableStore
SyncStorage
ChunkStorage
Storage Optimizer
(Defragment,
GarbageCollector)
Unchanged Deleted Added
Changed

http://pravega.io
Unified Chunk Management Layer
• Chunk
• “Unit” of storage: stored on underlying storage as files or
objects.
• Append-only writes:
• Effectively leverage append or concat from underlying storage wherever we can.
• Otherwise each write is separate chunk
• Strong single writer pattern: Chunks can be written by only one
Segment Store.
• Immutable: once they become inactive, they are considered
“sealed”.
• Names : Arbitrary but must be globally unique.
• Segment
• Made from chunks – conceptually a linked list of chunks
• New chunk is added when
• New segment is written for the first time
• After failover
• Underlying file/object reaches its limit
• Underlying storage does not provide safe append semantics
• Append only writes - There can be only one active chunk per
segment at any time.

http://pravega.io
ChunkedSegmentStorage
Component in each segment store container that manages metadata for segment it owns
• Conceptually the segment metadata consists of
• A header describing various properties of a segment
• plus a linked list of chunk metadata records describing each chunk.
• Metadata is stored in BK based Table
• Stored as KV pairs
• Table is pinned to a container.
• Metadata updates are atomic (when using multiple records).
• Metadata updates are fenced by tier-1
• Metadata updates are efficient
• Metadata updates are infrequent and updated lazily
• Metadata records are cached using write-through cache for read performance.
• Important optimization – metadata about chunk can be updated only when it becomes inactive.
• Metadata records offer points of extensibility
• Access Control lists at segment level
• CRC-check sums per chunk
• Import and export of metadata to a file/segment
• Metadata can be stored on
• Any storage that supports “read after writes” consistency

http://pravega.io
ChunkStorage
Simple contract that each storage provider must implement
ChunkStorage work only at chunk level. They are not aware of segment layout
• Required
• Create (C) -
• Read (R)
• Open (O) - Open chunk for read or write, returns SegmentHandle
• Write (W) - The write must be appended in a concurrency safe way. ( Otherwise each write needs to be separate
chunk)
• Delete (D)
• List (L) - List chunks
• Info (I) - Get attributes of the chunk (e.g. Size)
• Optionally
• Merge (M) - concat existing chunks
• Truncate(T) - truncate at the end
• Make Writable/Read-only -
• Not Required
• Fencing logic

http://pravega.io
Advantage of SLTS
Technical
• Separation of Responsibility
• Storage providers must only support simple
CRUDL operations
• Clean implementation of Single Writer
pattern
• Increased concurrency
• higher degree of thread utilization
• higher degree of parallelism
• higher read and write throughput and
lower latency.
• Robust failure handling
Functional
• Plug-in model
• Enables third party storage adapters for
wide variety of systems.
• SDK and Test Suite (in near future)
• Batteries included
• Pravega comes with built in adapters
for NFS, HDFS and ECS
• S3 (in near future)
• Its own built-in metadata store
• fine-tuned for SLTS usage pattern.

http://pravega.io
Timeline
Timeline Notes
0.8 Alpha – Initial implementation
0.9 Beta 1 – Experimental. Initial bindings for File System (NFS), HDFS and ECS.
0.9.1 Beta 2 – Experimental. Stability improvements and bug fixes.
0.10 Stable Release. Additional bindings (E.g. S3). Admin and diagnostic tools.
0.11 SLTS SDK and Test suite. Migration Tools.
Future Default Pravega LTS.

http://pravega.io
References
• Design documents - PDP 34
• https://github.com/pravega/pravega/wiki/PDP-34:-Simplified-Tier-2
• Slack Channel : pravega-lts
• https://pravega-io.slack.com/archives/C013PPW5WC9

Key Operations : Metadata-only Operations
Operation How Implemented
Create New metadata record is added.
Open Metadata record is checked for access
Exists Metadata record is checked for existence
Seal Status field in Metadata record updated
Unseal Status field in Metadata record updated
Delete First Marked for delete.

Key Operations : Data Operations
Read Chunk metadata is retrieved based on offset. (Variant of binary search.)
Offset within the chunk is calculated.
Read is issued on underlying storage.
Parallel Reads Like above except multiple offset ranges are read in parallel
Write Active chunk metadata is retrieved
Write is issued.
Chunk metadata is updated lazily.

Key Operations : Layout changes
Concatenation • When transactions are committed
• Two linked lists for chunks are concatenated.
• Data is not moved
Defrag • A big chunk is created by concatenating smaller chunks
• Replace multiple chunk metadata records with single record for larger chunk
Truncate Truncated at the head by deleting unneeded chunks
Rollover When size limit is reached a new chunk is added.

Key Scenarios
• Rolling Storage
• Add new chunk each time chunk size limit is exceeded
• Segment Store Failover
1. New SS records the size of chunk that it sees.
2. New SS seals the chunk at that offset (from previous step)
3. Old SS can keep on writing even after this, but that doesn’t matter as we'll not read data
after recorded offset.
4. Old SS is fenced for tier-1 from making any metadata updates (all table segment updates
go through tier-1)
5. New SS starts a new chunk
6. New SS adds a metadata record for the new chunk
7. New SS replays the Write Ahead Log
8. New SS saves data to new chunk
9. If new SS fails, the process repeats

SLTS for BlobIt! Object Store on BookKeeper
BlobIt.org
github.com/diennea/blobit
Pravega BlobIt ChuckManager
github.com/diegosalvi/pravega-blobit-chunkmanager
EDIT

Thank you! See you next time!
• Slack Invite – pravega-slack-invite.herokuapp.com
• Slack Workspace – pravega-io.slack.com
• Blog – blog.pravega.io
• April 16th meeting: KVT, Perf & Akka – community.cncf.io/e/m9mdcn
• Feedback – Derek.Moore@dell.com

2021 March Pravega Community Meeting

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à 2021 March Pravega Community Meeting

Similaire à 2021 March Pravega Community Meeting (20)

Dernier

Dernier (17)

2021 March Pravega Community Meeting

Notes de l'éditeur