SlideShare une entreprise Scribd logo
1  sur  27
Télécharger pour lire hors ligne
Using akka-streams to
access S3 objects
Mikhail Girkin
Software Engineer
GILT
HBC Digital
@mike_girkin
Codez? Codez!
https://github.com/gilt/gfc-aws-s3
Initial problem
● Several big (hundreds Mb) database result sets
● All data cached in memory
● Served as a JSON files
● The service constantly OOM-ing, even on 32Gb instance
Akka-streams
● Library from akka toolbox
● Build on top of actor framework
● Handles streams and their specifics, without exposing
actors itself
What is “stream”
● Sequence of objects
● Has an input
● Has an output
● Defined as a sequence of data transformations
● Could be infinite
● Steps could be executed independently
Stream input - Source
● The input of the data in the stream
● Has the output channel to feed data into the stream
SQLSource
Stream output - Sink
● The final point of the data in the stream
● Has the input channel to receive the data from the stream
S3 object
Processing - Flow
● The transformation procedure of the stream
● Takes data from the input, apply some computations to it,
and pass the resulting data to the output
Serialization
Basic stream operations
● via
Source via Flow =>
Source
Flow via Flow =>
Flow
● to
Flow to Sink =>
Sink
Source to Sink =>
Stream
Declaration is not execution!
Stream description is just a declaration, so:
val s = Source[Int](Range(1, 100).toList)
.via(
Flow[Int].map(x => x + 10)
).to(
Sink.foreach(println)
)
will not execute until you call
s.run()
The skeleton
Get data -> serialize -> send to S3
def run(): Future[Long] = {
val cn = getConnection()
val stream = (cn: Connection) =>
dataSource.streamList(cn) // Source[Item] - get data from the DB
.via(serializeFlow) // Flow[Item, Byte] - serialize
.toMat(s3UploaderSink)(Keep.right) // Sink[Byte] - upload to S3
val countFuture = stream(cn).run()
countFuture.onComplete { r =>
cn.close()
}
countFuture
}
Serialize in the stream
● We are dealing with the single collection
● Type of the items is the same
val serializeFlow = Flow[Item]
.map(x => serializeItem(x)) // serializeItem: Item => String
.intersperse("[", ",", "]") // sort of mkString for the streams
.mapConcat[Byte] { // mapConcat ≈ flatMap
x => x.getBytes().toIndexedSeq
}
S3 multipart upload API
● Allows to upload files in separate chunks
● Allows to upload chunks in parallel
● (!) By default doesn’t have TTL for the chunks uploaded
Simplified API:
1. initialize(bucket, filename) => uploadId
2. uploadChunk(uploadId, partNumber, content) => etag
3. complete(uploadId, List[etag])
Resource access
● Pattern: Open - Do stuff - Close
open: () => TState
onEach: (TState, TItem) => (TState)
close: TState => TResult
● Functional pattern - fold over the state
○ With an additional call in the end
● Akka-streams lacks Sink of that type
● Calls open lazily, on arrival of the first element of the stream
Lets create a new sink!
class FoldResourceSink[TState, TItem, Mat](
open: () => TState,
onEach: (TState, TItem) => (TState),
close: TState => Mat
) extends GraphStageWithMaterializedValue[SinkShape[TItem], Future[Mat]] { … }
Methods to write:
def onPush(): Unit
override def preStart(): Unit
override def onUpstreamFinish(): Unit
override def onUpstreamFailure(ex: Throwable): Unit
S3Sink from ResourceFoldSink
● SinkA = Flow to SinkB
S3 upload flow FoldResourceSink
S3 upload sink
What is TState and TItem?
We need to keep track of: uploadId, etags and uploadedLentgh to the moment
case class S3MultipartUploaderState(
uploadId: String,
etags: List[PartETag],
totalLength: Long
)
And item is:
(ByteString, Int) // (content, chunkNumber)
FoldResourceSink for S3
val sink =
Sink.foldResource[S3MultipartUploaderState, (ByteString, Int), Long](
() => initUpload(), //Returns state
{ case (state, (chunk, chunkNumber)) =>
uploadChunk(state, chunk, chunkNumber) },
completeUpload //Accepts state
)
Flow[Byte]
.grouped(chunkSize)
.map(b => ByteString(b:_*))
.zip(
Source.fromIterator(() => Iterator.from(1)) //pairs (content, partNumber)
).toMat(sink)(Keep.right)
SQL Source
Anorm provides akka-stream SQL source
libraryDependencies ++= Seq(
"com.typesafe.play" %% "anorm-akka" % "version",
"com.typesafe.akka" %% "akka-stream" % "version")
AkkaStream.source(SQL"SELECT * FROM Test",
SqlParser.scalar[String], ColumnAliaser.empty): Source[String]
Brings minimal transitive dependencies (!)
Road to production
● Retries in case of S3 errors/failures
○ S3 client handles this
● Handle the possible problem during stream execution (ie.
failure talking to DB)
○ When stream fails - it never calls complete
Could we do it other
way round?
● S3 tends to timeout and drop connection on slow download of large files
● Ability to process data in a streaming manner
S3 protocol for partial downloads
● By parts (see multipart upload)
○ Uses part numbers
○ Doesn’t work when upload wasn’t multipart
○ Amazon says it’s faster
● By chunks
○ Chunk is defined by (from, to) byte numbers
○ Works for any file, and any chunk length
○ Amazon says it’s slow
Basic idea
1. Get part count
2. For each part create an akka source
3. Combine the individual streams into one
1. Get file length
2. For chunk in file create an akka source
3. Combine the individual streams into one
Create akka source from IO stream:
val stream: InputStream = …
Source.fromInputStream(stream)
Downloading by parts
Source.single(getPartCount(s3Client, bucketName, key)
).flatMapConcat { partCount =>
Source(
Range(firstPartIndex, partCount + firstPartIndex)
)
}.flatMapConcat { partNumber =>
Source.fromInputStream(
getS3ObjectContent(partNumber, readMemoryBufferSize),
)
} // Type - Source[ByteString, NotUsed]
Downloading by parts
Source.single(Unit)
.map(
_ => getPartCount(s3Client, bucketName, key)
).flatMapConcat { partCount =>
Source(
Range(firstPartIndex, partCount + firstPartIndex)
)
}.flatMapConcat { partNumber =>
Source.fromInputStream(
getS3ObjectContent(partNumber, readMemoryBufferSize),
)
} // Type - Source[ByteString, NotUsed]
gfc-aws-s3 https://github.com/gilt/gfc-aws-s3
Opensource project containing the code above (Sources and Sink)
Also s3-http as an educational example
Codez!
200 OK

Contenu connexe

Tendances

Streaming Data Flow with Apache Flink @ Paris Flink Meetup 2015
Streaming Data Flow with Apache Flink @ Paris Flink Meetup 2015Streaming Data Flow with Apache Flink @ Paris Flink Meetup 2015
Streaming Data Flow with Apache Flink @ Paris Flink Meetup 2015Till Rohrmann
 
A dive into akka streams: from the basics to a real-world scenario
A dive into akka streams: from the basics to a real-world scenarioA dive into akka streams: from the basics to a real-world scenario
A dive into akka streams: from the basics to a real-world scenarioGioia Ballin
 
My Gentle Introduction to RxJS
My Gentle Introduction to RxJSMy Gentle Introduction to RxJS
My Gentle Introduction to RxJSMattia Occhiuto
 
Reactive streams processing using Akka Streams
Reactive streams processing using Akka StreamsReactive streams processing using Akka Streams
Reactive streams processing using Akka StreamsJohan Andrén
 
Introduction to rx java for android
Introduction to rx java for androidIntroduction to rx java for android
Introduction to rx java for androidEsa Firman
 
Intro to ReactiveCocoa
Intro to ReactiveCocoaIntro to ReactiveCocoa
Intro to ReactiveCocoakleneau
 
Reactive programming with RxJava
Reactive programming with RxJavaReactive programming with RxJava
Reactive programming with RxJavaJobaer Chowdhury
 
Reactive stream processing using Akka streams
Reactive stream processing using Akka streams Reactive stream processing using Akka streams
Reactive stream processing using Akka streams Johan Andrén
 
Flink Forward SF 2017: Kenneth Knowles - Back to Sessions overview
Flink Forward SF 2017: Kenneth Knowles - Back to Sessions overviewFlink Forward SF 2017: Kenneth Knowles - Back to Sessions overview
Flink Forward SF 2017: Kenneth Knowles - Back to Sessions overviewFlink Forward
 
Asynchronous stream processing with Akka Streams
Asynchronous stream processing with Akka StreamsAsynchronous stream processing with Akka Streams
Asynchronous stream processing with Akka StreamsJohan Andrén
 
Apache Flink @ NYC Flink Meetup
Apache Flink @ NYC Flink MeetupApache Flink @ NYC Flink Meetup
Apache Flink @ NYC Flink MeetupStephan Ewen
 
Intro to RxJava/RxAndroid - GDG Munich Android
Intro to RxJava/RxAndroid - GDG Munich AndroidIntro to RxJava/RxAndroid - GDG Munich Android
Intro to RxJava/RxAndroid - GDG Munich AndroidEgor Andreevich
 
Functional Reactive Programming (CocoaHeads Bratislava)
Functional Reactive Programming (CocoaHeads Bratislava)Functional Reactive Programming (CocoaHeads Bratislava)
Functional Reactive Programming (CocoaHeads Bratislava)Michal Grman
 
Scalable Applications with Scala
Scalable Applications with ScalaScalable Applications with Scala
Scalable Applications with ScalaNimrod Argov
 
Introduction to RxJS
Introduction to RxJSIntroduction to RxJS
Introduction to RxJSBrainhub
 
Practical RxJava for Android
Practical RxJava for AndroidPractical RxJava for Android
Practical RxJava for AndroidTomáš Kypta
 
Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...
Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...
Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...ucelebi
 
Distributed Real-Time Stream Processing: Why and How 2.0
Distributed Real-Time Stream Processing:  Why and How 2.0Distributed Real-Time Stream Processing:  Why and How 2.0
Distributed Real-Time Stream Processing: Why and How 2.0Petr Zapletal
 
Building Scalable Stateless Applications with RxJava
Building Scalable Stateless Applications with RxJavaBuilding Scalable Stateless Applications with RxJava
Building Scalable Stateless Applications with RxJavaRick Warren
 

Tendances (20)

Streaming Data Flow with Apache Flink @ Paris Flink Meetup 2015
Streaming Data Flow with Apache Flink @ Paris Flink Meetup 2015Streaming Data Flow with Apache Flink @ Paris Flink Meetup 2015
Streaming Data Flow with Apache Flink @ Paris Flink Meetup 2015
 
A dive into akka streams: from the basics to a real-world scenario
A dive into akka streams: from the basics to a real-world scenarioA dive into akka streams: from the basics to a real-world scenario
A dive into akka streams: from the basics to a real-world scenario
 
My Gentle Introduction to RxJS
My Gentle Introduction to RxJSMy Gentle Introduction to RxJS
My Gentle Introduction to RxJS
 
Reactive streams processing using Akka Streams
Reactive streams processing using Akka StreamsReactive streams processing using Akka Streams
Reactive streams processing using Akka Streams
 
Introduction to rx java for android
Introduction to rx java for androidIntroduction to rx java for android
Introduction to rx java for android
 
Intro to ReactiveCocoa
Intro to ReactiveCocoaIntro to ReactiveCocoa
Intro to ReactiveCocoa
 
Reactive Applications in Java
Reactive Applications in JavaReactive Applications in Java
Reactive Applications in Java
 
Reactive programming with RxJava
Reactive programming with RxJavaReactive programming with RxJava
Reactive programming with RxJava
 
Reactive stream processing using Akka streams
Reactive stream processing using Akka streams Reactive stream processing using Akka streams
Reactive stream processing using Akka streams
 
Flink Forward SF 2017: Kenneth Knowles - Back to Sessions overview
Flink Forward SF 2017: Kenneth Knowles - Back to Sessions overviewFlink Forward SF 2017: Kenneth Knowles - Back to Sessions overview
Flink Forward SF 2017: Kenneth Knowles - Back to Sessions overview
 
Asynchronous stream processing with Akka Streams
Asynchronous stream processing with Akka StreamsAsynchronous stream processing with Akka Streams
Asynchronous stream processing with Akka Streams
 
Apache Flink @ NYC Flink Meetup
Apache Flink @ NYC Flink MeetupApache Flink @ NYC Flink Meetup
Apache Flink @ NYC Flink Meetup
 
Intro to RxJava/RxAndroid - GDG Munich Android
Intro to RxJava/RxAndroid - GDG Munich AndroidIntro to RxJava/RxAndroid - GDG Munich Android
Intro to RxJava/RxAndroid - GDG Munich Android
 
Functional Reactive Programming (CocoaHeads Bratislava)
Functional Reactive Programming (CocoaHeads Bratislava)Functional Reactive Programming (CocoaHeads Bratislava)
Functional Reactive Programming (CocoaHeads Bratislava)
 
Scalable Applications with Scala
Scalable Applications with ScalaScalable Applications with Scala
Scalable Applications with Scala
 
Introduction to RxJS
Introduction to RxJSIntroduction to RxJS
Introduction to RxJS
 
Practical RxJava for Android
Practical RxJava for AndroidPractical RxJava for Android
Practical RxJava for Android
 
Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...
Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...
Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...
 
Distributed Real-Time Stream Processing: Why and How 2.0
Distributed Real-Time Stream Processing:  Why and How 2.0Distributed Real-Time Stream Processing:  Why and How 2.0
Distributed Real-Time Stream Processing: Why and How 2.0
 
Building Scalable Stateless Applications with RxJava
Building Scalable Stateless Applications with RxJavaBuilding Scalable Stateless Applications with RxJava
Building Scalable Stateless Applications with RxJava
 

Similaire à Using akka streams to access s3 objects

CBStreams - Java Streams for ColdFusion (CFML)
CBStreams - Java Streams for ColdFusion (CFML)CBStreams - Java Streams for ColdFusion (CFML)
CBStreams - Java Streams for ColdFusion (CFML)Ortus Solutions, Corp
 
ITB2019 CBStreams : Accelerate your Functional Programming with the power of ...
ITB2019 CBStreams : Accelerate your Functional Programming with the power of ...ITB2019 CBStreams : Accelerate your Functional Programming with the power of ...
ITB2019 CBStreams : Accelerate your Functional Programming with the power of ...Ortus Solutions, Corp
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Guido Schmutz
 
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...Databricks
 
Writing Continuous Applications with Structured Streaming in PySpark
Writing Continuous Applications with Structured Streaming in PySparkWriting Continuous Applications with Structured Streaming in PySpark
Writing Continuous Applications with Structured Streaming in PySparkDatabricks
 
Porting a Streaming Pipeline from Scala to Rust
Porting a Streaming Pipeline from Scala to RustPorting a Streaming Pipeline from Scala to Rust
Porting a Streaming Pipeline from Scala to RustEvan Chan
 
Python Streaming Pipelines on Flink - Beam Meetup at Lyft 2019
Python Streaming Pipelines on Flink - Beam Meetup at Lyft 2019Python Streaming Pipelines on Flink - Beam Meetup at Lyft 2019
Python Streaming Pipelines on Flink - Beam Meetup at Lyft 2019Thomas Weise
 
Productizing Structured Streaming Jobs
Productizing Structured Streaming JobsProductizing Structured Streaming Jobs
Productizing Structured Streaming JobsDatabricks
 
Elk with Openstack
Elk with OpenstackElk with Openstack
Elk with OpenstackArun prasath
 
Streaming Data with scalaz-stream
Streaming Data with scalaz-streamStreaming Data with scalaz-stream
Streaming Data with scalaz-streamGaryCoady
 
Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And...
Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And...Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And...
Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And...Lightbend
 
Stream processing from single node to a cluster
Stream processing from single node to a clusterStream processing from single node to a cluster
Stream processing from single node to a clusterGal Marder
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Guido Schmutz
 
Spark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka StreamsSpark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka StreamsGuido Schmutz
 
Writing Continuous Applications with Structured Streaming PySpark API
Writing Continuous Applications with Structured Streaming PySpark APIWriting Continuous Applications with Structured Streaming PySpark API
Writing Continuous Applications with Structured Streaming PySpark APIDatabricks
 

Similaire à Using akka streams to access s3 objects (20)

CBStreams - Java Streams for ColdFusion (CFML)
CBStreams - Java Streams for ColdFusion (CFML)CBStreams - Java Streams for ColdFusion (CFML)
CBStreams - Java Streams for ColdFusion (CFML)
 
ITB2019 CBStreams : Accelerate your Functional Programming with the power of ...
ITB2019 CBStreams : Accelerate your Functional Programming with the power of ...ITB2019 CBStreams : Accelerate your Functional Programming with the power of ...
ITB2019 CBStreams : Accelerate your Functional Programming with the power of ...
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
 
cb streams - gavin pickin
cb streams - gavin pickincb streams - gavin pickin
cb streams - gavin pickin
 
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
 
Writing Continuous Applications with Structured Streaming in PySpark
Writing Continuous Applications with Structured Streaming in PySparkWriting Continuous Applications with Structured Streaming in PySpark
Writing Continuous Applications with Structured Streaming in PySpark
 
Porting a Streaming Pipeline from Scala to Rust
Porting a Streaming Pipeline from Scala to RustPorting a Streaming Pipeline from Scala to Rust
Porting a Streaming Pipeline from Scala to Rust
 
Python Streaming Pipelines on Flink - Beam Meetup at Lyft 2019
Python Streaming Pipelines on Flink - Beam Meetup at Lyft 2019Python Streaming Pipelines on Flink - Beam Meetup at Lyft 2019
Python Streaming Pipelines on Flink - Beam Meetup at Lyft 2019
 
Google cloud Dataflow & Apache Flink
Google cloud Dataflow & Apache FlinkGoogle cloud Dataflow & Apache Flink
Google cloud Dataflow & Apache Flink
 
Productizing Structured Streaming Jobs
Productizing Structured Streaming JobsProductizing Structured Streaming Jobs
Productizing Structured Streaming Jobs
 
Elk with Openstack
Elk with OpenstackElk with Openstack
Elk with Openstack
 
So you think you can stream.pptx
So you think you can stream.pptxSo you think you can stream.pptx
So you think you can stream.pptx
 
Streaming Data with scalaz-stream
Streaming Data with scalaz-streamStreaming Data with scalaz-stream
Streaming Data with scalaz-stream
 
Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And...
Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And...Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And...
Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And...
 
Stream processing from single node to a cluster
Stream processing from single node to a clusterStream processing from single node to a cluster
Stream processing from single node to a cluster
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
 
Spark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka StreamsSpark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka Streams
 
Writing Continuous Applications with Structured Streaming PySpark API
Writing Continuous Applications with Structured Streaming PySpark APIWriting Continuous Applications with Structured Streaming PySpark API
Writing Continuous Applications with Structured Streaming PySpark API
 
Introduction to Apache Beam
Introduction to Apache BeamIntroduction to Apache Beam
Introduction to Apache Beam
 
Intro to Akka Streams
Intro to Akka StreamsIntro to Akka Streams
Intro to Akka Streams
 

Dernier

SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AIABDERRAOUF MEHENNI
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...harshavardhanraghave
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdfWave PLM
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️anilsa9823
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...Health
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providermohitmore19
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionSolGuruz
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsArshad QA
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...OnePlan Solutions
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxComplianceQuest1
 

Dernier (20)

Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 

Using akka streams to access s3 objects

  • 1. Using akka-streams to access S3 objects Mikhail Girkin Software Engineer GILT HBC Digital @mike_girkin
  • 3. Initial problem ● Several big (hundreds Mb) database result sets ● All data cached in memory ● Served as a JSON files ● The service constantly OOM-ing, even on 32Gb instance
  • 4. Akka-streams ● Library from akka toolbox ● Build on top of actor framework ● Handles streams and their specifics, without exposing actors itself
  • 5. What is “stream” ● Sequence of objects ● Has an input ● Has an output ● Defined as a sequence of data transformations ● Could be infinite ● Steps could be executed independently
  • 6. Stream input - Source ● The input of the data in the stream ● Has the output channel to feed data into the stream SQLSource
  • 7. Stream output - Sink ● The final point of the data in the stream ● Has the input channel to receive the data from the stream S3 object
  • 8. Processing - Flow ● The transformation procedure of the stream ● Takes data from the input, apply some computations to it, and pass the resulting data to the output Serialization
  • 9. Basic stream operations ● via Source via Flow => Source Flow via Flow => Flow ● to Flow to Sink => Sink Source to Sink => Stream
  • 10. Declaration is not execution! Stream description is just a declaration, so: val s = Source[Int](Range(1, 100).toList) .via( Flow[Int].map(x => x + 10) ).to( Sink.foreach(println) ) will not execute until you call s.run()
  • 11. The skeleton Get data -> serialize -> send to S3 def run(): Future[Long] = { val cn = getConnection() val stream = (cn: Connection) => dataSource.streamList(cn) // Source[Item] - get data from the DB .via(serializeFlow) // Flow[Item, Byte] - serialize .toMat(s3UploaderSink)(Keep.right) // Sink[Byte] - upload to S3 val countFuture = stream(cn).run() countFuture.onComplete { r => cn.close() } countFuture }
  • 12. Serialize in the stream ● We are dealing with the single collection ● Type of the items is the same val serializeFlow = Flow[Item] .map(x => serializeItem(x)) // serializeItem: Item => String .intersperse("[", ",", "]") // sort of mkString for the streams .mapConcat[Byte] { // mapConcat ≈ flatMap x => x.getBytes().toIndexedSeq }
  • 13. S3 multipart upload API ● Allows to upload files in separate chunks ● Allows to upload chunks in parallel ● (!) By default doesn’t have TTL for the chunks uploaded Simplified API: 1. initialize(bucket, filename) => uploadId 2. uploadChunk(uploadId, partNumber, content) => etag 3. complete(uploadId, List[etag])
  • 14. Resource access ● Pattern: Open - Do stuff - Close open: () => TState onEach: (TState, TItem) => (TState) close: TState => TResult ● Functional pattern - fold over the state ○ With an additional call in the end ● Akka-streams lacks Sink of that type ● Calls open lazily, on arrival of the first element of the stream
  • 15. Lets create a new sink! class FoldResourceSink[TState, TItem, Mat]( open: () => TState, onEach: (TState, TItem) => (TState), close: TState => Mat ) extends GraphStageWithMaterializedValue[SinkShape[TItem], Future[Mat]] { … } Methods to write: def onPush(): Unit override def preStart(): Unit override def onUpstreamFinish(): Unit override def onUpstreamFailure(ex: Throwable): Unit
  • 16. S3Sink from ResourceFoldSink ● SinkA = Flow to SinkB S3 upload flow FoldResourceSink S3 upload sink
  • 17. What is TState and TItem? We need to keep track of: uploadId, etags and uploadedLentgh to the moment case class S3MultipartUploaderState( uploadId: String, etags: List[PartETag], totalLength: Long ) And item is: (ByteString, Int) // (content, chunkNumber)
  • 18. FoldResourceSink for S3 val sink = Sink.foldResource[S3MultipartUploaderState, (ByteString, Int), Long]( () => initUpload(), //Returns state { case (state, (chunk, chunkNumber)) => uploadChunk(state, chunk, chunkNumber) }, completeUpload //Accepts state ) Flow[Byte] .grouped(chunkSize) .map(b => ByteString(b:_*)) .zip( Source.fromIterator(() => Iterator.from(1)) //pairs (content, partNumber) ).toMat(sink)(Keep.right)
  • 19. SQL Source Anorm provides akka-stream SQL source libraryDependencies ++= Seq( "com.typesafe.play" %% "anorm-akka" % "version", "com.typesafe.akka" %% "akka-stream" % "version") AkkaStream.source(SQL"SELECT * FROM Test", SqlParser.scalar[String], ColumnAliaser.empty): Source[String] Brings minimal transitive dependencies (!)
  • 20. Road to production ● Retries in case of S3 errors/failures ○ S3 client handles this ● Handle the possible problem during stream execution (ie. failure talking to DB) ○ When stream fails - it never calls complete
  • 21. Could we do it other way round? ● S3 tends to timeout and drop connection on slow download of large files ● Ability to process data in a streaming manner
  • 22. S3 protocol for partial downloads ● By parts (see multipart upload) ○ Uses part numbers ○ Doesn’t work when upload wasn’t multipart ○ Amazon says it’s faster ● By chunks ○ Chunk is defined by (from, to) byte numbers ○ Works for any file, and any chunk length ○ Amazon says it’s slow
  • 23. Basic idea 1. Get part count 2. For each part create an akka source 3. Combine the individual streams into one 1. Get file length 2. For chunk in file create an akka source 3. Combine the individual streams into one Create akka source from IO stream: val stream: InputStream = … Source.fromInputStream(stream)
  • 24. Downloading by parts Source.single(getPartCount(s3Client, bucketName, key) ).flatMapConcat { partCount => Source( Range(firstPartIndex, partCount + firstPartIndex) ) }.flatMapConcat { partNumber => Source.fromInputStream( getS3ObjectContent(partNumber, readMemoryBufferSize), ) } // Type - Source[ByteString, NotUsed]
  • 25. Downloading by parts Source.single(Unit) .map( _ => getPartCount(s3Client, bucketName, key) ).flatMapConcat { partCount => Source( Range(firstPartIndex, partCount + firstPartIndex) ) }.flatMapConcat { partNumber => Source.fromInputStream( getS3ObjectContent(partNumber, readMemoryBufferSize), ) } // Type - Source[ByteString, NotUsed]
  • 26. gfc-aws-s3 https://github.com/gilt/gfc-aws-s3 Opensource project containing the code above (Sources and Sink) Also s3-http as an educational example Codez!