3. Initial problem
● Several big (hundreds Mb) database result sets
● All data cached in memory
● Served as a JSON files
● The service constantly OOM-ing, even on 32Gb instance
4. Akka-streams
● Library from akka toolbox
● Build on top of actor framework
● Handles streams and their specifics, without exposing
actors itself
5. What is “stream”
● Sequence of objects
● Has an input
● Has an output
● Defined as a sequence of data transformations
● Could be infinite
● Steps could be executed independently
6. Stream input - Source
● The input of the data in the stream
● Has the output channel to feed data into the stream
SQLSource
7. Stream output - Sink
● The final point of the data in the stream
● Has the input channel to receive the data from the stream
S3 object
8. Processing - Flow
● The transformation procedure of the stream
● Takes data from the input, apply some computations to it,
and pass the resulting data to the output
Serialization
9. Basic stream operations
● via
Source via Flow =>
Source
Flow via Flow =>
Flow
● to
Flow to Sink =>
Sink
Source to Sink =>
Stream
10. Declaration is not execution!
Stream description is just a declaration, so:
val s = Source[Int](Range(1, 100).toList)
.via(
Flow[Int].map(x => x + 10)
).to(
Sink.foreach(println)
)
will not execute until you call
s.run()
11. The skeleton
Get data -> serialize -> send to S3
def run(): Future[Long] = {
val cn = getConnection()
val stream = (cn: Connection) =>
dataSource.streamList(cn) // Source[Item] - get data from the DB
.via(serializeFlow) // Flow[Item, Byte] - serialize
.toMat(s3UploaderSink)(Keep.right) // Sink[Byte] - upload to S3
val countFuture = stream(cn).run()
countFuture.onComplete { r =>
cn.close()
}
countFuture
}
12. Serialize in the stream
● We are dealing with the single collection
● Type of the items is the same
val serializeFlow = Flow[Item]
.map(x => serializeItem(x)) // serializeItem: Item => String
.intersperse("[", ",", "]") // sort of mkString for the streams
.mapConcat[Byte] { // mapConcat ≈ flatMap
x => x.getBytes().toIndexedSeq
}
13. S3 multipart upload API
● Allows to upload files in separate chunks
● Allows to upload chunks in parallel
● (!) By default doesn’t have TTL for the chunks uploaded
Simplified API:
1. initialize(bucket, filename) => uploadId
2. uploadChunk(uploadId, partNumber, content) => etag
3. complete(uploadId, List[etag])
14. Resource access
● Pattern: Open - Do stuff - Close
open: () => TState
onEach: (TState, TItem) => (TState)
close: TState => TResult
● Functional pattern - fold over the state
○ With an additional call in the end
● Akka-streams lacks Sink of that type
● Calls open lazily, on arrival of the first element of the stream
15. Lets create a new sink!
class FoldResourceSink[TState, TItem, Mat](
open: () => TState,
onEach: (TState, TItem) => (TState),
close: TState => Mat
) extends GraphStageWithMaterializedValue[SinkShape[TItem], Future[Mat]] { … }
Methods to write:
def onPush(): Unit
override def preStart(): Unit
override def onUpstreamFinish(): Unit
override def onUpstreamFailure(ex: Throwable): Unit
17. What is TState and TItem?
We need to keep track of: uploadId, etags and uploadedLentgh to the moment
case class S3MultipartUploaderState(
uploadId: String,
etags: List[PartETag],
totalLength: Long
)
And item is:
(ByteString, Int) // (content, chunkNumber)
18. FoldResourceSink for S3
val sink =
Sink.foldResource[S3MultipartUploaderState, (ByteString, Int), Long](
() => initUpload(), //Returns state
{ case (state, (chunk, chunkNumber)) =>
uploadChunk(state, chunk, chunkNumber) },
completeUpload //Accepts state
)
Flow[Byte]
.grouped(chunkSize)
.map(b => ByteString(b:_*))
.zip(
Source.fromIterator(() => Iterator.from(1)) //pairs (content, partNumber)
).toMat(sink)(Keep.right)
20. Road to production
● Retries in case of S3 errors/failures
○ S3 client handles this
● Handle the possible problem during stream execution (ie.
failure talking to DB)
○ When stream fails - it never calls complete
21. Could we do it other
way round?
● S3 tends to timeout and drop connection on slow download of large files
● Ability to process data in a streaming manner
22. S3 protocol for partial downloads
● By parts (see multipart upload)
○ Uses part numbers
○ Doesn’t work when upload wasn’t multipart
○ Amazon says it’s faster
● By chunks
○ Chunk is defined by (from, to) byte numbers
○ Works for any file, and any chunk length
○ Amazon says it’s slow
23. Basic idea
1. Get part count
2. For each part create an akka source
3. Combine the individual streams into one
1. Get file length
2. For chunk in file create an akka source
3. Combine the individual streams into one
Create akka source from IO stream:
val stream: InputStream = …
Source.fromInputStream(stream)