SlideShare a Scribd company logo
1 of 54
Download to read offline
Scala-like Distributed Collections:
Dumping Time-Series Data With
Apache Spark
Demi Ben-Ari - CTO @ Panorays
About Me
Demi Ben-Ari, Co-Founder & CTO @ Panorays
●  BS’c Computer Science – Academic College Tel-Aviv Yaffo
●  Co-Founder “Big Things” Big Data Community
In the Past:
●  Sr. Data Engineer - Windward
●  Team Leader & Sr. Java Software Engineer,
Missile defense and Alert System - “Ofek” – IAF
Interested in almost every kind of technology – A True Geek
Agenda
●  Scala and Spark analogies
●  Data flow and Environment
●  What’s our time series data like?
●  Where we started from - where we got to
○  Problems and our decisions
●  Conclusions
Scala and Spark analogies
Scala is...
●  Functional
●  Object Oriented
●  Statically typed
●  Interoperates well with Java and Javascript
○  JVM based
DSLs on top of Scala
SBT
Spiral
Scalaz
Slick
Dispatch
Chisel
Specs
Opti{X}
shapeless
ScalaTest
Squeryl
Scala & Spark (Architecture)
Scala REPL Scala Compiler
Spark Runtime
Scala Runtime
JVM
File System
(eg. HDFS,
Cassandra, S3..)
Cluster Manager
(eg. Yarn, Mesos)
What kind of DSL is Apache Spark
●  Centered around Collections
●  Immutable data sets equipped with functional transformations
●  These are exactly the Scala collection operations
map
flatMap
filter
...
reduce
fold
aggregate
...
union
intersection
...
Spark vs. Scala Collections
●  So, Spark is exactly Scala Collections, but running in a Cluster?
●  Not quite. There are Two main differences:
○  Spark is Lazy, Scala collections are strict
○  Spark has added functionality, eg. PairRDDs.
■  Gives us the power doing lots of operations in the NoSQL distributed
world
Collections Design Choices
Imperative Functional
Strict Lazy
VS
VS
java.util scala.collections.immutable
Scala
OCaml
Spark
C#
Scala Streams, views
Spark is A Multi-Language Platform
●  Why to use Scala instead of
Python?
○  Native to Spark, Can use
everything without
translation
○  Types help
So Bottom Line…
What’s Spark???
United Tools Platform - Single Framework
Batch
InteractiveStreaming
Single Framework
United Tools Platform
Spark Standalone Cluster - Architecture
●  Master
●  History
Server
●  etc
Master
Core 3
Core 4
Core 2
Worker Memory
Core 1Slave
Slave
Slave
Slave
Core 3
Core 4
Core 2
Worker Memory
Core 1
Core 3
Core 4
Core 2
Worker Memory
Core 1
Slave
Core 3
Core 4
Core 2
Worker Memory
Core 1
Core 3
Core 4
Core 2
Worker Memory
Core 1
Slave
Slave
Slave
Data flow and Environment
(Our Use Case)
Structure of the Data
●  Geo Locations + Metadata
●  Arriving over time
●  Different types of messages being reported by sattelites
●  Encoded
●  Might arrive later than acttually transmitted
Data Flow Diagram
Externa
l Data
Source
Analytics
Layers
Data Pipeline
Parsed
Raw
Entity Resolution
Process
Building insights
on top of the entities
Data
Output
Layer
Anomaly
Detection
Trends
Environment Description
Cluster
Dev Testing
Live
Staging
ProductionEnv
OB1K
RESTful Java
Services
Basic Terms
●  Idempotence
is the property of certain operations in mathematics and computer
science, that can be applied multiple times without changing the
result beyond the initial application.
●  Function: Same input => Same output
Basic Terms
●  Missing Parts in Time Series Data
◦  Data arriving from the satellites
⚫  Might be causing delays because of bad transmission
◦  Data vendors delaying the data stream
◦  Calculation in Layers may cause Holes in the Data
●  Calculating the Data layers by time slices
Basic Terms
●  Partitions == Parallelizm
◦  Physical / Logical partitioning
●  Resilient Distributed Datasets (RDDs) == Collections
◦  fault-tolerant collection of elements that can be operated on in
parallel.
◦  Applying immutable transformations and actions over RDDs
So what’s the problem?
The Problem - Receiving
DATA
Beginning state, no data, and the timeline
begins
T = 0
Level 3
Entity
Level 2
Entity
Level 1
Entity
The Problem - Receiving
DATA
T = 10
Level 3
Entity
Level 2
Entity
Level 1
Entity
Computation sliding window size
Level 1 entities data
arrives and gets stored
The Problem - Receiving
DATA
T = 10
Level 3
Entity
Level 2
Entity
Level 1
Entity
Computation sliding window size
Level 3 entities are created
on top of Level 2’s Data
(Decreased amount of data)
Level 2 entities are
created on top of Level 1’s
Data
(Decreased amount of
data)
The Problem - Receiving
DATA
T = 20
Level 3
Entity
Level 2
Entity
Level 1
Entity
Computation sliding window size
Because of the sliding window’s
back size, level 2 and 3 entities
would not be created properly
and there would be “Holes” in the
Data
Level 1 entity's
data arriving late
Solution to the Problem
●  Creating Dependent Micro services forming a data pipeline
◦  Mainly Apache Spark applications
◦  Services are only dependent on the Data - not the previous
service’s run
●  Forming a structure and scheduling of “Back Sliding Window”
◦  Know your data and it’s relevance through time
◦  Don’t try to foresee the future – it might Bias the results
Starting point & Infrastructure
How we started?
●  Spark Standalone – via ec2 scripts
◦  Around 5 nodes (r3.xlarge instances)
◦  Didn’t want to keep a persistent HDFS – Costs a lot
◦  100 GB (per day) => ~150 TB for 4 years
◦  Cost for server per year (r3.xlarge):
●  On demand: ~2900$
●  Reserved: ~1750$
●  Know your costs: http://www.ec2instances.info/
Decision
●  Working with S3 as the persistence layer
◦  Pay extra for
●  Put (0.005 per 1000 requests)
●  Get (0.004 per 10,000 requests)
◦  150TB => ~210$ for 4 years of Data
●  Same format as HDFS (CSV files)
◦  s3n://some-bucket/entity1/201412010000/part-00000
◦  s3n://some-bucket/entity1/201412010000/part-00001
◦  ……
What about the serving?
MongoDB for Serving
Worker 1
Worker 2
….
….
…
…
Worker N
MongoDB
Replica
Set
Spark
Cluster
Master
Write
Read
Spark Slave - Server Specs
●  Instance Type: r3.xlarge
●  CPU’s: 4
●  RAM: 30.5GB
●  Storage: ephemeral
●  Amount: 10+
MongoDB - Server Specs
●  MongoDB version: 2.6.1
●  Instance Type: m3.xlarge (AWS)
●  CPU’s: 4
●  RAM: 15GB
●  Storage: EBS
●  DB Size: ~500GB
●  Collection Indexes: 5 (4 compound)
The Problem
●  Batch jobs
◦  Should run for 5-10 minutes in total
◦  Actual - runs for ~40 minutes
●  Why?
◦  ~20 minutes to write with the Java mongo driver – Async
(Unacknowledged)
◦  ~20 minutes to sync the journal
◦  Total: ~ 40 Minutes of the DB being unavailable
◦  No batch process response and no UI serving
Alternative Solutions
●  Sharded MongoDB (With replica sets)
◦  Pros:
●  Increases Throughput by the amount of shards
●  Increases the availability of the DB
◦  Cons:
●  Very hard to manage DevOps wise (for a small team of
developers)
●  High cost of servers – because each shared need 3 replicas
Workflow with MongoDB
Worker 1
Worker 2
….
….
…
…
Worker N
Spark
Cluster
Master
Write
Read
Master
Our DevOps – After that solution
We had no
DevOps guy at
that time at all
☹
Alternative Solutions
●  Apache Cassandra
◦  Pros:
●  Very large developer community
●  Linearly scalable Database
●  No single master architecture
●  Proven working with distributed engines like Apache Spark
◦  Cons:
●  We had no experience at all with the Database
●  No Geo Spatial Index – Needed to implement by ourselves
The Solution
●  Migration to Apache Cassandra
●  Create easily a Cassandra cluster using DataStax Community AMI
on AWS
◦  First easy step – Using the spark-cassandra-connector
(Easy bootstrap move to Spark ⬄ Cassandra)
◦  Creating a monitoring dashboard to Cassandra
Workflow with Cassandra
Worker 1
Worker 2
….
….
…
…
Worker N
Cassandr
a
Cluster
Spark
Cluster
Write
Read
Result
●  Performance improvement
◦  Batch write parts of the job run in 3 minutes instead of ~ 40
minutes in MongoDB
●  Took 2 weeks to go from “Zero to Hero”, and to ramp up a running
solution that work without glitches
So what’s the problem
(Again)?
Transferring the Heaviest Process
●  Micro service that runs every 10 minutes
●  Writes to Cassandra 30GB per iteration
◦  (Replication factor 3 => 90GB)
●  At first took us 18 minutes to do all of the writes
◦  Not Acceptable in a 10 minute process
Cluster On OpsCenter - Before
Transferring the Heaviest Process
●  Solutions
◦  We chose the i2.xlarge
◦  Optimization of the Cluster
◦  Changing the JDK to Java-8
●  Changing the GC algorithm to G1
◦  Tuning the Operation system
●  Ulimit, removing the swap
◦  Write time went down to ~5 minutes (For 30GB RF=3)
Sounds good right? I don’t think so
Cloud Watch After Tuning
The Solution
●  Taking the same Data Model that we held in Cassandra (All of the
Raw data per 10 minutes) and put it on S3
◦  Write time went down from ~5 minutes to 1.5 minutes
●  Added another process, not dependent on the main one, happens
every 15 minutes
◦  Reads from S3, downscales the amount and Writes them to
Cassandra for serving
How it looks after all?
Parsed
Raw
Static /
Aggregated
Data
Spark Analytics Layers
UI Serving
Downscale
d Data
Heavy
Fusion
Process
Conclusion
●  Always give an estimate to your data
◦  Frequency
◦  Volume
◦  Arrangement of the previous phase
●  There is no “Best” persistence layer
◦  There is the right one for the job
◦  Don’t overload an existing solution
Conclusion
●  Spark is a great framework for distributed collections
◦ Fully functional API
◦ Can perform imperative actions
● “With great power,
comes lots of partitioning”
◦ Control your work and
data distribution via partitions
●  https://www.pinterest.com/pin/155514993354583499/ (Thanks)
Questions?
Thanks! my contact:
—Demi Ben-Ari
●  LinkedIn
●  Twitter: @demibenari
●  Blog: http://progexc.blogspot.com/
●  Email: demi.benari@gmail.com
●  “Big Things” Community
–Meetup, YouTube, Facebook, Twitter

More Related Content

What's hot

Akka-demy (a.k.a. How to build stateful distributed systems) I/II
 Akka-demy (a.k.a. How to build stateful distributed systems) I/II Akka-demy (a.k.a. How to build stateful distributed systems) I/II
Akka-demy (a.k.a. How to build stateful distributed systems) I/IIPeter Csala
 
Building real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark StreamingBuilding real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark Streamingdatamantra
 
Optimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloadsOptimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloadsdatamantra
 
Reactive mistakes reactive nyc
Reactive mistakes   reactive nycReactive mistakes   reactive nyc
Reactive mistakes reactive nycPetr Zapletal
 
netflix-real-time-data-strata-talk
netflix-real-time-data-strata-talknetflix-real-time-data-strata-talk
netflix-real-time-data-strata-talkDanny Yuan
 
Core Services behind Spark Job Execution
Core Services behind Spark Job ExecutionCore Services behind Spark Job Execution
Core Services behind Spark Job Executiondatamantra
 
Self-managed and automatically reconfigurable stream processing
Self-managed and automatically reconfigurable stream processingSelf-managed and automatically reconfigurable stream processing
Self-managed and automatically reconfigurable stream processingVasia Kalavri
 
Taskerman - a distributed cluster task manager
Taskerman - a distributed cluster task managerTaskerman - a distributed cluster task manager
Taskerman - a distributed cluster task managerRaghavendra Prabhu
 
Mantis qcon nyc_2015
Mantis qcon nyc_2015Mantis qcon nyc_2015
Mantis qcon nyc_2015neerajrj
 
Cassandra Summit 2014: Diagnosing Problems in Production
Cassandra Summit 2014: Diagnosing Problems in ProductionCassandra Summit 2014: Diagnosing Problems in Production
Cassandra Summit 2014: Diagnosing Problems in ProductionDataStax Academy
 
Cassandra Day Atlanta 2015: Diagnosing Problems in Production
Cassandra Day Atlanta 2015: Diagnosing Problems in ProductionCassandra Day Atlanta 2015: Diagnosing Problems in Production
Cassandra Day Atlanta 2015: Diagnosing Problems in ProductionDataStax Academy
 
Distributed Real-Time Stream Processing: Why and How 2.0
Distributed Real-Time Stream Processing:  Why and How 2.0Distributed Real-Time Stream Processing:  Why and How 2.0
Distributed Real-Time Stream Processing: Why and How 2.0Petr Zapletal
 
Apache Samza Past, Present and Future
Apache Samza  Past, Present and FutureApache Samza  Past, Present and Future
Apache Samza Past, Present and FutureKartik Paramasivam
 
Determinism in finance
Determinism in financeDeterminism in finance
Determinism in financePeter Lawrey
 
Runaway complexity in Big Data... and a plan to stop it
Runaway complexity in Big Data... and a plan to stop itRunaway complexity in Big Data... and a plan to stop it
Runaway complexity in Big Data... and a plan to stop itnathanmarz
 
Distributed Task Scheduling with Akka, Kafka and Cassandra
Distributed Task Scheduling with Akka, Kafka and CassandraDistributed Task Scheduling with Akka, Kafka and Cassandra
Distributed Task Scheduling with Akka, Kafka and CassandraDavid van Geest
 
How Workload Prioritization Reduces Your Datacenter Footprint
How Workload Prioritization Reduces Your Datacenter FootprintHow Workload Prioritization Reduces Your Datacenter Footprint
How Workload Prioritization Reduces Your Datacenter FootprintScyllaDB
 
Webinar: Diagnosing Apache Cassandra Problems in Production
Webinar: Diagnosing Apache Cassandra Problems in ProductionWebinar: Diagnosing Apache Cassandra Problems in Production
Webinar: Diagnosing Apache Cassandra Problems in ProductionDataStax Academy
 

What's hot (20)

Akka-demy (a.k.a. How to build stateful distributed systems) I/II
 Akka-demy (a.k.a. How to build stateful distributed systems) I/II Akka-demy (a.k.a. How to build stateful distributed systems) I/II
Akka-demy (a.k.a. How to build stateful distributed systems) I/II
 
Spark streaming: Best Practices
Spark streaming: Best PracticesSpark streaming: Best Practices
Spark streaming: Best Practices
 
Building real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark StreamingBuilding real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark Streaming
 
Optimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloadsOptimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloads
 
Reactive mistakes reactive nyc
Reactive mistakes   reactive nycReactive mistakes   reactive nyc
Reactive mistakes reactive nyc
 
netflix-real-time-data-strata-talk
netflix-real-time-data-strata-talknetflix-real-time-data-strata-talk
netflix-real-time-data-strata-talk
 
Core Services behind Spark Job Execution
Core Services behind Spark Job ExecutionCore Services behind Spark Job Execution
Core Services behind Spark Job Execution
 
Self-managed and automatically reconfigurable stream processing
Self-managed and automatically reconfigurable stream processingSelf-managed and automatically reconfigurable stream processing
Self-managed and automatically reconfigurable stream processing
 
Taskerman - a distributed cluster task manager
Taskerman - a distributed cluster task managerTaskerman - a distributed cluster task manager
Taskerman - a distributed cluster task manager
 
Mantis qcon nyc_2015
Mantis qcon nyc_2015Mantis qcon nyc_2015
Mantis qcon nyc_2015
 
Javantura v3 - Going Reactive with RxJava – Hrvoje Crnjak
Javantura v3 - Going Reactive with RxJava – Hrvoje CrnjakJavantura v3 - Going Reactive with RxJava – Hrvoje Crnjak
Javantura v3 - Going Reactive with RxJava – Hrvoje Crnjak
 
Cassandra Summit 2014: Diagnosing Problems in Production
Cassandra Summit 2014: Diagnosing Problems in ProductionCassandra Summit 2014: Diagnosing Problems in Production
Cassandra Summit 2014: Diagnosing Problems in Production
 
Cassandra Day Atlanta 2015: Diagnosing Problems in Production
Cassandra Day Atlanta 2015: Diagnosing Problems in ProductionCassandra Day Atlanta 2015: Diagnosing Problems in Production
Cassandra Day Atlanta 2015: Diagnosing Problems in Production
 
Distributed Real-Time Stream Processing: Why and How 2.0
Distributed Real-Time Stream Processing:  Why and How 2.0Distributed Real-Time Stream Processing:  Why and How 2.0
Distributed Real-Time Stream Processing: Why and How 2.0
 
Apache Samza Past, Present and Future
Apache Samza  Past, Present and FutureApache Samza  Past, Present and Future
Apache Samza Past, Present and Future
 
Determinism in finance
Determinism in financeDeterminism in finance
Determinism in finance
 
Runaway complexity in Big Data... and a plan to stop it
Runaway complexity in Big Data... and a plan to stop itRunaway complexity in Big Data... and a plan to stop it
Runaway complexity in Big Data... and a plan to stop it
 
Distributed Task Scheduling with Akka, Kafka and Cassandra
Distributed Task Scheduling with Akka, Kafka and CassandraDistributed Task Scheduling with Akka, Kafka and Cassandra
Distributed Task Scheduling with Akka, Kafka and Cassandra
 
How Workload Prioritization Reduces Your Datacenter Footprint
How Workload Prioritization Reduces Your Datacenter FootprintHow Workload Prioritization Reduces Your Datacenter Footprint
How Workload Prioritization Reduces Your Datacenter Footprint
 
Webinar: Diagnosing Apache Cassandra Problems in Production
Webinar: Diagnosing Apache Cassandra Problems in ProductionWebinar: Diagnosing Apache Cassandra Problems in Production
Webinar: Diagnosing Apache Cassandra Problems in Production
 

Similar to Scala like distributed collections - dumping time-series data with apache spark

S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...Codemotion Tel Aviv
 
S3 cassandra or outer space? dumping time series data using spark
S3 cassandra or outer space? dumping time series data using sparkS3 cassandra or outer space? dumping time series data using spark
S3 cassandra or outer space? dumping time series data using sparkDemi Ben-Ari
 
Migrating Data Pipeline from MongoDB to Cassandra
Migrating Data Pipeline from MongoDB to CassandraMigrating Data Pipeline from MongoDB to Cassandra
Migrating Data Pipeline from MongoDB to CassandraDemi Ben-Ari
 
Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Spark and KafkaStream, Stream, Stream: Different Streaming Methods with Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Spark and KafkaDataWorks Summit
 
Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2aspyker
 
kranonit S06E01 Игорь Цинько: High load
kranonit S06E01 Игорь Цинько: High loadkranonit S06E01 Игорь Цинько: High load
kranonit S06E01 Игорь Цинько: High loadKrivoy Rog IT Community
 
From HDFS to S3: Migrate Pinterest Apache Spark Clusters
From HDFS to S3: Migrate Pinterest Apache Spark ClustersFrom HDFS to S3: Migrate Pinterest Apache Spark Clusters
From HDFS to S3: Migrate Pinterest Apache Spark ClustersDatabricks
 
Profiling & Testing with Spark
Profiling & Testing with SparkProfiling & Testing with Spark
Profiling & Testing with SparkRoger Rafanell Mas
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned Omid Vahdaty
 
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and KafkaStream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and KafkaDatabricks
 
Our Multi-Year Journey to a 10x Faster Confluent Cloud
Our Multi-Year Journey to a 10x Faster Confluent CloudOur Multi-Year Journey to a 10x Faster Confluent Cloud
Our Multi-Year Journey to a 10x Faster Confluent CloudHostedbyConfluent
 
Big Data processing with Apache Spark
Big Data processing with Apache SparkBig Data processing with Apache Spark
Big Data processing with Apache SparkLucian Neghina
 
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...Martin Zapletal
 
PGConf APAC 2018 - High performance json postgre-sql vs. mongodb
PGConf APAC 2018 - High performance json  postgre-sql vs. mongodbPGConf APAC 2018 - High performance json  postgre-sql vs. mongodb
PGConf APAC 2018 - High performance json postgre-sql vs. mongodbPGConf APAC
 
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | EnglishAWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | EnglishOmid Vahdaty
 
Debugging data pipelines @OLA by Karan Kumar
Debugging data pipelines @OLA by Karan KumarDebugging data pipelines @OLA by Karan Kumar
Debugging data pipelines @OLA by Karan KumarShubham Tagra
 

Similar to Scala like distributed collections - dumping time-series data with apache spark (20)

S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...
 
S3 cassandra or outer space? dumping time series data using spark
S3 cassandra or outer space? dumping time series data using sparkS3 cassandra or outer space? dumping time series data using spark
S3 cassandra or outer space? dumping time series data using spark
 
Migrating Data Pipeline from MongoDB to Cassandra
Migrating Data Pipeline from MongoDB to CassandraMigrating Data Pipeline from MongoDB to Cassandra
Migrating Data Pipeline from MongoDB to Cassandra
 
Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Spark and KafkaStream, Stream, Stream: Different Streaming Methods with Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka
 
Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2
 
kranonit S06E01 Игорь Цинько: High load
kranonit S06E01 Игорь Цинько: High loadkranonit S06E01 Игорь Цинько: High load
kranonit S06E01 Игорь Цинько: High load
 
From HDFS to S3: Migrate Pinterest Apache Spark Clusters
From HDFS to S3: Migrate Pinterest Apache Spark ClustersFrom HDFS to S3: Migrate Pinterest Apache Spark Clusters
From HDFS to S3: Migrate Pinterest Apache Spark Clusters
 
Profiling & Testing with Spark
Profiling & Testing with SparkProfiling & Testing with Spark
Profiling & Testing with Spark
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned
 
Netty training
Netty trainingNetty training
Netty training
 
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and KafkaStream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
 
Our Multi-Year Journey to a 10x Faster Confluent Cloud
Our Multi-Year Journey to a 10x Faster Confluent CloudOur Multi-Year Journey to a 10x Faster Confluent Cloud
Our Multi-Year Journey to a 10x Faster Confluent Cloud
 
Big Data processing with Apache Spark
Big Data processing with Apache SparkBig Data processing with Apache Spark
Big Data processing with Apache Spark
 
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...
 
EVCache & Moneta (GoSF)
EVCache & Moneta (GoSF)EVCache & Moneta (GoSF)
EVCache & Moneta (GoSF)
 
PGConf APAC 2018 - High performance json postgre-sql vs. mongodb
PGConf APAC 2018 - High performance json  postgre-sql vs. mongodbPGConf APAC 2018 - High performance json  postgre-sql vs. mongodb
PGConf APAC 2018 - High performance json postgre-sql vs. mongodb
 
Cloud arch patterns
Cloud arch patternsCloud arch patterns
Cloud arch patterns
 
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | EnglishAWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
 
Netty training
Netty trainingNetty training
Netty training
 
Debugging data pipelines @OLA by Karan Kumar
Debugging data pipelines @OLA by Karan KumarDebugging data pipelines @OLA by Karan Kumar
Debugging data pipelines @OLA by Karan Kumar
 

More from Demi Ben-Ari

Thinking DevOps in the Era of the Cloud - Demi Ben-Ari
Thinking DevOps in the Era of the Cloud - Demi Ben-AriThinking DevOps in the Era of the Cloud - Demi Ben-Ari
Thinking DevOps in the Era of the Cloud - Demi Ben-AriDemi Ben-Ari
 
CTO Management Tool Box - Demi Ben-Ari at Panorays
CTO Management Tool Box - Demi Ben-Ari at PanoraysCTO Management Tool Box - Demi Ben-Ari at Panorays
CTO Management Tool Box - Demi Ben-Ari at PanoraysDemi Ben-Ari
 
Kubernetes, Toolbox to fail or succeed for beginners - Demi Ben-Ari, VP R&D @...
Kubernetes, Toolbox to fail or succeed for beginners - Demi Ben-Ari, VP R&D @...Kubernetes, Toolbox to fail or succeed for beginners - Demi Ben-Ari, VP R&D @...
Kubernetes, Toolbox to fail or succeed for beginners - Demi Ben-Ari, VP R&D @...Demi Ben-Ari
 
Hacker vs company, Cloud Cyber Security Automated with Kubernetes - Demi Ben-...
Hacker vs company, Cloud Cyber Security Automated with Kubernetes - Demi Ben-...Hacker vs company, Cloud Cyber Security Automated with Kubernetes - Demi Ben-...
Hacker vs company, Cloud Cyber Security Automated with Kubernetes - Demi Ben-...Demi Ben-Ari
 
CTO Management ToolBox - Demi Ben-Ari -- Panorays
CTO Management ToolBox - Demi Ben-Ari -- PanoraysCTO Management ToolBox - Demi Ben-Ari -- Panorays
CTO Management ToolBox - Demi Ben-Ari -- PanoraysDemi Ben-Ari
 
All I Wanted Is to Found a Startup - Demi Ben-Ari - Panorays
All I Wanted Is to Found a Startup - Demi Ben-Ari - PanoraysAll I Wanted Is to Found a Startup - Demi Ben-Ari - Panorays
All I Wanted Is to Found a Startup - Demi Ben-Ari - PanoraysDemi Ben-Ari
 
Hacking for fun & profit - The Kubernetes Way - Demi Ben-Ari - Panorays
Hacking for fun & profit - The Kubernetes Way - Demi Ben-Ari - PanoraysHacking for fun & profit - The Kubernetes Way - Demi Ben-Ari - Panorays
Hacking for fun & profit - The Kubernetes Way - Demi Ben-Ari - PanoraysDemi Ben-Ari
 
Community, Unifying the Geeks to Create Value - Demi Ben-Ari
Community, Unifying the Geeks to Create Value - Demi Ben-AriCommunity, Unifying the Geeks to Create Value - Demi Ben-Ari
Community, Unifying the Geeks to Create Value - Demi Ben-AriDemi Ben-Ari
 
Apache Spark 101 - Demi Ben-Ari - Panorays
Apache Spark 101 - Demi Ben-Ari - PanoraysApache Spark 101 - Demi Ben-Ari - Panorays
Apache Spark 101 - Demi Ben-Ari - PanoraysDemi Ben-Ari
 
Know the Startup World - Demi Ben-Ari - Ofek Alumni
Know the Startup World - Demi Ben-Ari - Ofek AlumniKnow the Startup World - Demi Ben-Ari - Ofek Alumni
Know the Startup World - Demi Ben-Ari - Ofek AlumniDemi Ben-Ari
 
Big Data made easy in the era of the Cloud - Demi Ben-Ari
Big Data made easy in the era of the Cloud - Demi Ben-AriBig Data made easy in the era of the Cloud - Demi Ben-Ari
Big Data made easy in the era of the Cloud - Demi Ben-AriDemi Ben-Ari
 
Know the Startup World - Demi Ben Ari - Ofek Alumni
Know the Startup World - Demi Ben Ari - Ofek AlumniKnow the Startup World - Demi Ben Ari - Ofek Alumni
Know the Startup World - Demi Ben Ari - Ofek AlumniDemi Ben-Ari
 
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...Demi Ben-Ari
 
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017Demi Ben-Ari
 
Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...
Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...
Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...Demi Ben-Ari
 
Thinking DevOps in the era of the Cloud - Demi Ben-Ari
Thinking DevOps in the era of the Cloud - Demi Ben-AriThinking DevOps in the era of the Cloud - Demi Ben-Ari
Thinking DevOps in the era of the Cloud - Demi Ben-AriDemi Ben-Ari
 
Bootstrapping a Tech Community - Demi Ben-Ari
Bootstrapping a Tech Community - Demi Ben-AriBootstrapping a Tech Community - Demi Ben-Ari
Bootstrapping a Tech Community - Demi Ben-AriDemi Ben-Ari
 
Apache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-AriApache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-AriDemi Ben-Ari
 
Spark 101 – First Steps To Distributed Computing - Demi Ben-Ari @ Ofek Alumni
Spark 101 – First Steps To Distributed Computing - Demi Ben-Ari @ Ofek AlumniSpark 101 – First Steps To Distributed Computing - Demi Ben-Ari @ Ofek Alumni
Spark 101 – First Steps To Distributed Computing - Demi Ben-Ari @ Ofek AlumniDemi Ben-Ari
 
Spark 101 - First steps to distributed computing
Spark 101 - First steps to distributed computingSpark 101 - First steps to distributed computing
Spark 101 - First steps to distributed computingDemi Ben-Ari
 

More from Demi Ben-Ari (20)

Thinking DevOps in the Era of the Cloud - Demi Ben-Ari
Thinking DevOps in the Era of the Cloud - Demi Ben-AriThinking DevOps in the Era of the Cloud - Demi Ben-Ari
Thinking DevOps in the Era of the Cloud - Demi Ben-Ari
 
CTO Management Tool Box - Demi Ben-Ari at Panorays
CTO Management Tool Box - Demi Ben-Ari at PanoraysCTO Management Tool Box - Demi Ben-Ari at Panorays
CTO Management Tool Box - Demi Ben-Ari at Panorays
 
Kubernetes, Toolbox to fail or succeed for beginners - Demi Ben-Ari, VP R&D @...
Kubernetes, Toolbox to fail or succeed for beginners - Demi Ben-Ari, VP R&D @...Kubernetes, Toolbox to fail or succeed for beginners - Demi Ben-Ari, VP R&D @...
Kubernetes, Toolbox to fail or succeed for beginners - Demi Ben-Ari, VP R&D @...
 
Hacker vs company, Cloud Cyber Security Automated with Kubernetes - Demi Ben-...
Hacker vs company, Cloud Cyber Security Automated with Kubernetes - Demi Ben-...Hacker vs company, Cloud Cyber Security Automated with Kubernetes - Demi Ben-...
Hacker vs company, Cloud Cyber Security Automated with Kubernetes - Demi Ben-...
 
CTO Management ToolBox - Demi Ben-Ari -- Panorays
CTO Management ToolBox - Demi Ben-Ari -- PanoraysCTO Management ToolBox - Demi Ben-Ari -- Panorays
CTO Management ToolBox - Demi Ben-Ari -- Panorays
 
All I Wanted Is to Found a Startup - Demi Ben-Ari - Panorays
All I Wanted Is to Found a Startup - Demi Ben-Ari - PanoraysAll I Wanted Is to Found a Startup - Demi Ben-Ari - Panorays
All I Wanted Is to Found a Startup - Demi Ben-Ari - Panorays
 
Hacking for fun & profit - The Kubernetes Way - Demi Ben-Ari - Panorays
Hacking for fun & profit - The Kubernetes Way - Demi Ben-Ari - PanoraysHacking for fun & profit - The Kubernetes Way - Demi Ben-Ari - Panorays
Hacking for fun & profit - The Kubernetes Way - Demi Ben-Ari - Panorays
 
Community, Unifying the Geeks to Create Value - Demi Ben-Ari
Community, Unifying the Geeks to Create Value - Demi Ben-AriCommunity, Unifying the Geeks to Create Value - Demi Ben-Ari
Community, Unifying the Geeks to Create Value - Demi Ben-Ari
 
Apache Spark 101 - Demi Ben-Ari - Panorays
Apache Spark 101 - Demi Ben-Ari - PanoraysApache Spark 101 - Demi Ben-Ari - Panorays
Apache Spark 101 - Demi Ben-Ari - Panorays
 
Know the Startup World - Demi Ben-Ari - Ofek Alumni
Know the Startup World - Demi Ben-Ari - Ofek AlumniKnow the Startup World - Demi Ben-Ari - Ofek Alumni
Know the Startup World - Demi Ben-Ari - Ofek Alumni
 
Big Data made easy in the era of the Cloud - Demi Ben-Ari
Big Data made easy in the era of the Cloud - Demi Ben-AriBig Data made easy in the era of the Cloud - Demi Ben-Ari
Big Data made easy in the era of the Cloud - Demi Ben-Ari
 
Know the Startup World - Demi Ben Ari - Ofek Alumni
Know the Startup World - Demi Ben Ari - Ofek AlumniKnow the Startup World - Demi Ben Ari - Ofek Alumni
Know the Startup World - Demi Ben Ari - Ofek Alumni
 
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...
 
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017
 
Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...
Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...
Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...
 
Thinking DevOps in the era of the Cloud - Demi Ben-Ari
Thinking DevOps in the era of the Cloud - Demi Ben-AriThinking DevOps in the era of the Cloud - Demi Ben-Ari
Thinking DevOps in the era of the Cloud - Demi Ben-Ari
 
Bootstrapping a Tech Community - Demi Ben-Ari
Bootstrapping a Tech Community - Demi Ben-AriBootstrapping a Tech Community - Demi Ben-Ari
Bootstrapping a Tech Community - Demi Ben-Ari
 
Apache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-AriApache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-Ari
 
Spark 101 – First Steps To Distributed Computing - Demi Ben-Ari @ Ofek Alumni
Spark 101 – First Steps To Distributed Computing - Demi Ben-Ari @ Ofek AlumniSpark 101 – First Steps To Distributed Computing - Demi Ben-Ari @ Ofek Alumni
Spark 101 – First Steps To Distributed Computing - Demi Ben-Ari @ Ofek Alumni
 
Spark 101 - First steps to distributed computing
Spark 101 - First steps to distributed computingSpark 101 - First steps to distributed computing
Spark 101 - First steps to distributed computing
 

Recently uploaded

+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...Health
 
tonesoftg
tonesoftgtonesoftg
tonesoftglanshi9
 
WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?WSO2
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...SelfMade bd
 
%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrandmasabamasaba
 
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park %in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park masabamasaba
 
Harnessing ChatGPT - Elevating Productivity in Today's Agile Environment
Harnessing ChatGPT  - Elevating Productivity in Today's Agile EnvironmentHarnessing ChatGPT  - Elevating Productivity in Today's Agile Environment
Harnessing ChatGPT - Elevating Productivity in Today's Agile EnvironmentVictorSzoltysek
 
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...masabamasaba
 
WSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go PlatformlessWSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go PlatformlessWSO2
 
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...masabamasaba
 
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024VictoriaMetrics
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesVictorSzoltysek
 
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...Jittipong Loespradit
 
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park masabamasaba
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Steffen Staab
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is insideshinachiaurasa2
 
%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Hararemasabamasaba
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️Delhi Call girls
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...panagenda
 

Recently uploaded (20)

+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
tonesoftg
tonesoftgtonesoftg
tonesoftg
 
WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
 
%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand
 
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park %in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
 
Harnessing ChatGPT - Elevating Productivity in Today's Agile Environment
Harnessing ChatGPT  - Elevating Productivity in Today's Agile EnvironmentHarnessing ChatGPT  - Elevating Productivity in Today's Agile Environment
Harnessing ChatGPT - Elevating Productivity in Today's Agile Environment
 
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
 
WSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go PlatformlessWSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go Platformless
 
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
 
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
 
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
 
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
 
%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 

Scala like distributed collections - dumping time-series data with apache spark

  • 1. Scala-like Distributed Collections: Dumping Time-Series Data With Apache Spark Demi Ben-Ari - CTO @ Panorays
  • 2. About Me Demi Ben-Ari, Co-Founder & CTO @ Panorays ●  BS’c Computer Science – Academic College Tel-Aviv Yaffo ●  Co-Founder “Big Things” Big Data Community In the Past: ●  Sr. Data Engineer - Windward ●  Team Leader & Sr. Java Software Engineer, Missile defense and Alert System - “Ofek” – IAF Interested in almost every kind of technology – A True Geek
  • 3. Agenda ●  Scala and Spark analogies ●  Data flow and Environment ●  What’s our time series data like? ●  Where we started from - where we got to ○  Problems and our decisions ●  Conclusions
  • 4. Scala and Spark analogies
  • 5. Scala is... ●  Functional ●  Object Oriented ●  Statically typed ●  Interoperates well with Java and Javascript ○  JVM based
  • 6. DSLs on top of Scala SBT Spiral Scalaz Slick Dispatch Chisel Specs Opti{X} shapeless ScalaTest Squeryl
  • 7. Scala & Spark (Architecture) Scala REPL Scala Compiler Spark Runtime Scala Runtime JVM File System (eg. HDFS, Cassandra, S3..) Cluster Manager (eg. Yarn, Mesos)
  • 8. What kind of DSL is Apache Spark ●  Centered around Collections ●  Immutable data sets equipped with functional transformations ●  These are exactly the Scala collection operations map flatMap filter ... reduce fold aggregate ... union intersection ...
  • 9. Spark vs. Scala Collections ●  So, Spark is exactly Scala Collections, but running in a Cluster? ●  Not quite. There are Two main differences: ○  Spark is Lazy, Scala collections are strict ○  Spark has added functionality, eg. PairRDDs. ■  Gives us the power doing lots of operations in the NoSQL distributed world
  • 10. Collections Design Choices Imperative Functional Strict Lazy VS VS java.util scala.collections.immutable Scala OCaml Spark C# Scala Streams, views
  • 11. Spark is A Multi-Language Platform ●  Why to use Scala instead of Python? ○  Native to Spark, Can use everything without translation ○  Types help
  • 13. United Tools Platform - Single Framework Batch InteractiveStreaming Single Framework
  • 15. Spark Standalone Cluster - Architecture ●  Master ●  History Server ●  etc Master Core 3 Core 4 Core 2 Worker Memory Core 1Slave Slave Slave Slave Core 3 Core 4 Core 2 Worker Memory Core 1 Core 3 Core 4 Core 2 Worker Memory Core 1 Slave Core 3 Core 4 Core 2 Worker Memory Core 1 Core 3 Core 4 Core 2 Worker Memory Core 1 Slave Slave Slave
  • 16. Data flow and Environment (Our Use Case)
  • 17. Structure of the Data ●  Geo Locations + Metadata ●  Arriving over time ●  Different types of messages being reported by sattelites ●  Encoded ●  Might arrive later than acttually transmitted
  • 18. Data Flow Diagram Externa l Data Source Analytics Layers Data Pipeline Parsed Raw Entity Resolution Process Building insights on top of the entities Data Output Layer Anomaly Detection Trends
  • 20. Basic Terms ●  Idempotence is the property of certain operations in mathematics and computer science, that can be applied multiple times without changing the result beyond the initial application. ●  Function: Same input => Same output
  • 21. Basic Terms ●  Missing Parts in Time Series Data ◦  Data arriving from the satellites ⚫  Might be causing delays because of bad transmission ◦  Data vendors delaying the data stream ◦  Calculation in Layers may cause Holes in the Data ●  Calculating the Data layers by time slices
  • 22. Basic Terms ●  Partitions == Parallelizm ◦  Physical / Logical partitioning ●  Resilient Distributed Datasets (RDDs) == Collections ◦  fault-tolerant collection of elements that can be operated on in parallel. ◦  Applying immutable transformations and actions over RDDs
  • 23. So what’s the problem?
  • 24. The Problem - Receiving DATA Beginning state, no data, and the timeline begins T = 0 Level 3 Entity Level 2 Entity Level 1 Entity
  • 25. The Problem - Receiving DATA T = 10 Level 3 Entity Level 2 Entity Level 1 Entity Computation sliding window size Level 1 entities data arrives and gets stored
  • 26. The Problem - Receiving DATA T = 10 Level 3 Entity Level 2 Entity Level 1 Entity Computation sliding window size Level 3 entities are created on top of Level 2’s Data (Decreased amount of data) Level 2 entities are created on top of Level 1’s Data (Decreased amount of data)
  • 27. The Problem - Receiving DATA T = 20 Level 3 Entity Level 2 Entity Level 1 Entity Computation sliding window size Because of the sliding window’s back size, level 2 and 3 entities would not be created properly and there would be “Holes” in the Data Level 1 entity's data arriving late
  • 28. Solution to the Problem ●  Creating Dependent Micro services forming a data pipeline ◦  Mainly Apache Spark applications ◦  Services are only dependent on the Data - not the previous service’s run ●  Forming a structure and scheduling of “Back Sliding Window” ◦  Know your data and it’s relevance through time ◦  Don’t try to foresee the future – it might Bias the results
  • 29. Starting point & Infrastructure
  • 30. How we started? ●  Spark Standalone – via ec2 scripts ◦  Around 5 nodes (r3.xlarge instances) ◦  Didn’t want to keep a persistent HDFS – Costs a lot ◦  100 GB (per day) => ~150 TB for 4 years ◦  Cost for server per year (r3.xlarge): ●  On demand: ~2900$ ●  Reserved: ~1750$ ●  Know your costs: http://www.ec2instances.info/
  • 31. Decision ●  Working with S3 as the persistence layer ◦  Pay extra for ●  Put (0.005 per 1000 requests) ●  Get (0.004 per 10,000 requests) ◦  150TB => ~210$ for 4 years of Data ●  Same format as HDFS (CSV files) ◦  s3n://some-bucket/entity1/201412010000/part-00000 ◦  s3n://some-bucket/entity1/201412010000/part-00001 ◦  ……
  • 32. What about the serving?
  • 33. MongoDB for Serving Worker 1 Worker 2 …. …. … … Worker N MongoDB Replica Set Spark Cluster Master Write Read
  • 34. Spark Slave - Server Specs ●  Instance Type: r3.xlarge ●  CPU’s: 4 ●  RAM: 30.5GB ●  Storage: ephemeral ●  Amount: 10+
  • 35. MongoDB - Server Specs ●  MongoDB version: 2.6.1 ●  Instance Type: m3.xlarge (AWS) ●  CPU’s: 4 ●  RAM: 15GB ●  Storage: EBS ●  DB Size: ~500GB ●  Collection Indexes: 5 (4 compound)
  • 36. The Problem ●  Batch jobs ◦  Should run for 5-10 minutes in total ◦  Actual - runs for ~40 minutes ●  Why? ◦  ~20 minutes to write with the Java mongo driver – Async (Unacknowledged) ◦  ~20 minutes to sync the journal ◦  Total: ~ 40 Minutes of the DB being unavailable ◦  No batch process response and no UI serving
  • 37. Alternative Solutions ●  Sharded MongoDB (With replica sets) ◦  Pros: ●  Increases Throughput by the amount of shards ●  Increases the availability of the DB ◦  Cons: ●  Very hard to manage DevOps wise (for a small team of developers) ●  High cost of servers – because each shared need 3 replicas
  • 38. Workflow with MongoDB Worker 1 Worker 2 …. …. … … Worker N Spark Cluster Master Write Read Master
  • 39. Our DevOps – After that solution We had no DevOps guy at that time at all ☹
  • 40. Alternative Solutions ●  Apache Cassandra ◦  Pros: ●  Very large developer community ●  Linearly scalable Database ●  No single master architecture ●  Proven working with distributed engines like Apache Spark ◦  Cons: ●  We had no experience at all with the Database ●  No Geo Spatial Index – Needed to implement by ourselves
  • 41. The Solution ●  Migration to Apache Cassandra ●  Create easily a Cassandra cluster using DataStax Community AMI on AWS ◦  First easy step – Using the spark-cassandra-connector (Easy bootstrap move to Spark ⬄ Cassandra) ◦  Creating a monitoring dashboard to Cassandra
  • 42. Workflow with Cassandra Worker 1 Worker 2 …. …. … … Worker N Cassandr a Cluster Spark Cluster Write Read
  • 43. Result ●  Performance improvement ◦  Batch write parts of the job run in 3 minutes instead of ~ 40 minutes in MongoDB ●  Took 2 weeks to go from “Zero to Hero”, and to ramp up a running solution that work without glitches
  • 44. So what’s the problem (Again)?
  • 45. Transferring the Heaviest Process ●  Micro service that runs every 10 minutes ●  Writes to Cassandra 30GB per iteration ◦  (Replication factor 3 => 90GB) ●  At first took us 18 minutes to do all of the writes ◦  Not Acceptable in a 10 minute process
  • 47. Transferring the Heaviest Process ●  Solutions ◦  We chose the i2.xlarge ◦  Optimization of the Cluster ◦  Changing the JDK to Java-8 ●  Changing the GC algorithm to G1 ◦  Tuning the Operation system ●  Ulimit, removing the swap ◦  Write time went down to ~5 minutes (For 30GB RF=3) Sounds good right? I don’t think so
  • 49. The Solution ●  Taking the same Data Model that we held in Cassandra (All of the Raw data per 10 minutes) and put it on S3 ◦  Write time went down from ~5 minutes to 1.5 minutes ●  Added another process, not dependent on the main one, happens every 15 minutes ◦  Reads from S3, downscales the amount and Writes them to Cassandra for serving
  • 50. How it looks after all? Parsed Raw Static / Aggregated Data Spark Analytics Layers UI Serving Downscale d Data Heavy Fusion Process
  • 51. Conclusion ●  Always give an estimate to your data ◦  Frequency ◦  Volume ◦  Arrangement of the previous phase ●  There is no “Best” persistence layer ◦  There is the right one for the job ◦  Don’t overload an existing solution
  • 52. Conclusion ●  Spark is a great framework for distributed collections ◦ Fully functional API ◦ Can perform imperative actions ● “With great power, comes lots of partitioning” ◦ Control your work and data distribution via partitions ●  https://www.pinterest.com/pin/155514993354583499/ (Thanks)
  • 54. Thanks! my contact: —Demi Ben-Ari ●  LinkedIn ●  Twitter: @demibenari ●  Blog: http://progexc.blogspot.com/ ●  Email: demi.benari@gmail.com ●  “Big Things” Community –Meetup, YouTube, Facebook, Twitter