SlideShare une entreprise Scribd logo
1  sur  32
Télécharger pour lire hors ligne
Quark: A Purely-Functional
Scala DSL for Data
Processing & Analytics
John A. De Goes
@jdegoes - http://degoes.net
Apache Spark
Apache Spark is a fast and general engine for big data
processing, with built-in modules for streaming, SQL,
machine learning and graph processing.
val textFile = sc.textFile("hdfs://...")
val counts =
textFile.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
Spark Sucks
— Functional-ish
— Exceptions, typecasts
— SparkContext
— Serializable
— Unsafe type-safe programs
— Second-class support for databases
— Dependency hell (>100)
— Painful debugging
— Implementation-dependent performance
Why Does Spark Have to Suck?
Computation
val textFile = sc.textFile("hdfs://...")
val counts =
textFile.flatMap(line => line.split(" ")) <---- Where Spark goes wrong
.map(word => (word, 1)) <---- Where Spark goes wrong
.reduceByKey(_ + _) <---- Where Spark goes wrong
WWFPD?
— Purely functional
— No exceptions, no casts, no nulls
— No global variables
— No serialization
— Safe type-safe programs
— First-class support for databases
— Few dependencies
— Better debugging
— Implementation-independent performance
Rule #1 in Functional
Programming
Don't solve the problem, describe the solution.
AKA the "Do Nothing" rule
=> Don't compute, embed a compiled language into
Scala
Quark
Compilation
Quark is a Scala DSL built on Quasar Analytics, a general-
purpose compiler for translating data processing over
semi-structured data into efficient plans that execute
100% inside the target infrastructure.
val textFile = Dataset.load("...")
val counts =
textFile.flatMap(line => line.typed[Str].split(" "))
.map(word => (word, 1))
.reduceByKey(_.sum)
More Quark
Compilation
val dataset = Dataset.load("/prod/profiles")
val averageAge = dataset.groupBy(_.country[Str]).map(_.age[Int]).reduceBy(_.average)
Quark Targets
One DSL to Rule Them All
— MongoDB
— Couchbase
— MarkLogic
— Hadoop / HDFS
— Add your connector here!
Both Quark and Quasar Analytics are purely-functional,
open source projects written in 100% Scala.
https://github.com/quasar-analytics/
How To DSL
Adding Integers
sealed trait Expr
final case class Integer(v: Int) extends Expr
final case class Addition(v: Expr, v: Expr) extends Expr
def int(v: Int): Expr = Integer(v)
def add(l: Expr, r: Expr): Expr = Addition(l, r)
add(add(int(1), int(2)), int(3)) : Expr
def interpret(e: Expr): Int = e match {
case Integer(v) => v
case Addition(l, r) => interpret(l) + interpret(r)
}
def serialize(v: Expr): Json = ???
def deserialize(v: Json): Expr = ???
How To DSL
Adding Strings
sealed trait Expr
final case class Integer(v: Int) extends Expr
final case class Addition(l: Expr, r: Expr) extends Expr // Uh, oh!
final case class Str(v: String) extends Expr
final case class StringConcat(l: Expr, r: Expr) extends Expr // Uh, oh!
How To DSL
Phantom Type
sealed trait Expr[A]
final case class Integer(v: Int) extends Expr[Int]
final case class Addition(l: Expr[Int], r: Expr[Int]) extends Expr[Int]
final case class Str(v: String) extends Expr[String]
final case class StringConcat(l: Expr[String], r: Expr[String]) extends Expr[String]
def interpret[A](e: Expr[A]): A = e match {
case Integer(v) => v
case Addition(l, r) => interpret(l) + interpret(r)
case Str(v) => v
case StringConcat(l, r) => interpret(l) ++ interpret(r)
}
def serialize[A](v: Expr[A]): Json = ???
def deserialize[Z](v: Json): Expr[A] forSome { type A } = ???
How To DSL
GADTs in Scala still have bugs
SI-8563, SI-9345, SI-6680
FRIENDS DON'T LET FRIENDS USE GADTS IN SCALA.
How To DSL
Finally Tagless
trait Expr[F[_]] {
def int(v: Int): F[Int]
def str(v: String): F[String]
def add(l: F[Int], r: F[Int]): F[Int]
def concat(l: F[String], r: F[String]): F[String]
}
trait Dsl[A] {
def apply[F[_]](implicit F: Expr[F]): F[A]
}
def int(v: Int): Dsl[Int] = new Dsl[Int] {
def apply[F[_]](implicit F: Expr[F]): F[Int] = F.int(v)
}
def add(l: Dsl[Int], r: Dsl[Int]): Dsl[Int] = new Dsl[Int] {
def apply[F[_]](implicit F: Expr[F]): F[Int] = F.add(l.apply[F], r.apply[F])
}
// ...
How To DSL
Finally Tagless
type Id[A] = A
def interpret: Expr[Id] = new Expr[Id] {
def int(v: Int): Id[Int] = v
def str(v: String): Id[String] = v
def add(l: Id[Int], r: Id[Int]): Id[Int] = l + r
def concat(l: Id[String], r: Id[String]): Id[String] = l + r
}
add(int(1), int(2)).apply(interpret) // Id(3)
final case class Const[A, B](a: A)
def serialize: Expr[Const[Json, ?]] = ???
def deserialize[F[_]: Expr](json: Json): F[A] forSome { type A } = ???
Quark 101
The Building Blocks
— Type. Represents a reified type of an element in a dataset.
— **Dataset[A]**. Represents a dataset, produced by successive
application of set-level operations (SetOps). Describes a directed-
acyclic graph.
— **MappingFunc[A, B]**. Represents a function from A to B that is
produced by successive application of mapping-level operations
(MapOps) to the input.
— **ReduceFunc[A, B]**. Represents a reduction from A to B, produced
by application of reduction-level operations (ReduceOps) to the input.
Let's Build Us a Mini-Quark!
Mini-Quark
Type System
sealed trait Type
object Type {
final case class Unknown() extends Type
final case class Timestamp() extends Type
final case class Date() extends Type
final case class Time() extends Type
final case class Interval() extends Type
final case class Int() extends Type
final case class Dec() extends Type
final case class Str() extends Type
final case class Map[A <: Type, B <: Type](key: A, value: B) extends Type
final case class Arr[A <: Type](element: A) extends Type
final case class Tuple2[A <: Type, B <: Type](_1: A, _2: B) extends Type
final case class Bool() extends Type
final case class Null() extends Type
type UnknownMap = Map[Unknown, Unknown]
val UnknownMap : UnknownMap = Map(Unknown(), Unknown())
type UnknownArr = Arr[Unknown]
val UnknownArr : UnknownArr = Arr(Unknown())
type Record[A <: Type] = Map[Str, A]
type UnknownRecord = Record[Unknown]
}
Mini-Quark
Set-Level Operations
sealed trait SetOps[F[_]] {
def read(path: String): F[Unknown]
}
Mini-Quark
Dataset
sealed trait Dataset[A] {
def apply[F[_]](implicit F: SetOps[F]): F[A]
}
object Dataset {
def read(path: String): Dataset[Unknown] = new Dataset[Unknown] {
def apply[F[_]](implicit F: SetOps[F]): F[Unknown] = F.read(path)
}
}
Mini-Quark
Mapping
sealed trait SetOps[F[_]] {
def read(path: String): F[Unknown]
def map[A, B](v: F[A], f: ???) // What goes here?
}
Mini-Quark
Mapping: Attempt #1
sealed trait SetOps[F[_]] {
def read(path: String): F[Unknown]
def map[A, B](v: F[A], f: F[A] => F[B]) // Doesn't really work...
}
Mini-Quark
Mapping: Attempt #2
sealed trait MappingFunc[A, B] {
def apply[F[_]](v: F[A])(implicit F: MappingOps[F]): F[B]
}
trait MappingOps[F[_]] {
def str(v: String): F[Type.Str]
def project[K <: Type, V <: Type](v: F[Type.Map[K, V]], k: F[K]): F[V]
def add(l: F[Type.Int], r: F[Type.Int]): F[Type.Int]
def length[A <: Type](v: F[Type.Arr[A]]): F[Type.Int]
...
}
object MappingOps {
def id[A]: MappingFunc[A, B] = new MappingFunc[A, A] {
def apply[F[_]](v: F[A])(implicit F: MappingOps[F]): F[A] = v
}
}
Mini-Quark
Mapping: Attempt #2
trait SetOps[F[_]] {
def read(path: String): F[Unknown]
def map[A, B](v: F[A], f: MappingFunc[A, B]): F[B] // Yay!!!
}
Mini-Quark
Dataset: Mapping
sealed trait Dataset[A] {
def apply[F[_]](implicit F: SetOps[F]): F[A]
def map[B](f: ???): Dataset[B] = ??? // What goes here???
}
object Dataset {
def read(path: String): Dataset[Unknown] = new Dataset[Unknown] {
def apply[F[_]](implicit F: SetOps[F]): F[Unknown] = F.read(path)
}
}
Mini-Quark
Dataset: Mapping Attempt #1
sealed trait Dataset[A] { self =>
def apply[F[_]](implicit F: SetOps[F]): F[A]
def map[B](f: MappingFunc[A, B]): Dataset[B] = new Dataset[B] {
def apply[F[_]](implicit F: SetOps[F]): F[B] = F.map(self.apply, f)
}
}
object Dataset {
def read(path: String): Dataset[Unknown] = new Dataset[Unknown] {
def apply[F[_]](implicit F: SetOps[F]): F[Unknown] = F.read(path)
}
}
// dataset.map(_.length) // Cannot ever work!
// dataset.map(v => v.profits[Dec] - v.losses[Dec]) // Cannot ever work!
Mini-Quark
Dataset: Mapping Attempt #2
sealed trait Dataset[A] {
def apply[F[_]](implicit F: SetOps[F]): F[A]
def map[B](f: MappingFunc[A, A] => MappingFunc[A, B]): Dataset[B] = new Dataset[B] {
def apply[F[_]](implicit F: SetOps[F]): F[B] = F.map(self.apply, f(MappingFunc.id[A]))
}
}
object Dataset {
def read(path: String): Dataset[Unknown] = new Dataset[Unknown] {
def apply[F[_]](implicit F: SetOps[F]): F[Unknown] = F.read(path)
}
}
// dataset.map(_.length) // Works with right methods on MappingFunc!
// dataset.map(v => v.profits[Dec] - v.losses[Dec]) // Works with right methods on MappingFunc!
Mini-Quark
Dataset: Mapping Binary Operators
val netProfit = dataset.map(v => v.netRevenue[Dec] - v.netCosts[Dec])
Mini-Quark
MappingFuncs Are Arrows!
trait MappingFunc[A <: Type, B <: Type] extends Dynamic { self =>
import MappingFunc.Case
def apply[F[_]: MappingOps](v: F[A]): F[B]
def >>> [C <: Type](that: MappingFunc[B, C]): MappingFunc[A, C] = new MappingFunc[A, C] {
def apply[F[_]: MappingOps](v: F[A]): F[C] = that.apply[F](self.apply[F](v))
}
def + (that: MappingFunc[A, B])(implicit W: NumberLike[B]): MappingFunc[A, B] = new MappingFunc[A, B] {
def apply[F[_]: MappingOps](v: F[A]): F[B] = MappingOps[F].add(self(v), that(v))
}
def - (that: MappingFunc[A, B])(implicit W: NumberLike[B]): MappingFunc[A, B] = new MappingFunc[A, B] {
def apply[F[_]: MappingOps](v: F[A]): F[B] = MappingOps[F].subtract(self(v), that(v))
}
...
}
Mini-Quark
Applicative Composition
MappingFunc[A, B]
A -----------------------------B
 /
 /
 /
 / MappingFunc[A, B ⊕ C]
 /
MappingFunc[A, C]  /
 /
C
Learn More
— Finally Tagless: http://okmij.org/ftp/tagless-final/
— Quark: https://github.com/quasar-analytics/quark
— Quasar: https://github.com/quasar-analytics/quasar
THANK YOU
@jdegoes - http://degoes.net

Contenu connexe

Similaire à Quark: A Purely-Functional Scala DSL for Data Processing & Analytics

Scala - where objects and functions meet
Scala - where objects and functions meetScala - where objects and functions meet
Scala - where objects and functions meet
Mario Fusco
 
Introduction to Functional Programming with Scala
Introduction to Functional Programming with ScalaIntroduction to Functional Programming with Scala
Introduction to Functional Programming with Scala
pramode_ce
 
TI1220 Lecture 6: First-class Functions
TI1220 Lecture 6: First-class FunctionsTI1220 Lecture 6: First-class Functions
TI1220 Lecture 6: First-class Functions
Eelco Visser
 
Types by Adform Research
Types by Adform ResearchTypes by Adform Research
Types by Adform Research
Vasil Remeniuk
 
pragmaticrealworldscalajfokus2009-1233251076441384-2.pdf
pragmaticrealworldscalajfokus2009-1233251076441384-2.pdfpragmaticrealworldscalajfokus2009-1233251076441384-2.pdf
pragmaticrealworldscalajfokus2009-1233251076441384-2.pdf
Hiroshi Ono
 

Similaire à Quark: A Purely-Functional Scala DSL for Data Processing & Analytics (20)

Scala best practices
Scala best practicesScala best practices
Scala best practices
 
Spark workshop
Spark workshopSpark workshop
Spark workshop
 
Scala - where objects and functions meet
Scala - where objects and functions meetScala - where objects and functions meet
Scala - where objects and functions meet
 
ITT 2015 - Saul Mora - Object Oriented Function Programming
ITT 2015 - Saul Mora - Object Oriented Function ProgrammingITT 2015 - Saul Mora - Object Oriented Function Programming
ITT 2015 - Saul Mora - Object Oriented Function Programming
 
Functions, Types, Programs and Effects
Functions, Types, Programs and EffectsFunctions, Types, Programs and Effects
Functions, Types, Programs and Effects
 
Fp in scala part 2
Fp in scala part 2Fp in scala part 2
Fp in scala part 2
 
C# programming
C# programming C# programming
C# programming
 
SDC - Einführung in Scala
SDC - Einführung in ScalaSDC - Einführung in Scala
SDC - Einführung in Scala
 
The Essence of the Iterator Pattern
The Essence of the Iterator PatternThe Essence of the Iterator Pattern
The Essence of the Iterator Pattern
 
The Essence of the Iterator Pattern (pdf)
The Essence of the Iterator Pattern (pdf)The Essence of the Iterator Pattern (pdf)
The Essence of the Iterator Pattern (pdf)
 
Scala for curious
Scala for curiousScala for curious
Scala for curious
 
Scalapeno18 - Thinking Less with Scala
Scalapeno18 - Thinking Less with ScalaScalapeno18 - Thinking Less with Scala
Scalapeno18 - Thinking Less with Scala
 
Introduction to Functional Programming with Scala
Introduction to Functional Programming with ScalaIntroduction to Functional Programming with Scala
Introduction to Functional Programming with Scala
 
Beginning Scala Svcc 2009
Beginning Scala Svcc 2009Beginning Scala Svcc 2009
Beginning Scala Svcc 2009
 
Scala Paradigms
Scala ParadigmsScala Paradigms
Scala Paradigms
 
TI1220 Lecture 6: First-class Functions
TI1220 Lecture 6: First-class FunctionsTI1220 Lecture 6: First-class Functions
TI1220 Lecture 6: First-class Functions
 
Power of functions in a typed world
Power of functions in a typed worldPower of functions in a typed world
Power of functions in a typed world
 
Types by Adform Research
Types by Adform ResearchTypes by Adform Research
Types by Adform Research
 
Scala by Luc Duponcheel
Scala by Luc DuponcheelScala by Luc Duponcheel
Scala by Luc Duponcheel
 
pragmaticrealworldscalajfokus2009-1233251076441384-2.pdf
pragmaticrealworldscalajfokus2009-1233251076441384-2.pdfpragmaticrealworldscalajfokus2009-1233251076441384-2.pdf
pragmaticrealworldscalajfokus2009-1233251076441384-2.pdf
 

Plus de John De Goes

Refactoring Functional Type Classes
Refactoring Functional Type ClassesRefactoring Functional Type Classes
Refactoring Functional Type Classes
John De Goes
 
The Death of Final Tagless
The Death of Final TaglessThe Death of Final Tagless
The Death of Final Tagless
John De Goes
 
Orthogonal Functional Architecture
Orthogonal Functional ArchitectureOrthogonal Functional Architecture
Orthogonal Functional Architecture
John De Goes
 
Post-Free: Life After Free Monads
Post-Free: Life After Free MonadsPost-Free: Life After Free Monads
Post-Free: Life After Free Monads
John De Goes
 

Plus de John De Goes (20)

Refactoring Functional Type Classes
Refactoring Functional Type ClassesRefactoring Functional Type Classes
Refactoring Functional Type Classes
 
One Monad to Rule Them All
One Monad to Rule Them AllOne Monad to Rule Them All
One Monad to Rule Them All
 
Error Management: Future vs ZIO
Error Management: Future vs ZIOError Management: Future vs ZIO
Error Management: Future vs ZIO
 
Atomically { Delete Your Actors }
Atomically { Delete Your Actors }Atomically { Delete Your Actors }
Atomically { Delete Your Actors }
 
The Death of Final Tagless
The Death of Final TaglessThe Death of Final Tagless
The Death of Final Tagless
 
Scalaz Stream: Rebirth
Scalaz Stream: RebirthScalaz Stream: Rebirth
Scalaz Stream: Rebirth
 
Scalaz Stream: Rebirth
Scalaz Stream: RebirthScalaz Stream: Rebirth
Scalaz Stream: Rebirth
 
ZIO Schedule: Conquering Flakiness & Recurrence with Pure Functional Programming
ZIO Schedule: Conquering Flakiness & Recurrence with Pure Functional ProgrammingZIO Schedule: Conquering Flakiness & Recurrence with Pure Functional Programming
ZIO Schedule: Conquering Flakiness & Recurrence with Pure Functional Programming
 
ZIO Queue
ZIO QueueZIO Queue
ZIO Queue
 
Blazing Fast, Pure Effects without Monads — LambdaConf 2018
Blazing Fast, Pure Effects without Monads — LambdaConf 2018Blazing Fast, Pure Effects without Monads — LambdaConf 2018
Blazing Fast, Pure Effects without Monads — LambdaConf 2018
 
Scalaz 8: A Whole New Game
Scalaz 8: A Whole New GameScalaz 8: A Whole New Game
Scalaz 8: A Whole New Game
 
Scalaz 8 vs Akka Actors
Scalaz 8 vs Akka ActorsScalaz 8 vs Akka Actors
Scalaz 8 vs Akka Actors
 
Orthogonal Functional Architecture
Orthogonal Functional ArchitectureOrthogonal Functional Architecture
Orthogonal Functional Architecture
 
Post-Free: Life After Free Monads
Post-Free: Life After Free MonadsPost-Free: Life After Free Monads
Post-Free: Life After Free Monads
 
Streams for (Co)Free!
Streams for (Co)Free!Streams for (Co)Free!
Streams for (Co)Free!
 
MTL Versus Free
MTL Versus FreeMTL Versus Free
MTL Versus Free
 
The Easy-Peasy-Lemon-Squeezy, Statically-Typed, Purely Functional Programming...
The Easy-Peasy-Lemon-Squeezy, Statically-Typed, Purely Functional Programming...The Easy-Peasy-Lemon-Squeezy, Statically-Typed, Purely Functional Programming...
The Easy-Peasy-Lemon-Squeezy, Statically-Typed, Purely Functional Programming...
 
Halogen: Past, Present, and Future
Halogen: Past, Present, and FutureHalogen: Past, Present, and Future
Halogen: Past, Present, and Future
 
Getting Started with PureScript
Getting Started with PureScriptGetting Started with PureScript
Getting Started with PureScript
 
SlamData - How MongoDB Is Powering a Revolution in Visual Analytics
SlamData - How MongoDB Is Powering a Revolution in Visual AnalyticsSlamData - How MongoDB Is Powering a Revolution in Visual Analytics
SlamData - How MongoDB Is Powering a Revolution in Visual Analytics
 

Dernier

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Dernier (20)

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 

Quark: A Purely-Functional Scala DSL for Data Processing & Analytics

  • 1. Quark: A Purely-Functional Scala DSL for Data Processing & Analytics John A. De Goes @jdegoes - http://degoes.net
  • 2. Apache Spark Apache Spark is a fast and general engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing. val textFile = sc.textFile("hdfs://...") val counts = textFile.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _)
  • 3. Spark Sucks — Functional-ish — Exceptions, typecasts — SparkContext — Serializable — Unsafe type-safe programs — Second-class support for databases — Dependency hell (>100) — Painful debugging — Implementation-dependent performance
  • 4. Why Does Spark Have to Suck? Computation val textFile = sc.textFile("hdfs://...") val counts = textFile.flatMap(line => line.split(" ")) <---- Where Spark goes wrong .map(word => (word, 1)) <---- Where Spark goes wrong .reduceByKey(_ + _) <---- Where Spark goes wrong
  • 5. WWFPD? — Purely functional — No exceptions, no casts, no nulls — No global variables — No serialization — Safe type-safe programs — First-class support for databases — Few dependencies — Better debugging — Implementation-independent performance
  • 6. Rule #1 in Functional Programming Don't solve the problem, describe the solution. AKA the "Do Nothing" rule => Don't compute, embed a compiled language into Scala
  • 7. Quark Compilation Quark is a Scala DSL built on Quasar Analytics, a general- purpose compiler for translating data processing over semi-structured data into efficient plans that execute 100% inside the target infrastructure. val textFile = Dataset.load("...") val counts = textFile.flatMap(line => line.typed[Str].split(" ")) .map(word => (word, 1)) .reduceByKey(_.sum)
  • 8. More Quark Compilation val dataset = Dataset.load("/prod/profiles") val averageAge = dataset.groupBy(_.country[Str]).map(_.age[Int]).reduceBy(_.average)
  • 9. Quark Targets One DSL to Rule Them All — MongoDB — Couchbase — MarkLogic — Hadoop / HDFS — Add your connector here!
  • 10. Both Quark and Quasar Analytics are purely-functional, open source projects written in 100% Scala. https://github.com/quasar-analytics/
  • 11. How To DSL Adding Integers sealed trait Expr final case class Integer(v: Int) extends Expr final case class Addition(v: Expr, v: Expr) extends Expr def int(v: Int): Expr = Integer(v) def add(l: Expr, r: Expr): Expr = Addition(l, r) add(add(int(1), int(2)), int(3)) : Expr def interpret(e: Expr): Int = e match { case Integer(v) => v case Addition(l, r) => interpret(l) + interpret(r) } def serialize(v: Expr): Json = ??? def deserialize(v: Json): Expr = ???
  • 12. How To DSL Adding Strings sealed trait Expr final case class Integer(v: Int) extends Expr final case class Addition(l: Expr, r: Expr) extends Expr // Uh, oh! final case class Str(v: String) extends Expr final case class StringConcat(l: Expr, r: Expr) extends Expr // Uh, oh!
  • 13. How To DSL Phantom Type sealed trait Expr[A] final case class Integer(v: Int) extends Expr[Int] final case class Addition(l: Expr[Int], r: Expr[Int]) extends Expr[Int] final case class Str(v: String) extends Expr[String] final case class StringConcat(l: Expr[String], r: Expr[String]) extends Expr[String] def interpret[A](e: Expr[A]): A = e match { case Integer(v) => v case Addition(l, r) => interpret(l) + interpret(r) case Str(v) => v case StringConcat(l, r) => interpret(l) ++ interpret(r) } def serialize[A](v: Expr[A]): Json = ??? def deserialize[Z](v: Json): Expr[A] forSome { type A } = ???
  • 14. How To DSL GADTs in Scala still have bugs SI-8563, SI-9345, SI-6680 FRIENDS DON'T LET FRIENDS USE GADTS IN SCALA.
  • 15. How To DSL Finally Tagless trait Expr[F[_]] { def int(v: Int): F[Int] def str(v: String): F[String] def add(l: F[Int], r: F[Int]): F[Int] def concat(l: F[String], r: F[String]): F[String] } trait Dsl[A] { def apply[F[_]](implicit F: Expr[F]): F[A] } def int(v: Int): Dsl[Int] = new Dsl[Int] { def apply[F[_]](implicit F: Expr[F]): F[Int] = F.int(v) } def add(l: Dsl[Int], r: Dsl[Int]): Dsl[Int] = new Dsl[Int] { def apply[F[_]](implicit F: Expr[F]): F[Int] = F.add(l.apply[F], r.apply[F]) } // ...
  • 16. How To DSL Finally Tagless type Id[A] = A def interpret: Expr[Id] = new Expr[Id] { def int(v: Int): Id[Int] = v def str(v: String): Id[String] = v def add(l: Id[Int], r: Id[Int]): Id[Int] = l + r def concat(l: Id[String], r: Id[String]): Id[String] = l + r } add(int(1), int(2)).apply(interpret) // Id(3) final case class Const[A, B](a: A) def serialize: Expr[Const[Json, ?]] = ??? def deserialize[F[_]: Expr](json: Json): F[A] forSome { type A } = ???
  • 17. Quark 101 The Building Blocks — Type. Represents a reified type of an element in a dataset. — **Dataset[A]**. Represents a dataset, produced by successive application of set-level operations (SetOps). Describes a directed- acyclic graph. — **MappingFunc[A, B]**. Represents a function from A to B that is produced by successive application of mapping-level operations (MapOps) to the input. — **ReduceFunc[A, B]**. Represents a reduction from A to B, produced by application of reduction-level operations (ReduceOps) to the input.
  • 18. Let's Build Us a Mini-Quark!
  • 19. Mini-Quark Type System sealed trait Type object Type { final case class Unknown() extends Type final case class Timestamp() extends Type final case class Date() extends Type final case class Time() extends Type final case class Interval() extends Type final case class Int() extends Type final case class Dec() extends Type final case class Str() extends Type final case class Map[A <: Type, B <: Type](key: A, value: B) extends Type final case class Arr[A <: Type](element: A) extends Type final case class Tuple2[A <: Type, B <: Type](_1: A, _2: B) extends Type final case class Bool() extends Type final case class Null() extends Type type UnknownMap = Map[Unknown, Unknown] val UnknownMap : UnknownMap = Map(Unknown(), Unknown()) type UnknownArr = Arr[Unknown] val UnknownArr : UnknownArr = Arr(Unknown()) type Record[A <: Type] = Map[Str, A] type UnknownRecord = Record[Unknown] }
  • 20. Mini-Quark Set-Level Operations sealed trait SetOps[F[_]] { def read(path: String): F[Unknown] }
  • 21. Mini-Quark Dataset sealed trait Dataset[A] { def apply[F[_]](implicit F: SetOps[F]): F[A] } object Dataset { def read(path: String): Dataset[Unknown] = new Dataset[Unknown] { def apply[F[_]](implicit F: SetOps[F]): F[Unknown] = F.read(path) } }
  • 22. Mini-Quark Mapping sealed trait SetOps[F[_]] { def read(path: String): F[Unknown] def map[A, B](v: F[A], f: ???) // What goes here? }
  • 23. Mini-Quark Mapping: Attempt #1 sealed trait SetOps[F[_]] { def read(path: String): F[Unknown] def map[A, B](v: F[A], f: F[A] => F[B]) // Doesn't really work... }
  • 24. Mini-Quark Mapping: Attempt #2 sealed trait MappingFunc[A, B] { def apply[F[_]](v: F[A])(implicit F: MappingOps[F]): F[B] } trait MappingOps[F[_]] { def str(v: String): F[Type.Str] def project[K <: Type, V <: Type](v: F[Type.Map[K, V]], k: F[K]): F[V] def add(l: F[Type.Int], r: F[Type.Int]): F[Type.Int] def length[A <: Type](v: F[Type.Arr[A]]): F[Type.Int] ... } object MappingOps { def id[A]: MappingFunc[A, B] = new MappingFunc[A, A] { def apply[F[_]](v: F[A])(implicit F: MappingOps[F]): F[A] = v } }
  • 25. Mini-Quark Mapping: Attempt #2 trait SetOps[F[_]] { def read(path: String): F[Unknown] def map[A, B](v: F[A], f: MappingFunc[A, B]): F[B] // Yay!!! }
  • 26. Mini-Quark Dataset: Mapping sealed trait Dataset[A] { def apply[F[_]](implicit F: SetOps[F]): F[A] def map[B](f: ???): Dataset[B] = ??? // What goes here??? } object Dataset { def read(path: String): Dataset[Unknown] = new Dataset[Unknown] { def apply[F[_]](implicit F: SetOps[F]): F[Unknown] = F.read(path) } }
  • 27. Mini-Quark Dataset: Mapping Attempt #1 sealed trait Dataset[A] { self => def apply[F[_]](implicit F: SetOps[F]): F[A] def map[B](f: MappingFunc[A, B]): Dataset[B] = new Dataset[B] { def apply[F[_]](implicit F: SetOps[F]): F[B] = F.map(self.apply, f) } } object Dataset { def read(path: String): Dataset[Unknown] = new Dataset[Unknown] { def apply[F[_]](implicit F: SetOps[F]): F[Unknown] = F.read(path) } } // dataset.map(_.length) // Cannot ever work! // dataset.map(v => v.profits[Dec] - v.losses[Dec]) // Cannot ever work!
  • 28. Mini-Quark Dataset: Mapping Attempt #2 sealed trait Dataset[A] { def apply[F[_]](implicit F: SetOps[F]): F[A] def map[B](f: MappingFunc[A, A] => MappingFunc[A, B]): Dataset[B] = new Dataset[B] { def apply[F[_]](implicit F: SetOps[F]): F[B] = F.map(self.apply, f(MappingFunc.id[A])) } } object Dataset { def read(path: String): Dataset[Unknown] = new Dataset[Unknown] { def apply[F[_]](implicit F: SetOps[F]): F[Unknown] = F.read(path) } } // dataset.map(_.length) // Works with right methods on MappingFunc! // dataset.map(v => v.profits[Dec] - v.losses[Dec]) // Works with right methods on MappingFunc!
  • 29. Mini-Quark Dataset: Mapping Binary Operators val netProfit = dataset.map(v => v.netRevenue[Dec] - v.netCosts[Dec])
  • 30. Mini-Quark MappingFuncs Are Arrows! trait MappingFunc[A <: Type, B <: Type] extends Dynamic { self => import MappingFunc.Case def apply[F[_]: MappingOps](v: F[A]): F[B] def >>> [C <: Type](that: MappingFunc[B, C]): MappingFunc[A, C] = new MappingFunc[A, C] { def apply[F[_]: MappingOps](v: F[A]): F[C] = that.apply[F](self.apply[F](v)) } def + (that: MappingFunc[A, B])(implicit W: NumberLike[B]): MappingFunc[A, B] = new MappingFunc[A, B] { def apply[F[_]: MappingOps](v: F[A]): F[B] = MappingOps[F].add(self(v), that(v)) } def - (that: MappingFunc[A, B])(implicit W: NumberLike[B]): MappingFunc[A, B] = new MappingFunc[A, B] { def apply[F[_]: MappingOps](v: F[A]): F[B] = MappingOps[F].subtract(self(v), that(v)) } ... }
  • 31. Mini-Quark Applicative Composition MappingFunc[A, B] A -----------------------------B / / / / MappingFunc[A, B ⊕ C] / MappingFunc[A, C] / / C
  • 32. Learn More — Finally Tagless: http://okmij.org/ftp/tagless-final/ — Quark: https://github.com/quasar-analytics/quark — Quasar: https://github.com/quasar-analytics/quasar THANK YOU @jdegoes - http://degoes.net