Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.
What is Grammarly?
A writing assistant that helps
make your communication clear
and effective, wherever you type.
Works Where You Do
Emails and Messages
Documents and Projects
Social Media
Our Mission
Grammarly’s evolution
Custom Expressions in Spark
Umayah Abdennabi
Software Engineer @ Grammarly
A lot of data produced
every second! How do
we understand it?
Gnar
Grammarly’s internal data
analytics platform.
Gnar
Grammarly’s internal data
analytics platform.
Gnar
● Goal: to understand
Gnar
● Goal: to understand
○ Who are our users
Gnar
● Goal: to understand
○ Who are our users
○ How do they interact with the product
Gnar
● Goal: to understand
○ Who are our users
○ How do they interact with the product
○ How do they sign-up, engage, pay,...
Gnar
● Goal: to understand
○ Who are our users
○ How do they interact with the product
○ How do they sign-up, engage, pay,...
Gnar
Gnar
segment “eventName”
where foo = “bar”
by browser
time from 2 months ago to today
User writes a query
using GQL using ...
Gnar
Sent to our backend
which will run a
Spark job
segment “eventName”
where foo = “bar”
by browser
time from 2 months ag...
Gnar
Results will be sent
back to the user and
displayed
segment “eventName”
where foo = “bar”
by browser
time from 2 mont...
Gnar
● When users write queries they use something
called expressions to describe what they want to
do
Gnar
● When users write queries they use something
called expressions to describe what they want to
do
○ The previous quer...
Gnar
● When users write queries they use something
called expressions to describe what they want to
do
○ The previous quer...
Gnar
● When users write queries they use something
called expressions to describe what they want to
do
● Hundreds of queri...
Expressions
SELECT
weight,
(price - cost) * sold
FROM products
WHERE price > 100
SELECT
weight,
(price - cost) * sold
FROM products
WHERE price > 100
SELECT
weight,
(price - cost) * sold
FROM products
WHERE price > 100
These are expressions
(price - cost) * sold
(price - cost) * sold
sold
cost
*
-(price - cost) * sold
price
sold
cost
*
-
price
sold: 10 price: 2.99 cost: 0.50
Input Row
sold
cost
*
-
price
sold: 10 price: 2.99 cost: 0.50
Input Row
sold
cost
*
-
price
sold: 10 price: 2.99 cost: 0.50
Input Row
10
cost
*
-
price
sold: 10 price: 2.99 cost: 0.50
Input Row
10
0.50
*
-
2.99
sold: 10 price: 2.99 cost: 0.50
Input Row
10
0.50
*
-
2.99
sold: 10 price: 2.99 cost: 0.50
Input Row
102.49
*
2.99
sold: 10 price: 2.99 cost: 0.50
Input Row
102.49
*
2.99
sold: 10 price: 2.99 cost: 0.50
Input Row
24.90
2.99
Expressions
abstract class Expression extends Tree[Expression] {
...
}
Expressions
abstract class Expression extends Tree[Expression] {
def children: Seq[Expression]
...
}
Expressions
abstract class Expression extends Tree[Expression] {
def children: Seq[Expression]
def eval(row: Row): Any
......
Expressions
class Add(left: Expression, right: Expression) extends
Expression {
...
}
Expressions
class Add(left: Expression, right: Expression) extends
Expression {
def children: Seq[Expression] = Seq(left, ...
Expressions
class Add(left: Expression, right: Expression) extends
Expression {
def children: Seq[Expression] = Seq(left, ...
Expressions
● How to generate a value given input values
Expressions
● How to generate a value given input values
● Used by Spark SQL to build around 300 SQL
functions
What if the expression
we want isn’t available?
UDF
User Defined Function
UDF
● Easy way to add new expressions
UDF
● Easy way to add new expressions
● Can be written in Scala, Java, Python, or R
UDF
val profit = udf((price: BigDecimal, cost: BigDecimal) =>
(price - cost) * sold
)
UDF
val profit = udf((price: BigDecimal, cost: BigDecimal) =>
(msrp - cost) * sold
)
SELECT weight, profit(price, cost, so...
What are the limitations
with UDFs?
UDF
● You are using a closure which is opaque to Spark
SQL, preventing many optimizations
UDF
● You are using a closure which is opaque to Spark
SQL, preventing many optimizations
● Access to input types
UDF
● You are using a closure which is opaque to Spark
SQL, preventing many optimizations
● Access to input types
○ Spark ...
UDF
● You are using a closure which is opaque to Spark
SQL, preventing many optimizations
● Access to input types
○ Spark ...
Optimizations
Minimizing IO and computation
Constant Folding
hours * ( 60 * 60 * 1000)
Constant Folding
hours * ( 60 * 60 * 1000)
● Commonly done to get milliseconds in an hour
Constant Folding
hours * ( 60 * 60 * 1000)
● Commonly done to get milliseconds in an hour
● How do we reduce the time we s...
Constant Folding
hours
60 *
*
*
60 1000
Constant Folding
hours
60 *
*
*
60 1000
Constant Folding
hours
60 *
*
*
60 1000
Constant Folding
60 *
*
*
60
hours
1000
Constant Folding
60 60,000
*
* hours
Constant Folding
60
*
*
60,000
hours
Constant Folding
*
hours3.6e6
Constant Folding
*
hours3.6e6
You can make your expressions a candidate for constant
folding by adding the following to yo...
Constant Folding
1.6x Slower
1.96x Slower
1.97x Slower
One billion rows on single node machine
spark.range(1000000000l).wi...
Optimizations
Catalyst Optimizer
Optimizations
Catalyst Optimizer
Optimizations
Catalyst Optimizer
Boolean Simplification
a
||
&&
true true
Boolean Simplification
a
||
&&
true true
Boolean Simplification
a
||
true
Boolean Simplification
a
||
true
Boolean Simplification
true
Other Examples
● Pruning Filters
Other Examples
● Pruning Filters
● Predicate Pushdown
Other Examples
● Pruning Filters
● Predicate Pushdown
○ Pushing predicate within the query plan
Other Examples
● Pruning Filters
● Predicate Pushdown
○ Pushing predicate within the query plan
○ Pushing predicate down t...
Other Examples
● Pruning Filters
● Predicate Pushdown
Other Examples
● Pruning Filters
● Predicate Pushdown
● Simplifying Casts
Optimizations
● All these optimizations are rules which are
implemented with pattern matching
Optimizations
● All these optimizations are rules which are
implemented with pattern matching
● If an expression matches t...
Optimizations
● All these optimizations are rules which are
implemented with pattern matching
● If an expression matches t...
Rule Example: Constant Folding
object ConstantFolding extends Rule {
def apply(plan: Plan): Plan = plan transformExpressio...
Rule Example: Boolean Simplification
object BooleanSimplification extends Rule {
def apply(plan: Plan): Plan = plan transfo...
Custom
Expressions
Custom Expressions
● Don’t have the limitations of UDFs
Custom Expressions
● Don’t have the limitations of UDFs
○ Benefit fully from optimizations
Custom Expressions
● Don’t have the limitations of UDFs
○ Benefit fully from optimizations
○ Access to Spark data types
Custom Expressions
● Don’t have the limitations of UDFs
○ Benefit fully from optimizations
○ Access to Spark data types
○ E...
Custom Expressions
● Don’t have the limitations of UDFs
○ Benefit fully from optimizations
○ Access to Spark data types
○ E...
State
class TimestampToDate(ts: Expression) extends Expression {
...
}
State
class TimestampToDate(ts: Expression) extends Expression {
def inputTypes: DataType = LongType
...
}
State
class TimestampToDate(ts: Expression) extends Expression {
def inputTypes: DataType = LongType
def dataType: DataTyp...
State
class TimestampToDate(ts: Expression) extends Expression {
def inputTypes: DataType = LongType
def dataType: DataTyp...
State
class TimestampToDate(ts: Expression) extends Expression {
...
var date = -1
var nextDaySts = -1
var prevSts = -1
de...
State
def eval(input: Row) = {
val currentTs = ts.eval(input)
if (currentTs >= nextDayTs || currentTs < prevTs) {
date = D...
Type
class ToJsonExpression(sts: Expression) extends Expression {
...
}
Type
class ToJsonExpression(json: Expression) extends Expression {
def eval(input: InternalRow): Any = {
toJson(json.dataT...
Type
def toJson(dataType: DataType, value: Any): JsValue = {
if (value == null) JsNull
else dataType match {
case BooleanT...
Code Generation
class Add(left: Expression, right) extends Expression {
...
}
Code Generation
class Add(left: Expression, right) extends Expression {
def doGenCode() = {
val l = left.doCodeGen()
val r...
Code Generation
sold
cost
+
+
price
left.eval(input) +
right.eval(input)
Code Generation
sold
cost
+
+
price
Code Generation
sold
cost
+
+
price
Decimal p= price;
Decimal c = cost;
Decimal res = price + cost;
Decimal s = sold;
Deci...
Code Generation
One billion rows on single node machine
Conclusion
● UDFs are great
Conclusion
● UDFs are great
○ Simple to write, compared to very involved
expressions, and they generally work well
Conclusion
● UDFs are great
○ Simple to write, compared to very involved
expressions, and they generally work well
Conclusion
● UDFs are great
○ Simple to write, compared to very involved
expressions, and they generally work well
● Custo...
Conclusion
● UDFs are great
○ Simple to write, compared to very involved
expressions, and they generally work well
● Custo...
Conclusion
● UDFs are great
○ Simple to write, compared to very involved
expressions, and they generally work well
● Custo...
Conclusion
● UDFs are great
○ Simple to write, compared to very involved
expressions, and they generally work well
● Custo...
We are hiring!
www.grammarly.com/jobs
Questions
?
?
?
?
?
?
? ?
?
?
Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and TensorFlow Meetup - San Francisco - May 7 2019
Prochain SlideShare
Chargement dans…5
×

Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and TensorFlow Meetup - San Francisco - May 7 2019

138 vues

Publié le

Speaker: Umayah Abdennabi

Agenda

* Intro Grammarly (Umayah Abdennabi, 5 mins)

* Meetup Updates and Announcements (Chris, 5 mins)

* Custom Functions in Spark SQL (30 mins)
Speaker: Umayah Abdennabi

Spark comes with a rich Expression library that can be extended to make custom expressions. We will look into custom expressions and why you would want to use them.

* TF 2.0 + Keras (30 mins)
Speaker: Francesco Mosconi

Tensorflow 2.0 was announced at the March TF Dev Summit, and it brings many changes and upgrades. The most significant change is the inclusion of Keras as the default model building API. In this talk, we'll review the main changes introduced in TF 2.0 and highlight the differences between open source Keras and tf.keras

* SQUAD Deep-Dive: Question & Answer with Context (45 mins)
Speaker: Brett Koonce (https://quarkworks.co)

SQuAD (Stanford Question Answer Dataset) is an NLP challenge based around answering questions by reading Wikipedia articles, designed to be a real-world machine learning benchmark. We will look at several different ways to tackle the SQuAD problem, building up to state of the art approaches in terms of time, complexity, and accuracy.

https://rajpurkar.github.io/SQuAD-explorer/
https://dawn.cs.stanford.edu/benchmark/#squad

Food and drinks will be provided. The event will be held at Grammarly's office at One Embarcadero Center on the 9th floor. When you arrive at One Embarcadero, take the escalator to the second floor where you will find the lobby and elevators to the office suites. Come on up to the 9th floor (no need to check in at security), and ring the Grammarly doorbell.

Publié dans : Logiciels
  • Soyez le premier à commenter

  • Soyez le premier à aimer ceci

Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and TensorFlow Meetup - San Francisco - May 7 2019

  1. 1. What is Grammarly? A writing assistant that helps make your communication clear and effective, wherever you type.
  2. 2. Works Where You Do Emails and Messages Documents and Projects Social Media
  3. 3. Our Mission
  4. 4. Grammarly’s evolution
  5. 5. Custom Expressions in Spark Umayah Abdennabi Software Engineer @ Grammarly
  6. 6. A lot of data produced every second! How do we understand it?
  7. 7. Gnar Grammarly’s internal data analytics platform.
  8. 8. Gnar Grammarly’s internal data analytics platform.
  9. 9. Gnar ● Goal: to understand
  10. 10. Gnar ● Goal: to understand ○ Who are our users
  11. 11. Gnar ● Goal: to understand ○ Who are our users ○ How do they interact with the product
  12. 12. Gnar ● Goal: to understand ○ Who are our users ○ How do they interact with the product ○ How do they sign-up, engage, pay, and how long do they stay
  13. 13. Gnar ● Goal: to understand ○ Who are our users ○ How do they interact with the product ○ How do they sign-up, engage, pay, and how long do they stay ● Allows us to make data driven decisions
  14. 14. Gnar
  15. 15. Gnar segment “eventName” where foo = “bar” by browser time from 2 months ago to today User writes a query using GQL using our web application *GQL stands for Gnar Query language, a SQL like language built on top of Spark SQL
  16. 16. Gnar Sent to our backend which will run a Spark job segment “eventName” where foo = “bar” by browser time from 2 months ago to today
  17. 17. Gnar Results will be sent back to the user and displayed segment “eventName” where foo = “bar” by browser time from 2 months ago to today
  18. 18. Gnar ● When users write queries they use something called expressions to describe what they want to do
  19. 19. Gnar ● When users write queries they use something called expressions to describe what they want to do ○ The previous query had 2
  20. 20. Gnar ● When users write queries they use something called expressions to describe what they want to do ○ The previous query had 2 segment “eventName” where foo = “bar” by browser time from 2 months ago to today
  21. 21. Gnar ● When users write queries they use something called expressions to describe what they want to do ● Hundreds of queries are run every day, and all of them use expressions
  22. 22. Expressions
  23. 23. SELECT weight, (price - cost) * sold FROM products WHERE price > 100
  24. 24. SELECT weight, (price - cost) * sold FROM products WHERE price > 100
  25. 25. SELECT weight, (price - cost) * sold FROM products WHERE price > 100 These are expressions
  26. 26. (price - cost) * sold
  27. 27. (price - cost) * sold
  28. 28. sold cost * -(price - cost) * sold price
  29. 29. sold cost * - price
  30. 30. sold: 10 price: 2.99 cost: 0.50 Input Row sold cost * - price
  31. 31. sold: 10 price: 2.99 cost: 0.50 Input Row sold cost * - price
  32. 32. sold: 10 price: 2.99 cost: 0.50 Input Row 10 cost * - price
  33. 33. sold: 10 price: 2.99 cost: 0.50 Input Row 10 0.50 * - 2.99
  34. 34. sold: 10 price: 2.99 cost: 0.50 Input Row 10 0.50 * - 2.99
  35. 35. sold: 10 price: 2.99 cost: 0.50 Input Row 102.49 * 2.99
  36. 36. sold: 10 price: 2.99 cost: 0.50 Input Row 102.49 * 2.99
  37. 37. sold: 10 price: 2.99 cost: 0.50 Input Row 24.90 2.99
  38. 38. Expressions abstract class Expression extends Tree[Expression] { ... }
  39. 39. Expressions abstract class Expression extends Tree[Expression] { def children: Seq[Expression] ... }
  40. 40. Expressions abstract class Expression extends Tree[Expression] { def children: Seq[Expression] def eval(row: Row): Any ... }
  41. 41. Expressions class Add(left: Expression, right: Expression) extends Expression { ... }
  42. 42. Expressions class Add(left: Expression, right: Expression) extends Expression { def children: Seq[Expression] = Seq(left, right) ... }
  43. 43. Expressions class Add(left: Expression, right: Expression) extends Expression { def children: Seq[Expression] = Seq(left, right) def eval(input: Row): Any = left.eval(input) + right.eval(input) ... }
  44. 44. Expressions ● How to generate a value given input values
  45. 45. Expressions ● How to generate a value given input values ● Used by Spark SQL to build around 300 SQL functions
  46. 46. What if the expression we want isn’t available?
  47. 47. UDF User Defined Function
  48. 48. UDF ● Easy way to add new expressions
  49. 49. UDF ● Easy way to add new expressions ● Can be written in Scala, Java, Python, or R
  50. 50. UDF val profit = udf((price: BigDecimal, cost: BigDecimal) => (price - cost) * sold )
  51. 51. UDF val profit = udf((price: BigDecimal, cost: BigDecimal) => (msrp - cost) * sold ) SELECT weight, profit(price, cost, sold) FROM products WHERE msrp > 100
  52. 52. What are the limitations with UDFs?
  53. 53. UDF ● You are using a closure which is opaque to Spark SQL, preventing many optimizations
  54. 54. UDF ● You are using a closure which is opaque to Spark SQL, preventing many optimizations ● Access to input types
  55. 55. UDF ● You are using a closure which is opaque to Spark SQL, preventing many optimizations ● Access to input types ○ Spark SQL data types
  56. 56. UDF ● You are using a closure which is opaque to Spark SQL, preventing many optimizations ● Access to input types ○ Spark SQL data types ● Hard to have a stateful implementation
  57. 57. Optimizations Minimizing IO and computation
  58. 58. Constant Folding hours * ( 60 * 60 * 1000)
  59. 59. Constant Folding hours * ( 60 * 60 * 1000) ● Commonly done to get milliseconds in an hour
  60. 60. Constant Folding hours * ( 60 * 60 * 1000) ● Commonly done to get milliseconds in an hour ● How do we reduce the time we spend computing this largely static operation
  61. 61. Constant Folding hours 60 * * * 60 1000
  62. 62. Constant Folding hours 60 * * * 60 1000
  63. 63. Constant Folding hours 60 * * * 60 1000
  64. 64. Constant Folding 60 * * * 60 hours 1000
  65. 65. Constant Folding 60 60,000 * * hours
  66. 66. Constant Folding 60 * * 60,000 hours
  67. 67. Constant Folding * hours3.6e6
  68. 68. Constant Folding * hours3.6e6 You can make your expressions a candidate for constant folding by adding the following to your expression class def foldable: Boolean = true
  69. 69. Constant Folding 1.6x Slower 1.96x Slower 1.97x Slower One billion rows on single node machine spark.range(1000000000l).withColumn("m", <expr>).rdd.count 1.68x Slower
  70. 70. Optimizations Catalyst Optimizer
  71. 71. Optimizations Catalyst Optimizer
  72. 72. Optimizations Catalyst Optimizer
  73. 73. Boolean Simplification a || && true true
  74. 74. Boolean Simplification a || && true true
  75. 75. Boolean Simplification a || true
  76. 76. Boolean Simplification a || true
  77. 77. Boolean Simplification true
  78. 78. Other Examples ● Pruning Filters
  79. 79. Other Examples ● Pruning Filters ● Predicate Pushdown
  80. 80. Other Examples ● Pruning Filters ● Predicate Pushdown ○ Pushing predicate within the query plan
  81. 81. Other Examples ● Pruning Filters ● Predicate Pushdown ○ Pushing predicate within the query plan ○ Pushing predicate down to data source
  82. 82. Other Examples ● Pruning Filters ● Predicate Pushdown
  83. 83. Other Examples ● Pruning Filters ● Predicate Pushdown ● Simplifying Casts
  84. 84. Optimizations ● All these optimizations are rules which are implemented with pattern matching
  85. 85. Optimizations ● All these optimizations are rules which are implemented with pattern matching ● If an expression matches the rule, it is applied
  86. 86. Optimizations ● All these optimizations are rules which are implemented with pattern matching ● If an expression matches the rule, it is applied ● UDFs aren’t expressions so you cannot apply many of these optimizations
  87. 87. Rule Example: Constant Folding object ConstantFolding extends Rule { def apply(plan: Plan): Plan = plan transformExpressions { case l: Literal => l case e if e.foldable => Literal.create(e.eval(EmptyRow), e.dataType) } }
  88. 88. Rule Example: Boolean Simplification object BooleanSimplification extends Rule { def apply(plan: Plan): Plan = plan transformExpressions { case TrueLiteral And e => e case FalseLiteral Or e => e case e Or FalseLiteral => e case FalseLiteral And _ => FalseLiteral case TrueLiteral Or _ => TrueLiteral case Not(TrueLiteral) => FalseLiteral ... } }
  89. 89. Custom Expressions
  90. 90. Custom Expressions ● Don’t have the limitations of UDFs
  91. 91. Custom Expressions ● Don’t have the limitations of UDFs ○ Benefit fully from optimizations
  92. 92. Custom Expressions ● Don’t have the limitations of UDFs ○ Benefit fully from optimizations ○ Access to Spark data types
  93. 93. Custom Expressions ● Don’t have the limitations of UDFs ○ Benefit fully from optimizations ○ Access to Spark data types ○ Easy to maintain state
  94. 94. Custom Expressions ● Don’t have the limitations of UDFs ○ Benefit fully from optimizations ○ Access to Spark data types ○ Easy to maintain state ○ You can specify code generation
  95. 95. State class TimestampToDate(ts: Expression) extends Expression { ... }
  96. 96. State class TimestampToDate(ts: Expression) extends Expression { def inputTypes: DataType = LongType ... }
  97. 97. State class TimestampToDate(ts: Expression) extends Expression { def inputTypes: DataType = LongType def dataType: DataType = DateType ... }
  98. 98. State class TimestampToDate(ts: Expression) extends Expression { def inputTypes: DataType = LongType def dataType: DataType = DateType var date = -1 var nextDaySts = -1 var prevSts = -1 ... }
  99. 99. State class TimestampToDate(ts: Expression) extends Expression { ... var date = -1 var nextDaySts = -1 var prevSts = -1 def eval(input: Row): {..} ... }
  100. 100. State def eval(input: Row) = { val currentTs = ts.eval(input) if (currentTs >= nextDayTs || currentTs < prevTs) { date = DateUtils.millisToSQLDate(currentTs) nextDaySts = DateUtils.nextDayToMillis(currentTs) } prevTs = currentTs date }
  101. 101. Type class ToJsonExpression(sts: Expression) extends Expression { ... }
  102. 102. Type class ToJsonExpression(json: Expression) extends Expression { def eval(input: InternalRow): Any = { toJson(json.dataType, json.eval(input)) } ... }
  103. 103. Type def toJson(dataType: DataType, value: Any): JsValue = { if (value == null) JsNull else dataType match { case BooleanType => JsBoolean(value) case LongType => JsNumber(value) case IntegerType => JsNumber(value) case StringType => JsString(value) case structType: StructType => toJsonStruct(structType, value) case arrayType: ArrayType => toJsonArray(arrayType, value) ... } }
  104. 104. Code Generation class Add(left: Expression, right) extends Expression { ... }
  105. 105. Code Generation class Add(left: Expression, right) extends Expression { def doGenCode() = { val l = left.doCodeGen() val r = right.doCodeGen() “”” ${l} ${r} {l.value + r.value} “”” }...}
  106. 106. Code Generation sold cost + + price left.eval(input) + right.eval(input)
  107. 107. Code Generation sold cost + + price
  108. 108. Code Generation sold cost + + price Decimal p= price; Decimal c = cost; Decimal res = price + cost; Decimal s = sold; Decimal value = sold + res;
  109. 109. Code Generation One billion rows on single node machine
  110. 110. Conclusion ● UDFs are great
  111. 111. Conclusion ● UDFs are great ○ Simple to write, compared to very involved expressions, and they generally work well
  112. 112. Conclusion ● UDFs are great ○ Simple to write, compared to very involved expressions, and they generally work well
  113. 113. Conclusion ● UDFs are great ○ Simple to write, compared to very involved expressions, and they generally work well ● Custom Expressions are great
  114. 114. Conclusion ● UDFs are great ○ Simple to write, compared to very involved expressions, and they generally work well ● Custom Expressions are great ○ Performance matters
  115. 115. Conclusion ● UDFs are great ○ Simple to write, compared to very involved expressions, and they generally work well ● Custom Expressions are great ○ Performance matters ○ Complex operations which require lower level API
  116. 116. Conclusion ● UDFs are great ○ Simple to write, compared to very involved expressions, and they generally work well ● Custom Expressions are great ○ Performance matters ○ Complex operations which require lower level API ● We use both of them to solve our complex problems
  117. 117. We are hiring! www.grammarly.com/jobs
  118. 118. Questions ? ? ? ? ? ? ? ? ? ?

×