Genomic analysis applications, libraries, and design patterns for Spark and Scala.
Presented at Scala Symposium 2017: https://conf.researchr.org/event/scala-2017/scala-2017-papers-genomic-data-analysis-in-scala-open-source-talk-
8. magic-rdds
Collection-operations implemented for Spark RDDs
scans
{left,right}
{elements, values of tuples}
.runLengthEncode, group consecutive elements by predicate / Ordering
.reverse
reductions: .maxByKey, .minByKey
sliding/windowed traversals
.size - smart count
multiple counts in one job:
val (count1, count2) = (rdd1, rdd2).size
smart partition-tracking: reuse counts for UnionRDDs
zips
lazy partition-count, eager partition-number check
sameElements, equals
group/sample by key: first elems or reservoir-sampled
HyperGeometric distribution handling Longs: hammerlab/math-utils
8 / 17
9. hammerlab/iterators
scans (in terms of cats.Monoid)
sliding/windowed traversals
eager drops/takes
by number
while
until
sorted/range zips
SimpleBufferedIterator
iterator in terms of _advance(): Option[T]
hasNext lazily buffers/caches head
etc.
9 / 17
10. args4j case-app
statically-checked/typed handlers
implicit resolution
inheritance vs. composition
mutable vs. immutable
case-app positional-arg support: #58
spark-commands: command-line interfaces
class Opts {
@args4j.Option(
name = "--in-path",
aliases = Array("-i"),
handler = classOf[PathOptionHandler],
usage = "Input path to read from"
)
var inPath: Option[Path] = None
@args4j.Option(
name = "--out-path",
aliases = Array("-o"),
handler = classOf[PathOptionHandler],
usage = "Output path to write to"
)
var outPath: Option[Path] = None
@args4j.Option(
name = "--overwrite",
aliases = Array("-f"),
usage = "Whether to overwrite an existing ou
)
var overwrite: Boolean = false
}
case class Opts(
@Opt("-i")
@Msg("Input path to read from")
inPath: Option[Path] = None,
@Opt("-o")
@Msg("Output path to write to")
outPath: Option[Path] = None,
@Opt("-f")
@Msg("Whether to overwrite an existing output
overwrite: Boolean = false
)
10 / 17
12. Deep case-class hierarchy:
case class A(n: Int)
case class B(s: String)
case class C(a: A, b: B)
case class D(b: Boolean)
case class E(c: C, d: D, a: A, a2: A)
case class F(e: E)
Instances:
val a = A(123)
val b = B("abc")
val c = C(a, b)
val d = D(true)
val e = E(c, d, A(456), A(789))
val f = F(e)
Pull out fields by type and/or name:
f.find('c) // f.e.c
f.findT[C] // f.e.c
f.field[C]('c) // f.e.c
f.field[A]('a2) // f.e.a2
f.field[B]('b) // f.e.c.b
As evidence parameters:
def findAandB[T](t: T)(
implicit
findA: Find[T, A],
findB: Find[T, B]
): (A, B) =
(findA(t), findB(t))
shapeless-utils
"recursive structural types"
12 / 17
13. Nesting/Mixing implicit contexts
Minimal boilerplate Spark CLI apps:
input Path
output Path (or: just a PrintStream)
SparkContext
select Broadcast variables
other argument-input objects
How to make all of these things implicitly available with minimal boilerplate?
13 / 17
14. def app1() = {
// call methods that want implicit
// input Path, SparkContext
}
def app2() = {
// call methods that want implicit
// Path, SparkContext, PrintStream
}
Nesting/Mixing implicit contexts
Minimal boilerplate Spark CLI apps:
input Path
output Path (or: just a PrintStream)
SparkContext
select Broadcast variables
other argument-input objects
How to make all of these things implicitly available with minimal boilerplate?
Ideally:
13 / 17
15. def run(
implicit
inPath: Path,
printStream: PrintStream,
sc: SparkContext,
ranges: Broadcast[Ranges],
…
): Unit = {
// do thing
}
case class Context(
inPath: Path,
printStream: PrintStream,
sc: SparkContext,
ranges: Broadcast[Ranges],
…
)
def run(implicit ctx: Context): Unit = {
implicit val Context(
inPath, printStream, sc, ranges, …
) = ctx
// do thing
}
Nesting/Mixing implicit contexts
Minimal boilerplate Spark CLI apps:
input Path
output Path (or: just a PrintStream)
SparkContext
select Broadcast variables
other argument-input objects
How to make all of these things implicitly available with minimal boilerplate?
14 / 17
16. trait HasInputPath { self: HasArgs ⇒
implicit val inPath = Path(args(0))
}
trait HasOutputPath { self: HasArgs ⇒
val outPath = Path(args(1))
}
class MinimalApp(args: Array[String])
extends HasArgs(args)
with HasInputPath
with HasPrintStream
with HasSparkContext
object Main {
def main(args: Array[String]): Unit =
new MinimalApp(args) {
// all the implicits!
}
}
}
Nesting/Mixing implicit contexts
How to make many implicits available with minimal boilerplate? ≈
trait HasSparkContext {
implicit val sc: SparkContext = new SparkContext(…)
}
abstract class HasArgs(args: Array[String])
trait HasPrintStream extends HasOutputPath { self: Args ⇒
implicit val printStream = new PrintStream(newOutputStream(outPath))
}
15 / 17
17. That comes from a data structure like:
case class Result(
numPositions : Long,
compressedSize : Bytes,
compressionRatio : Double,
numReads : Long,
numFalsePositives: Long,
numFalseNegatives: Long
)
or better yet:
case class Result(
numPositions : NumPositions,
compressedSize : CompressedSize,
compressionRatio: CompressionRatio,
numReads : NumReads,
falseCounts : FalseCounts
)
{to,from}String: invertible syntax
Miscellaneous tools output "reports":
466202931615 uncompressed positions
156G compressed
Compression ratio: 2.78
1236499892 reads
22489 false positives, 0 false negatives
This is basically toString the Show type-class
twist: downstream tools want to parse these reports
want to re-hydrate Result instances
implicit val _iso: Iso[FalseCounts] =
iso"${'numFPs} false positives, ${'numFNs} false negatives" }
16 / 17