SlideShare une entreprise Scribd logo
1  sur  87
Scalable Language“ ”
China Mobile
• 引入1
• 函数式编程(FP)2
• 面向对象(OO)3
• 类型系统(Type System)4
• 单子(Monad)5
搭建了当前的Javac
Generic Java的设计者之一
Martin
Odersky
--Scala的设计者
编译成Java字节码
与Java几乎无缝调用
静态类型
强大的类型系统
Who?
Lisp
Erlang
Haskell
Java
天下语言出 Lisp,
且Scala的设计哲学是和Lisp比较近的
What?
JVM
与Java相互调用
Monad
并行计算模型
类型系统
Scala
Coursera
Spark
Meetup
Linkedin
Gilt
Four
square
谁在使用Scala?
Scala的使用比较多样化,既有Spark的应用,也有很多网站使用Scala做后端
阿里中间
件团队
Spark
蘑菇街
看处方
乔布堂
唯品会
谁在使用Scala?
大公司基本都是由Spark驱动,且用Scala做中间件的较多,对外暴露语言无关的接口
优点
1多范式混合,表达能力强
2可以调用Java包,兼容性强
3静态强类型,直接编译为二
进制码,速度与Java不相上下
3类型系统很复杂,学习
曲线陡峭
缺点
1函数式编程
2函数式编程
λ
表
达
式
Expr = Iden
| Iden => Expr
| (Expr) (Expr)
x, y, name, id, person
x => name, x => id, x => x
x(y), y(x), (x => name) y, (x => +(x)(1)) 3
x => y => +(x)(y)
λ
演
算
α 变换
Β 规约
η 变换
x => x == y => y, x => +(x)(z) == y => +(y)(z)
(x => +(x)(3)) 2 == +(2)(3)
(x => y => +(x)(y)) 2 3 == +(2)(3)
x => f(x) == f
丘
奇
数
Zero = f => x => x
One = f => x => f(x)
Two = f => x => f(f(x))
Succ = n => f => x => f(n(f)(x))
type ChurchNumber[A] = (A => A) => A => A
def zero[A]: ChurchNumber[A] = f => a => a
def succ[A](n: ChurchNumber[A]): ChurchNumber[A] = f => a => f(n(f)(a))
val a1: Int = 0
val f1: Int => Int = x => x + 1
val a2: List[Int] = List()
val f2: List[Int] => List[Int] = list => 1 :: list
val a3: String = ""
val f3: String => String = s => "|" + s
println(zero(f1)(a1); println(succ(succ(zero))(f1)(a1)
Number = Zero
| Succ Number
type Segment = (List[Int], List[Int], List[Int])
object Split {
def unapply (xs: List[Int]) = {
val pivot = xs(xs.size / 2)
@tailrec
def partition (s: Segment, ys: List[Int]): Segment = {
val (left, mid, right) = s
ys match {
case Nil => s
case head :: tail if head < pivot => partition((head :: left, mid, right), tail)
case head :: tail if head == pivot => partition((left, head :: mid, right), tail)
case head :: tail if head > pivot => partition((left, mid, head :: right), tail)
}
}
Some(partition((Nil, Nil, Nil), xs))
}
}
def qsort(xs: List[Int]): List[Int] = xs match {
case Nil => xs
case Split(left, pivot, right) => qsort(left) ::: pivot ::: qsort(right)
}
Quick Sort
尾递归
Extractor
模式匹配
Guard
Pattern Matching
def sum(list: List[Int]): Int =
if (list.isEmpty) 0
else list.head + sum(list.tail)
def sum(list: List[Int]): Int = list match {
case List() => result
case head :: tail => head + sum(tail)
}
尾递归
def sum(list: List[Int], acc: Int): Int = list match {
case Nil => result
case head :: tail => sum(tail, result + head)
}
var list = (1 to 100).toArray
for (int i = 1; i <= 100; i++) {
list[i] += 1
}
list = list.map(1 +)
为
什
么
要
函
数
式
编
程
var list = (1 to 100).toArray
for (int i = 1; i <= 100; i++) {
list[i] += 1
}
list = list.view.map(1 +)
为
什
么
要
函
数
式
编
程
var list = (1 to 100).toArray
for (int i = 1; i <= 100; i++) {
list[i] += 1
}
list = list.par.map(1 +)
为
什
么
要
函
数
式
编
程
6 ^ 6
6 * 6 * 6 * 6 * 6 * 6
def ^(x: Int, y: Int) = {
if (y == 0) 1
else if (y % 2 == 0) ^(x * x, y / 2)
else x * ^(x, y – 1)
}
为
什
么
要
函
数
式
编
程
5 + 3
柯里化
fold(z: Int)(f: (Int, Int) => Int): Int
val list = List(1, 2, 3, 4)
def fold0 = list.foldLeft(0)
def fold1 = list.foldLeft(1)
: Int
5 + : Int => Int
+ : (Int, Int) => Int
fold0((x, y) => x + y)
fold1((x, y) => x * y)
: ((Int, Int) => Int) => Int
: ((Int, Int) => Int) => Int
+ : Int => Int => Int
副作用
class Pair[A](var x: A, var y: A) {
def modifyX(x: A) = this.x = x
def modifyY(y: A) = this.y = y
}
var pair = new Pair(1, 2)
var pair1 = new Pair(pair, pair)
var pair2 = new Pair(pair, new Pair(1, 2))
pair.modifyX(3)
值与址
副作用
结合律
var variable = 0
implicit class FooInt(i: Int) {
def |+|(j: Int) = {
variable = (i + j) / 2
i + j + variable
}
}
(1 |+| 2) |+| 3
1 |+| (2 |+| 3)
= 10
= 12
副作用
结合律
var variable = 0
implicit class FooInt(i: Int) {
def |+|(j: Int) = {
variable += 1
i + j * variable
}
}
(1 |+| 2) |+| 3
1 |+| (2 |+| 3)
= 9
= 11
map(f: T => U): A[U]
filter(f: T => Boolean): A[T]
flatMap(f: T => A[T]): A[T]
groupBy(f: T => K): A[(K, List[T])]
sortBy(f: T => K): A[T]
NEW
Count: Int
Force: A[T]
Reduce(f: (T, T) => T): T
Higher-Order Functions
T
r
a
n
f
o
r
m
a
t
i
o
n
A
c
t
i
o
n
A
B
Map
[A] -> (A -> B) -> [B]
高阶函数
List(1, 2, 3, 4).map(_.toString)
(A -> B) -> ([A] -> [B])
A
A
Filter
?
A
[A] -> (A -> Boolean) -> [A]
高阶函数
List(1, 2, 3, 4).filter(_ < 3)
(A -> Boolean) -> ([A] -> [A])
A
B
Fold
自然
元素
[A] -> B -> (B -> A -> B) -> [B]
高阶函数
val list = List(“one”, “two”, “three”)
list.foldLeft(0)((sum, str) => {
if (str.contains(“o”) sum + 1
else sum
})
B -> (B -> A -> B) -> ([A] -> [B])
[A]
A
Flatten
[[A]] -> [A] 高阶函数
List(List(1, 2), List(3, 5)).flatten
Quick Sort
object Split {
def unapply (xs: List[Int]) = {
val pivot = xs(xs.size / 2)
Some(xs.partitionBy(pivot))
}
}
def qsort(xs: List[Int]): List[Int] = xs match {
case Nil => xs
case Split(left, pivot, right) => qsort(left) ::: pivot ::: qsort(right)
}
Quick Sort
type Segment = (List[Int], List[Int], List[Int])
implicit class ListWithPartition(list: List[Int]) extends AnyVal {
def partitionBy(p: Int): Segment = {
val idenElem = (List[Int](), List[Int](), List[Int]())
def partition(result: Segment, x: Int): Segment = {
val (left, mid, right) = result
if (x < p) (x :: left, mid, right)
else if (x == p) (left, x :: mid, right)
else (left, mid, x :: right)
}
list.foldLeft(idenElem)(partition)
}
}
隐式转换
A
B
Map
[A] -> (A -> B) -> [B]
Par
高阶函数
惰性求值惰性求值
val foo = List(1, 2, 3, 4, 5)
baz = foo.map(5 +).map(3 +).filter(_ > 10).map(4 *)
baz.take(2)
我们却得到了
foo.map(5 +)
foo.map(5 +).map(3 +)
foo.map(5 +).map(3 +).filter(_ > 10)
三个中间结果
在命令式语言中:
for(int i = 0; i < 5; ++i) {
int x = foo[i] + 5 + 3
if (x > 10)
bar.add(x * 4)
else
continue;
{
在我们声明时
我们想要的是一个愿望(计算)
而不是结果
A
B
Map
[A] -> (A -> B) -> [B]
View
高阶函数
val fibs: Stream[Int] = 0 #:: 1 #:: fibs.zip(fibs.tail).map(n => n._1 + n._2)
流与惰性求值
Quora
惰性求值
zip = ([A], [B]) => [(A, B)]
惰性求值
Lazy val x = 3 + 3
def number = {println("OK"); 3 + 3}
class LazyValue(expr: => Int) {
var evaluated: Boolean = false
var value: Int = -1
def get: Int = {
if (!evaluated) {
value = expr
evaluated = true
}
value
}
}
val lazyValue = new LazyValue(number)
println(lazyValue.get)
println(lazyValue.get)
Thinking in Java
Map可以用装饰器模式来实现
Call By Name
面向对象
Scala是一门面向对象的语言,至少面向对象的纯度要比Java高。
包括1,2,1.1,等在内都是对象。
我们所见到的1 + 2实际上是1.+(2)
但在编译时会用原始类型来替代。
而函数 x: Int => x.toString
则是Function1[Int, String]
所以,你可以map(5 +)
但不能map(+ 5)
一些语法糖
class Sugar(i: Int) {
def unary_- = -i
def apply(expr: => Unit) = for (j <- 1 to i) expr
def +(that: Int) = i + that
def +:(that: Int) = I + that
}
val sugar = new Sugar(2)
-sugar
sugar(println("aha"))
sugar + 5
5 + sugar
前缀
中缀
省略方法名
所有字母
|
^
&
< >
= !
: 注意右结
合
+ -
* / %
其他字符
右结合
目的是为了做好DSL
和延续函数式编程习惯
请注意谨慎使用
Mix-in是一种多继承的手段,同Interface一样,通过限制第二个父类的方式
来限制多继承的复杂关系,但它具有默认的实现。
1.通常的继承提供单一继承
2.第二个以及以上的父类必须是Trait
3.不能单独生成实例
Scala中的Trait可以在编译时进行混合也可以在运行时混合。
Trait & Mix-in
但显然,一个人也可以跑可以唱歌……..不过他还可以编程.
设想我们要描述一种鸟,它可以唱歌也可以跑;由于它是一只鸟,它当然可
以飞。
abstract class Bird(kind: String) {
val name: String
def singMyName = println(s"$name is singing")
val capability: Int
def run = println(s"I can run $capability meters!!!")
def fly = println(s"flying of kind: $kind")
}
(虽然我不歧视鸟类,不过如果碰到会编程的鸟请通知我)
继承
trait Runnable {
val capability: Int
def run = println(s"I can run $capability meters!!!")
}
trait Singer {
val name: String
def singMyName = println(s"$name is singing")
}
abstract class Bird(kind: String) {
def fly = println(s"flying of kind: $kind")
}
继承
class Nightingale extends Bird("Nightingale") with Singer with Runnable {
val capability = 20
val name = "poly"
}
val myTinyBird = new Nightingale
myTinyBird.fly
myTinyBird.singMyName
myTinyBird.run
class Coder(language: String) {
val capability = 10
val name = "Handemelindo"
def code = println(s"coding in $language")
}
val me = new Coder("Scala") with Runnable with Singer
me.code
me.singMyName
me.run
继承
一个小伙伴
object Sugar {
def apply(i: Int) = new Sugar(i)
}
可以在此实现工厂模式
伴生对象
一些小伙伴
trait class Tree
case class Leaf(info: String) extends Tree
case class Node(left: Tree, right: Tree) extends Tree
def traverse(tree: Tree): Unit = {
tree match {
case Leaf(info) => println(info)
case Node(left, right) => {
traverse(left)
traverse(right)
}
}
}
val tree: Tree = new Node(new Node(new Leaf("1"), new Leaf("2")), new Leaf("3"))
traverse(tree)
Case Class与ADT
继承作为和类型
case class作为积类型
Tree = Leaf String
| Node Tree Tree
类型系统
如果你是一个C程序员,那么类型是:
如果你是一个Java程序员,那么类型是:
如果你是一个R程序员,那么类型是:
如果你是一个Ruby程序员,那么类型是:
而对于Scala程序员,类型是:
用来告诉计算机它需要用多少字节来存放这些数字的指标
用来表示存放实例的地方
这样编译器就可以检查你的程序是否连续一致
用来标志对这些变量应该用何种统计计算
你应该回避的东西
如同UML之于Java,是正确性的保证,是程序的蓝图
猜猜这是什么:e.g. [(K1, V1)] -> [(K2, [V2])] -> [(K2, V3)]
*
Any
Int
1
Pair[Int, Int]
(1, 2)
List[Int]
[1, 2, 3]
* * * * *
List Pair
Kind
Type
Value
类型构造器
类别
子类型
Generics of a Higher Kind - Martin Odersky
=> => =>
Proper
Type
type Int :: *
type String :: *
type (Int => String) :: *
type List[Int] :: *
type List :: ?
type Function1 :: ??
做一些抽象练习吧
type List :: * => *
type function1 :: * => * => * Function1[-T, +R]
def id(x: Int) = x
type Id[A] = A
type id[A[_], B] = A[B]
def id(f: Int => Int, x: Int) = f(x)
type Pair[K[_], V[_]] = (K[A], V[A]) forSome { type A }
(* -> *) -> (* -> *) -> *
设想,我们的程序要返回结果:
(Set(x,x,x,x,x), List(x,x,x,x,x,x,x,x,x,x))
val pair: Pair[Set, List] = (Set(“42”), List(52))
val pair: Pair[Set, List] = (Set(42), List(52))
做一些抽象练习吧
回想起 type function1 :: * => * => *
又例如,我们有以下这个函数:
def foo[A[_]](bar: A[Int]): A[Int] = bar
可以喂给它(* => *),例如
val foo1 = foo[List](List(1, 2, 3, 5, 8, 13))
如果我们有:
def baz(x: Int) = println(x)
Type Lambda
肿么办?
因此: * => * = *[Unit] => *[Unit]
val foo2 = foo[ ({type F[X] = Function1[X, Unit]})#F ](baz)
trait Monoid[A]{
val zero: A
def append(x: A, y: A): A
}
object IntNum extends Monoid[Int] {
val zero = 0
def append(x: Int, y: Int) = x + y
}
object DoubleNum extends Monoid[Double] {
val zero = 0d
def append(x: Double, y: Double) = x + y
}
def sum[A](nums: List[A])(tc: Monoid[A]) =
nums.foldLeft(tc.zero)(tc.append)
sum(List(1, 2, 3, 5, 8, 13))(IntNum)
sum(List(3.14, 1.68, 2.72))(DoubleNum)
对态射进行抽象
trait Monoid[A]{
val zero: A
def append(x: A, y: A): A
}
object IntNum extends Monoid[Int] {
val zero = 0
def append(x: Int, y: Int) = x + y
}
object DoubleNum extends Monoid[Double] {
val zero = 0d
def append(x: Double, y: Double) = x + y
}
def sum[A](nums: List[A])(implicit tc: Monoid[A]) =
nums.foldLeft(tc.zero)(tc.append)
sum(List(1, 2, 3, 5, 8, 13))
sum(List(3.14, 1.68, 2.72))
implicit
implicit
Type Class
1.抽象分离
2.可组合
3.可覆盖
4.类型安全
val list = List(1,3,234,56,5346,34)
list.sorted sorted[B >: A](implicit ord: math.Ording[B])
Type Class
类
型
类
的
作
用
List(1, 2, 3 5) -> “1,2,3,4”
(1,2) -> “1,2”
(List(1,2,3,5), List(8,13,21)) -> “1,2,3,5,8,13,21”
(List(1,2,3,5), (42.0, List(“a”, “b”))) -> “1,2,3,5,42.0,a,b”
类
型
类
的
作
用
trait Writable[A] {
def write(a: A): String
}
Implicit def numericWritable[A: Numeric]: Writable[A] = new Writable[A] {
def write(a: A): String = a.toString
}
Implicit val stringWritable: Writable[String] = new Writable[String] {
def write(a: String): String = a
}
Implicit def listWritable[A: Writable]: Writable[List[A]] = new Writable[List[A]] {
def write(a: List[A]): String = {
val writableA = implicitly[Writable[A]]
a.map(writableA.write).mkString(“,”)
}
}
Implicit def PairWritable[A: Writable, B: Writable]: Writable[(A, B)] =
new Writable[(A, B)] {
def write(p: (A, B)): String = {
val writableA = implicitly[Writable[A]]
val writableB = implicitly[Writable[B]]
writableA.write(p._1) + “,” + writableB.write(p._2)
}
}
赫尔曼 外尔-----思维的数学方式:
现在到了数学抽象中最关键的一步:让我们忘记这些符号所表示的对象。我们不
应在这里停步,有许多操作可以应用于这些符号,而根本不必考虑他们到底代表
着什么东西。
Monad
自函子范畴上的幺半群
Philip Wadler
(1)封闭性(Closure):对于任意a,b∈G,有a*b∈G
(2)结合律(Associativity):对于任意a,b,c∈G,有(a*b)*c=a*(b*c)
(3)幺元 (Identity):存在幺元e,使得对于任意a∈G,e*a=a*e=a
(4)逆元:对于任意a∈G,存在逆元a^-1,使得a^-1*a=a*a^-1=e
Group
什么是群(Group)
什么是半群(SemiGroup)
只满足1,2,
什么是幺半群(Monoid)
满足1,2,3
Monoid
废话少说,放码过来
trait SemiGroup[T] {
def append(a: T, b: T): T
}
trait Monoid[T] extends SemiGroup[T] {
def zero: T
}
class listMonoid[T] extends Monoid[List[T]]{
def zero = Nil
def append(a: List[T], b: List[T]) = a ++ b
}
Functor
函子(Functor)是什么
Int List[Int]
String List[String]
Functor
Functor
函子(Functor)是什么
trait Functor[F[_]] {
def map[A, B](f: (A) => B)(a: F[A]): F[B]
}
map[B](f: (A) => B): List[B]
Monad
自函子上的幺半群
回想一下幺半群的单位元
回想一下fold函数
什么是自函子上的单位元呢?
什么是自函子上的结合运算呢?
Unit x >>= f ≡ f x
M >>= unit ≡ m
(m >>= f) >>= g ≡ m >>= (λx . F x >>= g)
单位元:将元素提升进计算语境
结合律:结合简单运算形成复杂运算
一些常见Monad
Option
Option或叫Maybe,表示可能失败的计算
由Some(Value)或None表示
Some(x) fMap (f: A => Some[B]) = Some(f(x))
None fMap(f: A => Some[B]) = None
Unit = Some
val maybe: Option[Int] = Some(4)
val none: Option[Int] = None
def calculate(maybe: Option[Int]):
Option[Int] = for {
value <- maybe
} yield value + 5
calculate(maybe)
calculate(none)
一些常见Monad
List
集合本身是Proper type,它代表的是不确定性
Unit = List
val list1 = List(2, 4, 6, 8)
val list2 = List(1, 3, 5, 7)
for {
value1 <- list1.map(1 +)
value2 <- list2
} yield value1 + value2
Future
Future可以将计算包裹起来,它代表的是未来的结果
Unit = List
val future1= Future(SomeProcess)
val future2 = Future(AnotherProcess)
for {
value1 <- future1.map(SomeTransformation)
value2 <- future2
} yield value1 + value2
一些常见Monad
for {
(name, date(year, _, day)) <- nameList
if name.length > 3
char <- name
} yield char -> s”$name-$day@$year”
Usage
nameList.flatMap {
case (name, date(year, _, day)) =>
if (name.length > 3) {
name.map { char =>
char -> s"$name-$day@$year"
}
} else Map()
case _ => Map()
}
val date = “””(ddd)-(dd)-(dd)”””.r
val nameList = Map(
“haskell” -> “1900-12-12”, “godel” -> “1906-04-28”,
“church” -> “1903-06-14”, “turing” -> “1912/06/23”
)
var map = Map[Char, String]()
var i = 0
val list = nameList.toArray
while (i < list.size) {
val name = list(i)._1
val theDate = list(i)._2
if (theDate.matches("dddd-dd-dd")) {
val parts = theDate.split("-")
val year = parts(0)
val day = parts(2)
val charArray = name.toCharArray
var j = 0
while (j < charArray.length) {
val char = charArray(j)
map += char -> (name + "-" + day + "2" + year)
j += 1
}
}
i += 1
}
• 介绍1
• 从FP看MR2
• 从FP看RDD3
• RDD4
• MLlib5
Spark
Spark Map Reduce
生态系统 Spark平台已经基本成熟,
但相关的Mllib、Spark SQL等依然在发展中
非常成熟,有很多应用
计算模型 类Monadic(不是Monad),Functor Map Reduce
存储 主要是内存 主要是磁盘
编程风格 面向集合 面向接口
一种通用并行计算框架
Spark
Map Reduce Monadic
Spark SQL MLlib GraphX
Spark
Streaming
Spark
本地
运行模式
独立
运行模式
YARN Mesos
HDFS Amazon S3 Hypertable Hbase etc.
优点
1面向集合,便于开发
2支持的计算方式较MR要多
3内存计算速度更快,可以进
行持久化以便于迭代;数据不
“大”,还可兼顾
“快”
缺点
1内存消耗快,注意使用kryo等
序列化库
2惰性求值的计算时间不宜估计
优化难度高
[(K1, V1)] -> [(K2, [V2])] -> [(K2, V3)]
Word Count
[Line]
flatMap(_.split(“s+”)).map((_, 1))
groupBy(_._1)
[(Word, n)][(Word, 1)] -> ->[(Word, [1])]
->
reduceBy(_._1)(_._2 + _._2)
Map Reduce
map(f: T => U)
filter(f: T => Boolean)
flatMap(f: T => Seq[U])
sample(fraction: Float)
groupByKey()
reduceByKey(f: (V, V) => V)
mapValues(f: V => W)
NEW
Count()
Collect()
Reduce(f: (T, T) => T)
Lookup(k: K)
Save(path: String)
take(n: Int)
RDD
T
r
a
n
f
o
r
m
a
t
i
o
n
A
c
t
i
o
n
union()
join()
cogroup()
crossProduct
sort(c Comparator[K])
partitionBy(p: Partitioner[K])
[(K1, V1)] -> [(K2, [V2])] -> [(K2, V3)]
Word Count
lines = spark.textFile("hdfs://...")
words = lines.flatMap(_.split(“//s+”))
wordCounts = words.map((_, 1))
result = wordCounts.reduceByKey(_ + _)
result.save(“hdfs://…”)
RDD
什么是RDD
RDD的特点
• 不可变的、已分区的集合
• 只能通过读取文件或Transformation的方式来创建
• 容错
• 可控制存储级别
• 可缓存
• 粗粒度模型
• 静态类型的
new StorageLevel(useDisk, useMemory, deserialized, replication)
cache()方法
通过血统重新计算
什么是RDD
一个惰性的并行计算集合
• 惰性:
• 惰性的优点:单次计算,信息量充足,可自动批处理。
每一个Transformation代表着该数据将被执行何种操作
• 并行:我们将数据放在计算语境中
计算语境会自动将计算并行化
RDD是面向集合的
RDD的实现
一个五元组
• Partitions:
一片数据原子,例如HDFS的块,代表数据
• Preferred Location:
列出了partition可以从哪里进行更快速的访问
• Dependencies:
与父节点的依赖,子节点是由父节点计算出来的
• Computation:
代表计算,在父节点的数据上应用该计算则可得到子节点的数据
• Metadata:
储存例如该节点的地址和分片方式的元数据
RDD的实现
对于我们目前见到的惰性计算,他们都是线性的,可以表示为
+5 *7 _ % 2 == 0Map FilterMap Collect
但其他的计算呢?
如何表示惰性计算
RDD的实现
如何表示惰性计算
DAG
通过拓扑排序:
1. 追踪到源头开始进行计算
2. 将不需要混合的数据划分到同一组处理当中
RDD的实现
血统(Lineage)
表示计算之间的联系:
• Narrow Dependencies:开销小
• Wide Dependencies:开销大
如Map, Union。
表现为一个或多个父RDD的分区对应于一个子RDD分区
可以本地化
如GroupBy。
表现为一个父RDD分区对应多个子RDD分区
需要Shuffling
RDD的执行
Cluster ManagerSparkContext
Task Task
CacheExecutor
Task Task
CacheExecutor
RDD的执行
1.RDD直接从外部数据源创建(HDFS、本地文件等)
2.RDD经历一系列的TRANSFORMATION
3.执行ACTION,将最后一个RDD进行转换,输出到外部数据源。
同时:自动优化分块,分发闭包,混合数据,均衡负载
MLlibSVM with SGD
LR with SGD or LBFGS
NB
各类决策树
随机森林
GBT
LabeledPoint(Double, Vector)
Classification
val data = sc.textFile(“….")
val parsedData = data.map { line =>
val parts = line.split(' ')
LabeledPoint(parts(0).toDouble, parts.tail.map(x => x.toDouble).toArray)
}
val numIterations = 20
val model = SVMWithSGD.train(parsedData, numIterations)
val labelAndPreds = parsedData.map { point =>
val prediction = model.predict(point.features)
(point.label, prediction)
}
val trainErr = labelAndPreds.filter(r => r._1 != r._2).count.toDouble / parsedData.count
MLlib
LabeledPoint(Double <- Vector)
RegressionLinear
Ridge
Lasso
Isotonic
val data = sc.textFile(“….")
val parsedData = data.map { line =>
val parts = line.split(',')
LabeledPoint(parts(0).toDouble, parts(1).split(' ').map(x => x.toDouble).toArray)
}
val numIterations = 20
val model = LinearRegressionWithSGD.train(parsedData, numIterations)
val valuesAndPreds = parsedData.map { point =>
val prediction = model.predict(point.features)
(point.label, prediction)
}
val MSE = valuesAndPreds.map{ case(v, p) =>
math.pow((v - p), 2)}.reduce(_ + _) / valuesAndPreds.count
MLlib
Clustering:
k均值
及其变种k均值++
Gaussian Mixture
LDA
Vector
Clustering
val data = sc.textFile(“….")
val parsedData = data.map( _.split(' ').map(_.toDouble))
val numIterations = 20
val numClusters = 2
val clusters = KMeans.train(parsedData, numClusters,
numIterations)
val WSSSE = clusters.computeCost(parsedData)
MLlib
支持显性和隐性的ALS
Rating(Int, Int, Double)
Collaborate Filtering
val data = sc.textFile(“….")
val ratings = data.map(_.split(',') match {
case Array(user, item, rate) => Rating(user.toInt, item.toInt, rate.toDouble)
})
val numIterations = 20
val model = ALS.train(ratings, 1, 20, 0.01)
val usersProducts = ratings.map{ case Rating(user, product, rate) => (user, product)}
val predictions = model.predict(usersProducts).map{
case Rating(user, product, rate) => ((user, product), rate)
}
val ratesAndPreds = ratings.map{
case Rating(user, product, rate) => ((user, product), rate)
}.join(predictions)
val MSE = ratesAndPreds.map{
case ((user, product), (r1, r2)) => math.pow((r1 - r2), 2)
}.reduce(_ + _) / ratesAndPreds.count
MLlib
FP-Growth
Array[Item]
Frequent Pattern
val data = sc.textFile(“….")
val transactions: RDD[Array[String]] = data.map(_.split(“,”))
val fpg = new FPGrowth()
.setMinSupport(0.2)
.setNumPartitions(10)
val model = fpg.run(transactions)
model.freqItemsets.collect().foreach { itemset =>
println(itemset.items.mkString("[", ",", "]") + ", " + itemset.freq)
}
相
关
资
料
论文:
-- 点我
官方文档:
-- 点我
官方API:
-- 点我
EDX上Berkeley的spark课程:
-- 点我
EDX上Berkeley的MLlib课程:
-- 点我
Happy!
Hacking!
China Mobile
THANKS
of your attention!

Contenu connexe

Tendances (11)

Swift基礎
Swift基礎Swift基礎
Swift基礎
 
1 4對數函數
1 4對數函數1 4對數函數
1 4對數函數
 
Scala
ScalaScala
Scala
 
1 4對數函數
1 4對數函數1 4對數函數
1 4對數函數
 
Hi Haskell
Hi HaskellHi Haskell
Hi Haskell
 
Python learn guide
Python learn guidePython learn guide
Python learn guide
 
Fp
FpFp
Fp
 
Haskell Foundations
Haskell FoundationsHaskell Foundations
Haskell Foundations
 
Java SE 8 的 Lambda 連鎖效應 - 語法、風格與程式庫
Java SE 8 的 Lambda 連鎖效應 - 語法、風格與程式庫Java SE 8 的 Lambda 連鎖效應 - 語法、風格與程式庫
Java SE 8 的 Lambda 連鎖效應 - 語法、風格與程式庫
 
Java Script 引擎技术
Java Script 引擎技术Java Script 引擎技术
Java Script 引擎技术
 
Javascript share
Javascript shareJavascript share
Javascript share
 

En vedette

Spark tutorial @ KCC 2015
Spark tutorial @ KCC 2015Spark tutorial @ KCC 2015
Spark tutorial @ KCC 2015
Jongwook Woo
 
HDFS & MapReduce
HDFS & MapReduceHDFS & MapReduce
HDFS & MapReduce
Skillspeed
 

En vedette (20)

ScalaTrainings
ScalaTrainingsScalaTrainings
ScalaTrainings
 
Scala meetup - Intro to spark
Scala meetup - Intro to sparkScala meetup - Intro to spark
Scala meetup - Intro to spark
 
Workshop Scala
Workshop ScalaWorkshop Scala
Workshop Scala
 
Spark, spark streaming & tachyon
Spark, spark streaming & tachyonSpark, spark streaming & tachyon
Spark, spark streaming & tachyon
 
Spark tutorial @ KCC 2015
Spark tutorial @ KCC 2015Spark tutorial @ KCC 2015
Spark tutorial @ KCC 2015
 
Big Data Processing with Spark and Scala
Big Data Processing with Spark and Scala Big Data Processing with Spark and Scala
Big Data Processing with Spark and Scala
 
Apache spark workshop
Apache spark workshopApache spark workshop
Apache spark workshop
 
NYC_2016_slides
NYC_2016_slidesNYC_2016_slides
NYC_2016_slides
 
Spark workshop
Spark workshopSpark workshop
Spark workshop
 
Apache Spark 101
Apache Spark 101Apache Spark 101
Apache Spark 101
 
HDFS & MapReduce
HDFS & MapReduceHDFS & MapReduce
HDFS & MapReduce
 
Spark手把手:[e2-spk-s02]
Spark手把手:[e2-spk-s02]Spark手把手:[e2-spk-s02]
Spark手把手:[e2-spk-s02]
 
Spark手把手:[e2-spk-s01]
Spark手把手:[e2-spk-s01]Spark手把手:[e2-spk-s01]
Spark手把手:[e2-spk-s01]
 
Spark手把手:[e2-spk-s03]
Spark手把手:[e2-spk-s03]Spark手把手:[e2-spk-s03]
Spark手把手:[e2-spk-s03]
 
Functional Programming for OO Programmers (part 2)
Functional Programming for OO Programmers (part 2)Functional Programming for OO Programmers (part 2)
Functional Programming for OO Programmers (part 2)
 
Spark手把手:[e2-spk-s04]
Spark手把手:[e2-spk-s04]Spark手把手:[e2-spk-s04]
Spark手把手:[e2-spk-s04]
 
Dev Ops Training
Dev Ops TrainingDev Ops Training
Dev Ops Training
 
Run Your First Hadoop 2.x Program
Run Your First Hadoop 2.x ProgramRun Your First Hadoop 2.x Program
Run Your First Hadoop 2.x Program
 
Scala and spark
Scala and sparkScala and spark
Scala and spark
 
Getting started with Apache Spark
Getting started with Apache SparkGetting started with Apache Spark
Getting started with Apache Spark
 

Similaire à Scala+spark 2nd

Js is js(程劭非) (1)
Js is js(程劭非) (1)Js is js(程劭非) (1)
Js is js(程劭非) (1)
looneyren
 
Jscex:案例、经验、阻碍、展望
Jscex:案例、经验、阻碍、展望Jscex:案例、经验、阻碍、展望
Jscex:案例、经验、阻碍、展望
jeffz
 

Similaire à Scala+spark 2nd (20)

functional-scala
functional-scalafunctional-scala
functional-scala
 
ncuma_pylab.pptx
ncuma_pylab.pptxncuma_pylab.pptx
ncuma_pylab.pptx
 
Arrays的Sort算法分析
Arrays的Sort算法分析Arrays的Sort算法分析
Arrays的Sort算法分析
 
Js is js(程劭非) (1)
Js is js(程劭非) (1)Js is js(程劭非) (1)
Js is js(程劭非) (1)
 
Ihome inaction 篇外篇之fp介绍
Ihome inaction 篇外篇之fp介绍Ihome inaction 篇外篇之fp介绍
Ihome inaction 篇外篇之fp介绍
 
Jscex:案例、阻碍、体会、展望
Jscex:案例、阻碍、体会、展望Jscex:案例、阻碍、体会、展望
Jscex:案例、阻碍、体会、展望
 
Scilab introduction(Scilab 介紹)
Scilab introduction(Scilab 介紹)Scilab introduction(Scilab 介紹)
Scilab introduction(Scilab 介紹)
 
Jscex:案例、经验、阻碍、展望
Jscex:案例、经验、阻碍、展望Jscex:案例、经验、阻碍、展望
Jscex:案例、经验、阻碍、展望
 
Ch5
Ch5Ch5
Ch5
 
[Effective Kotlin 讀書會] 第八章 Efficient collection processing 導讀
[Effective Kotlin 讀書會] 第八章 Efficient collection processing 導讀[Effective Kotlin 讀書會] 第八章 Efficient collection processing 導讀
[Effective Kotlin 讀書會] 第八章 Efficient collection processing 導讀
 
Ppt 78-100
Ppt 78-100Ppt 78-100
Ppt 78-100
 
Ppt 78-100
Ppt 78-100Ppt 78-100
Ppt 78-100
 
Standford 2015 iOS讀書會 week2: 1. Applying MVC 2. More Swift and Foundation Fra...
Standford 2015 iOS讀書會 week2: 1. Applying MVC 2. More Swift and Foundation Fra...Standford 2015 iOS讀書會 week2: 1. Applying MVC 2. More Swift and Foundation Fra...
Standford 2015 iOS讀書會 week2: 1. Applying MVC 2. More Swift and Foundation Fra...
 
Scala再探
Scala再探Scala再探
Scala再探
 
Ch5 教學
Ch5 教學Ch5 教學
Ch5 教學
 
Ch5
Ch5Ch5
Ch5
 
Python入門:5大概念初心者必備 2021/11/18
Python入門:5大概念初心者必備 2021/11/18Python入門:5大概念初心者必備 2021/11/18
Python入門:5大概念初心者必備 2021/11/18
 
Python学习笔记
Python学习笔记Python学习笔记
Python学习笔记
 
JavaScript 快速複習 2017Q1
JavaScript 快速複習 2017Q1JavaScript 快速複習 2017Q1
JavaScript 快速複習 2017Q1
 
Python 温故
Python 温故Python 温故
Python 温故
 

Scala+spark 2nd