SlideShare a Scribd company logo
1 of 30
Download to read offline
Spark @ Hackathon
Madhukara Phatak
Zinnia Systems
@madhukaraphatak

1
Hackathon

“ Hackathon is a programming ritual ,
usually done in night, where
programmers solve problems overnight
which may take years of time otherwise”
-Anonymous

2
Where / When
Dec 6 and 7 th , 2013
At BigData Conclave
Hosted by Flutura
Solving real world problem(s) in 24 hours
No restriction on tools to be used
Deliverables : Results, Working code and Visualization

3
What?
Predict the global energy demand for next year using the
energy usage data available for last four years, in order to
enable utility companies , effectively handle the energy
demand.
Every minute usage data collected from smart meters
From 2008- 2012
Around 133 mb uncompressed
2,75,000 records
Around 25,000 missing data records

4
Questions to be answered
What would be the energy consumption for the next day?
What would be week wise Energy consumption for the next
one year?
What would be household’s peak time load (Peak time is
between 7 AM to 10 AM) for the next month.
During Weekdays
During Weekends
Assuming there was full day of outage, Calculate the
Revenue Loss for a particular day next year by finding the
Average Revenue per day (ARPD) of the household using
given tariff plan
Can you identify the device usage patterns?
5
Who
Four member team from Zinnia Systems
Chose Spark/Scala for solving these problems
Every one except me ,were new to Scala and Spark
Able to solve the problems on time
Won first prize at Hackathon

6
Why Spark?
Faster prototyping
Excellent in memory performance
Uses AKKA
Able to run 2.5 million concurrent processing in 1GB RAM
Easy to debug
Excellent integration in IntelliJ and Eclipse
Little to code - 500 lines of code
Something new to try at Hackathon

7
Solution
Uses only core spark
Uses Geometric Brownian motion algorithm for prediction
Complete code Available at Github
https://github.com/zinniasystems/spark-energy-prediction
Under Apache license
Blog series
http://zinniasystems.com/blog/2013/12/12/
predicting-global-energy-demand-using-spark-part-1/

8
Embrace Scala
Scala is a JVM language in which spark is implemented.
Though Spark gives Java API and Python API Scala feels
more natural.
If you are coming from Pig , You feel at home in Scala API.
Spark source base is small. So knowing Scala helps you
peek at spark source whenever possible.
Excellent REPL support.

9
Go functional
Spark encourages you to use functional idioms over object
oriented one's
Some of the functional idioms available are
Closure
Function chaining
Lazy evaluation
Ex : Standard deviation
Sum( (xi – Mean)*(xi – Mean))
Map/Reduce way
Map calculates (xi – Mean)*(xi – Mean)
Reduce does the sum
10
Spark way
private def standDeviation(inputRDD: RDD[Double], mean:
Double): Double = {
val sum = inputRDD.map(value => {
(value - mean) * (value - mean)
}).reduce((firstValue, secondValue) => {
firstValue + secondValue
})}
private def standDeviation(inputRDD: RDD[Double],mean: Double):
Double = {
val sum =0;
inputRDD.map(value => sum+=(value-mean)*(value-mean))}
11
Use Tuples
Map/Reduce is restricted to Key,Value pairs
Representing data like Grouped data is too difficult in
Key,Value pairs
Writable is too much work to develop/maintain
There was Tuple Map/Reduce effort some point of time
Spark (Scala) has built in.

12
Tuple Example
Aggregating data over hour
def hourlyAggregator(inputRDD: RDD[Record]): RDD[((String,
Long), Record)]
The resulting RDD has tuple as key which combines date and
hour of the day.
These tuples can be sent as input to other functions
Aggregating data over Day
def dailyAggregation(inputRDD: RDD[((String, Long),
Record)]): RDD[(Date, Record)]

13
Use Lazy evaluation
Map/Reduce does not embrace lazy evaluation. Output of
every job has to be written to HDFS
HDFS is only way to share data in Map/Reduce
Spark differs
Every operation, other than actions, are lazy evaluated
Only write critical data to Disk, other intermediate data
cache in Memory
Be careful when you use actions. Try to delay calling of
actions as late possible.
Refer Main.scala

14
Thank you

15
Spark @ Hackathon
Madhukara Phatak
Zinnia Systems
@madhukaraphatak

1
Hackathon

“ Hackathon is a programming ritual ,
usually done in night, where
programmers solve problems overnight
which may take years of time otherwise”
-Anonymous

2
Where / When
Dec 6 and 7 th , 2013
At BigData Conclave
Hosted by Flutura
Solving real world problem(s) in 24 hours
No restriction on tools to be used
Deliverables : Results, Working code and Visualization

3
What?
Predict the global energy demand for next year using the
energy usage data available for last four years, in order to
enable utility companies , effectively handle the energy
demand.
Every minute usage data collected from smart meters
From 2008- 2012
Around 133 mb uncompressed
2,75,000 records
Around 25,000 missing data records

4
Questions to be answered
What would be the energy consumption for the next day?
What would be week wise Energy consumption for the next
one year?
What would be household’s peak time load (Peak time is
between 7 AM to 10 AM) for the next month.
During Weekdays
During Weekends
Assuming there was full day of outage, Calculate the
Revenue Loss for a particular day next year by finding the
Average Revenue per day (ARPD) of the household using
given tariff plan
Can you identify the device usage patterns?
5
Who
Four member team from Zinnia Systems
Chose Spark/Scala for solving these problems
Every one except me ,were new to Scala and Spark
Able to solve the problems on time
Won first prize at Hackathon

6
Why Spark?
Faster prototyping
Excellent in memory performance
Uses AKKA
Able to run 2.5 million concurrent processing in 1GB RAM
Easy to debug
Excellent integration in IntelliJ and Eclipse
Little to code - 500 lines of code
Something new to try at Hackathon

7
Solution
Uses only core spark
Uses Geometric Brownian motion algorithm for prediction
Complete code Available at Github
https://github.com/zinniasystems/spark-energy-prediction
Under Apache license
Blog series
http://zinniasystems.com/blog/2013/12/12/
predicting-global-energy-demand-using-spark-part-1/

8
Embrace Scala
Scala is a JVM language in which spark is implemented.
Though Spark gives Java API and Python API Scala feels
more natural.
If you are coming from Pig , You feel at home in Scala API.
Spark source base is small. So knowing Scala helps you
peek at spark source whenever possible.
Excellent REPL support.

9
Go functional
Spark encourages you to use functional idioms over object
oriented one's
Some of the functional idioms available are
Closure
Function chaining
Lazy evaluation
Ex : Standard deviation
Sum( (xi – Mean)*(xi – Mean))
Map/Reduce way
Map calculates (xi – Mean)*(xi – Mean)
Reduce does the sum
10
Spark way
private def standDeviation(inputRDD: RDD[Double], mean:
Double): Double = {
val sum = inputRDD.map(value => {
(value - mean) * (value - mean)
}).reduce((firstValue, secondValue) => {
firstValue + secondValue
})}
private def standDeviation(inputRDD: RDD[Double],mean: Double):
Double = {
val sum =0;
inputRDD.map(value => sum+=(value-mean)*(value-mean))}
11

The code is available in
EnergyUsagePrediction.scala
Use Tuples
Map/Reduce is restricted to Key,Value pairs
Representing data like Grouped data is too difficult in
Key,Value pairs
Writable is too much work to develop/maintain
There was Tuple Map/Reduce effort some point of time
Spark (Scala) has built in.

12
Tuple Example
Aggregating data over hour
def hourlyAggregator(inputRDD: RDD[Record]): RDD[((String,
Long), Record)]
The resulting RDD has tuple as key which combines date and
hour of the day.
These tuples can be sent as input to other functions
Aggregating data over Day
def dailyAggregation(inputRDD: RDD[((String, Long),
Record)]): RDD[(Date, Record)]

13
Use Lazy evaluation
Map/Reduce does not embrace lazy evaluation. Output of
every job has to be written to HDFS
HDFS is only way to share data in Map/Reduce
Spark differs
Every operation, other than actions, are lazy evaluated
Only write critical data to Disk, other intermediate data
cache in Memory
Be careful when you use actions. Try to delay calling of
actions as late possible.
Refer Main.scala

14
Thank you

15

More Related Content

What's hot

Introduction into scalable graph analysis with Apache Giraph and Spark GraphX
Introduction into scalable graph analysis with Apache Giraph and Spark GraphXIntroduction into scalable graph analysis with Apache Giraph and Spark GraphX
Introduction into scalable graph analysis with Apache Giraph and Spark GraphX
rhatr
 
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
Xiao Qin
 

What's hot (20)

Introduction to Map Reduce
Introduction to Map ReduceIntroduction to Map Reduce
Introduction to Map Reduce
 
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
 
Introduction into scalable graph analysis with Apache Giraph and Spark GraphX
Introduction into scalable graph analysis with Apache Giraph and Spark GraphXIntroduction into scalable graph analysis with Apache Giraph and Spark GraphX
Introduction into scalable graph analysis with Apache Giraph and Spark GraphX
 
Unit 2 part-2
Unit 2 part-2Unit 2 part-2
Unit 2 part-2
 
Introduction to Map-Reduce
Introduction to Map-ReduceIntroduction to Map-Reduce
Introduction to Map-Reduce
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
LocationTech Projects
LocationTech ProjectsLocationTech Projects
LocationTech Projects
 
How LinkedIn Uses Scalding for Data Driven Product Development
How LinkedIn Uses Scalding for Data Driven Product DevelopmentHow LinkedIn Uses Scalding for Data Driven Product Development
How LinkedIn Uses Scalding for Data Driven Product Development
 
Sparse matrix computations in MapReduce
Sparse matrix computations in MapReduceSparse matrix computations in MapReduce
Sparse matrix computations in MapReduce
 
A Scaleable Implementation of Deep Learning on Spark -Alexander Ulanov
A Scaleable Implementation of Deep Learning on Spark -Alexander UlanovA Scaleable Implementation of Deep Learning on Spark -Alexander Ulanov
A Scaleable Implementation of Deep Learning on Spark -Alexander Ulanov
 
A Graph-Based Method For Cross-Entity Threat Detection
 A Graph-Based Method For Cross-Entity Threat Detection A Graph-Based Method For Cross-Entity Threat Detection
A Graph-Based Method For Cross-Entity Threat Detection
 
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
 
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
 
Flare: Scale Up Spark SQL with Native Compilation and Set Your Data on Fire! ...
Flare: Scale Up Spark SQL with Native Compilation and Set Your Data on Fire! ...Flare: Scale Up Spark SQL with Native Compilation and Set Your Data on Fire! ...
Flare: Scale Up Spark SQL with Native Compilation and Set Your Data on Fire! ...
 
Introduction to MapReduce Data Transformations
Introduction to MapReduce Data TransformationsIntroduction to MapReduce Data Transformations
Introduction to MapReduce Data Transformations
 
Resilient Distributed Datasets
Resilient Distributed DatasetsResilient Distributed Datasets
Resilient Distributed Datasets
 
Map reduce in Hadoop
Map reduce in HadoopMap reduce in Hadoop
Map reduce in Hadoop
 
Tutorial5
Tutorial5Tutorial5
Tutorial5
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 

Similar to Spark at-hackthon8jan2014

Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25th
Sneha Challa
 
Boston Spark Meetup event Slides Update
Boston Spark Meetup event Slides UpdateBoston Spark Meetup event Slides Update
Boston Spark Meetup event Slides Update
vithakur
 

Similar to Spark at-hackthon8jan2014 (20)

Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Apache Spark Introduction.pdf
Apache Spark Introduction.pdfApache Spark Introduction.pdf
Apache Spark Introduction.pdf
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutions
 
Stefano Baghino - From Big Data to Fast Data: Apache Spark
Stefano Baghino - From Big Data to Fast Data: Apache SparkStefano Baghino - From Big Data to Fast Data: Apache Spark
Stefano Baghino - From Big Data to Fast Data: Apache Spark
 
Big Data Analytics with Apache Spark
Big Data Analytics with Apache SparkBig Data Analytics with Apache Spark
Big Data Analytics with Apache Spark
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
 
Introduction to Spark - DataFactZ
Introduction to Spark - DataFactZIntroduction to Spark - DataFactZ
Introduction to Spark - DataFactZ
 
Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25th
 
Big Data LDN 2018: PROJECT HYDROGEN: UNIFYING AI WITH APACHE SPARK
Big Data LDN 2018: PROJECT HYDROGEN: UNIFYING AI WITH APACHE SPARKBig Data LDN 2018: PROJECT HYDROGEN: UNIFYING AI WITH APACHE SPARK
Big Data LDN 2018: PROJECT HYDROGEN: UNIFYING AI WITH APACHE SPARK
 
Cleveland Hadoop Users Group - Spark
Cleveland Hadoop Users Group - SparkCleveland Hadoop Users Group - Spark
Cleveland Hadoop Users Group - Spark
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 
Energy analytics with Apache Spark workshop
Energy analytics with Apache Spark workshopEnergy analytics with Apache Spark workshop
Energy analytics with Apache Spark workshop
 
Scala+data
Scala+dataScala+data
Scala+data
 
Boston Spark Meetup event Slides Update
Boston Spark Meetup event Slides UpdateBoston Spark Meetup event Slides Update
Boston Spark Meetup event Slides Update
 
Briefing on the Modern ML Stack with R
 Briefing on the Modern ML Stack with R Briefing on the Modern ML Stack with R
Briefing on the Modern ML Stack with R
 
Spark 101
Spark 101Spark 101
Spark 101
 
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
 
As simple as Apache Spark
As simple as Apache SparkAs simple as Apache Spark
As simple as Apache Spark
 
Multiplaform Solution for Graph Datasources
Multiplaform Solution for Graph DatasourcesMultiplaform Solution for Graph Datasources
Multiplaform Solution for Graph Datasources
 

Recently uploaded

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Recently uploaded (20)

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 

Spark at-hackthon8jan2014

  • 1. Spark @ Hackathon Madhukara Phatak Zinnia Systems @madhukaraphatak 1
  • 2. Hackathon “ Hackathon is a programming ritual , usually done in night, where programmers solve problems overnight which may take years of time otherwise” -Anonymous 2
  • 3. Where / When Dec 6 and 7 th , 2013 At BigData Conclave Hosted by Flutura Solving real world problem(s) in 24 hours No restriction on tools to be used Deliverables : Results, Working code and Visualization 3
  • 4. What? Predict the global energy demand for next year using the energy usage data available for last four years, in order to enable utility companies , effectively handle the energy demand. Every minute usage data collected from smart meters From 2008- 2012 Around 133 mb uncompressed 2,75,000 records Around 25,000 missing data records 4
  • 5. Questions to be answered What would be the energy consumption for the next day? What would be week wise Energy consumption for the next one year? What would be household’s peak time load (Peak time is between 7 AM to 10 AM) for the next month. During Weekdays During Weekends Assuming there was full day of outage, Calculate the Revenue Loss for a particular day next year by finding the Average Revenue per day (ARPD) of the household using given tariff plan Can you identify the device usage patterns? 5
  • 6. Who Four member team from Zinnia Systems Chose Spark/Scala for solving these problems Every one except me ,were new to Scala and Spark Able to solve the problems on time Won first prize at Hackathon 6
  • 7. Why Spark? Faster prototyping Excellent in memory performance Uses AKKA Able to run 2.5 million concurrent processing in 1GB RAM Easy to debug Excellent integration in IntelliJ and Eclipse Little to code - 500 lines of code Something new to try at Hackathon 7
  • 8. Solution Uses only core spark Uses Geometric Brownian motion algorithm for prediction Complete code Available at Github https://github.com/zinniasystems/spark-energy-prediction Under Apache license Blog series http://zinniasystems.com/blog/2013/12/12/ predicting-global-energy-demand-using-spark-part-1/ 8
  • 9. Embrace Scala Scala is a JVM language in which spark is implemented. Though Spark gives Java API and Python API Scala feels more natural. If you are coming from Pig , You feel at home in Scala API. Spark source base is small. So knowing Scala helps you peek at spark source whenever possible. Excellent REPL support. 9
  • 10. Go functional Spark encourages you to use functional idioms over object oriented one's Some of the functional idioms available are Closure Function chaining Lazy evaluation Ex : Standard deviation Sum( (xi – Mean)*(xi – Mean)) Map/Reduce way Map calculates (xi – Mean)*(xi – Mean) Reduce does the sum 10
  • 11. Spark way private def standDeviation(inputRDD: RDD[Double], mean: Double): Double = { val sum = inputRDD.map(value => { (value - mean) * (value - mean) }).reduce((firstValue, secondValue) => { firstValue + secondValue })} private def standDeviation(inputRDD: RDD[Double],mean: Double): Double = { val sum =0; inputRDD.map(value => sum+=(value-mean)*(value-mean))} 11
  • 12. Use Tuples Map/Reduce is restricted to Key,Value pairs Representing data like Grouped data is too difficult in Key,Value pairs Writable is too much work to develop/maintain There was Tuple Map/Reduce effort some point of time Spark (Scala) has built in. 12
  • 13. Tuple Example Aggregating data over hour def hourlyAggregator(inputRDD: RDD[Record]): RDD[((String, Long), Record)] The resulting RDD has tuple as key which combines date and hour of the day. These tuples can be sent as input to other functions Aggregating data over Day def dailyAggregation(inputRDD: RDD[((String, Long), Record)]): RDD[(Date, Record)] 13
  • 14. Use Lazy evaluation Map/Reduce does not embrace lazy evaluation. Output of every job has to be written to HDFS HDFS is only way to share data in Map/Reduce Spark differs Every operation, other than actions, are lazy evaluated Only write critical data to Disk, other intermediate data cache in Memory Be careful when you use actions. Try to delay calling of actions as late possible. Refer Main.scala 14
  • 16. Spark @ Hackathon Madhukara Phatak Zinnia Systems @madhukaraphatak 1
  • 17. Hackathon “ Hackathon is a programming ritual , usually done in night, where programmers solve problems overnight which may take years of time otherwise” -Anonymous 2
  • 18. Where / When Dec 6 and 7 th , 2013 At BigData Conclave Hosted by Flutura Solving real world problem(s) in 24 hours No restriction on tools to be used Deliverables : Results, Working code and Visualization 3
  • 19. What? Predict the global energy demand for next year using the energy usage data available for last four years, in order to enable utility companies , effectively handle the energy demand. Every minute usage data collected from smart meters From 2008- 2012 Around 133 mb uncompressed 2,75,000 records Around 25,000 missing data records 4
  • 20. Questions to be answered What would be the energy consumption for the next day? What would be week wise Energy consumption for the next one year? What would be household’s peak time load (Peak time is between 7 AM to 10 AM) for the next month. During Weekdays During Weekends Assuming there was full day of outage, Calculate the Revenue Loss for a particular day next year by finding the Average Revenue per day (ARPD) of the household using given tariff plan Can you identify the device usage patterns? 5
  • 21. Who Four member team from Zinnia Systems Chose Spark/Scala for solving these problems Every one except me ,were new to Scala and Spark Able to solve the problems on time Won first prize at Hackathon 6
  • 22. Why Spark? Faster prototyping Excellent in memory performance Uses AKKA Able to run 2.5 million concurrent processing in 1GB RAM Easy to debug Excellent integration in IntelliJ and Eclipse Little to code - 500 lines of code Something new to try at Hackathon 7
  • 23. Solution Uses only core spark Uses Geometric Brownian motion algorithm for prediction Complete code Available at Github https://github.com/zinniasystems/spark-energy-prediction Under Apache license Blog series http://zinniasystems.com/blog/2013/12/12/ predicting-global-energy-demand-using-spark-part-1/ 8
  • 24. Embrace Scala Scala is a JVM language in which spark is implemented. Though Spark gives Java API and Python API Scala feels more natural. If you are coming from Pig , You feel at home in Scala API. Spark source base is small. So knowing Scala helps you peek at spark source whenever possible. Excellent REPL support. 9
  • 25. Go functional Spark encourages you to use functional idioms over object oriented one's Some of the functional idioms available are Closure Function chaining Lazy evaluation Ex : Standard deviation Sum( (xi – Mean)*(xi – Mean)) Map/Reduce way Map calculates (xi – Mean)*(xi – Mean) Reduce does the sum 10
  • 26. Spark way private def standDeviation(inputRDD: RDD[Double], mean: Double): Double = { val sum = inputRDD.map(value => { (value - mean) * (value - mean) }).reduce((firstValue, secondValue) => { firstValue + secondValue })} private def standDeviation(inputRDD: RDD[Double],mean: Double): Double = { val sum =0; inputRDD.map(value => sum+=(value-mean)*(value-mean))} 11 The code is available in EnergyUsagePrediction.scala
  • 27. Use Tuples Map/Reduce is restricted to Key,Value pairs Representing data like Grouped data is too difficult in Key,Value pairs Writable is too much work to develop/maintain There was Tuple Map/Reduce effort some point of time Spark (Scala) has built in. 12
  • 28. Tuple Example Aggregating data over hour def hourlyAggregator(inputRDD: RDD[Record]): RDD[((String, Long), Record)] The resulting RDD has tuple as key which combines date and hour of the day. These tuples can be sent as input to other functions Aggregating data over Day def dailyAggregation(inputRDD: RDD[((String, Long), Record)]): RDD[(Date, Record)] 13
  • 29. Use Lazy evaluation Map/Reduce does not embrace lazy evaluation. Output of every job has to be written to HDFS HDFS is only way to share data in Map/Reduce Spark differs Every operation, other than actions, are lazy evaluated Only write critical data to Disk, other intermediate data cache in Memory Be careful when you use actions. Try to delay calling of actions as late possible. Refer Main.scala 14