SlideShare a Scribd company logo
1 of 38
1© Cloudera, Inc. All rights reserved.
Estimating Financial Risk with
Spark
Sandy Ryza | Senior Data Scientist
2© Cloudera, Inc. All rights reserved.
3© Cloudera, Inc. All rights reserved.
4© Cloudera, Inc. All rights reserved.
In reasonable
circumstances, what’s
the most you can expect
to lose?
5© Cloudera, Inc. All rights reserved.
6© Cloudera, Inc. All rights reserved.
def valueAtRisk(
portfolio,
timePeriod,
pValue
): Double = { ... }
7© Cloudera, Inc. All rights reserved.
def valueAtRisk(
portfolio,
2 weeks,
0.05
) = $1,000,000
8© Cloudera, Inc. All rights reserved.
Probability
density
Portfolio return ($) over the time
period
9© Cloudera, Inc. All rights reserved.
VaR estimation approaches
• Variance-covariance
• Historical
• Monte Carlo
10© Cloudera, Inc. All rights reserved.
11© Cloudera, Inc. All rights reserved.
12© Cloudera, Inc. All rights reserved.
Market Risk Factors
• Indexes (S&P 500, NASDAQ)
• Prices of commodities
• Currency exchange rates
• Treasury bonds
13© Cloudera, Inc. All rights reserved.
Predicting Instrument Returns from Factor Returns
• Train a linear model on the factors for each instrument
14© Cloudera, Inc. All rights reserved.
Fancier
• Add features that are non-linear transformations of the market risk factors
• Decision trees
• For options, use Black-Scholes
15© Cloudera, Inc. All rights reserved.
import org.apache.commons.math3.stat.regression.OLSMultipleLinearRegression
// Load the instruments and factors
val factorReturns: Array[Array[Double]] = ...
val instrumentReturns: RDD[Array[Double]] = ...
// Fit a model to each instrument
val models: Array[Array[Double]] =
instrumentReturns.map { instrument =>
val regression = new OLSMultipleLinearRegression()
regression.newSampleData(instrument, factorReturns)
regression.estimateRegressionParameters()
}.collect()
16© Cloudera, Inc. All rights reserved.
How to sample factor returns?
• Need to be able to generate sample vectors where each component is a factor
return.
• Factors returns are usually correlated.
17© Cloudera, Inc. All rights reserved.
Distribution of US treasury bond two-week returns
18© Cloudera, Inc. All rights reserved.
Distribution of crude oil two-week returns
19© Cloudera, Inc. All rights reserved.
The Multivariate Normal Distribution
• Probability distribution over vectors of length N
• Given all the variables but one, that variable is distributed according to a univariate
normal distribution
• Models correlations between variables
20© Cloudera, Inc. All rights reserved.
21© Cloudera, Inc. All rights reserved.
import org.apache.commons.math3.stat.correlation.Covariance
// Compute means
val factorMeans: Array[Double] = transpose(factorReturns)
.map(factor => factor.sum / factor.size)
// Compute covariances
val factorCovs: Array[Array[Double]] = new Covariance(factorReturns)
.getCovarianceMatrix().getData()
22© Cloudera, Inc. All rights reserved.
Fancier
• Multivariate normal often a poor choice compared to more sophisticated options
• Fatter tails: Multivariate T Distribution
• Filtered historical simulation
• ARMA
• GARCH
23© Cloudera, Inc. All rights reserved.
Running the simulations
• Create an RDD of seeds
• Use each seed to generate a set of simulations
• Aggregate results
24© Cloudera, Inc. All rights reserved.
def trialReturn(factorDist: MultivariateNormalDistribution, models: Seq[Array[Double]]): Double = {
val trialFactorReturns = factorDist.sample()
var totalReturn = 0.0
for (model <- models) {
// Add the returns from the instrument to the total trial return
for (i <- until trialFactorsReturns.length) {
totalReturn += trialFactorReturns(i) * model(i)
}
}
totalReturn
}
25© Cloudera, Inc. All rights reserved.
// Broadcast the factor return -> instrument return models
val bModels = sc.broadcast(models)
// Generate a seed for each task
val seeds = (baseSeed until baseSeed + parallelism)
val seedRdd = sc.parallelize(seeds, parallelism)
// Create an RDD of trials
val trialReturns: RDD[Double] = seedRdd.flatMap { seed =>
trialReturns(seed, trialsPerTask, bModels.value, factorMeans, factorCovs)
}
26© Cloudera, Inc. All rights reserved.
Executor
Executor
Time
Model
parameters
Model
parameters
27© Cloudera, Inc. All rights reserved.
Executor
Task Task
Trial Trial TrialTrial
Task Task
Trial Trial Trial Trial
Executor
Task Task
Trial Trial TrialTrial
Task Task
Trial Trial Trial Trial
Time
Model
parameters
Model
parameters
28© Cloudera, Inc. All rights reserved.
// Cache for reuse
trialReturns.cache()
val numTrialReturns = trialReturns.count().toInt
// Compute value at risk
val valueAtRisk = trials.takeOrdered(numTrialReturns / 20).last
// Compute expected shortfall
val expectedShortfall =
trials.takeOrdered(numTrialReturns / 20).sum / (numTrialReturns / 20)
29© Cloudera, Inc. All rights reserved.
30© Cloudera, Inc. All rights reserved.
VaR
31© Cloudera, Inc. All rights reserved.
32© Cloudera, Inc. All rights reserved.
So why Spark?
33© Cloudera, Inc. All rights reserved.
Ease of use
• Parallel computing for 5-year olds
• Scala and Python REPLs
34© Cloudera, Inc. All rights reserved.
• Cleaning data
• Fitting models
• Running simulations
• Analyzing results
Single platform for
35© Cloudera, Inc. All rights reserved.
But it’s CPU-bound and we’re using Java?
• Computational bottlenecks are normally in matrix operations, which can be BLAS-
ified
• Can call out to GPUs just like in C++
• Memory access patterns aren’t high-GC inducing
36© Cloudera, Inc. All rights reserved.
Want to do this yourself?
37© Cloudera, Inc. All rights reserved.
spark-timeseries
• https://github.com/cloudera/spark-finance
• Everything here + the fancier stuff
• Patches welcome!
38© Cloudera, Inc. All rights reserved.

More Related Content

Similar to Estimating Financial Risk with Spark

Similar to Estimating Financial Risk with Spark (20)

Estimating Financial Risk with Apache Spark
Estimating Financial Risk with Apache SparkEstimating Financial Risk with Apache Spark
Estimating Financial Risk with Apache Spark
 
Spark etl
Spark etlSpark etl
Spark etl
 
Discovering exoplanets with Deep Leaning
Discovering exoplanets with Deep LeaningDiscovering exoplanets with Deep Leaning
Discovering exoplanets with Deep Leaning
 
MySQL-Performance Schema- What's new in MySQL-5.7 DMRs
MySQL-Performance Schema- What's new in MySQL-5.7 DMRsMySQL-Performance Schema- What's new in MySQL-5.7 DMRs
MySQL-Performance Schema- What's new in MySQL-5.7 DMRs
 
RivieraJUG - MySQL Indexes and Histograms
RivieraJUG - MySQL Indexes and HistogramsRivieraJUG - MySQL Indexes and Histograms
RivieraJUG - MySQL Indexes and Histograms
 
"Petascale Genomics with Spark", Sean Owen,Director of Data Science at Cloudera
"Petascale Genomics with Spark", Sean Owen,Director of Data Science at Cloudera"Petascale Genomics with Spark", Sean Owen,Director of Data Science at Cloudera
"Petascale Genomics with Spark", Sean Owen,Director of Data Science at Cloudera
 
Building Data Science Teams: A Moneyball Approach
Building Data Science Teams: A Moneyball ApproachBuilding Data Science Teams: A Moneyball Approach
Building Data Science Teams: A Moneyball Approach
 
Why Your Apache Spark Job is Failing
Why Your Apache Spark Job is FailingWhy Your Apache Spark Job is Failing
Why Your Apache Spark Job is Failing
 
Why your Spark Job is Failing
Why your Spark Job is FailingWhy your Spark Job is Failing
Why your Spark Job is Failing
 
MySQL Performance Schema : fossasia
MySQL Performance Schema : fossasiaMySQL Performance Schema : fossasia
MySQL Performance Schema : fossasia
 
Parquet Vectorization in Hive
Parquet Vectorization in HiveParquet Vectorization in Hive
Parquet Vectorization in Hive
 
Achieving Global Consistency Using AWS CloudFormation StackSets - AWS Online ...
Achieving Global Consistency Using AWS CloudFormation StackSets - AWS Online ...Achieving Global Consistency Using AWS CloudFormation StackSets - AWS Online ...
Achieving Global Consistency Using AWS CloudFormation StackSets - AWS Online ...
 
Disaster Recovery Planning using Azure Site Recovery
Disaster Recovery Planning using Azure Site RecoveryDisaster Recovery Planning using Azure Site Recovery
Disaster Recovery Planning using Azure Site Recovery
 
Simplify and Scale Enterprise Apps in the Cloud | Boston 2023
Simplify and Scale Enterprise Apps in the Cloud | Boston 2023Simplify and Scale Enterprise Apps in the Cloud | Boston 2023
Simplify and Scale Enterprise Apps in the Cloud | Boston 2023
 
Simplify and Scale Enterprise Apps in the Cloud | Seattle 2023
Simplify and Scale Enterprise Apps in the Cloud | Seattle 2023Simplify and Scale Enterprise Apps in the Cloud | Seattle 2023
Simplify and Scale Enterprise Apps in the Cloud | Seattle 2023
 
Speed up R with parallel programming in the Cloud
Speed up R with parallel programming in the CloudSpeed up R with parallel programming in the Cloud
Speed up R with parallel programming in the Cloud
 
Dive into DevOps | March, Building with Terraform, Volodymyr Tsap
Dive into DevOps | March, Building with Terraform, Volodymyr TsapDive into DevOps | March, Building with Terraform, Volodymyr Tsap
Dive into DevOps | March, Building with Terraform, Volodymyr Tsap
 
Kubernetes Disaster Recovery - Los Angeles K8s meetup Dec 10 2019
Kubernetes Disaster Recovery - Los Angeles K8s meetup Dec 10 2019Kubernetes Disaster Recovery - Los Angeles K8s meetup Dec 10 2019
Kubernetes Disaster Recovery - Los Angeles K8s meetup Dec 10 2019
 
Speeding up R with Parallel Programming in the Cloud
Speeding up R with Parallel Programming in the CloudSpeeding up R with Parallel Programming in the Cloud
Speeding up R with Parallel Programming in the Cloud
 
Avoiding Disasters by Embracing Chaos
Avoiding Disasters by Embracing ChaosAvoiding Disasters by Embracing Chaos
Avoiding Disasters by Embracing Chaos
 

More from DataWorks Summit

HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
DataWorks Summit
 

More from DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Recently uploaded

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 

Recently uploaded (20)

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 

Estimating Financial Risk with Spark

  • 1. 1© Cloudera, Inc. All rights reserved. Estimating Financial Risk with Spark Sandy Ryza | Senior Data Scientist
  • 2. 2© Cloudera, Inc. All rights reserved.
  • 3. 3© Cloudera, Inc. All rights reserved.
  • 4. 4© Cloudera, Inc. All rights reserved. In reasonable circumstances, what’s the most you can expect to lose?
  • 5. 5© Cloudera, Inc. All rights reserved.
  • 6. 6© Cloudera, Inc. All rights reserved. def valueAtRisk( portfolio, timePeriod, pValue ): Double = { ... }
  • 7. 7© Cloudera, Inc. All rights reserved. def valueAtRisk( portfolio, 2 weeks, 0.05 ) = $1,000,000
  • 8. 8© Cloudera, Inc. All rights reserved. Probability density Portfolio return ($) over the time period
  • 9. 9© Cloudera, Inc. All rights reserved. VaR estimation approaches • Variance-covariance • Historical • Monte Carlo
  • 10. 10© Cloudera, Inc. All rights reserved.
  • 11. 11© Cloudera, Inc. All rights reserved.
  • 12. 12© Cloudera, Inc. All rights reserved. Market Risk Factors • Indexes (S&P 500, NASDAQ) • Prices of commodities • Currency exchange rates • Treasury bonds
  • 13. 13© Cloudera, Inc. All rights reserved. Predicting Instrument Returns from Factor Returns • Train a linear model on the factors for each instrument
  • 14. 14© Cloudera, Inc. All rights reserved. Fancier • Add features that are non-linear transformations of the market risk factors • Decision trees • For options, use Black-Scholes
  • 15. 15© Cloudera, Inc. All rights reserved. import org.apache.commons.math3.stat.regression.OLSMultipleLinearRegression // Load the instruments and factors val factorReturns: Array[Array[Double]] = ... val instrumentReturns: RDD[Array[Double]] = ... // Fit a model to each instrument val models: Array[Array[Double]] = instrumentReturns.map { instrument => val regression = new OLSMultipleLinearRegression() regression.newSampleData(instrument, factorReturns) regression.estimateRegressionParameters() }.collect()
  • 16. 16© Cloudera, Inc. All rights reserved. How to sample factor returns? • Need to be able to generate sample vectors where each component is a factor return. • Factors returns are usually correlated.
  • 17. 17© Cloudera, Inc. All rights reserved. Distribution of US treasury bond two-week returns
  • 18. 18© Cloudera, Inc. All rights reserved. Distribution of crude oil two-week returns
  • 19. 19© Cloudera, Inc. All rights reserved. The Multivariate Normal Distribution • Probability distribution over vectors of length N • Given all the variables but one, that variable is distributed according to a univariate normal distribution • Models correlations between variables
  • 20. 20© Cloudera, Inc. All rights reserved.
  • 21. 21© Cloudera, Inc. All rights reserved. import org.apache.commons.math3.stat.correlation.Covariance // Compute means val factorMeans: Array[Double] = transpose(factorReturns) .map(factor => factor.sum / factor.size) // Compute covariances val factorCovs: Array[Array[Double]] = new Covariance(factorReturns) .getCovarianceMatrix().getData()
  • 22. 22© Cloudera, Inc. All rights reserved. Fancier • Multivariate normal often a poor choice compared to more sophisticated options • Fatter tails: Multivariate T Distribution • Filtered historical simulation • ARMA • GARCH
  • 23. 23© Cloudera, Inc. All rights reserved. Running the simulations • Create an RDD of seeds • Use each seed to generate a set of simulations • Aggregate results
  • 24. 24© Cloudera, Inc. All rights reserved. def trialReturn(factorDist: MultivariateNormalDistribution, models: Seq[Array[Double]]): Double = { val trialFactorReturns = factorDist.sample() var totalReturn = 0.0 for (model <- models) { // Add the returns from the instrument to the total trial return for (i <- until trialFactorsReturns.length) { totalReturn += trialFactorReturns(i) * model(i) } } totalReturn }
  • 25. 25© Cloudera, Inc. All rights reserved. // Broadcast the factor return -> instrument return models val bModels = sc.broadcast(models) // Generate a seed for each task val seeds = (baseSeed until baseSeed + parallelism) val seedRdd = sc.parallelize(seeds, parallelism) // Create an RDD of trials val trialReturns: RDD[Double] = seedRdd.flatMap { seed => trialReturns(seed, trialsPerTask, bModels.value, factorMeans, factorCovs) }
  • 26. 26© Cloudera, Inc. All rights reserved. Executor Executor Time Model parameters Model parameters
  • 27. 27© Cloudera, Inc. All rights reserved. Executor Task Task Trial Trial TrialTrial Task Task Trial Trial Trial Trial Executor Task Task Trial Trial TrialTrial Task Task Trial Trial Trial Trial Time Model parameters Model parameters
  • 28. 28© Cloudera, Inc. All rights reserved. // Cache for reuse trialReturns.cache() val numTrialReturns = trialReturns.count().toInt // Compute value at risk val valueAtRisk = trials.takeOrdered(numTrialReturns / 20).last // Compute expected shortfall val expectedShortfall = trials.takeOrdered(numTrialReturns / 20).sum / (numTrialReturns / 20)
  • 29. 29© Cloudera, Inc. All rights reserved.
  • 30. 30© Cloudera, Inc. All rights reserved. VaR
  • 31. 31© Cloudera, Inc. All rights reserved.
  • 32. 32© Cloudera, Inc. All rights reserved. So why Spark?
  • 33. 33© Cloudera, Inc. All rights reserved. Ease of use • Parallel computing for 5-year olds • Scala and Python REPLs
  • 34. 34© Cloudera, Inc. All rights reserved. • Cleaning data • Fitting models • Running simulations • Analyzing results Single platform for
  • 35. 35© Cloudera, Inc. All rights reserved. But it’s CPU-bound and we’re using Java? • Computational bottlenecks are normally in matrix operations, which can be BLAS- ified • Can call out to GPUs just like in C++ • Memory access patterns aren’t high-GC inducing
  • 36. 36© Cloudera, Inc. All rights reserved. Want to do this yourself?
  • 37. 37© Cloudera, Inc. All rights reserved. spark-timeseries • https://github.com/cloudera/spark-finance • Everything here + the fancier stuff • Patches welcome!
  • 38. 38© Cloudera, Inc. All rights reserved.