SlideShare une entreprise Scribd logo
1  sur  26
Reverse Time Migration via Resilient
Distributed Datasets: Towards In-Memory
Coherence of Seismic-Reflection Wavefields
using Apache Spark
Ian Lumb
HPCS 2015 - Montreal
http://hpcs.ca
Outline
● The challenges and opportunities of RTM
● Refactoring RTM with Spark/RDDs
o Spark’ing coherence between wavefields
● Summary
http://www.acceleware.com/technical-papers
Zhou 2014
Fig. 7.25
Motivation
● RTM is performance-challenged
o Algorithms research remains topical
 GPUs responsible for compelling results
● Revisit RTM as a ‘Big Data problem’
o In-memory analytics has the potential to
 Improve performance of data and wavefield
manipulations in concert with computations
 Introduce new prospects for imaging conditions
Key Performance Challenges
● RTM modeling kernel is compute intensive
o Stable, non-dispersive solution via FDM requires
 Small time steps and small grid intervals
 Higher-order approximations of the spatial
derivatives
● RTM wavefields exceed memory capacity
o Multiple-TB source volumes must be stored to disk
e.g., Liu et al., Computers & Geosciences 59 (2013) 17–23
Resilient Distributed Datasets (RDDs)
● Abstraction for in-memory computing
● Fault-tolerant, parallel data structures
o Cluster-ready
● Optionally persistent
● Can be partitioned for optimal placement
● Manipulated via operators
Zaharia et al., NSDI 2012
RTM via RDDs: Implementation using Spark
● Apache Spark is an implementation of RDDs
● Make use of HDFS or alternative FS
o GPFS, AWS S3, OpenStack Swift, Ceph or Lustre
● Choose appropriate programming model(s)
o Not limited to MapReduce
o Iterative and/or interactive (including streaming)
● Manage Spark workloads
o Built-in mode or YARN mode, Mesos
o Univa Universal Resource Brokerafter Lumb, insideBIGDATA http://insidebigdata.com/2015/03/06/8-reasons-
apache-spark-hot/
RTM via RDDs: Implementation using Spark (2)
● Deployable on bare metal … clouds
o Monitoring/management Bright Cluster Manager
● Introduces analytics possibilities for RTM
o Program in Java (C/C++ via JNA), Scala or Python
● Uptake is significant - rapidly growing community
● Results are extremely impressive
o Exploit CPUs and/or GPUsafter Lumb, insideBIGDATA http://insidebigdata.com/2015/03/06/8-reasons-
apache-spark-hot/
RTM via RDDs: Opportunities
● Apply RDDs to gathers of seismic data
o Partition RDDs optimally for wavefields calculations
● Apply RDDs to source wavefields
o Partition RDDs optimally for cross-correlation of
forward and reverse time wavefields
 Significantly reduce/eliminate disk I/O
● Investigate alternate imaging conditions
o Machine-learning and/or graph-analytics algorithms
in addition to cross-correlation
Spark
Workers
Spark (YARN) Master
Spark
or YARN
http://www.informationweek.com/big-data/big-data-analytics/apache-spark-3-
promising-use-cases/a/d-id/1319660
http://ipython.org/notebook.html
Thunder: Initial Impressions
● Written in Spark's Python API (Pyspark)
o Makes use of scipy, numpy, and scikit-learn
● IPython Notebook serves as interactive GUI
 Runs in a Web browser
 Notebooks can include text and graphics
 Secure, remote access to an in-cluster IPython
Notebook server
● Includes modular functions for time-series analysis
● Can interface with C/C++ from Python
http://thunder-project.org/
Is there a case for migration?
● In-memory computing via RDDs is promising
o Application to gathers and wavefields
● Spark provides analytics upside
o Imaging conditions other than cross-correlation
● Spark may be applicable to modeling kernels
● Spark can be easily incorporated into pre-existing IT
infrastructures
o Compliments existing HPC environments
http://rice2015oghpc.rice.edu/technical-program/
Summary
● Is there a case for migration?
o From: RTM via HPC
o To: RTM via Big Data or ( Big Data and HPC )
● Does it make sense to refactor other HPC
problems as ‘Big Data problems’?
Resilient Distributed Datasets (RDDs)
● Abstraction for in-memory computing
● Fault-tolerant, parallel data structures
o Cluster-ready
● Optionally persistent
● Can be partitioned for optimal placement
● Manipulated via operators
Zaharia et al., NSDI 2012
Refactoring HPC with Spark/RDDs …
● Could Spark/RDDs replace MPI?
o Spark has primitives for distributed in-memory
parallel computing … including fault tolerance
Acknowledgements
● M. Zaharia et al. for RDDs
● Communities responsible for Spark, Python & Thunder
● M. Lamarca, P. Labropoulos, D. Shestakov & L.
Gibbons at Bright Computing
Questions?
Ian Lumb
ianlumb@yorku.ca
ian.lumb@brightcomputing.com
Resources
● RTM's scientific context
● Spark support in Bright Cluster Manager for
Apache Hadoop

Contenu connexe

Plus de Ian Lumb

Machine Learning for Big Data Analytics: Scaling In with Containers while Sc...
Machine Learning for Big Data Analytics:  Scaling In with Containers while Sc...Machine Learning for Big Data Analytics:  Scaling In with Containers while Sc...
Machine Learning for Big Data Analytics: Scaling In with Containers while Sc...
Ian Lumb
 
Docker 101 - all about Docker containers
Docker 101 - all about Docker containers Docker 101 - all about Docker containers
Docker 101 - all about Docker containers
Ian Lumb
 
Bright Cluster Manager: A Comprehensive, Integrated Management Solution for P...
Bright Cluster Manager: A Comprehensive, Integrated Management Solution for P...Bright Cluster Manager: A Comprehensive, Integrated Management Solution for P...
Bright Cluster Manager: A Comprehensive, Integrated Management Solution for P...
Ian Lumb
 

Plus de Ian Lumb (9)

Drilling Deep with Machine Learning as an Enterprise Enabled Micro Service
Drilling Deep with Machine Learning as an Enterprise Enabled Micro ServiceDrilling Deep with Machine Learning as an Enterprise Enabled Micro Service
Drilling Deep with Machine Learning as an Enterprise Enabled Micro Service
 
Machine Learning for Big Data Analytics: Scaling In with Containers while Sc...
Machine Learning for Big Data Analytics:  Scaling In with Containers while Sc...Machine Learning for Big Data Analytics:  Scaling In with Containers while Sc...
Machine Learning for Big Data Analytics: Scaling In with Containers while Sc...
 
Docker 101 - all about Docker containers
Docker 101 - all about Docker containers Docker 101 - all about Docker containers
Docker 101 - all about Docker containers
 
High Performance Computing in the Cloud?
High Performance Computing in the Cloud?High Performance Computing in the Cloud?
High Performance Computing in the Cloud?
 
VoDcast Slides: The Rise in Popularity of Apache Spark
VoDcast Slides: The Rise in Popularity of Apache SparkVoDcast Slides: The Rise in Popularity of Apache Spark
VoDcast Slides: The Rise in Popularity of Apache Spark
 
Bright Topics Webinar April 15, 2015 - Modernized Monitoring for Cluster and ...
Bright Topics Webinar April 15, 2015 - Modernized Monitoring for Cluster and ...Bright Topics Webinar April 15, 2015 - Modernized Monitoring for Cluster and ...
Bright Topics Webinar April 15, 2015 - Modernized Monitoring for Cluster and ...
 
Utilizing Public AND Private Clouds with Bright Cluster Manager
Utilizing Public AND Private Clouds with Bright Cluster ManagerUtilizing Public AND Private Clouds with Bright Cluster Manager
Utilizing Public AND Private Clouds with Bright Cluster Manager
 
How to Upgrade Your Hadoop Stack in 1 Step -- with Zero Downtime
How to Upgrade Your Hadoop Stack in 1 Step -- with Zero DowntimeHow to Upgrade Your Hadoop Stack in 1 Step -- with Zero Downtime
How to Upgrade Your Hadoop Stack in 1 Step -- with Zero Downtime
 
Bright Cluster Manager: A Comprehensive, Integrated Management Solution for P...
Bright Cluster Manager: A Comprehensive, Integrated Management Solution for P...Bright Cluster Manager: A Comprehensive, Integrated Management Solution for P...
Bright Cluster Manager: A Comprehensive, Integrated Management Solution for P...
 

Dernier

The Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxThe Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptx
seri bangash
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Sérgio Sacani
 
Human genetics..........................pptx
Human genetics..........................pptxHuman genetics..........................pptx
Human genetics..........................pptx
Silpa
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Dernier (20)

Site Acceptance Test .
Site Acceptance Test                    .Site Acceptance Test                    .
Site Acceptance Test .
 
The Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxThe Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptx
 
Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate ProfessorThyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
 
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptxClimate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx
 
Call Girls Ahmedabad +917728919243 call me Independent Escort Service
Call Girls Ahmedabad +917728919243 call me Independent Escort ServiceCall Girls Ahmedabad +917728919243 call me Independent Escort Service
Call Girls Ahmedabad +917728919243 call me Independent Escort Service
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
 
Clean In Place(CIP).pptx .
Clean In Place(CIP).pptx                 .Clean In Place(CIP).pptx                 .
Clean In Place(CIP).pptx .
 
Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS ESCORT SERVICE In Bhiwan...
Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS  ESCORT SERVICE In Bhiwan...Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS  ESCORT SERVICE In Bhiwan...
Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS ESCORT SERVICE In Bhiwan...
 
module for grade 9 for distance learning
module for grade 9 for distance learningmodule for grade 9 for distance learning
module for grade 9 for distance learning
 
Human genetics..........................pptx
Human genetics..........................pptxHuman genetics..........................pptx
Human genetics..........................pptx
 
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIACURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Factory Acceptance Test( FAT).pptx .
Factory Acceptance Test( FAT).pptx       .Factory Acceptance Test( FAT).pptx       .
Factory Acceptance Test( FAT).pptx .
 
Dr. E. Muralinath_ Blood indices_clinical aspects
Dr. E. Muralinath_ Blood indices_clinical  aspectsDr. E. Muralinath_ Blood indices_clinical  aspects
Dr. E. Muralinath_ Blood indices_clinical aspects
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
pumpkin fruit fly, water melon fruit fly, cucumber fruit fly
pumpkin fruit fly, water melon fruit fly, cucumber fruit flypumpkin fruit fly, water melon fruit fly, cucumber fruit fly
pumpkin fruit fly, water melon fruit fly, cucumber fruit fly
 
Introduction of DNA analysis in Forensic's .pptx
Introduction of DNA analysis in Forensic's .pptxIntroduction of DNA analysis in Forensic's .pptx
Introduction of DNA analysis in Forensic's .pptx
 
COMPUTING ANTI-DERIVATIVES (Integration by SUBSTITUTION)
COMPUTING ANTI-DERIVATIVES(Integration by SUBSTITUTION)COMPUTING ANTI-DERIVATIVES(Integration by SUBSTITUTION)
COMPUTING ANTI-DERIVATIVES (Integration by SUBSTITUTION)
 

Reverse Time Migration via Resilient Distributed Datasets: Towards In-Memory Coherence of Seismic-Reflection Wavefields using Apache Spark

  • 1. Reverse Time Migration via Resilient Distributed Datasets: Towards In-Memory Coherence of Seismic-Reflection Wavefields using Apache Spark Ian Lumb HPCS 2015 - Montreal http://hpcs.ca
  • 2. Outline ● The challenges and opportunities of RTM ● Refactoring RTM with Spark/RDDs o Spark’ing coherence between wavefields ● Summary
  • 5.
  • 6. Motivation ● RTM is performance-challenged o Algorithms research remains topical  GPUs responsible for compelling results ● Revisit RTM as a ‘Big Data problem’ o In-memory analytics has the potential to  Improve performance of data and wavefield manipulations in concert with computations  Introduce new prospects for imaging conditions
  • 7.
  • 8. Key Performance Challenges ● RTM modeling kernel is compute intensive o Stable, non-dispersive solution via FDM requires  Small time steps and small grid intervals  Higher-order approximations of the spatial derivatives ● RTM wavefields exceed memory capacity o Multiple-TB source volumes must be stored to disk e.g., Liu et al., Computers & Geosciences 59 (2013) 17–23
  • 9. Resilient Distributed Datasets (RDDs) ● Abstraction for in-memory computing ● Fault-tolerant, parallel data structures o Cluster-ready ● Optionally persistent ● Can be partitioned for optimal placement ● Manipulated via operators Zaharia et al., NSDI 2012
  • 10. RTM via RDDs: Implementation using Spark ● Apache Spark is an implementation of RDDs ● Make use of HDFS or alternative FS o GPFS, AWS S3, OpenStack Swift, Ceph or Lustre ● Choose appropriate programming model(s) o Not limited to MapReduce o Iterative and/or interactive (including streaming) ● Manage Spark workloads o Built-in mode or YARN mode, Mesos o Univa Universal Resource Brokerafter Lumb, insideBIGDATA http://insidebigdata.com/2015/03/06/8-reasons- apache-spark-hot/
  • 11. RTM via RDDs: Implementation using Spark (2) ● Deployable on bare metal … clouds o Monitoring/management Bright Cluster Manager ● Introduces analytics possibilities for RTM o Program in Java (C/C++ via JNA), Scala or Python ● Uptake is significant - rapidly growing community ● Results are extremely impressive o Exploit CPUs and/or GPUsafter Lumb, insideBIGDATA http://insidebigdata.com/2015/03/06/8-reasons- apache-spark-hot/
  • 12. RTM via RDDs: Opportunities ● Apply RDDs to gathers of seismic data o Partition RDDs optimally for wavefields calculations ● Apply RDDs to source wavefields o Partition RDDs optimally for cross-correlation of forward and reverse time wavefields  Significantly reduce/eliminate disk I/O ● Investigate alternate imaging conditions o Machine-learning and/or graph-analytics algorithms in addition to cross-correlation
  • 13.
  • 16.
  • 18. Thunder: Initial Impressions ● Written in Spark's Python API (Pyspark) o Makes use of scipy, numpy, and scikit-learn ● IPython Notebook serves as interactive GUI  Runs in a Web browser  Notebooks can include text and graphics  Secure, remote access to an in-cluster IPython Notebook server ● Includes modular functions for time-series analysis ● Can interface with C/C++ from Python http://thunder-project.org/
  • 19.
  • 20. Is there a case for migration? ● In-memory computing via RDDs is promising o Application to gathers and wavefields ● Spark provides analytics upside o Imaging conditions other than cross-correlation ● Spark may be applicable to modeling kernels ● Spark can be easily incorporated into pre-existing IT infrastructures o Compliments existing HPC environments http://rice2015oghpc.rice.edu/technical-program/
  • 21. Summary ● Is there a case for migration? o From: RTM via HPC o To: RTM via Big Data or ( Big Data and HPC ) ● Does it make sense to refactor other HPC problems as ‘Big Data problems’?
  • 22. Resilient Distributed Datasets (RDDs) ● Abstraction for in-memory computing ● Fault-tolerant, parallel data structures o Cluster-ready ● Optionally persistent ● Can be partitioned for optimal placement ● Manipulated via operators Zaharia et al., NSDI 2012
  • 23. Refactoring HPC with Spark/RDDs … ● Could Spark/RDDs replace MPI? o Spark has primitives for distributed in-memory parallel computing … including fault tolerance
  • 24. Acknowledgements ● M. Zaharia et al. for RDDs ● Communities responsible for Spark, Python & Thunder ● M. Lamarca, P. Labropoulos, D. Shestakov & L. Gibbons at Bright Computing
  • 26. Resources ● RTM's scientific context ● Spark support in Bright Cluster Manager for Apache Hadoop

Notes de l'éditeur

  1. From HPCS 2015 abstract: “Ultimately, in Reverse Time Seismic Migration (RTM), the coherence between two wavefields is determined across all depth-common gathers (i.e., source-receiver pairings) of seismic-reflection data. Because coherence between the two wavefields minimizes the impact of artifacts in the imaged section (or volume) arising from complex geological structures (e.g., folds, faults, domes, steeply dipping lithological interfaces), seismic-reflection data processed via RTM most-accurately depicts all reflectors in their actual locations in space and time (e.g., Zhou, Practical Seismic Data Analysis, Cambridge University Press, 2014).”
  2. An actual example illustrating the forward and reverse wavefields plus the migrated image. Source of image indicated.
  3. From the abstract for the HPCS 2015 event: “In the classical approach for RTM, forward modeling involving the three-dimensional wave equation (3D-WEM) results in source wavefields that are computed using the Finite Difference Method (FDM), and then stored to disk. In a subsequent step, and on a per-gather basis, source wavefields are read from disk so that they can be cross-correlated with the backwards-propagated (i.e., time-reversed) wavefields corresponding to the receivers - a step that again requires use of the FDM modeling kernel for the 3D-WEM. The inherent requirement for disk I/O involving multiple TB volumes of seismic-reflection data, during the application of the imaging condition (i.e., the cross-correlation step), results in a performance penalty well known to be highly problematical throughout the petroleum-exploration industry.” Flow chart adaptation based on algorithm detailed by Liu et al., Computers & Geosciences 59 (2013) 17–23.
  4. From https://ianlumb.wordpress.com/2015/04/01/possibilities-for-reverse-time-seismic-migration-rtm-using-apache-spark/: “RTM has a storied history of being performance-challenged. Although the method was originally conceived by geophysicists in the 1980s, it was almosttwo decades before it became computationally tractable. Considered table stakes in terms of seismic processing by today’s standards, algorithms research for RTM remains highly topical – not just at Rice, York and other universities, but also at the multinational corporations whose very livelihood depends upon the effective and efficient processing of seismic-reflection data. And of particular note are the consistent gains being made since the introduction of GPU programmability via CUDA, as innovative algorithms for RTM can exploit this platform for double-digit speedups.” From the HPCS abstract: “Over the past decade or so, General Purpose Graphics Processing Units (GPGPUs) have been employed to significantly reduce the processing burden of disk I/O in executing RTM. Broadly speaking, in applying RTM’s imaging condition, algorithms have made effective and efficient use of both the memory hierarchy as well as parallel-processing capabilities inherent in GPGPUs. Despite the progress that has been made, particularly in the implementation of algorithms using CUDA for programming GPGPUs, the computational performance of RTM remains an active area of research that continues to engage academics as well as industry.”
  5. JNA = Java Native Access
  6. From the HPCS 2015 abstract: “Recent work has already indicated that seismic reflection data in accepted industry formats can be distributed in memory across a cluster using Apache Spark (Yan et al., “A Spark-based Seismic Data Analytics Cloud”, 2015 Rice Oil & Gas Workshop, Houston, TX, http://rice2015.og-hpc.org/technical-program/).” “... attention here focuses on use of RDDs for facilitating the assessment of coherence between seismic-reflection wavefields in memory. More specifically an algorithm that significantly reduces the impact of disk I/O, in the wavefield manipulations required by RTM, is proposed based on RDDs and subsequently implementation-prototyped using Apache Spark.”
  7. From the HPCS 2015 abstract: “The need to cross-correlate two wavefields in the application of RTM’s imaging condition remains one of two fundamental challenges with use of the method in practice (e.g., Liu et al., Computers & Geosciences 59, 17–23, 2013). In a significant departure from previous approaches, this computational challenge is addressed here through the introduction of Resilient Distributed Datasets (RDDs) for RTM’s precomputed source wavefields. RDDs are a relatively recent abstraction for in-memory computing ideally suited to distributed computing environments like clusters (Zaharia et al., NSDI 2012, http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf). Originally introduced for Big Data Analytics and popularized (e.g., Lumb, “8 Reasons Apache Spark is So Hot”, insideBIGDATA, http://insidebigdata.com/2015/03/06/8-reasons-apache-spark-hot/, 2015) through the open-source implementation known as Apache Spark (https://spark.apache.org/), RDDs also appear promising in recontextualizing RTM’s imaging condition.”
  8. A schematic of the three-tier solution architecture: Client tier - Interactive analysis is facilitated through use of an IPython Notebook running remotely in a Web browser. Implemented via Spark’s Python API, Thunder provides classes that include the ability to analyse time series. App-server tier - Spark itself comprises the bulk of this tier - from its core, to analytics apps (including Thunder), finally to interfaces to a number of external data sources. Worker tier - Spark workers execute tasks generated in interactive analysis/processing sessions involving use of Thunder. Overall, workload is managed by the Spark Master (when Spark runs in a standalone mode), or via Hadoop’s YARN (when YARN mode is in effect).
  9. The app-server tier illustrating Apache Spark and its support for various data sources.
  10. A screenshot from Version 7.1 of Bright Cluster Manager for Apache Hadoop. In this screenshot, it is clear that Bright has deployed Apache Spark in tandem with Apache Hadoop - in other words, Spark makes use of HDFS as its file system, and YARN as its resource negotiator for managing workloads. Bright’s capabilities for monitoring and managing Hadoop, Spark and the physical cluster are enabled through the use of roles that include HDFS and its services, ZooKeeper, YARN as well as Spark. Spark support in Version 7.1 of Bright Cluster Manager for Apache Hadoop: Bright deploys the physical cluster, Hadoop & Spark Includes HDFS, YARN and other data-platform components Bright monitors and manages the physical cluster, Hadoop and Spark 50 metrics specific to Spark plus 650 for Hadoop that compliment 160 metrics for the physical cluster 3 Spark-specific management roles with 15 parameters Bright manages Spark workloads Standalone mode uses Spark’s built-in capability YARN mode uses Hadoop’s capabilities Bright manages Spark with or without Hadoop Spark can make use of HDFS Bright Computing currently investigating HDFS alternatives – e.g., Ceph and Lustre Bright supports Hadoop and Spark Includes monthly updates to ensure the platform is maintained Technical support plus product documentation
  11. In the client tier, an IPython Notebook serves as the GUI for time-series analysis via Spark-enabled Thunder.
  12. Notes: Running an IPython Notebook server - enables remote access via a Web browser to a Spark cluster Interfacing with C/C++ - see, e.g., https://scipy-lectures.github.io/advanced/interfacing_with_c/interfacing_with_c.html
  13. Thunder’s method for cross-correlation within its class for time-series analysis.
  14. This slide was first presented at the 2015 Rice University Oil and Gas HPC Workshop in March 2015.