Reverse Time Migration via Resilient Distributed Datasets: Towards In-Memory Coherence of Seismic-Reflection Wavefields using Apache Spark

Reverse Time Migration via Resilient
Distributed Datasets: Towards In-Memory
Coherence of Seismic-Reflection Wavefields
using Apache Spark
Ian Lumb
HPCS 2015 - Montreal
http://hpcs.ca

Outline
● The challenges and opportunities of RTM
● Refactoring RTM with Spark/RDDs
o Spark’ing coherence between wavefields
● Summary

http://www.acceleware.com/technical-papers

Motivation
● RTM is performance-challenged
o Algorithms research remains topical
 GPUs responsible for compelling results
● Revisit RTM as a ‘Big Data problem’
o In-memory analytics has the potential to
 Improve performance of data and wavefield
manipulations in concert with computations
 Introduce new prospects for imaging conditions

Key Performance Challenges
● RTM modeling kernel is compute intensive
o Stable, non-dispersive solution via FDM requires
 Small time steps and small grid intervals
 Higher-order approximations of the spatial
derivatives
● RTM wavefields exceed memory capacity
o Multiple-TB source volumes must be stored to disk
e.g., Liu et al., Computers & Geosciences 59 (2013) 17–23

Resilient Distributed Datasets (RDDs)
● Abstraction for in-memory computing
● Fault-tolerant, parallel data structures
o Cluster-ready
● Optionally persistent
● Can be partitioned for optimal placement
● Manipulated via operators
Zaharia et al., NSDI 2012

RTM via RDDs: Implementation using Spark
● Apache Spark is an implementation of RDDs
● Make use of HDFS or alternative FS
o GPFS, AWS S3, OpenStack Swift, Ceph or Lustre
● Choose appropriate programming model(s)
o Not limited to MapReduce
o Iterative and/or interactive (including streaming)
● Manage Spark workloads
o Built-in mode or YARN mode, Mesos
o Univa Universal Resource Brokerafter Lumb, insideBIGDATA http://insidebigdata.com/2015/03/06/8-reasons-
apache-spark-hot/

RTM via RDDs: Implementation using Spark (2)
● Deployable on bare metal … clouds
o Monitoring/management Bright Cluster Manager
● Introduces analytics possibilities for RTM
o Program in Java (C/C++ via JNA), Scala or Python
● Uptake is significant - rapidly growing community
● Results are extremely impressive
o Exploit CPUs and/or GPUsafter Lumb, insideBIGDATA http://insidebigdata.com/2015/03/06/8-reasons-
apache-spark-hot/

RTM via RDDs: Opportunities
● Apply RDDs to gathers of seismic data
o Partition RDDs optimally for wavefields calculations
● Apply RDDs to source wavefields
o Partition RDDs optimally for cross-correlation of
forward and reverse time wavefields
 Significantly reduce/eliminate disk I/O
● Investigate alternate imaging conditions
o Machine-learning and/or graph-analytics algorithms
in addition to cross-correlation

Spark
Workers
Spark (YARN) Master
Spark
or YARN

http://www.informationweek.com/big-data/big-data-analytics/apache-spark-3-
promising-use-cases/a/d-id/1319660

http://ipython.org/notebook.html

Thunder: Initial Impressions
● Written in Spark's Python API (Pyspark)
o Makes use of scipy, numpy, and scikit-learn
● IPython Notebook serves as interactive GUI
 Runs in a Web browser
 Notebooks can include text and graphics
 Secure, remote access to an in-cluster IPython
Notebook server
● Includes modular functions for time-series analysis
● Can interface with C/C++ from Python
http://thunder-project.org/

Is there a case for migration?
● In-memory computing via RDDs is promising
o Application to gathers and wavefields
● Spark provides analytics upside
o Imaging conditions other than cross-correlation
● Spark may be applicable to modeling kernels
● Spark can be easily incorporated into pre-existing IT
infrastructures
o Compliments existing HPC environments
http://rice2015oghpc.rice.edu/technical-program/

Summary
● Is there a case for migration?
o From: RTM via HPC
o To: RTM via Big Data or ( Big Data and HPC )
● Does it make sense to refactor other HPC
problems as ‘Big Data problems’?

Refactoring HPC with Spark/RDDs …
● Could Spark/RDDs replace MPI?
o Spark has primitives for distributed in-memory
parallel computing … including fault tolerance

Acknowledgements
● M. Zaharia et al. for RDDs
● Communities responsible for Spark, Python & Thunder
● M. Lamarca, P. Labropoulos, D. Shestakov & L.
Gibbons at Bright Computing

Questions?
Ian Lumb
ianlumb@yorku.ca
ian.lumb@brightcomputing.com

Resources
● RTM's scientific context
● Spark support in Bright Cluster Manager for
Apache Hadoop

Reverse Time Migration via Resilient Distributed Datasets: Towards In-Memory Coherence of Seismic-Reflection Wavefields using Apache Spark

Recommandé

Recommandé

Contenu connexe

Plus de Ian Lumb

Plus de Ian Lumb (9)

Dernier

Dernier (20)

Reverse Time Migration via Resilient Distributed Datasets: Towards In-Memory Coherence of Seismic-Reflection Wavefields using Apache Spark

Notes de l'éditeur