1. Rise of the Scientific
Database
John A. De Goes, @jdegoes
2. Agenda
• Scientific Computing & Databases
• Blessing / Curse of the RDBMS
• Power of the Array
• Scientific Databases
• Hadoop
• Summary & Conclusions
3. What is Scientific
Computing?
"Scientific computing is concerned with
constructing mathematical models and
quantitative analysis techniques and using
computers to analyze and solve scientific
problems."
—Wikipedia
4. J
LAPACK
Mathematica Julia
Fortran
LINPACK SciLab Spark
Modern numerical linear
algebra MATLAB SciPy MLBase
Gradient methods Conjugate gradient PDL SciDB
Finite
differences Finite difference for PDEs Poisson solvers Rasdaman MonetDB / SciQL
1940's 1960's 1980's 2000's The Future
1950's 1970's 1990's 2010's
Finite element methods Stable SVD algorithms Large-scale eigenvalue NumPy ???
solvers
Numeric linear algebra Iterative methods Hadoop
GNU Octave
Linear programming Stable pseudoinverses Mahout
Python
Monte carlo FFT HPCC
SPSS
APL invented CUDA
SAS released OpenCL
BrookGPU
5. What is a Database?
"A technology that combines the ability to
store data with a high-level, high-
performance means of storing, retrieving,
and manipulating that data without having
to write code or have knowledge of the
mechanisms of implementation."
6. Relational Model
Ingres (QUEL)
System R (SEQUEL) Julia
SQL/DBS Spark
DBS2 ODBMS MLBase
Oracle MySQL SciDB
"RDBMS" PostgreSQL MonetDB / SciQL
1960's 1980's 2000's The Future
1970's 1990's 2010's
CODASYL SQL wins MongoDB ???
IMS DB2 CouchDB
SABRE DBase Riak
SQL Server Neo4j
Other solutions
7. The Relationship between Scientific
Computing & Databases
Scientific Scientific Data
Computing Databases Analysis
9. Relational Algebra
Projection Selection Rename Natural Join
R S
Semijoin Antijoin Division Theta Join
R S R S R ÷ S
Left outer join Right outer join Full outer join Aggregation
R ⟕ S R ⟖ S R⟗ S G1, G2, ..., Gm g f1(A1'), f2(A2'), ..., fk(Ak') (r)
11. The Curse of RDBMS
Sets Tuples Arrays
rows columns
12. The Power of the Array
• Linear Algebra
• Transforms (Fourier, wavelet, etc.)
• Spatial Analysis
• Temporal Analysis
• Etc.
13. Poor Man’s Arrays
SELECT X.row AS row, Y.col AS col,
SUM(X.value * Y.value) AS value,
FROM X, Y where X.col = X.row
GROUP BY X.row, Y.col
14. Poor Man’s Arrays
SELECT A.name, A.sales, SUM(B.sales) AS
running_total
FROM Sales AS A, Sales AS B
WHERE A.sales < B.sales or
(A.sales = B.sales and
A.name = B.name)
GROUP BY A.name, A.sales
16. What is a Scientific
Database?
• First-class support for multidimensional arrays
• Creation
• Manipulation
• Composition
• Capable of expressing whole analyses, not just snippets
• Tremendous benefits across multiple dimensions
• Scalability & Performance
• Expressiveness & Usability
• Robustness & Accuracy
17. Array Algebra
• Many different approaches (NRCA, SciQL, AFL, ODMG, etc.)
• Possible to define as extensions to relational core (but not
necessary)
• Most approaches share common core
• Array deconstruction
• Array construction
• Array reduction
19. What About Hadoop?
• Commonly used in scientific computing
• No scientific database technology
• But many useful programming libraries
• Hama
• Mahout
• Cascading
• Hadoop doesn’t make it easy
• YARN should help (Tez?)
• Balancing needs help
• Not the only game in town anymore (BDAS, MPI-2, HPCC, etc.)
20. Conclusions
• Scientific computing can benefit from a
scientific database
• Success of RDBMS was also a curse
• NoSQL, big data, catalysts for disruption
• Still early for scientific databases
• Hadoop loves/hates science
21. Resources
SciDB / Array Functional Language
http://bit.ly/VdXJkA
Rasdaman / rasql
http://en.wikipedia.org/wiki/Rasdaman
MonetDB / SciQL
http://monetdb.org
Precog / Quirrel
http://precog.com
Query Language for Multidimensional Arrays: Design, Implementation, &
Optimization Techniques
John A. De Goes, @jdegoes