SlideShare une entreprise Scribd logo
1  sur  51
7-­‐Mar-­‐14	
   BigData-­‐QCon	
   1	
  
InfoQ.com: News & Community Site
• 750,000 unique visitors/month
• Published in 4 languages (English, Chinese, Japanese and Brazilian
Portuguese)
• Post content from our QCon conferences
• News 15-20 / week
• Articles 3-4 / week
• Presentations (videos) 12-15 / week
• Interviews 2-3 / week
• Books 1 / month
Watch the video with slide
synchronization on InfoQ.com!
http://www.infoq.com/presentations
/nasa-big-data
http://www.infoq.com/presentati
ons/nasa-big-data
http://www.infoq.com/presentati
ons/nasa-big-data
Presented at QCon London
www.qconlondon.com
Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
- practitioner-driven conference designed for YOU: influencers of
change and innovation in your teams
- speakers and topics driving the evolution and innovation
- connecting and catalyzing the influencers and innovators
Highlights
- attended by more than 12,000 delegates since 2007
- held in 9 cities worldwide
A	
  Research	
  Agenda	
  and	
  Vision	
  
for	
  Big	
  Data	
  at	
  NASA	
  
	
  Chris	
  Ma=mann	
  
Chief	
  Architect,	
  Instrument	
  and	
  Science	
  Data	
  Systems,	
  	
  
Jet	
  Propulsion	
  Laboratory,	
  California	
  Ins=tute	
  of	
  Technology	
  
Adjunct	
  Associate	
  Professor,	
  USC	
  
Director,	
  Apache	
  SoBware	
  Founda=on	
  
	
  
	
  
Agenda	
  
•  Big	
  Data	
  –	
  JPL’s	
  IniEaEve	
  	
  
•  Some	
  Big	
  Data	
  Technologies	
  from	
  the	
  Apache	
  
SoJware	
  FoundaEon	
  
•  JPL’s	
  Big	
  Data:	
  ASO,	
  RCMES,	
  SKA,	
  V-­‐FASTR	
  
•  Rapid	
  Algorithm	
  IntegraEon,	
  Smart	
  Data	
  Movement,	
  
Transient	
  Archives,	
  Automated	
  text/metadata	
  extracEon	
  
and	
  MIME	
  idenEficaEon	
  
•  Big	
  Data	
  Vision	
  and	
  Wrapup	
  
7-­‐Mar-­‐14	
   BigData-­‐QCon	
   3	
  
And	
  you	
  are?	
  
	
  
•  Apache	
  Board	
  of	
  Directors	
  involved	
  in	
  
–  OODT	
  (VP,	
  PMC),	
  Tika	
  (PMC),	
  Nutch	
  (PMC),	
  Incubator	
  
(PMC),	
  SIS	
  (PMC),	
  Gora	
  (PMC),	
  Airavata	
  (PMC)	
  
7-­‐Mar-­‐14	
   BigData-­‐QCon	
   4	
  
•  Chief	
  Architect	
  at	
  NASA	
  
JPL	
  in	
  Pasadena,	
  CA	
  USA	
  
•  SoJware	
  Architecture/
Engineering	
  Prof	
  at	
  Univ.	
  
of	
  Southern	
  California	
  	
  
•  One	
  of	
  original	
  PMC	
  
members	
  for	
  Apache	
  
Nutch	
  
–  predecessor	
  to	
  Hadoop	
  
Some	
  “Big	
  Data”	
  Grand	
  Challenges	
  I’m	
  
interested	
  in	
  
•  How	
  do	
  we	
  handle	
  700	
  TB/sec	
  of	
  data	
  coming	
  off	
  the	
  wire	
  when	
  we	
  
actually	
  have	
  to	
  keep	
  it	
  around?	
  
–  Required	
  by	
  the	
  Square	
  Kilometre	
  Array	
  
•  Joe	
  scien=st	
  says	
  I’ve	
  got	
  an	
  IDL	
  or	
  Matlab	
  algorithm	
  that	
  I	
  will	
  not	
  
change	
  and	
  I	
  need	
  to	
  run	
  it	
  on	
  10	
  years	
  of	
  data	
  from	
  the	
  Colorado	
  
River	
  Basin	
  and	
  store	
  and	
  disseminate	
  the	
  output	
  products	
  
–  Required	
  by	
  the	
  Western	
  Snow	
  Hydrology	
  project	
  
•  How	
  do	
  we	
  compare	
  petabytes	
  of	
  climate	
  model	
  output	
  data	
  in	
  a	
  
variety	
  of	
  formats	
  (HDF,	
  NetCDF,	
  Grib,	
  etc.)	
  with	
  petabytes	
  of	
  remote	
  
sensing	
  data	
  to	
  improve	
  climate	
  models	
  for	
  the	
  next	
  IPCC	
  assessment?	
  
–  Required	
  by	
  the	
  5th	
  IPCC	
  assessment	
  and	
  the	
  Earth	
  System	
  Grid	
  and	
  NASA	
  
•  How	
  do	
  we	
  catalog	
  all	
  of	
  NASA’s	
  current	
  planetary	
  science	
  data?	
  
–  Required	
  by	
  the	
  NASA	
  Planetary	
  Data	
  System	
  
Image	
  Credit:	
  h=p://www.jpl.nasa.gov/news/news.cfm?release=2011-­‐295	
  
Copyright	
  2012.	
  Jet	
  Propulsion	
  Laboratory,	
  California	
  InsEtute	
  of	
  Technology.	
  US	
  
Government	
  Sponsorship	
  Acknowledged.	
  
7-­‐Mar-­‐14	
   BigData-­‐QCon	
   5	
  
Task Title PI Section
1 Power Minimization in Signal
Processing for Data-Intensive Science
Larry
D’Addario
335
2 Machine Learning for Smart Triage of
Big Data
Kiri
Wagstaff
388
3 Archiving, Processing and
Dissemination for the Big Date Era
Chris
Mattmann
388
4 Knowledge driven Automated Movie
Production Environment distribution and
Display (AMPED) Pipeline
Eric
De Jong
3223
Big Data Strategic Initiative
•  Apply lower-efficient digital architectures to future
JPL flight instrument developments and proposals.
•  Expand and promote JPL expertise with machine
learning algorithm development for real-time triage.
•  Utilize intelligent anomaly classification algorithms
in other fields, including data-intensive industry.
•  Build on JPL investments in large data archive
systems to capture role in future science facilities.
•  Enhance the efficiency and impact of JPL’s data
visualization and knowledge extraction programs.
Initiative Leader: Dayton Jones
Steering Committee Leader: Robert Preston
Initial Major Milestones for FY13 Date
Report on end-to-end power optimization of instruments Jun 2013
Hierarchical classification method for VAST and ChemCam Jan 2013
Demonstrate smart compression for Hyperion and CRISM Mar 2013
Cloud computing research and scalability experiments Feb 2013
Data formats and text, metadata extraction in big data sys. Aug 2013
Develop AMPED pipeline and install in VIP Center Dec 2012
Future Opportunities: Mission and instrument competitions, data-
intensive industries, LSST, future radio observatories.
JPL Concept: Big data technology for data triage, archiving, etc.
Key Challenges this work enables: Broaden JPL business base
(relevant to 1X, 3X, 4X, 7X, 8X, 9X Directorates)
Initiative Long Term Objectives
7-­‐Mar-­‐14	
   BigData-­‐QCon	
   6	
  
Credit:	
  Dayton	
  Jones	
  
Recent pub highlights
7
•  Nature	
  magazine	
  
piece	
  on	
  “A	
  Vision	
  
for	
  Data	
  Science”	
  in	
  
Jan.	
  24th	
  issue	
  
– Big	
  Data	
  IniEaEve	
  
highlighted	
  
•  Outline	
  algorithm	
  
integra=on	
  (regridding,	
  
metrics);	
  automa=c	
  
understanding	
  of	
  data	
  
metadata	
  formats	
  and	
  
open	
  source	
  as	
  “key	
  issues”	
  	
  
7-­‐Mar-­‐14	
   BigData-­‐QCon	
  
Data Science/Big Data progress
•  Named to Editorial Board of Springer Journal of Big Data
•  Helping to define USC’s
M.S. in Data Science
program
•  Won/Submitted several Big Data
proposals for direct funding
–  DARPA Open Source Program Office XDATA
–  NSF Major Research Instrumentation (RAPID)
–  NSF Polar Cyberinfrastructure, NSF EarthCube (both via USC)
–  President’s/Director’s Fund for Cosmic Dawn/OVRO
–  National Science Foundation: High Performance Computing System
Acquisition (submitted)
87-­‐Mar-­‐14	
   BigData-­‐QCon	
  
Where	
  do	
  Big	
  Data	
  technologies	
  	
  
fit	
  into	
  this?	
  
U.S.	
  NaEonal	
  Climate	
  Assessment	
  
(pic	
  credit:	
  Dr.	
  Tom	
  Painter)	
  
SKA	
  South	
  Africa:	
  Square	
  Kilometre	
  Array	
  
(pic	
  credit:	
  Dr.	
  Jasper	
  Horrell,	
  Simon	
  Ratcliffe	
  
7-­‐Mar-­‐14	
   BigData-­‐QCon	
   9	
  
The Apache Software
Foundation
•  Largest open source
software development
entity in the world
–  Over 2600+ committers
–  Over 4200+ contributors
–  Over 400+ members
•  100+ Top Level Projects
–  57 Incubating
–  32 Lab Projects
•  12 retired projects in the “Attic”
•  Over 1.2 million revisions
•  501(c)3 non-profit organization incorporated in Delaware
- Over 10M successful requests
served a day across the world
- HTTPD web server used on
100+ million web sites (52+%
of the market)
7-­‐Mar-­‐14	
   BigData-­‐QCon	
   10	
  
Apache	
  Maturity	
  Model	
  
•  Start	
  out	
  
with	
  	
  
IncubaEon	
  
•  Grow	
  	
  
community	
  
•  Make	
  	
  
releases	
  
•  Gain	
  interest	
  
•  Diversify	
  
•  When	
  the	
  project	
  is	
  ready,	
  graduate	
  into	
  
–  Top-­‐Level	
  Project	
  (TLP)	
  
–  Sub-­‐project	
  of	
  TLP	
  
•  Increasingly,	
  Sub-­‐projects	
  are	
  discouraged	
  compared	
  to	
  TLPs	
  
11	
  BigData-­‐QCon	
  7-­‐Mar-­‐14	
  
•  Apache	
  is	
  a	
  meritocracy	
  
–  You	
  earn	
  your	
  keep	
  and	
  your	
  	
  
credenEals	
  
•  Start	
  out	
  as	
  Contributor	
  
–  Patches,	
  mailing	
  list	
  comments,	
  etc.	
  
–  No	
  commit	
  access	
  
•  Move	
  onto	
  Commier	
  
–  Commit	
  access,	
  evolve	
  the	
  code	
  
•  PMC	
  Members	
  
–  Have	
  binding	
  VOTEs	
  on	
  releases/personnel	
  
•  Officer	
  (VP,	
  Project)	
  
–  PMC	
  Chair	
  	
  
•  ASF	
  Member	
  
–  Have	
  binding	
  VOTE	
  in	
  the	
  state	
  of	
  the	
  foundaEon	
  
–  Elect	
  Board	
  of	
  Directors	
  
•  Director	
  
–  Oversight	
  of	
  projects,	
  foundaEon	
  acEviEes	
  
12	
  BigData-­‐QCon	
  7-­‐Mar-­‐14	
  
Apache	
  OrganizaEon	
  
Apache OODT
•  Entered “incubation” at the Apache
Software Foundation in 2010
•  Selected as a top level Apache Software
Foundation project in January 2011
•  Developed by a community of participants
from many companies, universities, and
organizations
•  Used for a diverse set of science data
system activities in planetary science,
earth science, radio astronomy,
biomedicine, astrophysics, and more
OODT Development & user community includes:
http://oodt.apache.org
7-­‐Mar-­‐14	
   BigData-­‐QCon	
   13	
  
Apache	
  OODT:	
  OSS	
  “big	
  data”	
  pla?orm	
  
originally	
  pioneered	
  at	
  NASA	
  
•  OODT is meant to be a set of tools to help build data systems
–  It’s not meant to be “turn key”
–  It attempts to exploit the boundary between bringing in capability vs.
being overly rigid in science
–  Each discipline/project extends
•  Projects	
  that	
  are	
  deploying	
  it	
  operaEonally	
  at	
  
–  Decadal-­‐survey	
  recommended	
  NASA	
  Earth	
  science	
  	
  missions,	
  NIH,	
  and	
  NCI,	
  
CHLA,	
  USC,	
  South	
  African	
  SKA	
  project	
  
•  Why	
  Apache?	
  
–  Less than 100 projects have been promoted to top level (Apache Web
Server, Tomcat, Solr, Hadoop)
–  Differs from other open source communities; it provides a governance
and management structure
Copyright	
  2012.	
  Jet	
  Propulsion	
  Laboratory,	
  California	
  
InsEtute	
  of	
  Technology.	
  US	
  Government	
  Sponsorship	
  
Acknowledged.	
  
7-­‐Mar-­‐14	
   BigData-­‐QCon	
   14	
  
7-­‐Mar-­‐14	
   15	
  
•  All	
  Core	
  components	
  implemented	
  as	
  web	
  services	
  
–  XML-­‐RPC	
  used	
  to	
  communicate	
  between	
  components	
  
–  Servers	
  implemented	
  in	
  Java	
  
–  Clients	
  implemented	
  in	
  Java,	
  scripts,	
  Python,	
  	
  PHP	
  and	
  web-­‐apps	
  
–  Service	
  configuraEon	
  implemented	
  in	
  ASCII	
  and	
  XML	
  files	
  	
  
	
  
	
  
OODT Core Components
BigData-­‐QCon	
  
Why Apache and OODT?
•  OODT is meant to be a set of tools to
help build data systems
–  It’s not meant to be “turn key”
–  It attempts to exploit the boundary
between bringing in capability vs.
being overly rigid in science
–  Each discipline/project extends
•  Apache is the elite open source
community for software developers
–  Less than 100 projects have been
promoted to top level (Apache Web
Server, Tomcat, Solr, Hadoop)
–  Differs from other open source
communities; it provides a
governance and management
structure
7-­‐Mar-­‐14	
   BigData-­‐QCon	
   16	
  
Governance	
  Model+NASA=♥	
  
•  NASA	
  and	
  other	
  government	
  	
  
agencies	
  have	
  tons	
  of	
  process	
  
– They	
  like	
  that	
  
7-­‐Mar-­‐14	
   BigData-­‐QCon	
   17	
  
Apache	
  Open	
  Climate	
  
Workbench..OCW	
  
7-­‐Mar-­‐14	
   BigData-­‐QCon	
   18	
  
The Information Landscape
7-­‐Mar-­‐14	
   BigData-­‐QCon	
   19	
  
ProliferaEon	
  of	
  content	
  types	
  available	
  
•  By some accounts, 16K to 51K content
types*
•  What to do with content types?
–  Parse them
•  How?
•  Extract their text and structure
–  Index their metadata
•  In an indexing technology like Lucene, Solr, or in
Google Appliance
–  Identify what language they belong to
•  Ngrams
*h=p://filext.com/	
  
7-­‐Mar-­‐14	
   BigData-­‐QCon	
   20	
  
is…
•  A	
  content	
  analysis	
  and	
  detecEon	
  toolkit	
  
•  A	
  set	
  of	
  Java	
  APIs	
  providing	
  MIME	
  type	
  detecEon,	
  
language	
  idenEficaEon,	
  integraEon	
  of	
  various	
  
parsing	
  libraries	
  
•  A	
  rich	
  Metadata	
  API	
  for	
  represenEng	
  different	
  
Metadata	
  models	
  
•  A	
  command	
  line	
  interface	
  to	
  the	
  underlying	
  Java	
  
code	
  
•  A	
  GUI	
  interface	
  to	
  the	
  Java	
  code	
  
7-­‐Mar-­‐14	
   BigData-­‐QCon	
   21	
  
Science Data File Formats
•  Hierarchical Data Format (HDF)
–  http://www.hdfgroup.org
–  Versions 4 and 5
–  Lots of NASA data is in 4, newer NASA data in 5
–  Encapsulates
•  Observation (Scalars, Vectors, Matrices, NxMxZ…)
•  Metadata (Summary info, date/time ranges, spatial
ranges)
–  Custom readers/writers/APIs in many languages
•  C/C++, Python, Java
7-­‐Mar-­‐14	
   BigData-­‐QCon	
   22	
  
Science Data File Formats
•  network Common Data Form (netCDF)
–  www.unidata.ucar.edu/software/netcdf/
–  Versions 3 and 4
–  Heavily used in DOE, NOAA, etc.
–  Encapsulates
•  Observation (Scalars, Vectors, Matrices, NxMxZ…)
•  Metadata (Summary info, date/time ranges, spatial
ranges)
–  Custom readers/writers/APIs in many languages
•  C/C++, Python, Java
–  Not Hierarchical representation: all flat
7-­‐Mar-­‐14	
   BigData-­‐QCon	
   23	
  
OODT + Tika integrations
Metadata
Ext: TIKA!	

Metadata
Ext: TIKA!	

MIME
identification:
TIKA!	

MIME
identification:
TIKA!	

7-­‐Mar-­‐14	
   BigData-­‐QCon	
   24	
  
OODT + Tika integrations
Metadata
Ext: TIKA!	

MIME
identification:
TIKA!	

MIME
identification:
TIKA!	

7-­‐Mar-­‐14	
   BigData-­‐QCon	
   25	
  
Now	
  some	
  specific	
  NASA/JPL	
  
project	
  examples	
  
RCMES2.0	
  
(http://rcmes.jpl.nasa.gov)	
  
Raw	
  Data:	
  
Various	
  sources,	
  
formats,	
  
ResoluEons,	
  
Coverage	
  
RCMED	
  
(Regional	
  Climate	
  Model	
  EvaluaEon	
  Database)	
  
A	
  large	
  scalable	
  database	
  to	
  store	
  data	
  from	
  variety	
  
of	
  sources	
  in	
  a	
  common	
  format	
  
RCMET	
  
(Regional	
  Climate	
  Model	
  EvaluaEon	
  Tool)	
  
A	
  library	
  of	
  codes	
  for	
  extracEng	
  data	
  from	
  
RCMED	
  and	
  model	
  and	
  for	
  calculaEng	
  
evaluaEon	
  metrics	
  
Metadata	
  
Data	
  Table	
  
Data	
  Table	
  
Data	
  Table	
  
Data	
  Table	
  
Data	
  Table	
  
Data	
  Table	
  
Common	
  Format,	
  
NaLve	
  grid,	
  
Efficient	
  
architecture	
  
MySQL	
  Extractor	
  
for	
  various	
  
data	
  
formats	
  
TRMM	
  
MODIS	
  
AIRS	
  
CERES	
  
ETC	
  
Soil	
  
moisture	
  
Extract	
  OBS	
  data	
  
Extract	
  model	
  
data	
  
User	
  
input	
  
Regridder	
  
(Put	
  the	
  OBS	
  &	
  model	
  data	
  on	
  the	
  
same	
  Lme/space	
  grid)	
  
Metrics	
  Calculator	
  
(Calculate	
  evaluaLon	
  metrics)	
  
Visualizer	
  
(Plot	
  the	
  metrics)	
  
URL	
  
Use	
  the	
  
re-­‐gridded	
  
data	
  for	
  
user’s	
  
own	
  
analyses	
  
and	
  VIS.	
  
Data	
  extractor	
  
(Binary	
  or	
  netCDF)	
  	
  
Model	
  data	
  Other	
  Data	
  Centers	
  
(ESGF,	
  DAAC,	
  ExArch	
  Network)	
  
7-­‐Mar-­‐14	
   BigData-­‐QCon	
   27	
  
0"
10"
20"
30"
40"
50"
60"
70"
80"
90"
w
rm
_trm
m
_pcp""
w
rm
_airs_SurfAirTem
p_A""
ceres_par""
ceres_cre_lw
_m
on""
ceres_ave_sza""
Second'
Query'"datapoint_id"'column'for'5'years'(2002:2007)'
MySQL""
PostgreSQL"
EvaluaEon	
  of	
  Cloud	
  CompuEng	
  for	
  Storage	
  &	
  
ApplicaEon	
  of	
  NASA	
  ObservaEons	
  
Challenge	
  
–  Regional	
  climate	
  model	
  evaluaEon	
  with	
  daily	
  temporal	
  
resoluEon	
  to	
  assess	
  representaEon	
  of	
  extreme	
  events.	
  
–  More	
  voluminous,	
  requires	
  scalability	
  in	
  web	
  services,	
  system	
  
throughput,	
  and	
  also	
  elasEcity	
  based	
  on	
  study	
  demands	
  
ObjecLve	
  
–  Understand	
  and	
  evaluate	
  popular	
  cloud	
  compuEng	
  
technologies,	
  and	
  provide	
  a	
  framework	
  for	
  selecEng	
  the	
  best	
  
one	
  for	
  supporEng	
  Regional	
  Climate	
  Model	
  EvaluaEon	
  System	
  
(RCMES)	
  &	
  applicaEons	
  such	
  as	
  the	
  NaEonal	
  Climate	
  
Assessment	
  and	
  IPCC’s	
  CORDEX	
  regional	
  model	
  evaluaEons.	
  
Results	
  	
  
–  Conducted	
  evaluaEon	
  demonstraEng	
  44	
  %	
  
avg	
  query	
  Eme	
  speedup	
  of	
  PostGIS	
  over	
  MySQL	
  	
  
for	
  	
  5	
  years	
  of	
  5	
  parameters	
  of	
  obs	
  data	
  in	
  RCMES	
  
–  Will	
  incorporate	
  into	
  RCMES	
  to	
  facilitate	
  NCA	
  
	
  	
  	
  	
  	
  and	
  CORDEX	
  regioal	
  model	
  evaluaEons.	
  
	
  
C.	
  Ma]mann,	
  D.	
  Waliser,	
  J.	
  Kim,	
  C.	
  Goodale,	
  A.	
  Hart,	
  P.	
  Ramirez,	
  D.	
  Crichton,	
  P.	
  Zimdars,	
  M.	
  Boustani,	
  H.	
  Lee,	
  P.	
  Loikith,	
  K.	
  Whitehall,	
  C.	
  Jack,	
  B.	
  
Hewitson.	
  Cloud	
  CompuEng	
  and	
  VirtualizaEon	
  Within	
  the	
  Regional	
  Climate	
  Model	
  and	
  EvaluaEon	
  System.	
  Earth	
  Science	
  Informa=cs,	
  2013.	
  
7-­‐Mar-­‐14	
   BigData-­‐QCon	
   28	
  
Example	
  applicaEon	
  for	
  CORDEX-­‐Africa	
  
Taylor	
  Plot	
  
Kim	
  et	
  al.	
  2013,	
  In	
  press	
  
Model bias (mm/day) from TRMMREF (%) from MODIS
NOTE: The blank areas in the REF (MODIS) data are due to missing values.
Ens	
  Mn	
  
MODIS	
  
ENS
Annual	
  Cloudiness	
  Climatology	
  Against	
  MODIS;	
  2001-­‐2008	
  
7-­‐Mar-­‐14	
   BigData-­‐QCon	
   29	
  
NARCCAP	
  MulE-­‐decadal	
  Hindcast	
  EvaluaEon	
  Result	
  
Considerable biases exist in
surface insolation fields, a
not so typical variable
scrutinized in RCMs.
Kim, J., D.E. Waliser, C.A. Mattmann,
L.O. Mearns, C.E. Goodale, A.F. Hart,
D.J. Crichton, and S. McGinnis, 2013:
Evaluations of the surface air
temperature, precipitation, and
insolation over the conterminous U.S. in
the NARCCAP multi-RCM hindcast
experiments using RCMES. J. Climate,
In press.
Model ID	
   Model Name	
  
M01	
   CRCM (Canadian Regional Climate Model)	
  
M02	
   ECP2 (NCEP Regional Spectral Model)	
  
M03	
   MM5I (MM5 – run by Iowa State Univ.)	
  
M04	
   RCM3	
  
M05	
   WRFG (WRF – run by PNNL)	
  
ENS	
   Model Ensemble (Uniform weighting)	
  
Table 1. The RCMs evaluated in this study.
Figure. RCM biases in surface insolation against CERES
Model	
   Land-­‐mean	
  bias	
  –	
  
PrecipitaLon	
  (mm/d)	
  
Land-­‐mean	
  bias	
  -­‐	
  
InsolaLon	
  (Wm-­‐2)	
  
Bias	
  pa]ern	
  
CorrelaLon	
  
CRCM	
   0.33	
   10.2	
   -­‐0.47	
  
ECP2	
   0.41	
   9.0	
   -­‐0.28	
  
RCM3	
   0.54	
   -­‐29.9	
   -­‐0.50	
  
WRFG	
   -­‐0.08	
   30.4	
   -­‐0.18	
  
ENS	
   0.25	
   4.9	
   -­‐0.62	
  
Table 2. The relationship between precip & insolation biases.
7-­‐Mar-­‐14	
   BigData-­‐QCon	
   30	
  
Snowmelt	
  Runoff	
  ForecasEng	
  
Dozier 2012
In	
  1	
  of	
  5	
  years,	
  forecast	
  errors	
  are	
  greater	
  
than	
  40%.	
  	
  Half	
  the	
  Eme,	
  they	
  are	
  greater	
  than	
  
20%.	
  	
  These	
  come	
  from	
  poor	
  data	
  and	
  poorly	
  
constrained	
  science.	
  
Credit:	
  Tom	
  Painter,	
  JPL	
  
7-­‐Mar-­‐14	
   BigData-­‐QCon	
   31	
  
Credit:	
  Tom	
  Painter,	
  JPL	
  
7-­‐Mar-­‐14	
   BigData-­‐QCon	
   32	
  
250
750
1250
1750
2250
2750
3250
3750
5/15/2013
5/16/2013
5/17/2013
5/18/2013
5/19/2013
5/20/2013
5/21/2013
5/22/2013
5/23/2013
5/24/2013
5/25/2013
5/26/2013
5/27/2013
5/28/2013
5/29/2013
5/30/2013
5/31/2013
6/1/2013
6/2/2013
6/3/2013
6/4/2013
6/5/2013
6/6/2013
6/7/2013
6/8/2013
6/9/2013
6/10/2013
6/11/2013
6/12/2013
6/13/2013
6/14/2013
6/15/2013
6/16/2013
Obs.	
  HH	
  Inflow	
  (cfs)
Raw	
  PRMS	
  basin_cfs
ASO	
  PRMS	
  basin_cfs
The	
  JPL	
  ASO	
  team	
  and	
  California	
  Dept.	
  of	
  Water	
  Resources	
  (DWR)	
  
predicEon	
  of	
  water	
  inflow	
  into	
  the	
  Hetch	
  Hetchy	
  Reservoir	
  in	
  
thousand	
  acre	
  feet	
  (shown	
  in	
  red)	
  was	
  modified	
  on	
  June	
  1,	
  2013	
  
based	
  on	
  snow	
  water	
  equivalent	
  (SWE)	
  data	
  from	
  the	
  NASA/JPL	
  
Airborne	
  Snow	
  Observatory.	
  The	
  new	
  forecast	
  (shown	
  in	
  purple)	
  
provided	
  a	
  factor	
  of	
  2	
  be=er	
  esEmate	
  of	
  the	
  actual	
  inflow	
  (shown	
  
in	
  blue)	
  and	
  enabled	
  water	
  managers	
  to	
  opEmize	
  reservoir	
  
operaEons	
  in	
  its	
  first	
  year.	
  	
  	
  
	
   	
   	
   	
  Tom	
  Painter,	
  JPL	
  
10	
  km	
  
Improved	
  EsEmates	
  for	
  Water	
  
Management	
  in	
  California	
  
With ASO
Actual
Forecast without ASO
ASOupdate
Hetch Hetchy inflow
forecasting
Spring 2013
7-­‐Mar-­‐14	
   BigData-­‐QCon	
   33	
  
How	
  did	
  ASO	
  go	
  from	
  acquired	
  data	
  
to…improving	
  water	
  esEmates?	
  The	
  ASO	
  Compute	
  Team	
  
Twin Otter Lands at Mammoth
LIDAR
data
CASI data
Sierra Nevada Aquatic Research Lab
ASO mobile processing
system (per specs)
Super fast
eSATA or
USB2/3
mount disks
LIDAR
data
CASI data
Snow
Depth
SWE
Water Managers
Operator
7-­‐Mar-­‐14	
   BigData-­‐QCon	
   34	
  
Who	
  is	
  the	
  ASO	
  Compute	
  Team?	
  
ASO PI
Tom Painter
Field Team
Lead: Jeff
Deems, NSIDC
Compute Team
Lead: Chris
Mattmann, JPL
Flight Team
Lead: Cate
Heneghan JPL
Paul	
  Ramirez,	
  Andrew	
  Hart,	
  Cameron	
  Goodale,	
  Felix	
  Seidel,	
  Paul	
  Zimdars,	
  Susan	
  Neely,	
  
Jason	
  Horn,	
  Rishi	
  Verma,	
  Maziyar	
  Boustani,	
  Shakeh	
  Khudikyan,	
  Joseph	
  Boardman,	
  Amy	
  
Trangsrud,	
  Cate	
  Heneghan	
  
7-­‐Mar-­‐14	
   BigData-­‐QCon	
   35	
  
What	
  do	
  we	
  do?	
  
•  Job	
  #1	
  
–  Don’t	
  lose	
  the	
  bits	
  
•  Job	
  #2	
  
–  Rapidly,	
  and	
  automaEcally	
  process	
  algorithms	
  delivered	
  by	
  ASO	
  scienEsts	
  
•  Spectrometer	
  (raw	
  radiance	
  data	
  through	
  basin	
  maps	
  of	
  albedo)	
  
•  LIDAR	
  (raw	
  data	
  through	
  snow	
  depth/SWE)	
  
•  Job	
  #3	
  
–  Ensure	
  that	
  executed	
  algorithms	
  can	
  easily	
  be	
  rerun,	
  and	
  that	
  we	
  catalog	
  and	
  archive	
  
the	
  inputs,	
  and	
  outputs	
  
•  Job	
  #4	
  
–  Deliver	
  the	
  outputs	
  of	
  the	
  algorithms	
  (“move	
  data	
  around”)	
  
•  Job	
  #5	
  
–  Reformat	
  the	
  data,	
  and	
  convert	
  it,	
  and	
  deliver	
  maps,	
  movies,	
  higher	
  level	
  EPO	
  
•  Job	
  #6	
  
–  Entertain	
  the	
  rest	
  of	
  the	
  team	
  
7-­‐Mar-­‐14	
   BigData-­‐QCon	
   36	
  
Weekly	
  Detailed	
  Compute	
  Flow	
  
1
2
3
4
5
6
7
7-­‐Mar-­‐14	
   BigData-­‐QCon	
   37	
  
Rodeo	
  improvements	
  over	
  Eme	
  
(CASI/spectrometer)	
  
•  Earlier,	
  ISSP	
  was	
  dominant	
  
processing	
  Eme	
  in	
  rodeo	
  
–  Eventually	
  Ortho	
  became	
  a	
  
problem	
  too	
  due	
  to	
  issues	
  
like	
  flying	
  off	
  DEM;	
  and/or	
  
discovery	
  of	
  resource	
  
contenEon	
  at	
  alg.	
  Level	
  
•  Radcorr	
  and	
  Rebin	
  
processing	
  Eme	
  were	
  
equated	
  to	
  nil	
  through	
  
paralllelism	
  and	
  automaEon	
  
•  Within	
  a	
  month	
  of	
  near	
  
automaEon,	
  we	
  were	
  
making	
  24	
  hrs	
  on	
  CASI	
  side	
  
•  Updates	
  to	
  algs	
  to	
  make	
  
deadline	
  
0	
  
5	
  
10	
  
15	
  
20	
  
25	
  
30	
  
35	
  
40	
  
45	
  
Radcorr	
  
Rebin	
  
Ortho	
  
ISSP	
  
Total	
  
0	
   10	
   20	
   30	
   40	
   50	
  
10-­‐Apr-­‐13	
  
17-­‐Apr-­‐13	
  
24-­‐Apr-­‐13	
  
1-­‐May-­‐13	
  
8-­‐May-­‐13	
  
15-­‐May-­‐13	
  
Total	
  CASI	
  24	
  rodeo	
  processing	
  Lme	
  
in	
  hours:	
  4/10/2013	
  -­‐	
  05/15/2013	
  
Total	
  
7-­‐Mar-­‐14	
   BigData-­‐QCon	
   38	
  
Owen’s Valley Radio
Observatory
•  LWA Owens Valley Radio Observatory – Probing
for Cosmic Dawn
–  Joe Lazio, JPL co-PI, Gregg Hallinan, Caltech co-PI
–  Larry D’Addario, Chris Mattmann JPL co-Is
•  Will lead the data management for OVRO
39
Offline calibration
(cuWARP)
Online calibration and
imaging
Real Time Calibration and
Imaging Pipeline
1TB/hr
raw
visibility
3 TB/hr
calibrated
all sky
images
File
Manager
Workflow
Manager
Reduction in Time/Frequency
10GB/hr
reduced
sky
images
LWA
OVRO
Portal
Legend:
wrapped
algorithm
Apache
OODT
component
data flow control flow
data
Apache
Solr
7-­‐Mar-­‐14	
   BigData-­‐QCon	
  
Go Where the Science is Best!
Deploy Where there is No Infrastructure
Reconfigure as Needed to Optimize Performance
Simplify by Using Raw Voltage Capture
7-­‐Mar-­‐14	
   BigData-­‐QCon	
   40	
  
7-­‐Mar-­‐14	
   BigData-­‐QCon	
   41	
  
MadrigalMadrigal
DatabaseDatabase
ServerServer
ExistingExisting
ProcessingProcessing
Cloud(s)Cloud(s)
Mark 6 Data LoadMark 6 Data Load
at Haystackat Haystack
(optional)(optional)
Post Experiment Processing Cloud (Haystack / Internet2)
Wireless
Mesh
Gateway
TX of
Opportunity
Beacon SatBeacon Sat RadioRadio
StarStar
Scientists
Science Targets
Solar Imaging
Galactic Synchrotron Emission
Cosmic Ray Air Showers
Ionospheric Irregularities
Ionospheric Scintillation
Go Where the Science is Best!
● Solar Power
● Wireless Mesh Networking
● Per Element Burst Buffers
● Scheduled and Triggered Operations
● Transfer data by collecting field units
● Direct fiber link possible
● Limit ops cycles by available power
● Offline processing (post deployment)
SolarSolar
ImagingImaging
RAPID :RAPID : Radio Array of Portable Interferometric DetectorsRadio Array of Portable Interferometric Detectors
Techniques
Aperture Synthesis Imaging
Absolute Temperature Calibration
Ionospheric TEC and Scintillation
Digital Imaging Riometery
Bi-static Active / Passive Radar
Spectral Monitoring
ApacheApache
OODTOODT
Mesh
Network
1000km
U.S.	
  NaEonal	
  Radio	
  Astronomy	
  
Observatory	
  (NRAO)	
  
•  Explore	
  JPL	
  data	
  system	
  experEse	
  
–  Leverage	
  Apache	
  OODT	
  
–  Leverage	
  architecture	
  experience	
  
–  Build	
  on	
  NRAO	
  Socorro	
  F2F	
  given	
  in	
  April	
  2011	
  and	
  
InnovaEons	
  in	
  Data-­‐Intensive	
  Astronomy	
  meeEng	
  in	
  
May	
  2011	
  
•  Define	
  achievable	
  prototype	
  
–  Focus	
  on	
  EVLA	
  summer	
  school	
  pipeline	
  
•  Heavy	
  focus	
  on	
  CASApy,	
  simple	
  pipelining,	
  metadata	
  
extracEon,	
  archiving	
  of	
  directory-­‐based	
  products	
  
•  Ideal	
  for	
  OODT	
  system	
  
7-­‐Mar-­‐14	
   42	
  BigData-­‐QCon	
  
SKA/NRAO	
  Architecture	
  
7-­‐Mar-­‐14	
   43	
  BigData-­‐QCon	
  
WWW
ska-dc.jpl.nasa.gov
Staging
Area
EVLA
Crawler
FM
rep cat
Curator
Browser
CASData
Services
PCS
Services
Science
Data System
Operator
products,
metadata
system
status
day2_TDEM0003_10s_norx
Legend:
Apache
OODT
day2_TDEM0003_10s_norx
data flow
control flow
data
/metDisk Area
WMCub
evlascube event
W
Monitor
proc
status
Key Science for the SKA
(a.k.a. m- and cm-( astronomy)
Strong-field Tests of Gravity
with Pulsars and Black Holes
Galaxy Evolution, Cosmology, &
Dark Energy
Origin & Evolution of
Cosmic Magnetism
Emerging from the Dark Ages &
the Epoch of Reionization
The Cradle of Life & Astrobiology
Fast Radio Transients
44
• VFASTR	
  Transient	
  Event	
  CollaboraEve	
  Review	
  Portal	
  –	
  collaboraEon	
  with	
  Wagstaff/
Thompson	
  
‣ Web-­‐based	
  pla{orm	
  for	
  easy	
  and	
  Emely	
  review	
  of	
  candidate	
  events	
  	
  
‣ AutomaEc	
  idenEficaEon	
  of	
  interesEng	
  events	
  by	
  a	
  self-­‐trained	
  machine	
  agent	
  	
  
‣ Demonstrates	
  rapid	
  science	
  algorithm	
  integraLon	
  
‣  C.	
  Ma=mann,	
  A.	
  Hart,	
  L.	
  Cinquini,	
  J.	
  Lazio,	
  S.	
  Khudikyan,	
  D.	
  Jones,	
  R.	
  Preston,	
  T.	
  Benne=,	
  B.	
  Butler,	
  D.	
  Harland,	
  K.	
  
Cummings,	
  B.	
  Glendenning,	
  J.	
  Kern,	
  J.	
  Robne=.	
  Scalable	
  Data	
  Mining,	
  Archiving	
  and	
  Big	
  Data	
  Management	
  for	
  the	
  
Next	
  GeneraEon	
  Astronomical	
  Telescopes.	
  Big	
  Data	
  Management,	
  Technologies,	
  and	
  Applica=ons.	
  W.	
  Hu,	
  N.	
  
Kaabouch,	
  eds.	
  IGI	
  Global,	
  2013.	
  
7-­‐Mar-­‐14	
   BigData-­‐QCon	
  
Data	
  ScienEst	
  Training	
  
•  Supervised	
  4	
  current	
  postdocs	
  (co-­‐supervise	
  with	
  
Waliser	
  and	
  Painter)	
  and	
  1	
  current	
  PhD	
  in	
  
Atmospheric	
  Sciences	
  (Whitehall)	
  
–  What	
  am	
  I	
  doing	
  on	
  her	
  disserta=on	
  commiee?	
  
•  USC	
  and	
  UCLA	
  in-­‐flow	
  and	
  ou{low	
  
–  Hired	
  ~10	
  USC	
  PhD	
  and	
  MS	
  students	
  at	
  JPL	
  
–  A=racEng	
  more	
  all	
  the	
  Eme	
  
–  Fielding	
  them	
  in	
  courses	
  at	
  USC	
  in	
  Search	
  	
  
Engines/Big	
  Data,	
  and	
  in	
  SoJware	
  	
  
Architecture	
  
–  $$$	
  at	
  USC	
  and	
  UCLA	
  from	
  NSF	
  to	
  flow	
  in	
  and	
  out	
  
7-­‐Mar-­‐14	
   BigData-­‐QCon	
   45	
  
NASA:	
  where	
  from	
  here?	
  
•  Agency	
  framework	
  for	
  open	
  source:	
  we	
  can’t	
  do	
  it	
  all	
  on	
  our	
  own	
  
•  Strategic	
  investments/opportuniEes	
  
–  Rapid	
  Science	
  algorithm	
  integraEon	
  
•  Needed	
  by	
  next	
  gen	
  missions	
  (Space/airborne),	
  e.g.,	
  ASO,	
  needed	
  by	
  next	
  gen	
  
astronomical	
  archives	
  e.g.,	
  SKA,	
  and	
  exisEng	
  NRAO	
  work,	
  and	
  collaboraEons	
  
(MIT)	
  
–  Smart	
  data	
  movement	
  
•  Needed	
  by	
  next	
  gen	
  missions,	
  climate	
  science	
  work	
  (RCMES,	
  ESGF),	
  and	
  next	
  
gen	
  astronomy	
  data	
  transfer	
  (S.	
  Africa	
  to	
  US)	
  
–  Transient/Persistent	
  archives	
  (Science	
  CompuEng	
  FaciliEes)	
  
•  How	
  do	
  we	
  get	
  it	
  done	
  for	
  cheaper,	
  tear	
  it	
  down,	
  stand	
  it	
  up	
  quickly	
  
–  AutomaEc	
  text/metadata	
  extracEon	
  from	
  file	
  formats	
  
•  There	
  will	
  never	
  be	
  a	
  “god”	
  format,	
  so	
  we	
  need	
  Babel	
  Fish	
  –	
  needed	
  
by	
  ASO,	
  RCMES,	
  SKA,	
  XDATA	
  
•  More	
  data	
  scienEst	
  training	
  	
  7-­‐Mar-­‐14	
   BigData-­‐QCon	
   46	
  
Thank	
  you!	
  
chris.a.ma=mann@nasa.gov	
  
@chrisma=mann/Twi=er	
  
h=p://sunset.usc.edu/
~ma=mann/	
  	
  
Credit:	
  Vala	
  Afshar,	
  Extreme	
  Networks	
  
7-­‐Mar-­‐14	
   BigData-­‐QCon	
   48	
  
Watch the video with slide synchronization on
InfoQ.com!
http://www.infoq.com/presentations/nasa-big-
data

Contenu connexe

Plus de C4Media

Service Meshes- The Ultimate Guide
Service Meshes- The Ultimate GuideService Meshes- The Ultimate Guide
Service Meshes- The Ultimate GuideC4Media
 
Shifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDShifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDC4Media
 
CI/CD for Machine Learning
CI/CD for Machine LearningCI/CD for Machine Learning
CI/CD for Machine LearningC4Media
 
Fault Tolerance at Speed
Fault Tolerance at SpeedFault Tolerance at Speed
Fault Tolerance at SpeedC4Media
 
Architectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsArchitectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsC4Media
 
ML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsC4Media
 
Build Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerBuild Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerC4Media
 
User & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleUser & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleC4Media
 
Scaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeScaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeC4Media
 
Make Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereMake Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereC4Media
 
The Talk You've Been Await-ing For
The Talk You've Been Await-ing ForThe Talk You've Been Await-ing For
The Talk You've Been Await-ing ForC4Media
 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data EngineeringC4Media
 
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreAutomated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreC4Media
 
Navigating Complexity: High-performance Delivery and Discovery Teams
Navigating Complexity: High-performance Delivery and Discovery TeamsNavigating Complexity: High-performance Delivery and Discovery Teams
Navigating Complexity: High-performance Delivery and Discovery TeamsC4Media
 
High Performance Cooperative Distributed Systems in Adtech
High Performance Cooperative Distributed Systems in AdtechHigh Performance Cooperative Distributed Systems in Adtech
High Performance Cooperative Distributed Systems in AdtechC4Media
 
Rust's Journey to Async/await
Rust's Journey to Async/awaitRust's Journey to Async/await
Rust's Journey to Async/awaitC4Media
 
Opportunities and Pitfalls of Event-Driven Utopia
Opportunities and Pitfalls of Event-Driven UtopiaOpportunities and Pitfalls of Event-Driven Utopia
Opportunities and Pitfalls of Event-Driven UtopiaC4Media
 
Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/DayDatadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/DayC4Media
 
Are We Really Cloud-Native?
Are We Really Cloud-Native?Are We Really Cloud-Native?
Are We Really Cloud-Native?C4Media
 
CockroachDB: Architecture of a Geo-Distributed SQL Database
CockroachDB: Architecture of a Geo-Distributed SQL DatabaseCockroachDB: Architecture of a Geo-Distributed SQL Database
CockroachDB: Architecture of a Geo-Distributed SQL DatabaseC4Media
 

Plus de C4Media (20)

Service Meshes- The Ultimate Guide
Service Meshes- The Ultimate GuideService Meshes- The Ultimate Guide
Service Meshes- The Ultimate Guide
 
Shifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDShifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CD
 
CI/CD for Machine Learning
CI/CD for Machine LearningCI/CD for Machine Learning
CI/CD for Machine Learning
 
Fault Tolerance at Speed
Fault Tolerance at SpeedFault Tolerance at Speed
Fault Tolerance at Speed
 
Architectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsArchitectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep Systems
 
ML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.js
 
Build Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerBuild Your Own WebAssembly Compiler
Build Your Own WebAssembly Compiler
 
User & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleUser & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix Scale
 
Scaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeScaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's Edge
 
Make Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereMake Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home Everywhere
 
The Talk You've Been Await-ing For
The Talk You've Been Await-ing ForThe Talk You've Been Await-ing For
The Talk You've Been Await-ing For
 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data Engineering
 
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreAutomated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
 
Navigating Complexity: High-performance Delivery and Discovery Teams
Navigating Complexity: High-performance Delivery and Discovery TeamsNavigating Complexity: High-performance Delivery and Discovery Teams
Navigating Complexity: High-performance Delivery and Discovery Teams
 
High Performance Cooperative Distributed Systems in Adtech
High Performance Cooperative Distributed Systems in AdtechHigh Performance Cooperative Distributed Systems in Adtech
High Performance Cooperative Distributed Systems in Adtech
 
Rust's Journey to Async/await
Rust's Journey to Async/awaitRust's Journey to Async/await
Rust's Journey to Async/await
 
Opportunities and Pitfalls of Event-Driven Utopia
Opportunities and Pitfalls of Event-Driven UtopiaOpportunities and Pitfalls of Event-Driven Utopia
Opportunities and Pitfalls of Event-Driven Utopia
 
Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/DayDatadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
 
Are We Really Cloud-Native?
Are We Really Cloud-Native?Are We Really Cloud-Native?
Are We Really Cloud-Native?
 
CockroachDB: Architecture of a Geo-Distributed SQL Database
CockroachDB: Architecture of a Geo-Distributed SQL DatabaseCockroachDB: Architecture of a Geo-Distributed SQL Database
CockroachDB: Architecture of a Geo-Distributed SQL Database
 

Dernier

How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 

Dernier (20)

How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 

A Research Agenda and Vision for Big Data at NASA

  • 2. InfoQ.com: News & Community Site • 750,000 unique visitors/month • Published in 4 languages (English, Chinese, Japanese and Brazilian Portuguese) • Post content from our QCon conferences • News 15-20 / week • Articles 3-4 / week • Presentations (videos) 12-15 / week • Interviews 2-3 / week • Books 1 / month Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations /nasa-big-data http://www.infoq.com/presentati ons/nasa-big-data http://www.infoq.com/presentati ons/nasa-big-data
  • 3. Presented at QCon London www.qconlondon.com Purpose of QCon - to empower software development by facilitating the spread of knowledge and innovation Strategy - practitioner-driven conference designed for YOU: influencers of change and innovation in your teams - speakers and topics driving the evolution and innovation - connecting and catalyzing the influencers and innovators Highlights - attended by more than 12,000 delegates since 2007 - held in 9 cities worldwide
  • 4. A  Research  Agenda  and  Vision   for  Big  Data  at  NASA    Chris  Ma=mann   Chief  Architect,  Instrument  and  Science  Data  Systems,     Jet  Propulsion  Laboratory,  California  Ins=tute  of  Technology   Adjunct  Associate  Professor,  USC   Director,  Apache  SoBware  Founda=on      
  • 5. Agenda   •  Big  Data  –  JPL’s  IniEaEve     •  Some  Big  Data  Technologies  from  the  Apache   SoJware  FoundaEon   •  JPL’s  Big  Data:  ASO,  RCMES,  SKA,  V-­‐FASTR   •  Rapid  Algorithm  IntegraEon,  Smart  Data  Movement,   Transient  Archives,  Automated  text/metadata  extracEon   and  MIME  idenEficaEon   •  Big  Data  Vision  and  Wrapup   7-­‐Mar-­‐14   BigData-­‐QCon   3  
  • 6. And  you  are?     •  Apache  Board  of  Directors  involved  in   –  OODT  (VP,  PMC),  Tika  (PMC),  Nutch  (PMC),  Incubator   (PMC),  SIS  (PMC),  Gora  (PMC),  Airavata  (PMC)   7-­‐Mar-­‐14   BigData-­‐QCon   4   •  Chief  Architect  at  NASA   JPL  in  Pasadena,  CA  USA   •  SoJware  Architecture/ Engineering  Prof  at  Univ.   of  Southern  California     •  One  of  original  PMC   members  for  Apache   Nutch   –  predecessor  to  Hadoop  
  • 7. Some  “Big  Data”  Grand  Challenges  I’m   interested  in   •  How  do  we  handle  700  TB/sec  of  data  coming  off  the  wire  when  we   actually  have  to  keep  it  around?   –  Required  by  the  Square  Kilometre  Array   •  Joe  scien=st  says  I’ve  got  an  IDL  or  Matlab  algorithm  that  I  will  not   change  and  I  need  to  run  it  on  10  years  of  data  from  the  Colorado   River  Basin  and  store  and  disseminate  the  output  products   –  Required  by  the  Western  Snow  Hydrology  project   •  How  do  we  compare  petabytes  of  climate  model  output  data  in  a   variety  of  formats  (HDF,  NetCDF,  Grib,  etc.)  with  petabytes  of  remote   sensing  data  to  improve  climate  models  for  the  next  IPCC  assessment?   –  Required  by  the  5th  IPCC  assessment  and  the  Earth  System  Grid  and  NASA   •  How  do  we  catalog  all  of  NASA’s  current  planetary  science  data?   –  Required  by  the  NASA  Planetary  Data  System   Image  Credit:  h=p://www.jpl.nasa.gov/news/news.cfm?release=2011-­‐295   Copyright  2012.  Jet  Propulsion  Laboratory,  California  InsEtute  of  Technology.  US   Government  Sponsorship  Acknowledged.   7-­‐Mar-­‐14   BigData-­‐QCon   5  
  • 8. Task Title PI Section 1 Power Minimization in Signal Processing for Data-Intensive Science Larry D’Addario 335 2 Machine Learning for Smart Triage of Big Data Kiri Wagstaff 388 3 Archiving, Processing and Dissemination for the Big Date Era Chris Mattmann 388 4 Knowledge driven Automated Movie Production Environment distribution and Display (AMPED) Pipeline Eric De Jong 3223 Big Data Strategic Initiative •  Apply lower-efficient digital architectures to future JPL flight instrument developments and proposals. •  Expand and promote JPL expertise with machine learning algorithm development for real-time triage. •  Utilize intelligent anomaly classification algorithms in other fields, including data-intensive industry. •  Build on JPL investments in large data archive systems to capture role in future science facilities. •  Enhance the efficiency and impact of JPL’s data visualization and knowledge extraction programs. Initiative Leader: Dayton Jones Steering Committee Leader: Robert Preston Initial Major Milestones for FY13 Date Report on end-to-end power optimization of instruments Jun 2013 Hierarchical classification method for VAST and ChemCam Jan 2013 Demonstrate smart compression for Hyperion and CRISM Mar 2013 Cloud computing research and scalability experiments Feb 2013 Data formats and text, metadata extraction in big data sys. Aug 2013 Develop AMPED pipeline and install in VIP Center Dec 2012 Future Opportunities: Mission and instrument competitions, data- intensive industries, LSST, future radio observatories. JPL Concept: Big data technology for data triage, archiving, etc. Key Challenges this work enables: Broaden JPL business base (relevant to 1X, 3X, 4X, 7X, 8X, 9X Directorates) Initiative Long Term Objectives 7-­‐Mar-­‐14   BigData-­‐QCon   6   Credit:  Dayton  Jones  
  • 9. Recent pub highlights 7 •  Nature  magazine   piece  on  “A  Vision   for  Data  Science”  in   Jan.  24th  issue   – Big  Data  IniEaEve   highlighted   •  Outline  algorithm   integra=on  (regridding,   metrics);  automa=c   understanding  of  data   metadata  formats  and   open  source  as  “key  issues”     7-­‐Mar-­‐14   BigData-­‐QCon  
  • 10. Data Science/Big Data progress •  Named to Editorial Board of Springer Journal of Big Data •  Helping to define USC’s M.S. in Data Science program •  Won/Submitted several Big Data proposals for direct funding –  DARPA Open Source Program Office XDATA –  NSF Major Research Instrumentation (RAPID) –  NSF Polar Cyberinfrastructure, NSF EarthCube (both via USC) –  President’s/Director’s Fund for Cosmic Dawn/OVRO –  National Science Foundation: High Performance Computing System Acquisition (submitted) 87-­‐Mar-­‐14   BigData-­‐QCon  
  • 11. Where  do  Big  Data  technologies     fit  into  this?   U.S.  NaEonal  Climate  Assessment   (pic  credit:  Dr.  Tom  Painter)   SKA  South  Africa:  Square  Kilometre  Array   (pic  credit:  Dr.  Jasper  Horrell,  Simon  Ratcliffe   7-­‐Mar-­‐14   BigData-­‐QCon   9  
  • 12. The Apache Software Foundation •  Largest open source software development entity in the world –  Over 2600+ committers –  Over 4200+ contributors –  Over 400+ members •  100+ Top Level Projects –  57 Incubating –  32 Lab Projects •  12 retired projects in the “Attic” •  Over 1.2 million revisions •  501(c)3 non-profit organization incorporated in Delaware - Over 10M successful requests served a day across the world - HTTPD web server used on 100+ million web sites (52+% of the market) 7-­‐Mar-­‐14   BigData-­‐QCon   10  
  • 13. Apache  Maturity  Model   •  Start  out   with     IncubaEon   •  Grow     community   •  Make     releases   •  Gain  interest   •  Diversify   •  When  the  project  is  ready,  graduate  into   –  Top-­‐Level  Project  (TLP)   –  Sub-­‐project  of  TLP   •  Increasingly,  Sub-­‐projects  are  discouraged  compared  to  TLPs   11  BigData-­‐QCon  7-­‐Mar-­‐14  
  • 14. •  Apache  is  a  meritocracy   –  You  earn  your  keep  and  your     credenEals   •  Start  out  as  Contributor   –  Patches,  mailing  list  comments,  etc.   –  No  commit  access   •  Move  onto  Commier   –  Commit  access,  evolve  the  code   •  PMC  Members   –  Have  binding  VOTEs  on  releases/personnel   •  Officer  (VP,  Project)   –  PMC  Chair     •  ASF  Member   –  Have  binding  VOTE  in  the  state  of  the  foundaEon   –  Elect  Board  of  Directors   •  Director   –  Oversight  of  projects,  foundaEon  acEviEes   12  BigData-­‐QCon  7-­‐Mar-­‐14   Apache  OrganizaEon  
  • 15. Apache OODT •  Entered “incubation” at the Apache Software Foundation in 2010 •  Selected as a top level Apache Software Foundation project in January 2011 •  Developed by a community of participants from many companies, universities, and organizations •  Used for a diverse set of science data system activities in planetary science, earth science, radio astronomy, biomedicine, astrophysics, and more OODT Development & user community includes: http://oodt.apache.org 7-­‐Mar-­‐14   BigData-­‐QCon   13  
  • 16. Apache  OODT:  OSS  “big  data”  pla?orm   originally  pioneered  at  NASA   •  OODT is meant to be a set of tools to help build data systems –  It’s not meant to be “turn key” –  It attempts to exploit the boundary between bringing in capability vs. being overly rigid in science –  Each discipline/project extends •  Projects  that  are  deploying  it  operaEonally  at   –  Decadal-­‐survey  recommended  NASA  Earth  science    missions,  NIH,  and  NCI,   CHLA,  USC,  South  African  SKA  project   •  Why  Apache?   –  Less than 100 projects have been promoted to top level (Apache Web Server, Tomcat, Solr, Hadoop) –  Differs from other open source communities; it provides a governance and management structure Copyright  2012.  Jet  Propulsion  Laboratory,  California   InsEtute  of  Technology.  US  Government  Sponsorship   Acknowledged.   7-­‐Mar-­‐14   BigData-­‐QCon   14  
  • 17. 7-­‐Mar-­‐14   15   •  All  Core  components  implemented  as  web  services   –  XML-­‐RPC  used  to  communicate  between  components   –  Servers  implemented  in  Java   –  Clients  implemented  in  Java,  scripts,  Python,    PHP  and  web-­‐apps   –  Service  configuraEon  implemented  in  ASCII  and  XML  files         OODT Core Components BigData-­‐QCon  
  • 18. Why Apache and OODT? •  OODT is meant to be a set of tools to help build data systems –  It’s not meant to be “turn key” –  It attempts to exploit the boundary between bringing in capability vs. being overly rigid in science –  Each discipline/project extends •  Apache is the elite open source community for software developers –  Less than 100 projects have been promoted to top level (Apache Web Server, Tomcat, Solr, Hadoop) –  Differs from other open source communities; it provides a governance and management structure 7-­‐Mar-­‐14   BigData-­‐QCon   16  
  • 19. Governance  Model+NASA=♥   •  NASA  and  other  government     agencies  have  tons  of  process   – They  like  that   7-­‐Mar-­‐14   BigData-­‐QCon   17  
  • 20. Apache  Open  Climate   Workbench..OCW   7-­‐Mar-­‐14   BigData-­‐QCon   18  
  • 21. The Information Landscape 7-­‐Mar-­‐14   BigData-­‐QCon   19  
  • 22. ProliferaEon  of  content  types  available   •  By some accounts, 16K to 51K content types* •  What to do with content types? –  Parse them •  How? •  Extract their text and structure –  Index their metadata •  In an indexing technology like Lucene, Solr, or in Google Appliance –  Identify what language they belong to •  Ngrams *h=p://filext.com/   7-­‐Mar-­‐14   BigData-­‐QCon   20  
  • 23. is… •  A  content  analysis  and  detecEon  toolkit   •  A  set  of  Java  APIs  providing  MIME  type  detecEon,   language  idenEficaEon,  integraEon  of  various   parsing  libraries   •  A  rich  Metadata  API  for  represenEng  different   Metadata  models   •  A  command  line  interface  to  the  underlying  Java   code   •  A  GUI  interface  to  the  Java  code   7-­‐Mar-­‐14   BigData-­‐QCon   21  
  • 24. Science Data File Formats •  Hierarchical Data Format (HDF) –  http://www.hdfgroup.org –  Versions 4 and 5 –  Lots of NASA data is in 4, newer NASA data in 5 –  Encapsulates •  Observation (Scalars, Vectors, Matrices, NxMxZ…) •  Metadata (Summary info, date/time ranges, spatial ranges) –  Custom readers/writers/APIs in many languages •  C/C++, Python, Java 7-­‐Mar-­‐14   BigData-­‐QCon   22  
  • 25. Science Data File Formats •  network Common Data Form (netCDF) –  www.unidata.ucar.edu/software/netcdf/ –  Versions 3 and 4 –  Heavily used in DOE, NOAA, etc. –  Encapsulates •  Observation (Scalars, Vectors, Matrices, NxMxZ…) •  Metadata (Summary info, date/time ranges, spatial ranges) –  Custom readers/writers/APIs in many languages •  C/C++, Python, Java –  Not Hierarchical representation: all flat 7-­‐Mar-­‐14   BigData-­‐QCon   23  
  • 26. OODT + Tika integrations Metadata Ext: TIKA! Metadata Ext: TIKA! MIME identification: TIKA! MIME identification: TIKA! 7-­‐Mar-­‐14   BigData-­‐QCon   24  
  • 27. OODT + Tika integrations Metadata Ext: TIKA! MIME identification: TIKA! MIME identification: TIKA! 7-­‐Mar-­‐14   BigData-­‐QCon   25  
  • 28. Now  some  specific  NASA/JPL   project  examples  
  • 29. RCMES2.0   (http://rcmes.jpl.nasa.gov)   Raw  Data:   Various  sources,   formats,   ResoluEons,   Coverage   RCMED   (Regional  Climate  Model  EvaluaEon  Database)   A  large  scalable  database  to  store  data  from  variety   of  sources  in  a  common  format   RCMET   (Regional  Climate  Model  EvaluaEon  Tool)   A  library  of  codes  for  extracEng  data  from   RCMED  and  model  and  for  calculaEng   evaluaEon  metrics   Metadata   Data  Table   Data  Table   Data  Table   Data  Table   Data  Table   Data  Table   Common  Format,   NaLve  grid,   Efficient   architecture   MySQL  Extractor   for  various   data   formats   TRMM   MODIS   AIRS   CERES   ETC   Soil   moisture   Extract  OBS  data   Extract  model   data   User   input   Regridder   (Put  the  OBS  &  model  data  on  the   same  Lme/space  grid)   Metrics  Calculator   (Calculate  evaluaLon  metrics)   Visualizer   (Plot  the  metrics)   URL   Use  the   re-­‐gridded   data  for   user’s   own   analyses   and  VIS.   Data  extractor   (Binary  or  netCDF)     Model  data  Other  Data  Centers   (ESGF,  DAAC,  ExArch  Network)   7-­‐Mar-­‐14   BigData-­‐QCon   27  
  • 30. 0" 10" 20" 30" 40" 50" 60" 70" 80" 90" w rm _trm m _pcp"" w rm _airs_SurfAirTem p_A"" ceres_par"" ceres_cre_lw _m on"" ceres_ave_sza"" Second' Query'"datapoint_id"'column'for'5'years'(2002:2007)' MySQL"" PostgreSQL" EvaluaEon  of  Cloud  CompuEng  for  Storage  &   ApplicaEon  of  NASA  ObservaEons   Challenge   –  Regional  climate  model  evaluaEon  with  daily  temporal   resoluEon  to  assess  representaEon  of  extreme  events.   –  More  voluminous,  requires  scalability  in  web  services,  system   throughput,  and  also  elasEcity  based  on  study  demands   ObjecLve   –  Understand  and  evaluate  popular  cloud  compuEng   technologies,  and  provide  a  framework  for  selecEng  the  best   one  for  supporEng  Regional  Climate  Model  EvaluaEon  System   (RCMES)  &  applicaEons  such  as  the  NaEonal  Climate   Assessment  and  IPCC’s  CORDEX  regional  model  evaluaEons.   Results     –  Conducted  evaluaEon  demonstraEng  44  %   avg  query  Eme  speedup  of  PostGIS  over  MySQL     for    5  years  of  5  parameters  of  obs  data  in  RCMES   –  Will  incorporate  into  RCMES  to  facilitate  NCA            and  CORDEX  regioal  model  evaluaEons.     C.  Ma]mann,  D.  Waliser,  J.  Kim,  C.  Goodale,  A.  Hart,  P.  Ramirez,  D.  Crichton,  P.  Zimdars,  M.  Boustani,  H.  Lee,  P.  Loikith,  K.  Whitehall,  C.  Jack,  B.   Hewitson.  Cloud  CompuEng  and  VirtualizaEon  Within  the  Regional  Climate  Model  and  EvaluaEon  System.  Earth  Science  Informa=cs,  2013.   7-­‐Mar-­‐14   BigData-­‐QCon   28  
  • 31. Example  applicaEon  for  CORDEX-­‐Africa   Taylor  Plot   Kim  et  al.  2013,  In  press   Model bias (mm/day) from TRMMREF (%) from MODIS NOTE: The blank areas in the REF (MODIS) data are due to missing values. Ens  Mn   MODIS   ENS Annual  Cloudiness  Climatology  Against  MODIS;  2001-­‐2008   7-­‐Mar-­‐14   BigData-­‐QCon   29  
  • 32. NARCCAP  MulE-­‐decadal  Hindcast  EvaluaEon  Result   Considerable biases exist in surface insolation fields, a not so typical variable scrutinized in RCMs. Kim, J., D.E. Waliser, C.A. Mattmann, L.O. Mearns, C.E. Goodale, A.F. Hart, D.J. Crichton, and S. McGinnis, 2013: Evaluations of the surface air temperature, precipitation, and insolation over the conterminous U.S. in the NARCCAP multi-RCM hindcast experiments using RCMES. J. Climate, In press. Model ID   Model Name   M01   CRCM (Canadian Regional Climate Model)   M02   ECP2 (NCEP Regional Spectral Model)   M03   MM5I (MM5 – run by Iowa State Univ.)   M04   RCM3   M05   WRFG (WRF – run by PNNL)   ENS   Model Ensemble (Uniform weighting)   Table 1. The RCMs evaluated in this study. Figure. RCM biases in surface insolation against CERES Model   Land-­‐mean  bias  –   PrecipitaLon  (mm/d)   Land-­‐mean  bias  -­‐   InsolaLon  (Wm-­‐2)   Bias  pa]ern   CorrelaLon   CRCM   0.33   10.2   -­‐0.47   ECP2   0.41   9.0   -­‐0.28   RCM3   0.54   -­‐29.9   -­‐0.50   WRFG   -­‐0.08   30.4   -­‐0.18   ENS   0.25   4.9   -­‐0.62   Table 2. The relationship between precip & insolation biases. 7-­‐Mar-­‐14   BigData-­‐QCon   30  
  • 33. Snowmelt  Runoff  ForecasEng   Dozier 2012 In  1  of  5  years,  forecast  errors  are  greater   than  40%.    Half  the  Eme,  they  are  greater  than   20%.    These  come  from  poor  data  and  poorly   constrained  science.   Credit:  Tom  Painter,  JPL   7-­‐Mar-­‐14   BigData-­‐QCon   31  
  • 34. Credit:  Tom  Painter,  JPL   7-­‐Mar-­‐14   BigData-­‐QCon   32  
  • 35. 250 750 1250 1750 2250 2750 3250 3750 5/15/2013 5/16/2013 5/17/2013 5/18/2013 5/19/2013 5/20/2013 5/21/2013 5/22/2013 5/23/2013 5/24/2013 5/25/2013 5/26/2013 5/27/2013 5/28/2013 5/29/2013 5/30/2013 5/31/2013 6/1/2013 6/2/2013 6/3/2013 6/4/2013 6/5/2013 6/6/2013 6/7/2013 6/8/2013 6/9/2013 6/10/2013 6/11/2013 6/12/2013 6/13/2013 6/14/2013 6/15/2013 6/16/2013 Obs.  HH  Inflow  (cfs) Raw  PRMS  basin_cfs ASO  PRMS  basin_cfs The  JPL  ASO  team  and  California  Dept.  of  Water  Resources  (DWR)   predicEon  of  water  inflow  into  the  Hetch  Hetchy  Reservoir  in   thousand  acre  feet  (shown  in  red)  was  modified  on  June  1,  2013   based  on  snow  water  equivalent  (SWE)  data  from  the  NASA/JPL   Airborne  Snow  Observatory.  The  new  forecast  (shown  in  purple)   provided  a  factor  of  2  be=er  esEmate  of  the  actual  inflow  (shown   in  blue)  and  enabled  water  managers  to  opEmize  reservoir   operaEons  in  its  first  year.              Tom  Painter,  JPL   10  km   Improved  EsEmates  for  Water   Management  in  California   With ASO Actual Forecast without ASO ASOupdate Hetch Hetchy inflow forecasting Spring 2013 7-­‐Mar-­‐14   BigData-­‐QCon   33  
  • 36. How  did  ASO  go  from  acquired  data   to…improving  water  esEmates?  The  ASO  Compute  Team   Twin Otter Lands at Mammoth LIDAR data CASI data Sierra Nevada Aquatic Research Lab ASO mobile processing system (per specs) Super fast eSATA or USB2/3 mount disks LIDAR data CASI data Snow Depth SWE Water Managers Operator 7-­‐Mar-­‐14   BigData-­‐QCon   34  
  • 37. Who  is  the  ASO  Compute  Team?   ASO PI Tom Painter Field Team Lead: Jeff Deems, NSIDC Compute Team Lead: Chris Mattmann, JPL Flight Team Lead: Cate Heneghan JPL Paul  Ramirez,  Andrew  Hart,  Cameron  Goodale,  Felix  Seidel,  Paul  Zimdars,  Susan  Neely,   Jason  Horn,  Rishi  Verma,  Maziyar  Boustani,  Shakeh  Khudikyan,  Joseph  Boardman,  Amy   Trangsrud,  Cate  Heneghan   7-­‐Mar-­‐14   BigData-­‐QCon   35  
  • 38. What  do  we  do?   •  Job  #1   –  Don’t  lose  the  bits   •  Job  #2   –  Rapidly,  and  automaEcally  process  algorithms  delivered  by  ASO  scienEsts   •  Spectrometer  (raw  radiance  data  through  basin  maps  of  albedo)   •  LIDAR  (raw  data  through  snow  depth/SWE)   •  Job  #3   –  Ensure  that  executed  algorithms  can  easily  be  rerun,  and  that  we  catalog  and  archive   the  inputs,  and  outputs   •  Job  #4   –  Deliver  the  outputs  of  the  algorithms  (“move  data  around”)   •  Job  #5   –  Reformat  the  data,  and  convert  it,  and  deliver  maps,  movies,  higher  level  EPO   •  Job  #6   –  Entertain  the  rest  of  the  team   7-­‐Mar-­‐14   BigData-­‐QCon   36  
  • 39. Weekly  Detailed  Compute  Flow   1 2 3 4 5 6 7 7-­‐Mar-­‐14   BigData-­‐QCon   37  
  • 40. Rodeo  improvements  over  Eme   (CASI/spectrometer)   •  Earlier,  ISSP  was  dominant   processing  Eme  in  rodeo   –  Eventually  Ortho  became  a   problem  too  due  to  issues   like  flying  off  DEM;  and/or   discovery  of  resource   contenEon  at  alg.  Level   •  Radcorr  and  Rebin   processing  Eme  were   equated  to  nil  through   paralllelism  and  automaEon   •  Within  a  month  of  near   automaEon,  we  were   making  24  hrs  on  CASI  side   •  Updates  to  algs  to  make   deadline   0   5   10   15   20   25   30   35   40   45   Radcorr   Rebin   Ortho   ISSP   Total   0   10   20   30   40   50   10-­‐Apr-­‐13   17-­‐Apr-­‐13   24-­‐Apr-­‐13   1-­‐May-­‐13   8-­‐May-­‐13   15-­‐May-­‐13   Total  CASI  24  rodeo  processing  Lme   in  hours:  4/10/2013  -­‐  05/15/2013   Total   7-­‐Mar-­‐14   BigData-­‐QCon   38  
  • 41. Owen’s Valley Radio Observatory •  LWA Owens Valley Radio Observatory – Probing for Cosmic Dawn –  Joe Lazio, JPL co-PI, Gregg Hallinan, Caltech co-PI –  Larry D’Addario, Chris Mattmann JPL co-Is •  Will lead the data management for OVRO 39 Offline calibration (cuWARP) Online calibration and imaging Real Time Calibration and Imaging Pipeline 1TB/hr raw visibility 3 TB/hr calibrated all sky images File Manager Workflow Manager Reduction in Time/Frequency 10GB/hr reduced sky images LWA OVRO Portal Legend: wrapped algorithm Apache OODT component data flow control flow data Apache Solr 7-­‐Mar-­‐14   BigData-­‐QCon  
  • 42. Go Where the Science is Best! Deploy Where there is No Infrastructure Reconfigure as Needed to Optimize Performance Simplify by Using Raw Voltage Capture 7-­‐Mar-­‐14   BigData-­‐QCon   40  
  • 43. 7-­‐Mar-­‐14   BigData-­‐QCon   41   MadrigalMadrigal DatabaseDatabase ServerServer ExistingExisting ProcessingProcessing Cloud(s)Cloud(s) Mark 6 Data LoadMark 6 Data Load at Haystackat Haystack (optional)(optional) Post Experiment Processing Cloud (Haystack / Internet2) Wireless Mesh Gateway TX of Opportunity Beacon SatBeacon Sat RadioRadio StarStar Scientists Science Targets Solar Imaging Galactic Synchrotron Emission Cosmic Ray Air Showers Ionospheric Irregularities Ionospheric Scintillation Go Where the Science is Best! ● Solar Power ● Wireless Mesh Networking ● Per Element Burst Buffers ● Scheduled and Triggered Operations ● Transfer data by collecting field units ● Direct fiber link possible ● Limit ops cycles by available power ● Offline processing (post deployment) SolarSolar ImagingImaging RAPID :RAPID : Radio Array of Portable Interferometric DetectorsRadio Array of Portable Interferometric Detectors Techniques Aperture Synthesis Imaging Absolute Temperature Calibration Ionospheric TEC and Scintillation Digital Imaging Riometery Bi-static Active / Passive Radar Spectral Monitoring ApacheApache OODTOODT Mesh Network 1000km
  • 44. U.S.  NaEonal  Radio  Astronomy   Observatory  (NRAO)   •  Explore  JPL  data  system  experEse   –  Leverage  Apache  OODT   –  Leverage  architecture  experience   –  Build  on  NRAO  Socorro  F2F  given  in  April  2011  and   InnovaEons  in  Data-­‐Intensive  Astronomy  meeEng  in   May  2011   •  Define  achievable  prototype   –  Focus  on  EVLA  summer  school  pipeline   •  Heavy  focus  on  CASApy,  simple  pipelining,  metadata   extracEon,  archiving  of  directory-­‐based  products   •  Ideal  for  OODT  system   7-­‐Mar-­‐14   42  BigData-­‐QCon  
  • 45. SKA/NRAO  Architecture   7-­‐Mar-­‐14   43  BigData-­‐QCon   WWW ska-dc.jpl.nasa.gov Staging Area EVLA Crawler FM rep cat Curator Browser CASData Services PCS Services Science Data System Operator products, metadata system status day2_TDEM0003_10s_norx Legend: Apache OODT day2_TDEM0003_10s_norx data flow control flow data /metDisk Area WMCub evlascube event W Monitor proc status Key Science for the SKA (a.k.a. m- and cm-( astronomy) Strong-field Tests of Gravity with Pulsars and Black Holes Galaxy Evolution, Cosmology, & Dark Energy Origin & Evolution of Cosmic Magnetism Emerging from the Dark Ages & the Epoch of Reionization The Cradle of Life & Astrobiology
  • 46. Fast Radio Transients 44 • VFASTR  Transient  Event  CollaboraEve  Review  Portal  –  collaboraEon  with  Wagstaff/ Thompson   ‣ Web-­‐based  pla{orm  for  easy  and  Emely  review  of  candidate  events     ‣ AutomaEc  idenEficaEon  of  interesEng  events  by  a  self-­‐trained  machine  agent     ‣ Demonstrates  rapid  science  algorithm  integraLon   ‣  C.  Ma=mann,  A.  Hart,  L.  Cinquini,  J.  Lazio,  S.  Khudikyan,  D.  Jones,  R.  Preston,  T.  Benne=,  B.  Butler,  D.  Harland,  K.   Cummings,  B.  Glendenning,  J.  Kern,  J.  Robne=.  Scalable  Data  Mining,  Archiving  and  Big  Data  Management  for  the   Next  GeneraEon  Astronomical  Telescopes.  Big  Data  Management,  Technologies,  and  Applica=ons.  W.  Hu,  N.   Kaabouch,  eds.  IGI  Global,  2013.   7-­‐Mar-­‐14   BigData-­‐QCon  
  • 47. Data  ScienEst  Training   •  Supervised  4  current  postdocs  (co-­‐supervise  with   Waliser  and  Painter)  and  1  current  PhD  in   Atmospheric  Sciences  (Whitehall)   –  What  am  I  doing  on  her  disserta=on  commiee?   •  USC  and  UCLA  in-­‐flow  and  ou{low   –  Hired  ~10  USC  PhD  and  MS  students  at  JPL   –  A=racEng  more  all  the  Eme   –  Fielding  them  in  courses  at  USC  in  Search     Engines/Big  Data,  and  in  SoJware     Architecture   –  $$$  at  USC  and  UCLA  from  NSF  to  flow  in  and  out   7-­‐Mar-­‐14   BigData-­‐QCon   45  
  • 48. NASA:  where  from  here?   •  Agency  framework  for  open  source:  we  can’t  do  it  all  on  our  own   •  Strategic  investments/opportuniEes   –  Rapid  Science  algorithm  integraEon   •  Needed  by  next  gen  missions  (Space/airborne),  e.g.,  ASO,  needed  by  next  gen   astronomical  archives  e.g.,  SKA,  and  exisEng  NRAO  work,  and  collaboraEons   (MIT)   –  Smart  data  movement   •  Needed  by  next  gen  missions,  climate  science  work  (RCMES,  ESGF),  and  next   gen  astronomy  data  transfer  (S.  Africa  to  US)   –  Transient/Persistent  archives  (Science  CompuEng  FaciliEes)   •  How  do  we  get  it  done  for  cheaper,  tear  it  down,  stand  it  up  quickly   –  AutomaEc  text/metadata  extracEon  from  file  formats   •  There  will  never  be  a  “god”  format,  so  we  need  Babel  Fish  –  needed   by  ASO,  RCMES,  SKA,  XDATA   •  More  data  scienEst  training    7-­‐Mar-­‐14   BigData-­‐QCon   46  
  • 49. Thank  you!   chris.a.ma=mann@nasa.gov   @chrisma=mann/Twi=er   h=p://sunset.usc.edu/ ~ma=mann/     Credit:  Vala  Afshar,  Extreme  Networks  
  • 51. Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations/nasa-big- data