Parallel Distributed Image Stacking and Mosaicing with Hadoop__HadoopSummit2010

Astronomical Image
Processing with Hadoop
Keith Wiley*
Survey Science Group
Dept. of Astronomy, Univ. of Washington
*Keith Wiley, Andrew Connolly,
YongChul Kwon, Magdalena
Balazinska, Bill Howe, Jeffrey
Garder, Simon Krughoff, Yingyi Bu,
Sarah Loebman and Matthew Kraus

1

Session Agenda
 Astronomical Survey Science
 Image Coaddition
 Implementing Coaddition within MapReduce
 Optimizing the Coaddition Process
 Conclusions
 Future Work

2

Astronomical Topics of Study
 Dark energy
 Large scale structure of universe
 Gravitational lensing
 Asteroid detection/tracking

3

What is Astronomical Survey Science?
 Dedicated sky surveys,
usually from a single calibrated telescope/camera pair.

 Run for years at a time.

 Gather millions of images and TBs of storage*.

 Require high-throughput data reduction pipelines.

 Require sophisticated off-line data analysis tools.

* Next generation surveys will gather PBs of 4image data.

Sky Surveys: Today and Tomorrow
 SDSS* (1999-2005)  LSST† (2015-2025)
 Founded in part by UW  8.4m mirror, 3.2 gigapixel camera
 1/4 of the sky  Half sky every three nights
 80TBs total data  30TB per night...
...one SDSS every three nights
 60PBs total (nonstop ten years)
 1000s of exposures of each location

* Sloan Digital Sky Survey That’s a person!
5
† Large Synoptic Survey Telescope

FITS (Flexible Image Transport System)
 Common astronomical image representation file format
 Metadata tags (like EXIF):
› Most importantly: Precise astrometry*
› Other:
• Geolocation (telescope location)
• Sky conditions, image quality, etc.

 Bottom line:
› An image format that
knows where it is looking.

* Position on sky 6

Image Coaddition
 Give multiple partially overlapping images
and a query (color and sky bounds):
› Find images’ intersections with the query bounds.
› Project bitmaps to the bounds.
› Stack and mosaic into a final product.

7

Image Coaddition Expensive
 Give multiple partially overlapping images
and a query (color and sky bounds):
› Find images’ intersections with the query bounds.
› Project bitmaps to the bounds.
› Stack and mosaic into a final product.

Cheap

8

Image Stacking (Signal Averaging)
 Stacking improves SNR:
› Makes fainter objects visible.

 Example (SDSS, Stripe 82):
› Top: Single image, R-band
› Bottom: 79-deep stack
• (~9x SNR improvement)
• Numerous additional detections

 Variable conditions (e.g.,
atmosphere, PSF, haze) mean
stacking algorithm complexity can
exceed a mere sum.

9

Coaddition in Hadoop
Input FITS image Mapper

Detect Projected
intersection with intersection
query bounds.

Project bitmap to
query’s coord sys.
Reducer
Final coadd
Stack and mosaic
HDFS Mapper projected
intersections.

Mapper

Mapper

Parallel Serial
10

Driver Prefiltering
 To assist the process we prefilter
the FITS files in the driver.

 SDSS camera has 30 CCDs:
› 5 colors
› 6 abutting strips of sky
Query bounds
Relevant FITS
 Prefilter (path glob) by color Prefilter-excluded FITS
False positives 1 FITS
and sky coverage (single axis):
› Exclude many irrelevant FITS files.
› Sky coverage filter is only single axis:
• Thus, false positives slip through...
...to be discarded in the mappers.
11

Performance Analysis
45
 Running time:
40
› 2 query sizes
35
› Run against
30
1/10th of SDSS
25 (100,058 FITS)
Minutes

20

15  Conclusion:
10
› Considering the

5
small dataset, this
is too slow!
0
1° sq (3885 FITS) ¼° sq (465 FITS)
Query Sky Bounds Extent
Error bars show 95% conﬁdence intervals — Outliers removed via Chauvenet
› Remember 42
minutes for the
next slide.
12

Driver MapReduce runJob() main()  Breakdown of large
40
query running time

30
 main() is sum of:
› Driver
Minutes

20 › runJob()

10

 runJob() is sum of
0
Degl Cons Mapp R runJo main
MapReduce parts.
ob In
put P truct File er Do educer D b() ()
aths Splits ne one
Hadoop Stage

13

Total job time  Breakdown of large
RPCs from
40
client to HDFS query running time

30
 Observation:
› Running time
Minutes

20 dominated by
RPCs from client
to HDFS to
process 1000s of
10

FITS file paths.
0
ob In
aths Splits ne one
Hadoop Stage  Conclusion:
› Need to reduce
number of files.
14

Sequence Files
 Sequence files group many small files into a few large files.
 Just what we need!
 Real-time images may not be amenable to logical grouping.
› Therefore, sequence files filled in an arbitrary manner:

FITS FITS
filename assignment
to Seq
Sequence
FITS
Hash
Function
FITS
Sequence

FITS

15

45
FITS
Seq hashed
 Comparison:
40
› FITS input vs.
35
unstructured
30 sequence file
25
input*
Minutes

20

15
 Conclusion:
› 5x speedup!
10

5

0
 Hmmm...
1° sq (3885 FITS)
¼° sq (465 FITS)
Can we do better?

* 360 seq files in hashed seq DB. 16

45

Processed a subset
FITS
Seq hashed
 Comparison:
40
of the database › FITS input vs.
after prefiltering
35
unstructured
30 sequence file
25
input*
Minutes

20

 Conclusion:
15
Processed
the entire › 5x speedup!
10
database, but
5 still ran faster
0
 Hmmm...
1° sq (3885 FITS)
¼° sq (465 FITS)
Can we do better?

* 360 seq files in hashed seq DB. 17

Structured Sequence Files
 Similar to the way we prefiltered
FITS files...

 SDSS camera has 30 CCDs:
› 5 colors
› 6 abutting strips of sky
› Thus, 30 sequence file types Query bounds
Relevant FITS
Prefilter-excluded FITS
False positives 1 FITS
 Prefilter by color and
sky coverage (single axis):
› Exclude irrelevant sequence files.
› Still have false positives.
› Catch them in the mappers as before.
18

45
FITS
Seq hashed
 Comparison:
40 Seq grouped, preﬁltered
› FITS vs.
35
unstructured
30 sequence* vs.
25
structured
Minutes

sequence files†
20

15
 Conclusion:
10
› Another 2x
5
speedup for the
0
1° sq (3885 FITS) ¼° sq (465 FITS)
large query, 1.5x
speedup for the
small query.

* 360 seq files in hashed seq DB.
† 1080 seq files in structured DB.
19

10
Seq hashed
Seq grouped, preﬁltered
 Breakdown of large
9
query running time
8

7
 Prediction:
6
› Prefiltering should
Minutes

5
gain performance
in the mapper.
4

3

2
 Does it?
1

0
ob In
aths Splits ne one
Hadoop Stage

20

10
Seq hashed
 Breakdown of large
9
query running time
8

7
 Prediction:
6
› Prefiltering should
Minutes

5
gain performance
in the mapper.
4

3

2
 Conclusion:
1
› Yep, just as
0
Degl
ob In Cons Mapp
put P truct File
aths Splits
R
er Do educer D
ne
runJo
one
b()
main
() expected.
Hadoop Stage

21

45
FITS
Seq hashed
 Experiments were
performed on a
35 100,058 FITS
30
database
(1/10th SDSS).
25
Minutes

20
 How much of this
15
database is
10
Hadoop churning
5 through?
0
1° sq (3885 FITS) ¼° sq (465 FITS)

22

45
FITS
Seq hashed
 Comparison:
› Number of FITS
13,415 FITS files
35
499 mappers‡ considered in
30 mappers vs.
25 100,058 FITS files number
Minutes

360 sequence files* contributing to
20
2077 mappers‡ coadd
15

13,335 FITS files
 Conclusion:
10
144 sequence files†
5 341 mappers‡ › Mappers must
0
3885 FITS Sky Bounds Extent
1° sq (3885 FITS) ¼° sq (465 FITS)
discard many
Query
FITS files due to
nonoverlap of
* 360 seq files in hashed seq DB.
query bounds.
† 1080 seq files in structured DB.
‡ 800 mapper slots on cluster.
23

Using SQL to Find Intersections
 Store all image colors and sky bounds in a database:
› First, query color and intersections via SQL.
› Second, send only relevant images to MapReduce.
 Consequence:
All images processed by mappers contribute to coadd.
No time wasted considering irrelevant images.
SQL
(1) Retrieve Database
FITS filenames
that overlap
query bounds

Driver

(2) Only load relevant MapReduce
images into MapReduce

24

10
Seq hashed
 Comparison:
9 SQL ! Seq hashed
SQL ! Seq grouped › nonSQL vs. SQL
8

7

6
 Conclusion:
› Sigh, no major
Minutes

5

4
improvement
(SQL is not
3
remarkably
2
superior to
1
nonSQL for given
0
1° sq (3885 FITS) ¼° sq (465 FITS)
pairs of bars).

25

10
Seq hashed
 Comparable
9 SQL ! Seq hashed
SQL ! Seq grouped performance here
8
makes sense:
› In essence,
7

6
prefiltering and
Minutes

5 SQL performed
4 similar tasks,
3 albeit with 3.5x
2 100058 3885 Num mapper 100058 465
different mapper
1
13335 3885
13335 3885 input records (FITS) 6674 465 inputs (FITS).
2077 1714 2149 499
341 338 Num launched mappers 144 128
0
1° sq (3885 FITS) ¼° sq (465 FITS)
Query Sky Bounds Extent  Conclusion:
› Cost of discarding
many images in
the nonSQL case
26 was negligible.

10
Seq hashed
 Low improvement
9 SQL ! Seq hashed
SQL ! Seq grouped for SQL in the
8
hashed case is
7 surprising at first
6
› ...especially
Minutes

5 considering 26x
4 different mapper
3 inputs (FITS).
2
100058 3885
100058 3885 Num mapper
input records (FITS)
100058 465
13335 3885 6674 465
1
2077 1714 2149 499
341 338 Num launched mappers 144 128
0
1° sq (3885 FITS) ¼° sq (465 FITS)

27

10
Seq hashed
 Low improvement
9 SQL ! Seq hashed
SQL ! Seq grouped for SQL in the
8
hashed case is
7 surprising at first
6
› ...especially
Minutes

5 considering 26x
4 different mapper
3 inputs (FITS).
2
100058 3885
100058 3885 Num mapper
100058 465
13335 3885 6674 465
1  Theory:
2077 1714 338
1714
Num launched mappers
2149 499
0
341 338 144 128
› Scattered
1° sq (3885 FITS) ¼° sq (465 FITS)
distribution of
relevant FITS
ently
Mapper requirements exceeded Consequ prevented
cluster capacity(~800). We lost efficient mapper
parallelism (mappers were queued)! 28 reuse.

Results
10
Seq hashed
 Just to be clear:
9 SQL ! Seq hashed
SQL ! Seq grouped

8

7
› Prefiltering
improved due to
6
reduction of
Minutes

5
mapper load.
4

3
› SQL improved
2 100058
13335
3885
3885
Num mapper
100058
6674
465
465
due to data
1
2077 1714
Num launched mappers
2149 499 locality and more
0
341
1° sq (3885 FITS)
338 144
¼° sq (465 FITS)
128
efficient mapper
allocation – the
required work
was unaffected
(3885 FITS).
29

Utility of SQL Method
 Despite our results
(which show SQL to be equivalent to prefiltering)...

 ...we predict that SQL should outperform prefiltering on
larger databases.

 Why?
› Prefiltering would contend with an increasing number of
false positives in the mappers*.
› SQL would incur little additional overhead.

 No experiments on this yet.

* A spacing-filling curve for grouping the data might help.
30

Conclusions
 Packing many small files into a few large files is essential.
 Structured packing and associated prefiltering offers
significant gains (reduces mapper load).
 SQL prefiltering of unstructured sequence files yields little
improvement (failure to combine scattered HDFS file-splits
leads to mapper bloat).
 SQL prefiltering of structured sequence files performs
comparably to driver prefiltering, but we anticipate superior
performance on larger databases.

 On a shared cluster (e.g. the cloud), performance variance
is high – doesn’t bode well for online applications. Also
makes precise performance profiling difficult.

31

Future Work
 Parallelize the reducer.
 Less conservative CombineFileSplit builder.
 Conversion to C++, usage of existing C++ libraries.
 Query by time-range.
 Increase complexity of projection/interpolation:
› PSF matching
 Increase complexity of stacking algorithm:
› Convert straight sum to weighted sum by image quality.
 Work toward the larger science goals:
› Study the evolution of galaxies.
› Look for moving objects (asteroids, comets).
› Implement fast parallel machine learning algorithms for
detection/classification of anomalies.
32

Questions?

kbwiley@astro.washington.edu

33

Parallel Distributed Image Stacking and Mosaicing with Hadoop__HadoopSummit2010

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (20)

Similaire à Parallel Distributed Image Stacking and Mosaicing with Hadoop__HadoopSummit2010

Similaire à Parallel Distributed Image Stacking and Mosaicing with Hadoop__HadoopSummit2010 (20)

Plus de Yahoo Developer Network

Plus de Yahoo Developer Network (20)

Dernier

Dernier (20)

Parallel Distributed Image Stacking and Mosaicing with Hadoop__HadoopSummit2010