How to Troubleshoot Apps for the Modern Connected Worker
Parallel Distributed Image Stacking and Mosaicing with Hadoop__HadoopSummit2010
1. Astronomical Image
Processing with Hadoop
Keith Wiley*
Survey Science Group
Dept. of Astronomy, Univ. of Washington
*Keith Wiley, Andrew Connolly,
YongChul Kwon, Magdalena
Balazinska, Bill Howe, Jeffrey
Garder, Simon Krughoff, Yingyi Bu,
Sarah Loebman and Matthew Kraus
1
2. Session Agenda
Astronomical Survey Science
Image Coaddition
Implementing Coaddition within MapReduce
Optimizing the Coaddition Process
Conclusions
Future Work
2
3. Astronomical Topics of Study
Dark energy
Large scale structure of universe
Gravitational lensing
Asteroid detection/tracking
3
4. What is Astronomical Survey Science?
Dedicated sky surveys,
usually from a single calibrated telescope/camera pair.
Run for years at a time.
Gather millions of images and TBs of storage*.
Require high-throughput data reduction pipelines.
Require sophisticated off-line data analysis tools.
* Next generation surveys will gather PBs of 4image data.
5. Sky Surveys: Today and Tomorrow
SDSS* (1999-2005) LSST† (2015-2025)
Founded in part by UW 8.4m mirror, 3.2 gigapixel camera
1/4 of the sky Half sky every three nights
80TBs total data 30TB per night...
...one SDSS every three nights
60PBs total (nonstop ten years)
1000s of exposures of each location
* Sloan Digital Sky Survey That’s a person!
5
† Large Synoptic Survey Telescope
6. FITS (Flexible Image Transport System)
Common astronomical image representation file format
Metadata tags (like EXIF):
› Most importantly: Precise astrometry*
› Other:
• Geolocation (telescope location)
• Sky conditions, image quality, etc.
Bottom line:
› An image format that
knows where it is looking.
* Position on sky 6
7. Image Coaddition
Give multiple partially overlapping images
and a query (color and sky bounds):
› Find images’ intersections with the query bounds.
› Project bitmaps to the bounds.
› Stack and mosaic into a final product.
7
8. Image Coaddition Expensive
Give multiple partially overlapping images
and a query (color and sky bounds):
› Find images’ intersections with the query bounds.
› Project bitmaps to the bounds.
› Stack and mosaic into a final product.
Cheap
8
9. Image Stacking (Signal Averaging)
Stacking improves SNR:
› Makes fainter objects visible.
Example (SDSS, Stripe 82):
› Top: Single image, R-band
› Bottom: 79-deep stack
• (~9x SNR improvement)
• Numerous additional detections
Variable conditions (e.g.,
atmosphere, PSF, haze) mean
stacking algorithm complexity can
exceed a mere sum.
9
10. Coaddition in Hadoop
Input FITS image Mapper
Detect Projected
intersection with intersection
query bounds.
Project bitmap to
query’s coord sys.
Reducer
Final coadd
Stack and mosaic
HDFS Mapper projected
intersections.
Mapper
Mapper
Parallel Serial
10
11. Driver Prefiltering
To assist the process we prefilter
the FITS files in the driver.
SDSS camera has 30 CCDs:
› 5 colors
› 6 abutting strips of sky
Query bounds
Relevant FITS
Prefilter (path glob) by color Prefilter-excluded FITS
False positives 1 FITS
and sky coverage (single axis):
› Exclude many irrelevant FITS files.
› Sky coverage filter is only single axis:
• Thus, false positives slip through...
...to be discarded in the mappers.
11
12. Performance Analysis
45
Running time:
40
› 2 query sizes
35
› Run against
30
1/10th of SDSS
25 (100,058 FITS)
Minutes
20
15 Conclusion:
10
› Considering the
5
small dataset, this
is too slow!
0
1° sq (3885 FITS) ¼° sq (465 FITS)
Query Sky Bounds Extent
Error bars show 95% confidence intervals — Outliers removed via Chauvenet
› Remember 42
minutes for the
next slide.
12
13. Performance Analysis
Driver MapReduce runJob() main() Breakdown of large
40
query running time
30
main() is sum of:
› Driver
Minutes
20 › runJob()
10
runJob() is sum of
0
Degl Cons Mapp R runJo main
MapReduce parts.
ob In
put P truct File er Do educer D b() ()
aths Splits ne one
Hadoop Stage
13
14. Performance Analysis
Total job time Breakdown of large
RPCs from
40
client to HDFS query running time
30
Observation:
› Running time
Minutes
20 dominated by
RPCs from client
to HDFS to
process 1000s of
10
FITS file paths.
0
Degl Cons Mapp R runJo main
ob In
put P truct File er Do educer D b() ()
aths Splits ne one
Hadoop Stage Conclusion:
› Need to reduce
number of files.
14
15. Sequence Files
Sequence files group many small files into a few large files.
Just what we need!
Real-time images may not be amenable to logical grouping.
› Therefore, sequence files filled in an arbitrary manner:
FITS FITS
filename assignment
to Seq
Sequence
FITS
Hash
Function
FITS
Sequence
FITS
15
16. Performance Analysis
45
FITS
Seq hashed
Comparison:
40
› FITS input vs.
35
unstructured
30 sequence file
25
input*
Minutes
20
15
Conclusion:
› 5x speedup!
10
5
0
Hmmm...
1° sq (3885 FITS)
Query Sky Bounds Extent
¼° sq (465 FITS)
Can we do better?
Error bars show 95% confidence intervals — Outliers removed via Chauvenet
* 360 seq files in hashed seq DB. 16
17. Performance Analysis
45
Processed a subset
FITS
Seq hashed
Comparison:
40
of the database › FITS input vs.
after prefiltering
35
unstructured
30 sequence file
25
input*
Minutes
20
Conclusion:
15
Processed
the entire › 5x speedup!
10
database, but
5 still ran faster
0
Hmmm...
1° sq (3885 FITS)
Query Sky Bounds Extent
¼° sq (465 FITS)
Can we do better?
Error bars show 95% confidence intervals — Outliers removed via Chauvenet
* 360 seq files in hashed seq DB. 17
18. Structured Sequence Files
Similar to the way we prefiltered
FITS files...
SDSS camera has 30 CCDs:
› 5 colors
› 6 abutting strips of sky
› Thus, 30 sequence file types Query bounds
Relevant FITS
Prefilter-excluded FITS
False positives 1 FITS
Prefilter by color and
sky coverage (single axis):
› Exclude irrelevant sequence files.
› Still have false positives.
› Catch them in the mappers as before.
18
19. Performance Analysis
45
FITS
Seq hashed
Comparison:
40 Seq grouped, prefiltered
› FITS vs.
35
unstructured
30 sequence* vs.
25
structured
Minutes
sequence files†
20
15
Conclusion:
10
› Another 2x
5
speedup for the
0
1° sq (3885 FITS) ¼° sq (465 FITS)
large query, 1.5x
Query Sky Bounds Extent
Error bars show 95% confidence intervals — Outliers removed via Chauvenet
speedup for the
small query.
* 360 seq files in hashed seq DB.
† 1080 seq files in structured DB.
19
20. Performance Analysis
10
Seq hashed
Seq grouped, prefiltered
Breakdown of large
9
query running time
8
7
Prediction:
6
› Prefiltering should
Minutes
5
gain performance
in the mapper.
4
3
2
Does it?
1
0
Degl Cons Mapp R runJo main
ob In
put P truct File er Do educer D b() ()
aths Splits ne one
Hadoop Stage
20
21. Performance Analysis
10
Seq hashed
Seq grouped, prefiltered
Breakdown of large
9
query running time
8
7
Prediction:
6
› Prefiltering should
Minutes
5
gain performance
in the mapper.
4
3
2
Conclusion:
1
› Yep, just as
0
Degl
ob In Cons Mapp
put P truct File
aths Splits
R
er Do educer D
ne
runJo
one
b()
main
() expected.
Hadoop Stage
21
22. Performance Analysis
45
FITS
Seq hashed
Experiments were
40 Seq grouped, prefiltered
performed on a
35 100,058 FITS
30
database
(1/10th SDSS).
25
Minutes
20
How much of this
15
database is
10
Hadoop churning
5 through?
0
1° sq (3885 FITS) ¼° sq (465 FITS)
Query Sky Bounds Extent
Error bars show 95% confidence intervals — Outliers removed via Chauvenet
22
23. Performance Analysis
45
FITS
Seq hashed
Comparison:
40 Seq grouped, prefiltered
› Number of FITS
13,415 FITS files
35
499 mappers‡ considered in
30 mappers vs.
25 100,058 FITS files number
Minutes
360 sequence files* contributing to
20
2077 mappers‡ coadd
15
13,335 FITS files
Conclusion:
10
144 sequence files†
5 341 mappers‡ › Mappers must
0
3885 FITS Sky Bounds Extent
1° sq (3885 FITS) ¼° sq (465 FITS)
discard many
Query
Error bars show 95% confidence intervals — Outliers removed via Chauvenet
FITS files due to
nonoverlap of
* 360 seq files in hashed seq DB.
query bounds.
† 1080 seq files in structured DB.
‡ 800 mapper slots on cluster.
23
24. Using SQL to Find Intersections
Store all image colors and sky bounds in a database:
› First, query color and intersections via SQL.
› Second, send only relevant images to MapReduce.
Consequence:
All images processed by mappers contribute to coadd.
No time wasted considering irrelevant images.
SQL
(1) Retrieve Database
FITS filenames
that overlap
query bounds
Driver
(2) Only load relevant MapReduce
images into MapReduce
24
25. Performance Analysis
10
Seq hashed
Seq grouped, prefiltered
Comparison:
9 SQL ! Seq hashed
SQL ! Seq grouped › nonSQL vs. SQL
8
7
6
Conclusion:
› Sigh, no major
Minutes
5
4
improvement
(SQL is not
3
remarkably
2
superior to
1
nonSQL for given
0
1° sq (3885 FITS) ¼° sq (465 FITS)
pairs of bars).
Query Sky Bounds Extent
25
26. Performance Analysis
10
Seq hashed
Seq grouped, prefiltered
Comparable
9 SQL ! Seq hashed
SQL ! Seq grouped performance here
8
makes sense:
› In essence,
7
6
prefiltering and
Minutes
5 SQL performed
4 similar tasks,
3 albeit with 3.5x
2 100058 3885 Num mapper 100058 465
different mapper
1
13335 3885
13335 3885 input records (FITS) 6674 465 inputs (FITS).
2077 1714 2149 499
341 338 Num launched mappers 144 128
0
1° sq (3885 FITS) ¼° sq (465 FITS)
Query Sky Bounds Extent Conclusion:
› Cost of discarding
many images in
the nonSQL case
26 was negligible.
27. Performance Analysis
10
Seq hashed
Seq grouped, prefiltered
Low improvement
9 SQL ! Seq hashed
SQL ! Seq grouped for SQL in the
8
hashed case is
7 surprising at first
6
› ...especially
Minutes
5 considering 26x
4 different mapper
3 inputs (FITS).
2
100058 3885
100058 3885 Num mapper
input records (FITS)
100058 465
13335 3885 6674 465
1
2077 1714 2149 499
341 338 Num launched mappers 144 128
0
1° sq (3885 FITS) ¼° sq (465 FITS)
Query Sky Bounds Extent
27
28. Performance Analysis
10
Seq hashed
Seq grouped, prefiltered
Low improvement
9 SQL ! Seq hashed
SQL ! Seq grouped for SQL in the
8
hashed case is
7 surprising at first
6
› ...especially
Minutes
5 considering 26x
4 different mapper
3 inputs (FITS).
2
100058 3885
100058 3885 Num mapper
input records (FITS)
100058 465
13335 3885 6674 465
1 Theory:
2077 1714 338
1714
Num launched mappers
2149 499
0
341 338 144 128
› Scattered
1° sq (3885 FITS) ¼° sq (465 FITS)
Query Sky Bounds Extent
distribution of
relevant FITS
ently
Mapper requirements exceeded Consequ prevented
cluster capacity(~800). We lost efficient mapper
parallelism (mappers were queued)! 28 reuse.
29. Results
10
Seq hashed
Seq grouped, prefiltered
Just to be clear:
9 SQL ! Seq hashed
SQL ! Seq grouped
8
7
› Prefiltering
improved due to
6
reduction of
Minutes
5
mapper load.
4
3
› SQL improved
2 100058
13335
3885
3885
Num mapper
input records (FITS)
100058
6674
465
465
due to data
1
2077 1714
Num launched mappers
2149 499 locality and more
0
341
1° sq (3885 FITS)
338 144
¼° sq (465 FITS)
128
efficient mapper
Query Sky Bounds Extent
allocation – the
required work
was unaffected
(3885 FITS).
29
30. Utility of SQL Method
Despite our results
(which show SQL to be equivalent to prefiltering)...
...we predict that SQL should outperform prefiltering on
larger databases.
Why?
› Prefiltering would contend with an increasing number of
false positives in the mappers*.
› SQL would incur little additional overhead.
No experiments on this yet.
* A spacing-filling curve for grouping the data might help.
30
31. Conclusions
Packing many small files into a few large files is essential.
Structured packing and associated prefiltering offers
significant gains (reduces mapper load).
SQL prefiltering of unstructured sequence files yields little
improvement (failure to combine scattered HDFS file-splits
leads to mapper bloat).
SQL prefiltering of structured sequence files performs
comparably to driver prefiltering, but we anticipate superior
performance on larger databases.
On a shared cluster (e.g. the cloud), performance variance
is high – doesn’t bode well for online applications. Also
makes precise performance profiling difficult.
31
32. Future Work
Parallelize the reducer.
Less conservative CombineFileSplit builder.
Conversion to C++, usage of existing C++ libraries.
Query by time-range.
Increase complexity of projection/interpolation:
› PSF matching
Increase complexity of stacking algorithm:
› Convert straight sum to weighted sum by image quality.
Work toward the larger science goals:
› Study the evolution of galaxies.
› Look for moving objects (asteroids, comets).
› Implement fast parallel machine learning algorithms for
detection/classification of anomalies.
32