SlideShare une entreprise Scribd logo
1  sur  30
Télécharger pour lire hors ligne
© Hortonworks Inc. 2011
Bring Cartography to the Cloud
with Apache Hadoop
Nick Dimiduk
Member of Technical Staff, HBase
FOSS4G-NA, 2013-05-23
Page 1
© Hortonworks Inc. 2011
Beginnings…
Page 2
Architecting the Future of Big Data
mapbox.com/blog/
rendering-the-world/
bmander.com/dotmap/index.html
© Hortonworks Inc. 2011
Definitions
Page 3
Architecting the Future of Big Data
car•tog•ra•phy
|kärˈtägrəәfē|
noun

the science or practice of drawing maps.

rendering map tiles from some kind of
geographic data.
cloud
|kloud|
noun

a visible mass of condensed water vapor
floating in the atmosphere, typically high
above the ground.

on demand consumption of
computation and storage resources.
© Hortonworks Inc. 2011
Background
Architecting the Future of Big Data
Page 4
© Hortonworks Inc. 2011
Apache Hadoop in Review
•  Apache Hadoop Distributed Filesystem (HDFS)
–  Distributed, fault-tolerant, throughput-optimized data storage
–  Uses a filesystem analogy, not structured tables
–  The Google File System, 2003, Ghemawat et al.
–  http://research.google.com/archive/gfs.html
•  Apache Hadoop MapReduce (MR)
–  Distributed, fault-tolerant, batch-oriented data processing
–  Line- or record-oriented processing of the entire dataset *
–  “[Application] schema on read”
–  MapReduce: Simplified Data Processing on Large Clusters, 2004, Dean and
Ghemawat
–  http://research.google.com/archive/mapreduce.html
Page 5
Architecting the Future of Big Data
* For more on writing MapReduce applications, see “MapReduce
Patterns, Algorithms, and Use Cases”
http://highlyscalable.wordpress.com/2012/02/01/mapreduce-patterns/
© Hortonworks Inc. 2011
MapReduce in Detail
Page 6
Architecting the Future of Big Data
highlyscalable.wordpress.com/2012/02/01/mapreduce-patterns/
© Hortonworks Inc. 2011
MapReduce in Detail
Page 7
Architecting the Future of Big Data
highlyscalable.wordpress.com/2012/02/01/mapreduce-patterns/
© Hortonworks Inc. 2011
What we care about
Page 8
Architecting the Future of Big Data
$ map < input | sort | reduce > output
© Hortonworks Inc. 2011
How Seamlessly?
Page 9
Architecting the Future of Big Data
$ git show e65731e:bin/10_simulated_hadoop.sh
gzcat "$INPUT_FILES" 
| python "${PYTHON_DIR}/sample_shapes.py" 
| sort 
| python "${PYTHON_DIR}/draw_tiles.py"
$ git show e65731e:bin/11_hadoop_local.sh
hadoop jar target/tile-brute-0.1.0-SNAPSHOT.jar 
-input /tmp/input.csv 
-output "$OUTPUT_DIR" 
-mapper "python ${PYTHON_DIR}/sample_shapes.py" 
-reducer "python ${PYTHON_DIR}/draw_tiles.py"
© Hortonworks Inc. 2011
To the Code!
github.com/ndimiduk/tilebrute
Architecting the Future of Big Data
Page 10
© Hortonworks Inc. 2011
Our Tools
•  Python + GIS
–  GDAL
–  Shapely
–  Mapnik
•  Java
•  Apache Hadoop
•  Bash
•  MrJob
Page 11
Architecting the Future of Big Data
© Hortonworks Inc. 2011
Prepare the Input
Page 12
Architecting the Future of Big Data
TIGER/Line Shapefiles
www.census.gov/geo/maps-data/data/tiger-line.html
$ tail -n6 bin/00_prepare_input.sh
ogr2ogr `: invoke gdal tool ogr2ogr` 
-t_srs epsg:4326 `: reproject the data` 
-f CSV `: in CSV format` 
$OUTPUT `: producing output file` 
$INPUT `: from input file` 
-lco GEOMETRY=AS_WKT `: including geometries as WKT`
$ head -n2 /tmp/input.csv
WKT,STATEFP10,COUNTYFP10,TRACTCE10,BLOCKCE,BLOCKID10,PARTFLG,HOUSING10,POP10
"POLYGON ((-118.81473 47.233499,...))",53,001,950100,1042,530019501001042,N,1,5
© Hortonworks Inc. 2011
Prepare the Input
Page 13
Architecting the Future of Big Data
TIGER/Line Shapefiles
www.census.gov/geo/maps-data/data/tiger-line.html
$ tail -n6 bin/00_prepare_input.sh
ogr2ogr `: invoke gdal tool ogr2ogr` 
-t_srs epsg:4326 `: reproject the data` 
-f CSV `: in CSV format` 
$OUTPUT `: producing output file` 
$INPUT `: from input file` 
-lco GEOMETRY=AS_WKT `: including geometries as WKT`
$ head -n2 /tmp/input.csv
WKT,STATEFP10,COUNTYFP10,TRACTCE10,BLOCKCE,BLOCKID10,PARTFLG,HOUSING10,POP10
"POLYGON ((-118.81473 47.233499,...))",53,001,950100,1042,530019501001042,N,1,5
© Hortonworks Inc. 2011
Map: Sample Geometries
Page 14
Architecting the Future of Big Data
[,[WKT, population]] => mapper => ['tx,ty,z', 'px,py']
def main():
for geom, population in read_feature(stdin):
for lng, lat in sample_geometry(geom, population):
for key, val in make_kv(lat, lng):
emit(key, val)
$ map < input | sort | reduce > output
© Hortonworks Inc. 2011
Map: Sample Geometries
Page 15
Architecting the Future of Big Data
$ head -n1 input.csv | python -m tilebrute.sample_shapes
2,5,4 -13224181.65427 5981084.37214
5,11,5 -13224181.65427 5981084.37214
10,22,6 -13224181.65427 5981084.37214
21,44,7 -13224181.65427 5981084.37214
43,89,8 -13224181.65427 5981084.37214
87,179,9 -13224181.65427 5981084.37214
174,359,10 -13224181.65427 5981084.37214
348,718,11 -13224181.65427 5981084.37214
696,1436,12 -13224181.65427 5981084.37214
1392,2873,13 -13224181.65427 5981084.37214
2785,5746,14 -13224181.65427 5981084.37214
5571,11493,15 -13224181.65427 5981084.37214
11142,22986,16 -13224181.65427 5981084.37214
22284,45973,17 -13224181.65427 5981084.37214
$ map < input | sort | reduce > output
© Hortonworks Inc. 2011
Sort
Page 16
Architecting the Future of Big Data
$ head -n1 input.csv | python -m tilebrute.sample_shapes | sort
10,22,6 -13224414.42332 5983539.01581
10,22,6 -13225723.87449 5981201.60336
10,22,6 -13225793.67181 5983127.53706
10,22,6 -13226046.70101 5983375.66839
10,22,6 -13226331.90155 5984272.31303
11138,22981,16 -13226331.90155 5984272.31303
11139,22983,16 -13225793.67181 5983127.53706
11139,22983,16 -13226046.70101 5983375.66839
11139,22986,16 -13225723.87449 5981201.60336
11141,22982,16 -13224414.42332 5983539.01581
$ map < input | sort | reduce > output
© Hortonworks Inc. 2011
Reduce: Draw Tiles
Page 17
Architecting the Future of Big Data
def main():
for tile,points in groupby(read_points(stdin), lambda x: x[0]):
zoom = get_zoom(tile)
map = init_map(zoom, points)
map.zoom_all()
im = mapnik.Image(256,256)
mapnik.render(map,im)
emit(tile, encode_image(im))
$ map < input | sort | reduce > output
$ head -n1 input.csv | python -m tilebrute.sample_shapes | sort | head -n5 |
python -m tilebrute.draw_tiles
10,22,6 iVBORw0KGgoAAAANSUhEUgAAAQAAAAEACAYAAABccqhmAAADJ...+aBAAAAAElFTkSuQmCC
© Hortonworks Inc. 2011
Write Output
Page 18
Architecting the Future of Big Data
public void write(Text tileId, Text tile) throws IOException {
String[] tileIdSplits = tileId.toString().split(",");
assert tileIdSplits.length == 3;
String tx = tileIdSplits[0];
String ty = tileIdSplits[1];
String zoom = tileIdSplits[2];
Path tilePath = new Path(outputPath, zoom + "/" + tx + "/" + ty + ".png");
fs.mkdirs(tilePath.getParent());
byte[] buf = Base64.decodeBase64(tile.toString());
final FSDataOutputStream fout = fs.create(tilePath, progress);
fout.write(buf);
fout.close();
}
© Hortonworks Inc. 2011
To the Cloud!
Architecting the Future of Big Data
Page 19
© Hortonworks Inc. 2011
Basic Services: EC2, S3
•  EC2: Elastic Compute Cloud
–  Virtual machines on demand
–  Different “instance types” with different hardware profiles
–  m1.large (2 cores, 7.5G), c1.xlarge (8 cores, 7G)
•  S3: Simple Storage Service
–  Distributed, replicated storage
–  Native Hadoop integration
–  Also exposed over http(s), easy tile hosting
Page 20
Architecting the Future of Big Data
© Hortonworks Inc. 2011
Add-on Service: EMR
•  EMR: Elastic MapReduce
–  “Hadoop as a Service”
–  On-demand, pre-installed and configured Hadoop clusters
–  +1: standardize of provisioning, deployment, monitoring
–  -1: “stable” (old) software
Page 21
Architecting the Future of Big Data
© Hortonworks Inc. 2011
MrJob: Python for EMR
Page 22
Architecting the Future of Big Data
class TileBrute(MRJob):
HADOOP_OUTPUT_FORMAT = 'tilebrute.hadoop.mapred.MapTileOutputFormat'
def mapper_cmd(self):
return bash_wrap('$PYTHON -m tilebrute.sample_shapes')
def reducer_cmd(self):
return bash_wrap('$PYTHON -m tilebrute.draw_tiles')
github.com/Yelp/mrjob
© Hortonworks Inc. 2011
Results
Architecting the Future of Big Data
Page 23
© Hortonworks Inc. 2011
Page 24
Architecting the Future of Big Data
© Hortonworks Inc. 2011
Page 25
Architecting the Future of Big Data
14z, 2624x, 5722y
© Hortonworks Inc. 2011
Page 26
Architecting the Future of Big Data
14z, 2624x, 5722y
© Hortonworks Inc. 2011
How much code?
Page 27
Architecting the Future of Big Data
$ find -f src -f bin | egrep '.(java|sh|py)$' | grep -v test | xargs cloc --quiet
http://cloc.sourceforge.net v 1.56 T=0.5 s (28.0 files/s, 1868.0 lines/s)
-------------------------------------------------------------------------------
Language files blank comment code
-------------------------------------------------------------------------------
Python 4 69 105 299
Bourne Shell 8 51 85 210
Java 2 25 16 74
-------------------------------------------------------------------------------
SUM: 14 145 206 583
-------------------------------------------------------------------------------
© Hortonworks Inc. 2011
Performance
Page 28
Architecting the Future of Big Data
•  1 x m1.large (2 cores)
–  195575 input features (WA state)
–  3 zoom levels (6, 7, 8)
–  1 hour
•  19 x c1.xlarge (152 cores)
–  308745538 input features (all data)
–  3 zoom levels (6, 7, 8)
–  3 hours 15 minutes
© Hortonworks Inc. 2011
TODOs
•  Macro-level performance optimizations (configuration)
–  Balancing mappers and reducers, memory allocation, &c.
–  On-demand Hadoop means tuning the cluster to the application
•  Micro-level performance optimizations (code)
–  Smarter sampling logic
–  Mapnik API considerations
–  Multi-threaded S3 PUTs
–  https://forums.aws.amazon.com/thread.jspa?threadID=125135
•  Write tiles in MBTiles format
•  Write tiles to HBase
•  Compression!
•  Ogrbrute?
Page 29
Architecting the Future of Big Data
© Hortonworks Inc. 2011
Thanks!
Architecting the Future of Big Data
Page 30
M A N N I N G
Nick Dimiduk
Amandeep Khurana
FOREWORD BY
Michael Stack
hbaseinaction.com
Nick Dimiduk
github.com/ndimiduk
@xefyr
n10k.com

Contenu connexe

En vedette

HBase Data Types (WIP)
HBase Data Types (WIP)HBase Data Types (WIP)
HBase Data Types (WIP)Nick Dimiduk
 
HBase Low Latency, StrataNYC 2014
HBase Low Latency, StrataNYC 2014HBase Low Latency, StrataNYC 2014
HBase Low Latency, StrataNYC 2014Nick Dimiduk
 
Apache Big Data EU 2015 - HBase
Apache Big Data EU 2015 - HBaseApache Big Data EU 2015 - HBase
Apache Big Data EU 2015 - HBaseNick Dimiduk
 
Introduction to Hadoop, HBase, and NoSQL
Introduction to Hadoop, HBase, and NoSQLIntroduction to Hadoop, HBase, and NoSQL
Introduction to Hadoop, HBase, and NoSQLNick Dimiduk
 
HBase Client APIs (for webapps?)
HBase Client APIs (for webapps?)HBase Client APIs (for webapps?)
HBase Client APIs (for webapps?)Nick Dimiduk
 
Apache Big Data EU 2015 - Phoenix
Apache Big Data EU 2015 - PhoenixApache Big Data EU 2015 - Phoenix
Apache Big Data EU 2015 - PhoenixNick Dimiduk
 
Apache HBase 1.0 Release
Apache HBase 1.0 ReleaseApache HBase 1.0 Release
Apache HBase 1.0 ReleaseNick Dimiduk
 
Apache HBase for Architects
Apache HBase for ArchitectsApache HBase for Architects
Apache HBase for ArchitectsNick Dimiduk
 
Apache HBase Low Latency
Apache HBase Low LatencyApache HBase Low Latency
Apache HBase Low LatencyNick Dimiduk
 
HBase for Architects
HBase for ArchitectsHBase for Architects
HBase for ArchitectsNick Dimiduk
 
Pig, Making Hadoop Easy
Pig, Making Hadoop EasyPig, Making Hadoop Easy
Pig, Making Hadoop EasyNick Dimiduk
 
HBase Blockcache 101
HBase Blockcache 101HBase Blockcache 101
HBase Blockcache 101Nick Dimiduk
 

En vedette (13)

HBase Data Types (WIP)
HBase Data Types (WIP)HBase Data Types (WIP)
HBase Data Types (WIP)
 
HBase Low Latency, StrataNYC 2014
HBase Low Latency, StrataNYC 2014HBase Low Latency, StrataNYC 2014
HBase Low Latency, StrataNYC 2014
 
HBase Data Types
HBase Data TypesHBase Data Types
HBase Data Types
 
Apache Big Data EU 2015 - HBase
Apache Big Data EU 2015 - HBaseApache Big Data EU 2015 - HBase
Apache Big Data EU 2015 - HBase
 
Introduction to Hadoop, HBase, and NoSQL
Introduction to Hadoop, HBase, and NoSQLIntroduction to Hadoop, HBase, and NoSQL
Introduction to Hadoop, HBase, and NoSQL
 
HBase Client APIs (for webapps?)
HBase Client APIs (for webapps?)HBase Client APIs (for webapps?)
HBase Client APIs (for webapps?)
 
Apache Big Data EU 2015 - Phoenix
Apache Big Data EU 2015 - PhoenixApache Big Data EU 2015 - Phoenix
Apache Big Data EU 2015 - Phoenix
 
Apache HBase 1.0 Release
Apache HBase 1.0 ReleaseApache HBase 1.0 Release
Apache HBase 1.0 Release
 
Apache HBase for Architects
Apache HBase for ArchitectsApache HBase for Architects
Apache HBase for Architects
 
Apache HBase Low Latency
Apache HBase Low LatencyApache HBase Low Latency
Apache HBase Low Latency
 
HBase for Architects
HBase for ArchitectsHBase for Architects
HBase for Architects
 
Pig, Making Hadoop Easy
Pig, Making Hadoop EasyPig, Making Hadoop Easy
Pig, Making Hadoop Easy
 
HBase Blockcache 101
HBase Blockcache 101HBase Blockcache 101
HBase Blockcache 101
 

Similaire à Bring Cartography to the Cloud

도시건축설계와 오픈소스 기반 GIS
도시건축설계와 오픈소스 기반 GIS도시건축설계와 오픈소스 기반 GIS
도시건축설계와 오픈소스 기반 GISmac999
 
도시 설계와 GIS 기술의 관계
도시 설계와 GIS 기술의 관계도시 설계와 GIS 기술의 관계
도시 설계와 GIS 기술의 관계Tae wook kang
 
Pivotal Greenplum 次世代マルチクラウド・データ分析プラットフォーム
Pivotal Greenplum 次世代マルチクラウド・データ分析プラットフォームPivotal Greenplum 次世代マルチクラウド・データ分析プラットフォーム
Pivotal Greenplum 次世代マルチクラウド・データ分析プラットフォームMasayuki Matsushita
 
Making pig fly optimizing data processing on hadoop presentation
Making pig fly  optimizing data processing on hadoop presentationMaking pig fly  optimizing data processing on hadoop presentation
Making pig fly optimizing data processing on hadoop presentationMd Rasool
 
Best practices for_managing_geospatial_data1
Best practices for_managing_geospatial_data1Best practices for_managing_geospatial_data1
Best practices for_managing_geospatial_data1Leng Kim Leng
 
Workshop on Google Cloud Data Platform
Workshop on Google Cloud Data PlatformWorkshop on Google Cloud Data Platform
Workshop on Google Cloud Data PlatformGoDataDriven
 
State of the Art Web Mapping with Open Source
State of the Art Web Mapping with Open SourceState of the Art Web Mapping with Open Source
State of the Art Web Mapping with Open SourceOSCON Byrum
 
Hadoop: Beyond MapReduce
Hadoop: Beyond MapReduceHadoop: Beyond MapReduce
Hadoop: Beyond MapReduceSteve Loughran
 
XQuery - The GSD (Getting Stuff Done) language
XQuery - The GSD (Getting Stuff Done) languageXQuery - The GSD (Getting Stuff Done) language
XQuery - The GSD (Getting Stuff Done) languagejimfuller2009
 
IoT NY - Google Cloud Services for IoT
IoT NY - Google Cloud Services for IoTIoT NY - Google Cloud Services for IoT
IoT NY - Google Cloud Services for IoTJames Chittenden
 
Tez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthelTez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthelt3rmin4t0r
 
NCGIC The Geospatial Revolution
NCGIC The Geospatial RevolutionNCGIC The Geospatial Revolution
NCGIC The Geospatial RevolutionPeter Batty
 
GIS in the Rockies Geospatial Revolution
GIS in the Rockies Geospatial RevolutionGIS in the Rockies Geospatial Revolution
GIS in the Rockies Geospatial RevolutionPeter Batty
 
Run Your First Hadoop 2.x Program
Run Your First Hadoop 2.x ProgramRun Your First Hadoop 2.x Program
Run Your First Hadoop 2.x ProgramSkillspeed
 
How Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscapeHow Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscapePaco Nathan
 
Hadoop past, present and future
Hadoop past, present and futureHadoop past, present and future
Hadoop past, present and futureCodemotion
 
Developing Spatial Applications with Google Maps and CARTO
Developing Spatial Applications with Google Maps and CARTODeveloping Spatial Applications with Google Maps and CARTO
Developing Spatial Applications with Google Maps and CARTOCARTO
 
Architecting a Scalable Hadoop Platform: Top 10 considerations for success
Architecting a Scalable Hadoop Platform: Top 10 considerations for successArchitecting a Scalable Hadoop Platform: Top 10 considerations for success
Architecting a Scalable Hadoop Platform: Top 10 considerations for successDataWorks Summit
 
Economies of Scaling Software
Economies of Scaling SoftwareEconomies of Scaling Software
Economies of Scaling SoftwareJoshua Long
 

Similaire à Bring Cartography to the Cloud (20)

도시건축설계와 오픈소스 기반 GIS
도시건축설계와 오픈소스 기반 GIS도시건축설계와 오픈소스 기반 GIS
도시건축설계와 오픈소스 기반 GIS
 
도시 설계와 GIS 기술의 관계
도시 설계와 GIS 기술의 관계도시 설계와 GIS 기술의 관계
도시 설계와 GIS 기술의 관계
 
Pivotal Greenplum 次世代マルチクラウド・データ分析プラットフォーム
Pivotal Greenplum 次世代マルチクラウド・データ分析プラットフォームPivotal Greenplum 次世代マルチクラウド・データ分析プラットフォーム
Pivotal Greenplum 次世代マルチクラウド・データ分析プラットフォーム
 
Making pig fly optimizing data processing on hadoop presentation
Making pig fly  optimizing data processing on hadoop presentationMaking pig fly  optimizing data processing on hadoop presentation
Making pig fly optimizing data processing on hadoop presentation
 
Best practices for_managing_geospatial_data1
Best practices for_managing_geospatial_data1Best practices for_managing_geospatial_data1
Best practices for_managing_geospatial_data1
 
Workshop on Google Cloud Data Platform
Workshop on Google Cloud Data PlatformWorkshop on Google Cloud Data Platform
Workshop on Google Cloud Data Platform
 
State of the Art Web Mapping with Open Source
State of the Art Web Mapping with Open SourceState of the Art Web Mapping with Open Source
State of the Art Web Mapping with Open Source
 
Hadoop: Beyond MapReduce
Hadoop: Beyond MapReduceHadoop: Beyond MapReduce
Hadoop: Beyond MapReduce
 
XQuery - The GSD (Getting Stuff Done) language
XQuery - The GSD (Getting Stuff Done) languageXQuery - The GSD (Getting Stuff Done) language
XQuery - The GSD (Getting Stuff Done) language
 
Web mapswithleaflet
Web mapswithleafletWeb mapswithleaflet
Web mapswithleaflet
 
IoT NY - Google Cloud Services for IoT
IoT NY - Google Cloud Services for IoTIoT NY - Google Cloud Services for IoT
IoT NY - Google Cloud Services for IoT
 
Tez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthelTez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthel
 
NCGIC The Geospatial Revolution
NCGIC The Geospatial RevolutionNCGIC The Geospatial Revolution
NCGIC The Geospatial Revolution
 
GIS in the Rockies Geospatial Revolution
GIS in the Rockies Geospatial RevolutionGIS in the Rockies Geospatial Revolution
GIS in the Rockies Geospatial Revolution
 
Run Your First Hadoop 2.x Program
Run Your First Hadoop 2.x ProgramRun Your First Hadoop 2.x Program
Run Your First Hadoop 2.x Program
 
How Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscapeHow Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscape
 
Hadoop past, present and future
Hadoop past, present and futureHadoop past, present and future
Hadoop past, present and future
 
Developing Spatial Applications with Google Maps and CARTO
Developing Spatial Applications with Google Maps and CARTODeveloping Spatial Applications with Google Maps and CARTO
Developing Spatial Applications with Google Maps and CARTO
 
Architecting a Scalable Hadoop Platform: Top 10 considerations for success
Architecting a Scalable Hadoop Platform: Top 10 considerations for successArchitecting a Scalable Hadoop Platform: Top 10 considerations for success
Architecting a Scalable Hadoop Platform: Top 10 considerations for success
 
Economies of Scaling Software
Economies of Scaling SoftwareEconomies of Scaling Software
Economies of Scaling Software
 

Dernier

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Zilliz
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Bhuvaneswari Subramani
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Angeliki Cooney
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 

Dernier (20)

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 

Bring Cartography to the Cloud

  • 1. © Hortonworks Inc. 2011 Bring Cartography to the Cloud with Apache Hadoop Nick Dimiduk Member of Technical Staff, HBase FOSS4G-NA, 2013-05-23 Page 1
  • 2. © Hortonworks Inc. 2011 Beginnings… Page 2 Architecting the Future of Big Data mapbox.com/blog/ rendering-the-world/ bmander.com/dotmap/index.html
  • 3. © Hortonworks Inc. 2011 Definitions Page 3 Architecting the Future of Big Data car•tog•ra•phy |kärˈtägrəәfē| noun the science or practice of drawing maps. rendering map tiles from some kind of geographic data. cloud |kloud| noun a visible mass of condensed water vapor floating in the atmosphere, typically high above the ground. on demand consumption of computation and storage resources.
  • 4. © Hortonworks Inc. 2011 Background Architecting the Future of Big Data Page 4
  • 5. © Hortonworks Inc. 2011 Apache Hadoop in Review •  Apache Hadoop Distributed Filesystem (HDFS) –  Distributed, fault-tolerant, throughput-optimized data storage –  Uses a filesystem analogy, not structured tables –  The Google File System, 2003, Ghemawat et al. –  http://research.google.com/archive/gfs.html •  Apache Hadoop MapReduce (MR) –  Distributed, fault-tolerant, batch-oriented data processing –  Line- or record-oriented processing of the entire dataset * –  “[Application] schema on read” –  MapReduce: Simplified Data Processing on Large Clusters, 2004, Dean and Ghemawat –  http://research.google.com/archive/mapreduce.html Page 5 Architecting the Future of Big Data * For more on writing MapReduce applications, see “MapReduce Patterns, Algorithms, and Use Cases” http://highlyscalable.wordpress.com/2012/02/01/mapreduce-patterns/
  • 6. © Hortonworks Inc. 2011 MapReduce in Detail Page 6 Architecting the Future of Big Data highlyscalable.wordpress.com/2012/02/01/mapreduce-patterns/
  • 7. © Hortonworks Inc. 2011 MapReduce in Detail Page 7 Architecting the Future of Big Data highlyscalable.wordpress.com/2012/02/01/mapreduce-patterns/
  • 8. © Hortonworks Inc. 2011 What we care about Page 8 Architecting the Future of Big Data $ map < input | sort | reduce > output
  • 9. © Hortonworks Inc. 2011 How Seamlessly? Page 9 Architecting the Future of Big Data $ git show e65731e:bin/10_simulated_hadoop.sh gzcat "$INPUT_FILES" | python "${PYTHON_DIR}/sample_shapes.py" | sort | python "${PYTHON_DIR}/draw_tiles.py" $ git show e65731e:bin/11_hadoop_local.sh hadoop jar target/tile-brute-0.1.0-SNAPSHOT.jar -input /tmp/input.csv -output "$OUTPUT_DIR" -mapper "python ${PYTHON_DIR}/sample_shapes.py" -reducer "python ${PYTHON_DIR}/draw_tiles.py"
  • 10. © Hortonworks Inc. 2011 To the Code! github.com/ndimiduk/tilebrute Architecting the Future of Big Data Page 10
  • 11. © Hortonworks Inc. 2011 Our Tools •  Python + GIS –  GDAL –  Shapely –  Mapnik •  Java •  Apache Hadoop •  Bash •  MrJob Page 11 Architecting the Future of Big Data
  • 12. © Hortonworks Inc. 2011 Prepare the Input Page 12 Architecting the Future of Big Data TIGER/Line Shapefiles www.census.gov/geo/maps-data/data/tiger-line.html $ tail -n6 bin/00_prepare_input.sh ogr2ogr `: invoke gdal tool ogr2ogr` -t_srs epsg:4326 `: reproject the data` -f CSV `: in CSV format` $OUTPUT `: producing output file` $INPUT `: from input file` -lco GEOMETRY=AS_WKT `: including geometries as WKT` $ head -n2 /tmp/input.csv WKT,STATEFP10,COUNTYFP10,TRACTCE10,BLOCKCE,BLOCKID10,PARTFLG,HOUSING10,POP10 "POLYGON ((-118.81473 47.233499,...))",53,001,950100,1042,530019501001042,N,1,5
  • 13. © Hortonworks Inc. 2011 Prepare the Input Page 13 Architecting the Future of Big Data TIGER/Line Shapefiles www.census.gov/geo/maps-data/data/tiger-line.html $ tail -n6 bin/00_prepare_input.sh ogr2ogr `: invoke gdal tool ogr2ogr` -t_srs epsg:4326 `: reproject the data` -f CSV `: in CSV format` $OUTPUT `: producing output file` $INPUT `: from input file` -lco GEOMETRY=AS_WKT `: including geometries as WKT` $ head -n2 /tmp/input.csv WKT,STATEFP10,COUNTYFP10,TRACTCE10,BLOCKCE,BLOCKID10,PARTFLG,HOUSING10,POP10 "POLYGON ((-118.81473 47.233499,...))",53,001,950100,1042,530019501001042,N,1,5
  • 14. © Hortonworks Inc. 2011 Map: Sample Geometries Page 14 Architecting the Future of Big Data [,[WKT, population]] => mapper => ['tx,ty,z', 'px,py'] def main(): for geom, population in read_feature(stdin): for lng, lat in sample_geometry(geom, population): for key, val in make_kv(lat, lng): emit(key, val) $ map < input | sort | reduce > output
  • 15. © Hortonworks Inc. 2011 Map: Sample Geometries Page 15 Architecting the Future of Big Data $ head -n1 input.csv | python -m tilebrute.sample_shapes 2,5,4 -13224181.65427 5981084.37214 5,11,5 -13224181.65427 5981084.37214 10,22,6 -13224181.65427 5981084.37214 21,44,7 -13224181.65427 5981084.37214 43,89,8 -13224181.65427 5981084.37214 87,179,9 -13224181.65427 5981084.37214 174,359,10 -13224181.65427 5981084.37214 348,718,11 -13224181.65427 5981084.37214 696,1436,12 -13224181.65427 5981084.37214 1392,2873,13 -13224181.65427 5981084.37214 2785,5746,14 -13224181.65427 5981084.37214 5571,11493,15 -13224181.65427 5981084.37214 11142,22986,16 -13224181.65427 5981084.37214 22284,45973,17 -13224181.65427 5981084.37214 $ map < input | sort | reduce > output
  • 16. © Hortonworks Inc. 2011 Sort Page 16 Architecting the Future of Big Data $ head -n1 input.csv | python -m tilebrute.sample_shapes | sort 10,22,6 -13224414.42332 5983539.01581 10,22,6 -13225723.87449 5981201.60336 10,22,6 -13225793.67181 5983127.53706 10,22,6 -13226046.70101 5983375.66839 10,22,6 -13226331.90155 5984272.31303 11138,22981,16 -13226331.90155 5984272.31303 11139,22983,16 -13225793.67181 5983127.53706 11139,22983,16 -13226046.70101 5983375.66839 11139,22986,16 -13225723.87449 5981201.60336 11141,22982,16 -13224414.42332 5983539.01581 $ map < input | sort | reduce > output
  • 17. © Hortonworks Inc. 2011 Reduce: Draw Tiles Page 17 Architecting the Future of Big Data def main(): for tile,points in groupby(read_points(stdin), lambda x: x[0]): zoom = get_zoom(tile) map = init_map(zoom, points) map.zoom_all() im = mapnik.Image(256,256) mapnik.render(map,im) emit(tile, encode_image(im)) $ map < input | sort | reduce > output $ head -n1 input.csv | python -m tilebrute.sample_shapes | sort | head -n5 | python -m tilebrute.draw_tiles 10,22,6 iVBORw0KGgoAAAANSUhEUgAAAQAAAAEACAYAAABccqhmAAADJ...+aBAAAAAElFTkSuQmCC
  • 18. © Hortonworks Inc. 2011 Write Output Page 18 Architecting the Future of Big Data public void write(Text tileId, Text tile) throws IOException { String[] tileIdSplits = tileId.toString().split(","); assert tileIdSplits.length == 3; String tx = tileIdSplits[0]; String ty = tileIdSplits[1]; String zoom = tileIdSplits[2]; Path tilePath = new Path(outputPath, zoom + "/" + tx + "/" + ty + ".png"); fs.mkdirs(tilePath.getParent()); byte[] buf = Base64.decodeBase64(tile.toString()); final FSDataOutputStream fout = fs.create(tilePath, progress); fout.write(buf); fout.close(); }
  • 19. © Hortonworks Inc. 2011 To the Cloud! Architecting the Future of Big Data Page 19
  • 20. © Hortonworks Inc. 2011 Basic Services: EC2, S3 •  EC2: Elastic Compute Cloud –  Virtual machines on demand –  Different “instance types” with different hardware profiles –  m1.large (2 cores, 7.5G), c1.xlarge (8 cores, 7G) •  S3: Simple Storage Service –  Distributed, replicated storage –  Native Hadoop integration –  Also exposed over http(s), easy tile hosting Page 20 Architecting the Future of Big Data
  • 21. © Hortonworks Inc. 2011 Add-on Service: EMR •  EMR: Elastic MapReduce –  “Hadoop as a Service” –  On-demand, pre-installed and configured Hadoop clusters –  +1: standardize of provisioning, deployment, monitoring –  -1: “stable” (old) software Page 21 Architecting the Future of Big Data
  • 22. © Hortonworks Inc. 2011 MrJob: Python for EMR Page 22 Architecting the Future of Big Data class TileBrute(MRJob): HADOOP_OUTPUT_FORMAT = 'tilebrute.hadoop.mapred.MapTileOutputFormat' def mapper_cmd(self): return bash_wrap('$PYTHON -m tilebrute.sample_shapes') def reducer_cmd(self): return bash_wrap('$PYTHON -m tilebrute.draw_tiles') github.com/Yelp/mrjob
  • 23. © Hortonworks Inc. 2011 Results Architecting the Future of Big Data Page 23
  • 24. © Hortonworks Inc. 2011 Page 24 Architecting the Future of Big Data
  • 25. © Hortonworks Inc. 2011 Page 25 Architecting the Future of Big Data 14z, 2624x, 5722y
  • 26. © Hortonworks Inc. 2011 Page 26 Architecting the Future of Big Data 14z, 2624x, 5722y
  • 27. © Hortonworks Inc. 2011 How much code? Page 27 Architecting the Future of Big Data $ find -f src -f bin | egrep '.(java|sh|py)$' | grep -v test | xargs cloc --quiet http://cloc.sourceforge.net v 1.56 T=0.5 s (28.0 files/s, 1868.0 lines/s) ------------------------------------------------------------------------------- Language files blank comment code ------------------------------------------------------------------------------- Python 4 69 105 299 Bourne Shell 8 51 85 210 Java 2 25 16 74 ------------------------------------------------------------------------------- SUM: 14 145 206 583 -------------------------------------------------------------------------------
  • 28. © Hortonworks Inc. 2011 Performance Page 28 Architecting the Future of Big Data •  1 x m1.large (2 cores) –  195575 input features (WA state) –  3 zoom levels (6, 7, 8) –  1 hour •  19 x c1.xlarge (152 cores) –  308745538 input features (all data) –  3 zoom levels (6, 7, 8) –  3 hours 15 minutes
  • 29. © Hortonworks Inc. 2011 TODOs •  Macro-level performance optimizations (configuration) –  Balancing mappers and reducers, memory allocation, &c. –  On-demand Hadoop means tuning the cluster to the application •  Micro-level performance optimizations (code) –  Smarter sampling logic –  Mapnik API considerations –  Multi-threaded S3 PUTs –  https://forums.aws.amazon.com/thread.jspa?threadID=125135 •  Write tiles in MBTiles format •  Write tiles to HBase •  Compression! •  Ogrbrute? Page 29 Architecting the Future of Big Data
  • 30. © Hortonworks Inc. 2011 Thanks! Architecting the Future of Big Data Page 30 M A N N I N G Nick Dimiduk Amandeep Khurana FOREWORD BY Michael Stack hbaseinaction.com Nick Dimiduk github.com/ndimiduk @xefyr n10k.com