Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.
Analyzing Air Quality Measurements in Macedonia
with Apache Drill
Author: Marjan Sterjev
Apache Drill (https://drill.apach...
"http://airquality.moepp.gov.mk/graphs/site/pages/MakeGraph.php?
graph=StationLineGraph&station=Lisice&parameter=PM10&endD...
• air_measurement_lisice.json
we will create corresponding view:
CREATE OR REPLACE VIEW
dfs.tmp.air_measurement_<<replace_...
The result is:
Table 1. Average air quality measurement per station
We can query and filter temporal data as well.
The ave...
Table 2. Average air quality measurement per station early in the morning
The average air quality measurement in the eveni...
Table 3. Average air quality measurement per station in the evening
For example, we can conclude that the air quality in B...
Prochain SlideShare
Chargement dans…5
×

Analyzing Air Quality Measurements in Macedonia with Apache Drill

The article provides an example for JSON data analysis with Apache Drill. The "toy" model is based on the publicly available air quality measurement data.

  • Soyez le premier à commenter

  • Soyez le premier à aimer ceci

Analyzing Air Quality Measurements in Macedonia with Apache Drill

  1. 1. Analyzing Air Quality Measurements in Macedonia with Apache Drill Author: Marjan Sterjev Apache Drill (https://drill.apache.org/) is schema free SQL engine for analyzing Big data coming from disparate data sources having various data formats. Drill can query data stored in HBase, Hive, HDFS, S3, MongoDB etc. Its engine is especially powerful for analyzing JSON data records: https://drill.apache.org/docs/json-data-model/ One of the key constructs when dealing with JSON records are the functions KVGEN and FLATTEN that are described in details in the link above. Take a deeper look for details. The text in this article provides a sketch for procedure consisting of collecting publicly available air quality measurement data and analyzing that data with Drill. In particular, the air quality measurement data for Macedonia is available at the following address: http://airquality.moepp.gov.mk The data is available for various periods, regions and stations: http://airquality.moepp.gov.mk/?page_id=4 Collect air quality measurement data We will collect PM10 related air quality data for a period of one week originating from the air quality measure stations in the Western Region (Bitola1, Bitola2, Lazaropole, Kicevo, Tetovo), Eastern Region (Veles1, Veles2, Kocani, Kavadarci, Kumanovo) and 3 stations in Skopje (Center, Karpos and Lisice). You can obtain the data by copying it from your browser's plugins (Firebug for example) or you can use curl: curl -o air_measurement_east.json "http://airquality.moepp.gov.mk/graphs/site/pages/MakeGraph.php? graph=StationLineGraph&station=EasternRegion&parameter=PM10&endDate=2015-12- 04&timeMode=Week" curl -o air_measurement_west.json "http://airquality.moepp.gov.mk/graphs/site/pages/MakeGraph.php? graph=StationLineGraph&station=WesternRegion&parameter=PM10&endDate=2015-12- 04&timeMode=Week" curl -o air_measurement_centar.json "http://airquality.moepp.gov.mk/graphs/site/pages/MakeGraph.php? graph=StationLineGraph&station=Centar&parameter=PM10&endDate=2015-12-04&timeMode=Week" curl -o air_measurement_karpos.json "http://airquality.moepp.gov.mk/graphs/site/pages/MakeGraph.php? graph=StationLineGraph&station=Karpos&parameter=PM10&endDate=2015-12-04&timeMode=Week" curl -o air_measurement_lisice.json 1
  2. 2. "http://airquality.moepp.gov.mk/graphs/site/pages/MakeGraph.php? graph=StationLineGraph&station=Lisice&parameter=PM10&endDate=2015-12-04&timeMode=Week" The data retrieved has the following format: {"parameter":"PM10","measurements":{ "20151128 11": {"Bitola1":"13.26","Bitola2":"47.42","Kicevo":"31.52","Lazaropole":"","Tetovo":"106.59"}, "20151128 12": {"Bitola1":"8.42","Bitola2":"47.12","Kicevo":"45.28","Lazaropole":"","Tetovo":"106.59"}, If we observe the data format we can see that: 1. The structure is nested: particular measurements are located at the third level of the JSON structure 2. The field names correspond to the station names, i.e. there is no schema in the structure We need to create a view for this data that will allow us to place standard SQL queries for data analysis. Download and install Apache Drill Download Drill from https://drill.apache.org/. Just unzip the bundle in the folder of your choice. Start Apache Drill If your are Windows user, navigate to the bin directory located in the Drill installation folder and start the engine in embedded mode: sqlline -u "jdbc:drill:zk=local" Linux users can run the command: drill-embedded.sh Once Drill is started, you can access its WEB console at: http://localhost:8047 Create Views Navigate to the Query tab in the Drill UI: http://localhost:8047/query For each of the air quality data files collected: • air_measurement_west.json • air_measurement_east.json • air_measurement_centar.json, • air_measurement_karpos.json 2
  3. 3. • air_measurement_lisice.json we will create corresponding view: CREATE OR REPLACE VIEW dfs.tmp.air_measurement_<<replace_it>> AS SELECT TO_TIMESTAMP(dmt1.date_hour,'YYYYMMdd HH') AS `timestamp`, dmt1.station_measurement.key AS station, CAST(CONCAT('0',dmt1.station_measurement.`value`) AS FLOAT) AS measure FROM ( SELECT dmt.dm.key AS date_hour, FLATTEN(KVGEN(dmt.dm.`value`)) AS `station_measurement` FROM ( SELECT FLATTEN(KVGEN(aq.measurements)) dm FROM dfs.`C:/ml/air_measurement_<<replace_it>>.json` aq ) dmt )dmt1 Once the ingredient views are created, we will create the final union view, that sublimates all of the data: CREATE OR REPLACE VIEW dfs.tmp.air_measurement AS SELECT * FROM dfs.tmp.air_measurement_west UNION ALL SELECT * FROM dfs.tmp.air_measurement_east UNION ALL SELECT * FROM dfs.tmp.air_measurement_centar UNION ALL SELECT * FROM dfs.tmp.air_measurement_karpos UNION ALL SELECT * FROM dfs.tmp.air_measurement_lisice Note that the created views are persistent and they will survive Apache Drill restarts. Analyze Data With the final view created, we have the full SQL tool set available for air quality measurement data analysis. For example, we can group the data per station and find the average measurement for the data collected: SELECT station, AVG(measure) as avg_measure FROM dfs.tmp.air_measurement GROUP BY station ORDER BY avg_measure DESC 3
  4. 4. The result is: Table 1. Average air quality measurement per station We can query and filter temporal data as well. The average air quality measurement early in the morning is: SELECT station, AVG(measure) avg_measure FROM dfs.tmp.air_measurement WHERE EXTRACT(hour FROM `timestamp`) <8 GROUP BY station ORDER BY avg_measure DESC 4
  5. 5. Table 2. Average air quality measurement per station early in the morning The average air quality measurement in the evening is: SELECT station, AVG(measure) avg_measure FROM dfs.tmp.air_measurement WHERE EXTRACT(hour FROM `timestamp`) >=18 GROUP BY station ORDER BY avg_measure DESC 5
  6. 6. Table 3. Average air quality measurement per station in the evening For example, we can conclude that the air quality in Bitola degrades in the evenings compared with its morning siblings. The data sets used in this “toy” demonstration were small. However, Drill is designed to work with very large data sets and you can apply your existing SQL knowledge on those large data sets as well. 6

    Soyez le premier à commenter

    Identifiez-vous pour voir les commentaires

The article provides an example for JSON data analysis with Apache Drill. The "toy" model is based on the publicly available air quality measurement data.

Vues

Nombre de vues

576

Sur Slideshare

0

À partir des intégrations

0

Nombre d'intégrations

63

Actions

Téléchargements

0

Partages

0

Commentaires

0

Mentions J'aime

0

×