This document describes using Apache Drill to analyze air quality measurement data from Macedonia. It involves collecting PM10 air quality data for various regions and stations in Macedonia through API calls. Views are then created in Drill to structure the JSON data. Finally, SQL queries are run on the views to analyze and summarize the data, such as finding the average air quality measurements by station and at different times of day.
Aspirational Block Program Block Syaldey District - Almora
Analyzing Air Quality Measurements in Macedonia with Apache Drill
1. Analyzing Air Quality Measurements in Macedonia
with Apache Drill
Author: Marjan Sterjev
Apache Drill (https://drill.apache.org/) is schema free SQL engine for analyzing Big data coming from
disparate data sources having various data formats. Drill can query data stored in HBase, Hive,
HDFS, S3, MongoDB etc.
Its engine is especially powerful for analyzing JSON data records:
https://drill.apache.org/docs/json-data-model/
One of the key constructs when dealing with JSON records are the functions KVGEN and FLATTEN
that are described in details in the link above. Take a deeper look for details.
The text in this article provides a sketch for procedure consisting of collecting publicly available air
quality measurement data and analyzing that data with Drill.
In particular, the air quality measurement data for Macedonia is available at the following address:
http://airquality.moepp.gov.mk
The data is available for various periods, regions and stations:
http://airquality.moepp.gov.mk/?page_id=4
Collect air quality measurement data
We will collect PM10 related air quality data for a period of one week originating from the air quality
measure stations in the Western Region (Bitola1, Bitola2, Lazaropole, Kicevo, Tetovo), Eastern
Region (Veles1, Veles2, Kocani, Kavadarci, Kumanovo) and 3 stations in Skopje (Center, Karpos and
Lisice). You can obtain the data by copying it from your browser's plugins (Firebug for example) or you
can use curl:
curl -o air_measurement_east.json
"http://airquality.moepp.gov.mk/graphs/site/pages/MakeGraph.php?
graph=StationLineGraph&station=EasternRegion¶meter=PM10&endDate=2015-12-
04&timeMode=Week"
curl -o air_measurement_west.json
"http://airquality.moepp.gov.mk/graphs/site/pages/MakeGraph.php?
graph=StationLineGraph&station=WesternRegion¶meter=PM10&endDate=2015-12-
04&timeMode=Week"
curl -o air_measurement_centar.json
"http://airquality.moepp.gov.mk/graphs/site/pages/MakeGraph.php?
graph=StationLineGraph&station=Centar¶meter=PM10&endDate=2015-12-04&timeMode=Week"
curl -o air_measurement_karpos.json
"http://airquality.moepp.gov.mk/graphs/site/pages/MakeGraph.php?
graph=StationLineGraph&station=Karpos¶meter=PM10&endDate=2015-12-04&timeMode=Week"
curl -o air_measurement_lisice.json
1
2. "http://airquality.moepp.gov.mk/graphs/site/pages/MakeGraph.php?
graph=StationLineGraph&station=Lisice¶meter=PM10&endDate=2015-12-04&timeMode=Week"
The data retrieved has the following format:
{"parameter":"PM10","measurements":{
"20151128 11":
{"Bitola1":"13.26","Bitola2":"47.42","Kicevo":"31.52","Lazaropole":"","Tetovo":"106.59"},
"20151128 12":
{"Bitola1":"8.42","Bitola2":"47.12","Kicevo":"45.28","Lazaropole":"","Tetovo":"106.59"},
If we observe the data format we can see that:
1. The structure is nested: particular measurements are located at the third level of the JSON
structure
2. The field names correspond to the station names, i.e. there is no schema in the structure
We need to create a view for this data that will allow us to place standard SQL queries for data
analysis.
Download and install Apache Drill
Download Drill from https://drill.apache.org/. Just unzip the bundle in the folder of your choice.
Start Apache Drill
If your are Windows user, navigate to the bin directory located in the Drill installation folder and start
the engine in embedded mode:
sqlline -u "jdbc:drill:zk=local"
Linux users can run the command:
drill-embedded.sh
Once Drill is started, you can access its WEB console at:
http://localhost:8047
Create Views
Navigate to the Query tab in the Drill UI:
http://localhost:8047/query
For each of the air quality data files collected:
• air_measurement_west.json
• air_measurement_east.json
• air_measurement_centar.json,
• air_measurement_karpos.json
2
3. • air_measurement_lisice.json
we will create corresponding view:
CREATE OR REPLACE VIEW
dfs.tmp.air_measurement_<<replace_it>>
AS
SELECT
TO_TIMESTAMP(dmt1.date_hour,'YYYYMMdd HH') AS `timestamp`,
dmt1.station_measurement.key AS station,
CAST(CONCAT('0',dmt1.station_measurement.`value`) AS FLOAT) AS measure
FROM
(
SELECT
dmt.dm.key AS date_hour,
FLATTEN(KVGEN(dmt.dm.`value`)) AS `station_measurement`
FROM
(
SELECT FLATTEN(KVGEN(aq.measurements)) dm FROM
dfs.`C:/ml/air_measurement_<<replace_it>>.json` aq
) dmt
)dmt1
Once the ingredient views are created, we will create the final union view, that sublimates all of the
data:
CREATE OR REPLACE VIEW
dfs.tmp.air_measurement
AS
SELECT * FROM dfs.tmp.air_measurement_west
UNION ALL
SELECT * FROM dfs.tmp.air_measurement_east
UNION ALL
SELECT * FROM dfs.tmp.air_measurement_centar
UNION ALL
SELECT * FROM dfs.tmp.air_measurement_karpos
UNION ALL
SELECT * FROM dfs.tmp.air_measurement_lisice
Note that the created views are persistent and they will survive Apache Drill restarts.
Analyze Data
With the final view created, we have the full SQL tool set available for air quality measurement data
analysis. For example, we can group the data per station and find the average measurement for the
data collected:
SELECT
station, AVG(measure) as avg_measure
FROM
dfs.tmp.air_measurement
GROUP BY
station
ORDER BY
avg_measure DESC
3
4. The result is:
Table 1. Average air quality measurement per station
We can query and filter temporal data as well.
The average air quality measurement early in the morning is:
SELECT
station, AVG(measure) avg_measure
FROM
dfs.tmp.air_measurement
WHERE
EXTRACT(hour FROM `timestamp`) <8
GROUP BY
station
ORDER BY
avg_measure DESC
4
5. Table 2. Average air quality measurement per station early in the morning
The average air quality measurement in the evening is:
SELECT
station, AVG(measure) avg_measure
FROM
dfs.tmp.air_measurement
WHERE
EXTRACT(hour FROM `timestamp`) >=18
GROUP BY
station
ORDER BY
avg_measure DESC
5
6. Table 3. Average air quality measurement per station in the evening
For example, we can conclude that the air quality in Bitola degrades in the evenings compared with its
morning siblings.
The data sets used in this “toy” demonstration were small. However, Drill is designed to work with
very large data sets and you can apply your existing SQL knowledge on those large data sets as well.
6