Why Teams call analytics are critical to your entire business
Open Data and Web API
1. Open Data
and Web API
Sammy Fung
Technology Sharing (28/1/2016)
at Hong Kong Polytechnic University COMP
2. Sammy Fung
• President, Open Source Hong Kong.
• Conference Chair, Hong Kong Open Source Conference.
• Tags: Freelancer, Developer, Open Source, Open Data, Startup.
• Contacts:
• sammy@sammy.hk
• @sammyfung
• https://github.com/sammyfung
• The presentation slide will be public on SlideShare in CC license.
• Creative Commons BY-NC-SA (Attribution, Non-Commerical, Sharealike)
5. 1. What is current Air
Quality Health Index
(AQHI) of Causeway Bay ?
6. AQHI
• Environment Protection Department (EPD)
• Find it out from AQHI website run by EPD.
• http://www.aqhi.gov.hk/en.html
• How about details of air quality in Causeway Bay
? Look into Pollutant Concentration of CWB.
8. How does software program
read the AQHI and Pollutant
Concentration of CWB (“data”) ?
9. Software and Data
• We need values of data for software program.
• an integer or a float: eg. 2016, 6.89
• a character string: eg. “Causeway Bay”
• Retrieve data through “Interface” for data.
• Application Programming Interface (API)
• Computer/Software/Program Readable Data Format.
• Human communicates in a common language, eg. English,
Cantonese, Mandarin.
• Data Formats: eg. XML, JSON.
12. Software and Data
• Web Scraping: Retrieve documents from website.
• Information Extraction & Transformation:
• Extract and Transform data from common data format
into data objects (variables) in software program.
• eg. JSON -> Float(s)
• “Clean Data” is needed for “non-good” data formats.
• eg. HTML -> Float(s)
13. Software and Data
• Programming Language: eg. Python
• Web Scraping library: import scrapy
• JSON library: import json
• Regular Expression library: import re
• Other libraries (eg. database): import mysql
14. Installing Scrapy
• Scrapy is a web scraping framework written in
ptyhon.
• virtualenv ~/env/scrapy
• source ~/env/scrapy/bin/activate
• pip install scrapy
15. Try in Scrapy Shell
• Try in Scrapy Shell:
• scrapy startproject demo1
• scrapy shell http://www.aqhi.gov.hk/epd/ddata/html/out/
24aqhi_Eng.xml
• a =response.xpath("//item[contains(.//StationName/text(),
'Causeway Bay’)]/aqhi/text()").extract()
• b = a[len(a)-1] # b is string
• c = int(a[len(a)-1]) # c is integer
• print (b, c) # show the difference
16. AQHI
• Phase 1: EPD provided AQHI in XML format.
• Phase 2: EPD provided both AQHI and Pollutant
Concentration in XML format.
17. 2. What is the current
temperature of Shatin ?
18. Weather
• Hong Kong Observatory (HKO)
• http://www.weather.gov.hk
• Top Hobbyist Website: Weather Underground
http://www.weather.org.hk/
19. So, we just found another
“data” again by human.
20. How does software program
read the current temperature
of Shatin (“data”) ?
21. Sorry! You need to subscribe to commercial
paid data feed services provided by HKO.
XD
23. We can do it by scraping
from HTML document
(a harder method)
24. Try in Scrapy
• scrapy shell http://www.weather.gov.hk/wxinfo/ts/
text_readings_e.htm
• a = response.xpath(“//pre").extract()[0]
• import re
• b = re.split("n", a)
25. Clean Data with RE
c = ‘’
for i in b:
if re.search("^Sha Tin", i) and c=‘’:
c = re.sub("^Sha Tin *”,"",i)
c = re.sub(" .*”,”",c)
print c # c is string
print float(c) # c is float
27. Open Data
• Discoverable
• Available and Searchable on Internet.
• Structured
• Open and Machine-readable Format.
• Unconditional
• Legal Framework allows to reproduce and
repurpose the data.
28. 5-star Open Data
DeploymentScheme
• Tim Berners-Lee, the inventor of the Web.
• 5stardata.info
• 1 Star: make your stuff available on the Web (whatever format) under an open
license.
• 2 Star: make it available as structured data
• eg. Excel instead of image scan of a table
• 3 Star: use non-proprietary formats
• eg. CSV instead of Excel
• 4 Star: use URIs to denote things, so that people can point at your stuff
• 5 Star: link your data to other data to provide context.
29. Open Data in Hong Kong
• OGCIO
• DATA.ONE in 2011.
• data.gov.hk in 2015.
• JSON/XML, RSS, XLS, CSV, JPEG/PNG,….
• Define workflow for other government department to release open
data.
• OGCIO could not decide which data and format can be
released
• Decision made by data owner of each government departments.
30. Open Data in Hong Kong
• LegCo
• http://www.legco.gov.hk
• Voting results of LegCo meetings and some
committee meetings in XML in Oct 2013.
• API is available in Fall 2014.
• Not part of DATA.ONE / DATA.GOV.HK.
31. HK Air Quality Data
• AQHI, old API and Pollutant Concentration
• XML Data for past 24 hours.
• CSV Data for all past records.
• EPD released AQHI and old API at phase 1 few
years ago.
• EPD also released Pollutant Concentration data in
machine-readable format at phase 2 one year ago.
32. Weather in DATA.GOV.HK
• I posted a blog 'Progress of Open Government Data in Hong Kong' on 2013/01/17.
• Weather at Data.One released 7 datasets only.
• All datasets are in RSS (XML) format which includes items with title and
description only.
• Hourly weather reports, weather forecasts and special reports in 3 languages.
• Examples of missing data:
• Regional Weather Data updates from stations in every 10 minutes.
• One word: Useless.
• RSS Datasets on DATA.GOV.HK is completely different with HKO paid service
(XML data feed).
34. API
• API = Application Programming Interface
• Retrieve data through “Interface”.
• C API, Python API, Objective-C API, Java
API……
35. Web API
• API for Web Server or Web Browser/Client.
• Usually Web APIs are used for connecting to 3rd party
web services.
• Request and Response messaging interface via Web
(HTTP) defined by service providers.
• Request URI example: https://api.twitter.com/1.1/
statuses/user_timeline.json
• Data are exchanged in JSON or XML format.
36. Web API Examples
• Payments: Paypal, MasterPass,…
• Online Services: Google, GitHub,…
• Social Networks: Twitter, Facebook,…
37. REST
• Representational state transfer
• One of reference styles of data exchange for Web 2.0.
• Web API design are usually in REST style.
• Systems communicates using HTTP verbs over HTTP communication.
• HTTP Verbs: GET, POST, PUT, DELETE,…
• GET: list or retrieve data
• POST: create data
• PUT: update or replace data
• DELETE: delete data
38. Communication Flow
with API
• Authorization
• to retrieve a token for your web/mobile/backend apps to use the 3rd party API
services.
• Re-direct users to 3rd party services for one-time auth (eg. Username,
Password), and token will be used for future access until token is expired.
• For Application or Application-User Authorisation.
• eg. OAuth, XAuth.
• Do your any web API calls.
• API Rate Limits
39. Tweepy
• a 3rd party twitter library for python.
• pip install tweepy
• http://tweepy.readthedocs.org
40. Open Data and Web API
• Structure of Open Data and Syntax of Web API
will be changed by service / data providers from
time to time.
• You should subscribe to developer blog of those
API and data services if possible.
• Use existing open source software tools to use
web API, otherwise build your own tools (and
consider to make it open source)
42. Open Source Software
• Open Source = Source Codes are available to public.
• License: Licensed in one of Open Source Licenses.
• Freedom: Freely (re-)distribute
• You can charge for distribution costs but almost no
one will do so.
• GitHub: Rich open source software library
• Git: a distributed version control software tool.