New Perspectives for Business Intelligence: Library and Research Technologies...
Google Public Data Explorer
1. Digital Enterprise Research Institute www.deri.ie
Google Public Data Explorer
Aftab Iqbal
Stefan.Decker@deri.org
http://www.StefanDecker.org/
Copyright 2010 Digital Enterprise Research Institute. All rights reserved.
3. DSPL Dataset
Digital Enterprise Research Institute www.deri.ie
General information
About the dataset
Concepts
Definitions of "things" that appear in the dataset (e.g., counties,
unemployment rate, gender, etc.)
Slices
Combinations of concepts for which there are data
Tables
Data for concepts and slices. Concept tables hold enumerations
and slice tables hold statistical data
Topics
Organize the concepts of the dataset in a meaningful hierarchy
through labeling
4. School Enrollment 2009_2010 *
Digital Enterprise Research Institute www.deri.ie
School_Roll_No Short_Name Level Male Female
00697S ST BRIDGIDS NS Primary 377 447
01170G NAUL NS Primary 40 61
09492W BALSCADDEN NS Primary 98 133
… … … … …
* Snapshot took from http://data.fingal.ie/ViewDataSets/Details/default.aspx?datasetID=385
5. DSPL – Contd.
Digital Enterprise Research Institute www.deri.ie
General Information
General information about the provider of the dataset
<info>
<name>
<value>School</value>
</name>
<description>
<value>Statistics about Fingal County Schools</value>
</description>
<url>
<value></value>
</url>
</info>
<provider>
<name>
<value>County Fingal School Enrollment Statistics</value>
</name>
<url>
<value>http://data.fingal.ie/ViewDataSets/Details/default.aspx?datasetID=385</value>
</url>
</provider>
6. DSPL – Contd.
Digital Enterprise Research Institute www.deri.ie
Concepts
Type of data that appears in a dataset
<concept id="Schools“ extends="geo:location" >
<info> <table id="schools_table">
<name> <column id="School" type="string"/>
<value>Schools</value> <column id=“School_Roll_No" type="string"/>
</name> <column id="latitude" type="float"/>
<description> <column id="longitude" type="float"/>
<value>List of schools for Co. Fingal</value> <data>
</description> <file format="csv" encoding="utf-8">schools.csv</file>
</info> </data>
<type ref="string"/> </table>
<table ref="schools_table"/>
</concept>
school name latitude longitude
00697S Saint Bridgids National School 53.37514 -6.36221
01170G S N Na H Aille Naul National School 53.57887 -6.28564
09492W Balscadden National School 53.61528 -6.23218
09642P Burrow National School 53.39129 -6.10028
… … … …
7. DSPL – Contd.
Digital Enterprise Research Institute www.deri.ie
Slices
It’s a combination of concepts for which data exists
contains two kinds of concept references: Dimensions and
metrics.
<table id="enrolment_slice_table">
<slice id="enrolment_slice"> <column id="school" type="string"/>
<dimension concept="school"/> <column id="M" type="integer"/>
<dimension concept="time:year"/> <column id="F" type="integer"/>
<metric concept="M"/> <column id="year" type="date" format="yyyy"/>
<metric concept="F"/> <data>
<table ref="enrolment_slice_table"/> <file format="csv" encoding="utf-
</slice> 8">school_enrolment_slice.csv</file>
</data>
</table>
8. School Enrollment Slice
Digital Enterprise Research Institute www.deri.ie
Dimensions metrics
School Male Female Year
Saint Bridgids National School 377 447 2009
Saint Bridgids National School 475 392 2010
Balscadden National School 98 133 2009
Balscadden National School 126 102 2010
… … … …
9. DSPL – Contd.
Digital Enterprise Research Institute www.deri.ie
Topics
Classify concepts hierarchically, and are used by applications to
help users navigate to your data.
<topic id="Male_indicators">
<info>
<name><value>Male Students Enrollment</value></name>
</info>
</topic>
<topic id="Female_indicators">
<info>
<name><value>Female Students Enrollment</value></name>
</info>
</topic>
10. Data Cleansing
Digital Enterprise Research Institute www.deri.ie
School Enrollment 2009 School Enrollment 2010
School_Roll_No Short_Name Level Male Female School_Roll_No Short_Name Level Male Female
00697S ST BRIDGIDS NS Primary 377 447 00697S ST BRIDGIDS NS Primary 475 392
01170G NAUL NS Primary 40 61 01170G NAUL NS Primary 58 40
… … … … … … … … … …
School Male Female Year
00697S 377 447 2009
00697S 475 392 2010
01170G 40 61 2009
01170G 58 40 2010
… … … …
School_Enrollment_Slice.csv
School Name Latitude Longitude
00697S Saint Bridgids National School 53.37514 -6.36221
01170G S N Na H Aille Naul National School 53.57887 -6.28564
… … … …
Schools.csv
A DSPL dataset is a bundle that contains an XML file and a set of CSV files. The CSV files are simple tables containing the data of the dataset. The XML file describes the metadata of the dataset, including informational metadata like descriptions of measures, as well as structural metadata like references between tables. The metadata lets non-expert users explore and visualize your data.The only prerequisite for understanding this tutorial is a good level of understanding of XML. Some understanding of simple database concepts (e.g., tables, primary keys) may help, but it's not required. For reference, the completed XML file and complete dataset bundle associated with this tutorial are also available for review.
General information about the provider of the dataset: its name and a URL where more information can be found (generally the data provider's home page)The <info> element contains general information about the dataset: name, description, and a URL where more information can be foundThe <provider> element contains information about the provider of the dataset: its name and a URL where more information can be found (generally the data provider's home page).
Now that we have provided some general information about the dataset, we're ready to start defining its contents.Concepts that are categorical, such as state, are associated with concept tables, which enumerate all their possible values (California, Arizona, etc.). Concepts may have additional columns for properties such as the name or the country of a state.A concept is a definition of a type of data that appears in a dataset. The data values that correspond to a given concept are called instances of that concept.Every concept must provide an id that uniquely identifies the concept within the dataset. Just like for the dataset and its provider, the <info> elements provide textual information about the concept, such as its name and description. The <type> element specifies the data type for the instances of the concept (in other words, its "values").Finally, the school concept has a <table> element. This element references a table that enumerates the list of all schools.The schools table specifies the columns of the table and their types, and references a CSV file that contains the data.
The values of metrics vary with the values of dimensions.Just like concepts, slices include a reference to a table that contains the data of the slice. The referenced table must have one column for each dimension and metric of the slice. Just as for concepts, the slice's dimensions and metrics are mapped to the table columns with the same ids.Slices define each combination of concepts for which there is statistical data in the dataset. A slice contains dimensions and metrics. In the above picture, the dimensions are blue and the metrics are orange. In this example, the slice gender_country_slice has data for the metric population and the dimensions country, year and gender. Another slice, called country_slice, gives total yearly population numbers (metric) for countries.