2. We are surrounded by
MESSY data 2013-02-06
Toronto Data Science Group
- Multiple standards and formats
Structured vs unstructured
Field nomination and format varies ...
- Human Error (misspellings, errors, etc)
- Non-normalized inputs (free-text entries, the
“other" option)
- Incomplete data (laziness)
....
2
3. Lack of 2013-02-06
Toronto Data Science Group
Time
Skills
» Software
3
4. OpenRefine the 2013-02-06
Toronto Data Science Group
- Swiss army knife for data manipulation!
- glue step between your IT systems
4
5. What's OpenRefine
(former Google Refine, former Gridworks) 2013-02-06
Toronto Data Science Group
- A Cross platform Web Application that runs
locally
- A Community based project hosted on GitHub
- Which have two distributions and multiple
extensions
- Something between a spreadsheet and SQL
5
6. Three use case 2013-02-06
Toronto Data Science Group
1. Data Cleaning
2. ETL (Extract Transform Load) Prototyping
3. Data extension (reconciliation & linked data)
6
7. #1 Data Cleaning 2013-02-06
Toronto Data Science Group
Graphical interface Cluster similar record
Facet option Support three languages:
- GREL Jyton, Clojure
+ regex
7
8. Facet example 2013-02-06
Toronto Data Science Group
8
10. #2 ETL Prototyping
(Extract – Transform - Load) 2013-02-06
Toronto Data Science Group
Extract & Load Transform
Support: - Understand your data
- tabular (csv, xls) - Test the
transformation that
- hierarchical (xml, json) need to be done
- Undo / Redo
- Export transformation
in JSON format
- Automate using the
python or ruby
extension 10
11. History and JSON export 2013-02-06
Toronto Data Science Group
11
12. #3 Extend your Data
(reconciliation & linked data) 2013-02-06
Toronto Data Science Group
- Cross between Reconcile against
OpenRefine projects - RDF file & Local SPARQL
(vlookup) endpoints
- Fetch URL and - Online databases
call web services (API)
12
14. 2013-02-06
Toronto Data Science Group
Thanks!
Martin Magdinier OpenRefine
martin.magdinier@gmail.com http://openrefine.org
@magdmartin @OpenRefine
14