Tables presented in spreadsheets can be a source of important information that needs to be loaded into relational databases. However, many of them have complex structures. This does not allow to populate databases with their information directly. The presentation is devoted to the issues of the rule-based information extraction from arbitrary tables presented in spreadsheets and its transformation into structured canonical form that can be loaded into a database by standard ETL tools. We suggest a novel rule language called CRL for table analysis and interpretation. It enables developing a simple program to recover missing relationships describing table semantics. Particular sets of rules can be designed for different types of tables to provide extraction and transformation steps in a process of unstructured tabular data integration.
CRL: A Rule Language for Table Analysis and Interpretation
1. CRL: A Rule Language
for Table Analysis and Interpretation*
in Unstructured Tabular Data Integration
Alexey Shigarov, shigarov@icc.ru
Matrosov Institute for System Dynamics and Control Theory of SB RAS
17th International Conference on
Data Analytics and Management in Data Intensive Domains
Obninsk, Russia
October 13-16, 2015
* This work was financially supported by the Russian Foundation for Basic Research (Grant No. 15-37-20042)
and the Council for grants of the President of the Russian Federation (Scholarship No. SP-3387.2013.5)
2. Unstructured vs Structured
Unstructured
Tabular Data
Arbitrary Tables in
ASCII-text,
Spreadsheets,
PDF Documents,
Web-Pages
Structured Data
Relational Databases
Easy Way
Hard Way
For Humans
To Understand
No Explicit
Semantics
We Can Read,
Write, and Edit
For Computers
To Understand
Formal
Data Model
(Semantics)
We Can Query (SQL)
and Analyse (DM, OLAP)
2
3. Hard Way Back to Structured Data World
Table Detection*
Table Recognition*
Table Analysis*
Table Interpretation*
ASCII-text
Untagged PDF Documents
Image
Documnets
Spreadsheets
Web Pages
Word Documents
OCR
Databases
Cannonical Forms
XML
ETL
* Hurst M. Layout and language: Challenges for table understanding on the web //
Proc. 1st Int. Workshop on Web Document Analysis. 2001. pp. 27-30 3
4. Our purpose
Globally
to automate unstructured tabular data integration
Databases
Arbitrary Tables
in Spreadsheets
Currently
to automate table analysis and interpretation
Tables in Cannonical Form
4
5. Ok, We Have Initially an Arbitrary Tagged Table
We know
• structure (rows, columns, cells)
• style settings (fonts, colors, alignments, etc.)
• textual content
5
6. All We Need Is To Recover Semantics
Relationships like
entry-label, label-label, label-category*
* Our terminology is inspired by
the X. Wang’s abstract table model
[Wang X. Tabular Abstraction, Editing,
and Formatting, PhD Thesis. 1996]
6
7. When We Know Semantics We Can Generate a Canonical Table
It can be loaded into a database by ETL tools
7
8. Challenges on the Hard Way Back
• Too many layouts to create a table
• Anyone can invent new one
• Messy data
• No guarantees your tabular data are clear and standardized
• Natural Language
• Table understanding needs using knowleadge
8
9. Our Idea
When
• A table creator (e.g. a company, a government agency, ad-hoc software)
use a set of rules for table generation
• Tables have similar structure, style, and content
within a set of generating rules
Then
• We can define a set of rules for table analysis and interpretation
• We can use a rule engine to execute these rules
9
10. Table Analysis and Interpretation Rules
• Rules can be expressed in
• Drools Rule Language* (DRL)
General-purpose language for expressing production rules in Drools* rule engine
• Cells Rule Language (CRL)
Our domain-specific language for expressing table analysis and interpretation rules
• Rules can be executed with Drools* rule engine
*http://drools.org
10
11. CRL Rules
Rules map known table data to unknown ones
rule
when
Left hand side defines conditions using available facts
(cells, categories)
then
Right hand side defines actions to recover unknown semantics
(entries, labels, categories, entry-label, label-label, label-category)
end
11
12. CRL: Left Hand Side
factType $variable : Java boolean expressions
cell $cell : constraints
entry $entry : constraints
label $label : constraints
category $category : constraints
12
13. CRL: Right Hand Side
Merged Cells Splitted Cells
Cell splitting
To split n-tiles cell into n cells
split $cell
Cell merging
To merge two cells into one
merge $cell1 -> $cell2
13
14. CRL: Right Hand Side
Cell marking
set mark @mark -> $cell
where @mark is a word with @ starting character
Using marks in conditions
cell $cell : mark == @mark, constraints
Short form
cell@mark $cell : constraints
14
15. CRL: Right Hand Side
Entry creating
Using a cell value
new entry $cell
Using a specified value
new entry value -> $cell
Label creating
Using a cell value
new label $cell
Using a specified value
new label value -> $cell
15
16. CRL: Right Hand Side
Label categorizing
To associate a label with a category
set category $category -> $label
Trying to find or create a category with a specified name
set category category_name -> $label
16
17. CRL: Right Hand Side
Label associating
set parent label $label1 -> $label2
• Labels can be organized in a tree
• We can build hierarchical categories
• We can build compound label values like label1|label2|…|labelN
17
18. CRL: Right Hand Side
Label grouping
group $label1 -> $label2
• A label group constitutes an anonymous category
• We can divide labels into categories without knowing categories
• We can entirely categorize a label group
18
19. CRL: Right Hand Side
Entry associating
To associate an entry with a label
add label $label -> $entry
Trying to find or create a label in the category with specified value
add label label_value from $category -> $entry
Trying to find or create a category with specified name
add label label_value from category_name -> $entry
19
21. Applying CRL: Critical Cells*
c d c d e
j 2 2 2 3
k
i l 6 7
h
1
4
5
a b
f g
* Nagy G. Learning the Characteristics of Critical
Cells from Web Tables // In Proc. of the 21st Int.
Conf. on Pattern Recognition, Tsukuba, Japan,
IEEE Comp. Soc., 2012, pp. 1554-1557
when
cell $cc : cl==1, rt==1, blank
cell $ec : cl>$cc.cr, rt>$cc.rb
then
new entry $ec
-> <entries> = {1,2,3,4,5,6,7}
21
22. when
cell $cc : cl == 1, rt == 1, blank
cell $clc : cl > $cc.cr, rb <= $cc.rb
then
set mark @ColLabel -> $clc
new label $clc
when
cell@ColLabel $c1
cell@ColLabel $c2 : rt == $c1.rt
then
group $c1.label -> $c2.label
Applying CRL: Label Groups
c d c d e
j 2 2 2 3
k
i l 6 7
h
1
4
5
a b
f g
-> <labels>={a,b,c,d,e,f,g,...}
-> <groups>={{a,b},{c,d,e},
{f,g},...}
22
24. Applying CRL: YAML* Specified categories
Category YAML specification
# category YEAR
name: Year
description: years from 1982 to 2015
constraints:
-"198[2-9]"
-"200[1-9]"
-"201[0-5]"
when
category $c : name == "Year"
label $l : $c.canHaveLabel(value)
then
set category $c -> $l
Category YAML specification
# category COUNTRY_CODE
name: CountryCode
description: ISO 3166 2-letter country codes
labels:
-AD
-AE
-...
-ZW
when
category $c : name == "CountryCode"
label $l : $c.hasLabel(value)
then
set category $c -> $l
*http://yaml.org
24
25. Applying CRL: Category Names
when
cell $cc : cl == 1, rt == 1
cell $c : mark == "@ColLabel"
then
set category token($cc, 0) -> $c.label
A
B
a1 a2 a3
b1 1 2 3
b2 4 5 6
-> <categories> = {A,...}
-> <labels> = {a1,a2,a3,...}
-> <label-category pairs> = {(a1,A),(a2,A),(a3,A),...}
25
26. Applying CRL: Multi-Valued Cells
α β
阿爾法 公測
γ 1 2
伽馬 一 二
δ 3 4
三角洲 三 四
C1 C2 C3
a = 1 b = 2 c = 3
d = 4 e = 5 f = 6
g = 7 h = 8 i = 9
Bilingual Tables Key=Value Cells
when
cell $c : cl==1 || rt==1, !blank
then
new label token($c, 0) -> $c
new label token($c, 1) -> $c
when
cell $c : rt>1
then
new label left($c, '=') -> $c
new entry right($c, '=') -> $c
26
27. Applying CRL: Footnotes
when
cell $footer : onLastRow, $notes : text
entry $e : cell.text matches ".+*+",
$ref : extract(cell.text, "*+")
then
add label between($notes, $ref, 'n')
from "footnotes" -> $e
c d c d
e 1* 2** 3 4
f 5 6 7 8
g 9 10 11 12
a b
* x
** y
-> <labels>={x,y,...}
-> <categories>={"footnotes",...}
-> <entry-label pairs>={(1,x),(2,y),...}
-> <label-category pairs>={(x,"footnotes"), (y,"footnotes"),...}
27
30. Experimental Evaluation
Our purpose is evaluation of recovering entries, labels,
entry-label and label-label relationships
Dataset
• We use the TANGO dataset (http://tango.byu.edu/data)
which
• is a part of the TANGO (Table ANalysis for Generating Ontologies) project
(http://tango.byu.edu)
• is intended for testing table interpretation methods
• has 200 arbitrary tables collected from 10 statistical sites in spreadsheet format in 2009
30
31. Experimental Evaluation
Multi-row
hierarchical layout
Multi-column
plain layout
One-column
hierarchical
layout
Multi-column &
multi-row layout
One-column
plain layout
Category name cells
Row label cells
Column label cells
Entry cells
Table regions
One-column &
one-row layout
Multi-column &
one-row layout
One-row plain layout
Multi-row
plain layout
47,5%
47%
5,5% 100%
94,5% 5,5% 65,5%
26%
8,5%
31
We develop two sets
of CRL rules to define
two table types
• TANGO-200
all tables
• TANGO-SUB
without tables having
hierarchical layout in
the leftmost column
Layouts of TANGO Tables
32. Experimental Evaluation
Measures
• Recall
• a table is processed successfully, when all entries, labels, entry-label pairs, and label-label pairs which
are implicitly contained in its source form are explicitly included in its canonical form
• Presision
• a table is processed successfully, when all entries, labels, entry-label pairs, and label-label pairs which
are explicitly included in its canonical form are implicitly contained in its source form
Process
• Two experts independently compare sources and generated automatically canonical forms of tables
• They referee that each table is processed successfully or not in terms of recall and precision
• When they make opposite decisions on a table, a final decision is made by third expert
32
33. Experimental Evaluation
Results
Rule Set / Table Type TANGO-200 TANGO-SUB
Tables 200 105
Cells 22757 10893
Rules 16 13
Recall 87% 95%
Precision 89% 95%
For TANGO-200
• 33 tables are processed with errors
• 85% of errors are born in the leftmost column with one-column hierarchical layout
• Two main causes:
1) ambiguity among style characteristics
2) hierarchical relationships expressed by natural language only
33
34. Comparison with others
Methods and Tools for Table Analysis and Interpretation
1-5 Fixed Types of Tables Programmable Table Types
Knowledge-based
methods
Douglas, 1995
Tijerino, 2005
Embley, 2005
WangJ, 2012
• Domain ontologies
• Taxonomies like
ProBase, FreeBase
Domain-independent methods
Gatterbauer, 2007
Pivk, 2005, 2006, 2007
Kim, 2008
Chen&Cafarella, 2013, 2014
Embley, 2014
Nagy, 2014
• Spatial, style, and textual data
• Several typical table types
We are here!
2014, 2015
• Rule language (CRL, DRL)
• Relative cell addressing
• Fixed target schema
• Spatial, style,
and textual data
Hung, 2011
• Spreadsheet-like formula
mapping language (TranSheet)
• Absolute cell addressing
• Programmable target schema
• Spatial and textual data
34
35. Conclusions
• Our methodology is mainly oriented on unstructured tabular data integration
• We expect it to be useful in cases when data from a large number of tables
appertaining to a few table types are required for populating a database
• One set of rules can be suitable for processing a wide range of arbitrary tables
with high accuracy
• Experiment demonstrates that narrowing of a table type can cause simplifying of
rules and increase of recall and precision in table canonicalization
35
36. Further Work
• Table Layouts
to develop techniques for widely used table features,
e.g. for recovering a row label hierarchy in the leftmost column
• Messy Tabular Data
to incorporate data cleansing techniques into table understanding
• Natural Language
to add knowledge, global taxonomies (e.g. FreeBase, DBpedia)
and domain ontologies
36
37. Supplementary Materials
CRL language specification
Examples of CRL rules
All details of our experiment
http://cells.icc.ru/pub/crl
Source code of our prototype
licensed under Apache License 2.0
https://github.com/shigarov/cells-ssdc
37
38. Thanks!
This presentation is available on SlideShare.net
http://www.slideshare.net/shig
Alexey Shigarov
shigarov@icc.ru
http://cells.icc.ru
38