SlideShare une entreprise Scribd logo
1  sur  38
Télécharger pour lire hors ligne
CRL: A Rule Language
for Table Analysis and Interpretation*
in Unstructured Tabular Data Integration
Alexey Shigarov, shigarov@icc.ru
Matrosov Institute for System Dynamics and Control Theory of SB RAS
17th International Conference on
Data Analytics and Management in Data Intensive Domains
Obninsk, Russia
October 13-16, 2015
* This work was financially supported by the Russian Foundation for Basic Research (Grant No. 15-37-20042)
and the Council for grants of the President of the Russian Federation (Scholarship No. SP-3387.2013.5)
Unstructured vs Structured
Unstructured
Tabular Data
Arbitrary Tables in
ASCII-text,
Spreadsheets,
PDF Documents,
Web-Pages
Structured Data
Relational Databases
Easy Way
Hard Way
For Humans
To Understand
No Explicit
Semantics
We Can Read,
Write, and Edit
For Computers
To Understand
Formal
Data Model
(Semantics)
We Can Query (SQL)
and Analyse (DM, OLAP)
2
Hard Way Back to Structured Data World
Table Detection*
Table Recognition*
Table Analysis*
Table Interpretation*
ASCII-text
Untagged PDF Documents
Image
Documnets
Spreadsheets
Web Pages
Word Documents
OCR
Databases
Cannonical Forms
XML
ETL
* Hurst M. Layout and language: Challenges for table understanding on the web //
Proc. 1st Int. Workshop on Web Document Analysis. 2001. pp. 27-30 3
Our purpose
Globally
to automate unstructured tabular data integration
Databases
Arbitrary Tables
in Spreadsheets
Currently
to automate table analysis and interpretation
Tables in Cannonical Form
4
Ok, We Have Initially an Arbitrary Tagged Table
We know
• structure (rows, columns, cells)
• style settings (fonts, colors, alignments, etc.)
• textual content
5
All We Need Is To Recover Semantics
Relationships like
entry-label, label-label, label-category*
* Our terminology is inspired by
the X. Wang’s abstract table model
[Wang X. Tabular Abstraction, Editing,
and Formatting, PhD Thesis. 1996]
6
When We Know Semantics We Can Generate a Canonical Table
It can be loaded into a database by ETL tools
7
Challenges on the Hard Way Back
• Too many layouts to create a table
• Anyone can invent new one
• Messy data
• No guarantees your tabular data are clear and standardized
• Natural Language
• Table understanding needs using knowleadge
8
Our Idea
When
• A table creator (e.g. a company, a government agency, ad-hoc software)
use a set of rules for table generation
• Tables have similar structure, style, and content
within a set of generating rules
Then
• We can define a set of rules for table analysis and interpretation
• We can use a rule engine to execute these rules
9
Table Analysis and Interpretation Rules
• Rules can be expressed in
• Drools Rule Language* (DRL)
General-purpose language for expressing production rules in Drools* rule engine
• Cells Rule Language (CRL)
Our domain-specific language for expressing table analysis and interpretation rules
• Rules can be executed with Drools* rule engine
*http://drools.org
10
CRL Rules
Rules map known table data to unknown ones
rule
when
Left hand side defines conditions using available facts
(cells, categories)
then
Right hand side defines actions to recover unknown semantics
(entries, labels, categories, entry-label, label-label, label-category)
end
11
CRL: Left Hand Side
factType $variable : Java boolean expressions
cell $cell : constraints
entry $entry : constraints
label $label : constraints
category $category : constraints
12
CRL: Right Hand Side
Merged Cells Splitted Cells
Cell splitting
To split n-tiles cell into n cells
split $cell
Cell merging
To merge two cells into one
merge $cell1 -> $cell2
13
CRL: Right Hand Side
Cell marking
set mark @mark -> $cell
where @mark is a word with @ starting character
Using marks in conditions
cell $cell : mark == @mark, constraints
Short form
cell@mark $cell : constraints
14
CRL: Right Hand Side
Entry creating
Using a cell value
new entry $cell
Using a specified value
new entry value -> $cell
Label creating
Using a cell value
new label $cell
Using a specified value
new label value -> $cell
15
CRL: Right Hand Side
Label categorizing
To associate a label with a category
set category $category -> $label
Trying to find or create a category with a specified name
set category category_name -> $label
16
CRL: Right Hand Side
Label associating
set parent label $label1 -> $label2
• Labels can be organized in a tree
• We can build hierarchical categories
• We can build compound label values like label1|label2|…|labelN
17
CRL: Right Hand Side
Label grouping
group $label1 -> $label2
• A label group constitutes an anonymous category
• We can divide labels into categories without knowing categories
• We can entirely categorize a label group
18
CRL: Right Hand Side
Entry associating
To associate an entry with a label
add label $label -> $entry
Trying to find or create a label in the category with specified value
add label label_value from $category -> $entry
Trying to find or create a category with specified name
add label label_value from category_name -> $entry
19
Cannonical Form Generation
<entries>={1,2,3,4,5,6,7,8}
<labels>={a1,a11,a12,a2,a21,a22,b1,b2}
<categories>={A,B}
<entry-label pairs>={(1,a11),(1,b1),(2,a12),
(2,b1),(3,a21),(3,b1),(4,a22),(4,b1),(5,a11),
(5,b2),(6,a12),(6,b2),(7,a21),(7,b2),(8,a22),
(8,b2)}
<label-label pairs>={(a11,a1),(a12,a1),
(a21,a2),(a22,a2)}
<label-category pairs>={(a1,A),(a11,A),
(a12,A),(a2,A),(a21,A),(a22,A),(b1,B),(b2,B)}
DATA A B
1 a1 | a11 b1
2 a1 | a12 b1
3 a2 | a21 b1
4 a2 | a22 b1
5 a1 | a11 b2
6 a1 | a12 b2
7 a2 | a21 b2
8 a2 | a22 b2
a11 a12 a21 a22
b1 1 2 3 4
b2 5 6 7 8
A
B
a1 a2
20
Applying CRL: Critical Cells*
c d c d e
j 2 2 2 3
k
i l 6 7
h
1
4
5
a b
f g
* Nagy G. Learning the Characteristics of Critical
Cells from Web Tables // In Proc. of the 21st Int.
Conf. on Pattern Recognition, Tsukuba, Japan,
IEEE Comp. Soc., 2012, pp. 1554-1557
when
cell $cc : cl==1, rt==1, blank
cell $ec : cl>$cc.cr, rt>$cc.rb
then
new entry $ec
-> <entries> = {1,2,3,4,5,6,7}
21
when
cell $cc : cl == 1, rt == 1, blank
cell $clc : cl > $cc.cr, rb <= $cc.rb
then
set mark @ColLabel -> $clc
new label $clc
when
cell@ColLabel $c1
cell@ColLabel $c2 : rt == $c1.rt
then
group $c1.label -> $c2.label
Applying CRL: Label Groups
c d c d e
j 2 2 2 3
k
i l 6 7
h
1
4
5
a b
f g
-> <labels>={a,b,c,d,e,f,g,...}
-> <groups>={{a,b},{c,d,e},
{f,g},...}
22
Applying CRL: Row Label Hierarchies
when
cell $c1 : cl==1, $l1 : label
cell $c2 : cl==1, rt>$c1.rt,
indent==$c1.indent+2, $l2 : label
no cells : cl==1, rt>$c1.rt,
rt<$c2.rt, indent==$c1.indent
then
set parent label $c1.label -> $c2.label
-> <label-label pairs> =
{(c1,c),(c11,c1),(c12,c1),(c2,c),
(c21,c2),(d1,d),(d11,d1)}
23
Applying CRL: YAML* Specified categories
Category YAML specification
# category YEAR
name: Year
description: years from 1982 to 2015
constraints:
-"198[2-9]"
-"200[1-9]"
-"201[0-5]"
when
category $c : name == "Year"
label $l : $c.canHaveLabel(value)
then
set category $c -> $l
Category YAML specification
# category COUNTRY_CODE
name: CountryCode
description: ISO 3166 2-letter country codes
labels:
-AD
-AE
-...
-ZW
when
category $c : name == "CountryCode"
label $l : $c.hasLabel(value)
then
set category $c -> $l
*http://yaml.org
24
Applying CRL: Category Names
when
cell $cc : cl == 1, rt == 1
cell $c : mark == "@ColLabel"
then
set category token($cc, 0) -> $c.label
A
B
a1 a2 a3
b1 1 2 3
b2 4 5 6
-> <categories> = {A,...}
-> <labels> = {a1,a2,a3,...}
-> <label-category pairs> = {(a1,A),(a2,A),(a3,A),...}
25
Applying CRL: Multi-Valued Cells
α β
阿爾法 公測
γ 1 2
伽馬 一 二
δ 3 4
三角洲 三 四
C1 C2 C3
a = 1 b = 2 c = 3
d = 4 e = 5 f = 6
g = 7 h = 8 i = 9
Bilingual Tables Key=Value Cells
when
cell $c : cl==1 || rt==1, !blank
then
new label token($c, 0) -> $c
new label token($c, 1) -> $c
when
cell $c : rt>1
then
new label left($c, '=') -> $c
new entry right($c, '=') -> $c
26
Applying CRL: Footnotes
when
cell $footer : onLastRow, $notes : text
entry $e : cell.text matches ".+*+",
$ref : extract(cell.text, "*+")
then
add label between($notes, $ref, 'n')
from "footnotes" -> $e
c d c d
e 1* 2** 3 4
f 5 6 7 8
g 9 10 11 12
a b
* x
** y
-> <labels>={x,y,...}
-> <categories>={"footnotes",...}
-> <entry-label pairs>={(1,x),(2,y),...}
-> <label-category pairs>={(x,"footnotes"), (y,"footnotes"),...}
27
Applying CRL: Colored Tables
when
cell $lc : style.bgColor == "#4f81bd"
cell $ec : style.bgColor == null, rt >= $lc.rt, cl > $lc.cr
no cells : style.bgColor == "#4f81bd", cl > $lc.cr, cr < $ec.cl
then
add label $lc.label -> $ec.entry
1l
l2 l3 l4 l2 l3 l2
l5 l7 e1 e2 e2 l5 l7 e6 e8 l5 l8 e9
l6 l8 e3 e4 e5 l6 l7 e7 e8 l5 l8 e9
c1 c2
l1
c1 c2
l1
c1 c2
28
Prototype of Spreadsheet Data Extraction
and Transformatiom System
29
Experimental Evaluation
Our purpose is evaluation of recovering entries, labels,
entry-label and label-label relationships
Dataset
• We use the TANGO dataset (http://tango.byu.edu/data)
which
• is a part of the TANGO (Table ANalysis for Generating Ontologies) project
(http://tango.byu.edu)
• is intended for testing table interpretation methods
• has 200 arbitrary tables collected from 10 statistical sites in spreadsheet format in 2009
30
Experimental Evaluation
Multi-row
hierarchical layout
Multi-column
plain layout
One-column
hierarchical
layout
Multi-column &
multi-row layout
One-column
plain layout
Category name cells
Row label cells
Column label cells
Entry cells
Table regions
One-column &
one-row layout
Multi-column &
one-row layout
One-row plain layout
Multi-row
plain layout
47,5%
47%
5,5% 100%
94,5% 5,5% 65,5%
26%
8,5%
31
We develop two sets
of CRL rules to define
two table types
• TANGO-200
all tables
• TANGO-SUB
without tables having
hierarchical layout in
the leftmost column
Layouts of TANGO Tables
Experimental Evaluation
Measures
• Recall
• a table is processed successfully, when all entries, labels, entry-label pairs, and label-label pairs which
are implicitly contained in its source form are explicitly included in its canonical form
• Presision
• a table is processed successfully, when all entries, labels, entry-label pairs, and label-label pairs which
are explicitly included in its canonical form are implicitly contained in its source form
Process
• Two experts independently compare sources and generated automatically canonical forms of tables
• They referee that each table is processed successfully or not in terms of recall and precision
• When they make opposite decisions on a table, a final decision is made by third expert
32
Experimental Evaluation
Results
Rule Set / Table Type TANGO-200 TANGO-SUB
Tables 200 105
Cells 22757 10893
Rules 16 13
Recall 87% 95%
Precision 89% 95%
For TANGO-200
• 33 tables are processed with errors
• 85% of errors are born in the leftmost column with one-column hierarchical layout
• Two main causes:
1) ambiguity among style characteristics
2) hierarchical relationships expressed by natural language only
33
Comparison with others
Methods and Tools for Table Analysis and Interpretation
1-5 Fixed Types of Tables Programmable Table Types
Knowledge-based
methods
Douglas, 1995
Tijerino, 2005
Embley, 2005
WangJ, 2012
• Domain ontologies
• Taxonomies like
ProBase, FreeBase
Domain-independent methods
Gatterbauer, 2007
Pivk, 2005, 2006, 2007
Kim, 2008
Chen&Cafarella, 2013, 2014
Embley, 2014
Nagy, 2014
• Spatial, style, and textual data
• Several typical table types
We are here!
2014, 2015
• Rule language (CRL, DRL)
• Relative cell addressing
• Fixed target schema
• Spatial, style,
and textual data
Hung, 2011
• Spreadsheet-like formula
mapping language (TranSheet)
• Absolute cell addressing
• Programmable target schema
• Spatial and textual data
34
Conclusions
• Our methodology is mainly oriented on unstructured tabular data integration
• We expect it to be useful in cases when data from a large number of tables
appertaining to a few table types are required for populating a database
• One set of rules can be suitable for processing a wide range of arbitrary tables
with high accuracy
• Experiment demonstrates that narrowing of a table type can cause simplifying of
rules and increase of recall and precision in table canonicalization
35
Further Work
• Table Layouts
to develop techniques for widely used table features,
e.g. for recovering a row label hierarchy in the leftmost column
• Messy Tabular Data
to incorporate data cleansing techniques into table understanding
• Natural Language
to add knowledge, global taxonomies (e.g. FreeBase, DBpedia)
and domain ontologies
36
Supplementary Materials
CRL language specification
Examples of CRL rules
All details of our experiment
http://cells.icc.ru/pub/crl
Source code of our prototype
licensed under Apache License 2.0
https://github.com/shigarov/cells-ssdc
37
Thanks!
This presentation is available on SlideShare.net
http://www.slideshare.net/shig
Alexey Shigarov
shigarov@icc.ru
http://cells.icc.ru
38

Contenu connexe

Tendances

Lession 4 the tables of a database
Lession 4 the tables of a databaseLession 4 the tables of a database
Lession 4 the tables of a databaseĐỗ Đức Hùng
 
UKOUG Tech14 - Getting Started With JSON in the Database
UKOUG Tech14 - Getting Started With JSON in the DatabaseUKOUG Tech14 - Getting Started With JSON in the Database
UKOUG Tech14 - Getting Started With JSON in the DatabaseMarco Gralike
 
JSON Data Parsing in Snowflake (By Faysal Shaarani)
JSON Data Parsing in Snowflake (By Faysal Shaarani)JSON Data Parsing in Snowflake (By Faysal Shaarani)
JSON Data Parsing in Snowflake (By Faysal Shaarani)Faysal Shaarani (MBA)
 
DDL DATA DEFINATION LANGUAGE
DDL DATA DEFINATION LANGUAGEDDL DATA DEFINATION LANGUAGE
DDL DATA DEFINATION LANGUAGEAbrar ali
 
SQL Tutorial for BCA-2
SQL Tutorial for BCA-2SQL Tutorial for BCA-2
SQL Tutorial for BCA-2Raj vardhan
 
Load & Unload Data TO and FROM Snowflake (By Faysal Shaarani)
Load & Unload Data TO and FROM Snowflake (By Faysal Shaarani)Load & Unload Data TO and FROM Snowflake (By Faysal Shaarani)
Load & Unload Data TO and FROM Snowflake (By Faysal Shaarani)Faysal Shaarani (MBA)
 
Starting with JSON Path Expressions in Oracle 12.1.0.2
Starting with JSON Path Expressions in Oracle 12.1.0.2Starting with JSON Path Expressions in Oracle 12.1.0.2
Starting with JSON Path Expressions in Oracle 12.1.0.2Marco Gralike
 
[Www.pkbulk.blogspot.com]dbms05
[Www.pkbulk.blogspot.com]dbms05[Www.pkbulk.blogspot.com]dbms05
[Www.pkbulk.blogspot.com]dbms05AnusAhmad
 
PostgreSQL Tutorial for Beginners | Edureka
PostgreSQL Tutorial for Beginners | EdurekaPostgreSQL Tutorial for Beginners | Edureka
PostgreSQL Tutorial for Beginners | EdurekaEdureka!
 
BCS4L1-Database Management lab.pdf
BCS4L1-Database Management lab.pdfBCS4L1-Database Management lab.pdf
BCS4L1-Database Management lab.pdfKeerthanaP37
 
Native XML processing in C++ (BoostCon'11)
Native XML processing in C++ (BoostCon'11)Native XML processing in C++ (BoostCon'11)
Native XML processing in C++ (BoostCon'11)Sumant Tambe
 

Tendances (15)

Lession 4 the tables of a database
Lession 4 the tables of a databaseLession 4 the tables of a database
Lession 4 the tables of a database
 
Lab
LabLab
Lab
 
UKOUG Tech14 - Getting Started With JSON in the Database
UKOUG Tech14 - Getting Started With JSON in the DatabaseUKOUG Tech14 - Getting Started With JSON in the Database
UKOUG Tech14 - Getting Started With JSON in the Database
 
JSON Data Parsing in Snowflake (By Faysal Shaarani)
JSON Data Parsing in Snowflake (By Faysal Shaarani)JSON Data Parsing in Snowflake (By Faysal Shaarani)
JSON Data Parsing in Snowflake (By Faysal Shaarani)
 
Rdbms day3
Rdbms day3Rdbms day3
Rdbms day3
 
DDL DATA DEFINATION LANGUAGE
DDL DATA DEFINATION LANGUAGEDDL DATA DEFINATION LANGUAGE
DDL DATA DEFINATION LANGUAGE
 
SQL Tutorial for BCA-2
SQL Tutorial for BCA-2SQL Tutorial for BCA-2
SQL Tutorial for BCA-2
 
Load & Unload Data TO and FROM Snowflake (By Faysal Shaarani)
Load & Unload Data TO and FROM Snowflake (By Faysal Shaarani)Load & Unload Data TO and FROM Snowflake (By Faysal Shaarani)
Load & Unload Data TO and FROM Snowflake (By Faysal Shaarani)
 
Starting with JSON Path Expressions in Oracle 12.1.0.2
Starting with JSON Path Expressions in Oracle 12.1.0.2Starting with JSON Path Expressions in Oracle 12.1.0.2
Starting with JSON Path Expressions in Oracle 12.1.0.2
 
[Www.pkbulk.blogspot.com]dbms05
[Www.pkbulk.blogspot.com]dbms05[Www.pkbulk.blogspot.com]dbms05
[Www.pkbulk.blogspot.com]dbms05
 
Sql
SqlSql
Sql
 
Sql for dbaspresentation
Sql for dbaspresentationSql for dbaspresentation
Sql for dbaspresentation
 
PostgreSQL Tutorial for Beginners | Edureka
PostgreSQL Tutorial for Beginners | EdurekaPostgreSQL Tutorial for Beginners | Edureka
PostgreSQL Tutorial for Beginners | Edureka
 
BCS4L1-Database Management lab.pdf
BCS4L1-Database Management lab.pdfBCS4L1-Database Management lab.pdf
BCS4L1-Database Management lab.pdf
 
Native XML processing in C++ (BoostCon'11)
Native XML processing in C++ (BoostCon'11)Native XML processing in C++ (BoostCon'11)
Native XML processing in C++ (BoostCon'11)
 

En vedette

Tutorials--Logarithmic Functions in Tabular and Graph Form
Tutorials--Logarithmic Functions in Tabular and Graph Form	Tutorials--Logarithmic Functions in Tabular and Graph Form
Tutorials--Logarithmic Functions in Tabular and Graph Form Media4math
 
Approaches to Develop Curriculum for Children Visual Impairment
Approaches to Develop Curriculum for Children Visual ImpairmentApproaches to Develop Curriculum for Children Visual Impairment
Approaches to Develop Curriculum for Children Visual ImpairmentRajnish Kumar Arya
 
Tabular Data on the Web
Tabular Data on the WebTabular Data on the Web
Tabular Data on the WebGregg Kellogg
 
Visual impairment
Visual impairmentVisual impairment
Visual impairmentCachelle
 
Visual Impairment Information and Teaching Strategies
Visual Impairment Information and Teaching StrategiesVisual Impairment Information and Teaching Strategies
Visual Impairment Information and Teaching StrategiesMauro Garcia
 
Visual Impairment
Visual ImpairmentVisual Impairment
Visual Impairmentaniwilfi
 
Case Study: Advanced analytics in healthcare using unstructured data
Case Study: Advanced analytics in healthcare using unstructured dataCase Study: Advanced analytics in healthcare using unstructured data
Case Study: Advanced analytics in healthcare using unstructured dataDamo Consulting Inc.
 
visual impairment
visual impairmentvisual impairment
visual impairmentwajiha b
 
Getting Started with Unstructured Data
Getting Started with Unstructured DataGetting Started with Unstructured Data
Getting Started with Unstructured DataChristine Connors
 
Visual Impairments
Visual ImpairmentsVisual Impairments
Visual ImpairmentsPetri Myllys
 
Frequency Distributions and Graphs
Frequency Distributions and GraphsFrequency Distributions and Graphs
Frequency Distributions and Graphsmonritche
 
Analysis of ‘Unstructured’ Data
Analysis of ‘Unstructured’ DataAnalysis of ‘Unstructured’ Data
Analysis of ‘Unstructured’ DataSeth Grimes
 
Policies and Guidelines of Special Education in the Philippines
Policies and Guidelines of Special Education in the PhilippinesPolicies and Guidelines of Special Education in the Philippines
Policies and Guidelines of Special Education in the Philippinesmaria martha manette madrid
 
Drive Insight From Unstructured Data With Endeca
Drive Insight From Unstructured Data With EndecaDrive Insight From Unstructured Data With Endeca
Drive Insight From Unstructured Data With EndecaKPI Partners
 

En vedette (20)

Tutorials--Logarithmic Functions in Tabular and Graph Form
Tutorials--Logarithmic Functions in Tabular and Graph Form	Tutorials--Logarithmic Functions in Tabular and Graph Form
Tutorials--Logarithmic Functions in Tabular and Graph Form
 
Approaches to Develop Curriculum for Children Visual Impairment
Approaches to Develop Curriculum for Children Visual ImpairmentApproaches to Develop Curriculum for Children Visual Impairment
Approaches to Develop Curriculum for Children Visual Impairment
 
Kxu stat-anderson-ch02
Kxu stat-anderson-ch02Kxu stat-anderson-ch02
Kxu stat-anderson-ch02
 
V.i.new
V.i.newV.i.new
V.i.new
 
Tabular Data on the Web
Tabular Data on the WebTabular Data on the Web
Tabular Data on the Web
 
V.i. ppt copy
V.i. ppt   copyV.i. ppt   copy
V.i. ppt copy
 
Visual impairment
Visual impairmentVisual impairment
Visual impairment
 
Visual Impairment Information and Teaching Strategies
Visual Impairment Information and Teaching StrategiesVisual Impairment Information and Teaching Strategies
Visual Impairment Information and Teaching Strategies
 
Ses 4 tabulation
Ses 4 tabulationSes 4 tabulation
Ses 4 tabulation
 
Visual Impairment
Visual ImpairmentVisual Impairment
Visual Impairment
 
Case Study: Advanced analytics in healthcare using unstructured data
Case Study: Advanced analytics in healthcare using unstructured dataCase Study: Advanced analytics in healthcare using unstructured data
Case Study: Advanced analytics in healthcare using unstructured data
 
visual impairment
visual impairmentvisual impairment
visual impairment
 
visual impairment
visual impairment visual impairment
visual impairment
 
Getting Started with Unstructured Data
Getting Started with Unstructured DataGetting Started with Unstructured Data
Getting Started with Unstructured Data
 
Visual Impairments
Visual ImpairmentsVisual Impairments
Visual Impairments
 
Ncf 2005
Ncf 2005Ncf 2005
Ncf 2005
 
Frequency Distributions and Graphs
Frequency Distributions and GraphsFrequency Distributions and Graphs
Frequency Distributions and Graphs
 
Analysis of ‘Unstructured’ Data
Analysis of ‘Unstructured’ DataAnalysis of ‘Unstructured’ Data
Analysis of ‘Unstructured’ Data
 
Policies and Guidelines of Special Education in the Philippines
Policies and Guidelines of Special Education in the PhilippinesPolicies and Guidelines of Special Education in the Philippines
Policies and Guidelines of Special Education in the Philippines
 
Drive Insight From Unstructured Data With Endeca
Drive Insight From Unstructured Data With EndecaDrive Insight From Unstructured Data With Endeca
Drive Insight From Unstructured Data With Endeca
 

Similaire à CRL: A Rule Language for Table Analysis and Interpretation

Modern Database Systems - Lecture 01
Modern Database Systems - Lecture 01Modern Database Systems - Lecture 01
Modern Database Systems - Lecture 01Michael Mathioudakis
 
学生向けScalaハンズオンテキスト
学生向けScalaハンズオンテキスト学生向けScalaハンズオンテキスト
学生向けScalaハンズオンテキストOpt Technologies
 
SQL -Beginner To Intermediate Level.pdf
SQL -Beginner To Intermediate Level.pdfSQL -Beginner To Intermediate Level.pdf
SQL -Beginner To Intermediate Level.pdfDraguClaudiu
 
Relational data model
Relational data modelRelational data model
Relational data modelSURBHI SAROHA
 
MongoDB Aggregation Framework
MongoDB Aggregation FrameworkMongoDB Aggregation Framework
MongoDB Aggregation FrameworkCaserta
 
DBMS and SQL(structured query language) .pptx
DBMS and SQL(structured query language) .pptxDBMS and SQL(structured query language) .pptx
DBMS and SQL(structured query language) .pptxjainendraKUMAR55
 
DBIx-DataModel v2.0 in detail
DBIx-DataModel v2.0 in detail DBIx-DataModel v2.0 in detail
DBIx-DataModel v2.0 in detail Laurent Dami
 
Relational database management system
Relational database management systemRelational database management system
Relational database management systemPraveen Soni
 
Dynamic languages, for software craftmanship group
Dynamic languages, for software craftmanship groupDynamic languages, for software craftmanship group
Dynamic languages, for software craftmanship groupReuven Lerner
 

Similaire à CRL: A Rule Language for Table Analysis and Interpretation (20)

Sql
SqlSql
Sql
 
ADVANCE ITT BY PRASAD
ADVANCE ITT BY PRASADADVANCE ITT BY PRASAD
ADVANCE ITT BY PRASAD
 
Data import-cheatsheet
Data import-cheatsheetData import-cheatsheet
Data import-cheatsheet
 
DDL and DML statements.pptx
DDL and DML statements.pptxDDL and DML statements.pptx
DDL and DML statements.pptx
 
Introduction to sql
Introduction to sqlIntroduction to sql
Introduction to sql
 
Modern Database Systems - Lecture 01
Modern Database Systems - Lecture 01Modern Database Systems - Lecture 01
Modern Database Systems - Lecture 01
 
Module 3
Module 3Module 3
Module 3
 
学生向けScalaハンズオンテキスト
学生向けScalaハンズオンテキスト学生向けScalaハンズオンテキスト
学生向けScalaハンズオンテキスト
 
Db1 lecture4
Db1 lecture4Db1 lecture4
Db1 lecture4
 
SQL -Beginner To Intermediate Level.pdf
SQL -Beginner To Intermediate Level.pdfSQL -Beginner To Intermediate Level.pdf
SQL -Beginner To Intermediate Level.pdf
 
Relational data model
Relational data modelRelational data model
Relational data model
 
Sql basics
Sql  basicsSql  basics
Sql basics
 
MongoDB Aggregation Framework
MongoDB Aggregation FrameworkMongoDB Aggregation Framework
MongoDB Aggregation Framework
 
DBMS and SQL(structured query language) .pptx
DBMS and SQL(structured query language) .pptxDBMS and SQL(structured query language) .pptx
DBMS and SQL(structured query language) .pptx
 
Sql
SqlSql
Sql
 
DBIx-DataModel v2.0 in detail
DBIx-DataModel v2.0 in detail DBIx-DataModel v2.0 in detail
DBIx-DataModel v2.0 in detail
 
CS121Lec04.pdf
CS121Lec04.pdfCS121Lec04.pdf
CS121Lec04.pdf
 
Relational database management system
Relational database management systemRelational database management system
Relational database management system
 
Unit vii wp ppt
Unit vii wp pptUnit vii wp ppt
Unit vii wp ppt
 
Dynamic languages, for software craftmanship group
Dynamic languages, for software craftmanship groupDynamic languages, for software craftmanship group
Dynamic languages, for software craftmanship group
 

Plus de Alexey Shigarov

Methodology and software for extracting and transforming data from arbitrary ...
Methodology and software for extracting and transforming data from arbitrary ...Methodology and software for extracting and transforming data from arbitrary ...
Methodology and software for extracting and transforming data from arbitrary ...Alexey Shigarov
 
Technology for tabular information extraction from documents in various formats
Technology for tabular information extraction from documents in various formatsTechnology for tabular information extraction from documents in various formats
Technology for tabular information extraction from documents in various formatsAlexey Shigarov
 
System for tabular information extraction from documents in various formats
System for tabular information extraction from documents in various formatsSystem for tabular information extraction from documents in various formats
System for tabular information extraction from documents in various formatsAlexey Shigarov
 
A simple algorithm for page segmentation
A simple algorithm for page segmentationA simple algorithm for page segmentation
A simple algorithm for page segmentationAlexey Shigarov
 
From Unstructured to Structured Tabular Data Using a Rule Engine
From Unstructured to Structured Tabular Data Using a Rule EngineFrom Unstructured to Structured Tabular Data Using a Rule Engine
From Unstructured to Structured Tabular Data Using a Rule EngineAlexey Shigarov
 
Shigarov A.O. A Method for Table Detection in Metafiles // Presentation for I...
Shigarov A.O. A Method for Table Detection in Metafiles // Presentation for I...Shigarov A.O. A Method for Table Detection in Metafiles // Presentation for I...
Shigarov A.O. A Method for Table Detection in Metafiles // Presentation for I...Alexey Shigarov
 

Plus de Alexey Shigarov (6)

Methodology and software for extracting and transforming data from arbitrary ...
Methodology and software for extracting and transforming data from arbitrary ...Methodology and software for extracting and transforming data from arbitrary ...
Methodology and software for extracting and transforming data from arbitrary ...
 
Technology for tabular information extraction from documents in various formats
Technology for tabular information extraction from documents in various formatsTechnology for tabular information extraction from documents in various formats
Technology for tabular information extraction from documents in various formats
 
System for tabular information extraction from documents in various formats
System for tabular information extraction from documents in various formatsSystem for tabular information extraction from documents in various formats
System for tabular information extraction from documents in various formats
 
A simple algorithm for page segmentation
A simple algorithm for page segmentationA simple algorithm for page segmentation
A simple algorithm for page segmentation
 
From Unstructured to Structured Tabular Data Using a Rule Engine
From Unstructured to Structured Tabular Data Using a Rule EngineFrom Unstructured to Structured Tabular Data Using a Rule Engine
From Unstructured to Structured Tabular Data Using a Rule Engine
 
Shigarov A.O. A Method for Table Detection in Metafiles // Presentation for I...
Shigarov A.O. A Method for Table Detection in Metafiles // Presentation for I...Shigarov A.O. A Method for Table Detection in Metafiles // Presentation for I...
Shigarov A.O. A Method for Table Detection in Metafiles // Presentation for I...
 

Dernier

American Type Culture Collection (ATCC).pptx
American Type Culture Collection (ATCC).pptxAmerican Type Culture Collection (ATCC).pptx
American Type Culture Collection (ATCC).pptxabhishekdhamu51
 
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance Bookingroncy bisnoi
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksSérgio Sacani
 
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...ssuser79fe74
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPirithiRaju
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000Sapana Sha
 
Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.Silpa
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​kaibalyasahoo82800
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPirithiRaju
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...ssifa0344
 
Chemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfChemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfSumit Kumar yadav
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticssakshisoni2385
 
GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...
GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...
GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...Lokesh Kothari
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Lokesh Kothari
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bSérgio Sacani
 
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencyHire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencySheetal Arora
 
Factory Acceptance Test( FAT).pptx .
Factory Acceptance Test( FAT).pptx       .Factory Acceptance Test( FAT).pptx       .
Factory Acceptance Test( FAT).pptx .Poonam Aher Patil
 
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxSCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxRizalinePalanog2
 
GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)Areesha Ahmad
 
module for grade 9 for distance learning
module for grade 9 for distance learningmodule for grade 9 for distance learning
module for grade 9 for distance learninglevieagacer
 

Dernier (20)

American Type Culture Collection (ATCC).pptx
American Type Culture Collection (ATCC).pptxAmerican Type Culture Collection (ATCC).pptx
American Type Culture Collection (ATCC).pptx
 
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
 
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
 
Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
 
Chemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfChemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdf
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
 
GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...
GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...
GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
 
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencyHire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
 
Factory Acceptance Test( FAT).pptx .
Factory Acceptance Test( FAT).pptx       .Factory Acceptance Test( FAT).pptx       .
Factory Acceptance Test( FAT).pptx .
 
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxSCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
 
GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)
 
module for grade 9 for distance learning
module for grade 9 for distance learningmodule for grade 9 for distance learning
module for grade 9 for distance learning
 

CRL: A Rule Language for Table Analysis and Interpretation

  • 1. CRL: A Rule Language for Table Analysis and Interpretation* in Unstructured Tabular Data Integration Alexey Shigarov, shigarov@icc.ru Matrosov Institute for System Dynamics and Control Theory of SB RAS 17th International Conference on Data Analytics and Management in Data Intensive Domains Obninsk, Russia October 13-16, 2015 * This work was financially supported by the Russian Foundation for Basic Research (Grant No. 15-37-20042) and the Council for grants of the President of the Russian Federation (Scholarship No. SP-3387.2013.5)
  • 2. Unstructured vs Structured Unstructured Tabular Data Arbitrary Tables in ASCII-text, Spreadsheets, PDF Documents, Web-Pages Structured Data Relational Databases Easy Way Hard Way For Humans To Understand No Explicit Semantics We Can Read, Write, and Edit For Computers To Understand Formal Data Model (Semantics) We Can Query (SQL) and Analyse (DM, OLAP) 2
  • 3. Hard Way Back to Structured Data World Table Detection* Table Recognition* Table Analysis* Table Interpretation* ASCII-text Untagged PDF Documents Image Documnets Spreadsheets Web Pages Word Documents OCR Databases Cannonical Forms XML ETL * Hurst M. Layout and language: Challenges for table understanding on the web // Proc. 1st Int. Workshop on Web Document Analysis. 2001. pp. 27-30 3
  • 4. Our purpose Globally to automate unstructured tabular data integration Databases Arbitrary Tables in Spreadsheets Currently to automate table analysis and interpretation Tables in Cannonical Form 4
  • 5. Ok, We Have Initially an Arbitrary Tagged Table We know • structure (rows, columns, cells) • style settings (fonts, colors, alignments, etc.) • textual content 5
  • 6. All We Need Is To Recover Semantics Relationships like entry-label, label-label, label-category* * Our terminology is inspired by the X. Wang’s abstract table model [Wang X. Tabular Abstraction, Editing, and Formatting, PhD Thesis. 1996] 6
  • 7. When We Know Semantics We Can Generate a Canonical Table It can be loaded into a database by ETL tools 7
  • 8. Challenges on the Hard Way Back • Too many layouts to create a table • Anyone can invent new one • Messy data • No guarantees your tabular data are clear and standardized • Natural Language • Table understanding needs using knowleadge 8
  • 9. Our Idea When • A table creator (e.g. a company, a government agency, ad-hoc software) use a set of rules for table generation • Tables have similar structure, style, and content within a set of generating rules Then • We can define a set of rules for table analysis and interpretation • We can use a rule engine to execute these rules 9
  • 10. Table Analysis and Interpretation Rules • Rules can be expressed in • Drools Rule Language* (DRL) General-purpose language for expressing production rules in Drools* rule engine • Cells Rule Language (CRL) Our domain-specific language for expressing table analysis and interpretation rules • Rules can be executed with Drools* rule engine *http://drools.org 10
  • 11. CRL Rules Rules map known table data to unknown ones rule when Left hand side defines conditions using available facts (cells, categories) then Right hand side defines actions to recover unknown semantics (entries, labels, categories, entry-label, label-label, label-category) end 11
  • 12. CRL: Left Hand Side factType $variable : Java boolean expressions cell $cell : constraints entry $entry : constraints label $label : constraints category $category : constraints 12
  • 13. CRL: Right Hand Side Merged Cells Splitted Cells Cell splitting To split n-tiles cell into n cells split $cell Cell merging To merge two cells into one merge $cell1 -> $cell2 13
  • 14. CRL: Right Hand Side Cell marking set mark @mark -> $cell where @mark is a word with @ starting character Using marks in conditions cell $cell : mark == @mark, constraints Short form cell@mark $cell : constraints 14
  • 15. CRL: Right Hand Side Entry creating Using a cell value new entry $cell Using a specified value new entry value -> $cell Label creating Using a cell value new label $cell Using a specified value new label value -> $cell 15
  • 16. CRL: Right Hand Side Label categorizing To associate a label with a category set category $category -> $label Trying to find or create a category with a specified name set category category_name -> $label 16
  • 17. CRL: Right Hand Side Label associating set parent label $label1 -> $label2 • Labels can be organized in a tree • We can build hierarchical categories • We can build compound label values like label1|label2|…|labelN 17
  • 18. CRL: Right Hand Side Label grouping group $label1 -> $label2 • A label group constitutes an anonymous category • We can divide labels into categories without knowing categories • We can entirely categorize a label group 18
  • 19. CRL: Right Hand Side Entry associating To associate an entry with a label add label $label -> $entry Trying to find or create a label in the category with specified value add label label_value from $category -> $entry Trying to find or create a category with specified name add label label_value from category_name -> $entry 19
  • 20. Cannonical Form Generation <entries>={1,2,3,4,5,6,7,8} <labels>={a1,a11,a12,a2,a21,a22,b1,b2} <categories>={A,B} <entry-label pairs>={(1,a11),(1,b1),(2,a12), (2,b1),(3,a21),(3,b1),(4,a22),(4,b1),(5,a11), (5,b2),(6,a12),(6,b2),(7,a21),(7,b2),(8,a22), (8,b2)} <label-label pairs>={(a11,a1),(a12,a1), (a21,a2),(a22,a2)} <label-category pairs>={(a1,A),(a11,A), (a12,A),(a2,A),(a21,A),(a22,A),(b1,B),(b2,B)} DATA A B 1 a1 | a11 b1 2 a1 | a12 b1 3 a2 | a21 b1 4 a2 | a22 b1 5 a1 | a11 b2 6 a1 | a12 b2 7 a2 | a21 b2 8 a2 | a22 b2 a11 a12 a21 a22 b1 1 2 3 4 b2 5 6 7 8 A B a1 a2 20
  • 21. Applying CRL: Critical Cells* c d c d e j 2 2 2 3 k i l 6 7 h 1 4 5 a b f g * Nagy G. Learning the Characteristics of Critical Cells from Web Tables // In Proc. of the 21st Int. Conf. on Pattern Recognition, Tsukuba, Japan, IEEE Comp. Soc., 2012, pp. 1554-1557 when cell $cc : cl==1, rt==1, blank cell $ec : cl>$cc.cr, rt>$cc.rb then new entry $ec -> <entries> = {1,2,3,4,5,6,7} 21
  • 22. when cell $cc : cl == 1, rt == 1, blank cell $clc : cl > $cc.cr, rb <= $cc.rb then set mark @ColLabel -> $clc new label $clc when cell@ColLabel $c1 cell@ColLabel $c2 : rt == $c1.rt then group $c1.label -> $c2.label Applying CRL: Label Groups c d c d e j 2 2 2 3 k i l 6 7 h 1 4 5 a b f g -> <labels>={a,b,c,d,e,f,g,...} -> <groups>={{a,b},{c,d,e}, {f,g},...} 22
  • 23. Applying CRL: Row Label Hierarchies when cell $c1 : cl==1, $l1 : label cell $c2 : cl==1, rt>$c1.rt, indent==$c1.indent+2, $l2 : label no cells : cl==1, rt>$c1.rt, rt<$c2.rt, indent==$c1.indent then set parent label $c1.label -> $c2.label -> <label-label pairs> = {(c1,c),(c11,c1),(c12,c1),(c2,c), (c21,c2),(d1,d),(d11,d1)} 23
  • 24. Applying CRL: YAML* Specified categories Category YAML specification # category YEAR name: Year description: years from 1982 to 2015 constraints: -"198[2-9]" -"200[1-9]" -"201[0-5]" when category $c : name == "Year" label $l : $c.canHaveLabel(value) then set category $c -> $l Category YAML specification # category COUNTRY_CODE name: CountryCode description: ISO 3166 2-letter country codes labels: -AD -AE -... -ZW when category $c : name == "CountryCode" label $l : $c.hasLabel(value) then set category $c -> $l *http://yaml.org 24
  • 25. Applying CRL: Category Names when cell $cc : cl == 1, rt == 1 cell $c : mark == "@ColLabel" then set category token($cc, 0) -> $c.label A B a1 a2 a3 b1 1 2 3 b2 4 5 6 -> <categories> = {A,...} -> <labels> = {a1,a2,a3,...} -> <label-category pairs> = {(a1,A),(a2,A),(a3,A),...} 25
  • 26. Applying CRL: Multi-Valued Cells α β 阿爾法 公測 γ 1 2 伽馬 一 二 δ 3 4 三角洲 三 四 C1 C2 C3 a = 1 b = 2 c = 3 d = 4 e = 5 f = 6 g = 7 h = 8 i = 9 Bilingual Tables Key=Value Cells when cell $c : cl==1 || rt==1, !blank then new label token($c, 0) -> $c new label token($c, 1) -> $c when cell $c : rt>1 then new label left($c, '=') -> $c new entry right($c, '=') -> $c 26
  • 27. Applying CRL: Footnotes when cell $footer : onLastRow, $notes : text entry $e : cell.text matches ".+*+", $ref : extract(cell.text, "*+") then add label between($notes, $ref, 'n') from "footnotes" -> $e c d c d e 1* 2** 3 4 f 5 6 7 8 g 9 10 11 12 a b * x ** y -> <labels>={x,y,...} -> <categories>={"footnotes",...} -> <entry-label pairs>={(1,x),(2,y),...} -> <label-category pairs>={(x,"footnotes"), (y,"footnotes"),...} 27
  • 28. Applying CRL: Colored Tables when cell $lc : style.bgColor == "#4f81bd" cell $ec : style.bgColor == null, rt >= $lc.rt, cl > $lc.cr no cells : style.bgColor == "#4f81bd", cl > $lc.cr, cr < $ec.cl then add label $lc.label -> $ec.entry 1l l2 l3 l4 l2 l3 l2 l5 l7 e1 e2 e2 l5 l7 e6 e8 l5 l8 e9 l6 l8 e3 e4 e5 l6 l7 e7 e8 l5 l8 e9 c1 c2 l1 c1 c2 l1 c1 c2 28
  • 29. Prototype of Spreadsheet Data Extraction and Transformatiom System 29
  • 30. Experimental Evaluation Our purpose is evaluation of recovering entries, labels, entry-label and label-label relationships Dataset • We use the TANGO dataset (http://tango.byu.edu/data) which • is a part of the TANGO (Table ANalysis for Generating Ontologies) project (http://tango.byu.edu) • is intended for testing table interpretation methods • has 200 arbitrary tables collected from 10 statistical sites in spreadsheet format in 2009 30
  • 31. Experimental Evaluation Multi-row hierarchical layout Multi-column plain layout One-column hierarchical layout Multi-column & multi-row layout One-column plain layout Category name cells Row label cells Column label cells Entry cells Table regions One-column & one-row layout Multi-column & one-row layout One-row plain layout Multi-row plain layout 47,5% 47% 5,5% 100% 94,5% 5,5% 65,5% 26% 8,5% 31 We develop two sets of CRL rules to define two table types • TANGO-200 all tables • TANGO-SUB without tables having hierarchical layout in the leftmost column Layouts of TANGO Tables
  • 32. Experimental Evaluation Measures • Recall • a table is processed successfully, when all entries, labels, entry-label pairs, and label-label pairs which are implicitly contained in its source form are explicitly included in its canonical form • Presision • a table is processed successfully, when all entries, labels, entry-label pairs, and label-label pairs which are explicitly included in its canonical form are implicitly contained in its source form Process • Two experts independently compare sources and generated automatically canonical forms of tables • They referee that each table is processed successfully or not in terms of recall and precision • When they make opposite decisions on a table, a final decision is made by third expert 32
  • 33. Experimental Evaluation Results Rule Set / Table Type TANGO-200 TANGO-SUB Tables 200 105 Cells 22757 10893 Rules 16 13 Recall 87% 95% Precision 89% 95% For TANGO-200 • 33 tables are processed with errors • 85% of errors are born in the leftmost column with one-column hierarchical layout • Two main causes: 1) ambiguity among style characteristics 2) hierarchical relationships expressed by natural language only 33
  • 34. Comparison with others Methods and Tools for Table Analysis and Interpretation 1-5 Fixed Types of Tables Programmable Table Types Knowledge-based methods Douglas, 1995 Tijerino, 2005 Embley, 2005 WangJ, 2012 • Domain ontologies • Taxonomies like ProBase, FreeBase Domain-independent methods Gatterbauer, 2007 Pivk, 2005, 2006, 2007 Kim, 2008 Chen&Cafarella, 2013, 2014 Embley, 2014 Nagy, 2014 • Spatial, style, and textual data • Several typical table types We are here! 2014, 2015 • Rule language (CRL, DRL) • Relative cell addressing • Fixed target schema • Spatial, style, and textual data Hung, 2011 • Spreadsheet-like formula mapping language (TranSheet) • Absolute cell addressing • Programmable target schema • Spatial and textual data 34
  • 35. Conclusions • Our methodology is mainly oriented on unstructured tabular data integration • We expect it to be useful in cases when data from a large number of tables appertaining to a few table types are required for populating a database • One set of rules can be suitable for processing a wide range of arbitrary tables with high accuracy • Experiment demonstrates that narrowing of a table type can cause simplifying of rules and increase of recall and precision in table canonicalization 35
  • 36. Further Work • Table Layouts to develop techniques for widely used table features, e.g. for recovering a row label hierarchy in the leftmost column • Messy Tabular Data to incorporate data cleansing techniques into table understanding • Natural Language to add knowledge, global taxonomies (e.g. FreeBase, DBpedia) and domain ontologies 36
  • 37. Supplementary Materials CRL language specification Examples of CRL rules All details of our experiment http://cells.icc.ru/pub/crl Source code of our prototype licensed under Apache License 2.0 https://github.com/shigarov/cells-ssdc 37
  • 38. Thanks! This presentation is available on SlideShare.net http://www.slideshare.net/shig Alexey Shigarov shigarov@icc.ru http://cells.icc.ru 38