SlideShare une entreprise Scribd logo
1  sur  48
Télécharger pour lire hors ligne
Bibliotheca Digitalis
Reconstitution of Early Modern Cultural Networks
From Primary Source to Data
DARIAH / Biblissima Summer School
Le Mans, 4-8 July 2017
From Pixels to Contents
Jean-Yves Ramel
Professor of Computer Science,
Computer Laboratory, University of Tours
1st day, July 4th – Digital sources: theoretical fundamentals
1
From pixels to content: An overview of
the main techniques used in DIA
1
Jean-Yves RAMEL
2
IMAGES
CONTENTS
From Pixels to Contents
Introduction
Document Image Analysis
Machine Learning
Content based images Retrieval
Tools to automatically extract
information inside the images /
documents
Generation and utilization of
meta data
3
From Pixels … to contents
Outline
 From pixels…
 What is an image?
 Image (pre-)processing
 … to Text
 Transcription and Layout analysis
 Segmentation and content extraction
 An overview of Pattern Recognition
 … but also to non-Text
 Content characterization and signatures
 Content retrieval and spotting
 Back to meta-data?
 From descriptive to perceptual meta-data
 Is there adequate encoding formats?
 Conclusions and perspectives
From Pixels…
IMAGES
5
From Pixels…
Digitization  Set of pixels ?
 Images come from a grid of microscopic photosensitive cells called PIXELS
 Sampling
 Quantization
 Assignment of a numerical value drawn from the received lighting energy / pixel (grid unit)
 Continuous value (xi,yi) → Discrete value (xi,yi) → Pixels
 The range of colors that each pixel can take
Pixels
Source of
illumination
Initial object
(real world)
Radiant
energy
Analogical
image
Space partitioning :
sampling
Digitized object: Quantization :
color information encoding by the
sensors
Acquisition system
6
From Pixels…
What is an image?
Image Quantization
Binary images: I(i,j) = 0 black or I(i,j) = 1 white
Gray level (8 bits/pixel) images:
I(i,j) = 0…..255 from the lighter to the darker.
Color images (24 bits/pixel):
3 values of lighting intensity Red, Green, Blue
I1(i,j) = 0…..255 – I2(i,j) = 0…..255 – I3(i,j) = 0…..255
Image Representation & Processing
Image = Array(s) of pixels = Matrices of values
1 pixel = A position inside the image (i,j) + 1 color (1 to 3 values)
The values I(i,j) associated to each pixel s(i,j) represent
their brightness intensity
7
7
From Pixels…
What is an image?
• Sampling  Image resolution
• Number of pixels per lenght unit
• In dpi (dots per inches) or ppp (points par pouce)
• When the resolution decrease, the precision decrease
• VF Image processing
• Page A4 = 21x29.7 cm
• 200 dpi : 1650 x 2340 pixels = 3 861 000 pixels
• 300 dpi : 3500 x 2480 pixels = 8 680 000 pixels
• 16M colors, 1 pixel = 3 octets  10 à 25 Mo/page !
• A trade-off between quality-quantity/time is mandatory
• Fidelity of the numerical version
• Mass of storage size – Transmission / Processing time
8 8
From Pixels…
Why few pixels are so important?
 Because, even with a high resolution, some isolated patterns can correspond
only to small numbers of pixels (from 20 to 50 pixels) !
 20 to 40% of the useful information is situated on the boundaries of the shapes
and is modified randomly by the digitization process
 The missing of few pixels can change the appearance of characters until the
confusion
 Loss of topological information  Broken characters or touching characters
 The human perception is not sensitive to the resolution
Broken shape
Touching OCR errors
9 9
After the digitization, the images usually still have a lot of defaults
 Curvature and skew due to scanning
 Noise on boundaries, dots, blur, …
From Pixels…
What is Image (pre-)processing?
10 10
Curvature and skew correction is possible on text images
From Pixels…
Image (pre-)processing
11 11
From Pixels…
Image (pre-)processing
The problem is more complicated in case of heterogeneous content
12 12
The problem is more complicated in case of heterogeneous content
From Pixels…
Image (pre-)processing
… to text and layout
CONTENTS
IMAGES
14
1-Offrant appres plusieurs rues sur ce faites
2-au lieu et temps acostume aly livre et
3- accenste la somme et quantite dessoulz
4-estempte du voloir et consentement de
5-plusieurs des borgeys de la dite ville
6-de chastellion ______________IIIIxXX V
flor(ins)
7-receu dudit franceys rotier pour
8-une albaleste dehue a cause de la
9-dite ferme dudit trezein avec la
10-somme et quantite dessus estempte ?
11- ______________II flor(ins)
12-receu de Alexandre ? escoffier al loup ?
13- dudit lieu de chatellion pour la
14-rense et ferme du commun de la dite
15-ville et communaute ? pour ung an commettre ?
16-alafeste nativite fi jehan baptiste lan
17-mil quatre cent et cinquante? A ly par le
18-pris dessouls esempt accense et hure ?
19-comme au plus offrant du voloyr et
20-et consentement de pluseures des bourgeys
21-de la dite ville de chastellion apres
22-pluseures rues ….refaytes es lieus ?
Preservation of the
link between the
Text and the Image
From pixels … to text
Automatic transcription but also layout Analysis
15 15
First step : Image segmentation
• Transformation of the image (set of pixels) into patterns (regions of
interest) of higher level (EoC)
• These EoC could be very simple (part of characters) or more
sophisticated ones (paragraphs, illustrations, …)
• EoC extraction: Background (white) / Foreground (black) separation
• Color Image  Grayscale  binarisation
From pixels … to text
An overview of OCR mechanisms
EoC
16
Just to illustrate the difficulties…
 Most of the segmentation methods need a binarisation
From pixels … to text
An overview of OCR mechanisms
Threshold selection ?
17 17
Just to illustrate the
difficulties…
 Most of the segmentation
methods need a binarisation
 Global threshold 
 Local thresholds 
Niblack : S =m+ks² avec k= -0,2
| m : mean et s : standard deviation
From pixels … to text
An overview of OCR mechanisms
18
First step: Image segmentation / Connected components
 Then, we can try to group black pixels together to localize and
recognize higher level Element of Content (EoC)
From pixels … to text
An overview of OCR mechanisms
19
From pixels … to text
An overview of OCR mechanisms
Next step : Layout analysis
• Connected Components  Words  Lines  Paragraphs  Page
• The results have to be saved in XML format (Alto, …)
• Choosing how to organize the XML tree (physical / logical) is not so easy…
20 20
Next step : Layout analysis
Two kind of structures have been identified by researchers in DIA:
• The logical structure  the generic one corresponding to a priori
knowledge about the content of the document
• The physical structure  the analysed instance corresponding to the
extracted EoC inside the image, each one associated to descriptive
features (size, position, number of sub-patterns, … )
• Layout analysis tries to recognize these 2 structures (EoC identification)
From pixels … to text
An overview of OCR mechanisms
Physical structure   logical structureSub-section
Paragraph
Sentence
21
From pixels … to text
An overview of OCR mechanisms
Next step : Layout analysis
• The analysis / identification of the EoC is usually achieved based on a rule
based system defined through a grammar (static one) or defined interactively
by the users
Projet Européen Meta-e (http://meta-e.aib.uni-linz.ac.at/ )
First commercial system for automatic layout analysis
Dutch books of XVIIIe
Projet BVH - Paradiit (https://sites.google.com/site/paradiitproject/)
AGORA: an user-driven system for content extraction in
historical printed books
22
Next step : Pattern / EoC recognition (toward Machine Learning)
How computers can recognize objects?
• We need a large set of (labelled) examples similar to the patterns to be
recognized  a training set
• We need a list of stable and discriminative features (shape, color, size,…) used
to describe the patterns (labelled ones and unknown one)
From pixels … to text
An overview of OCR mechanisms
A E
… …
[ Training set ]
x1
[ 2D Representation
of the training set]
A
A
x2
x1
Stable and
discriminative
features
23
Next step : Pattern / EoC recognition (toward Machine Learning)
How computers can recognize objects?
• When an unknown EoC arrives, we compute its features and compare it with the
content of the training set (associated built models)
From pixels … to text
An overview of OCR mechanisms
x2
[ 2D Representation
of the training set]
A
A
x2
x1
?A
24
Deep Learning (Conv. Neural Net)
From pixels … to text
An overview of OCR mechanisms
25
25
From pixels … to text
An overview of OCR mechanisms
 Why commercial OCR are not working well on historical
documents?
 Noises and degradations
 Unusual layout
 Unsuited training set
Fine Reader
26
From pixels … to text
An overview of OCR mechanisms
 Why commercial OCR are not working well on historical
documents?
 Lack of data, knowledge and experiences
 Unusual fonts and characters  training data needs to be created
 Unusual languages Lexicons, dictionaries and language models need to be created
 Context often allows to modify our understanding of what is perceived by our
senses
 Until now, we tried to recognized EoC without using their context
 The same EoC could be interpreted differently according to its surrounding context
 Results of OCR are highly correlated to the adequacy of the used word dictionary
 Is there methods that need less a priori knowledge?
 Processing non-Textual parts can be good source of inspiration?
… to non-text
CONTENTS
IMAGES
28
 Ornamental letters ( +of 20000)
 Figures (+ de 1500)
BATYR: http://www.bvh.univ-tours.fr
From pixels … to non-textual contents
Pictorial Content is also of high interest
29
From pixels … to non-textual contents
An overview of Content Based Image Retrieval
• Computation time not crucial
• Signatures  Visual Features
 Perceptual meta data
Index
database
Statisrics
Perceptual meta data instead of classical meta data
 Computation of signatures for all the images or even sub-parts of the
images (EoC)
Set of images to index
30
From pixels … to non-textual contents
An overview of Content Based Image Retrieval
Using images
as query
instead of
words
Search
Engine
Back-office Front-office
31
From pixels … to non-textual contents
An overview of Content Based Image Retrieval
It is again a question of features…
 We speak about signatures
Query Image
Signature
computation
Index database
containing
perceptual signatures
Off-line indexation
On-line retrieval
Image Dataset
similar Images
Representation space
32
From pixels … to non-textual contents
An overview of Content Based Image Retrieval
 From CBIR to Word spotting
Off-lineindexation
Clustering
EoC
Signature
computation
33
From pixels … to non-textual contents
An overview of Content Based Image Retrieval
 From CBIR to Word spotting
On-lineRetrieval
Query images
Results: retrieved images
34
From pixels … to non-textual contents
An overview of Content Based Image Retrieval
 From CBIR to Word spotting
Is it just a question
of meta-data?
CONTENTSIMAGES
36
The standard model
Descriptive meta-data + transcription
We have usually
• Descriptive meta-data in standard formats (MARC, EAD, Dublin Core, MODS, …)
• Edited manually
• “Semantical” information
• Text transcription associated to additional meta-data (TEI)
• Semi-automatic transcription or manually edited
• “Semantical” information
Text (TEI)
Meta-data
Images
Descriptive meta-
data (MARC)Keyboarding
Image
AnalysisDigitization
Text regions
XML
“Semantical” meta-data
37
The standard model
Perceptual meta-data
 It seems that CBIR can help to extract and save supplementary
information about image content (EoC) without going to the semantical
aspect (recognition)
 Regions of interest
 Visual features  Perceptual signatures and index
 Shapes, positions, colors, textures, …  Numerical values (vectors)
Text (TEI)
Images
Descriptive meta-
data (MARC)Keyboarding
Image
AnalysisDigitization
Text regions
XML
Meta-dataMeta-
data ++
+ “Perceptual” meta-data
Non-Text regions
CBIR
38
Is it just a problem of Meta-data?
What about the encoding formats?
Structuration of data and file formats is a difficult problem
• Data architects are needed
• Some interesting formats linked with previous discussions
o METS – Metadata Encoding and Transmission Standard
o ALTO – Analyzed Layout and Text Object
o TEI – Text Encoding Initiative
METSImages
Image ALTO
Enriched
documents
TEI …
39
Is it just a problem of Meta-data?
ALTO for OCR
ALTO = Analyzed Layout and
Text Object
 Standard XML
 Created in 2003 during METAe
project
 Developed by Graz, Linz, Innsbruck
universities
 Description of the content and
the physical layout of one page
 Used by several OCR software
 Adapted and used by the BNF and
other libraries
 Drawbacks: huge / static
40
 TEI = Text Encoding Initiative - Standard XML
 Content tagging and logical structure encoding (full document)
 Used a lot by libraries
 Too much « open »?  Quit « complex »
Is it just a problem of Meta-data?
TEI for transcription and enriched contents
41
METS – Metadata Encoding and Transmission Standard
• Open XML Standard created in 2001 by the Digital Library Federation
maintained by METS Editorial Board
• XSD-Schema
• Linking between
multimedia objects
• Complete Description of
digitized content (images,
texts, audio, sculptures, …)
• Physical / logical structures
• Descriptive Meta data
(DC, MODS, MARC, …)
• …
metsHdr
dmdSec
amdSec
fileSec
structMap
METS document
header (info author, sotware, …)
descriptive metadata section
(bibliographic notice)
administrative metadata
section (copyright, …)
file inventory section
(file localisation)
structLink structural map linking
(link between structures)
structural maps
(physical et logical)
METS – First level Elements
Is it just a problem of Meta-data?
Link between meta data?
42
METS
metsHdr
dmdSec
amdSec
fileSec
fileGrp
file
FLocat
ID=Images
ID=IMG00001
xlink:href="file://./master/DDD_010120914_001.j
p2"
fileGrp
file
FLocat
ID=ALTOS
ID=ALTO00001
xlink:href="file://./alto/DDD_010120914_001.xml"
Image
ALTO
structMap
div
div
ID=Physical
ID=Book
ID=Page1
fptr
par
FILEID=IMG00001area
FILEID=ALTO0000
1
area
div ID=Page2
METS – Physical Structure
METS allows to specify the
locations of resource files 
File pointers METS (fptr)
Possible link to the descriptive meta data (dmd Sec)
43
METS
metsHdr
dmdSec
amdSec
fileSec
fileGrp
file
FLocat
ID=Images
ID=ALTO00002
xlink:href="file://./master/DDD_010120914_001.j
p2"
fileGrp
file
FLocat
ID=ALTOS
ID=ALTO00001
xlink:href="file://./alto/DDD_010120914_001.xml"
structMap ID=Logical
structMap ID=Physical
div
div
Type=Volume
Type=Issue
fptr
seq
FILEID=ALTO0000
1, BEGIN=TXT203area
FILEID=ALTO0000
2, BEGIN=TXT004
area
div Type=Chapter
div Type=Chapter
div Type=Paragraph
text block
ALTO
ID=TXT004
ALTO
text block
ID=TXT203
Image
Image
The Logical StructMap reflects the enrichment of the different logical blocks
located in different pages that can be split by other logical blocks (like foot-notes,
….).
METS – Logical Structure
TEI
Conclusions
CONTENTS
IMAGES
45 45
From Pixels to Contents
Conclusions
 Building tools for the valorisation of digitized
historical content is a pluri-disciplinary task
 Meta-data production  Experts of the domains
 Selection and verification of the data  Experts + Data accuratist
 Structuration of the data and system  Data / system architect
 Computer vision, Machine learning  Data scientist
 Manual indexing is needed
 Descriptive meta-data  Semantical meta data
 Standard formats for data encoding
 Annotations could be seen as supplementary meta-data?
 Operational methods and tools are available
 Acquisition devices
 Automatic tools: low level image processing, OCR
 Perceptual meta-data should be added : CBIR
46 46
From Pixels to contents
Conclusions - Perspectives
 The actual context: big data and heterogeneous collections
 Connexion between data, mutual enrichment, interoperability
 Introduction and management of additional knowledge
 Facing the diversity of the types of contents and usages
 Quality of the interaction instead of only the quantity
 Semantic Web : queries reformulation, smart crawlers, automatic
categorisation
47
https://sites.google.com/site/paradiitproject/
47
Thanks…

Contenu connexe

Similaire à Bibliotheca Digitalis Summer school: From pixels to content - Jean-Yves Ramel

Opticalcharacter recognition
Opticalcharacter recognition Opticalcharacter recognition
Opticalcharacter recognition Shobhit Saxena
 
Automatic Image Annotation (AIA)
Automatic Image Annotation (AIA)Automatic Image Annotation (AIA)
Automatic Image Annotation (AIA)Farzaneh Rezaei
 
Overview of graphics systems.ppt
Overview of graphics systems.pptOverview of graphics systems.ppt
Overview of graphics systems.pptMalleshBettadapura1
 
Mirko Lucchese - Deep Image Processing
Mirko Lucchese - Deep Image ProcessingMirko Lucchese - Deep Image Processing
Mirko Lucchese - Deep Image ProcessingMeetupDataScienceRoma
 
2014 01-ticosa
2014 01-ticosa2014 01-ticosa
2014 01-ticosaPharo
 
TechnicalBackgroundOverview
TechnicalBackgroundOverviewTechnicalBackgroundOverview
TechnicalBackgroundOverviewMotaz El-Saban
 
YCIS_Forensic PArt 1 Digital Image Processing.pptx
YCIS_Forensic PArt 1 Digital Image Processing.pptxYCIS_Forensic PArt 1 Digital Image Processing.pptx
YCIS_Forensic PArt 1 Digital Image Processing.pptxSharmilaMore5
 
Multimodal Learning Analytics
Multimodal Learning AnalyticsMultimodal Learning Analytics
Multimodal Learning AnalyticsXavier Ochoa
 
CG_Unit1_SShah.pptx
CG_Unit1_SShah.pptxCG_Unit1_SShah.pptx
CG_Unit1_SShah.pptxShweta Shah
 
Overview of graphics systems
Overview of  graphics systemsOverview of  graphics systems
Overview of graphics systemsJay Nagar
 
Implementation of Computer Vision Applications using OpenCV in C++
Implementation of Computer Vision Applications using OpenCV in C++Implementation of Computer Vision Applications using OpenCV in C++
Implementation of Computer Vision Applications using OpenCV in C++IRJET Journal
 
Introduction to Machine Vision
Introduction to Machine VisionIntroduction to Machine Vision
Introduction to Machine VisionNasir Jumani
 
Handwriting_Recognition_using_KNN_classificatiob_algorithm_ijariie6729 (1).pdf
Handwriting_Recognition_using_KNN_classificatiob_algorithm_ijariie6729 (1).pdfHandwriting_Recognition_using_KNN_classificatiob_algorithm_ijariie6729 (1).pdf
Handwriting_Recognition_using_KNN_classificatiob_algorithm_ijariie6729 (1).pdfSachin414679
 
Adaptive membership functions for hand written character recognition by voron...
Adaptive membership functions for hand written character recognition by voron...Adaptive membership functions for hand written character recognition by voron...
Adaptive membership functions for hand written character recognition by voron...JPINFOTECH JAYAPRAKASH
 
Introduction to Digital Image Processing Using MATLAB
Introduction to Digital Image Processing Using MATLABIntroduction to Digital Image Processing Using MATLAB
Introduction to Digital Image Processing Using MATLABRay Phan
 
AN INTEGRATED APPROACH TO CONTENT BASED IMAGE RETRIEVAL by Madhu
AN INTEGRATED APPROACH TO CONTENT BASED IMAGERETRIEVAL by MadhuAN INTEGRATED APPROACH TO CONTENT BASED IMAGERETRIEVAL by Madhu
AN INTEGRATED APPROACH TO CONTENT BASED IMAGE RETRIEVAL by MadhuMadhu Rock
 
Mind the Gap: Another look at the problem of the semantic gap in image retrieval
Mind the Gap: Another look at the problem of the semantic gap in image retrievalMind the Gap: Another look at the problem of the semantic gap in image retrieval
Mind the Gap: Another look at the problem of the semantic gap in image retrievalJonathon Hare
 
Hybrid Image Retrieval in Digital Libraries by Jean-Philippe Moreux & Guillau...
Hybrid Image Retrieval in Digital Libraries by Jean-Philippe Moreux & Guillau...Hybrid Image Retrieval in Digital Libraries by Jean-Philippe Moreux & Guillau...
Hybrid Image Retrieval in Digital Libraries by Jean-Philippe Moreux & Guillau...Europeana
 
Hybrid Image Retrieval in Digital libraries
Hybrid Image Retrieval in Digital librariesHybrid Image Retrieval in Digital libraries
Hybrid Image Retrieval in Digital librariesJean-Philippe Moreux
 

Similaire à Bibliotheca Digitalis Summer school: From pixels to content - Jean-Yves Ramel (20)

Opticalcharacter recognition
Opticalcharacter recognition Opticalcharacter recognition
Opticalcharacter recognition
 
Automatic Image Annotation (AIA)
Automatic Image Annotation (AIA)Automatic Image Annotation (AIA)
Automatic Image Annotation (AIA)
 
Overview of graphics systems.ppt
Overview of graphics systems.pptOverview of graphics systems.ppt
Overview of graphics systems.ppt
 
Mirko Lucchese - Deep Image Processing
Mirko Lucchese - Deep Image ProcessingMirko Lucchese - Deep Image Processing
Mirko Lucchese - Deep Image Processing
 
2014 01-ticosa
2014 01-ticosa2014 01-ticosa
2014 01-ticosa
 
TechnicalBackgroundOverview
TechnicalBackgroundOverviewTechnicalBackgroundOverview
TechnicalBackgroundOverview
 
YCIS_Forensic PArt 1 Digital Image Processing.pptx
YCIS_Forensic PArt 1 Digital Image Processing.pptxYCIS_Forensic PArt 1 Digital Image Processing.pptx
YCIS_Forensic PArt 1 Digital Image Processing.pptx
 
Multimodal Learning Analytics
Multimodal Learning AnalyticsMultimodal Learning Analytics
Multimodal Learning Analytics
 
CG_Unit1_SShah.pptx
CG_Unit1_SShah.pptxCG_Unit1_SShah.pptx
CG_Unit1_SShah.pptx
 
Overview of graphics systems
Overview of  graphics systemsOverview of  graphics systems
Overview of graphics systems
 
Overblik over kunstig intelligens og digital billedanalyse
Overblik over kunstig intelligens og digital billedanalyseOverblik over kunstig intelligens og digital billedanalyse
Overblik over kunstig intelligens og digital billedanalyse
 
Implementation of Computer Vision Applications using OpenCV in C++
Implementation of Computer Vision Applications using OpenCV in C++Implementation of Computer Vision Applications using OpenCV in C++
Implementation of Computer Vision Applications using OpenCV in C++
 
Introduction to Machine Vision
Introduction to Machine VisionIntroduction to Machine Vision
Introduction to Machine Vision
 
Handwriting_Recognition_using_KNN_classificatiob_algorithm_ijariie6729 (1).pdf
Handwriting_Recognition_using_KNN_classificatiob_algorithm_ijariie6729 (1).pdfHandwriting_Recognition_using_KNN_classificatiob_algorithm_ijariie6729 (1).pdf
Handwriting_Recognition_using_KNN_classificatiob_algorithm_ijariie6729 (1).pdf
 
Adaptive membership functions for hand written character recognition by voron...
Adaptive membership functions for hand written character recognition by voron...Adaptive membership functions for hand written character recognition by voron...
Adaptive membership functions for hand written character recognition by voron...
 
Introduction to Digital Image Processing Using MATLAB
Introduction to Digital Image Processing Using MATLABIntroduction to Digital Image Processing Using MATLAB
Introduction to Digital Image Processing Using MATLAB
 
AN INTEGRATED APPROACH TO CONTENT BASED IMAGE RETRIEVAL by Madhu
AN INTEGRATED APPROACH TO CONTENT BASED IMAGERETRIEVAL by MadhuAN INTEGRATED APPROACH TO CONTENT BASED IMAGERETRIEVAL by Madhu
AN INTEGRATED APPROACH TO CONTENT BASED IMAGE RETRIEVAL by Madhu
 
Mind the Gap: Another look at the problem of the semantic gap in image retrieval
Mind the Gap: Another look at the problem of the semantic gap in image retrievalMind the Gap: Another look at the problem of the semantic gap in image retrieval
Mind the Gap: Another look at the problem of the semantic gap in image retrieval
 
Hybrid Image Retrieval in Digital Libraries by Jean-Philippe Moreux & Guillau...
Hybrid Image Retrieval in Digital Libraries by Jean-Philippe Moreux & Guillau...Hybrid Image Retrieval in Digital Libraries by Jean-Philippe Moreux & Guillau...
Hybrid Image Retrieval in Digital Libraries by Jean-Philippe Moreux & Guillau...
 
Hybrid Image Retrieval in Digital libraries
Hybrid Image Retrieval in Digital librariesHybrid Image Retrieval in Digital libraries
Hybrid Image Retrieval in Digital libraries
 

Plus de Bibliothèques Virtuelles Humanistes - CESR, Université de Tours, UMR 7323

Plus de Bibliothèques Virtuelles Humanistes - CESR, Université de Tours, UMR 7323 (20)

Montaigne : derniers développements sur les travaux éditoriaux
Montaigne : derniers développements sur les travaux éditoriauxMontaigne : derniers développements sur les travaux éditoriaux
Montaigne : derniers développements sur les travaux éditoriaux
 
Les BVH & l’étude des matériels d’imprimerie anciens
 Les BVH & l’étude des matériels d’imprimerie anciens Les BVH & l’étude des matériels d’imprimerie anciens
Les BVH & l’étude des matériels d’imprimerie anciens
 
Évolutions de l’infrastructure & de la bibliothèque numérique
Évolutions de l’infrastructure & de la bibliothèque numériqueÉvolutions de l’infrastructure & de la bibliothèque numérique
Évolutions de l’infrastructure & de la bibliothèque numérique
 
Les « Bibliotheques françoises » (BibFr) – Avancée de l’indexation de La Croi...
Les « Bibliotheques françoises » (BibFr) – Avancée de l’indexation de La Croi...Les « Bibliotheques françoises » (BibFr) – Avancée de l’indexation de La Croi...
Les « Bibliotheques françoises » (BibFr) – Avancée de l’indexation de La Croi...
 
Édition numérique et valorisation du livre de compte de la reine Marguerite d...
Édition numérique et valorisation du livre de compte de la reine Marguerite d...Édition numérique et valorisation du livre de compte de la reine Marguerite d...
Édition numérique et valorisation du livre de compte de la reine Marguerite d...
 
Catalogues régionaux des Incunables des bibliothèques publiques de France
Catalogues régionaux des Incunables des bibliothèques publiques de FranceCatalogues régionaux des Incunables des bibliothèques publiques de France
Catalogues régionaux des Incunables des bibliothèques publiques de France
 
Une nouvelle base de données, Scripta Manent : le “Facebook” des années 1530-...
Une nouvelle base de données, Scripta Manent : le “Facebook” des années 1530-...Une nouvelle base de données, Scripta Manent : le “Facebook” des années 1530-...
Une nouvelle base de données, Scripta Manent : le “Facebook” des années 1530-...
 
Bilan 2022 & perspectives du programme de recherche BVH
Bilan 2022 & perspectives du programme de recherche BVHBilan 2022 & perspectives du programme de recherche BVH
Bilan 2022 & perspectives du programme de recherche BVH
 
Catalogues régionaux des Incunables des bibliothèques publiques de France : S...
Catalogues régionaux des Incunables des bibliothèques publiques de France : S...Catalogues régionaux des Incunables des bibliothèques publiques de France : S...
Catalogues régionaux des Incunables des bibliothèques publiques de France : S...
 
Architecture de la bibliothèque numérique : Déploiement du protocole IIIF - A...
Architecture de la bibliothèque numérique : Déploiement du protocole IIIF - A...Architecture de la bibliothèque numérique : Déploiement du protocole IIIF - A...
Architecture de la bibliothèque numérique : Déploiement du protocole IIIF - A...
 
Autour du projet BiRayMa : "Bibliothèque de Raymond Marcel" (CollEx-Persée) -...
Autour du projet BiRayMa : "Bibliothèque de Raymond Marcel" (CollEx-Persée) -...Autour du projet BiRayMa : "Bibliothèque de Raymond Marcel" (CollEx-Persée) -...
Autour du projet BiRayMa : "Bibliothèque de Raymond Marcel" (CollEx-Persée) -...
 
Rabelais : Les documents de Berne et l'Almanach d'Alessandria - Assemblée gén...
Rabelais : Les documents de Berne et l'Almanach d'Alessandria - Assemblée gén...Rabelais : Les documents de Berne et l'Almanach d'Alessandria - Assemblée gén...
Rabelais : Les documents de Berne et l'Almanach d'Alessandria - Assemblée gén...
 
Projet Scripta Manent : Une nouvelle base de données : les relations sociales...
Projet Scripta Manent : Une nouvelle base de données : les relations sociales...Projet Scripta Manent : Une nouvelle base de données : les relations sociales...
Projet Scripta Manent : Une nouvelle base de données : les relations sociales...
 
Projet Les Bibliotheques françoises de La Croix du Maine et de Du Verdier - A...
Projet Les Bibliotheques françoises de La Croix du Maine et de Du Verdier - A...Projet Les Bibliotheques françoises de La Croix du Maine et de Du Verdier - A...
Projet Les Bibliotheques françoises de La Croix du Maine et de Du Verdier - A...
 
Architecture de la bibliothèque numérique : Modélisation en XML-TEI - Assembl...
Architecture de la bibliothèque numérique : Modélisation en XML-TEI - Assembl...Architecture de la bibliothèque numérique : Modélisation en XML-TEI - Assembl...
Architecture de la bibliothèque numérique : Modélisation en XML-TEI - Assembl...
 
Architecture de la bibliothèque numérique : Veille fonctionnelle et technique...
Architecture de la bibliothèque numérique : Veille fonctionnelle et technique...Architecture de la bibliothèque numérique : Veille fonctionnelle et technique...
Architecture de la bibliothèque numérique : Veille fonctionnelle et technique...
 
Architecture de la bibliothèque numérique : Modélisation et migrations de don...
Architecture de la bibliothèque numérique : Modélisation et migrations de don...Architecture de la bibliothèque numérique : Modélisation et migrations de don...
Architecture de la bibliothèque numérique : Modélisation et migrations de don...
 
Production BVH : Epistemon (éditions numériques TEI-Renaissance) - Assemblée ...
Production BVH : Epistemon (éditions numériques TEI-Renaissance) - Assemblée ...Production BVH : Epistemon (éditions numériques TEI-Renaissance) - Assemblée ...
Production BVH : Epistemon (éditions numériques TEI-Renaissance) - Assemblée ...
 
Production BVH : Fac-similés (Numérisations) - Assemblée générale 2021, Progr...
Production BVH : Fac-similés (Numérisations) - Assemblée générale 2021, Progr...Production BVH : Fac-similés (Numérisations) - Assemblée générale 2021, Progr...
Production BVH : Fac-similés (Numérisations) - Assemblée générale 2021, Progr...
 
Bilan 2020-2021 & perspectives 2022+ Assemblée générale 2021, Programme de re...
Bilan 2020-2021 & perspectives 2022+ Assemblée générale 2021, Programme de re...Bilan 2020-2021 & perspectives 2022+ Assemblée générale 2021, Programme de re...
Bilan 2020-2021 & perspectives 2022+ Assemblée générale 2021, Programme de re...
 

Dernier

From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...GQ Research
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 

Dernier (20)

From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population Mean
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 

Bibliotheca Digitalis Summer school: From pixels to content - Jean-Yves Ramel

  • 1. Bibliotheca Digitalis Reconstitution of Early Modern Cultural Networks From Primary Source to Data DARIAH / Biblissima Summer School Le Mans, 4-8 July 2017 From Pixels to Contents Jean-Yves Ramel Professor of Computer Science, Computer Laboratory, University of Tours 1st day, July 4th – Digital sources: theoretical fundamentals
  • 2. 1 From pixels to content: An overview of the main techniques used in DIA 1 Jean-Yves RAMEL
  • 3. 2 IMAGES CONTENTS From Pixels to Contents Introduction Document Image Analysis Machine Learning Content based images Retrieval Tools to automatically extract information inside the images / documents Generation and utilization of meta data
  • 4. 3 From Pixels … to contents Outline  From pixels…  What is an image?  Image (pre-)processing  … to Text  Transcription and Layout analysis  Segmentation and content extraction  An overview of Pattern Recognition  … but also to non-Text  Content characterization and signatures  Content retrieval and spotting  Back to meta-data?  From descriptive to perceptual meta-data  Is there adequate encoding formats?  Conclusions and perspectives
  • 6. 5 From Pixels… Digitization  Set of pixels ?  Images come from a grid of microscopic photosensitive cells called PIXELS  Sampling  Quantization  Assignment of a numerical value drawn from the received lighting energy / pixel (grid unit)  Continuous value (xi,yi) → Discrete value (xi,yi) → Pixels  The range of colors that each pixel can take Pixels Source of illumination Initial object (real world) Radiant energy Analogical image Space partitioning : sampling Digitized object: Quantization : color information encoding by the sensors Acquisition system
  • 7. 6 From Pixels… What is an image? Image Quantization Binary images: I(i,j) = 0 black or I(i,j) = 1 white Gray level (8 bits/pixel) images: I(i,j) = 0…..255 from the lighter to the darker. Color images (24 bits/pixel): 3 values of lighting intensity Red, Green, Blue I1(i,j) = 0…..255 – I2(i,j) = 0…..255 – I3(i,j) = 0…..255 Image Representation & Processing Image = Array(s) of pixels = Matrices of values 1 pixel = A position inside the image (i,j) + 1 color (1 to 3 values) The values I(i,j) associated to each pixel s(i,j) represent their brightness intensity
  • 8. 7 7 From Pixels… What is an image? • Sampling  Image resolution • Number of pixels per lenght unit • In dpi (dots per inches) or ppp (points par pouce) • When the resolution decrease, the precision decrease • VF Image processing • Page A4 = 21x29.7 cm • 200 dpi : 1650 x 2340 pixels = 3 861 000 pixels • 300 dpi : 3500 x 2480 pixels = 8 680 000 pixels • 16M colors, 1 pixel = 3 octets  10 à 25 Mo/page ! • A trade-off between quality-quantity/time is mandatory • Fidelity of the numerical version • Mass of storage size – Transmission / Processing time
  • 9. 8 8 From Pixels… Why few pixels are so important?  Because, even with a high resolution, some isolated patterns can correspond only to small numbers of pixels (from 20 to 50 pixels) !  20 to 40% of the useful information is situated on the boundaries of the shapes and is modified randomly by the digitization process  The missing of few pixels can change the appearance of characters until the confusion  Loss of topological information  Broken characters or touching characters  The human perception is not sensitive to the resolution Broken shape Touching OCR errors
  • 10. 9 9 After the digitization, the images usually still have a lot of defaults  Curvature and skew due to scanning  Noise on boundaries, dots, blur, … From Pixels… What is Image (pre-)processing?
  • 11. 10 10 Curvature and skew correction is possible on text images From Pixels… Image (pre-)processing
  • 12. 11 11 From Pixels… Image (pre-)processing The problem is more complicated in case of heterogeneous content
  • 13. 12 12 The problem is more complicated in case of heterogeneous content From Pixels… Image (pre-)processing
  • 14. … to text and layout CONTENTS IMAGES
  • 15. 14 1-Offrant appres plusieurs rues sur ce faites 2-au lieu et temps acostume aly livre et 3- accenste la somme et quantite dessoulz 4-estempte du voloir et consentement de 5-plusieurs des borgeys de la dite ville 6-de chastellion ______________IIIIxXX V flor(ins) 7-receu dudit franceys rotier pour 8-une albaleste dehue a cause de la 9-dite ferme dudit trezein avec la 10-somme et quantite dessus estempte ? 11- ______________II flor(ins) 12-receu de Alexandre ? escoffier al loup ? 13- dudit lieu de chatellion pour la 14-rense et ferme du commun de la dite 15-ville et communaute ? pour ung an commettre ? 16-alafeste nativite fi jehan baptiste lan 17-mil quatre cent et cinquante? A ly par le 18-pris dessouls esempt accense et hure ? 19-comme au plus offrant du voloyr et 20-et consentement de pluseures des bourgeys 21-de la dite ville de chastellion apres 22-pluseures rues ….refaytes es lieus ? Preservation of the link between the Text and the Image From pixels … to text Automatic transcription but also layout Analysis
  • 16. 15 15 First step : Image segmentation • Transformation of the image (set of pixels) into patterns (regions of interest) of higher level (EoC) • These EoC could be very simple (part of characters) or more sophisticated ones (paragraphs, illustrations, …) • EoC extraction: Background (white) / Foreground (black) separation • Color Image  Grayscale  binarisation From pixels … to text An overview of OCR mechanisms EoC
  • 17. 16 Just to illustrate the difficulties…  Most of the segmentation methods need a binarisation From pixels … to text An overview of OCR mechanisms Threshold selection ?
  • 18. 17 17 Just to illustrate the difficulties…  Most of the segmentation methods need a binarisation  Global threshold   Local thresholds  Niblack : S =m+ks² avec k= -0,2 | m : mean et s : standard deviation From pixels … to text An overview of OCR mechanisms
  • 19. 18 First step: Image segmentation / Connected components  Then, we can try to group black pixels together to localize and recognize higher level Element of Content (EoC) From pixels … to text An overview of OCR mechanisms
  • 20. 19 From pixels … to text An overview of OCR mechanisms Next step : Layout analysis • Connected Components  Words  Lines  Paragraphs  Page • The results have to be saved in XML format (Alto, …) • Choosing how to organize the XML tree (physical / logical) is not so easy…
  • 21. 20 20 Next step : Layout analysis Two kind of structures have been identified by researchers in DIA: • The logical structure  the generic one corresponding to a priori knowledge about the content of the document • The physical structure  the analysed instance corresponding to the extracted EoC inside the image, each one associated to descriptive features (size, position, number of sub-patterns, … ) • Layout analysis tries to recognize these 2 structures (EoC identification) From pixels … to text An overview of OCR mechanisms Physical structure   logical structureSub-section Paragraph Sentence
  • 22. 21 From pixels … to text An overview of OCR mechanisms Next step : Layout analysis • The analysis / identification of the EoC is usually achieved based on a rule based system defined through a grammar (static one) or defined interactively by the users Projet Européen Meta-e (http://meta-e.aib.uni-linz.ac.at/ ) First commercial system for automatic layout analysis Dutch books of XVIIIe Projet BVH - Paradiit (https://sites.google.com/site/paradiitproject/) AGORA: an user-driven system for content extraction in historical printed books
  • 23. 22 Next step : Pattern / EoC recognition (toward Machine Learning) How computers can recognize objects? • We need a large set of (labelled) examples similar to the patterns to be recognized  a training set • We need a list of stable and discriminative features (shape, color, size,…) used to describe the patterns (labelled ones and unknown one) From pixels … to text An overview of OCR mechanisms A E … … [ Training set ] x1 [ 2D Representation of the training set] A A x2 x1 Stable and discriminative features
  • 24. 23 Next step : Pattern / EoC recognition (toward Machine Learning) How computers can recognize objects? • When an unknown EoC arrives, we compute its features and compare it with the content of the training set (associated built models) From pixels … to text An overview of OCR mechanisms x2 [ 2D Representation of the training set] A A x2 x1 ?A
  • 25. 24 Deep Learning (Conv. Neural Net) From pixels … to text An overview of OCR mechanisms
  • 26. 25 25 From pixels … to text An overview of OCR mechanisms  Why commercial OCR are not working well on historical documents?  Noises and degradations  Unusual layout  Unsuited training set Fine Reader
  • 27. 26 From pixels … to text An overview of OCR mechanisms  Why commercial OCR are not working well on historical documents?  Lack of data, knowledge and experiences  Unusual fonts and characters  training data needs to be created  Unusual languages Lexicons, dictionaries and language models need to be created  Context often allows to modify our understanding of what is perceived by our senses  Until now, we tried to recognized EoC without using their context  The same EoC could be interpreted differently according to its surrounding context  Results of OCR are highly correlated to the adequacy of the used word dictionary  Is there methods that need less a priori knowledge?  Processing non-Textual parts can be good source of inspiration?
  • 29. 28  Ornamental letters ( +of 20000)  Figures (+ de 1500) BATYR: http://www.bvh.univ-tours.fr From pixels … to non-textual contents Pictorial Content is also of high interest
  • 30. 29 From pixels … to non-textual contents An overview of Content Based Image Retrieval • Computation time not crucial • Signatures  Visual Features  Perceptual meta data Index database Statisrics Perceptual meta data instead of classical meta data  Computation of signatures for all the images or even sub-parts of the images (EoC) Set of images to index
  • 31. 30 From pixels … to non-textual contents An overview of Content Based Image Retrieval Using images as query instead of words Search Engine Back-office Front-office
  • 32. 31 From pixels … to non-textual contents An overview of Content Based Image Retrieval It is again a question of features…  We speak about signatures Query Image Signature computation Index database containing perceptual signatures Off-line indexation On-line retrieval Image Dataset similar Images Representation space
  • 33. 32 From pixels … to non-textual contents An overview of Content Based Image Retrieval  From CBIR to Word spotting Off-lineindexation Clustering EoC Signature computation
  • 34. 33 From pixels … to non-textual contents An overview of Content Based Image Retrieval  From CBIR to Word spotting On-lineRetrieval Query images Results: retrieved images
  • 35. 34 From pixels … to non-textual contents An overview of Content Based Image Retrieval  From CBIR to Word spotting
  • 36. Is it just a question of meta-data? CONTENTSIMAGES
  • 37. 36 The standard model Descriptive meta-data + transcription We have usually • Descriptive meta-data in standard formats (MARC, EAD, Dublin Core, MODS, …) • Edited manually • “Semantical” information • Text transcription associated to additional meta-data (TEI) • Semi-automatic transcription or manually edited • “Semantical” information Text (TEI) Meta-data Images Descriptive meta- data (MARC)Keyboarding Image AnalysisDigitization Text regions XML “Semantical” meta-data
  • 38. 37 The standard model Perceptual meta-data  It seems that CBIR can help to extract and save supplementary information about image content (EoC) without going to the semantical aspect (recognition)  Regions of interest  Visual features  Perceptual signatures and index  Shapes, positions, colors, textures, …  Numerical values (vectors) Text (TEI) Images Descriptive meta- data (MARC)Keyboarding Image AnalysisDigitization Text regions XML Meta-dataMeta- data ++ + “Perceptual” meta-data Non-Text regions CBIR
  • 39. 38 Is it just a problem of Meta-data? What about the encoding formats? Structuration of data and file formats is a difficult problem • Data architects are needed • Some interesting formats linked with previous discussions o METS – Metadata Encoding and Transmission Standard o ALTO – Analyzed Layout and Text Object o TEI – Text Encoding Initiative METSImages Image ALTO Enriched documents TEI …
  • 40. 39 Is it just a problem of Meta-data? ALTO for OCR ALTO = Analyzed Layout and Text Object  Standard XML  Created in 2003 during METAe project  Developed by Graz, Linz, Innsbruck universities  Description of the content and the physical layout of one page  Used by several OCR software  Adapted and used by the BNF and other libraries  Drawbacks: huge / static
  • 41. 40  TEI = Text Encoding Initiative - Standard XML  Content tagging and logical structure encoding (full document)  Used a lot by libraries  Too much « open »?  Quit « complex » Is it just a problem of Meta-data? TEI for transcription and enriched contents
  • 42. 41 METS – Metadata Encoding and Transmission Standard • Open XML Standard created in 2001 by the Digital Library Federation maintained by METS Editorial Board • XSD-Schema • Linking between multimedia objects • Complete Description of digitized content (images, texts, audio, sculptures, …) • Physical / logical structures • Descriptive Meta data (DC, MODS, MARC, …) • … metsHdr dmdSec amdSec fileSec structMap METS document header (info author, sotware, …) descriptive metadata section (bibliographic notice) administrative metadata section (copyright, …) file inventory section (file localisation) structLink structural map linking (link between structures) structural maps (physical et logical) METS – First level Elements Is it just a problem of Meta-data? Link between meta data?
  • 44. 43 METS metsHdr dmdSec amdSec fileSec fileGrp file FLocat ID=Images ID=ALTO00002 xlink:href="file://./master/DDD_010120914_001.j p2" fileGrp file FLocat ID=ALTOS ID=ALTO00001 xlink:href="file://./alto/DDD_010120914_001.xml" structMap ID=Logical structMap ID=Physical div div Type=Volume Type=Issue fptr seq FILEID=ALTO0000 1, BEGIN=TXT203area FILEID=ALTO0000 2, BEGIN=TXT004 area div Type=Chapter div Type=Chapter div Type=Paragraph text block ALTO ID=TXT004 ALTO text block ID=TXT203 Image Image The Logical StructMap reflects the enrichment of the different logical blocks located in different pages that can be split by other logical blocks (like foot-notes, ….). METS – Logical Structure TEI
  • 46. 45 45 From Pixels to Contents Conclusions  Building tools for the valorisation of digitized historical content is a pluri-disciplinary task  Meta-data production  Experts of the domains  Selection and verification of the data  Experts + Data accuratist  Structuration of the data and system  Data / system architect  Computer vision, Machine learning  Data scientist  Manual indexing is needed  Descriptive meta-data  Semantical meta data  Standard formats for data encoding  Annotations could be seen as supplementary meta-data?  Operational methods and tools are available  Acquisition devices  Automatic tools: low level image processing, OCR  Perceptual meta-data should be added : CBIR
  • 47. 46 46 From Pixels to contents Conclusions - Perspectives  The actual context: big data and heterogeneous collections  Connexion between data, mutual enrichment, interoperability  Introduction and management of additional knowledge  Facing the diversity of the types of contents and usages  Quality of the interaction instead of only the quantity  Semantic Web : queries reformulation, smart crawlers, automatic categorisation