Ce diaporama a bien été signalé.
Le téléchargement de votre SlideShare est en cours. ×

Session5 01.rutger vankoert

Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Chargement dans…3
×

Consultez-les par la suite

1 sur 21 Publicité

Session5 01.rutger vankoert

Télécharger pour lire hors ligne

Slides of the paper Tribunal Archives as Digital Research Facility (TRIADO): new ways to make archives accessible and useable by Anne Gorter, Edwin Klijn, Rutger Van Koert, Marielle Scherer and Ismee Tames at the 3rd Edition of the DATeCH2019 International Conference

Slides of the paper Tribunal Archives as Digital Research Facility (TRIADO): new ways to make archives accessible and useable by Anne Gorter, Edwin Klijn, Rutger Van Koert, Marielle Scherer and Ismee Tames at the 3rd Edition of the DATeCH2019 International Conference

Publicité
Publicité

Plus De Contenu Connexe

Similaire à Session5 01.rutger vankoert (20)

Plus par IMPACT Centre of Competence (20)

Publicité

Plus récents (20)

Session5 01.rutger vankoert

  1. 1. Anne Gorter, Edwin Klijn, Rutger van Koert, Marielle Scherer, Ismee Tames DATECH 2019 rom Tribunal Archive to Digital Research Facility (TRIADO) Exploring ways to make archives accessible and usable
  2. 2. TRIADO: from laboratory to ‘reality check’ • Partners: National Archives, NIOD Institute for War, Holocaust and Genocide Studies, Huygens ING/KNAW Humanities Cluster, Dutch War Collections • 2017-2019 • Sample of 13,8 metres (160,000 pages) from Central Archive of Special Jurisdiction (CABR)
  3. 3. About the CABR
  4. 4. About the CABR Source: photo album ‘Centraal Archievendepot Justitie’; National Archives
  5. 5. About the CABR Source: photo album‘Centraal Archievendepot Justitie’; National Archives
  6. 6. Research questions 1. Which digital methods are best suited (in terms of quality, efficiency, etc) to make large corpora of unstructured, imperfect data, based on analogue collections, usable as a research facility? Focus is on applying proven technology on a sample of digitized documents to work towards four access points: who, what, where and when. (GENERIC part) 2. Is it possible to answer specific, mainly quantitative statistical research questions on the basis of the digital data created under 1? (SPECIFIC part)
  7. 7. Steps • Preparation for digitization • Scanning • Automatic transcription • Data enrichment • Web presentation
  8. 8. Diverse quality of materials
  9. 9. A little about OCR • Abbyy does a good job without much tweaking • Preprocessing helps tesseract v4, but training even more • Tesseract has a lot of false positives, Abbyy practically none, but can miss stamps etc • Overall best: Abbyy l Retraining tesseract l 150 pages, 80% training, 20% testing 0 2 4 6 8 10 12 14 16 18 20 1 1000 2000 5000 10000 20000 30000 40000 50000 60000 70000 80000100000 CER WER Error
  10. 10. Named Entities; Who? Where? • After OCR: NER using frog => results not too good l Broken OCR l Different language in the period. For example: Capitalization of Job titles such Kapitein Jansen, Veldwachter Jansen • Much better: matching lists of names (almost) exactly l Can be tweaked with relative Levenshtein distance l relativeDistance(“hallo”, “ha110”) = 0.3 l relativeDistance(“hallo”, “ha8lo”) = 1.0 • Less false positives, but also missing unknown names • Searching with relative levenshtein increases false positives and is much slower
  11. 11. When? If it looks like a date…. 11 = possible day or month Aug = possible month 1945 = possible year 11 + Aug+ 1945 = date 35 + Aug + 1945 != date 80% correct, remaining errors mostly ocr-errors and some overdetection
  12. 12. Finding the right document l Rvl-cdip dataset
  13. 13. Finding the right document l Questionnaire l Email l Budget
  14. 14. Architecture of a CNN. — Source: https://www.mathworks.com/videos/introduction-to-deep-learning-what-are-convolutional-neural-networks--1489512765771.html
  15. 15. Frontend (anonimized) l Image removed for privacy reasons
  16. 16. Some conclusions • OCR useful. Word Error Rate (WER) of 15% for reports. • Auto-classification and date extraction promising. Error rate of 20%. • Training the software (machine-learning) and preprocessing before OCR improves the results. • Extracting names of locations, organisations and persons using off the shelf NER has lots of errors. Using existing lists of names works better.
  17. 17. Next steps • Project plan to digitize and provide access to 4 kilometres of archives • More linking with external data resources (names, places, organizations, etc) • Extend automatic transcription
  18. 18. www.oorlogsbronnen.nl @Oorlogsbronnen info@oorlogsbronnen.nl www.oorlogsbronnen.nl/final-report-triado-enrichment-phase

×