SlideShare une entreprise Scribd logo
1  sur  18
Télécharger pour lire hors ligne
What happened?
    Martin Majlis
Outline

    Introduction
    Architecture
    Back-end
          Downloading
          Extraction
    Front-end
          Web application
          iGoogle Gadget
28/01/10                     SWT - Final Project   2
Introduction

    Answer on questions:
          what happened on 3 January
          what happened on 3 January 1865
          what happened on January 1825
          what happened from January until July 1985
          what happened during the 16th century
          what started on January 1930
          what ended in 1990

28/01/10                    SWT - Final Project         3
Architecture

    Back-end
          Downloading
          Structure Converting
          Parsing
    Front-end
          Web application
          iGoogle Gadget



28/01/10                     SWT - Final Project   4
Build process

    Fully automatized
    Target for each phase
    Less error-prone
    GNU Make




28/01/10                 SWT - Final Project   5
Data Source

    Czech Wikipedia
          Documented format
          Dumps regularly generated
          Cleaner than general texts




28/01/10                     SWT - Final Project   6
Downloading / Conversion

    Downloading
          Script from DBPedia
          Added traffic shaping
    Data Conversion
          Recognizing pages/categories
          Building category “hierarchy”




28/01/10                     SWT - Final Project   7
Categories

    Confusing Structure
    Netherlands - 229
          Physics, Planets,
           Illusions, Psychology,
           Literature, Organ,
           Neuroscience, etc.
    Maximal deep 5
    Median: 31
    Mean: 33.87
28/01/10                      SWT - Final Project   8
Date Extraction – Regular Exp.

    Regular expressions aren't for parsing
          Day=(d+).; Month = (Jan|Feb|...); Year=(d+)
          Date = (Day Month Year | Day Month | Month Year |
           Year)
          Extract = (“from” Date “until” Date | Date “-” Date |
           “between” Date “and” Date | “from” Date)
    Day number can be on 14 positions
    In real more than 1000 slots

28/01/10                       SWT - Final Project                 9
Date Extraction - Tools

    Standard way:
          GNU Flex / GNU Bison
          Ragel
    Problem with UTF-8 support
          Unicode – almost 100.000 characters
          Big transition tables (100.000 vs 127)




28/01/10                      SWT - Final Project   10
Date Extraction - Mixed

    Lexical Analysis
          Regular Expressions
          Filling Table
    Syntactic Analysis
          Theoretically CFG
          Practically again regular expressions




28/01/10                       SWT - Final Project   11
Date Extraction - Example

    Lexical Analysis
          “From 23 January 1956 until 2 February 1960”
          “From {{DATE_1}} until {{DATE_2}}”
    Syntactic Analysis
          Interval = “From” DATE “to” DATE
          Interval = “Between” DATE “and” DATE




28/01/10                    SWT - Final Project           12
Date Representation

    Dates from 10.000 BC to 2500 AC
                      th
 
     Not exact: 13 century, June 1689
    Zero
          2 January - 5days = 28 December
          2 January 1AC -5days = 28 December
           1BC
    Simple tuples
          (“I”, 23, 1, 1956, 20, 2, 2, 1960, 20)
28/01/10                   SWT - Final Project      13
Web application

    PHP5 + MySQL
    Nette Framework + Dibi
    http://css.majlis.cz/
          GT: http://jdem.cz/dspw9
    HTML, JSON, XML output




28/01/10                     SWT - Final Project   14
iGoogle Gadget

    iGoogle = Google personalized homepage
    URL: http://jdem.cz/dspx7
    Using JSON
    Tricky development




28/01/10              SWT - Final Project     15
Future Work

    Improve performance
       
           20th century events – 28s – 406.980 (one OR)
       
           20th century events – 0.0007s – 392.573 (no OR)
    Improve parser architecture




28/01/10                     SWT - Final Project             16
Questions?




28/01/10     SWT - Final Project   17
Thank You!

28/01/10      SWT - Final Project   18

Contenu connexe

Similaire à SWT Final Project Presentation

Implementation of a SaaS based simulation platform using open standards and o...
Implementation of a SaaS based simulation platform using open standards and o...Implementation of a SaaS based simulation platform using open standards and o...
Implementation of a SaaS based simulation platform using open standards and o...Thomas Paviot
 
Using Cascalog to build an app with City of Palo Alto Open Data
Using Cascalog to build an app with City of Palo Alto Open DataUsing Cascalog to build an app with City of Palo Alto Open Data
Using Cascalog to build an app with City of Palo Alto Open DataOSCON Byrum
 
OSCON 2013: Using Cascalog to build an app with City of Palo Alto Open Data
OSCON 2013: Using Cascalog to build an app with City of Palo Alto Open DataOSCON 2013: Using Cascalog to build an app with City of Palo Alto Open Data
OSCON 2013: Using Cascalog to build an app with City of Palo Alto Open DataPaco Nathan
 
Softshake 2013: Introduction to NoSQL with Couchbase
Softshake 2013: Introduction to NoSQL with CouchbaseSoftshake 2013: Introduction to NoSQL with Couchbase
Softshake 2013: Introduction to NoSQL with CouchbaseTugdual Grall
 
What the cloud has to do with a burning house?
What the cloud has to do with a burning house?What the cloud has to do with a burning house?
What the cloud has to do with a burning house?Nane Kratzke
 
SKA_in_Seoul_2015_NicolasErdody v2.0
SKA_in_Seoul_2015_NicolasErdody v2.0SKA_in_Seoul_2015_NicolasErdody v2.0
SKA_in_Seoul_2015_NicolasErdody v2.0Nicolás Erdödy
 
TIM: Large-scale Energy Forecasting in Julia
TIM: Large-scale Energy Forecasting in JuliaTIM: Large-scale Energy Forecasting in Julia
TIM: Large-scale Energy Forecasting in JuliaGapData Institute
 
Programmable Exascale Supercomputer
Programmable Exascale SupercomputerProgrammable Exascale Supercomputer
Programmable Exascale SupercomputerSagar Dolas
 
Towards a Lightweight Multi-Cloud DSL for Elastic and Transferable Cloud-nati...
Towards a Lightweight Multi-Cloud DSL for Elastic and Transferable Cloud-nati...Towards a Lightweight Multi-Cloud DSL for Elastic and Transferable Cloud-nati...
Towards a Lightweight Multi-Cloud DSL for Elastic and Transferable Cloud-nati...Nane Kratzke
 
GeoKettle: A powerful open source spatial ETL tool
GeoKettle: A powerful open source spatial ETL toolGeoKettle: A powerful open source spatial ETL tool
GeoKettle: A powerful open source spatial ETL toolThierry Badard
 
Eclipse IoT Talk (Montreal JUG)
Eclipse IoT Talk (Montreal JUG)Eclipse IoT Talk (Montreal JUG)
Eclipse IoT Talk (Montreal JUG)Mike Milinkovich
 
StreamSight - Query-Driven Descriptive Analytics for IoT and Edge Computing
StreamSight - Query-Driven Descriptive Analytics for IoT and Edge ComputingStreamSight - Query-Driven Descriptive Analytics for IoT and Edge Computing
StreamSight - Query-Driven Descriptive Analytics for IoT and Edge ComputingDemetris Trihinas
 
Visualisation of Complex 3D City Models on Mobile Webbrowsers Using Cloud-bas...
Visualisation of Complex 3D City Models on Mobile Webbrowsers Using Cloud-bas...Visualisation of Complex 3D City Models on Mobile Webbrowsers Using Cloud-bas...
Visualisation of Complex 3D City Models on Mobile Webbrowsers Using Cloud-bas...Martin Christen
 
OrientDB - Time Series and Event Sequences - Codemotion Milan 2014
OrientDB - Time Series and Event Sequences - Codemotion Milan 2014OrientDB - Time Series and Event Sequences - Codemotion Milan 2014
OrientDB - Time Series and Event Sequences - Codemotion Milan 2014Luigi Dell'Aquila
 
Always-On Web of Things Infrastructure Dynamic Software Updating
Always-On Web of Things Infrastructure Dynamic Software UpdatingAlways-On Web of Things Infrastructure Dynamic Software Updating
Always-On Web of Things Infrastructure Dynamic Software UpdatingTECO Research Group
 
CCCA Data Centre - Dynamic Data Citation for NetCDF files
CCCA Data Centre - Dynamic Data Citation for NetCDF filesCCCA Data Centre - Dynamic Data Citation for NetCDF files
CCCA Data Centre - Dynamic Data Citation for NetCDF filesChris Schubert
 

Similaire à SWT Final Project Presentation (20)

What happened?
What happened?What happened?
What happened?
 
Implementation of a SaaS based simulation platform using open standards and o...
Implementation of a SaaS based simulation platform using open standards and o...Implementation of a SaaS based simulation platform using open standards and o...
Implementation of a SaaS based simulation platform using open standards and o...
 
Using Cascalog to build an app with City of Palo Alto Open Data
Using Cascalog to build an app with City of Palo Alto Open DataUsing Cascalog to build an app with City of Palo Alto Open Data
Using Cascalog to build an app with City of Palo Alto Open Data
 
OSCON 2013: Using Cascalog to build an app with City of Palo Alto Open Data
OSCON 2013: Using Cascalog to build an app with City of Palo Alto Open DataOSCON 2013: Using Cascalog to build an app with City of Palo Alto Open Data
OSCON 2013: Using Cascalog to build an app with City of Palo Alto Open Data
 
Softshake 2013: Introduction to NoSQL with Couchbase
Softshake 2013: Introduction to NoSQL with CouchbaseSoftshake 2013: Introduction to NoSQL with Couchbase
Softshake 2013: Introduction to NoSQL with Couchbase
 
What the cloud has to do with a burning house?
What the cloud has to do with a burning house?What the cloud has to do with a burning house?
What the cloud has to do with a burning house?
 
SKA_in_Seoul_2015_NicolasErdody v2.0
SKA_in_Seoul_2015_NicolasErdody v2.0SKA_in_Seoul_2015_NicolasErdody v2.0
SKA_in_Seoul_2015_NicolasErdody v2.0
 
TIM: Large-scale Energy Forecasting in Julia
TIM: Large-scale Energy Forecasting in JuliaTIM: Large-scale Energy Forecasting in Julia
TIM: Large-scale Energy Forecasting in Julia
 
Programmable Exascale Supercomputer
Programmable Exascale SupercomputerProgrammable Exascale Supercomputer
Programmable Exascale Supercomputer
 
Towards a Lightweight Multi-Cloud DSL for Elastic and Transferable Cloud-nati...
Towards a Lightweight Multi-Cloud DSL for Elastic and Transferable Cloud-nati...Towards a Lightweight Multi-Cloud DSL for Elastic and Transferable Cloud-nati...
Towards a Lightweight Multi-Cloud DSL for Elastic and Transferable Cloud-nati...
 
GeoKettle: A powerful open source spatial ETL tool
GeoKettle: A powerful open source spatial ETL toolGeoKettle: A powerful open source spatial ETL tool
GeoKettle: A powerful open source spatial ETL tool
 
Eclipse IoT Talk (Montreal JUG)
Eclipse IoT Talk (Montreal JUG)Eclipse IoT Talk (Montreal JUG)
Eclipse IoT Talk (Montreal JUG)
 
StreamSight - Query-Driven Descriptive Analytics for IoT and Edge Computing
StreamSight - Query-Driven Descriptive Analytics for IoT and Edge ComputingStreamSight - Query-Driven Descriptive Analytics for IoT and Edge Computing
StreamSight - Query-Driven Descriptive Analytics for IoT and Edge Computing
 
Cv jeanlucbordessoule
Cv jeanlucbordessouleCv jeanlucbordessoule
Cv jeanlucbordessoule
 
Rock Overview
Rock OverviewRock Overview
Rock Overview
 
Visualisation of Complex 3D City Models on Mobile Webbrowsers Using Cloud-bas...
Visualisation of Complex 3D City Models on Mobile Webbrowsers Using Cloud-bas...Visualisation of Complex 3D City Models on Mobile Webbrowsers Using Cloud-bas...
Visualisation of Complex 3D City Models on Mobile Webbrowsers Using Cloud-bas...
 
OrientDB - Time Series and Event Sequences - Codemotion Milan 2014
OrientDB - Time Series and Event Sequences - Codemotion Milan 2014OrientDB - Time Series and Event Sequences - Codemotion Milan 2014
OrientDB - Time Series and Event Sequences - Codemotion Milan 2014
 
Omid: A transactional Framework for HBase
Omid: A transactional Framework for HBaseOmid: A transactional Framework for HBase
Omid: A transactional Framework for HBase
 
Always-On Web of Things Infrastructure Dynamic Software Updating
Always-On Web of Things Infrastructure Dynamic Software UpdatingAlways-On Web of Things Infrastructure Dynamic Software Updating
Always-On Web of Things Infrastructure Dynamic Software Updating
 
CCCA Data Centre - Dynamic Data Citation for NetCDF files
CCCA Data Centre - Dynamic Data Citation for NetCDF filesCCCA Data Centre - Dynamic Data Citation for NetCDF files
CCCA Data Centre - Dynamic Data Citation for NetCDF files
 

Plus de Martin Majlis

E-Learning - Text Comprehension
E-Learning - Text ComprehensionE-Learning - Text Comprehension
E-Learning - Text ComprehensionMartin Majlis
 
RIES voting key interface
RIES voting key interfaceRIES voting key interface
RIES voting key interfaceMartin Majlis
 
CMC - RIES-improvements: Pragmatic authentication
CMC - RIES-improvements: Pragmatic authenticationCMC - RIES-improvements: Pragmatic authentication
CMC - RIES-improvements: Pragmatic authenticationMartin Majlis
 
Partially observable Markov decision processes for spoken dialog systems
Partially observable Markov decision processes for spoken dialog systemsPartially observable Markov decision processes for spoken dialog systems
Partially observable Markov decision processes for spoken dialog systemsMartin Majlis
 
Java – Annotations
Java – AnnotationsJava – Annotations
Java – AnnotationsMartin Majlis
 
Google Translate + TectoMT
Google Translate + TectoMTGoogle Translate + TectoMT
Google Translate + TectoMTMartin Majlis
 

Plus de Martin Majlis (9)

E-Learning - Text Comprehension
E-Learning - Text ComprehensionE-Learning - Text Comprehension
E-Learning - Text Comprehension
 
RIES voting key interface
RIES voting key interfaceRIES voting key interface
RIES voting key interface
 
CMC - RIES-improvements: Pragmatic authentication
CMC - RIES-improvements: Pragmatic authenticationCMC - RIES-improvements: Pragmatic authentication
CMC - RIES-improvements: Pragmatic authentication
 
Partially observable Markov decision processes for spoken dialog systems
Partially observable Markov decision processes for spoken dialog systemsPartially observable Markov decision processes for spoken dialog systems
Partially observable Markov decision processes for spoken dialog systems
 
Java – Annotations
Java – AnnotationsJava – Annotations
Java – Annotations
 
Google Translate + TectoMT
Google Translate + TectoMTGoogle Translate + TectoMT
Google Translate + TectoMT
 
Prekladovy slovnik
Prekladovy slovnikPrekladovy slovnik
Prekladovy slovnik
 
Prompt
PromptPrompt
Prompt
 
Korejsky Korpus
Korejsky KorpusKorejsky Korpus
Korejsky Korpus
 

SWT Final Project Presentation

  • 1. What happened? Martin Majlis
  • 2. Outline  Introduction  Architecture  Back-end  Downloading  Extraction  Front-end  Web application  iGoogle Gadget 28/01/10 SWT - Final Project 2
  • 3. Introduction  Answer on questions:  what happened on 3 January  what happened on 3 January 1865  what happened on January 1825  what happened from January until July 1985  what happened during the 16th century  what started on January 1930  what ended in 1990 28/01/10 SWT - Final Project 3
  • 4. Architecture  Back-end  Downloading  Structure Converting  Parsing  Front-end  Web application  iGoogle Gadget 28/01/10 SWT - Final Project 4
  • 5. Build process  Fully automatized  Target for each phase  Less error-prone  GNU Make 28/01/10 SWT - Final Project 5
  • 6. Data Source  Czech Wikipedia  Documented format  Dumps regularly generated  Cleaner than general texts 28/01/10 SWT - Final Project 6
  • 7. Downloading / Conversion  Downloading  Script from DBPedia  Added traffic shaping  Data Conversion  Recognizing pages/categories  Building category “hierarchy” 28/01/10 SWT - Final Project 7
  • 8. Categories  Confusing Structure  Netherlands - 229  Physics, Planets, Illusions, Psychology, Literature, Organ, Neuroscience, etc.  Maximal deep 5  Median: 31  Mean: 33.87 28/01/10 SWT - Final Project 8
  • 9. Date Extraction – Regular Exp.  Regular expressions aren't for parsing  Day=(d+).; Month = (Jan|Feb|...); Year=(d+)  Date = (Day Month Year | Day Month | Month Year | Year)  Extract = (“from” Date “until” Date | Date “-” Date | “between” Date “and” Date | “from” Date)  Day number can be on 14 positions  In real more than 1000 slots 28/01/10 SWT - Final Project 9
  • 10. Date Extraction - Tools  Standard way:  GNU Flex / GNU Bison  Ragel  Problem with UTF-8 support  Unicode – almost 100.000 characters  Big transition tables (100.000 vs 127) 28/01/10 SWT - Final Project 10
  • 11. Date Extraction - Mixed  Lexical Analysis  Regular Expressions  Filling Table  Syntactic Analysis  Theoretically CFG  Practically again regular expressions 28/01/10 SWT - Final Project 11
  • 12. Date Extraction - Example  Lexical Analysis  “From 23 January 1956 until 2 February 1960”  “From {{DATE_1}} until {{DATE_2}}”  Syntactic Analysis  Interval = “From” DATE “to” DATE  Interval = “Between” DATE “and” DATE 28/01/10 SWT - Final Project 12
  • 13. Date Representation  Dates from 10.000 BC to 2500 AC th  Not exact: 13 century, June 1689  Zero  2 January - 5days = 28 December  2 January 1AC -5days = 28 December 1BC  Simple tuples  (“I”, 23, 1, 1956, 20, 2, 2, 1960, 20) 28/01/10 SWT - Final Project 13
  • 14. Web application  PHP5 + MySQL  Nette Framework + Dibi  http://css.majlis.cz/  GT: http://jdem.cz/dspw9  HTML, JSON, XML output 28/01/10 SWT - Final Project 14
  • 15. iGoogle Gadget  iGoogle = Google personalized homepage  URL: http://jdem.cz/dspx7  Using JSON  Tricky development 28/01/10 SWT - Final Project 15
  • 16. Future Work  Improve performance  20th century events – 28s – 406.980 (one OR)  20th century events – 0.0007s – 392.573 (no OR)  Improve parser architecture 28/01/10 SWT - Final Project 16
  • 17. Questions? 28/01/10 SWT - Final Project 17
  • 18. Thank You! 28/01/10 SWT - Final Project 18