SlideShare une entreprise Scribd logo
1  sur  24
Télécharger pour lire hors ligne
Project group knowAAN
   Final presentation

         Adrian Wilke
   info[REMOVE]@adrianwilke.de


 Computer Science Education Group
     University of Paderborn


     October 20th 2011
Overview



Overview



    Introduction
    System components & Work flow
    Demonstration
    Development process
    Summary & Outlook
    Time for further questions of detail




                   PG knowAAN                    2
Overview



Overview: First part



    Goals
    Extraction & Storage (of data)
    Exploration (of data)
    System components & Work flow
    Analysis & Visualization (of data)




                PG knowAAN                     3
Goals



Goals

    Explore research networks
    Based on: Artifacts (scientific publications) and metadata
    Combination and analysis of data
    Computation of similarities of full texts
    Support for conference management system Ginkgo
    Data visualization
    Recommendations

              (Source: PG knowAAN project description)



                 PG knowAAN                                        4
Goals


Imagine you are interested in a conference.
You downloaded the papers of 2 or 3 years.
  Now you have nearly 100 publications.
       How do you explore them?




   100 publications. Do you know tools?
      PG knowAAN                                 5
Extraction & Storage



Extraction & Storage




           First step: Extract data and store it.




             PG knowAAN                                               6
Extraction & Storage




PG knowAAN                     7
Exploration



Exploration




               Second step: Explore data.




              PG knowAAN                             8
Exploration



Exploring a conference




             PG knowAAN            9
Exploration



Exploration




      Which extracted data is available for a publication?
                     → Database schema




                PG knowAAN                                           10
discipline                                     pub_dis                           pub_aff                                                                             affiliation
            id GUID                                        publication_id GUID               publication_id GUID                                                               id GUID
            text VARCHAR(512)                              discipline_id GUID                affiliation_id GUID                                                               text VARCHAR(512)
            parent_id GUID                               Indexes                           Indexes                                                                             location_id GUID
                                                                                                                                           aut_aff
           Indexes                                                                                                                                                            Indexes
                                                                                                                                         author_id GUID
                                                                                                                                         affiliation_id GUID
                                                                                                                                        Indexes
                                    pub_key                           publication
   keyword                        publication_id GUID               id GUID
 id GUID                          keyword_id GUID                   lucuid VARCHAR(512)
 text VARCHAR(512)                score DOUBLE                      title VARCHAR(512)                                                         author
                                                                                                                   pub_aut
Indexes                           source VARCHAR(512)               booktitle VARCHAR(512)                                                   id GUID
                                                                                                              publication_id GUID
                                 Indexes                            normtitle VARCHAR(512)                                                   text VARCHAR(512)
                                                                                                              author_id GUID                                                       location
                                                                    date VARCHAR(512)                                                        normtext VARCHAR(512)
                                                                                                           Indexes                                                             id GUID
                                    pub_con                         editor VARCHAR(512)                                                      firstname VARCHAR(512)
                                                                                                                                                                               latitude DOUBLE
   concept                        publication_id GUID               journal VARCHAR(512)                                                     lastname VARCHAR(512)
                                                                                                                                                                               longitude DOUBLE
 id GUID                          concept_id GUID                   note VARCHAR(512)                              citation                  created BIGINT
                                                                                                                                                                               text VARCHAR(512)
 text VARCHAR(512)                score DOUBLE                      pages VARCHAR(512)                        publication1_id GUID           modified BIGINT
                                                                                                                                                                              Indexes
Indexes                           source VARCHAR(512)               publisher VARCHAR(512)                                                 Indexes
                                                                                                              publication2_id GUID
                                 Indexes                            tech VARCHAR(512)                      Indexes
                                                                    volume VARCHAR(512)
                                    pub_cat                         number VARCHAR(512)
                                                                                                                                                          aut_add
   category                       publication_id GUID               rawstring VARCHAR(4096)                        pub_add
                                                                                                                                                        author_id GUID
 id GUID                          category_id GUID                  xmlfile VARCHAR(512)                      publication_id GUID
                                                                                                                                                        address_id GUID
 text VARCHAR(512)                score DOUBLE                      pdffile VARCHAR(512)                      address_id GUID
                                                                                                                                                       Indexes
Indexes                           source VARCHAR(512)               topicfile VARCHAR(512)                 Indexes
                                 Indexes                            created BIGINT
                                                                    modified BIGINT
   eventseries                                                    Indexes
                                                                                                                                                                         address
 id GUID
                                                                                                                                                                    id GUID
 text VARCHAR(512)
                                                                                               pub_evt                                                              text VARCHAR(512)
 filepath VARCHAR(512)
                                                                                             publication_id GUID                                                    location_id GUID
Indexes
                                                 event                                       event_id GUID                                                        Indexes

                                              id GUID                                      Indexes
                                              text VARCHAR(512)                                                                     category_count               bib_coupling
            evt_evs                           filepath VARCHAR(512)
           event_id GUID                      predecessor_id GUID                            discipline_count                       concept_count                co_author
           eventseries_id GUID                successor_id GUID
      Indexes                              Indexes                                           evt_pub_aut_count                      keyword_count                co_citation
System components & Work flow



System components & Work flow




           How is our system structured?
                  → Some examples.




            PG knowAAN                                              12
System components & Work flow



Components
                                                      Model                 << component >>
                      << component >>
                          Backend                                            ParscitTrainer


                                   << component >>
    << component >>
                                        Parscit
       Clustering
                                                     WebServices                  << component >>
                                                                            FrontendReferenceExtraction


    << component >>                << component >>
          DB                       TrendDetection

                                                     WebServices            << component >>
                                                                              DocBrowser


    << component >>                << component >>
       Roundtrip                    TF-Component

                                                                     JDBC


    << component >>                << component >>                          << component >>
      PDFToText                                       JDBC
                                   TopicExtraction                             DataBase




    << component >>                << component >>                          << component >>
                                                       WebServices
    Recommendation                   xmlBuilder                                   Solr




                                                       FileSystem           << component >>
                                                                              FileStorage




                              PG knowAAN                                                                  13
DocumentBrowser:              RoundTrip :                  RoundTripExecutor :             PDFToText :            Parscit:       Languagedetection:       Lemmatizer:   NounExtraction:   Solr:   DB:

             a / 1) .addPDF


                                            a / 2) .writeToFS




                                            a / 2) Path


                                            a / 3) .createThread

                                              .submitThread


                                            a / 3)

                   a / 1)




                                                                           b / 1) .run

                                                                         b / 2) .getText


                                                                           b / 2) Text
                                                                                 b / 3) .ParseFullText


                                                                                    b / 3) ParscitXML




                                                                            b / 4) .extractBodyAndAstract




                                                                            b / 4) BodyAndAbstract

                                                                                              b / 5) .getLanguage


                                                                                             b / 5) LanguageString
                                                                                                            b / 6) .lemmatize


                                                                                                         b / 6) LemmatizedText

                                                                                                                    b / 7) .extractNouns


                                                                                                                      b / 7) NounsList
                                                                                                     b / 8) .lemmatizeNounslist


                                                                                                         b / 8) LemmatizedNouns




                                                                            b / 9) .ReduceToTopNouns




                                                                            b / 9) TopNouns


                                                                            b / 10) .writeToFiles




                                                                            b / 10) Paths
                                                                                                                                 b / 11) .addTexts


                                                                                                                                   b / 11) Solrid


                                                                                                                                     b / 12) .addPublication


                                                                                                                                              b / 12)


                                                                           b / 1)
System components & Work flow



Work flow




           PG knowAAN                            15
Analysis & Visualization



Analysis & Visualization




           Third step: Analyze and visualize data.




               PG knowAAN                                                 16
Analysis & Visualization



Analysis of authors




              PG knowAAN                        17
Analysis & Visualization



Analysis of scientific publications




              PG knowAAN                                  18
Demonstration



Demonstration




                            Now: Demo.
           Image: http://www.flickr.com/photos/plaisanter/5525977163/


             PG knowAAN                                                          19
Development process



Technologies




                            Jersey



               PG knowAAN                            20
Development process



Methods of agile software development



     FDD                  XP
                                        Scrum




             PG knowAAN                                  21
Development process



Methods of agile software development




    Weekly meetings
    Sit together (as much as possible)
    Automated building system
    Continuous integration
    Issue tracking


                PG knowAAN                               22
Summary and Outlook



Summary and future work

 Summary
     Integrated processing of scientific papers
     Aggregated visualization of authors, publications and
     events
     Compute various analysis over the data
     Cleaning functionality for automated processed data

 Future work
     Parallelized Clustering
     Additional graphical visualization
     Improve extraction of metadata from PDF files
                 PG knowAAN                                           23
Summary and Outlook



Thank you for your attention




                           Questions?

              PG knowAAN                                24

Contenu connexe

Plus de Adrian Wilke

OPAL - Open Data Portal Germany
OPAL - Open Data Portal GermanyOPAL - Open Data Portal Germany
OPAL - Open Data Portal GermanyAdrian Wilke
 
Algebraic Property Graphs
Algebraic Property GraphsAlgebraic Property Graphs
Algebraic Property GraphsAdrian Wilke
 
Critical Incidents for Technology Enhanced Learning in Vocational Education a...
Critical Incidents for Technology Enhanced Learning in Vocational Education a...Critical Incidents for Technology Enhanced Learning in Vocational Education a...
Critical Incidents for Technology Enhanced Learning in Vocational Education a...Adrian Wilke
 
36. Bundeswettbewerb Informatik - DICE Data Science
36. Bundeswettbewerb Informatik - DICE Data Science36. Bundeswettbewerb Informatik - DICE Data Science
36. Bundeswettbewerb Informatik - DICE Data ScienceAdrian Wilke
 
Zotero Visualisierungen
Zotero VisualisierungenZotero Visualisierungen
Zotero VisualisierungenAdrian Wilke
 
Assistenz der Ausbildung im Maschinenbau durch mobiles Lernen - OEB15
Assistenz der Ausbildung im Maschinenbau durch mobiles Lernen - OEB15Assistenz der Ausbildung im Maschinenbau durch mobiles Lernen - OEB15
Assistenz der Ausbildung im Maschinenbau durch mobiles Lernen - OEB15Adrian Wilke
 
INSPIRE: Insight to Scientific Publications and References
INSPIRE: Insight to Scientific Publications and ReferencesINSPIRE: Insight to Scientific Publications and References
INSPIRE: Insight to Scientific Publications and ReferencesAdrian Wilke
 
Ant Colony Optimization: Routing
Ant Colony Optimization: RoutingAnt Colony Optimization: Routing
Ant Colony Optimization: RoutingAdrian Wilke
 
Analyse wissenschaftlicher Publikationen
Analyse wissenschaftlicher PublikationenAnalyse wissenschaftlicher Publikationen
Analyse wissenschaftlicher PublikationenAdrian Wilke
 
Analyse wissenschaftlicher Publikationen
Analyse wissenschaftlicher PublikationenAnalyse wissenschaftlicher Publikationen
Analyse wissenschaftlicher PublikationenAdrian Wilke
 

Plus de Adrian Wilke (10)

OPAL - Open Data Portal Germany
OPAL - Open Data Portal GermanyOPAL - Open Data Portal Germany
OPAL - Open Data Portal Germany
 
Algebraic Property Graphs
Algebraic Property GraphsAlgebraic Property Graphs
Algebraic Property Graphs
 
Critical Incidents for Technology Enhanced Learning in Vocational Education a...
Critical Incidents for Technology Enhanced Learning in Vocational Education a...Critical Incidents for Technology Enhanced Learning in Vocational Education a...
Critical Incidents for Technology Enhanced Learning in Vocational Education a...
 
36. Bundeswettbewerb Informatik - DICE Data Science
36. Bundeswettbewerb Informatik - DICE Data Science36. Bundeswettbewerb Informatik - DICE Data Science
36. Bundeswettbewerb Informatik - DICE Data Science
 
Zotero Visualisierungen
Zotero VisualisierungenZotero Visualisierungen
Zotero Visualisierungen
 
Assistenz der Ausbildung im Maschinenbau durch mobiles Lernen - OEB15
Assistenz der Ausbildung im Maschinenbau durch mobiles Lernen - OEB15Assistenz der Ausbildung im Maschinenbau durch mobiles Lernen - OEB15
Assistenz der Ausbildung im Maschinenbau durch mobiles Lernen - OEB15
 
INSPIRE: Insight to Scientific Publications and References
INSPIRE: Insight to Scientific Publications and ReferencesINSPIRE: Insight to Scientific Publications and References
INSPIRE: Insight to Scientific Publications and References
 
Ant Colony Optimization: Routing
Ant Colony Optimization: RoutingAnt Colony Optimization: Routing
Ant Colony Optimization: Routing
 
Analyse wissenschaftlicher Publikationen
Analyse wissenschaftlicher PublikationenAnalyse wissenschaftlicher Publikationen
Analyse wissenschaftlicher Publikationen
 
Analyse wissenschaftlicher Publikationen
Analyse wissenschaftlicher PublikationenAnalyse wissenschaftlicher Publikationen
Analyse wissenschaftlicher Publikationen
 

Dernier

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 

Dernier (20)

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 

knowAAN final presentation

  • 1. Project group knowAAN Final presentation Adrian Wilke info[REMOVE]@adrianwilke.de Computer Science Education Group University of Paderborn October 20th 2011
  • 2. Overview Overview Introduction System components & Work flow Demonstration Development process Summary & Outlook Time for further questions of detail PG knowAAN 2
  • 3. Overview Overview: First part Goals Extraction & Storage (of data) Exploration (of data) System components & Work flow Analysis & Visualization (of data) PG knowAAN 3
  • 4. Goals Goals Explore research networks Based on: Artifacts (scientific publications) and metadata Combination and analysis of data Computation of similarities of full texts Support for conference management system Ginkgo Data visualization Recommendations (Source: PG knowAAN project description) PG knowAAN 4
  • 5. Goals Imagine you are interested in a conference. You downloaded the papers of 2 or 3 years. Now you have nearly 100 publications. How do you explore them? 100 publications. Do you know tools? PG knowAAN 5
  • 6. Extraction & Storage Extraction & Storage First step: Extract data and store it. PG knowAAN 6
  • 8. Exploration Exploration Second step: Explore data. PG knowAAN 8
  • 10. Exploration Exploration Which extracted data is available for a publication? → Database schema PG knowAAN 10
  • 11. discipline pub_dis pub_aff affiliation id GUID publication_id GUID publication_id GUID id GUID text VARCHAR(512) discipline_id GUID affiliation_id GUID text VARCHAR(512) parent_id GUID Indexes Indexes location_id GUID aut_aff Indexes Indexes author_id GUID affiliation_id GUID Indexes pub_key publication keyword publication_id GUID id GUID id GUID keyword_id GUID lucuid VARCHAR(512) text VARCHAR(512) score DOUBLE title VARCHAR(512) author pub_aut Indexes source VARCHAR(512) booktitle VARCHAR(512) id GUID publication_id GUID Indexes normtitle VARCHAR(512) text VARCHAR(512) author_id GUID location date VARCHAR(512) normtext VARCHAR(512) Indexes id GUID pub_con editor VARCHAR(512) firstname VARCHAR(512) latitude DOUBLE concept publication_id GUID journal VARCHAR(512) lastname VARCHAR(512) longitude DOUBLE id GUID concept_id GUID note VARCHAR(512) citation created BIGINT text VARCHAR(512) text VARCHAR(512) score DOUBLE pages VARCHAR(512) publication1_id GUID modified BIGINT Indexes Indexes source VARCHAR(512) publisher VARCHAR(512) Indexes publication2_id GUID Indexes tech VARCHAR(512) Indexes volume VARCHAR(512) pub_cat number VARCHAR(512) aut_add category publication_id GUID rawstring VARCHAR(4096) pub_add author_id GUID id GUID category_id GUID xmlfile VARCHAR(512) publication_id GUID address_id GUID text VARCHAR(512) score DOUBLE pdffile VARCHAR(512) address_id GUID Indexes Indexes source VARCHAR(512) topicfile VARCHAR(512) Indexes Indexes created BIGINT modified BIGINT eventseries Indexes address id GUID id GUID text VARCHAR(512) pub_evt text VARCHAR(512) filepath VARCHAR(512) publication_id GUID location_id GUID Indexes event event_id GUID Indexes id GUID Indexes text VARCHAR(512) category_count bib_coupling evt_evs filepath VARCHAR(512) event_id GUID predecessor_id GUID discipline_count concept_count co_author eventseries_id GUID successor_id GUID Indexes Indexes evt_pub_aut_count keyword_count co_citation
  • 12. System components & Work flow System components & Work flow How is our system structured? → Some examples. PG knowAAN 12
  • 13. System components & Work flow Components Model << component >> << component >> Backend ParscitTrainer << component >> << component >> Parscit Clustering WebServices << component >> FrontendReferenceExtraction << component >> << component >> DB TrendDetection WebServices << component >> DocBrowser << component >> << component >> Roundtrip TF-Component JDBC << component >> << component >> << component >> PDFToText JDBC TopicExtraction DataBase << component >> << component >> << component >> WebServices Recommendation xmlBuilder Solr FileSystem << component >> FileStorage PG knowAAN 13
  • 14. DocumentBrowser: RoundTrip : RoundTripExecutor : PDFToText : Parscit: Languagedetection: Lemmatizer: NounExtraction: Solr: DB: a / 1) .addPDF a / 2) .writeToFS a / 2) Path a / 3) .createThread .submitThread a / 3) a / 1) b / 1) .run b / 2) .getText b / 2) Text b / 3) .ParseFullText b / 3) ParscitXML b / 4) .extractBodyAndAstract b / 4) BodyAndAbstract b / 5) .getLanguage b / 5) LanguageString b / 6) .lemmatize b / 6) LemmatizedText b / 7) .extractNouns b / 7) NounsList b / 8) .lemmatizeNounslist b / 8) LemmatizedNouns b / 9) .ReduceToTopNouns b / 9) TopNouns b / 10) .writeToFiles b / 10) Paths b / 11) .addTexts b / 11) Solrid b / 12) .addPublication b / 12) b / 1)
  • 15. System components & Work flow Work flow PG knowAAN 15
  • 16. Analysis & Visualization Analysis & Visualization Third step: Analyze and visualize data. PG knowAAN 16
  • 17. Analysis & Visualization Analysis of authors PG knowAAN 17
  • 18. Analysis & Visualization Analysis of scientific publications PG knowAAN 18
  • 19. Demonstration Demonstration Now: Demo. Image: http://www.flickr.com/photos/plaisanter/5525977163/ PG knowAAN 19
  • 20. Development process Technologies Jersey PG knowAAN 20
  • 21. Development process Methods of agile software development FDD XP Scrum PG knowAAN 21
  • 22. Development process Methods of agile software development Weekly meetings Sit together (as much as possible) Automated building system Continuous integration Issue tracking PG knowAAN 22
  • 23. Summary and Outlook Summary and future work Summary Integrated processing of scientific papers Aggregated visualization of authors, publications and events Compute various analysis over the data Cleaning functionality for automated processed data Future work Parallelized Clustering Additional graphical visualization Improve extraction of metadata from PDF files PG knowAAN 23
  • 24. Summary and Outlook Thank you for your attention Questions? PG knowAAN 24