Ce diaporama a bien été signalé.
Le téléchargement de votre SlideShare est en cours. ×

AI-SDV 2022: Rolling out web crawling at Boehringer Ingelheim - 10 years of experience Klaus Kater (Copyright Clearance Center, DE )

Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Chargement dans…3
×

Consultez-les par la suite

1 sur 20 Publicité

AI-SDV 2022: Rolling out web crawling at Boehringer Ingelheim - 10 years of experience Klaus Kater (Copyright Clearance Center, DE )

Télécharger pour lire hors ligne

10 years in the making. How real-world business cases have driven the development of CCC's deep search solutions, leading to the capabilities for web-crawling and delivery of targeted intelligence that helps R&D; intensive companies gain a competitive advantage.

10 years in the making. How real-world business cases have driven the development of CCC's deep search solutions, leading to the capabilities for web-crawling and delivery of targeted intelligence that helps R&D; intensive companies gain a competitive advantage.

Publicité
Publicité

Plus De Contenu Connexe

Similaire à AI-SDV 2022: Rolling out web crawling at Boehringer Ingelheim - 10 years of experience Klaus Kater (Copyright Clearance Center, DE ) (20)

Plus par Dr. Haxel Consult (20)

Publicité

Plus récents (20)

AI-SDV 2022: Rolling out web crawling at Boehringer Ingelheim - 10 years of experience Klaus Kater (Copyright Clearance Center, DE )

  1. 1. AI-SDV 2022, Oct. 10/11 2022 Klaus Kater Director, Research & Development
  2. 2. 1. 2012-2013: Crawling metasearch results Yahoo.com / Bing.com 2
  3. 3. 1. 2012-2013: Crawling metasearch results Yahoo.com / Bing.com 2. 2014: Prototyping a configurable crawler framework 3
  4. 4. 4 Look and feel 2014 Look and feel 2022
  5. 5. 1. 2012-2013: Crawling metasearch results Yahoo.com / Bing.com 2. 2014: Prototyping a configurable crawler framework 3. 2015: Rollout of Company SEARCHCORPUS 5
  6. 6. 6 Having started in 2015 with just about 30,000 companies, the company SEARCHCORPUS keeps growing and growing… 2015 2016 2017 2018 2019 • 30.000 company websites • Duration 2 weeks: • Crawling • Indexing • 50 GB of web data • 60.000 company websites • Still 2 weeks: • Crawling • Indexing • 500 GB of web data • 290.000 company websites • Link depth 5 • Still 2 weeks: • Crawling • Geolocation • Classification • Indexing • 2 TB of web data 2020 2021 2022
  7. 7. 1. 2012-2013: Crawling metasearch results Yahoo.com / Bing.com 2. 2014: Prototyping a configurable crawler framework 3. 2015: Rollout of Company SEARCHCORPUS 4. 2016: Thesaurus management for domain specific content selection 7
  8. 8. 8
  9. 9. 1. 2012-2013: Crawling metasearch results Yahoo.com / Bing.com 2. 2014: Prototyping a configurable crawler framework 3. 2015: Rollout of Company SEARCHCORPUS 4. 2016: Thesaurus management for domain specific content selection 5. 2017: Establishing a process to roll out news trackers and other crawling solutions 9
  10. 10. 10 ……2015 ……………………………..….2016……………….………...………2017…… • Preparation of international rollout of domain specific targeted news trackers and alerting • Animal Health Tracker • RBB Tracker • CRDI Tracker • BD&L Tracker • Single Sign On with automatic user provisioning (SAML)
  11. 11. 1. 2012-2013: Crawling metasearch results Yahoo.com / Bing.com 2. 2014: Prototyping a configurable crawler framework 3. 2015: Rollout of Company SEARCHCORPUS 4. 2016: Thesaurus management for domain specific content selection 5. 2017: Establishing a process to roll out news trackers and other crawling solutions 6. 2018: Clinical Trial Registry Tracker 11
  12. 12. 12 Australia / New Zealand China Republic of Korea USA Germany European Union Hong Kong International (Springer) Japan Philippines University Hospital Medical Information Network WHO
  13. 13. 1. 2012-2013: Crawling metasearch results Yahoo.com / Bing.com 2. 2014: Prototyping a configurable crawler framework 3. 2015: Rollout of Company SEARCHCORPUS 4. 2016: Thesaurus management for domain specific content selection 5. 2017: Establishing a process to roll out news trackers and other crawling solutions 6. 2018: Clinical Trial Registry Tracker 7. 2018: Machine learning based classification of company websites 13
  14. 14. 14 1. To obtain a reasonably sized input vector (remember, we classify a whole website which may have several 100 MB of content), we convert the data into a vector using a TF-IDF pre-processor trained on a corpus collected for the project 2. Support Vector Machines alone is not good enough​, therefore pre-processing of all input with a custom thesaurus is necessary 3. For all 6 real world samples we got > 96% average recognition rate
  15. 15. 1. 2012-2013: Crawling metasearch results Yahoo.com / Bing.com 2. 2014: Prototyping a configurable crawler framework 3. 2015: Rollout of Company SEARCHCORPUS 4. 2016: Thesaurus management for domain specific content selection 5. 2017: Establishing a process to roll out news trackers and other crawling solutions 6. 2018: Clinical Trial Registry Tracker 7. 2018: Machine learning based classification of company websites 8. 2019: Globally distributed massive parallel crawling 15
  16. 16. 16 Transferring 1 page from London: 82ms​, 500 pages: 41 seconds​, 1.000 servers: 11,5 hours Transferring 1 page from Tokyo: 1.200ms​, 500 pages: 10 minutes​, 1.000 servers: 6 days 23 hours NASA’s Terra satellite for the MODIS imageries, combined by Meow. Credit: NASA Goddard Space Flight Center Image by Reto Stöckli (land surface, shallow water, clouds). Enhancements by Robert Simmon (ocean color, compositing, 3D globes, animation). Data and technical support: MODIS Land Group; MODIS Science Data Support Team; MODIS Atmosphere Group; MODIS Ocean Group Additional data: USGS EROS Data Center (topography); USGS Terrestrial Remote Sensing Flagstaff Field Center (Antarctica); Defense Meteorological Satellite Program (city lights)., Public domain, via Wikimedia Commons
  17. 17. 1. 2012-2013: Crawling metasearch results Yahoo.com / Bing.com 2. 2014: Prototyping a configurable crawler framework 3. 2015: Rollout of Company SEARCHCORPUS 4. 2016: Thesaurus management for domain specific content selection 5. 2017: Establishing a process to roll out news trackers and other crawling solutions 6. 2018: Clinical Trial Registry Tracker 7. 2018: Machine learning based classification of company websites 8. 2019: Globally distributed massive parallel crawling 9. 2020/21: Deep learning for automated news rating 17
  18. 18. 18 Corporate Websites News portals News feeds Crawlers {APIs} {APIs} 3rd party APIs Licensed 3rd party content Feed readers News archive (un)rated news Consume {APIs} { standard API } Model + meta data Request news rating for selected model Return news rated with selected model Deploy selected model Crawl / retrieve news      Rate news with selected model  extracted news  Publish model + metadata Optimization of models, retraining  model rating Deploy and verify model
  19. 19. 1. 2012-2013: Crawling metasearch results Yahoo.com / Bing.com 2. 2014: Prototyping a configurable crawler framework 3. 2015: Rollout of Company SEARCHCORPUS 4. 2016: Thesaurus management for domain specific content selection 5. 2017: Establishing a process to roll out news trackers and other crawling solutions 6. 2018: Clinical Trial Registry Tracker 7. 2018: Machine learning based classification of company websites 8. 2019: Globally distributed massive parallel crawling 9. 2020/21: Deep learning for automated news rating 10. 2022: Automating regulatory intelligence collection and classification (will be integrated with intranet applications to manage regulatory events) 19
  20. 20. Klaus Kater kkater@copyright.com

×