SlideShare une entreprise Scribd logo
1  sur  23
Jeremie Charlet
04 08 2015
Trial-and-error experiments on
Taxonomy Applications
Introduction
What is Taxonomy?
To better understand what it is about,
Let’s make a search on Discovery!
3
Introduction
4
Introduction
Taxonomy is just about classification.
Here it concerns the applications used
to apply categories (or subjects) to the records in Discovery.
Project involving several people from Taxonomy team
and Systems Development team
5
Introduction
Solution
Administration interface for taxonomists
6
Application to categorise everything once
1.To do it for the first time
2.to apply latest modifications from taxonomists on all documents
Application to categorise documents every day
1.to categorise new documents
2.to re-categorise documents when they are updated
Plan
This presentation is all about how we built this categorisation system
A.Using category queries
1. Get it right
2. Get it fast
a) Evolution of the algorithm
b) Fine tuning
c) Scale out
B. Attempt using machine learning
o Using a training set based algorithm
7
A. Using category queries
How to categorise a document?
Solution (from former system Autonomy):
1 category = 1 search Query
8
“air force” "Air Force" OR "air forces" OR "Air Ministry" OR "Air
Historical Branch" OR "Air Department" OR "Air Board" OR
"Air Council" OR "Department of the Air Member" OR "air
army“ …
A.1. Get it right
Many parameters to take into account
•Is case sensitiveness important?
•Use synonyms?
•Ignore stop words (of, the, a, …)?
•Which attributes to use (title, description, …)? Are some more important than others?
•And many others
> Iterative process
How to evaluate if our results are valid?
> Use documents and categories from former system
> Categorise them again and compare results
To do that quickly, created Command Line Interface
9
[jcharlet@server ~]$
./runCli.sh -EVALcategoriseEvalDataSet --lucene.index.useSynonymFilter=true
A.1. Get it right
Findings
1.To automate evaluation
o saved me a lot of time
o regression tool
o benchmarking tool
1.Using a training set based system was not satisfactory
2.Needed to ignore case sensitiveness + punctuation in most cases
10
A.2 Get it fast
How to apply our 140 categories to 22 millions records quickly?
How fast do we need our system to be?
•Former system: 10+ days
o clunky
o Have to wait months to do it again
o What if categorisation goes wrong? Start again for 10 days?
•Target: ~1d
1 document categorised in 4ms
11
Let’s categorise 1 document at a time
Run queries in parallel
Run inverted queries
Run every query against every document one after
another on the file index
Run queries against memory index
Run queries in memory to find candidates and run
the candidates against the file index
A.2.a Evolution of the algorithm
12
Solution Time to categorise
everything
Works
A few years
Fewer years
About 10 days
?
About 10 days
(60ms/doc)
A.2.b Fine tuning
Use the right driver for your system (NRTCacheDirectory
instead of default one)
> 1 line in 1 file = 20% faster on search queries
Use filter instead of query to search on only 1 document
+ use carefully low level api
Profile your application frequently
> Identify ugly code, where to add cache, where to add
concurrency
Spent 7% on creating Query objects for every document:
instead, create them once and store them in memory
13
A.2.c Scale out
Requires suitable architecture
~Micro services like vs monolithic application
14
A.2.c Scale out
Back to the solution…
GUI for taxonomists (+ backend for GUI)
•Available at all time
•Do search queries
•Update categories
Application to categorise everything once
•Run once in a while
•Needs a huge amount of instances to do the job as fast as possible
•Categorise everything
Application to categorise documents every day
•Run every night
•Receive categorisation requests from another system
15
A.2.c Scale out
Requires suitable architecture
~Micro services like vs monolithic application
16
A.2.c Scale out
On current available platform:
2 * 24 Core CPU
40 Go RAM
2 * 6 categorisation processes
Categorise 22m documents in 1d 8h
= 5ms to categorise 1 doc
17
Run queries in memory to find candidates and run
the candidates against the file index
About 10 days
(60ms/doc)
Progress is
linear
Progress is
linear
A.2.c Scale out
Let’s imagine that we use cloud services
Let’s suppose we already pay for something equivalent on
Microsoft Azure
4 *
How much does it cost to use twice that number of servers to
be twice faster (ideally)?
NOTHING (* If you shut down your server once process ended)
18
INSTANCE CORES RAM DISK
SIZES
PRICE
D3 4 14 GB 200 G
B
£0.4179/hr
Plan
This presentation is all about how we built this categorisation system
A.Using category queries
1. Get it right
2. Get it fast
a) Evolution of the algorithm
b) Fine tuning
c) Scale out
B. Attempt using machine learning
o Using a training set based algorithm
19
Research on a training set based solution for 2 months
Biggest failure, best learning
1.Take a data set of known (already classified) documents
2.Split it into a test set and training set
o Train the system with the training set
o Evaluate it using the test set
o Iterate until satisfactory
1.Move it to production
o Classify new documents using the trained system
B. Using machine learning
20
B. Using machine learning
Why it did not work
1.Using category queries to create the training set
21
B. Using machine learning
Why it did not work
1.Using category queries to create the training set
o Highly dependent on the validity/accuracy of the category queries
1.Nature of our categories
o far too many (136)
o categories too vague or too similar (“Poverty”): do not suit such a
system
1.Not the right tool? We used Lucene (search engine) built in tool
2.Nature of the data?
22
B. Using machine learning
Why we should get into it
•Capabilities are impressive (examples)
•Enabled thanks to Cloud Computing (the computing power
needed is all available)
•Machine Learning As A Service
> You can play with it for free (*), start prototyping
23

Contenu connexe

En vedette

TNA Introduction to taxonomy applications
TNA Introduction to taxonomy applicationsTNA Introduction to taxonomy applications
TNA Introduction to taxonomy applicationsJeremie Charlet
 
Lucha libre_touro
Lucha libre_touroLucha libre_touro
Lucha libre_touroIsmaSuarez
 
Training needs analysis, skills auditing and training
Training needs analysis, skills auditing and trainingTraining needs analysis, skills auditing and training
Training needs analysis, skills auditing and trainingCharles Cotter, PhD
 

En vedette (6)

TNA Introduction to taxonomy applications
TNA Introduction to taxonomy applicationsTNA Introduction to taxonomy applications
TNA Introduction to taxonomy applications
 
TNA Portail Discovery
TNA Portail DiscoveryTNA Portail Discovery
TNA Portail Discovery
 
Tna Discovery Portal
Tna Discovery PortalTna Discovery Portal
Tna Discovery Portal
 
Lucha libre_touro
Lucha libre_touroLucha libre_touro
Lucha libre_touro
 
Training needs analysis, skills auditing and training
Training needs analysis, skills auditing and trainingTraining needs analysis, skills auditing and training
Training needs analysis, skills auditing and training
 
Training Need Analysis Training and Development
Training Need Analysis Training and Development Training Need Analysis Training and Development
Training Need Analysis Training and Development
 

Similaire à Tna how taxonomy applications were built

AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization...
AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization...AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization...
AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization...Dr. Haxel Consult
 
Pycon2015 scope
Pycon2015 scopePycon2015 scope
Pycon2015 scopearthi v
 
Applying Machine Learning to Software Clustering
Applying Machine Learning to Software ClusteringApplying Machine Learning to Software Clustering
Applying Machine Learning to Software Clusteringbutest
 
Incremental learning from unbalanced data with concept class, concept drift a...
Incremental learning from unbalanced data with concept class, concept drift a...Incremental learning from unbalanced data with concept class, concept drift a...
Incremental learning from unbalanced data with concept class, concept drift a...IJDKP
 
Computer Tools for Academic Research
Computer Tools for Academic ResearchComputer Tools for Academic Research
Computer Tools for Academic ResearchMiklos Koren
 
Net campus2015 antimomusone
Net campus2015 antimomusoneNet campus2015 antimomusone
Net campus2015 antimomusoneDotNetCampus
 
PREDICT THE FUTURE , MACHINE LEARNING & BIG DATA
PREDICT THE FUTURE , MACHINE LEARNING & BIG DATAPREDICT THE FUTURE , MACHINE LEARNING & BIG DATA
PREDICT THE FUTURE , MACHINE LEARNING & BIG DATADotNetCampus
 
KineMatik November 2010
KineMatik November 2010KineMatik November 2010
KineMatik November 2010Michael Price
 
Information Systems For Management Strategies
Information Systems For Management StrategiesInformation Systems For Management Strategies
Information Systems For Management StrategiesSindhuKB
 
Системный взгляд на параллельный запуск Selenium тестов
Системный взгляд на параллельный запуск Selenium тестовСистемный взгляд на параллельный запуск Selenium тестов
Системный взгляд на параллельный запуск Selenium тестовCOMAQA.BY
 
Automated Testing with Databases
Automated Testing with DatabasesAutomated Testing with Databases
Automated Testing with Databaseselliando dias
 
Enabling open and reproducible computer systems research: the good, the bad a...
Enabling open and reproducible computer systems research: the good, the bad a...Enabling open and reproducible computer systems research: the good, the bad a...
Enabling open and reproducible computer systems research: the good, the bad a...Grigori Fursin
 
Running Header 1Quality Assurance ReportG.docx
Running Header  1Quality Assurance ReportG.docxRunning Header  1Quality Assurance ReportG.docx
Running Header 1Quality Assurance ReportG.docxrtodd599
 
Prelim Project OOP
Prelim Project OOPPrelim Project OOP
Prelim Project OOPDwight Sabio
 
Catégorisation automatisée de contenus documentaires : la ...
Catégorisation automatisée de contenus documentaires : la ...Catégorisation automatisée de contenus documentaires : la ...
Catégorisation automatisée de contenus documentaires : la ...butest
 

Similaire à Tna how taxonomy applications were built (20)

TNA taxonomies 20160525
TNA taxonomies 20160525TNA taxonomies 20160525
TNA taxonomies 20160525
 
AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization...
AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization...AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization...
AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization...
 
Pycon2015 scope
Pycon2015 scopePycon2015 scope
Pycon2015 scope
 
Applying Machine Learning to Software Clustering
Applying Machine Learning to Software ClusteringApplying Machine Learning to Software Clustering
Applying Machine Learning to Software Clustering
 
Incremental learning from unbalanced data with concept class, concept drift a...
Incremental learning from unbalanced data with concept class, concept drift a...Incremental learning from unbalanced data with concept class, concept drift a...
Incremental learning from unbalanced data with concept class, concept drift a...
 
Introduction to knowledge discovery
Introduction to knowledge discoveryIntroduction to knowledge discovery
Introduction to knowledge discovery
 
Computer Tools for Academic Research
Computer Tools for Academic ResearchComputer Tools for Academic Research
Computer Tools for Academic Research
 
Net campus2015 antimomusone
Net campus2015 antimomusoneNet campus2015 antimomusone
Net campus2015 antimomusone
 
PREDICT THE FUTURE , MACHINE LEARNING & BIG DATA
PREDICT THE FUTURE , MACHINE LEARNING & BIG DATAPREDICT THE FUTURE , MACHINE LEARNING & BIG DATA
PREDICT THE FUTURE , MACHINE LEARNING & BIG DATA
 
KineMatik November 2010
KineMatik November 2010KineMatik November 2010
KineMatik November 2010
 
Information Systems For Management Strategies
Information Systems For Management StrategiesInformation Systems For Management Strategies
Information Systems For Management Strategies
 
Системный взгляд на параллельный запуск Selenium тестов
Системный взгляд на параллельный запуск Selenium тестовСистемный взгляд на параллельный запуск Selenium тестов
Системный взгляд на параллельный запуск Selenium тестов
 
System analysis
System analysisSystem analysis
System analysis
 
Automated Testing with Databases
Automated Testing with DatabasesAutomated Testing with Databases
Automated Testing with Databases
 
Enabling open and reproducible computer systems research: the good, the bad a...
Enabling open and reproducible computer systems research: the good, the bad a...Enabling open and reproducible computer systems research: the good, the bad a...
Enabling open and reproducible computer systems research: the good, the bad a...
 
Introduction
IntroductionIntroduction
Introduction
 
Running Header 1Quality Assurance ReportG.docx
Running Header  1Quality Assurance ReportG.docxRunning Header  1Quality Assurance ReportG.docx
Running Header 1Quality Assurance ReportG.docx
 
2014 toronto-torbug
2014 toronto-torbug2014 toronto-torbug
2014 toronto-torbug
 
Prelim Project OOP
Prelim Project OOPPrelim Project OOP
Prelim Project OOP
 
Catégorisation automatisée de contenus documentaires : la ...
Catégorisation automatisée de contenus documentaires : la ...Catégorisation automatisée de contenus documentaires : la ...
Catégorisation automatisée de contenus documentaires : la ...
 

Plus de Jeremie Charlet

Do we know our data, as good as we know our tools
Do we know our data, as good as we know our tools Do we know our data, as good as we know our tools
Do we know our data, as good as we know our tools Jeremie Charlet
 
Machine learning study group 17 4 2019
Machine learning study group 17 4 2019Machine learning study group 17 4 2019
Machine learning study group 17 4 2019Jeremie Charlet
 
Introduction to Shell Scripting
Introduction to Shell ScriptingIntroduction to Shell Scripting
Introduction to Shell ScriptingJeremie Charlet
 
Bibliography & Appendixes Can new web technologies HTML5 & CSS3 kill Flash? D...
Bibliography & Appendixes Can new web technologies HTML5 & CSS3 kill Flash? D...Bibliography & Appendixes Can new web technologies HTML5 & CSS3 kill Flash? D...
Bibliography & Appendixes Can new web technologies HTML5 & CSS3 kill Flash? D...Jeremie Charlet
 
Can new web technologies HTML5 & CSS3 kill Flash? Dissertation by Jeremie Cha...
Can new web technologies HTML5 & CSS3 kill Flash? Dissertation by Jeremie Cha...Can new web technologies HTML5 & CSS3 kill Flash? Dissertation by Jeremie Cha...
Can new web technologies HTML5 & CSS3 kill Flash? Dissertation by Jeremie Cha...Jeremie Charlet
 

Plus de Jeremie Charlet (6)

Do we know our data, as good as we know our tools
Do we know our data, as good as we know our tools Do we know our data, as good as we know our tools
Do we know our data, as good as we know our tools
 
Machine learning study group 17 4 2019
Machine learning study group 17 4 2019Machine learning study group 17 4 2019
Machine learning study group 17 4 2019
 
Introduction to Shell Scripting
Introduction to Shell ScriptingIntroduction to Shell Scripting
Introduction to Shell Scripting
 
Actors with akka
Actors with akkaActors with akka
Actors with akka
 
Bibliography & Appendixes Can new web technologies HTML5 & CSS3 kill Flash? D...
Bibliography & Appendixes Can new web technologies HTML5 & CSS3 kill Flash? D...Bibliography & Appendixes Can new web technologies HTML5 & CSS3 kill Flash? D...
Bibliography & Appendixes Can new web technologies HTML5 & CSS3 kill Flash? D...
 
Can new web technologies HTML5 & CSS3 kill Flash? Dissertation by Jeremie Cha...
Can new web technologies HTML5 & CSS3 kill Flash? Dissertation by Jeremie Cha...Can new web technologies HTML5 & CSS3 kill Flash? Dissertation by Jeremie Cha...
Can new web technologies HTML5 & CSS3 kill Flash? Dissertation by Jeremie Cha...
 

Dernier

Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Principled Technologies
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024The Digital Insurer
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesBoston Institute of Analytics
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 

Dernier (20)

Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 

Tna how taxonomy applications were built

  • 1.
  • 2. Jeremie Charlet 04 08 2015 Trial-and-error experiments on Taxonomy Applications
  • 3. Introduction What is Taxonomy? To better understand what it is about, Let’s make a search on Discovery! 3
  • 5. Introduction Taxonomy is just about classification. Here it concerns the applications used to apply categories (or subjects) to the records in Discovery. Project involving several people from Taxonomy team and Systems Development team 5
  • 6. Introduction Solution Administration interface for taxonomists 6 Application to categorise everything once 1.To do it for the first time 2.to apply latest modifications from taxonomists on all documents Application to categorise documents every day 1.to categorise new documents 2.to re-categorise documents when they are updated
  • 7. Plan This presentation is all about how we built this categorisation system A.Using category queries 1. Get it right 2. Get it fast a) Evolution of the algorithm b) Fine tuning c) Scale out B. Attempt using machine learning o Using a training set based algorithm 7
  • 8. A. Using category queries How to categorise a document? Solution (from former system Autonomy): 1 category = 1 search Query 8 “air force” "Air Force" OR "air forces" OR "Air Ministry" OR "Air Historical Branch" OR "Air Department" OR "Air Board" OR "Air Council" OR "Department of the Air Member" OR "air army“ …
  • 9. A.1. Get it right Many parameters to take into account •Is case sensitiveness important? •Use synonyms? •Ignore stop words (of, the, a, …)? •Which attributes to use (title, description, …)? Are some more important than others? •And many others > Iterative process How to evaluate if our results are valid? > Use documents and categories from former system > Categorise them again and compare results To do that quickly, created Command Line Interface 9 [jcharlet@server ~]$ ./runCli.sh -EVALcategoriseEvalDataSet --lucene.index.useSynonymFilter=true
  • 10. A.1. Get it right Findings 1.To automate evaluation o saved me a lot of time o regression tool o benchmarking tool 1.Using a training set based system was not satisfactory 2.Needed to ignore case sensitiveness + punctuation in most cases 10
  • 11. A.2 Get it fast How to apply our 140 categories to 22 millions records quickly? How fast do we need our system to be? •Former system: 10+ days o clunky o Have to wait months to do it again o What if categorisation goes wrong? Start again for 10 days? •Target: ~1d 1 document categorised in 4ms 11
  • 12. Let’s categorise 1 document at a time Run queries in parallel Run inverted queries Run every query against every document one after another on the file index Run queries against memory index Run queries in memory to find candidates and run the candidates against the file index A.2.a Evolution of the algorithm 12 Solution Time to categorise everything Works A few years Fewer years About 10 days ? About 10 days (60ms/doc)
  • 13. A.2.b Fine tuning Use the right driver for your system (NRTCacheDirectory instead of default one) > 1 line in 1 file = 20% faster on search queries Use filter instead of query to search on only 1 document + use carefully low level api Profile your application frequently > Identify ugly code, where to add cache, where to add concurrency Spent 7% on creating Query objects for every document: instead, create them once and store them in memory 13
  • 14. A.2.c Scale out Requires suitable architecture ~Micro services like vs monolithic application 14
  • 15. A.2.c Scale out Back to the solution… GUI for taxonomists (+ backend for GUI) •Available at all time •Do search queries •Update categories Application to categorise everything once •Run once in a while •Needs a huge amount of instances to do the job as fast as possible •Categorise everything Application to categorise documents every day •Run every night •Receive categorisation requests from another system 15
  • 16. A.2.c Scale out Requires suitable architecture ~Micro services like vs monolithic application 16
  • 17. A.2.c Scale out On current available platform: 2 * 24 Core CPU 40 Go RAM 2 * 6 categorisation processes Categorise 22m documents in 1d 8h = 5ms to categorise 1 doc 17 Run queries in memory to find candidates and run the candidates against the file index About 10 days (60ms/doc) Progress is linear Progress is linear
  • 18. A.2.c Scale out Let’s imagine that we use cloud services Let’s suppose we already pay for something equivalent on Microsoft Azure 4 * How much does it cost to use twice that number of servers to be twice faster (ideally)? NOTHING (* If you shut down your server once process ended) 18 INSTANCE CORES RAM DISK SIZES PRICE D3 4 14 GB 200 G B £0.4179/hr
  • 19. Plan This presentation is all about how we built this categorisation system A.Using category queries 1. Get it right 2. Get it fast a) Evolution of the algorithm b) Fine tuning c) Scale out B. Attempt using machine learning o Using a training set based algorithm 19
  • 20. Research on a training set based solution for 2 months Biggest failure, best learning 1.Take a data set of known (already classified) documents 2.Split it into a test set and training set o Train the system with the training set o Evaluate it using the test set o Iterate until satisfactory 1.Move it to production o Classify new documents using the trained system B. Using machine learning 20
  • 21. B. Using machine learning Why it did not work 1.Using category queries to create the training set 21
  • 22. B. Using machine learning Why it did not work 1.Using category queries to create the training set o Highly dependent on the validity/accuracy of the category queries 1.Nature of our categories o far too many (136) o categories too vague or too similar (“Poverty”): do not suit such a system 1.Not the right tool? We used Lucene (search engine) built in tool 2.Nature of the data? 22
  • 23. B. Using machine learning Why we should get into it •Capabilities are impressive (examples) •Enabled thanks to Cloud Computing (the computing power needed is all available) •Machine Learning As A Service > You can play with it for free (*), start prototyping 23