5. Introduction
Taxonomy is just about classification.
Here it concerns the applications used
to apply categories (or subjects) to the records in Discovery.
Project involving several people from Taxonomy team
and Systems Development team
5
6. Introduction
Solution
Administration interface for taxonomists
6
Application to categorise everything once
1.To do it for the first time
2.to apply latest modifications from taxonomists on all documents
Application to categorise documents every day
1.to categorise new documents
2.to re-categorise documents when they are updated
7. Plan
This presentation is all about how we built this categorisation system
A.Using category queries
1. Get it right
2. Get it fast
a) Evolution of the algorithm
b) Fine tuning
c) Scale out
B. Attempt using machine learning
o Using a training set based algorithm
7
8. A. Using category queries
How to categorise a document?
Solution (from former system Autonomy):
1 category = 1 search Query
8
“air force” "Air Force" OR "air forces" OR "Air Ministry" OR "Air
Historical Branch" OR "Air Department" OR "Air Board" OR
"Air Council" OR "Department of the Air Member" OR "air
army“ …
9. A.1. Get it right
Many parameters to take into account
•Is case sensitiveness important?
•Use synonyms?
•Ignore stop words (of, the, a, …)?
•Which attributes to use (title, description, …)? Are some more important than others?
•And many others
> Iterative process
How to evaluate if our results are valid?
> Use documents and categories from former system
> Categorise them again and compare results
To do that quickly, created Command Line Interface
9
[jcharlet@server ~]$
./runCli.sh -EVALcategoriseEvalDataSet --lucene.index.useSynonymFilter=true
10. A.1. Get it right
Findings
1.To automate evaluation
o saved me a lot of time
o regression tool
o benchmarking tool
1.Using a training set based system was not satisfactory
2.Needed to ignore case sensitiveness + punctuation in most cases
10
11. A.2 Get it fast
How to apply our 140 categories to 22 millions records quickly?
How fast do we need our system to be?
•Former system: 10+ days
o clunky
o Have to wait months to do it again
o What if categorisation goes wrong? Start again for 10 days?
•Target: ~1d
1 document categorised in 4ms
11
12. Let’s categorise 1 document at a time
Run queries in parallel
Run inverted queries
Run every query against every document one after
another on the file index
Run queries against memory index
Run queries in memory to find candidates and run
the candidates against the file index
A.2.a Evolution of the algorithm
12
Solution Time to categorise
everything
Works
A few years
Fewer years
About 10 days
?
About 10 days
(60ms/doc)
13. A.2.b Fine tuning
Use the right driver for your system (NRTCacheDirectory
instead of default one)
> 1 line in 1 file = 20% faster on search queries
Use filter instead of query to search on only 1 document
+ use carefully low level api
Profile your application frequently
> Identify ugly code, where to add cache, where to add
concurrency
Spent 7% on creating Query objects for every document:
instead, create them once and store them in memory
13
14. A.2.c Scale out
Requires suitable architecture
~Micro services like vs monolithic application
14
15. A.2.c Scale out
Back to the solution…
GUI for taxonomists (+ backend for GUI)
•Available at all time
•Do search queries
•Update categories
Application to categorise everything once
•Run once in a while
•Needs a huge amount of instances to do the job as fast as possible
•Categorise everything
Application to categorise documents every day
•Run every night
•Receive categorisation requests from another system
15
16. A.2.c Scale out
Requires suitable architecture
~Micro services like vs monolithic application
16
17. A.2.c Scale out
On current available platform:
2 * 24 Core CPU
40 Go RAM
2 * 6 categorisation processes
Categorise 22m documents in 1d 8h
= 5ms to categorise 1 doc
17
Run queries in memory to find candidates and run
the candidates against the file index
About 10 days
(60ms/doc)
Progress is
linear
Progress is
linear
18. A.2.c Scale out
Let’s imagine that we use cloud services
Let’s suppose we already pay for something equivalent on
Microsoft Azure
4 *
How much does it cost to use twice that number of servers to
be twice faster (ideally)?
NOTHING (* If you shut down your server once process ended)
18
INSTANCE CORES RAM DISK
SIZES
PRICE
D3 4 14 GB 200 G
B
£0.4179/hr
19. Plan
This presentation is all about how we built this categorisation system
A.Using category queries
1. Get it right
2. Get it fast
a) Evolution of the algorithm
b) Fine tuning
c) Scale out
B. Attempt using machine learning
o Using a training set based algorithm
19
20. Research on a training set based solution for 2 months
Biggest failure, best learning
1.Take a data set of known (already classified) documents
2.Split it into a test set and training set
o Train the system with the training set
o Evaluate it using the test set
o Iterate until satisfactory
1.Move it to production
o Classify new documents using the trained system
B. Using machine learning
20
21. B. Using machine learning
Why it did not work
1.Using category queries to create the training set
21
22. B. Using machine learning
Why it did not work
1.Using category queries to create the training set
o Highly dependent on the validity/accuracy of the category queries
1.Nature of our categories
o far too many (136)
o categories too vague or too similar (“Poverty”): do not suit such a
system
1.Not the right tool? We used Lucene (search engine) built in tool
2.Nature of the data?
22
23. B. Using machine learning
Why we should get into it
•Capabilities are impressive (examples)
•Enabled thanks to Cloud Computing (the computing power
needed is all available)
•Machine Learning As A Service
> You can play with it for free (*), start prototyping
23