Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.
Dictionary Based Annotation at
scale with Spark, SolrTextTagger
and OpenNLP
Sujit Pal, Elsevier Labs
Introduction
• About Me
– Work at Elsevier Labs.
– Interested in Search, NLP and Distributed Processing.
– URL: labs.elsev...
Agenda
• Overview and Background
• Features and API
• Scaling out
• Q&A
Overview/Background
Problem Definition
• What is the problem?
– Annotate millions of documents from different corpora.
• 14M docs from Science...
Annotation Pipeline
Dictionary Based NE Annotator (SoDA)
• Part of Document Annotation Pipeline.
• Annotates text with Named Entities from ext...
SoDA Architecture
How does it work (exact/case matching)?
• Uses Aho-Corasick algorithm – treats the dictionary as a FST and streams text ag...
How does it work (fuzzy matching)?
• Pre-normalizes each dictionary entry into various forms
– Original – “Astrocytoma, Su...
Features and API
Feature Overview
• Provides JSON over HTTP interface
– Compose request as JSON document
– HTTP POST document to JSON endpo...
Services
• Status
– index.json – returns a JSON (suitable for health check monitoring)
• Single Lexicon Services
– annot.j...
Annotation Service I/O
Example annotation request
{
“lexicon” : “countries”,
“text” : “Institute of Clean Coal Technology,...
Calling Annotation Service
• Client originally written in Python, using built-in json and requests libraries.
• For Scala ...
Annotation Service – Python Client
Annotation Service – Scala Client
Annotation Service - Outputs
• Each Annotation result provides:
– Entity ID (not shown)
– Begin position in text
– End pos...
Loading Dictionaries
• Dictionary entries represented by:
– Lexicon Name
– Entry ID (unique across lexicons)
– List of pos...
Loading Dictionaries – Scala Client
Scaling Out
SoDA Performance – Expected
• Test: annotate 14M docs in “reasonable time”.
– Approx. 3s/doc with SoDA+Solr on ec2 r3.larg...
Where is the time being spent?
• Majority of time spent in Solr.
• Some time spent in SoDA (decreases slower than Solr as ...
Optimization #1: Combine Paragraphs
• Performance measured using 10K random
articles.
• Time to annotate 1 article: Mean 2...
Optimization #2: Tune Solr GC
• OOB Solr would GC very frequently, slowing
down Spark and causing timeouts.
• Current Inde...
Optimization #3: Larger Spark Cluster
• Running on cluster Master + 4 Workers.
• Each worker has 1 Executor with 4 Cores.
...
Optimization #4: Solr Scaleout
• Upgrade to r3.xlarge (30.5GB RAM, 80GB
SSD, 4vCPU)
– Throughput 7.9 docs/s
• Upgrade to 2...
Performance – Did we meet expectations?
• At 26 docs/sec and 14M documents, it will take our current cluster little over 6...
Future Work
• More Lexicons
• Investigate Lexicon-Centric scale out.
– Allows more lexicons.
– Not limited to single index...
Q&A
Thank you for listening!
• Questions?
• SoDA available on GitHub
– https://github.com/elsevierlabs-os/soda
• Contact me
– ...
Prochain SlideShare
Chargement dans…5
×

Dictionary based Annotation at Scale with Spark, SolrTextTagger and OpenNLP

1 813 vues

Publié le

Presented at Spark Summit EU 2015 at Amsterdam. Details the SoDA project, a micro-service accessible via Spark for naive annotation of large volumes of text against very large lexicons.

Publié dans : Données & analyses
  • Soyez le premier à commenter

Dictionary based Annotation at Scale with Spark, SolrTextTagger and OpenNLP

  1. 1. Dictionary Based Annotation at scale with Spark, SolrTextTagger and OpenNLP Sujit Pal, Elsevier Labs
  2. 2. Introduction • About Me – Work at Elsevier Labs. – Interested in Search, NLP and Distributed Processing. – URL: labs.elsevier.com – Email: sujit.pal@elsevier.com – Twitter: @palsujit • About Elsevier – World’s largest publisher of STM Books and Journals. – Uses Data to inform and enable consumers of STM info. – And like everybody else, we are hiring!
  3. 3. Agenda • Overview and Background • Features and API • Scaling out • Q&A
  4. 4. Overview/Background
  5. 5. Problem Definition • What is the problem? – Annotate millions of documents from different corpora. • 14M docs from Science Direct alone. • More from other corpora, dependency parsing, etc. – Critical step for Machine Reading and Knowledge Graph applications. • Why is this such a big deal? – Takes advantage of existing linked data. – No model training for multiple complex STM domains. – However, simple until done at scale.
  6. 6. Annotation Pipeline
  7. 7. Dictionary Based NE Annotator (SoDA) • Part of Document Annotation Pipeline. • Annotates text with Named Entities from external Dictionaries. • Built with Open Source Components – Apache Solr – Highly reliable, scalable and fault-tolerant search index. – SolrTextTagger – Solr component for text tagging, uses Lucene FST technology. – Apache OpenNLP – Machine Learning based toolkit for processing Natural Language Text. – Apache Spark – Lightning fast, large scale data processing. • Uses ideas from other Open Source libraries – FuzzyWuzzy – Fuzzy String Matching like a boss. • Contributed back to Open Source – https://github.com/elsevierlabs-os/soda
  8. 8. SoDA Architecture
  9. 9. How does it work (exact/case matching)? • Uses Aho-Corasick algorithm – treats the dictionary as a FST and streams text against it. Matches all patterns simultaneously. Diagram shows FST for vocabulary {“his”, “he”, “her”, “hers”, “she”}. • Michael McCandless implemented FSTs in Lucene (blog post). • David Smiley built SolrTextTagger to use Lucene FSTs. • SoDA uses SolrTextTagger for streaming exact and case-insensitive matching.
  10. 10. How does it work (fuzzy matching)? • Pre-normalizes each dictionary entry into various forms – Original – “Astrocytoma, Subependymal Giant Cell” – Lowercased – “astrocytoma, subependymal giant cell” – Punctuation – “astrocytoma subependymal giant cell” – Sorted – “astrocytoma cell giant subependymal” – Stemmed – “astrocytoma cell giant subependym” • Uses OpenNLP to parse input text into phrases, normalizes each phrase into the desired normalization level and matches against corresponding field. • Caller specifies normalization level.
  11. 11. Features and API
  12. 12. Feature Overview • Provides JSON over HTTP interface – Compose request as JSON document – HTTP POST document to JSON endpoint URL (HTTP GET for URL only requests). – Receive response as JSON document. • Language-Agnostic and Cross-Platform. • API can be used from standalone clients, Spark jobs and Databricks notebooks. • Examples in Scala and Python
  13. 13. Services • Status – index.json – returns a JSON (suitable for health check monitoring) • Single Lexicon Services – annot.json – annotates a block of text in streaming manner. Supports different levels of matching (strict to permissive). – matchphrase.json – annotates short phrases. Supports same matching levels as annot.json. • Multi-Lexicon Services – dicts.json – lists all lexicons available. – coverage.json – returns number of annotations by lexicon found for text across all available lexicons. • Indexing Services – delete.json – deletes entire lexicon from index. – add.json – adds an entry to the specified lexicon.
  14. 14. Annotation Service I/O Example annotation request { “lexicon” : “countries”, “text” : “Institute of Clean Coal Technology, East China University”, “matching” : “exact” } Example annotated response [ { “id” : “http://www.geonames.org/CHN”, “lexicon” : “countries”, “begin” : 41, “end” : 46, “coveredText” : “China”, “confidence” : 1.0 } ]
  15. 15. Calling Annotation Service • Client originally written in Python, using built-in json and requests libraries. • For Scala client, SoDA JAR provides classes to mimic json and requests functionality in Scala. • Input to both our (somewhat contrived) examples are: (pii: String, affStr: String) tuples as shown. • Match against a country lexicon to find country names.
  16. 16. Annotation Service – Python Client
  17. 17. Annotation Service – Scala Client
  18. 18. Annotation Service - Outputs • Each Annotation result provides: – Entity ID (not shown) – Begin position in text – End position in text – Matched Text – Confidence (not shown) • Zero or more Annotations possible per input text.
  19. 19. Loading Dictionaries • Dictionary entries represented by: – Lexicon Name – Entry ID (unique across lexicons) – List of possible synonym terms • JSON Request to add an entry for MeSH dictionary. { “id”: http://id.nlm.nih.gov/mesh/2015/M0021699, “lexicon”: “mesh”, “names”: [“Baby Tooth”, “Dentitions, Primary”, “Milk Tooth”, ...], “commit”: false } • Preferable to commit periodically and after batch.
  20. 20. Loading Dictionaries – Scala Client
  21. 21. Scaling Out
  22. 22. SoDA Performance – Expected • Test: annotate 14M docs in “reasonable time”. – Approx. 3s/doc with SoDA+Solr on ec2 r3.large box (15.5GB RAM, 32GB SSD, 2vCPU). – Total estimated time: 16.2 months! • Questions – Can we make the process faster? – Can we scale out the process?
  23. 23. Where is the time being spent? • Majority of time spent in Solr. • Some time spent in SoDA (decreases slower than Solr as transactions get shorter). • Almost no additional time spent in Spark.
  24. 24. Optimization #1: Combine Paragraphs • Performance measured using 10K random articles. • Time to annotate 1 article: Mean 2.9s, Median 2.1s. • Annotation done per paragraph, 40 paragraphs/article on average. • Reduce HTTP network + parsing overhead by sending full document. • Time to annotate 1 article: Mean 1.4s, Median 0.3s. • 2x - 7x improvement.
  25. 25. Optimization #2: Tune Solr GC • OOB Solr would GC very frequently, slowing down Spark and causing timeouts. • Current Index Size: 2.1 GB • Need to size box so approximately 75% RAM given to OS and remaining 25% allocated to Solr (Uwe Schindler's Blog Post). • Heap size should be 3-4x index size (Internal Guideline). • Current Solr Heap Size = 8 GB • RAM is 30.5 GB • CMS (Concurrent Mark-Sweep) Garbage Collection.
  26. 26. Optimization #3: Larger Spark Cluster • Running on cluster Master + 4 Workers. • Each worker has 1 Executor with 4 Cores. • Number of simultaneous Solr clients = 16 (4 workers * 1 executor * 4 cores) – measured with lsof –p in a loop on Solr server. • Throughput increases with number of partitions till about 2x the number of worker cores. • Best throughput 5 docs/sec with #- partitions=30 for 16 cores.
  27. 27. Optimization #4: Solr Scaleout • Upgrade to r3.xlarge (30.5GB RAM, 80GB SSD, 4vCPU) – Throughput 7.9 docs/s • Upgrade to 2x r3.2xlarge (61GB RAM, 160GB SSD, 8vCPU) with c3.large LB (3.75GB RAM, 32GB Disk, 2vCPU) running HAProxy. #-workers #- requests/serv er Throughput (docs/sec) 4 8 8.62 8 16 17.334 12 24 20.64 16 32 26.845
  28. 28. Performance – Did we meet expectations? • At 26 docs/sec and 14M documents, it will take our current cluster little over 6 days to annotate against our largest dictionary (8M entries). • Throughput scales linearly @ 1.5 docs/sec per additional worker, as long as Solr servers have enough capacity to serve requests. • Each Solr box (as configured) can serve sustained loads of up to 30-35 simultaneous requests. • Number of simultaneous requests approximately equal to number of worker cores. • Example: annotate 14M documents in 3 days. – Throughput required: 14M / (3 * 86400) = 54 docs/s – Number of workers: 54 / 1.5 = 36 workers – Number of simultaneous requests (4 cores/worker) = 36 * 4 = 144 – Number of Solr servers: 144 / 32 = 4.5 = 5 servers
  29. 29. Future Work • More Lexicons • Investigate Lexicon-Centric scale out. – Allows more lexicons. – Not limited to single index. • Move to Lucene, eliminate network overhead. – Asynchronous model – Use Kafka topic with multiple partitions – Lucene based tagging consumers – Write output to S3.
  30. 30. Q&A
  31. 31. Thank you for listening! • Questions? • SoDA available on GitHub – https://github.com/elsevierlabs-os/soda • Contact me – sujit.pal@elsevier.com

×