Accelerating NLP with Dask on Saturn Cloud: A case study with CORD-19

Saturn Cloud
Accelerating NLP with Dask on Saturn
Cloud
Elsevier Labs Online Lecture
November 2020
1

Hi!
Speakers
Aaron Richter
Senior Data Scientist @ Saturn Cloud
aaron@saturncloud.io
@rikturr
Sujit Pal
Technology Research Director @ Elsevier
sujit.pal@elsevier.com
@palsujit
2

Check out the
blog post!
🔗 How Elsevier Accelerated COVID-19 research using Dask on Saturn
Cloud
3

Saturn Cloud
Data science with Python
4

Dask
● Parallel computing for Python people
● Anaconda, ~2015
● Built in Python; Python API
● Mature, scientific computing communities
● Low-level task library
● High-level libraries for DataFrames, arrays, ML
● Integrates with PyData ecosystem
● Runs on laptop, scales to clusters
https://dask.org/
7

Dask
What does it do?
https://docs.dask.org/en/latest/user-interfaces.html
● Parallel machine learning (scikit)
● Parallel dataframes (pandas)
● Parallel arrays (numpy)
● Parallel anything else
8

What does it do?
Arrays and Dataframes
https://docs.dask.org/en/latest/array.html https://docs.dask.org/en/latest/dataframe.html
9

What does it do?
Anything else!
https://docs.dask.org/en/latest/delayed.html 10

What does it do?
Anything else!
https://dask.org/
11

Getting up to
Speed with Dask
https://youtu.be/S_ncqocDcBA
12

Spark vs. Dask
● Written in Scala with Python API
● All-in-one tool
○ Requires re-write to migrate
from PyData code
● Programming model not suited for
complex operations (multi-dim
arrays, machine learning)
● 100% Python
● Built to extend and interact with
PyData ecosystem
● High-level interfaces for
DataFrames, (multi-dim) Arrays,
and ML
● Native integration with RAPIDS for
GPU-acceleration

How can I run Dask clusters?
● Manual setup
● SSH
● HPC: MPI, SLURM, SGE, TORQUE, LSF, DRMAA,
PBS
● Kubernetes (Docker, Helm)
● Hadoop/YARN
● Cloud provider: AWS or Azure
🔗 https://docs.dask.org/en/latest/setup.html
15

● Fast setup
● Enterprise secure
● Pythonic parallelism
● Rapidly scale
PyData
● Multi-GPU computing
● The future of HPC
● Workflow orchestration
● Flow insight and mgmt
Bringing together the fastest hardware + OSS
Saturn Cloud

● Saturn manages all infrastructure
● Hosted: Run within our cloud
● Enterprise: Run within your AWS account
Saturn Cloud to the rescue!
Taking the DevOps out of Data Science
18

● Images
● Jupyter server
● Dask Cluster
● Deployments
Saturn Cloud
Core features
19

Saturn Cloud
Extracting Entities from
CORD-19
21

Genesis
23
● Based on one of the many
COVID-19 initiatives (COVID-
KG)
● Original intent: extract entities
from CORD-19 dataset for
relationship mining.

Genesis
24
KG)
relationship mining.● CORD-19 dataset open sourced
by AllenAI.
● SciSpacy provided Language
models trained on Biomedical
text, and...
● Pre-trained Named Entity
Recognition (and linking) models.

Genesis
25
KG)
relationship mining.● CORD-19 dataset open sourced
by AllenAI.
● SciSpacy provided Language
models trained on Biomedical
text, and...
● Pre-trained Named Entity
Recognition (and linking) models.
● Dask based distributed
computing platform
● Opportunity to evaluate.

Goals
● Create standoff entity annotations for CORD-19, modeled after CAT (Content
Analytics Toolbench) from Labs.
● Automated entity recognition using pre-trained SciSpaCy models, where each
model recognizes a different subset of entity classes, e.g. DNA, Gene,
Protein, Chemical, Organism, Disease, etc.
● Output is structured as Parquet files, consumable via Dask or Spark.
● Share output dataset with community.
26

CORD-19 Dataset
● Started mid March 2020 with ~40k articles released weekly.
● By Sept/Oct 2020 ~200k articles released daily, growing everyday.
● Each release contains:
○ Metadata file (CSV)
○ Set of articles (JSON)
27

SciSpaCy NER(L) models
● Medium English LM for sentence
splitting.
● 4 NER models
● 5 NERL models using LM’s
candidate entity generator and
trained entity linking models.
28

Full Pipeline
● Read metadata.csv
● Parse each JSON file into
paragraphs.
● Split paragraphs into sentences.
● Extract entities from sentences
using a NER(L) model..
29

Embarrassing Parallelism
● Pipeline is embarrassingly parallel.
● Parse files to paragraphs has no
dependencies (i.e., perfectly
parallel)
● ...
● ...
30

● Pipeline is embarrassingly parallel
● ...
● Split paragraphs to sentences
needs sentence splitter model
assigned per partition.
● Load only models that you need.
● …
● ...
31

● …
● ...
● Extract entities from sentence
needs NER model, assign lazily to
worker per partition.
● Use nlp.pipe and batching to exploit
multithreading.
● ...
32

● …
● ...
● Extract and link entities from
sentence needs Language
Model, Entity Linker, etc. assign
eagerly per worker after cluster
creation.
33

Parquet Dask / Spark interop
● Output of paragraph, sentence, and entities are in Parquet format.
● Things to keep in mind for Spark interoperability when writing from Dask.
○ Column data types must be declared explicitly on the Dask end.
○ Column names should be specified when saving (“hidden” columns visible in Spark).
○ Explicit re-partitioning may be necessary when saving on Dask.
34

Incremental Pipeline
● Extracted Entities + new metadata
and JSON files.
● Compute diffs (additions +
deletions)
● Parse added articles to paragraphs,
paragraphs to sentences, and
sentences to entities.
● Remove paragraphs, sentences,
and entities for deleted articles.
● Merge diff and original.
35

Deliverables
● Code
○ Set of Jupyter notebooks deployed on Saturn Cloud -- sujitpal/saturn-scispacy
● Data
○ Dataframes in Parquet format (approx 70 GB, 35 for Sep 2020, 35 for Oct 2020).
○ Publicly available on s3://els-saturn-scispacy/cord19-scispacy-entities (requester pays).
○ Available within ELS on Databricks under “/mnt/els/covid-19” -- “/mnt/els/covid-19/saturn-
scispacy/annotations-pq-full-20200928”
36

Output formats
1 paragraph dataframe (3.4M paragraphs), 1 sentence dataframe (17.1M sentences),
and 9 entity dataframes (total 805.4M entities).
37

Utility
● Envisioned usage similar to that for CAT annotations using AnnotationQuery,
but for biomedical entities on biomedical data.
● Query dataset using DataFrame API to create interesting micro-datasets
datasets potentially spanning across different NER(L) models. Example:
○ Human Phenotypes Annotations from HPO co-occurring in same sentence with Disease
Annotations from UMLS or BC5CDR.
○ Gene annotations co-occurring in same sentence with Cancer annotations (both from
BioNLP).
● Annotations can be features for Topic Modeling or Categorization.
38

Additional Resources
● Labs Blog post -- How Elsevier Accelerated COVID-19 research using Dask
on Saturn Cloud
● Confluence page -- Dataset of Entity Annotations for CORD-19
● Book -- Data Science with Python and Dask by Jesse C. Daniel (also
available on Percipio)
● Dask online documentation
● Saturn Cloud documentations
● Cloud deployments using Dask
● Query interface example -- AnnotationQuery.
39

Accelerating NLP with Dask on Saturn Cloud: A case study with CORD-19

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Accelerating NLP with Dask on Saturn Cloud: A case study with CORD-19

Similaire à Accelerating NLP with Dask on Saturn Cloud: A case study with CORD-19 (20)

Plus de Sujit Pal

Plus de Sujit Pal (20)

Dernier

Dernier (20)

Accelerating NLP with Dask on Saturn Cloud: A case study with CORD-19

Notes de l'éditeur