SlideShare une entreprise Scribd logo
1  sur  39
Saturn Cloud
Accelerating NLP with Dask on Saturn
Cloud
Elsevier Labs Online Lecture
November 2020
1
Hi!
Speakers
Aaron Richter
Senior Data Scientist @ Saturn Cloud
aaron@saturncloud.io
@rikturr
Sujit Pal
Technology Research Director @ Elsevier
sujit.pal@elsevier.com
@palsujit
2
Check out the
blog post!
🔗 How Elsevier Accelerated COVID-19 research using Dask on Saturn
Cloud
3
Saturn Cloud
Data science with Python
4
Data science with Python
5
Saturn Cloud
What is Dask?
6
Dask
● Parallel computing for Python people
● Anaconda, ~2015
● Built in Python; Python API
● Mature, scientific computing communities
● Low-level task library
● High-level libraries for DataFrames, arrays, ML
● Integrates with PyData ecosystem
● Runs on laptop, scales to clusters
https://dask.org/
7
Dask
What does it do?
https://docs.dask.org/en/latest/user-interfaces.html
● Parallel machine learning (scikit)
● Parallel dataframes (pandas)
● Parallel arrays (numpy)
● Parallel anything else
8
What does it do?
Arrays and Dataframes
https://docs.dask.org/en/latest/array.html https://docs.dask.org/en/latest/dataframe.html
9
What does it do?
Anything else!
https://docs.dask.org/en/latest/delayed.html 10
What does it do?
Anything else!
https://dask.org/
11
Getting up to
Speed with Dask
https://youtu.be/S_ncqocDcBA
12
Spark vs. Dask
● Written in Scala with Python API
● All-in-one tool
○ Requires re-write to migrate
from PyData code
● Programming model not suited for
complex operations (multi-dim
arrays, machine learning)
● 100% Python
● Built to extend and interact with
PyData ecosystem
● High-level interfaces for
DataFrames, (multi-dim) Arrays,
and ML
● Native integration with RAPIDS for
GPU-acceleration
14
How can I run Dask clusters?
● Manual setup
● SSH
● HPC: MPI, SLURM, SGE, TORQUE, LSF, DRMAA,
PBS
● Kubernetes (Docker, Helm)
● Hadoop/YARN
● Cloud provider: AWS or Azure
🔗 https://docs.dask.org/en/latest/setup.html
15
Saturn Cloud
16
● Fast setup
● Enterprise secure
● Pythonic parallelism
● Rapidly scale
PyData
● Multi-GPU computing
● The future of HPC
● Workflow orchestration
● Flow insight and mgmt
Bringing together the fastest hardware + OSS
Saturn Cloud
● Saturn manages all infrastructure
● Hosted: Run within our cloud
● Enterprise: Run within your AWS account
Saturn Cloud to the rescue!
Taking the DevOps out of Data Science
18
● Images
● Jupyter server
● Dask Cluster
● Deployments
Saturn Cloud
Core features
19
20
Saturn Cloud
Extracting Entities from
CORD-19
21
Genesis
22
Genesis
23
● Based on one of the many
COVID-19 initiatives (COVID-
KG)
● Original intent: extract entities
from CORD-19 dataset for
relationship mining.
Genesis
24
● Based on one of the many
COVID-19 initiatives (COVID-
KG)
● Original intent: extract entities
from CORD-19 dataset for
relationship mining.● CORD-19 dataset open sourced
by AllenAI.
● SciSpacy provided Language
models trained on Biomedical
text, and...
● Pre-trained Named Entity
Recognition (and linking) models.
Genesis
25
● Based on one of the many
COVID-19 initiatives (COVID-
KG)
● Original intent: extract entities
from CORD-19 dataset for
relationship mining.● CORD-19 dataset open sourced
by AllenAI.
● SciSpacy provided Language
models trained on Biomedical
text, and...
● Pre-trained Named Entity
Recognition (and linking) models.
● Dask based distributed
computing platform
● Opportunity to evaluate.
Goals
● Create standoff entity annotations for CORD-19, modeled after CAT (Content
Analytics Toolbench) from Labs.
● Automated entity recognition using pre-trained SciSpaCy models, where each
model recognizes a different subset of entity classes, e.g. DNA, Gene,
Protein, Chemical, Organism, Disease, etc.
● Output is structured as Parquet files, consumable via Dask or Spark.
● Share output dataset with community.
26
CORD-19 Dataset
● Started mid March 2020 with ~40k articles released weekly.
● By Sept/Oct 2020 ~200k articles released daily, growing everyday.
● Each release contains:
○ Metadata file (CSV)
○ Set of articles (JSON)
27
SciSpaCy NER(L) models
● Medium English LM for sentence
splitting.
● 4 NER models
● 5 NERL models using LM’s
candidate entity generator and
trained entity linking models.
28
Full Pipeline
● Read metadata.csv
● Parse each JSON file into
paragraphs.
● Split paragraphs into sentences.
● Extract entities from sentences
using a NER(L) model..
29
Embarrassing Parallelism
● Pipeline is embarrassingly parallel.
● Parse files to paragraphs has no
dependencies (i.e., perfectly
parallel)
● ...
● ...
30
Embarrassing Parallelism
● Pipeline is embarrassingly parallel
● ...
● Split paragraphs to sentences
needs sentence splitter model
assigned per partition.
● Load only models that you need.
● …
● ...
31
Embarrassing Parallelism
● Pipeline is embarrassingly parallel.
● …
● ...
● Extract entities from sentence
needs NER model, assign lazily to
worker per partition.
● Use nlp.pipe and batching to exploit
multithreading.
● ...
32
Embarrassing Parallelism
● Pipeline is embarrassingly parallel.
● …
● ...
● Extract and link entities from
sentence needs Language
Model, Entity Linker, etc. assign
eagerly per worker after cluster
creation.
33
Parquet Dask / Spark interop
● Output of paragraph, sentence, and entities are in Parquet format.
● Things to keep in mind for Spark interoperability when writing from Dask.
○ Column data types must be declared explicitly on the Dask end.
○ Column names should be specified when saving (“hidden” columns visible in Spark).
○ Explicit re-partitioning may be necessary when saving on Dask.
34
Incremental Pipeline
● Extracted Entities + new metadata
and JSON files.
● Compute diffs (additions +
deletions)
● Parse added articles to paragraphs,
paragraphs to sentences, and
sentences to entities.
● Remove paragraphs, sentences,
and entities for deleted articles.
● Merge diff and original.
35
Deliverables
● Code
○ Set of Jupyter notebooks deployed on Saturn Cloud -- sujitpal/saturn-scispacy
● Data
○ Dataframes in Parquet format (approx 70 GB, 35 for Sep 2020, 35 for Oct 2020).
○ Publicly available on s3://els-saturn-scispacy/cord19-scispacy-entities (requester pays).
○ Available within ELS on Databricks under “/mnt/els/covid-19” -- “/mnt/els/covid-19/saturn-
scispacy/annotations-pq-full-20200928”
36
Output formats
1 paragraph dataframe (3.4M paragraphs), 1 sentence dataframe (17.1M sentences),
and 9 entity dataframes (total 805.4M entities).
37
Utility
● Envisioned usage similar to that for CAT annotations using AnnotationQuery,
but for biomedical entities on biomedical data.
● Query dataset using DataFrame API to create interesting micro-datasets
datasets potentially spanning across different NER(L) models. Example:
○ Human Phenotypes Annotations from HPO co-occurring in same sentence with Disease
Annotations from UMLS or BC5CDR.
○ Gene annotations co-occurring in same sentence with Cancer annotations (both from
BioNLP).
● Annotations can be features for Topic Modeling or Categorization.
38
Additional Resources
● Labs Blog post -- How Elsevier Accelerated COVID-19 research using Dask
on Saturn Cloud
● Confluence page -- Dataset of Entity Annotations for CORD-19
● Book -- Data Science with Python and Dask by Jesse C. Daniel (also
available on Percipio)
● Dask online documentation
● Saturn Cloud documentations
● Cloud deployments using Dask
● Query interface example -- AnnotationQuery.
39

Contenu connexe

Tendances

Optimization Techniques
Optimization TechniquesOptimization Techniques
Optimization TechniquesJoud Khattab
 
The Materials Project - Combining Science and Informatics to Accelerate Mater...
The Materials Project - Combining Science and Informatics to Accelerate Mater...The Materials Project - Combining Science and Informatics to Accelerate Mater...
The Materials Project - Combining Science and Informatics to Accelerate Mater...University of California, San Diego
 
Тимофей Титовец — Elastic+Logstash+Kibana: Архитектура и опыт использования
Тимофей Титовец — Elastic+Logstash+Kibana: Архитектура и опыт использованияТимофей Титовец — Elastic+Logstash+Kibana: Архитектура и опыт использования
Тимофей Титовец — Elastic+Logstash+Kibana: Архитектура и опыт использованияMinsk Linux User Group
 
Apache Solr as a compressed, scalable, and high performance time series database
Apache Solr as a compressed, scalable, and high performance time series databaseApache Solr as a compressed, scalable, and high performance time series database
Apache Solr as a compressed, scalable, and high performance time series databaseFlorian Lautenschlager
 
OSDC 2016 - Chronix - A fast and efficient time series storage based on Apach...
OSDC 2016 - Chronix - A fast and efficient time series storage based on Apach...OSDC 2016 - Chronix - A fast and efficient time series storage based on Apach...
OSDC 2016 - Chronix - A fast and efficient time series storage based on Apach...NETWAYS
 
Semantic web and Drupal: an introduction
Semantic web and Drupal: an introductionSemantic web and Drupal: an introduction
Semantic web and Drupal: an introductionKristof Van Tomme
 
Using MongoDB for Materials Discovery
Using MongoDB for Materials DiscoveryUsing MongoDB for Materials Discovery
Using MongoDB for Materials DiscoveryDan Gunter
 
Rust in TiKV
Rust in TiKVRust in TiKV
Rust in TiKVPingCAP
 
OSGi Community Event 2010 - OSGi and Terracotta - replication of states for c...
OSGi Community Event 2010 - OSGi and Terracotta - replication of states for c...OSGi Community Event 2010 - OSGi and Terracotta - replication of states for c...
OSGi Community Event 2010 - OSGi and Terracotta - replication of states for c...mfrancis
 
Learning do discover: machine learning in high-energy physics
Learning do discover: machine learning in high-energy physicsLearning do discover: machine learning in high-energy physics
Learning do discover: machine learning in high-energy physicsBalázs Kégl
 
Strings, C# and Unmanaged Memory
Strings, C# and Unmanaged MemoryStrings, C# and Unmanaged Memory
Strings, C# and Unmanaged MemoryMichael Yarichuk
 
Avogadro 2 and Open Chemistry
Avogadro 2 and Open ChemistryAvogadro 2 and Open Chemistry
Avogadro 2 and Open ChemistryMarcus Hanwell
 
[Webinar] Scientific Computation and Data Visualization with Ruby
[Webinar] Scientific Computation and Data Visualization with Ruby [Webinar] Scientific Computation and Data Visualization with Ruby
[Webinar] Scientific Computation and Data Visualization with Ruby Srijan Technologies
 
MongoDB Scalability Best Practices
MongoDB Scalability Best PracticesMongoDB Scalability Best Practices
MongoDB Scalability Best PracticesJason Terpko
 
SiriusCon 2017 - Integrating Xtext and Sirius: Strategies and Pitfalls
SiriusCon 2017 - Integrating Xtext and Sirius: Strategies and PitfallsSiriusCon 2017 - Integrating Xtext and Sirius: Strategies and Pitfalls
SiriusCon 2017 - Integrating Xtext and Sirius: Strategies and PitfallsObeo
 
Introduction to SparkR
Introduction to SparkRIntroduction to SparkR
Introduction to SparkRKien Dang
 
Time Series Data in a Time Series World
Time Series Data in a Time Series WorldTime Series Data in a Time Series World
Time Series Data in a Time Series WorldMapR Technologies
 

Tendances (20)

Optimization Techniques
Optimization TechniquesOptimization Techniques
Optimization Techniques
 
The Materials Project - Combining Science and Informatics to Accelerate Mater...
The Materials Project - Combining Science and Informatics to Accelerate Mater...The Materials Project - Combining Science and Informatics to Accelerate Mater...
The Materials Project - Combining Science and Informatics to Accelerate Mater...
 
The Materials API
The Materials APIThe Materials API
The Materials API
 
Тимофей Титовец — Elastic+Logstash+Kibana: Архитектура и опыт использования
Тимофей Титовец — Elastic+Logstash+Kibana: Архитектура и опыт использованияТимофей Титовец — Elastic+Logstash+Kibana: Архитектура и опыт использования
Тимофей Титовец — Elastic+Logstash+Kibana: Архитектура и опыт использования
 
Apache Solr as a compressed, scalable, and high performance time series database
Apache Solr as a compressed, scalable, and high performance time series databaseApache Solr as a compressed, scalable, and high performance time series database
Apache Solr as a compressed, scalable, and high performance time series database
 
OSDC 2016 - Chronix - A fast and efficient time series storage based on Apach...
OSDC 2016 - Chronix - A fast and efficient time series storage based on Apach...OSDC 2016 - Chronix - A fast and efficient time series storage based on Apach...
OSDC 2016 - Chronix - A fast and efficient time series storage based on Apach...
 
Semantic web and Drupal: an introduction
Semantic web and Drupal: an introductionSemantic web and Drupal: an introduction
Semantic web and Drupal: an introduction
 
Using MongoDB for Materials Discovery
Using MongoDB for Materials DiscoveryUsing MongoDB for Materials Discovery
Using MongoDB for Materials Discovery
 
Rust in TiKV
Rust in TiKVRust in TiKV
Rust in TiKV
 
GraphQL & Ratpack
GraphQL & RatpackGraphQL & Ratpack
GraphQL & Ratpack
 
OSGi Community Event 2010 - OSGi and Terracotta - replication of states for c...
OSGi Community Event 2010 - OSGi and Terracotta - replication of states for c...OSGi Community Event 2010 - OSGi and Terracotta - replication of states for c...
OSGi Community Event 2010 - OSGi and Terracotta - replication of states for c...
 
Geo data analytics
Geo data analyticsGeo data analytics
Geo data analytics
 
Learning do discover: machine learning in high-energy physics
Learning do discover: machine learning in high-energy physicsLearning do discover: machine learning in high-energy physics
Learning do discover: machine learning in high-energy physics
 
Strings, C# and Unmanaged Memory
Strings, C# and Unmanaged MemoryStrings, C# and Unmanaged Memory
Strings, C# and Unmanaged Memory
 
Avogadro 2 and Open Chemistry
Avogadro 2 and Open ChemistryAvogadro 2 and Open Chemistry
Avogadro 2 and Open Chemistry
 
[Webinar] Scientific Computation and Data Visualization with Ruby
[Webinar] Scientific Computation and Data Visualization with Ruby [Webinar] Scientific Computation and Data Visualization with Ruby
[Webinar] Scientific Computation and Data Visualization with Ruby
 
MongoDB Scalability Best Practices
MongoDB Scalability Best PracticesMongoDB Scalability Best Practices
MongoDB Scalability Best Practices
 
SiriusCon 2017 - Integrating Xtext and Sirius: Strategies and Pitfalls
SiriusCon 2017 - Integrating Xtext and Sirius: Strategies and PitfallsSiriusCon 2017 - Integrating Xtext and Sirius: Strategies and Pitfalls
SiriusCon 2017 - Integrating Xtext and Sirius: Strategies and Pitfalls
 
Introduction to SparkR
Introduction to SparkRIntroduction to SparkR
Introduction to SparkR
 
Time Series Data in a Time Series World
Time Series Data in a Time Series WorldTime Series Data in a Time Series World
Time Series Data in a Time Series World
 

Similaire à Accelerating NLP with Dask on Saturn Cloud: A case study with CORD-19

Boosting big data with apache spark
Boosting big data with apache sparkBoosting big data with apache spark
Boosting big data with apache sparkInfoFarm
 
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...MLconf
 
Kafka Connect: Operational Lessons Learned from the Trenches (Elizabeth Benne...
Kafka Connect: Operational Lessons Learned from the Trenches (Elizabeth Benne...Kafka Connect: Operational Lessons Learned from the Trenches (Elizabeth Benne...
Kafka Connect: Operational Lessons Learned from the Trenches (Elizabeth Benne...confluent
 
Wattpad - Spark Stories
Wattpad - Spark StoriesWattpad - Spark Stories
Wattpad - Spark StoriesRylan Halteman
 
BigDL: Bringing Ease of Use of Deep Learning for Apache Spark with Jason Dai ...
BigDL: Bringing Ease of Use of Deep Learning for Apache Spark with Jason Dai ...BigDL: Bringing Ease of Use of Deep Learning for Apache Spark with Jason Dai ...
BigDL: Bringing Ease of Use of Deep Learning for Apache Spark with Jason Dai ...Databricks
 
gVisor, Kata Containers, Firecracker, Docker: Who is Who in the Container Space?
gVisor, Kata Containers, Firecracker, Docker: Who is Who in the Container Space?gVisor, Kata Containers, Firecracker, Docker: Who is Who in the Container Space?
gVisor, Kata Containers, Firecracker, Docker: Who is Who in the Container Space?ArangoDB Database
 
MLconf seattle 2015 presentation
MLconf seattle 2015 presentationMLconf seattle 2015 presentation
MLconf seattle 2015 presentationehtshamelahi
 
Software tools for high-throughput materials data generation and data mining
Software tools for high-throughput materials data generation and data miningSoftware tools for high-throughput materials data generation and data mining
Software tools for high-throughput materials data generation and data miningAnubhav Jain
 
[Conf42-KubeNative] Building Real-time Pulsar Apps on K8
[Conf42-KubeNative] Building Real-time Pulsar Apps on K8[Conf42-KubeNative] Building Real-time Pulsar Apps on K8
[Conf42-KubeNative] Building Real-time Pulsar Apps on K8Timothy Spann
 
Cinder Update, OpenInfra Meetup Q3 China, 2020-09-26
Cinder Update, OpenInfra Meetup Q3 China, 2020-09-26Cinder Update, OpenInfra Meetup Q3 China, 2020-09-26
Cinder Update, OpenInfra Meetup Q3 China, 2020-09-26Brian Rosmaita
 
Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016Holden Karau
 
Big data distributed processing: Spark introduction
Big data distributed processing: Spark introductionBig data distributed processing: Spark introduction
Big data distributed processing: Spark introductionHektor Jacynycz García
 
Running Emerging AI Applications on Big Data Platforms with Ray On Apache Spark
Running Emerging AI Applications on Big Data Platforms with Ray On Apache SparkRunning Emerging AI Applications on Big Data Platforms with Ray On Apache Spark
Running Emerging AI Applications on Big Data Platforms with Ray On Apache SparkDatabricks
 
Data Integration Solutions Created By Koneksys
Data Integration Solutions Created By KoneksysData Integration Solutions Created By Koneksys
Data Integration Solutions Created By KoneksysKoneksys
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark TutorialAhmet Bulut
 
Analyzing Data at Scale with Apache Spark
Analyzing Data at Scale with Apache SparkAnalyzing Data at Scale with Apache Spark
Analyzing Data at Scale with Apache SparkNicola Ferraro
 
NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1Ruslan Meshenberg
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...Jose Quesada (hiring)
 
Graph Analytics with ArangoDB
Graph Analytics with ArangoDBGraph Analytics with ArangoDB
Graph Analytics with ArangoDBArangoDB Database
 

Similaire à Accelerating NLP with Dask on Saturn Cloud: A case study with CORD-19 (20)

Boosting big data with apache spark
Boosting big data with apache sparkBoosting big data with apache spark
Boosting big data with apache spark
 
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
 
Kafka Connect: Operational Lessons Learned from the Trenches (Elizabeth Benne...
Kafka Connect: Operational Lessons Learned from the Trenches (Elizabeth Benne...Kafka Connect: Operational Lessons Learned from the Trenches (Elizabeth Benne...
Kafka Connect: Operational Lessons Learned from the Trenches (Elizabeth Benne...
 
Wattpad - Spark Stories
Wattpad - Spark StoriesWattpad - Spark Stories
Wattpad - Spark Stories
 
BigDL: Bringing Ease of Use of Deep Learning for Apache Spark with Jason Dai ...
BigDL: Bringing Ease of Use of Deep Learning for Apache Spark with Jason Dai ...BigDL: Bringing Ease of Use of Deep Learning for Apache Spark with Jason Dai ...
BigDL: Bringing Ease of Use of Deep Learning for Apache Spark with Jason Dai ...
 
gVisor, Kata Containers, Firecracker, Docker: Who is Who in the Container Space?
gVisor, Kata Containers, Firecracker, Docker: Who is Who in the Container Space?gVisor, Kata Containers, Firecracker, Docker: Who is Who in the Container Space?
gVisor, Kata Containers, Firecracker, Docker: Who is Who in the Container Space?
 
MLconf seattle 2015 presentation
MLconf seattle 2015 presentationMLconf seattle 2015 presentation
MLconf seattle 2015 presentation
 
Session 2
Session 2Session 2
Session 2
 
Software tools for high-throughput materials data generation and data mining
Software tools for high-throughput materials data generation and data miningSoftware tools for high-throughput materials data generation and data mining
Software tools for high-throughput materials data generation and data mining
 
[Conf42-KubeNative] Building Real-time Pulsar Apps on K8
[Conf42-KubeNative] Building Real-time Pulsar Apps on K8[Conf42-KubeNative] Building Real-time Pulsar Apps on K8
[Conf42-KubeNative] Building Real-time Pulsar Apps on K8
 
Cinder Update, OpenInfra Meetup Q3 China, 2020-09-26
Cinder Update, OpenInfra Meetup Q3 China, 2020-09-26Cinder Update, OpenInfra Meetup Q3 China, 2020-09-26
Cinder Update, OpenInfra Meetup Q3 China, 2020-09-26
 
Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016
 
Big data distributed processing: Spark introduction
Big data distributed processing: Spark introductionBig data distributed processing: Spark introduction
Big data distributed processing: Spark introduction
 
Running Emerging AI Applications on Big Data Platforms with Ray On Apache Spark
Running Emerging AI Applications on Big Data Platforms with Ray On Apache SparkRunning Emerging AI Applications on Big Data Platforms with Ray On Apache Spark
Running Emerging AI Applications on Big Data Platforms with Ray On Apache Spark
 
Data Integration Solutions Created By Koneksys
Data Integration Solutions Created By KoneksysData Integration Solutions Created By Koneksys
Data Integration Solutions Created By Koneksys
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
 
Analyzing Data at Scale with Apache Spark
Analyzing Data at Scale with Apache SparkAnalyzing Data at Scale with Apache Spark
Analyzing Data at Scale with Apache Spark
 
NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
 
Graph Analytics with ArangoDB
Graph Analytics with ArangoDBGraph Analytics with ArangoDB
Graph Analytics with ArangoDB
 

Plus de Sujit Pal

Supporting Concept Search using a Clinical Healthcare Knowledge Graph
Supporting Concept Search using a Clinical Healthcare Knowledge GraphSupporting Concept Search using a Clinical Healthcare Knowledge Graph
Supporting Concept Search using a Clinical Healthcare Knowledge GraphSujit Pal
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGSujit Pal
 
Building Learning to Rank (LTR) search reranking models using Large Language ...
Building Learning to Rank (LTR) search reranking models using Large Language ...Building Learning to Rank (LTR) search reranking models using Large Language ...
Building Learning to Rank (LTR) search reranking models using Large Language ...Sujit Pal
 
Cheap Trick for Question Answering
Cheap Trick for Question AnsweringCheap Trick for Question Answering
Cheap Trick for Question AnsweringSujit Pal
 
Searching Across Images and Test
Searching Across Images and TestSearching Across Images and Test
Searching Across Images and TestSujit Pal
 
Learning a Joint Embedding Representation for Image Search using Self-supervi...
Learning a Joint Embedding Representation for Image Search using Self-supervi...Learning a Joint Embedding Representation for Image Search using Self-supervi...
Learning a Joint Embedding Representation for Image Search using Self-supervi...Sujit Pal
 
The power of community: training a Transformer Language Model on a shoestring
The power of community: training a Transformer Language Model on a shoestringThe power of community: training a Transformer Language Model on a shoestring
The power of community: training a Transformer Language Model on a shoestringSujit Pal
 
Backprop Visualization
Backprop VisualizationBackprop Visualization
Backprop VisualizationSujit Pal
 
Leslie Smith's Papers discussion for DL Journal Club
Leslie Smith's Papers discussion for DL Journal ClubLeslie Smith's Papers discussion for DL Journal Club
Leslie Smith's Papers discussion for DL Journal ClubSujit Pal
 
Using Graph and Transformer Embeddings for Vector Based Retrieval
Using Graph and Transformer Embeddings for Vector Based RetrievalUsing Graph and Transformer Embeddings for Vector Based Retrieval
Using Graph and Transformer Embeddings for Vector Based RetrievalSujit Pal
 
Transformer Mods for Document Length Inputs
Transformer Mods for Document Length InputsTransformer Mods for Document Length Inputs
Transformer Mods for Document Length InputsSujit Pal
 
Question Answering as Search - the Anserini Pipeline and Other Stories
Question Answering as Search - the Anserini Pipeline and Other StoriesQuestion Answering as Search - the Anserini Pipeline and Other Stories
Question Answering as Search - the Anserini Pipeline and Other StoriesSujit Pal
 
Building Named Entity Recognition Models Efficiently using NERDS
Building Named Entity Recognition Models Efficiently using NERDSBuilding Named Entity Recognition Models Efficiently using NERDS
Building Named Entity Recognition Models Efficiently using NERDSSujit Pal
 
Graph Techniques for Natural Language Processing
Graph Techniques for Natural Language ProcessingGraph Techniques for Natural Language Processing
Graph Techniques for Natural Language ProcessingSujit Pal
 
Learning to Rank Presentation (v2) at LexisNexis Search Guild
Learning to Rank Presentation (v2) at LexisNexis Search GuildLearning to Rank Presentation (v2) at LexisNexis Search Guild
Learning to Rank Presentation (v2) at LexisNexis Search GuildSujit Pal
 
Search summit-2018-ltr-presentation
Search summit-2018-ltr-presentationSearch summit-2018-ltr-presentation
Search summit-2018-ltr-presentationSujit Pal
 
Search summit-2018-content-engineering-slides
Search summit-2018-content-engineering-slidesSearch summit-2018-content-engineering-slides
Search summit-2018-content-engineering-slidesSujit Pal
 
SoDA v2 - Named Entity Recognition from streaming text
SoDA v2 - Named Entity Recognition from streaming textSoDA v2 - Named Entity Recognition from streaming text
SoDA v2 - Named Entity Recognition from streaming textSujit Pal
 
Evolving a Medical Image Similarity Search
Evolving a Medical Image Similarity SearchEvolving a Medical Image Similarity Search
Evolving a Medical Image Similarity SearchSujit Pal
 
Embed, Encode, Attend, Predict – applying the 4 step NLP recipe for text clas...
Embed, Encode, Attend, Predict – applying the 4 step NLP recipe for text clas...Embed, Encode, Attend, Predict – applying the 4 step NLP recipe for text clas...
Embed, Encode, Attend, Predict – applying the 4 step NLP recipe for text clas...Sujit Pal
 

Plus de Sujit Pal (20)

Supporting Concept Search using a Clinical Healthcare Knowledge Graph
Supporting Concept Search using a Clinical Healthcare Knowledge GraphSupporting Concept Search using a Clinical Healthcare Knowledge Graph
Supporting Concept Search using a Clinical Healthcare Knowledge Graph
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAG
 
Building Learning to Rank (LTR) search reranking models using Large Language ...
Building Learning to Rank (LTR) search reranking models using Large Language ...Building Learning to Rank (LTR) search reranking models using Large Language ...
Building Learning to Rank (LTR) search reranking models using Large Language ...
 
Cheap Trick for Question Answering
Cheap Trick for Question AnsweringCheap Trick for Question Answering
Cheap Trick for Question Answering
 
Searching Across Images and Test
Searching Across Images and TestSearching Across Images and Test
Searching Across Images and Test
 
Learning a Joint Embedding Representation for Image Search using Self-supervi...
Learning a Joint Embedding Representation for Image Search using Self-supervi...Learning a Joint Embedding Representation for Image Search using Self-supervi...
Learning a Joint Embedding Representation for Image Search using Self-supervi...
 
The power of community: training a Transformer Language Model on a shoestring
The power of community: training a Transformer Language Model on a shoestringThe power of community: training a Transformer Language Model on a shoestring
The power of community: training a Transformer Language Model on a shoestring
 
Backprop Visualization
Backprop VisualizationBackprop Visualization
Backprop Visualization
 
Leslie Smith's Papers discussion for DL Journal Club
Leslie Smith's Papers discussion for DL Journal ClubLeslie Smith's Papers discussion for DL Journal Club
Leslie Smith's Papers discussion for DL Journal Club
 
Using Graph and Transformer Embeddings for Vector Based Retrieval
Using Graph and Transformer Embeddings for Vector Based RetrievalUsing Graph and Transformer Embeddings for Vector Based Retrieval
Using Graph and Transformer Embeddings for Vector Based Retrieval
 
Transformer Mods for Document Length Inputs
Transformer Mods for Document Length InputsTransformer Mods for Document Length Inputs
Transformer Mods for Document Length Inputs
 
Question Answering as Search - the Anserini Pipeline and Other Stories
Question Answering as Search - the Anserini Pipeline and Other StoriesQuestion Answering as Search - the Anserini Pipeline and Other Stories
Question Answering as Search - the Anserini Pipeline and Other Stories
 
Building Named Entity Recognition Models Efficiently using NERDS
Building Named Entity Recognition Models Efficiently using NERDSBuilding Named Entity Recognition Models Efficiently using NERDS
Building Named Entity Recognition Models Efficiently using NERDS
 
Graph Techniques for Natural Language Processing
Graph Techniques for Natural Language ProcessingGraph Techniques for Natural Language Processing
Graph Techniques for Natural Language Processing
 
Learning to Rank Presentation (v2) at LexisNexis Search Guild
Learning to Rank Presentation (v2) at LexisNexis Search GuildLearning to Rank Presentation (v2) at LexisNexis Search Guild
Learning to Rank Presentation (v2) at LexisNexis Search Guild
 
Search summit-2018-ltr-presentation
Search summit-2018-ltr-presentationSearch summit-2018-ltr-presentation
Search summit-2018-ltr-presentation
 
Search summit-2018-content-engineering-slides
Search summit-2018-content-engineering-slidesSearch summit-2018-content-engineering-slides
Search summit-2018-content-engineering-slides
 
SoDA v2 - Named Entity Recognition from streaming text
SoDA v2 - Named Entity Recognition from streaming textSoDA v2 - Named Entity Recognition from streaming text
SoDA v2 - Named Entity Recognition from streaming text
 
Evolving a Medical Image Similarity Search
Evolving a Medical Image Similarity SearchEvolving a Medical Image Similarity Search
Evolving a Medical Image Similarity Search
 
Embed, Encode, Attend, Predict – applying the 4 step NLP recipe for text clas...
Embed, Encode, Attend, Predict – applying the 4 step NLP recipe for text clas...Embed, Encode, Attend, Predict – applying the 4 step NLP recipe for text clas...
Embed, Encode, Attend, Predict – applying the 4 step NLP recipe for text clas...
 

Dernier

Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxolyaivanovalion
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Delhi Call girls
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...shivangimorya083
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfadriantubila
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 

Dernier (20)

Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptx
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 

Accelerating NLP with Dask on Saturn Cloud: A case study with CORD-19

  • 1. Saturn Cloud Accelerating NLP with Dask on Saturn Cloud Elsevier Labs Online Lecture November 2020 1
  • 2. Hi! Speakers Aaron Richter Senior Data Scientist @ Saturn Cloud aaron@saturncloud.io @rikturr Sujit Pal Technology Research Director @ Elsevier sujit.pal@elsevier.com @palsujit 2
  • 3. Check out the blog post! 🔗 How Elsevier Accelerated COVID-19 research using Dask on Saturn Cloud 3
  • 5. Data science with Python 5
  • 7. Dask ● Parallel computing for Python people ● Anaconda, ~2015 ● Built in Python; Python API ● Mature, scientific computing communities ● Low-level task library ● High-level libraries for DataFrames, arrays, ML ● Integrates with PyData ecosystem ● Runs on laptop, scales to clusters https://dask.org/ 7
  • 8. Dask What does it do? https://docs.dask.org/en/latest/user-interfaces.html ● Parallel machine learning (scikit) ● Parallel dataframes (pandas) ● Parallel arrays (numpy) ● Parallel anything else 8
  • 9. What does it do? Arrays and Dataframes https://docs.dask.org/en/latest/array.html https://docs.dask.org/en/latest/dataframe.html 9
  • 10. What does it do? Anything else! https://docs.dask.org/en/latest/delayed.html 10
  • 11. What does it do? Anything else! https://dask.org/ 11
  • 12. Getting up to Speed with Dask https://youtu.be/S_ncqocDcBA 12
  • 13. Spark vs. Dask ● Written in Scala with Python API ● All-in-one tool ○ Requires re-write to migrate from PyData code ● Programming model not suited for complex operations (multi-dim arrays, machine learning) ● 100% Python ● Built to extend and interact with PyData ecosystem ● High-level interfaces for DataFrames, (multi-dim) Arrays, and ML ● Native integration with RAPIDS for GPU-acceleration
  • 14. 14
  • 15. How can I run Dask clusters? ● Manual setup ● SSH ● HPC: MPI, SLURM, SGE, TORQUE, LSF, DRMAA, PBS ● Kubernetes (Docker, Helm) ● Hadoop/YARN ● Cloud provider: AWS or Azure 🔗 https://docs.dask.org/en/latest/setup.html 15
  • 17. ● Fast setup ● Enterprise secure ● Pythonic parallelism ● Rapidly scale PyData ● Multi-GPU computing ● The future of HPC ● Workflow orchestration ● Flow insight and mgmt Bringing together the fastest hardware + OSS Saturn Cloud
  • 18. ● Saturn manages all infrastructure ● Hosted: Run within our cloud ● Enterprise: Run within your AWS account Saturn Cloud to the rescue! Taking the DevOps out of Data Science 18
  • 19. ● Images ● Jupyter server ● Dask Cluster ● Deployments Saturn Cloud Core features 19
  • 20. 20
  • 23. Genesis 23 ● Based on one of the many COVID-19 initiatives (COVID- KG) ● Original intent: extract entities from CORD-19 dataset for relationship mining.
  • 24. Genesis 24 ● Based on one of the many COVID-19 initiatives (COVID- KG) ● Original intent: extract entities from CORD-19 dataset for relationship mining.● CORD-19 dataset open sourced by AllenAI. ● SciSpacy provided Language models trained on Biomedical text, and... ● Pre-trained Named Entity Recognition (and linking) models.
  • 25. Genesis 25 ● Based on one of the many COVID-19 initiatives (COVID- KG) ● Original intent: extract entities from CORD-19 dataset for relationship mining.● CORD-19 dataset open sourced by AllenAI. ● SciSpacy provided Language models trained on Biomedical text, and... ● Pre-trained Named Entity Recognition (and linking) models. ● Dask based distributed computing platform ● Opportunity to evaluate.
  • 26. Goals ● Create standoff entity annotations for CORD-19, modeled after CAT (Content Analytics Toolbench) from Labs. ● Automated entity recognition using pre-trained SciSpaCy models, where each model recognizes a different subset of entity classes, e.g. DNA, Gene, Protein, Chemical, Organism, Disease, etc. ● Output is structured as Parquet files, consumable via Dask or Spark. ● Share output dataset with community. 26
  • 27. CORD-19 Dataset ● Started mid March 2020 with ~40k articles released weekly. ● By Sept/Oct 2020 ~200k articles released daily, growing everyday. ● Each release contains: ○ Metadata file (CSV) ○ Set of articles (JSON) 27
  • 28. SciSpaCy NER(L) models ● Medium English LM for sentence splitting. ● 4 NER models ● 5 NERL models using LM’s candidate entity generator and trained entity linking models. 28
  • 29. Full Pipeline ● Read metadata.csv ● Parse each JSON file into paragraphs. ● Split paragraphs into sentences. ● Extract entities from sentences using a NER(L) model.. 29
  • 30. Embarrassing Parallelism ● Pipeline is embarrassingly parallel. ● Parse files to paragraphs has no dependencies (i.e., perfectly parallel) ● ... ● ... 30
  • 31. Embarrassing Parallelism ● Pipeline is embarrassingly parallel ● ... ● Split paragraphs to sentences needs sentence splitter model assigned per partition. ● Load only models that you need. ● … ● ... 31
  • 32. Embarrassing Parallelism ● Pipeline is embarrassingly parallel. ● … ● ... ● Extract entities from sentence needs NER model, assign lazily to worker per partition. ● Use nlp.pipe and batching to exploit multithreading. ● ... 32
  • 33. Embarrassing Parallelism ● Pipeline is embarrassingly parallel. ● … ● ... ● Extract and link entities from sentence needs Language Model, Entity Linker, etc. assign eagerly per worker after cluster creation. 33
  • 34. Parquet Dask / Spark interop ● Output of paragraph, sentence, and entities are in Parquet format. ● Things to keep in mind for Spark interoperability when writing from Dask. ○ Column data types must be declared explicitly on the Dask end. ○ Column names should be specified when saving (“hidden” columns visible in Spark). ○ Explicit re-partitioning may be necessary when saving on Dask. 34
  • 35. Incremental Pipeline ● Extracted Entities + new metadata and JSON files. ● Compute diffs (additions + deletions) ● Parse added articles to paragraphs, paragraphs to sentences, and sentences to entities. ● Remove paragraphs, sentences, and entities for deleted articles. ● Merge diff and original. 35
  • 36. Deliverables ● Code ○ Set of Jupyter notebooks deployed on Saturn Cloud -- sujitpal/saturn-scispacy ● Data ○ Dataframes in Parquet format (approx 70 GB, 35 for Sep 2020, 35 for Oct 2020). ○ Publicly available on s3://els-saturn-scispacy/cord19-scispacy-entities (requester pays). ○ Available within ELS on Databricks under “/mnt/els/covid-19” -- “/mnt/els/covid-19/saturn- scispacy/annotations-pq-full-20200928” 36
  • 37. Output formats 1 paragraph dataframe (3.4M paragraphs), 1 sentence dataframe (17.1M sentences), and 9 entity dataframes (total 805.4M entities). 37
  • 38. Utility ● Envisioned usage similar to that for CAT annotations using AnnotationQuery, but for biomedical entities on biomedical data. ● Query dataset using DataFrame API to create interesting micro-datasets datasets potentially spanning across different NER(L) models. Example: ○ Human Phenotypes Annotations from HPO co-occurring in same sentence with Disease Annotations from UMLS or BC5CDR. ○ Gene annotations co-occurring in same sentence with Cancer annotations (both from BioNLP). ● Annotations can be features for Topic Modeling or Categorization. 38
  • 39. Additional Resources ● Labs Blog post -- How Elsevier Accelerated COVID-19 research using Dask on Saturn Cloud ● Confluence page -- Dataset of Entity Annotations for CORD-19 ● Book -- Data Science with Python and Dask by Jesse C. Daniel (also available on Percipio) ● Dask online documentation ● Saturn Cloud documentations ● Cloud deployments using Dask ● Query interface example -- AnnotationQuery. 39

Notes de l'éditeur

  1. Q’s to ask: what other packages do you frequently use? Things that you always import with each notebook/script
  2. Goal: flexible parallel computing for Python ecosystem. Compatible with lots of packages Generally, if you do something in Python, you can probably make it faster pretty easily with Dask (minimal re-write) Click link, scroll to “powered by Dask”
  3. Ok so that was Dask, we got another tool to talk about
  4. Took over big data world post-Hadoop. Still carries Hadoop residue. There is a machine learning library, but is pretty slow (and why re-learn new library, when scikit-learn has more algorithms, plus you already know it) I worked with Spark for years, but always dropped back to PyData after I was done cleaning data Also not great for monitoring/debugging
  5. If you already have a cluster with one of these mangement tools, you’re good to go Cloud provider - AWS can’t get GPUs
  6. Contrary to popular belief, this is not zach galifinakis (its robert redford) Show Saturn UI