An April 2023 presentation to the AMIA working group on natural language processing. The talk focuses on three current trends in NLP and how they apply in healthcare: Large language models, No-code, and Responsible AI.
3. 3
55+ million 59% share
O’Reilly Media Gradient Flow
Downloads on PyPI.
“Most Widely Used NLP
Library in the Enterprise.”
of Healthcare NLP
teams use Spark NLP
John Snow Labs
is the team behind Spark NLP
4. 4
Accelerating Biomedical Innovation by
Combining NLP and Knowledge Graphs
Extracting what, when, why, and how from Radiology
Reports in Real World Data Projects
Automated Classification and Entity Extraction from
Essential Clinical Trial Documents
Question Answering on Clinical Guidelines Identifying opioid-related adverse events
from unstructured text
Adverse Drug Event Detection using Spark NLP Lessons Learned De-Identifying 700 Million Patients
Notes with Spark NLP
Understand Patient Experience Journey
to Improve Pharma Value Chain
A Real-time NLP-Based Clinical Decision
Support Platform for Psychiatry and Oncology
Case Studies from the NLP Summit
5. 5
2022 Peer-Reviewed Papers
Deeper Clinical Document
Understanding Using Relation
Extraction
New state-of-the-art accuracy on:
2019 Phenotype-Gene Relations dataset
2018 n2c2 Posology Relations dataset
2012 Adverse Drug Events Drug-Reaction dataset
2012 i2b2 Clinical Temporal Relations challenge
2010 i2b2 Clinical Relations challenge
Mining Adverse Drug Reactions from
Unstructured Mediums at Scale
New state-of-the-art accuracy on:
ADE benchmark
SMM4H benchmark
CADEC entity recognition dataset
CADEC relation extraction dataset
Biomedical Named Entity Recognition
in Eight Languages with Zero Code
Changes
New state-of-the-art accuracy on:
LivingNER dataset using a single model architecture in
English, French, Italian, Portuguese, Galatian, Catalan &
Romanian
Accurate Clinical and Biomedical
Named
Entity Recognition at Scale
New state-of-the-art accuracy on:
2018 n2c2 medication extraction
2014 n2c2 de-identification
2010 i2b2/VA clinical concept extraction
8 different Biomedical NLP benchmarks
7. 7
1. Open-Source is Catching Up Fast
State of AI Report, Nathan Benaich & Ian Hogarth, https://www.stateof.ai/
11th October 2022
8. 8
1. Open-Source is Catching Up Fast
A Survey of Large Language Models, Zhao et. al., arxiv.org/abs/2303.18223
Submitted on 31 Mar 2023 (v1), last revised 24 Apr 2023 (v6)
9. 9
2. Costs Are Coming Down Fast
At the MIT event, Altman was asked if training GPT-4 cost $100 million;
he replied, “It’s more than that.”
10. 10
2. Costs Are Coming Down Fast
Dolly 2.0 as trained on a human-generated dataset of prompts and
responses. The training methodology is similar to InstructGPT but with
a claimed higher accuracy and lower training costs of less than $30.
11. 11
3. Medical Large Language Models Are Here
Medical Question Answering with
BioGPT
Medical Question Answering with BioGPT-JSL
Faster inference than HF
Fine-tuned with fresh medical data
The first ever closed-book medical question
answering LLM based on BioGPT
12. 12
Medical Specialty: Pediatrics - Neonatal, Sample Name: Chest Closure
Text :
Summary
A newborn with hypoplastic left heart syndrome underwent a delayed primary chest closure under general endotracheal
anesthesia. The chest was prepped and draped in a sterile fashion, and mediastinal cultures were obtained. The mediastinum
and cavities were irrigated and suctioned, and the sternum was closed with stainless steel wires and subcutaneous tissues
with interrupted monofilament stitches. The patient tolerated the procedure well and was transferred to the pediatric intensive
unit in stable condition.
Description: Delayed primary chest closure. Open chest status post modified stage 1
Norwood operation. The patient is a newborn with diagnosis of hypoplastic left heart
syndrome who 48 hours prior to the current procedure has undergone a modified stage 1
Norwood operation. (Medical Transcription Sample Report)
PROCEDURE: Delayed primary chest closure.
INDICATIONS: The patient is a newborn with diagnosis of hypoplastic left heart syndrome
who 48 hours prior to the current procedure has undergone a modified stage 1 Norwood
operation. Given the magnitude of the operation and the size of the patient (2.5 kg), we have
elected to leave the chest open to facilitate postoperative management. He is now taken back
to the operative room for delayed primary chest closure.
PREOP DX: Open chest status post modified stage 1 Norwood operation.
POSTOP DX: Open chest status post modified stage 1 Norwood operation.
ANESTHESIA: General endotracheal.
COMPLICATIONS: None.
FINDINGS: No evidence of intramediastinal purulence or hematoma. He tolerated the procedure
well.
DETAILS OF PROCEDURE: The patient was brought to the operating room and placed on the
operating table in the supine position. Following general endotracheal anesthesia, the chest was
prepped and draped in the usual sterile fashion. The previously placed AlloDerm membrane was
removed. Mediastinal cultures were obtained, and the mediastinum was then profusely irrigated and
suctioned. Both cavities were also irrigated and suctioned. The drains were flushed and
repositioned. Approximately 30 cubic centimeters of blood were drawn slowly from the right atrial
line. The sternum was then smeared with a vancomycin paste. The proximal aspect of the 5 mm
RV-PA conduit was marked with a small titanium clip at its inferior most aspect and with an
additional one on its rightward inferior side. The sternum was then closed with stainless steel wires
followed by closure of subcutaneous tissues with interrupted monofilament stitches. The skin was
closed with interrupted nylon sutures and a sterile dressing was placed. The peritoneal dialysis
catheter, atrial and ventricular pacing wires were removed. The patient was transferred to the
pediatric intensive unit shortly thereafter in very stable condition. I was the surgical attending
present in the operating room and in charge of the surgical procedure throughout the entire length of
the case.
Summarize Clinical Notes, Biomedical Research, and Patient Messages
3. Medical Large Language Models Are Here
13. 13
Healthcare-Specific LLM’s Outperform
General-Purpose LLM’s
• Clinical note summarization is 30% more accurate than
general state-of-the-art LLMs (BART, Flan-T5, Pegasus).
• On clinical entity recognition, John Snow Labs'
models make half of the errors that ChatGPT does.
• De-Identification out-of-the-box accuracy is
93% compared to ChatGPT’s 60% on detecting PHI in
clinical notes.
• Extracting ICD-10-CM codes is done with a 76%
success rate versus 26% for GPT-3.5 and 36% for
GPT-4.
www.johnsnowlabs.com/large-language-models-blog
16. 16
The NLP Lab
The Free No-Code NLP Platform:
• Annotate Text & Images
• AI Assisted Annotation
• Train & Tune NLP Models
• Models, Rules, and Prompts Hub
• Manage Projects & Teams
• Enterprise Security & Privacy
This is widely used today, but what comes
next?
https://www.johnsnowlabs.com/nlp-lab/
17. 17
Answering Clinical Questions
Which female patients have not
started taking beta blockers
within a month after a heart attack?
Demographics
Cohort
Building
Not, And, Or
Drug Classes
Timeline Common Terms
18. 18
Answering Biomedical Questions
Which multi-center clinical trials assessed
the efficacy of vildagliptin as an add-on
therapy to metformin for adults with T2DM?
Trial Protocols
Research Outcomes & Metrics
Populations
Study Design Terminologies
19. 19
No Data Sharing No BS No Test Gaps
Airgap Deployment Knowledge Base NLP Test
Run behind your firewall,
never send data to 3rd parties
No hallucinations or
unexplained results
Responsible AI: Test for
robustness, fairness, bias,
toxicity, and data leakage
Towards Regulatory-Grade Chatbots
20. 20
An End-to-end System
Chat & Query Application
Pre-Processing Cluster
Kubernetes Keycloak
Vector
Database
Curated
datasets &
terminologies
Multi-modal
Patient data
21. 21
An End-to-end System: Capabilities
Answer ‘noisy’ natural
language questions
Find cohorts by conditions,
grouping and/or timeline
Explain & cite answers
Maintain session & context
Analyze multi-modal data
Near-real-time freshness
Normalize patient data
Link patient data over time
Scale to millions of patients
Run on commodity hardware
On-premise, high-compliance, scale-as-you-go
Strong security, role-based access, single sign-
on
Semantic
Search
Curated
datasets &
terminologies
Multi-modal
Patient data
25. But There’s a Big Gap in Implementation
Beyond Accuracy: Behavioral Testing of NLP
models with CheckList
Ribiero et. al., 2020
Sentiment analysis services of the top three cloud providers fail:
• 9-16% of the time when replacing neutral words
• 7-20% of the time when changing neutral named entities
• 36-42% of the time on some temporal tests
• Almost 100% of the time on some negation tests.
BBQ: A Hand-Built Bias Benchmark for
Question Answering
Parrish et. al., 2022
Biases around race, gender, physical appearance,
disability, and religion are ingrained in state-of-the-art
question answering models – sometimes changing the
likely answer more than 80% of the time.
Information Leakage in Embedding
Models
Song and Raghunathan, 2020
Data leakage of 50-70% of personal information
into popular word & sentence embeddings.
What Do You See in this Patient?
Behavioral Testing of Clinical NLP Models
van Aken et. al., 2022
Adding any mention of ethnicity to a patient note reduces their
predicted risk of mortality – with the most accurate model
producing the largest error.
26. Responsible AI Best Practices
1. Test Your Models!
Why would you expect untested software to work?
2. Don’t Reuse Academic Models in Production
Publishing research ≠ Building reliable systems
3. Test Beyond Accuracy
Robustness, Bias, Fairness, Toxicity, Efficiency, Safety, …
27. 27
Simple
O’Reilly Media
Comprehensive
Test all aspects of
model quality before
going to production
Open Source
Open under the Apache
2.0 license and designed
for easy extension
Papers with Code
Generate & run
50+ test types on
popular NLP tasks
Introducing the NLP Test Library
29. NLP Test In 3 Lines of Code
from nlptest import Harness
h = Harness(model='dslim/bert-base-NER', hub='huggingface')
h.generate().run().report()
Generate a set of test cases
given a task, model & dataset
Run the test suite, generating
a data frame of test results
Generate a summary report
stating which tests have passed
30. Write Once, Test Everywhere
from nlptest import Harness
h = Harness(model='ner_dl_bert', hub='johnsnowlabs')
h = Harness(model='dslim/bert-base-NER', hub='huggingface')
h = Harness(model='en_core_web_sm', hub='spacy')
Adding a new library or API?
All test types will generate & run.
Adding a new test type?
It will run on all supported libraries.
32. 2. Run Tests
Test type Test case Expected result
add_typos Wang Li is a ductor. Wang Li: Person
add_context Wang Li is a doctor. #careers Wang Li: Person
replace_to_hispanic_name Juan Moreno is a doctor. Juan Moreno: Person
min_gender_representation Female 30
min_gender_f1_score Female 0.85
From a test suite created with generate(), manually, or with load():
Category Pass Rate Minimum Pass Rate Pass?
Robustness 50% 75%
Bias 85% 85%
Representation 100% 100%
Fairness 66% 100%
Calling run() and then report() produces a summary:
33. 3. Improve Models With Data Augmentation
h.augment(input_path='training_dataset', output_path='augmented_dataset')
new_model = nlp.load('model').fit('augmented_dataset')
Harness.load(save_dir='testcases', model=new_model, hub='johnsnowlabs').run()
Generate new augmented
labeled data for the model’s
training (not test!) dataset.
Train a new model using your
favorite framework using the
augmented training dataset.
Run a regression test: Create a
new test harness with the new
model and the old test suite.
34. Integrate Testing Into CI/CD or MLOps
class DataScienceWorkFlow(FlowSpec):
@step
def train(self):
...
@step
def run_tests(self):
harness = Harness.load(model=self.model, save_dir=“testsuite")
self.report = harness.run().report()
@step
def deploy(self):
if self.report["score"] > self.test_threshold:
...
Train a new version of a model
Run a regression test
Only deploy if the test passed
35. Getting Started with NLP Test
TUTORIALS AND EXAMPLES:
CONTRIBUTING:
https://github.com/johnsnowlabs/nlptest
COMMUNITY CHAT:
https://spark-nlp.slack.com @ #nlp-test
https://nlptest.org
Expect Rapid Releases & Long-Term Support from John Snow Labs.