Overview of text data, processing of text data, integration of text data with structured databases, and uses of text data in analytics across a variety of fields. Here's the talk link: https://www.youtube.com/watch?v=wS0X1bSsuUU
SAS Global 2021 Introduction to Natural Language Processing
1.
2. Natural Language Processing—An Introduction
Colleen M. Farrelly, Staticlysm
Brief bio –
Colleen M. Farrelly is a machine learning scientist whose expertise includes
supervised learning, unsupervised learning, psychometrics, topological data
analysis, and natural language processing. She has an analytics book in review
that touches upon the analysis of text data with topological data analysis tools.
4. SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product names are trademarks of their respective companies.
Text Data and Applications
• What do all of these have in
common?
• Clinical case notes
• Chatbot conversations
• Client email interactions
• Court case
summaries/transcripts
• Published research articles
• Tweets
• Voice recordings
5. SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product names are trademarks of their respective companies.
Text Data and Applications
• Commonalities
• Text data
• Contain potentially-
informative features for
predicting an outcome or
categorizing data
• May contain information
not available in structured
datasets
• Linguistic insight on the
speaker/writer
6. SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product names are trademarks of their respective companies.
Example
Legal
• Imagine both the witness and the robber in these two examples.
• How might these observations impact the outcome of a police investigation?
• Statement 1:
• She pulled the gun, took the money, and ran.
• Statement 2:
• The petite blonde pulled a shotgun on the clerk at station 2, filled a bag with cash from the
register, and absconded with the money and a handful of pens.
• How many suspects might the police have to stop to find Bonnie and Clyde?
Which witness statement might have more impact on a jury?
• How might differences in clinical case notes by clinicians inform health outcome
models? How might they reflect on the individual clinician?
7. SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product names are trademarks of their respective companies.
Making Sense of Text Data
• Natural language
processing (NLP)
• Collection of tools to parse
human language into
something understandable by
algorithms
• What is said
• Computational linguistics
• Deriving insight about human
behavior or traits based on
text data
• How it’s said
9. SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product names are trademarks of their respective companies.
Parsing Documents/Sentences
An Example
• Tokens (words or punctuation)
• Punctuation (non-word tokens)
• Stop words (less important words)
• Root words (stemming/lemmatizing)
Bonnie hopped into Clyde’s new car.
10. SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product names are trademarks of their respective companies.
Tagging Features
• Parts of speech
• Clauses
• Grammatical relations
• Entity recognition
Bonnie hopped into Clyde’s new car.
11. SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product names are trademarks of their respective companies.
Deriving Sentiment
• Language-dependent
• Sentiment dictionaries
• Positive/negative/neutral
(afinn, for instance)
• Emotion groups from
psychological models
Bonnie hopped into Clyde’s new car.
12. SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product names are trademarks of their respective companies.
Vectorizing/Summarizing Results
• Many options for turning
NLP results into usable
data in machine learning
and statistical tools:
• Vectorization
• Word frequency matrices
• Summary tables
Bonnie hopped into Clyde’s new car.
14. SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product names are trademarks of their respective companies.
Summary Statistics
• Common summary
statistic uses
1. Conversation length
(example: engagement
metric)
2. Swear count (example:
escalation marker)
3. Conversation sentiment
over time (example:
engagement and
satisfaction)
4. Key word frequency
(example: products with
most issues)
15. SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product names are trademarks of their respective companies.
Use as Machine Learning Features
• Examples combining
NLP data with data
from structured
databases
1. Clustering (example:
types of churn from
client feedback and
account data)
2. Predictive modeling
(example: patient
outcomes from case
notes and medical
records)
16. SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product names are trademarks of their respective companies.
Psychometric Applications
• Some published papers:
1. Personality trait
identification in industrial
psychology research
2. Author identification in
plagiarism software
3. Quantification of release
risk in justice systems
4. Quantification of relapse
risk in mental health
applications
18. SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product names are trademarks of their respective companies.
Other Common NLP Applications
• Chatbots
• Personal assistants
• Translation services
• Sentence completion
19. SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product names are trademarks of their respective companies.
In General
21. SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product names are trademarks of their respective companies.
Main NLP Software Options
• NLTK (Python)
• spaCy (Python)
• Stanford CoreNLP (Java)
• John Snow Labs/Spark NLP (Spark)
22. SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product names are trademarks of their respective companies.
Some NLP Literature
• Dunnmon, J. A., Ratner, A. J., Saab, K., Khandwala, N., Markert, M., Sagreiya, H., ...
& Ré, C. (2020). Cross-modal data programming enables rapid medical machine
learning. Patterns, 100019.
• Maas, A., Daly, R. E., Pham, P. T., Huang, D., Ng, A. Y., & Potts, C. (2011, June).
Learning word vectors for sentiment analysis. In Proceedings of the 49th annual
meeting of the association for computational linguistics: Human language
technologies (pp. 142-150).
• Pennebaker, J. W. (2011). The secret life of pronouns. New Scientist, 211(2828),
42-45.
• Polsley, S., Jhunjhunwala, P., & Huang, R. (2016, December). Casesummarizer: a
system for automated summarization of legal texts. In Proceedings of COLING
2016, the 26th international conference on Computational Linguistics: System
Demonstrations (pp. 258-262).
• Velupillai, S., Suominen, H., Liakata, M., Roberts, A., Shah, A. D., Morley, K., ... &
Chapman, W. (2018). Using clinical Natural Language Processing for health
outcomes research: Overview and actionable suggestions for future advances.
Journal of biomedical informatics, 88, 11-19.