SlideShare une entreprise Scribd logo
1  sur  2
TextAnalysis with Python and NLTK
Abstract
Digital technologies have made vast amounts of text available to researchers, and this same technological moment has
provided us with the capacity to analyze that text faster than humanly possible. The first step in that analysis is to
transform texts designed for human consumption into a form a computer can analyze. Using Python and the Natural
Language ToolKit (commonly called NLTK), this workshop introduces strategies to turn qualitative texts into
quantitative objects. Through that process, we will present a variety of strategies for simple analysis of text-based data.
Learning Objectives
In this workshop, you will learn skills like:
• How to prepare texts for computational analysis, including strategies for transforming texts into numbers
• How to use NLTK methods such as concordance and similar
• How to clean and standardize your data, including powerful tools such as stemmers and lemmatizers
• Compare frequency distribution of words in a text to quantify the narrative arc
• Understand stop words and how to remove them when needed.
• Utilize Part-of-Speech tagging to gather insights about a text
• Transform any document that you have (or have access to) in a .txt format into a text that can be analyzed
computationally
• How to tokenize your data and put it in a format compatible with Natural Language Toolkit.
Estimated time
10 hours
Prerequisites
• Introduction to Python (required) This workshop relies heavily on concepts from the Python workshop, and having
a basic understanding of how to use the commands discussed in the workshop will be central for anyone who
wants to learn about text analysis with Python and NLTK.
• Introduction to the Command Line (recommended) This workshop makes some reference to concepts from the
Command Line workshop, and having basic knowledge about how to use the command line will be central for
anyone who wants to learn about text analysis with Python and NLTK.
• Short introduction to Jupyter Notebooks (recommended) This workshop uses Jupyter Notebooks to process the
Python commands in a clear and visual way. Anyone who wants to follow along in the workshop on text analysis
with Python and NLTK should read this very short introduction to how to use Notebooks.
• Installing Python (and Anaconda) (required) This workshop uses Python and you will need to have a Python
installation. If you choose to install a different version of Python, make sure it is version 3 as other versions will
not work with our workshop.
• Installing Natural Language Toolkit (required)You will need to install the NLTK package into your Python
packages for the purposes of this workshop. This guide will help you along the way.
Contexts
Pre-reading suggestions
• A Beginner’s Tutorial to Jupyter Notebooks
• What is text analysis
Projects that use these skills
• Short list of academic Text & Data mining projects
• Building a Simple Chatbot from Scratch in Python
• Classifying personality type by social media posts
Ethical Considerations
• In working with massive amounts of text, it is natural to lose the original context. We must be aware of that and be
careful when analizing it.
• It is important to constantly question our assumptions and the indexes we are using. Numbers and graphs do not
tell the story, our analysis does.We must be careful not to draw hasty and simplistic conclusions for things that are
complex. Just because we found out that authorA uses more unique words than author B, does it mean thatA is a
better writer than B?
Cheat Sheets
• Jupyter Notebook shortcuts, tips and tricks
Acknowledgements
• Current author: Rafael Davis Portela
• Past contributor: Michelle McSweeney
• Past contributor: Rachel Rakov
• Past contributor: KalleWesterling
• Past contributor: Patrick Smyth
• Past contributor: HannahAizenman
• Past contributor: Kelsey Chatlosh
• Past reviewer: Filipa Calado
• Current editor: Lisa Rhody
• Current editor: KalleWesterling

Contenu connexe

Tendances

An Introduction to NLP4L - Natural Language Processing Tool for Apache Lucene...
An Introduction to NLP4L - Natural Language Processing Tool for Apache Lucene...An Introduction to NLP4L - Natural Language Processing Tool for Apache Lucene...
An Introduction to NLP4L - Natural Language Processing Tool for Apache Lucene...
Lucidworks
 
Sequence Modelling with Deep Learning
Sequence Modelling with Deep LearningSequence Modelling with Deep Learning
Sequence Modelling with Deep Learning
Natasha Latysheva
 

Tendances (20)

What is python
What is pythonWhat is python
What is python
 
Everyday Python Idioms
Everyday Python IdiomsEveryday Python Idioms
Everyday Python Idioms
 
Learning to Translate with Joey NMT
Learning to Translate with Joey NMTLearning to Translate with Joey NMT
Learning to Translate with Joey NMT
 
Practical NLP with Lisp
Practical NLP with LispPractical NLP with Lisp
Practical NLP with Lisp
 
Python indroduction
Python indroductionPython indroduction
Python indroduction
 
Programming with Python: Week 1
Programming with Python: Week 1Programming with Python: Week 1
Programming with Python: Week 1
 
An Introduction to ANTLR
An Introduction to ANTLRAn Introduction to ANTLR
An Introduction to ANTLR
 
NLP Project Full Cycle
NLP Project Full CycleNLP Project Full Cycle
NLP Project Full Cycle
 
Lisp Machine Prunciples
Lisp Machine PrunciplesLisp Machine Prunciples
Lisp Machine Prunciples
 
Using ANTLR on real example - convert "string combined" queries into paramete...
Using ANTLR on real example - convert "string combined" queries into paramete...Using ANTLR on real example - convert "string combined" queries into paramete...
Using ANTLR on real example - convert "string combined" queries into paramete...
 
Aspects of NLP Practice
Aspects of NLP PracticeAspects of NLP Practice
Aspects of NLP Practice
 
What can Ruby learn from Python (and vice versa)?
What can Ruby learn from Python (and vice versa)?What can Ruby learn from Python (and vice versa)?
What can Ruby learn from Python (and vice versa)?
 
Webinar: OpenNLP and Solr for Superior Relevance
Webinar: OpenNLP and Solr for Superior RelevanceWebinar: OpenNLP and Solr for Superior Relevance
Webinar: OpenNLP and Solr for Superior Relevance
 
Building a Neural Machine Translation System From Scratch
Building a Neural Machine Translation System From ScratchBuilding a Neural Machine Translation System From Scratch
Building a Neural Machine Translation System From Scratch
 
20120314 changa-python-workshop
20120314 changa-python-workshop20120314 changa-python-workshop
20120314 changa-python-workshop
 
An Introduction to NLP4L - Natural Language Processing Tool for Apache Lucene...
An Introduction to NLP4L - Natural Language Processing Tool for Apache Lucene...An Introduction to NLP4L - Natural Language Processing Tool for Apache Lucene...
An Introduction to NLP4L - Natural Language Processing Tool for Apache Lucene...
 
Sequence Modelling with Deep Learning
Sequence Modelling with Deep LearningSequence Modelling with Deep Learning
Sequence Modelling with Deep Learning
 
Python for MATLAB Programmers
Python for MATLAB ProgrammersPython for MATLAB Programmers
Python for MATLAB Programmers
 
Fine tuning large LMs
Fine tuning large LMsFine tuning large LMs
Fine tuning large LMs
 
ANTLR4 in depth
ANTLR4 in depthANTLR4 in depth
ANTLR4 in depth
 

Similaire à frontmatter.pptx

Python programming ppt.pptx
Python programming ppt.pptxPython programming ppt.pptx
Python programming ppt.pptx
nagendrasai12
 
PYTHON UNIT 1
PYTHON UNIT 1PYTHON UNIT 1
PYTHON UNIT 1
nagendrasai12
 

Similaire à frontmatter.pptx (20)

An Introduction to Natural Language Processing
An Introduction to Natural Language ProcessingAn Introduction to Natural Language Processing
An Introduction to Natural Language Processing
 
Introduction to Analytics with Azure Notebooks and Python
Introduction to Analytics with Azure Notebooks and PythonIntroduction to Analytics with Azure Notebooks and Python
Introduction to Analytics with Azure Notebooks and Python
 
Software Programming with Python II.pptx
Software Programming with Python II.pptxSoftware Programming with Python II.pptx
Software Programming with Python II.pptx
 
interviewbit.pdf
interviewbit.pdfinterviewbit.pdf
interviewbit.pdf
 
Python programming lab 23
Python programming lab 23Python programming lab 23
Python programming lab 23
 
hpcpp.pptx
hpcpp.pptxhpcpp.pptx
hpcpp.pptx
 
Solved Big Data and Data Science Projects pdf.pdf
Solved Big Data and Data Science Projects pdf.pdfSolved Big Data and Data Science Projects pdf.pdf
Solved Big Data and Data Science Projects pdf.pdf
 
ProjectsSummary.pptx
ProjectsSummary.pptxProjectsSummary.pptx
ProjectsSummary.pptx
 
Assignment4.pptx
Assignment4.pptxAssignment4.pptx
Assignment4.pptx
 
Python programming ppt.pptx
Python programming ppt.pptxPython programming ppt.pptx
Python programming ppt.pptx
 
Python presentation of Government Engineering College Aurangabad, Bihar
Python presentation of Government Engineering College Aurangabad, BiharPython presentation of Government Engineering College Aurangabad, Bihar
Python presentation of Government Engineering College Aurangabad, Bihar
 
PYTHON UNIT 1
PYTHON UNIT 1PYTHON UNIT 1
PYTHON UNIT 1
 
Introduction_to_Python.pptx
Introduction_to_Python.pptxIntroduction_to_Python.pptx
Introduction_to_Python.pptx
 
python classes in thane
python classes in thanepython classes in thane
python classes in thane
 
Natural language identification
Natural language identificationNatural language identification
Natural language identification
 
Presentation.pptx
Presentation.pptxPresentation.pptx
Presentation.pptx
 
Presentation.pptx
Presentation.pptxPresentation.pptx
Presentation.pptx
 
Python Programming1.ppt
Python Programming1.pptPython Programming1.ppt
Python Programming1.ppt
 
Python Demo.pptx
Python Demo.pptxPython Demo.pptx
Python Demo.pptx
 
core python.pdf
core python.pdfcore python.pdf
core python.pdf
 

Dernier

The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
heathfieldcps1
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
QucHHunhnh
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
QucHHunhnh
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
ciinovamais
 

Dernier (20)

The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptx
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SD
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
Asian American Pacific Islander Month DDSD 2024.pptx
Asian American Pacific Islander Month DDSD 2024.pptxAsian American Pacific Islander Month DDSD 2024.pptx
Asian American Pacific Islander Month DDSD 2024.pptx
 
ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701
 
PROCESS RECORDING FORMAT.docx
PROCESS      RECORDING        FORMAT.docxPROCESS      RECORDING        FORMAT.docx
PROCESS RECORDING FORMAT.docx
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptx
 
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdf
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...
 
Role Of Transgenic Animal In Target Validation-1.pptx
Role Of Transgenic Animal In Target Validation-1.pptxRole Of Transgenic Animal In Target Validation-1.pptx
Role Of Transgenic Animal In Target Validation-1.pptx
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docx
 

frontmatter.pptx

  • 1. TextAnalysis with Python and NLTK Abstract Digital technologies have made vast amounts of text available to researchers, and this same technological moment has provided us with the capacity to analyze that text faster than humanly possible. The first step in that analysis is to transform texts designed for human consumption into a form a computer can analyze. Using Python and the Natural Language ToolKit (commonly called NLTK), this workshop introduces strategies to turn qualitative texts into quantitative objects. Through that process, we will present a variety of strategies for simple analysis of text-based data. Learning Objectives In this workshop, you will learn skills like: • How to prepare texts for computational analysis, including strategies for transforming texts into numbers • How to use NLTK methods such as concordance and similar • How to clean and standardize your data, including powerful tools such as stemmers and lemmatizers • Compare frequency distribution of words in a text to quantify the narrative arc • Understand stop words and how to remove them when needed. • Utilize Part-of-Speech tagging to gather insights about a text • Transform any document that you have (or have access to) in a .txt format into a text that can be analyzed computationally • How to tokenize your data and put it in a format compatible with Natural Language Toolkit. Estimated time 10 hours Prerequisites • Introduction to Python (required) This workshop relies heavily on concepts from the Python workshop, and having a basic understanding of how to use the commands discussed in the workshop will be central for anyone who wants to learn about text analysis with Python and NLTK. • Introduction to the Command Line (recommended) This workshop makes some reference to concepts from the Command Line workshop, and having basic knowledge about how to use the command line will be central for anyone who wants to learn about text analysis with Python and NLTK. • Short introduction to Jupyter Notebooks (recommended) This workshop uses Jupyter Notebooks to process the Python commands in a clear and visual way. Anyone who wants to follow along in the workshop on text analysis with Python and NLTK should read this very short introduction to how to use Notebooks. • Installing Python (and Anaconda) (required) This workshop uses Python and you will need to have a Python installation. If you choose to install a different version of Python, make sure it is version 3 as other versions will not work with our workshop. • Installing Natural Language Toolkit (required)You will need to install the NLTK package into your Python packages for the purposes of this workshop. This guide will help you along the way. Contexts Pre-reading suggestions • A Beginner’s Tutorial to Jupyter Notebooks • What is text analysis Projects that use these skills • Short list of academic Text & Data mining projects • Building a Simple Chatbot from Scratch in Python • Classifying personality type by social media posts
  • 2. Ethical Considerations • In working with massive amounts of text, it is natural to lose the original context. We must be aware of that and be careful when analizing it. • It is important to constantly question our assumptions and the indexes we are using. Numbers and graphs do not tell the story, our analysis does.We must be careful not to draw hasty and simplistic conclusions for things that are complex. Just because we found out that authorA uses more unique words than author B, does it mean thatA is a better writer than B? Cheat Sheets • Jupyter Notebook shortcuts, tips and tricks Acknowledgements • Current author: Rafael Davis Portela • Past contributor: Michelle McSweeney • Past contributor: Rachel Rakov • Past contributor: KalleWesterling • Past contributor: Patrick Smyth • Past contributor: HannahAizenman • Past contributor: Kelsey Chatlosh • Past reviewer: Filipa Calado • Current editor: Lisa Rhody • Current editor: KalleWesterling