Research Methods in Natural Language Processing

Research Methods in Natural Language Processing
Pham Quang Nhat Minh
FPT Technology Research Institute
FPT University
minhpqn2@fe.edu.vn
April 16, 2017

Objectives of the lecture
Introduce some research know-how and practices in doing
research
Focus on NLP/Machine Learning/Data Science ﬁelds
Share my research experiences in the ﬁeld NLP
Pham Quang Nhat Minh Research Methods in NLP 2/70

Table of Contents
1 What are empirical research methods for computer science?
2 How to choose a good research topic?
3 How to read a scientiﬁc paper?
4 How to work with your advisor
5 Doing research in NLP ﬁeld
What is NLP?
What is it like doing research in NLP?
How to do research in NLP?
How to choose NLP papers to read?
6 Coding practices for NLP/Machine Learning research work
7 My research stories

Acknowledgements
Many contents in the lecture are from documents in the
references
(Alon, 2009) How To Choose a Good Scientiﬁc Problem
(Wilson et al., 2012) Best Practices for Scientiﬁc Computing
Paul Cohen: Empirical Methods for AI & CS
Other documents, blogs

Table of Contents
What is NLP?

What does “empirical” mean?
Relying on observations, data, experiments
Empirical work should complement theoretical work
Theories often have holes (e.g., How big is the constant term?)
Theories are suggested by observations
Theories are tested by observations
Conversely, theories direct our empirical attention
In addition, empirical means “wanting to understand
behaviour of complex systems”
In NLP, we may want to understand how features are
correlated

Why we need empirical methods
Theory based science need not be all theorems
We do not know how a theory works in diﬀerent conditions
Diﬀerent data sets, domains

Empirical methods in CS/AI
Data observation
Construct hypotheses
Test with empirical experiments
Reﬁne hypotheses and modelling assumptions

Kinds of data analysis
Exploratory (EDA) - looking for patterns in data
Statistical inferences from sample data
Testing hypotheses
Estimating parameters
Building mathematical models of datasets
Machine learning, data mining...

Tools for data analysis
R programming language
Python:
numpy
scipy
pandas
matplotlib for data visualization
My bias opinions:
statisticians like R, computer scientists often use Python
Python is much easier to learn than R

Exercises
Install R: https://www.r-project.org
Download the data file ex1data1.txt from:
http://tinyurl.com/m7bpp8d
The data file has two columns:
First column: the population of a city.
Second column: the profit of a food truck in that city.
In R terminal, try the plot code
df <- read.table("./ex1data1.txt", sep=",",
header=FALSE)
plot(df[,1], df[,2], xlab=‘‘Profit in
$10,000s’’, ylab=‘‘Population of City in
10,000s’’)

R for data visualization

Table of Contents
What is NLP?

Why do we need to choose a good research topic?
“Garbage in, garbage out” principle
You may work with a research topic for years
1 year for a master thesis
3 years or more for a Ph.D. dissertation
It is painful to do things that you feel uninteresting
Lack passion, motivations, ideas
Much frustration and bitterness

What is a good research topic?
(Alon, 2009) Two Dimensions of Problem Choice
Feasibility: whether a problem is hard or easy
We can measure the feasibility as the expected time to
complete the project
Feasibility is a function of the skills of students/researchers
and of the technology in the lab.
Interest: the increase in knowledge expected from the project.

Two-dimensional space of Problem Choice (1)
Figure: The Feasibility-Interest Diagram for Choosing a Project (Alon,
2009)

Two-dimensional space of Problem Choice (2)
Figure: The Feasibility-Interest Diagram for Choosing a Project (Alon,
2009)

What is a good research topic?
Are many people care about the topic?
Research community, your supervisors, industry demands
Are you really interested in the topic?
The topic should be interesting to you rather than to others
Good signs: “ideas and questions that come back again and
again to your mind for months or years.”

How to choose a good research topic: steps by steps
Choose the broad (general) topic
E.g, Machine Translation
Draw a hierarchy of research topics, starting from the broad
topic
Review literature to look for gaps in previous work
Choose the focused topic
E.g., Phrase-based Machine Translation
Find gaps in previous work
Form research questions in the focused topic
From research questions, formulate the research problem

Finding a research problem
Take your time to choose a good research topic
(Alon, 2009): Rule for new Ph.D. students and postdocs: “Do
not commit to a problem before 3 months have elapsed”
For master students, take 1-2 months for choosing the research
topic before your start the research project.
Join projects in your laboratory
Many research ideas for thesis are from projects you involved

Developing your research ideas
Where do research ideas come from?
Observations
Data observations, data analysis, discover patterns in data
Reading papers, attending conferences, listening talks
Techniques, methods from other disciplines, ﬁelds
Imagine
Suggestions from your advisor

Reading papers, attending conferences
Choose good and relevant papers. Consider:
Impact factors of the journal.
In the NLP ﬁeld, choose papers from top conferences, journals
(ACL/NAACL/EMNLP/COLING)
The Top 10 NLP Conferences:
http://www.junglelightspeed.com/
the-top-10-nlp-conferences
Reputations of authors and their organizations
Not only readings, but criticizing papers and ﬁnding the gaps

Techniques, methods from other ﬁelds
Expand your view, problem solving methodologies by regularly
reading articles in other ﬁelds.
An example is the task image captioning
We need to use techniques from both computer vision and
NLP.

What happens after we choose a problem? (Alon, 2009)

Table of Contents
What is NLP?

Two types of readings
Fast readings
Get and understand the basic ideas of the paper
Know the problems the paper attacks and how it solves that
Put the paper in the “big picture” of the field
Know what are differences between the paper and previous
work
We do “fast reading” much when we survey literature and
choose a broad topic
Deep readings
Understand the details of presented methods
Try to understand how the proposed method works
Criticize the paper and find its limitations
If you were the authors, how would you solve the problem?
Propose alternative methods?
We do “deep reading” much we look for a focused topic

How to read a scientiﬁc paper (1)
Michael J. Hanson. Eﬃcient Readings of Papers in Science and Technology: http://tinyurl.com/qdebynz

Decide what to read
Read title, abstract
Read it, ﬁle it, or skip it
Read for breath
What did they do
Skim introduction, headings, graphics, deﬁnitions, conclusions
and bibliography.
Consider the credibility.
How useful is it?
Decide whether to go on.

Read in depth
How did they do it?
Challenge their arguments.
Examine assumptions.
Examine methods.
Examine statistics.
Examine reasoning and conclusions.
How can I apply their approach to my work?
Take notes
Make notes as you read.
Highlight major points.
Note new terms and deﬁnitions.
Summarize tables and graphs.
Write a summary.

Homework
Choose one scientiﬁc article that you want to read in depth, read,
take notes and explain ideas, methods presented in the paper to
other students in a simple way.
Notes: You should be able to answer 3 questions as follows.
What is the problem the paper attack?
What are the diﬀerences between the paper and other existing
papers?
What are interesting points of the presented methods?

Table of Contents
What is NLP?

Some basic rules
Your advisor is supposed to be very busy, so you should follow
up her/him
Schedule the meeting in advanced and ask for meeting
Keep regular meeting with your advisor
Usually weekly meeting
Do not just do what your advisor tell you to do
Rule of thumbnail: You should ﬁnish all your assigned tasks
before doing your own ideas

How to write a progress/status report
Michael Ernst. Writing a progress/status report:
http://tinyurl.com/zp7cdvt
Quote the previous week’s plan.
This helps you determine whether you accomplished your goals.
State this week’s progress.
What you have accomplished,
What you learned, what difficulties you overcame, what
difficulties are still blocking you,
Your new ideas for research directions or projects, etc
Give the next week’s plan.
A good format is a bulleted list
Try to make each goal measurable: there should be no
ambiguity as to whether you were able to finish it.
It’s good to include longer-term goals as well.

Communicate with your advisor
Prepare some slides (3-4 slides) to make the discussion
concrete
Send the materials at least 24 hours before the meeting day
Arrange the meeting in advanced
Your advisor is not always right
Actually you know more about your work than her/him
If you have data, evidences, proofs, do not hesitate to debate
Do not say “I guest”, “I think” when you explain something.
Use data, evidences, references instead

Table of Contents
What is NLP?

What is Natural Language Processing?
A ﬁeld of computer science, artiﬁcial intelligence, and
computational linguistics
To get computers to perform useful tasks involving human
languages
Human-Machine communication
Improving human-human communication
E.g Machine Translation
Extracting information from texts

Why is NLP interesting?
Languages involve many human activities
Reading, writing, speaking, listening
Voice can be used as an user interface in many applications
Remote controls, virtual assistants like siri,...
NLP is used to acquire insights from massive amount of
textual data
E.g., hypotheses from medical, health reports
NLP has many applications
NLP is hard!

NLP problems
Fundamental problems
Word Segmentation
Part-of-speech tagging
Syntactic Analysis
Semantic Analysis
Application problems
Information Retrieval
Information Extraction
Question Answering
Text Summarization
Machine Translation

Empirical methods are applied much in NLP
Relying on observations, data, experiments
Contains many loops of experiments
Identify the problem → Create ideas → Test the best idea →
Analyse results → Identify the problem → Create ideas → · · ·

Many ideas do not work
Even though, we need to analyse the results to understand
why they do not work to come up with new ideas.
Try the next idea
Fails occur more often than successes
Try to increase the number of experiments
(No of successes) = (No of experiments) × (Success rate)

The typical working day of a NLP researcher
Data observation and data/result analysis (a lot)
Discuss ideas with colleagues
Do experiments (run the program) to test ideas
Reading papers to keep up-to-date on mainstream researches
Investigate new NLP/Machine Learning tools, libraries (less
regular)

How to learn NLP?
Research starts from learning
Learn/review background about:
Probabilistic and Statistics
Basic math (linear algebra, calculus)
Machine Learning
Programming
Read NLP textbooks
Jurafsky, D., & Martin, J.H. Speech and Language Processing:
an Introduction to Natural Language Processing,
Computational Linguistics, and Speech Recognition.
Manning, C.D., & Schutze, H. Foundations of statistical
natural language processing.

How to learn NLP: Get your hands dirty
Practice with programming exercises:
100 NLP drill exercises: https://github.com/
minhpqn/nlp_100_drill_exercises
NLP Programming Tutorial, by Graham Neubig:
http://www.phontron.com/teaching.php
Compete in Kaggle data science challenges (kaggle.com)

Finding a NLP research problem
All the principles in the section “How to choose a good
research topic” apply.
Looking for ideas from related ﬁelds
Linguistics
Machine learning: mainstream in the NLP ﬁeld is applying
machine learning methods in the NLP problems
Computer vision
Looking at data
It is actually my daily task

Basic rules to choose NLP papers
READ:
Papers in top conferences and journals in NLP and other
related ﬁelds
(ACL/EMNLP/NAACL/EACL/COLING/CoNLL/...)
Workshops that focus on an NLP sub-ﬁeld
Short papers at top conferences
PhD dissertations from top institutions/advisors
Papers with many citations
Textbooks from leading researchers
For more information, see: The Top 10 NLP Conferences1
1
http://www.junglelightspeed.com/the-top-10-nlp-conferences/

Table of Contents
What is NLP?

Why is coding important in NLP/ML research?
Many (most) NLP/ML research work is empirical studies
Need to do data analysis, run experiments to test our ideas
So, we have to write programs
Even theorists should program, too
“Implementing your own algorithm is a good way of checking
your work. If you aren’t implementing your algorithm,
arguably you’re skipping a key step in checking your results.”
—Michael Mitzenmacher
http://mybiasedcoin.blogspot.com/2008/11/bugs.html

Why we care about coding practices in NLP research?
Bad coding practices cause problems
You ﬁnd errors in the experimental results right before the
paper submission deadline
You cannot understand your own code after some months
You deleted intermediate results, so you cannot verify the code
You do not know the technique to verify experimental results
You did not test the code, and then use untested code for
experiments
You spend long time for refactoring the code
You could not get back the version that generate the best
results
...

Why we care about coding practices in NLP research?
Good coding practices speed up our research work
Recall that:
(No of successes) = (No of experiments) × (Success rate)

Best Practices for Scientific Computing
(Wilson et al., 2012)
1- Write programs for people, not computers.
Readers of the code do not need to remember too much
Easy to read: names should be consistent, distinctive, and
meaningful
Break down the coding work into one-hour-long tasks
2- Automate repetitive tasks
Scientists should rely on the computer to repeat tasks.
Should use a script to run program!!
Use a build tool to automate their scientific workflows

3- Use the computer to record history
Unique identiﬁers and version numbers for raw data records
Unique identiﬁers and version number for programs and
libraries
The values of parameters used to generate any given output;
The names and version number of programs used to generate
those outputs.
4- Make incremental changes
Scientists can not know what their programs should do next
until the current version has produced some results.
Should work in small steps with frequent feedback and
correction!

5- Use a version control system: git, mercural, subversion. Push
code to github, bitbucket
Everything that has been created manually should be put in
version control
6- Do not repeat yourself (or others)
At small-scale, code should be modularized rather than copied
and pasted.
At large-scale, scientiﬁc programmers should re-use code
instead of re-writing it.

7- Plan for mistakes
Write and run tests
Unit Test: Check the correctness of each single software unit
Integration Test: Check that pieces of unit code work
correctly when combined.
Regression Test: Running pre-existing code tests after changes
to the code in order to make sure that it hasn’t regressed.
Should use oﬀ-the-self unit testing library

8- Optimize software only after it works correctly
Use proﬁler to identify bottlenecks
Write code in the highest-level language possible
Python is recommended language for research
Only use low-level programming language when they are sure
that performance boost is needed.
Use the highest-level programming language for rapid
prototyping.

9- Document design, and purpose, not mechanics
Document interface and reasons, not implementations
Do not do that
i = i + 1 # Increment the variable ’i’ by
one.
Refactor the code instead of explaining how it works
Embed the documentation for a piece of software in that
software
Use software to generate documentation.
10- Collaborate
Use pre-merge code reviews
Use an issue tracking tool.

Coding practices for NLP/ML research
All general practices apply for NLP/ML research
Separate a process into small processes
Use pipelines in Unix/Linux
Make use of tools in experiments
Linux commands
NLP/ML Tools
Libraries (json, nltk, matplotlib, scikit-learn,...)
Algorithms
E.g., Show statistics about number of words in a text ﬁle
source file name.txt | cut -f1 | sort | uniq
-c | sort -nr
Visualize experimental results, make demo for your research
results

Tool for visualizing research results
Tables (Microsoft Excel, HTML)
Charts (gnuplot, matplotlib, R)
Graphs (graphviz, Gephi, D3.js)
Texts (Microsoft Excel, HTML, brat2)
Codes (google-code-prettify3, Pygments4)
Demo (HTML, JavaScript, CSS,...)
2
http://brat.nlplab.org/
3
https://github.com/google/code-prettify
4
http://pygments.org/

Optimize codes only after your ideas work
“Make it work. Make it right. Make it fast.” (Kent Beck)
“Premature optimization is the root of all evil (or at least
most of it) in programming.” (Donald Knuth)
In NLP, always start with a simple and dirty working version
E.g, Bag-of-word features and Naive Bayes algorithm in text
classiﬁcation tasks

Table of Contents
What is NLP?

My proﬁle
6/2006: B.Sc. in Information Technology from University of
Engineering and Technology, Vietnam National University,
Hanoi
3/2010: M.Sc. in Information Science from Japan Advanced
Institute of Science and Technology
3/2013: Ph.D. in Information Science from Japan Advanced
Institute of Science and Technology

Master program at JAIST
JAIST is a public graduate institute in Japan
Homepage: https://www.jaist.ac.jp/english
Three schools
Information Science
Knowledge Science
Material Science
All courses have English version
You can learn in English

Master program at JAIST
Two-year full-time master program
First year:
Students are temporarily assigned to a laboratory, and select
the official lab after 3 months
In the first year, mainly taking courses and choosing the
master research topic
Write the research proposal for master thesis in the end of the
first year
Second year:
Finishing all remaining course work
Working on master research project
Looking for jobs (students who do not pursue Ph.D.)

How did I ﬁnish my master?
Six months before entering master program
Take Japanese course
Review background
Read NLP Textbooks
First year:
Finish all course work
Join a research project in my laboratory
Choose the research topic
Second year:
Do research
Attend one international conference
Thesis defense

How I choose my master thesis
I even did not know how to choose a research topic (crying)
You should know how to choose
I was assigned the topic by my co-advisor
The topic is about sentence insertion
I proposed a method to improve the previous results

Sentence insertion task
Task: To automatically updating a wikipedia article by inserting
new information into that.
I proposed to use Word Clusters to capture meaning of words

Research projects at FPT Technology Research Institute
NLP problems in chatbot development
Intent classiﬁcation
Named entity recognition
FAQ generation from chat history, manuals
Figure: Source: stanfy.com: http://tinyurl.com/mdfsa6h)

Summary
Empirical research methods reply on observations, data,
experiments
Two dimensions of problem choice: Feasibility and Interest
Research starts from learning
Reading is very important in research
NLP research involves much data analysis
Coding practices for NLP/ML research

Check-list for your master thesis
1 Is your work reproducible?
Package your code so that it can automatically generate the
results by a single script
Freeze the ﬁnal version
2 Is your proposed method new
3 Did you revise your thesis many times?
Ask your advisors, friends for proof reading
4 Did you understand previous work?
5 Do you think you can pass the master thesis defense?

Advices for your master thesis
Take time to choose your master research topic
Work on the research problem that you are interested in
Start soon
Follow up your advisor
Spend time on regular literature review (reading papers)
Commit at least 2-3 hours per day for your master research
Look at your data before starting doing something
Follow “best” coding practices for research
Use version control
For versioning everything that is manually created
Backup your work on the cloud

References
Alon, U. (2009). How to choose a good scientific problem.
Molecular cell, 35 6, 726-8.
Aruliah, D.A., Brown, C.T., Davis, M., Guy, R.T., Hong, N.P.,
Haddock, S.H., Huff, K., Mitchell, I.M., Plumbley, M.D.,
Waugh, B., White, E.P., Wilson, G., & Wilson, P. (2014).
Best Practices for Scientific Computing. PLoS biology.
Ali Eslami. Patterns for Research in Machine Learning
http://arkitus.com/patterns-for-research-in-machine-learning

Research Methods in Natural Language Processing

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Research Methods in Natural Language Processing

Similaire à Research Methods in Natural Language Processing (20)

Plus de Minh Pham

Plus de Minh Pham (12)

Dernier

Dernier (20)

Research Methods in Natural Language Processing