This document outlines an introductory lecture on research methods in natural language processing (NLP). It discusses empirical research methods in computer science, how to choose a good research topic, how to read scientific papers, how to work with an advisor, and doing research in NLP. The document provides an overview of key aspects of conducting research in NLP such as identifying problems, developing ideas, conducting experiments, analyzing results, and iterating on the process. It also discusses common NLP problems and applications. The overall summary is an introductory lecture on best practices for conducting NLP research.
The document discusses dependency parsing in natural language processing. It begins by defining dependency as a syntactic or semantic relation between tokens. It then contrasts constituent structure, which groups tokens into phrases bottom-up, with dependency structure, which builds a graph connecting tokens with edges. The document goes on to describe the components of a dependency graph, including vertices, arcs, and relations. It also discusses projectivity, head rules to convert constituent trees to dependency trees, and different approaches to dependency parsing like transition-based and graph-based parsing.
Abstractive text summarization is nowadays one of the most important research topics in NLP. However, getting a deep understanding of what it is and also how it works requires a series of base pieces of knowledge that build on top of each other. This is the reason why this presentation will give audiences an overview of sequence-to-sequence with the acceleration of various versions of attention over the past few years. In addition, natural language generation (NLG) with the focusing on decoder techniques and its relevant problems will be reviewed, as a supportive factor to the light of the success of automatic summarization. Finally, the abstractive text summarization will be represented with potential approaches to tackle some hot issues in some latest research papers.
Lecture 3 Computer Science Research SEM1 22_23 (1).pptxNabilaHassan13
The document discusses research in computer science. It defines research as a systematic process of investigating problems to find valid answers supported by evidence. Computer science research derives from mathematics and philosophy. It involves studying computational phenomena, developing models, and investigating properties of abstract objects through formal methods. Research can be theoretical, focusing on creation and analysis of abstract models, or empirical, involving observation and experimentation. The document outlines the basic steps in theoretical and empirical computer science research processes. It also categorizes the scope of computer science research.
Neural Models for Information RetrievalBhaskar Mitra
In the last few years, neural representation learning approaches have achieved very good performance on many natural language processing (NLP) tasks, such as language modelling and machine translation. This suggests that neural models may also yield significant performance improvements on information retrieval (IR) tasks, such as relevance ranking, addressing the query-document vocabulary mismatch problem by using semantic rather than lexical matching. IR tasks, however, are fundamentally different from NLP tasks leading to new challenges and opportunities for existing neural representation learning approaches for text.
In this talk, I will present my recent work on neural IR models. We begin with a discussion on learning good representations of text for retrieval. I will present visual intuitions about how different embeddings spaces capture different relationships between items, and their usefulness to different types of IR tasks. The second part of this talk is focused on the applications of deep neural architectures to the document ranking task.
Jiawei Han, Micheline Kamber and Jian Pei
Data Mining: Concepts and Techniques, 3rd ed.
The Morgan Kaufmann Series in Data Management Systems
Morgan Kaufmann Publishers, July 2011. ISBN 978-0123814791
Every researcher is a cyborg! Academic researchers engage various sorts of research in vitro (in the glass) and in vivo (in the living body), or they engage in experimental laboratory work and analyze data in natural in-world experiments. In between, many conduct surveys, focus groups, interviews, and other types of research work. In the computer-assisted qualitative data analysis software (CAQDAS) space, NVivo is one of the foremost tools, enabling the creation of manual codebooks, multimedia analysis, and various forms of “auto” or unsupervised machine learning. NVivo works as a “database” for structured and unstructured data (multimedia). It enables the drawing of content from various social media sites. Technologies augment human analytical capabilities, in the qualitative and quantitative research spaces. This presentation demonstrates some of the capabilities of NVivo. This also addresses how a researcher is changed by the computational capabilities they harness.
Words and sentences are the basic units of text. In this lecture we discuss basics of operations on words and sentences such as tokenization, text normalization, tf-idf, cosine similarity measures, vector space models and word representation
The document discusses dependency parsing in natural language processing. It begins by defining dependency as a syntactic or semantic relation between tokens. It then contrasts constituent structure, which groups tokens into phrases bottom-up, with dependency structure, which builds a graph connecting tokens with edges. The document goes on to describe the components of a dependency graph, including vertices, arcs, and relations. It also discusses projectivity, head rules to convert constituent trees to dependency trees, and different approaches to dependency parsing like transition-based and graph-based parsing.
Abstractive text summarization is nowadays one of the most important research topics in NLP. However, getting a deep understanding of what it is and also how it works requires a series of base pieces of knowledge that build on top of each other. This is the reason why this presentation will give audiences an overview of sequence-to-sequence with the acceleration of various versions of attention over the past few years. In addition, natural language generation (NLG) with the focusing on decoder techniques and its relevant problems will be reviewed, as a supportive factor to the light of the success of automatic summarization. Finally, the abstractive text summarization will be represented with potential approaches to tackle some hot issues in some latest research papers.
Lecture 3 Computer Science Research SEM1 22_23 (1).pptxNabilaHassan13
The document discusses research in computer science. It defines research as a systematic process of investigating problems to find valid answers supported by evidence. Computer science research derives from mathematics and philosophy. It involves studying computational phenomena, developing models, and investigating properties of abstract objects through formal methods. Research can be theoretical, focusing on creation and analysis of abstract models, or empirical, involving observation and experimentation. The document outlines the basic steps in theoretical and empirical computer science research processes. It also categorizes the scope of computer science research.
Neural Models for Information RetrievalBhaskar Mitra
In the last few years, neural representation learning approaches have achieved very good performance on many natural language processing (NLP) tasks, such as language modelling and machine translation. This suggests that neural models may also yield significant performance improvements on information retrieval (IR) tasks, such as relevance ranking, addressing the query-document vocabulary mismatch problem by using semantic rather than lexical matching. IR tasks, however, are fundamentally different from NLP tasks leading to new challenges and opportunities for existing neural representation learning approaches for text.
In this talk, I will present my recent work on neural IR models. We begin with a discussion on learning good representations of text for retrieval. I will present visual intuitions about how different embeddings spaces capture different relationships between items, and their usefulness to different types of IR tasks. The second part of this talk is focused on the applications of deep neural architectures to the document ranking task.
Jiawei Han, Micheline Kamber and Jian Pei
Data Mining: Concepts and Techniques, 3rd ed.
The Morgan Kaufmann Series in Data Management Systems
Morgan Kaufmann Publishers, July 2011. ISBN 978-0123814791
Every researcher is a cyborg! Academic researchers engage various sorts of research in vitro (in the glass) and in vivo (in the living body), or they engage in experimental laboratory work and analyze data in natural in-world experiments. In between, many conduct surveys, focus groups, interviews, and other types of research work. In the computer-assisted qualitative data analysis software (CAQDAS) space, NVivo is one of the foremost tools, enabling the creation of manual codebooks, multimedia analysis, and various forms of “auto” or unsupervised machine learning. NVivo works as a “database” for structured and unstructured data (multimedia). It enables the drawing of content from various social media sites. Technologies augment human analytical capabilities, in the qualitative and quantitative research spaces. This presentation demonstrates some of the capabilities of NVivo. This also addresses how a researcher is changed by the computational capabilities they harness.
Words and sentences are the basic units of text. In this lecture we discuss basics of operations on words and sentences such as tokenization, text normalization, tf-idf, cosine similarity measures, vector space models and word representation
This document discusses text summarization using machine learning. It begins by defining text summarization as reducing a text to create a summary that retains the most important points. There are two main types: single document summarization and multiple document summarization. Extractive summarization creates summaries by extracting phrases or sentences from the source text, while abstractive summarization expresses ideas using different words. Supervised machine learning approaches use labeled training data to train classifiers to select content, while unsupervised approaches select content based on metrics like term frequency-inverse document frequency. ROUGE is commonly used to automatically evaluate summaries by comparing them to human references. Query-focused multi-document summarization aims to answer a user's information need by summarizing relevant documents
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...Rodney Joyce
Number 2 in the Data Science for Dummies series - We'll predict Titanic survival with Databricks, python and MLSpark.
These are the slides only (excuse the Powerpoint animation issues) - check out the actual tech talk on YouTube: https://rodneyjoyce.home.blog/2019/05/03/data-science-for-dummies-machine-learning-with-databricks-python-sparkml-tech-talk-1-of-7/)
If you have not used Databricks before check out the first talk - Databricks for Dummies.
Here's the rest of the series: https://rodneyjoyce.home.blog/tag/data-science-for-dummies/
1) Data Science overview with Databricks
2) Titanic survival prediction with Azure Machine Learning Studio + Kaggle
3) Data Engineering with Titanic dataset + Databricks + Python
4) Titanic with Databricks + Spark ML
5) Titanic with Databricks + Azure Machine Learning Service
6) Titanic with Databricks + MLS + AutoML
7) Titanic with Databricks + MLFlow
8) Titanic with .NET Core + ML.NET
9) Deployment, DevOps/MLOps and Productionisation
A short presentation for beginners on Introduction of Machine Learning, What it is, how it works, what all are the popular Machine Learning techniques and learning models (supervised, unsupervised, semi-supervised, reinforcement learning) and how they works with various Industry use-cases and popular examples.
This presentation provides an overview of boosting approaches for classification problems. It discusses combining classifiers through bagging and boosting to create stronger classifiers. The AdaBoost algorithm is explained in detail, including its training and classification phases. An example is provided to illustrate how AdaBoost works over multiple rounds, increasing the weights of misclassified examples to improve classification accuracy. In conclusion, AdaBoost is highlighted as an effective approach for classification problems where misclassification has severe consequences by producing highly accurate strong classifiers.
This document discusses how to create effective research questions to guide research. It explains that research questions map out the direction of the research. An effective research question needs information from sources beyond yourself, requires background research, and is neither too broad nor too narrow in scope. There are two types of questions: "thin" questions like who, what, when, where that provide background details, and "thick" questions using how and why that explore broader concepts and changes over time. The document provides examples of each and guides the reader in forming their own thick questions.
This document provides an overview of how to become a data scientist from scratch. It discusses the key skills needed, which include mathematics/statistics, computer programming, and business knowledge. It then covers various topics required for a data science career like mathematics, programming languages, data wrangling, analysis, machine learning, deep learning, big data, and additional skills like NLP and CV. The document also lists learning outcomes, best online resources, blogs, books, and packages to learn data science from the ground up.
This document provides an overview of Bayes law, Bayesian networks, and latent Dirichlet allocation (LDA). It begins with an explanation of Bayes law and examples of how it can be used. Next, it defines Bayesian networks as probabilistic graphical models and provides examples. Finally, it introduces LDA as a statistical model for collections of discrete data like text corpora and explains how it can be used for topic modeling. The document includes mathematical notation and diagrams to illustrate key concepts.
Transfer learning aims to improve learning outcomes for a target task by leveraging knowledge from a related source task. It does this by influencing the target task's assumptions based on what was learned from the source task. This can allow for faster and better generalized learning in the target task. However, there is a risk of negative transfer where performance decreases. To avoid this, methods examine task similarity and reject harmful source knowledge, or generate multiple mappings between source and target to identify the best match. The goal of transfer learning is to start higher, learn faster, and achieve better overall performance compared to learning the target task without transfer.
This document provides an overview of research methods. It defines research and describes the scientific research process. Research is defined as a systematic investigation to discover and develop knowledge. The scientific research process involves four stages: exploration, description, explanation, and prediction. It also outlines key aspects of different types of research methods, including quantitative and qualitative approaches, and discusses challenges in applying scientific methods to social science research. The document emphasizes that research requires a systematic, objective, and rigorous approach.
This document discusses case study research design and methodology. It defines a case study as an empirical inquiry that investigates a contemporary phenomenon in its real-life context. Case studies rely on multiple sources of evidence that must converge to draw conclusions. The key components of case study research design are determining the study's questions, propositions, units of analysis, and linking data to propositions. Data collection involves gathering evidence through documentation, interviews, observations, and artifacts, requiring skills like effective questioning and listening without bias. Data is then organized and reported, with options including linear, comparative, chronological, or unstructured structures.
What is and what isn’t a good research question? Discover how to develop an impactful and significant research question by asking the right questions related to your field and area of study. This is a presentation developed through the Graduate Resource Center at the University of New Mexico.
Introduction to Model-Based Machine LearningDaniel Emaasit
The field of machine learning has seen the development of thousands of learning algorithms. Typically, scientists choose from these algorithms to solve specific problems. Their choices often being limited by their familiarity with these algorithms. In this classical/traditional framework of machine learning, scientists are constrained to making some assumptions so as to use an existing algorithm. This is in contrast to the model-based machine learning approach which seeks to create a bespoke solution tailored to each new problem.
Meaning and introduction to educational researchQazi GHAFOOR
This document discusses the meaning and introduction of research. It defines research as the formal and systematic application of scientific methods to study a problem. There are two main sources of knowledge: revealed knowledge from religious texts and acquired knowledge from personal experiences, experts, logic, and the scientific method. The scientific method involves recognizing a problem, formulating hypotheses, collecting and analyzing data, and stating conclusions. The document also discusses the need for research when problems exist, different types of research classified by purpose and method, and the importance of ethics in research.
The document summarizes imitation learning techniques. It introduces behavioral cloning, which frames imitation learning as a supervised learning problem by learning to mimic expert demonstrations. However, behavioral cloning has limitations as it does not allow for recovery from mistakes. Alternative approaches involve direct policy learning using an interactive expert or inverse reinforcement learning, which aims to learn a reward function that explains the expert's behavior. The document outlines different types of imitation learning problems and algorithms for interactive direct policy learning, including data aggregation and policy aggregation methods.
Content analysis is a research technique that:
1) Determines the presence of certain words or concepts within texts through quantitative analysis.
2) Researchers quantify and analyze the presence, meanings and relationships of words/concepts to make inferences about messages, writers, audiences, and cultural contexts.
3) The process involves coding a text by breaking it into categories, then examining occurrences and relationships of concepts through conceptual analysis (counting concepts) or relational analysis (examining relationships among concepts).
This document discusses the K-nearest neighbors (KNN) algorithm, an instance-based learning method used for classification. KNN works by identifying the K training examples nearest to a new data point and assigning the most common class among those K neighbors to the new point. The document covers how KNN calculates distances between data points, chooses the value of K, handles feature normalization, and compares strengths and weaknesses of the approach. It also briefly discusses clustering, an unsupervised learning technique where data is grouped based on similarity.
This document provides an overview of qualitative, quantitative, and mixed methods research approaches. It discusses the underlying principles, benefits, and limitations of each approach. Qualitative research is based on phenomenology and seeks to understand individual experiences, while quantitative research uses logical positivism to objectively measure variables and test hypotheses. Mixed methods combines both qualitative and quantitative approaches. The document analyzes how each approach could be applied to research on critical thinking in nursing education.
PowerPoint Presentation - Conditional Random Fields - A ...butest
- Conditional random fields (CRFs) are probabilistic graphical models that can be used for labeling and segmenting sequential data. They generalize hidden Markov models (HMMs) by allowing dependencies between labels.
- CRFs are discriminative models that directly model the conditional probability of labels given observations, rather than the joint probability like generative models. This allows them to avoid problems with independence assumptions.
- Linear-chain CRFs are commonly used for sequential labeling tasks. They incorporate a large number of features without conditional independence assumptions, outperforming HMMs on problems like gene prediction. Parameter estimation is done with maximum likelihood.
This document discusses Bayesian global optimization as a method for tuning machine learning models. It begins by outlining challenges with traditional tuning methods like grid search and random search. It then introduces Bayesian global optimization, which uses a Gaussian process model and expected improvement criterion to efficiently search the parameter space. The document provides examples of applying Bayesian optimization to deep learning tasks in MXNet and TensorFlow to achieve faster and better performance than traditional methods. It concludes by discussing tools for evaluating optimization strategies and comparing Bayesian optimization to baseline methods.
Research Methods in Natural Language Processing (2018 version)Minh Pham
Updated version of my lecture slide about "Research Methods in Natural Language Processing" for the course RAW-501 in Master program of FPT University.
This document provides guidance on writing projects. It discusses how to plan a project by defining the vision and current reality, and determining action steps. When selecting a topic, one should identify their strengths, consider innovativeness, and identify gaps through critical thinking and research. The document also reviews how to scope problems, choose a title, perform critical reading and analysis, work on the project, and discuss results. In summary, the document offers a comprehensive overview of how to plan, develop and execute a successful project from start to finish.
This document discusses text summarization using machine learning. It begins by defining text summarization as reducing a text to create a summary that retains the most important points. There are two main types: single document summarization and multiple document summarization. Extractive summarization creates summaries by extracting phrases or sentences from the source text, while abstractive summarization expresses ideas using different words. Supervised machine learning approaches use labeled training data to train classifiers to select content, while unsupervised approaches select content based on metrics like term frequency-inverse document frequency. ROUGE is commonly used to automatically evaluate summaries by comparing them to human references. Query-focused multi-document summarization aims to answer a user's information need by summarizing relevant documents
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...Rodney Joyce
Number 2 in the Data Science for Dummies series - We'll predict Titanic survival with Databricks, python and MLSpark.
These are the slides only (excuse the Powerpoint animation issues) - check out the actual tech talk on YouTube: https://rodneyjoyce.home.blog/2019/05/03/data-science-for-dummies-machine-learning-with-databricks-python-sparkml-tech-talk-1-of-7/)
If you have not used Databricks before check out the first talk - Databricks for Dummies.
Here's the rest of the series: https://rodneyjoyce.home.blog/tag/data-science-for-dummies/
1) Data Science overview with Databricks
2) Titanic survival prediction with Azure Machine Learning Studio + Kaggle
3) Data Engineering with Titanic dataset + Databricks + Python
4) Titanic with Databricks + Spark ML
5) Titanic with Databricks + Azure Machine Learning Service
6) Titanic with Databricks + MLS + AutoML
7) Titanic with Databricks + MLFlow
8) Titanic with .NET Core + ML.NET
9) Deployment, DevOps/MLOps and Productionisation
A short presentation for beginners on Introduction of Machine Learning, What it is, how it works, what all are the popular Machine Learning techniques and learning models (supervised, unsupervised, semi-supervised, reinforcement learning) and how they works with various Industry use-cases and popular examples.
This presentation provides an overview of boosting approaches for classification problems. It discusses combining classifiers through bagging and boosting to create stronger classifiers. The AdaBoost algorithm is explained in detail, including its training and classification phases. An example is provided to illustrate how AdaBoost works over multiple rounds, increasing the weights of misclassified examples to improve classification accuracy. In conclusion, AdaBoost is highlighted as an effective approach for classification problems where misclassification has severe consequences by producing highly accurate strong classifiers.
This document discusses how to create effective research questions to guide research. It explains that research questions map out the direction of the research. An effective research question needs information from sources beyond yourself, requires background research, and is neither too broad nor too narrow in scope. There are two types of questions: "thin" questions like who, what, when, where that provide background details, and "thick" questions using how and why that explore broader concepts and changes over time. The document provides examples of each and guides the reader in forming their own thick questions.
This document provides an overview of how to become a data scientist from scratch. It discusses the key skills needed, which include mathematics/statistics, computer programming, and business knowledge. It then covers various topics required for a data science career like mathematics, programming languages, data wrangling, analysis, machine learning, deep learning, big data, and additional skills like NLP and CV. The document also lists learning outcomes, best online resources, blogs, books, and packages to learn data science from the ground up.
This document provides an overview of Bayes law, Bayesian networks, and latent Dirichlet allocation (LDA). It begins with an explanation of Bayes law and examples of how it can be used. Next, it defines Bayesian networks as probabilistic graphical models and provides examples. Finally, it introduces LDA as a statistical model for collections of discrete data like text corpora and explains how it can be used for topic modeling. The document includes mathematical notation and diagrams to illustrate key concepts.
Transfer learning aims to improve learning outcomes for a target task by leveraging knowledge from a related source task. It does this by influencing the target task's assumptions based on what was learned from the source task. This can allow for faster and better generalized learning in the target task. However, there is a risk of negative transfer where performance decreases. To avoid this, methods examine task similarity and reject harmful source knowledge, or generate multiple mappings between source and target to identify the best match. The goal of transfer learning is to start higher, learn faster, and achieve better overall performance compared to learning the target task without transfer.
This document provides an overview of research methods. It defines research and describes the scientific research process. Research is defined as a systematic investigation to discover and develop knowledge. The scientific research process involves four stages: exploration, description, explanation, and prediction. It also outlines key aspects of different types of research methods, including quantitative and qualitative approaches, and discusses challenges in applying scientific methods to social science research. The document emphasizes that research requires a systematic, objective, and rigorous approach.
This document discusses case study research design and methodology. It defines a case study as an empirical inquiry that investigates a contemporary phenomenon in its real-life context. Case studies rely on multiple sources of evidence that must converge to draw conclusions. The key components of case study research design are determining the study's questions, propositions, units of analysis, and linking data to propositions. Data collection involves gathering evidence through documentation, interviews, observations, and artifacts, requiring skills like effective questioning and listening without bias. Data is then organized and reported, with options including linear, comparative, chronological, or unstructured structures.
What is and what isn’t a good research question? Discover how to develop an impactful and significant research question by asking the right questions related to your field and area of study. This is a presentation developed through the Graduate Resource Center at the University of New Mexico.
Introduction to Model-Based Machine LearningDaniel Emaasit
The field of machine learning has seen the development of thousands of learning algorithms. Typically, scientists choose from these algorithms to solve specific problems. Their choices often being limited by their familiarity with these algorithms. In this classical/traditional framework of machine learning, scientists are constrained to making some assumptions so as to use an existing algorithm. This is in contrast to the model-based machine learning approach which seeks to create a bespoke solution tailored to each new problem.
Meaning and introduction to educational researchQazi GHAFOOR
This document discusses the meaning and introduction of research. It defines research as the formal and systematic application of scientific methods to study a problem. There are two main sources of knowledge: revealed knowledge from religious texts and acquired knowledge from personal experiences, experts, logic, and the scientific method. The scientific method involves recognizing a problem, formulating hypotheses, collecting and analyzing data, and stating conclusions. The document also discusses the need for research when problems exist, different types of research classified by purpose and method, and the importance of ethics in research.
The document summarizes imitation learning techniques. It introduces behavioral cloning, which frames imitation learning as a supervised learning problem by learning to mimic expert demonstrations. However, behavioral cloning has limitations as it does not allow for recovery from mistakes. Alternative approaches involve direct policy learning using an interactive expert or inverse reinforcement learning, which aims to learn a reward function that explains the expert's behavior. The document outlines different types of imitation learning problems and algorithms for interactive direct policy learning, including data aggregation and policy aggregation methods.
Content analysis is a research technique that:
1) Determines the presence of certain words or concepts within texts through quantitative analysis.
2) Researchers quantify and analyze the presence, meanings and relationships of words/concepts to make inferences about messages, writers, audiences, and cultural contexts.
3) The process involves coding a text by breaking it into categories, then examining occurrences and relationships of concepts through conceptual analysis (counting concepts) or relational analysis (examining relationships among concepts).
This document discusses the K-nearest neighbors (KNN) algorithm, an instance-based learning method used for classification. KNN works by identifying the K training examples nearest to a new data point and assigning the most common class among those K neighbors to the new point. The document covers how KNN calculates distances between data points, chooses the value of K, handles feature normalization, and compares strengths and weaknesses of the approach. It also briefly discusses clustering, an unsupervised learning technique where data is grouped based on similarity.
This document provides an overview of qualitative, quantitative, and mixed methods research approaches. It discusses the underlying principles, benefits, and limitations of each approach. Qualitative research is based on phenomenology and seeks to understand individual experiences, while quantitative research uses logical positivism to objectively measure variables and test hypotheses. Mixed methods combines both qualitative and quantitative approaches. The document analyzes how each approach could be applied to research on critical thinking in nursing education.
PowerPoint Presentation - Conditional Random Fields - A ...butest
- Conditional random fields (CRFs) are probabilistic graphical models that can be used for labeling and segmenting sequential data. They generalize hidden Markov models (HMMs) by allowing dependencies between labels.
- CRFs are discriminative models that directly model the conditional probability of labels given observations, rather than the joint probability like generative models. This allows them to avoid problems with independence assumptions.
- Linear-chain CRFs are commonly used for sequential labeling tasks. They incorporate a large number of features without conditional independence assumptions, outperforming HMMs on problems like gene prediction. Parameter estimation is done with maximum likelihood.
This document discusses Bayesian global optimization as a method for tuning machine learning models. It begins by outlining challenges with traditional tuning methods like grid search and random search. It then introduces Bayesian global optimization, which uses a Gaussian process model and expected improvement criterion to efficiently search the parameter space. The document provides examples of applying Bayesian optimization to deep learning tasks in MXNet and TensorFlow to achieve faster and better performance than traditional methods. It concludes by discussing tools for evaluating optimization strategies and comparing Bayesian optimization to baseline methods.
Research Methods in Natural Language Processing (2018 version)Minh Pham
Updated version of my lecture slide about "Research Methods in Natural Language Processing" for the course RAW-501 in Master program of FPT University.
This document provides guidance on writing projects. It discusses how to plan a project by defining the vision and current reality, and determining action steps. When selecting a topic, one should identify their strengths, consider innovativeness, and identify gaps through critical thinking and research. The document also reviews how to scope problems, choose a title, perform critical reading and analysis, work on the project, and discuss results. In summary, the document offers a comprehensive overview of how to plan, develop and execute a successful project from start to finish.
This document provides an overview of a workshop on planning academic papers. It discusses developing an outline for a paper, including typical sections like introduction, background, methodology, results, discussion and conclusions. Last sessions covered types of publications and what makes a good paper. This session will focus on paper structure and developing an outline, with tips like choosing a paper type, finding an example paper, and starting with a generic structure to customize. The goal is for participants to understand common paper elements and be able to start developing their own outline by the end of the workshop.
This document provides information on scientific research papers and their structure and purpose. It discusses that research papers present an interpretation or evaluation of an argument based on what is known about a subject. When writing a research paper, authors build upon existing knowledge and survey relevant fields to find the best information. Research can be published in many areas, including science, arts, humanities, religion, and management. For scientific publications specifically, the work must be public, objective, predictive, reproducible, systematic, and cumulative. Key parts of a research paper include the introduction, methods, analysis, results, discussion, and conclusions. The document provides guidance on how to effectively read and evaluate a scientific research paper.
This document provides information on scientific research papers and their structure and purpose. It discusses key parts of a research paper including the introduction, methods, results, and discussion sections. It emphasizes that the goal of a scientific paper is to advance knowledge in a field by presenting a research study and its findings in a clear, objective manner so that other experts can analyze and build upon the work. Overall, the document serves as a guide for writing and reading scientific research papers effectively.
This document provides information on scientific research papers and how to read them effectively. It discusses that research papers are an important part of the scientific community and involve building upon existing knowledge in a field. It also outlines the key sections of a research paper such as the introduction, methods, results and discussion. The document emphasizes that to critically read a research paper, one should understand the problem being studied and evaluated, understand the proposed methodology, and evaluate the assumptions, findings and conclusions presented. It stresses reading research papers actively and constructively in order to gain insights and identify areas for further study.
Systematic Literature Reviews : Concise Overviewyoukayaslam
This document provides an overview of a workshop on systematic approaches to literature reviewing led by Dr. Mark Matthews. The workshop explores elements of the systematic review process and how they can be adapted for thesis literature reviews and keeping up with literature through a PhD. It discusses formulating review questions, systematically searching literature databases and other sources, selecting studies, critically appraising research, analyzing and synthesizing findings, and structuring the writing of literature reviews. Challenges of literature reviews and additional resources are also presented.
This document provides a template for presenting a journal club, including guidelines for selecting a paper, structuring the presentation, and evaluating the paper. Some key points include: (1) The paper should be of interest to both the presenter and audience, be recently cited but not just published, and report a novel method or application. (2) The presentation should be 30 minutes with 20-25 spent summarizing the paper and 5+ for discussion. (3) The outline includes introducing the biomedical problem, methods, results, evaluation, presenter's assessment, and conclusions. (4) The assessment considers the paper's informatics and biomedical contributions as well as any limitations.
The document provides an outline for writing a research proposal and report. It discusses the typical elements and structure, including:
1) Elements such as the title page, problem statement, objectives, literature review, methodology, and references.
2) Developing the proposal involves choosing a topic, formulating research questions, outlining literature, deciding on methods, and proposing timelines and resources.
3) Research proposals and reports generally have five chapters: introduction, literature review, methodology, analysis, and conclusions. Each chapter contains standard sections.
Research seminar lecture_7_criteria_good_researchDaria Bogdanova
This document provides an overview and review of key aspects of educational research. It discusses what educational research is and the main types of research. It outlines the typical steps in conducting research, including identifying a research problem, conducting a literature review, developing research questions and hypotheses, identifying needed data, data collection methods, data analysis, findings, discussion, and conclusions. Good research is defined as having a sound rationale, clear aims, a relevant theoretical basis, well-defined research questions, an appropriate methodology, contributions to the field, and consistency between all steps. Typical mistakes include having too much background and too little on the specific current research, as well as weaknesses in feasibility or scope.
RESEARCH METHODOLOGY_ STEP BY STEP RESEARCH METHODOLOGY CHAPTER_.pdfMATIULLAH JAN
What the methodology chapter is and why it is important?
How to structure and write up the methodology chapter:
The research design:
The research philosophy:
The research type:
Inductive research,
The research strategy:
Experimental research
The time horizon:
The sampling strategy:
The data collection method
The analysis methods and techniques:
The methodological limitations
This document provides guidance on publishing research results in academic journals. It outlines the typical components of a research paper, including an introduction describing the purpose and literature review, methods, results presented in tables and discussed in text, and a discussion and conclusion section. The document also offers tips on selecting an appropriate journal based on its aims and scope, following author instructions carefully, and having others edit the paper before submitting.
This document provides tips and strategies for effectively reading academic papers. It discusses deciding what papers to read based on relevance and credibility. It recommends making best use of academic resources like preprint sites, blogs, and mailing lists to stay updated. It explains the importance of reading for breadth to understand the big picture and reading for depth to critically examine assumptions, methods, statistics and conclusions. The document concludes by discussing how to take notes and think creatively after reading papers to develop new research ideas.
Writing an effective Poster: the point of view of experts, novices and litera...Elisabetta Cigognini
The document discusses guidelines for effective scientific poster design from experts, novices, and literature. It analyzes posters created by students to identify problematic design elements. Experts agree that posters should have a clear organization, use large readable fonts, select key information, and limit text length and decorative images. While students struggled with these principles, experts note similar issues still appear in some experienced researchers' posters as well.
This document provides guidance on writing reports based on research. It discusses defining objectives for the report based on reader needs. It also covers conducting sound research through understanding context, defining questions precisely, and using credible methods. The document reviews planning a report's organization and structure, as well as drafting, revising, and crafting different report sections like the introduction, methods, results, discussion, conclusions and recommendations. Key elements of each section are outlined. Readers are given a writing assignment to help apply these report writing concepts.
This document discusses key aspects of the scientific research process and publishing findings, including:
1) The typical phases of the scientific method such as developing a research question, conducting background research, forming a hypothesis, designing and conducting experiments, analyzing results, and publishing findings.
2) Guidelines for publishing research including selecting appropriate publication venues based on their prestige, impact factor, and indexing in databases. Conferences, journals, books, and dissertation are discussed as common publication types.
3) Metrics for measuring research impact including the number of citations, journal impact factor, and h-index which provides an indicator of productivity and citation impact. Resources for identifying publications and metrics like Web of Science, DBLP, and Google
RES 3024 Presentation 3a Understanding Academic Articles.ppsxMatthewLewis227954
This document provides an overview of understanding academic articles. It discusses important terms related to academic research like academic journals, peer review, and empirical research. It then describes how to find academic articles using online databases and search tools. The major sections of an empirical study are outlined as abstract, introduction and literature review, methods, results, discussion and conclusions, and references. A 4-step reading strategy is proposed that involves reading the abstract, skimming the introduction and discussion, skimming the methods and results, and then reading the introduction and discussion in depth. Computer-based tools for efficiently finding relevant literature are also briefly mentioned.
This document provides an overview of research and the research process. It discusses that research involves asking questions and finding answers through systematic procedures. Research can be qualitative, involving more subjective methods, or quantitative, using more objective methods. The goal of research is to describe phenomena, determine causes of behavior, predict behavior, and explain behavior. Strong research is theory-driven, testable, replicable, and seeks to minimize bias. The research process involves forming a question or hypothesis, designing a study, collecting and analyzing data, and drawing conclusions. Presentations of research should be clear, well-organized, and visually engaging for audiences.
This document provides an overview of research and the research process. It discusses that research involves asking questions and finding answers through systematic procedures. Research can be qualitative, involving more subjective methods, or quantitative, using more objective methods. The goal of research is to describe phenomena, determine causes of behavior, predict behavior, and explain behavior. Strong research is theory-driven, testable, replicable, and seeks to minimize bias. The research process involves forming a question or hypothesis, designing a study, collecting and analyzing data, and drawing conclusions. Presentations of research should be clear, well-organized, and focus on the essential information.
This document provides an overview of research and the research process. It discusses that research involves asking questions and finding answers through systematic procedures. Research can be qualitative, involving more subjective methods, or quantitative, using more objective methods. The goal of research is to describe phenomena, determine causes of behavior, predict behavior, and explain behavior. Strong research is theory-driven, testable, replicable, and seeks to minimize bias. Both qualitative and quantitative methods are important in social science research about communication. The research process involves developing research questions or hypotheses, collecting and analyzing data, and drawing conclusions.
Similaire à Research Methods in Natural Language Processing (20)
Prompt Engineering Tutorial: Cách viết prompt hiệu quả với ChatGPTMinh Pham
Bài giảng về cách sử dụng prompt engineering hiệu quả với ChatGPT. Sau khi học xong bài giảng, người dùng hiểu về cấu trúc cơ bản của prompt, biết cách thiết kế prompt một cách hiệu quả, tiết kiệm
AimeLaw at ALQAC 2021: Enriching Neural Network Models with Legal-Domain Know...Minh Pham
Our presentation slide at the 13th IEEE International Conference on Knowledge and Systems Engineering (KSE 2021).
In this paper, we present our participated systems for three Vietnamese legal text processing tasks at Automated Legal Question Answering Competition (ALQAC 2021). In our systems, we leverage the strength of traditional information retrieval methods (BM25), pre-trained masked language models (BERT), and legal domain knowledge. Our proposed methods help to overcome the shortage of training data. Especially, in the legal textual entailment task, we propose a novel data augmentation
method that is based on legal domain knowledge. Evaluation
results show the effectiveness of our proposed methods.
A Multimodal Ensemble Model for Detecting Unreliable Information on Vietnames...Minh Pham
This document proposes a multimodal ensemble model for detecting unreliable information on Vietnamese social media. It uses text, image, and metadata features as inputs to three deep learning models - BERT+CNN, and two variants with additional CNN layers. An attention mechanism is applied to learn which image parts to focus on for each text. The models are ensembled by averaging their prediction probabilities. Evaluation on a private test set shows the ensemble model achieves an AUC of 0.945, outperforming the individual models. Future work could involve comparing posts to external sources to find evidence of fakes.
Research methods for engineering students (v.2020)Minh Pham
Beginning students who start doing research may face to many difficulties from choosing a good research topic to start, how to develop new ideas to how to implement models to test their ideas and write papers. Research skill is a craft skill. You only learn it by doing. However, it is good to learn know-how in doing research. In this lecture, I share information of how-to-do research for engineering students with the hope that it will help students to save time at the beginning state of doing research.
Tài liệu giới thiệu kiến thức cơ bản về AIML và cách sử dụng khi phát triển chatbot. Để áp dụng được tốt hơn, độc giả cần tìm hiểu các tài liệu chi tiết hơn.
Mạng neural nhân tạo và ứng dụng trong xử lý ngôn ngữ tự nhiênMinh Pham
Slide bài thuyết trình tại sự kiện của của công ty rubikAI. Nội dung của bài trình bày là kiến thức cơ bản về mạng neural và ứng dụng trong xử lý ngôn ngữ tự nhiên.
Slide của bài trình bày tại al+ AI Seminar số 4 về báo bài báo được giải thưởng best paper award tại hội nghị NAACL 2018
Peters et al., 2018. Deep Contextualized Word Representations. In NAACL.
Bài báo gốc: http://aclweb.org/anthology/N18-1202
Mô hình ELMo là mô hình biểu diễn từ phụ thuộc ngữ cảnh học từ mô hình ngôn ngữ hai chiều. ELMo được áp dụng cho nhiều bài toán khác nhau và đạt kết quả tốt nhất trên nhiều tập dữ liệu.
A Feature-Based Model for Nested Named-Entity Recognition at VLSP-2018 NER Ev...Minh Pham
The presentation of a feature-based model for nested named-entity recognition at VLSP 2018. Our system obtained the first rank among participant systems. There is still a gap between the accuracy on the development set and the test set.
Về kỹ thuật Attention trong mô hình sequence-to-sequence tại hội nghị ACL 2017Minh Pham
Trình bày về kỹ thuật attention trong mô hình sequence-to-sequence và ứng dụng trong các nghiên cứu NLP tại ACL 2017. Ngoài ra chúng tôi cũng tóm tắt một số các nghiên cứu thú vị khác tại hội nghị.
Các bài toán xử lý ngôn ngữ tự nhiên trong phát triển hệ thống chatbotMinh Pham
Trình bày về những bài toán xử lý ngôn ngữ tự nhiên trong phát triển hệ thống chatbot theo mô hình truy xuất thông tin. Ngoài ra mô hình sinh hội thoại sử dụng mạng Neural cũng được đề cập (neural chatbot)
Introduction to natural language processingMinh Pham
This document provides an introduction to natural language processing (NLP). It discusses what NLP is, why NLP is a difficult problem, the history of NLP, fundamental NLP tasks like word segmentation, part-of-speech tagging, syntactic analysis and semantic analysis, and applications of NLP like information retrieval, question answering, text summarization and machine translation. The document aims to give readers an overview of the key concepts and challenges in the field of natural language processing.
(June 12, 2024) Webinar: Development of PET theranostics targeting the molecu...Scintica Instrumentation
Targeting Hsp90 and its pathogen Orthologs with Tethered Inhibitors as a Diagnostic and Therapeutic Strategy for cancer and infectious diseases with Dr. Timothy Haystead.
Microbial interaction
Microorganisms interacts with each other and can be physically associated with another organisms in a variety of ways.
One organism can be located on the surface of another organism as an ectobiont or located within another organism as endobiont.
Microbial interaction may be positive such as mutualism, proto-cooperation, commensalism or may be negative such as parasitism, predation or competition
Types of microbial interaction
Positive interaction: mutualism, proto-cooperation, commensalism
Negative interaction: Ammensalism (antagonism), parasitism, predation, competition
I. Mutualism:
It is defined as the relationship in which each organism in interaction gets benefits from association. It is an obligatory relationship in which mutualist and host are metabolically dependent on each other.
Mutualistic relationship is very specific where one member of association cannot be replaced by another species.
Mutualism require close physical contact between interacting organisms.
Relationship of mutualism allows organisms to exist in habitat that could not occupied by either species alone.
Mutualistic relationship between organisms allows them to act as a single organism.
Examples of mutualism:
i. Lichens:
Lichens are excellent example of mutualism.
They are the association of specific fungi and certain genus of algae. In lichen, fungal partner is called mycobiont and algal partner is called
II. Syntrophism:
It is an association in which the growth of one organism either depends on or improved by the substrate provided by another organism.
In syntrophism both organism in association gets benefits.
Compound A
Utilized by population 1
Compound B
Utilized by population 2
Compound C
utilized by both Population 1+2
Products
In this theoretical example of syntrophism, population 1 is able to utilize and metabolize compound A, forming compound B but cannot metabolize beyond compound B without co-operation of population 2. Population 2is unable to utilize compound A but it can metabolize compound B forming compound C. Then both population 1 and 2 are able to carry out metabolic reaction which leads to formation of end product that neither population could produce alone.
Examples of syntrophism:
i. Methanogenic ecosystem in sludge digester
Methane produced by methanogenic bacteria depends upon interspecies hydrogen transfer by other fermentative bacteria.
Anaerobic fermentative bacteria generate CO2 and H2 utilizing carbohydrates which is then utilized by methanogenic bacteria (Methanobacter) to produce methane.
ii. Lactobacillus arobinosus and Enterococcus faecalis:
In the minimal media, Lactobacillus arobinosus and Enterococcus faecalis are able to grow together but not alone.
The synergistic relationship between E. faecalis and L. arobinosus occurs in which E. faecalis require folic acid
Mending Clothing to Support Sustainable Fashion_CIMaR 2024.pdfSelcen Ozturkcan
Ozturkcan, S., Berndt, A., & Angelakis, A. (2024). Mending clothing to support sustainable fashion. Presented at the 31st Annual Conference by the Consortium for International Marketing Research (CIMaR), 10-13 Jun 2024, University of Gävle, Sweden.
The cost of acquiring information by natural selectionCarl Bergstrom
This is a short talk that I gave at the Banff International Research Station workshop on Modeling and Theory in Population Biology. The idea is to try to understand how the burden of natural selection relates to the amount of information that selection puts into the genome.
It's based on the first part of this research paper:
The cost of information acquisition by natural selection
Ryan Seamus McGee, Olivia Kosterlitz, Artem Kaznatcheev, Benjamin Kerr, Carl T. Bergstrom
bioRxiv 2022.07.02.498577; doi: https://doi.org/10.1101/2022.07.02.498577
When I was asked to give a companion lecture in support of ‘The Philosophy of Science’ (https://shorturl.at/4pUXz) I decided not to walk through the detail of the many methodologies in order of use. Instead, I chose to employ a long standing, and ongoing, scientific development as an exemplar. And so, I chose the ever evolving story of Thermodynamics as a scientific investigation at its best.
Conducted over a period of >200 years, Thermodynamics R&D, and application, benefitted from the highest levels of professionalism, collaboration, and technical thoroughness. New layers of application, methodology, and practice were made possible by the progressive advance of technology. In turn, this has seen measurement and modelling accuracy continually improved at a micro and macro level.
Perhaps most importantly, Thermodynamics rapidly became a primary tool in the advance of applied science/engineering/technology, spanning micro-tech, to aerospace and cosmology. I can think of no better a story to illustrate the breadth of scientific methodologies and applications at their best.
Authoring a personal GPT for your research and practice: How we created the Q...Leonel Morgado
Thematic analysis in qualitative research is a time-consuming and systematic task, typically done using teams. Team members must ground their activities on common understandings of the major concepts underlying the thematic analysis, and define criteria for its development. However, conceptual misunderstandings, equivocations, and lack of adherence to criteria are challenges to the quality and speed of this process. Given the distributed and uncertain nature of this process, we wondered if the tasks in thematic analysis could be supported by readily available artificial intelligence chatbots. Our early efforts point to potential benefits: not just saving time in the coding process but better adherence to criteria and grounding, by increasing triangulation between humans and artificial intelligence. This tutorial will provide a description and demonstration of the process we followed, as two academic researchers, to develop a custom ChatGPT to assist with qualitative coding in the thematic data analysis process of immersive learning accounts in a survey of the academic literature: QUAL-E Immersive Learning Thematic Analysis Helper. In the hands-on time, participants will try out QUAL-E and develop their ideas for their own qualitative coding ChatGPT. Participants that have the paid ChatGPT Plus subscription can create a draft of their assistants. The organizers will provide course materials and slide deck that participants will be able to utilize to continue development of their custom GPT. The paid subscription to ChatGPT Plus is not required to participate in this workshop, just for trying out personal GPTs during it.
Evidence of Jet Activity from the Secondary Black Hole in the OJ 287 Binary S...Sérgio Sacani
Wereport the study of a huge optical intraday flare on 2021 November 12 at 2 a.m. UT in the blazar OJ287. In the binary black hole model, it is associated with an impact of the secondary black hole on the accretion disk of the primary. Our multifrequency observing campaign was set up to search for such a signature of the impact based on a prediction made 8 yr earlier. The first I-band results of the flare have already been reported by Kishore et al. (2024). Here we combine these data with our monitoring in the R-band. There is a big change in the R–I spectral index by 1.0 ±0.1 between the normal background and the flare, suggesting a new component of radiation. The polarization variation during the rise of the flare suggests the same. The limits on the source size place it most reasonably in the jet of the secondary BH. We then ask why we have not seen this phenomenon before. We show that OJ287 was never before observed with sufficient sensitivity on the night when the flare should have happened according to the binary model. We also study the probability that this flare is just an oversized example of intraday variability using the Krakow data set of intense monitoring between 2015 and 2023. We find that the occurrence of a flare of this size and rapidity is unlikely. In machine-readable Tables 1 and 2, we give the full orbit-linked historical light curve of OJ287 as well as the dense monitoring sample of Krakow.
The binding of cosmological structures by massless topological defectsSérgio Sacani
Assuming spherical symmetry and weak field, it is shown that if one solves the Poisson equation or the Einstein field
equations sourced by a topological defect, i.e. a singularity of a very specific form, the result is a localized gravitational
field capable of driving flat rotation (i.e. Keplerian circular orbits at a constant speed for all radii) of test masses on a thin
spherical shell without any underlying mass. Moreover, a large-scale structure which exploits this solution by assembling
concentrically a number of such topological defects can establish a flat stellar or galactic rotation curve, and can also deflect
light in the same manner as an equipotential (isothermal) sphere. Thus, the need for dark matter or modified gravity theory is
mitigated, at least in part.
Travis Hills of MN is Making Clean Water Accessible to All Through High Flux ...Travis Hills MN
By harnessing the power of High Flux Vacuum Membrane Distillation, Travis Hills from MN envisions a future where clean and safe drinking water is accessible to all, regardless of geographical location or economic status.
Discovery of An Apparent Red, High-Velocity Type Ia Supernova at 𝐳 = 2.9 wi...Sérgio Sacani
We present the JWST discovery of SN 2023adsy, a transient object located in a host galaxy JADES-GS
+
53.13485
−
27.82088
with a host spectroscopic redshift of
2.903
±
0.007
. The transient was identified in deep James Webb Space Telescope (JWST)/NIRCam imaging from the JWST Advanced Deep Extragalactic Survey (JADES) program. Photometric and spectroscopic followup with NIRCam and NIRSpec, respectively, confirm the redshift and yield UV-NIR light-curve, NIR color, and spectroscopic information all consistent with a Type Ia classification. Despite its classification as a likely SN Ia, SN 2023adsy is both fairly red (
�
(
�
−
�
)
∼
0.9
) despite a host galaxy with low-extinction and has a high Ca II velocity (
19
,
000
±
2
,
000
km/s) compared to the general population of SNe Ia. While these characteristics are consistent with some Ca-rich SNe Ia, particularly SN 2016hnk, SN 2023adsy is intrinsically brighter than the low-
�
Ca-rich population. Although such an object is too red for any low-
�
cosmological sample, we apply a fiducial standardization approach to SN 2023adsy and find that the SN 2023adsy luminosity distance measurement is in excellent agreement (
≲
1
�
) with
Λ
CDM. Therefore unlike low-
�
Ca-rich SNe Ia, SN 2023adsy is standardizable and gives no indication that SN Ia standardized luminosities change significantly with redshift. A larger sample of distant SNe Ia is required to determine if SN Ia population characteristics at high-
�
truly diverge from their low-
�
counterparts, and to confirm that standardized luminosities nevertheless remain constant with redshift.
The debris of the ‘last major merger’ is dynamically youngSérgio Sacani
The Milky Way’s (MW) inner stellar halo contains an [Fe/H]-rich component with highly eccentric orbits, often referred to as the
‘last major merger.’ Hypotheses for the origin of this component include Gaia-Sausage/Enceladus (GSE), where the progenitor
collided with the MW proto-disc 8–11 Gyr ago, and the Virgo Radial Merger (VRM), where the progenitor collided with the
MW disc within the last 3 Gyr. These two scenarios make different predictions about observable structure in local phase space,
because the morphology of debris depends on how long it has had to phase mix. The recently identified phase-space folds in Gaia
DR3 have positive caustic velocities, making them fundamentally different than the phase-mixed chevrons found in simulations
at late times. Roughly 20 per cent of the stars in the prograde local stellar halo are associated with the observed caustics. Based
on a simple phase-mixing model, the observed number of caustics are consistent with a merger that occurred 1–2 Gyr ago.
We also compare the observed phase-space distribution to FIRE-2 Latte simulations of GSE-like mergers, using a quantitative
measurement of phase mixing (2D causticality). The observed local phase-space distribution best matches the simulated data
1–2 Gyr after collision, and certainly not later than 3 Gyr. This is further evidence that the progenitor of the ‘last major merger’
did not collide with the MW proto-disc at early times, as is thought for the GSE, but instead collided with the MW disc within
the last few Gyr, consistent with the body of work surrounding the VRM.
ESA/ACT Science Coffee: Diego Blas - Gravitational wave detection with orbita...Advanced-Concepts-Team
Presentation in the Science Coffee of the Advanced Concepts Team of the European Space Agency on the 07.06.2024.
Speaker: Diego Blas (IFAE/ICREA)
Title: Gravitational wave detection with orbital motion of Moon and artificial
Abstract:
In this talk I will describe some recent ideas to find gravitational waves from supermassive black holes or of primordial origin by studying their secular effect on the orbital motion of the Moon or satellites that are laser ranged.
Gadgets for management of stored product pests_Dr.UPR.pdf
Research Methods in Natural Language Processing
1. Research Methods in Natural Language Processing
Pham Quang Nhat Minh
FPT Technology Research Institute
FPT University
minhpqn2@fe.edu.vn
April 16, 2017
2. Objectives of the lecture
Introduce some research know-how and practices in doing
research
Focus on NLP/Machine Learning/Data Science fields
Share my research experiences in the field NLP
Pham Quang Nhat Minh Research Methods in NLP 2/70
3. Table of Contents
1 What are empirical research methods for computer science?
2 How to choose a good research topic?
3 How to read a scientific paper?
4 How to work with your advisor
5 Doing research in NLP field
What is NLP?
What is it like doing research in NLP?
How to do research in NLP?
How to choose NLP papers to read?
6 Coding practices for NLP/Machine Learning research work
7 My research stories
Pham Quang Nhat Minh Research Methods in NLP 3/70
4. Acknowledgements
Many contents in the lecture are from documents in the
references
(Alon, 2009) How To Choose a Good Scientific Problem
(Wilson et al., 2012) Best Practices for Scientific Computing
Paul Cohen: Empirical Methods for AI & CS
Other documents, blogs
Pham Quang Nhat Minh Research Methods in NLP 4/70
5. Table of Contents
1 What are empirical research methods for computer science?
2 How to choose a good research topic?
3 How to read a scientific paper?
4 How to work with your advisor
5 Doing research in NLP field
What is NLP?
What is it like doing research in NLP?
How to do research in NLP?
How to choose NLP papers to read?
6 Coding practices for NLP/Machine Learning research work
7 My research stories
Pham Quang Nhat Minh Research Methods in NLP 5/70
6. What does “empirical” mean?
Relying on observations, data, experiments
Empirical work should complement theoretical work
Theories often have holes (e.g., How big is the constant term?)
Theories are suggested by observations
Theories are tested by observations
Conversely, theories direct our empirical attention
In addition, empirical means “wanting to understand
behaviour of complex systems”
In NLP, we may want to understand how features are
correlated
Pham Quang Nhat Minh Research Methods in NLP 6/70
7. Why we need empirical methods
Theory based science need not be all theorems
We do not know how a theory works in different conditions
Different data sets, domains
Pham Quang Nhat Minh Research Methods in NLP 7/70
8. Empirical methods in CS/AI
Data observation
Construct hypotheses
Test with empirical experiments
Refine hypotheses and modelling assumptions
Pham Quang Nhat Minh Research Methods in NLP 8/70
9. Kinds of data analysis
Exploratory (EDA) - looking for patterns in data
Statistical inferences from sample data
Testing hypotheses
Estimating parameters
Building mathematical models of datasets
Machine learning, data mining...
Pham Quang Nhat Minh Research Methods in NLP 9/70
10. Tools for data analysis
R programming language
Python:
numpy
scipy
pandas
matplotlib for data visualization
My bias opinions:
statisticians like R, computer scientists often use Python
Python is much easier to learn than R
Pham Quang Nhat Minh Research Methods in NLP 10/70
11. Exercises
Install R: https://www.r-project.org
Download the data file ex1data1.txt from:
http://tinyurl.com/m7bpp8d
The data file has two columns:
First column: the population of a city.
Second column: the profit of a food truck in that city.
In R terminal, try the plot code
df <- read.table("./ex1data1.txt", sep=",",
header=FALSE)
plot(df[,1], df[,2], xlab=‘‘Profit in
$10,000s’’, ylab=‘‘Population of City in
10,000s’’)
Pham Quang Nhat Minh Research Methods in NLP 11/70
12. R for data visualization
Pham Quang Nhat Minh Research Methods in NLP 12/70
13. Table of Contents
1 What are empirical research methods for computer science?
2 How to choose a good research topic?
3 How to read a scientific paper?
4 How to work with your advisor
5 Doing research in NLP field
What is NLP?
What is it like doing research in NLP?
How to do research in NLP?
How to choose NLP papers to read?
6 Coding practices for NLP/Machine Learning research work
7 My research stories
Pham Quang Nhat Minh Research Methods in NLP 13/70
14. Why do we need to choose a good research topic?
“Garbage in, garbage out” principle
You may work with a research topic for years
1 year for a master thesis
3 years or more for a Ph.D. dissertation
It is painful to do things that you feel uninteresting
Lack passion, motivations, ideas
Much frustration and bitterness
Pham Quang Nhat Minh Research Methods in NLP 14/70
15. What is a good research topic?
(Alon, 2009) Two Dimensions of Problem Choice
Feasibility: whether a problem is hard or easy
We can measure the feasibility as the expected time to
complete the project
Feasibility is a function of the skills of students/researchers
and of the technology in the lab.
Interest: the increase in knowledge expected from the project.
Pham Quang Nhat Minh Research Methods in NLP 15/70
16. Two-dimensional space of Problem Choice (1)
Figure: The Feasibility-Interest Diagram for Choosing a Project (Alon,
2009)
Pham Quang Nhat Minh Research Methods in NLP 16/70
17. Two-dimensional space of Problem Choice (2)
Figure: The Feasibility-Interest Diagram for Choosing a Project (Alon,
2009)
Pham Quang Nhat Minh Research Methods in NLP 17/70
18. What is a good research topic?
Are many people care about the topic?
Research community, your supervisors, industry demands
Are you really interested in the topic?
The topic should be interesting to you rather than to others
Good signs: “ideas and questions that come back again and
again to your mind for months or years.”
Pham Quang Nhat Minh Research Methods in NLP 18/70
19. How to choose a good research topic: steps by steps
Choose the broad (general) topic
E.g, Machine Translation
Draw a hierarchy of research topics, starting from the broad
topic
Review literature to look for gaps in previous work
Choose the focused topic
E.g., Phrase-based Machine Translation
Find gaps in previous work
Form research questions in the focused topic
From research questions, formulate the research problem
Pham Quang Nhat Minh Research Methods in NLP 19/70
20. Finding a research problem
Take your time to choose a good research topic
(Alon, 2009): Rule for new Ph.D. students and postdocs: “Do
not commit to a problem before 3 months have elapsed”
For master students, take 1-2 months for choosing the research
topic before your start the research project.
Join projects in your laboratory
Many research ideas for thesis are from projects you involved
Pham Quang Nhat Minh Research Methods in NLP 20/70
21. Developing your research ideas
Where do research ideas come from?
Observations
Data observations, data analysis, discover patterns in data
Reading papers, attending conferences, listening talks
Techniques, methods from other disciplines, fields
Imagine
Suggestions from your advisor
Pham Quang Nhat Minh Research Methods in NLP 21/70
22. Reading papers, attending conferences
Choose good and relevant papers. Consider:
Impact factors of the journal.
In the NLP field, choose papers from top conferences, journals
(ACL/NAACL/EMNLP/COLING)
The Top 10 NLP Conferences:
http://www.junglelightspeed.com/
the-top-10-nlp-conferences
Reputations of authors and their organizations
Not only readings, but criticizing papers and finding the gaps
Pham Quang Nhat Minh Research Methods in NLP 22/70
23. Techniques, methods from other fields
Expand your view, problem solving methodologies by regularly
reading articles in other fields.
An example is the task image captioning
We need to use techniques from both computer vision and
NLP.
Pham Quang Nhat Minh Research Methods in NLP 23/70
24. What happens after we choose a problem? (Alon, 2009)
Pham Quang Nhat Minh Research Methods in NLP 24/70
25. Table of Contents
1 What are empirical research methods for computer science?
2 How to choose a good research topic?
3 How to read a scientific paper?
4 How to work with your advisor
5 Doing research in NLP field
What is NLP?
What is it like doing research in NLP?
How to do research in NLP?
How to choose NLP papers to read?
6 Coding practices for NLP/Machine Learning research work
7 My research stories
Pham Quang Nhat Minh Research Methods in NLP 25/70
26. Two types of readings
Fast readings
Get and understand the basic ideas of the paper
Know the problems the paper attacks and how it solves that
Put the paper in the “big picture” of the field
Know what are differences between the paper and previous
work
We do “fast reading” much when we survey literature and
choose a broad topic
Deep readings
Understand the details of presented methods
Try to understand how the proposed method works
Criticize the paper and find its limitations
If you were the authors, how would you solve the problem?
Propose alternative methods?
We do “deep reading” much we look for a focused topic
Pham Quang Nhat Minh Research Methods in NLP 26/70
27. How to read a scientific paper (1)
Michael J. Hanson. Efficient Readings of Papers in Science and Technology: http://tinyurl.com/qdebynz
Pham Quang Nhat Minh Research Methods in NLP 27/70
28. How to read a scientific paper (2)
Decide what to read
Read title, abstract
Read it, file it, or skip it
Read for breath
What did they do
Skim introduction, headings, graphics, definitions, conclusions
and bibliography.
Consider the credibility.
How useful is it?
Decide whether to go on.
Pham Quang Nhat Minh Research Methods in NLP 28/70
29. How to read a scientific paper (3)
Read in depth
How did they do it?
Challenge their arguments.
Examine assumptions.
Examine methods.
Examine statistics.
Examine reasoning and conclusions.
How can I apply their approach to my work?
Take notes
Make notes as you read.
Highlight major points.
Note new terms and definitions.
Summarize tables and graphs.
Write a summary.
Pham Quang Nhat Minh Research Methods in NLP 29/70
30. Homework
Choose one scientific article that you want to read in depth, read,
take notes and explain ideas, methods presented in the paper to
other students in a simple way.
Notes: You should be able to answer 3 questions as follows.
What is the problem the paper attack?
What are the differences between the paper and other existing
papers?
What are interesting points of the presented methods?
Pham Quang Nhat Minh Research Methods in NLP 30/70
31. Table of Contents
1 What are empirical research methods for computer science?
2 How to choose a good research topic?
3 How to read a scientific paper?
4 How to work with your advisor
5 Doing research in NLP field
What is NLP?
What is it like doing research in NLP?
How to do research in NLP?
How to choose NLP papers to read?
6 Coding practices for NLP/Machine Learning research work
7 My research stories
Pham Quang Nhat Minh Research Methods in NLP 31/70
32. Some basic rules
Your advisor is supposed to be very busy, so you should follow
up her/him
Schedule the meeting in advanced and ask for meeting
Keep regular meeting with your advisor
Usually weekly meeting
Do not just do what your advisor tell you to do
Rule of thumbnail: You should finish all your assigned tasks
before doing your own ideas
Pham Quang Nhat Minh Research Methods in NLP 32/70
33. How to write a progress/status report
Michael Ernst. Writing a progress/status report:
http://tinyurl.com/zp7cdvt
Quote the previous week’s plan.
This helps you determine whether you accomplished your goals.
State this week’s progress.
What you have accomplished,
What you learned, what difficulties you overcame, what
difficulties are still blocking you,
Your new ideas for research directions or projects, etc
Give the next week’s plan.
A good format is a bulleted list
Try to make each goal measurable: there should be no
ambiguity as to whether you were able to finish it.
It’s good to include longer-term goals as well.
Pham Quang Nhat Minh Research Methods in NLP 33/70
34. Communicate with your advisor
Prepare some slides (3-4 slides) to make the discussion
concrete
Send the materials at least 24 hours before the meeting day
Arrange the meeting in advanced
Your advisor is not always right
Actually you know more about your work than her/him
If you have data, evidences, proofs, do not hesitate to debate
Do not say “I guest”, “I think” when you explain something.
Use data, evidences, references instead
Pham Quang Nhat Minh Research Methods in NLP 34/70
35. Table of Contents
1 What are empirical research methods for computer science?
2 How to choose a good research topic?
3 How to read a scientific paper?
4 How to work with your advisor
5 Doing research in NLP field
What is NLP?
What is it like doing research in NLP?
How to do research in NLP?
How to choose NLP papers to read?
6 Coding practices for NLP/Machine Learning research work
7 My research stories
Pham Quang Nhat Minh Research Methods in NLP 35/70
36. What is Natural Language Processing?
A field of computer science, artificial intelligence, and
computational linguistics
To get computers to perform useful tasks involving human
languages
Human-Machine communication
Improving human-human communication
E.g Machine Translation
Extracting information from texts
Pham Quang Nhat Minh Research Methods in NLP 36/70
37. Why is NLP interesting?
Languages involve many human activities
Reading, writing, speaking, listening
Voice can be used as an user interface in many applications
Remote controls, virtual assistants like siri,...
NLP is used to acquire insights from massive amount of
textual data
E.g., hypotheses from medical, health reports
NLP has many applications
NLP is hard!
Pham Quang Nhat Minh Research Methods in NLP 37/70
38. NLP problems
Fundamental problems
Word Segmentation
Part-of-speech tagging
Syntactic Analysis
Semantic Analysis
Application problems
Information Retrieval
Information Extraction
Question Answering
Text Summarization
Machine Translation
Pham Quang Nhat Minh Research Methods in NLP 38/70
39. What is it like doing research in NLP?
Empirical methods are applied much in NLP
Relying on observations, data, experiments
Contains many loops of experiments
Identify the problem → Create ideas → Test the best idea →
Analyse results → Identify the problem → Create ideas → · · ·
Pham Quang Nhat Minh Research Methods in NLP 39/70
40. What is it like doing research in NLP?
Many ideas do not work
Even though, we need to analyse the results to understand
why they do not work to come up with new ideas.
Try the next idea
Fails occur more often than successes
Try to increase the number of experiments
(No of successes) = (No of experiments) × (Success rate)
Pham Quang Nhat Minh Research Methods in NLP 40/70
41. The typical working day of a NLP researcher
Data observation and data/result analysis (a lot)
Discuss ideas with colleagues
Do experiments (run the program) to test ideas
Reading papers to keep up-to-date on mainstream researches
Investigate new NLP/Machine Learning tools, libraries (less
regular)
Pham Quang Nhat Minh Research Methods in NLP 41/70
42. How to learn NLP?
Research starts from learning
Learn/review background about:
Probabilistic and Statistics
Basic math (linear algebra, calculus)
Machine Learning
Programming
Read NLP textbooks
Jurafsky, D., & Martin, J.H. Speech and Language Processing:
an Introduction to Natural Language Processing,
Computational Linguistics, and Speech Recognition.
Manning, C.D., & Schutze, H. Foundations of statistical
natural language processing.
Pham Quang Nhat Minh Research Methods in NLP 42/70
43. How to learn NLP: Get your hands dirty
Practice with programming exercises:
100 NLP drill exercises: https://github.com/
minhpqn/nlp_100_drill_exercises
NLP Programming Tutorial, by Graham Neubig:
http://www.phontron.com/teaching.php
Compete in Kaggle data science challenges (kaggle.com)
Pham Quang Nhat Minh Research Methods in NLP 43/70
44. Finding a NLP research problem
All the principles in the section “How to choose a good
research topic” apply.
Looking for ideas from related fields
Linguistics
Machine learning: mainstream in the NLP field is applying
machine learning methods in the NLP problems
Computer vision
Looking at data
It is actually my daily task
Pham Quang Nhat Minh Research Methods in NLP 44/70
45. Basic rules to choose NLP papers
READ:
Papers in top conferences and journals in NLP and other
related fields
(ACL/EMNLP/NAACL/EACL/COLING/CoNLL/...)
Workshops that focus on an NLP sub-field
Short papers at top conferences
PhD dissertations from top institutions/advisors
Papers with many citations
Textbooks from leading researchers
For more information, see: The Top 10 NLP Conferences1
1
http://www.junglelightspeed.com/the-top-10-nlp-conferences/
Pham Quang Nhat Minh Research Methods in NLP 45/70
46. Table of Contents
1 What are empirical research methods for computer science?
2 How to choose a good research topic?
3 How to read a scientific paper?
4 How to work with your advisor
5 Doing research in NLP field
What is NLP?
What is it like doing research in NLP?
How to do research in NLP?
How to choose NLP papers to read?
6 Coding practices for NLP/Machine Learning research work
7 My research stories
Pham Quang Nhat Minh Research Methods in NLP 46/70
47. Why is coding important in NLP/ML research?
Many (most) NLP/ML research work is empirical studies
Need to do data analysis, run experiments to test our ideas
So, we have to write programs
Even theorists should program, too
“Implementing your own algorithm is a good way of checking
your work. If you aren’t implementing your algorithm,
arguably you’re skipping a key step in checking your results.”
—Michael Mitzenmacher
http://mybiasedcoin.blogspot.com/2008/11/bugs.html
Pham Quang Nhat Minh Research Methods in NLP 47/70
48. Why we care about coding practices in NLP research?
Bad coding practices cause problems
You find errors in the experimental results right before the
paper submission deadline
You cannot understand your own code after some months
You deleted intermediate results, so you cannot verify the code
You do not know the technique to verify experimental results
You did not test the code, and then use untested code for
experiments
You spend long time for refactoring the code
You could not get back the version that generate the best
results
...
Pham Quang Nhat Minh Research Methods in NLP 48/70
49. Why we care about coding practices in NLP research?
Good coding practices speed up our research work
Recall that:
(No of successes) = (No of experiments) × (Success rate)
Pham Quang Nhat Minh Research Methods in NLP 49/70
50. Best Practices for Scientific Computing
(Wilson et al., 2012)
1- Write programs for people, not computers.
Readers of the code do not need to remember too much
Easy to read: names should be consistent, distinctive, and
meaningful
Break down the coding work into one-hour-long tasks
2- Automate repetitive tasks
Scientists should rely on the computer to repeat tasks.
Should use a script to run program!!
Use a build tool to automate their scientific workflows
Pham Quang Nhat Minh Research Methods in NLP 50/70
51. Best Practices for Scientific Computing
3- Use the computer to record history
Unique identifiers and version numbers for raw data records
Unique identifiers and version number for programs and
libraries
The values of parameters used to generate any given output;
The names and version number of programs used to generate
those outputs.
4- Make incremental changes
Scientists can not know what their programs should do next
until the current version has produced some results.
Should work in small steps with frequent feedback and
correction!
Pham Quang Nhat Minh Research Methods in NLP 51/70
52. Best Practices for Scientific Computing
5- Use a version control system: git, mercural, subversion. Push
code to github, bitbucket
Everything that has been created manually should be put in
version control
6- Do not repeat yourself (or others)
At small-scale, code should be modularized rather than copied
and pasted.
At large-scale, scientific programmers should re-use code
instead of re-writing it.
Pham Quang Nhat Minh Research Methods in NLP 52/70
53. Best Practices for Scientific Computing
7- Plan for mistakes
Write and run tests
Unit Test: Check the correctness of each single software unit
Integration Test: Check that pieces of unit code work
correctly when combined.
Regression Test: Running pre-existing code tests after changes
to the code in order to make sure that it hasn’t regressed.
Should use off-the-self unit testing library
Pham Quang Nhat Minh Research Methods in NLP 53/70
54. Best Practices for Scientific Computing
8- Optimize software only after it works correctly
Use profiler to identify bottlenecks
Write code in the highest-level language possible
Python is recommended language for research
Only use low-level programming language when they are sure
that performance boost is needed.
Use the highest-level programming language for rapid
prototyping.
Pham Quang Nhat Minh Research Methods in NLP 54/70
55. 9- Document design, and purpose, not mechanics
Document interface and reasons, not implementations
Do not do that
i = i + 1 # Increment the variable ’i’ by
one.
Refactor the code instead of explaining how it works
Embed the documentation for a piece of software in that
software
Use software to generate documentation.
10- Collaborate
Use pre-merge code reviews
Use an issue tracking tool.
Pham Quang Nhat Minh Research Methods in NLP 55/70
56. Coding practices for NLP/ML research
All general practices apply for NLP/ML research
Separate a process into small processes
Use pipelines in Unix/Linux
Make use of tools in experiments
Linux commands
NLP/ML Tools
Libraries (json, nltk, matplotlib, scikit-learn,...)
Algorithms
E.g., Show statistics about number of words in a text file
source file name.txt | cut -f1 | sort | uniq
-c | sort -nr
Visualize experimental results, make demo for your research
results
Pham Quang Nhat Minh Research Methods in NLP 56/70
58. Optimize codes only after your ideas work
“Make it work. Make it right. Make it fast.” (Kent Beck)
“Premature optimization is the root of all evil (or at least
most of it) in programming.” (Donald Knuth)
In NLP, always start with a simple and dirty working version
E.g, Bag-of-word features and Naive Bayes algorithm in text
classification tasks
Pham Quang Nhat Minh Research Methods in NLP 58/70
59. Table of Contents
1 What are empirical research methods for computer science?
2 How to choose a good research topic?
3 How to read a scientific paper?
4 How to work with your advisor
5 Doing research in NLP field
What is NLP?
What is it like doing research in NLP?
How to do research in NLP?
How to choose NLP papers to read?
6 Coding practices for NLP/Machine Learning research work
7 My research stories
Pham Quang Nhat Minh Research Methods in NLP 59/70
60. My profile
6/2006: B.Sc. in Information Technology from University of
Engineering and Technology, Vietnam National University,
Hanoi
3/2010: M.Sc. in Information Science from Japan Advanced
Institute of Science and Technology
3/2013: Ph.D. in Information Science from Japan Advanced
Institute of Science and Technology
Pham Quang Nhat Minh Research Methods in NLP 60/70
61. Master program at JAIST
JAIST is a public graduate institute in Japan
Homepage: https://www.jaist.ac.jp/english
Three schools
Information Science
Knowledge Science
Material Science
All courses have English version
You can learn in English
Pham Quang Nhat Minh Research Methods in NLP 61/70
62. Master program at JAIST
Two-year full-time master program
First year:
Students are temporarily assigned to a laboratory, and select
the official lab after 3 months
In the first year, mainly taking courses and choosing the
master research topic
Write the research proposal for master thesis in the end of the
first year
Second year:
Finishing all remaining course work
Working on master research project
Looking for jobs (students who do not pursue Ph.D.)
Pham Quang Nhat Minh Research Methods in NLP 62/70
63. How did I finish my master?
Six months before entering master program
Take Japanese course
Review background
Read NLP Textbooks
First year:
Finish all course work
Join a research project in my laboratory
Choose the research topic
Second year:
Do research
Attend one international conference
Thesis defense
Pham Quang Nhat Minh Research Methods in NLP 63/70
64. How I choose my master thesis
I even did not know how to choose a research topic (crying)
You should know how to choose
I was assigned the topic by my co-advisor
The topic is about sentence insertion
I proposed a method to improve the previous results
Pham Quang Nhat Minh Research Methods in NLP 64/70
65. Sentence insertion task
Task: To automatically updating a wikipedia article by inserting
new information into that.
I proposed to use Word Clusters to capture meaning of words
Pham Quang Nhat Minh Research Methods in NLP 65/70
66. Research projects at FPT Technology Research Institute
NLP problems in chatbot development
Intent classification
Named entity recognition
FAQ generation from chat history, manuals
Figure: Source: stanfy.com: http://tinyurl.com/mdfsa6h)
Pham Quang Nhat Minh Research Methods in NLP 66/70
67. Summary
Empirical research methods reply on observations, data,
experiments
Two dimensions of problem choice: Feasibility and Interest
Research starts from learning
Reading is very important in research
NLP research involves much data analysis
Coding practices for NLP/ML research
Pham Quang Nhat Minh Research Methods in NLP 67/70
68. Check-list for your master thesis
1 Is your work reproducible?
Package your code so that it can automatically generate the
results by a single script
Freeze the final version
2 Is your proposed method new
3 Did you revise your thesis many times?
Ask your advisors, friends for proof reading
4 Did you understand previous work?
5 Do you think you can pass the master thesis defense?
Pham Quang Nhat Minh Research Methods in NLP 68/70
69. Advices for your master thesis
Take time to choose your master research topic
Work on the research problem that you are interested in
Start soon
Follow up your advisor
Spend time on regular literature review (reading papers)
Commit at least 2-3 hours per day for your master research
Look at your data before starting doing something
Follow “best” coding practices for research
Use version control
For versioning everything that is manually created
Backup your work on the cloud
Pham Quang Nhat Minh Research Methods in NLP 69/70
70. References
Alon, U. (2009). How to choose a good scientific problem.
Molecular cell, 35 6, 726-8.
Aruliah, D.A., Brown, C.T., Davis, M., Guy, R.T., Hong, N.P.,
Haddock, S.H., Huff, K., Mitchell, I.M., Plumbley, M.D.,
Waugh, B., White, E.P., Wilson, G., & Wilson, P. (2014).
Best Practices for Scientific Computing. PLoS biology.
Ali Eslami. Patterns for Research in Machine Learning
http://arkitus.com/patterns-for-research-in-machine-learning
Pham Quang Nhat Minh Research Methods in NLP 70/70