SlideShare une entreprise Scribd logo
1  sur  45
Télécharger pour lire hors ligne
Regular Expressions
& Regular Languages
slideshare: http://www.slideshare.net/marinasantini1/regular-expressions-and-regular-languages
Mathematics for Language Technology
http://stp.lingfil.uu.se/~matsd/uv/uv15/mfst/
Last Updated 6 March 2015
Marina Santini
santinim@stp.lingfil.uu.se
Department of Linguistics and Philology
Uppsala University, Uppsala, Sweden
Spring 2015
1
Acknowledgements
 Several	
  slides	
  borrowed	
  from	
  Jurafsky	
  and	
  Mar6n	
  
(2009).	
  
 Prac6cal	
  ac6vi6es	
  by	
  Mats	
  Dahllöf	
  and	
  Jurafsky	
  and	
  
Mar6n	
  (2009).	
  
2
Reading
 Required Reading:
  E&G (2013): Ch. 9 (pp. 252-256)
  Compendium (3): 7.2, 7.3, 8.2.3
  Mats Dahllöf: Reguljära uttryck
•  http://stp.lingfil.uu.se/~matsd/uv/uv14/mfst/dok/oh6.pdf
 Further Reading:
  Chapters	
  2	
  in	
  Jurafsky	
  D.	
  &	
  Mar6n	
  J.	
  (2009)	
  Speech	
  and	
  Language	
  Processing:	
  
An	
  introduc5on	
  to	
  natural	
  language	
  processing,	
  computa5onal	
  linguis5cs,	
  and	
  
speech	
  recogni5on.	
  Online	
  draG	
  version:	
  hIp://stp.lingfil.uu.se/~san6nim/ml/2014/
JurafskyMar6nSpeechAndLanguageProcessing2ed_draG%202007.pdf	
  
3
Outline
 Regular Expressions
 Regular Languages
 Practical Activities
 (Pumping Lemma)
4
5
Regular Expressions
Definitions
Equivalence to Finite Automata
6
Regular Expressions and Text Searching
 Everybody does it
  Emacs, vi, perl, grep, etc..
 Regular expressions are a compact
textual representation of a set of strings
representing a language.
7
Example
 Find all the instances of the word “the”
in a text.
  /the/
  /[tT]he/
  /b[tT]heb/
8
Errors
 The process we just went through was
based on two fixing kinds of errors
  Matching strings that we should not have
matched (there, then, other)
•  False positives (Type I)
  Not matching things that we should have
matched (The)
•  False negatives (Type II)
9
Errors
 Reducing the error rate for an application
often involves two antagonistic efforts:
  Increasing accuracy, or precision, (minimizing
false positives)
  Increasing coverage, or recall, (minimizing
false negatives).
10
REs: What are they?
 Regular expressions describe
languages by an algebra.
Link: https://www.youtube.com/watch?v=eOfMcdeyrMU
11
DFA
12
Converting the regular expression
(a|b)* to a DFA
13
Converting the regular expression
(a*|b*)* to a DFA
14
Converting the regular expression
ab(a|b)* to a DFA
15
Remember Jeff Ullman video?
16
17
Operations on Languages
 REs use three operations:
  union
  concatenation
  Kleene star (*) [cleany star]
Union ∪ (aka: disjunction, OR, |, +)
 The union of languages is the usual
thing, since languages are sets.
 Example: {01,111,10}∪{00, 01} =
{01,111,10,00}.
18
01 happens to be in both
sets, so it will be once in the
union
19
Concatenation: represented by juxtaposition (no punctuation)
or middle dot ( · )
 The concatenation of languages
L and M is denoted LM.
 It contains every string wx such
that w is in L and x is in M.
 Example: {01,111,10}{00, 01}
= {0100, 0101, 11100, 11101,
1000, 1001}. In the example, we take 01 from the first language,
and we concatenate it with 00 in the second language.
That gives us 0100.
We then take 01 from the first language again, and we
concatenate it with 01 in the second language, and that
gives us 0101.
Then we take 111 from the first language and we
concatenated it with 00 in the second language and
this gives us 11100
…. and so on.
20
Kleene Star: represented by an asterisk
aka star (*)
 If L is a language, then L*, the Kleene
star or just “star,” is the set of strings
formed by concatenating zero or more
strings from L, in any order.
 L* = {ε} ∪ L ∪ LL ∪ LLL ∪ …
 Example: {0,10}* = {ε, 0, 10, 00, 010,
100, 1010,…}
If you take no strings from L, that would give you the empty string.
IMPORTANT!
 FROM NOW ON, LET’S STICK TO THE
FOLLOWING CONVENTIONS (OTHERWISE WE
WILL BE CONFUSED):
  Union ∪ (aka: disjunction, OR) represented by: | or +
  Concatenation: represented by juxtaposition (= no
punctuation) or middle dot ( · )
  Kleene Star: represented by *
21
22
Precedence of Operators
 Parentheses may be used wherever
needed to influence the grouping of
operators.
 Order of precedence is * (highest), then
concatenation, then + (lowest).
Remember: + = union/disjunction
23
Examples: REs
1.  L(01) = {01}.
2.  L(01+0) = {01, 0}.
3.  L(0(1+0)) = {01, 00}.
  Note order of precedence of
operators.
4.  L(0*) = {ε, 0, 00, 000,… }.
5.  L((0+10)*(ε+1)) = all strings
of 0s and 1s without two
consecutive 1s.
1) The regular expression 01 represents the
concatenation of the language consisting of one
string, 0 and the language consisting of one string, 1.
The result is the language containing the one string
01.
2) The language of 01+0 is the union of the language
containing only string 01 and the language containing
only string 0.
3) The language of 0 concatenated with 1+0 is the
two strings 01 and 00. Notice that we need
parentheses to force the + to group first. Without
them, since concatenation takes precedence over +,
we get the interpretation in the second example.
4) The language of 0* is the star of the language
containing only the string 0. This is all strings of 0’s,
including the empty string.
5) This example denotes the language with all strings
of 0s and 1s without two consecutive 0s. To see why
this works, in every such string, each 1 is either
followed immediately by a 0, or it comes at the end of
the string. (0+10)* denotes all strings in which every
1 is followed by a 0. These strings are surely in the
language we want. But we also want these strings
followed by a final 1. Thus, we concatenate the
language of (0+10)* with epsilon+1. This
concatenation gives us all the strings where 1s are
followed by 0s, plus all those strings with an
additional 1 at the end.
24
Equivalence of REs and Finite
Automata
 For every RE, there is a finite automaton
that accepts the same language.
 And we need to show that for every finite
automaton, there is a RE defining its
language.
25
Summary
Automata and regular expressions define
exactly the same set of languages: the
regular languages.
REGULAR LANGUAGES
26
27
The Chomsky Hierachy
Regular
(DFA)
Context-
free
(PDA)
Context-
sensitive
(LBA)
Recursively-
enumerable
(TM)
•  Hierarchy of classes of formal languages
One language is of greater generative power or complexity than another if
it can define a language that other cannot define. Context-free grammars
are more powerful that regular grammars
28
Regular Languages
 A language L is regular if it is the
language accepted by some DFA.
  Note: the DFA must accept only the strings
in L, no others.
 Some languages are not regular.
Only languages that meet the following criteria
are regular languages:
29
  Regular language derive their name from the fact that the
strings they recognize are (in a formal computer science sense)
“regular.”
  This implies that there are certain kinds of strings that it will be
very hard, if not impossible, to recognize with regular
expressions, especially nested syntactic structures in natural
language.
30
Formal languages vs regular
languages
 A formal language is a set of strings,
each string composed of symbols from
a finite set called an alphabet.
  Ex: {a,b!}
 Formal languages are not the same as
regular languages….
31
32
But Many Languages are Regular
 They appear in many contexts and have
many useful properties.
How to tell if a language is not regular
 The most common way to prove that a
language is regular is to build a regular
expression for the language.
33
Pumping Lemma
34
Prac6cal	
  Ac6vity	
  1	
  
 The	
  language	
  L	
  contains	
  all	
  strings	
  over	
  the	
  
alphabet	
  {a,b}	
  that	
  begin	
  with	
  a	
  and	
  end	
  with	
  b,	
  
ie:	
  
 Write a regular expression that defines
the language L.	
  	
  	
  
35
Practical Activity 1:
Possible Solution
36
Your Solutions
37
In between the concatenation of a
and b there must be 0 or more
unions (disjuctions) of a and b.
Reference: slides 17-22
Practical Activity 2
 Draw a deterministic finite-state automaton
that accepts the following regular expression:
38
( (ab) | c)*
Alternative notation style:
ie: 0 or more occurences of
the disjunction ab | c
Test the
automaton with
these legal strings
in the language :
0
abc
a
ab
cccabc
cbacccabababccc
….
Practical Activity 2:
Possible Correct Solution
39
Having the initial state as a final state gives us the empty string as an element in the language.
Your solutions (1): when we interpret ”+” as
disjunction, these solutions are wrong because
”c” happens only after ”a” and ”b”…
40
Test
these
automata
with the
string on
slide 35
Your solutions (2): same as
previous slide. In addition, here no
final states are shown…
41
Test
these
automata
with the
string on
slide 35
Practical Activity 3
  Construct a grep regular expression that
matches patterns containing at least one
“ab” followed by any number of bs.
  Construct a grep regular expression that
matches any number between 1000 and
9999.
42
Practical Activity 3:
Possible Solutions
  grep ‘(ab)+b*’
  [1-9][0-9]{3}
43
Exercises: E&G (2013)
 Övning 9.40
 Optional: as many as you can
 AGer	
  having	
  completed	
  the	
  exercises,	
  
check	
  out	
  the	
  solu6ons	
  at	
  the	
  end	
  of	
  the	
  
book.	
  	
  	
  
44
The End
45

Contenu connexe

Tendances

Regular language and Regular expression
Regular language and Regular expressionRegular language and Regular expression
Regular language and Regular expressionAnimesh Chaturvedi
 
TOC 1 | Introduction to Theory of Computation
TOC 1 | Introduction to Theory of ComputationTOC 1 | Introduction to Theory of Computation
TOC 1 | Introduction to Theory of ComputationMohammad Imam Hossain
 
Regular expressions-Theory of computation
Regular expressions-Theory of computationRegular expressions-Theory of computation
Regular expressions-Theory of computationBipul Roy Bpl
 
Regular Languages
Regular LanguagesRegular Languages
Regular Languagesparmeet834
 
Deterministic Finite Automata (DFA)
Deterministic Finite Automata (DFA)Deterministic Finite Automata (DFA)
Deterministic Finite Automata (DFA)Animesh Chaturvedi
 
Lecture 1,2
Lecture 1,2Lecture 1,2
Lecture 1,2shah zeb
 
Introduction TO Finite Automata
Introduction TO Finite AutomataIntroduction TO Finite Automata
Introduction TO Finite AutomataRatnakar Mikkili
 
Context free grammars
Context free grammarsContext free grammars
Context free grammarsRonak Thakkar
 
Theory of Automata
Theory of AutomataTheory of Automata
Theory of AutomataFarooq Mian
 
Theory of Computation
Theory of ComputationTheory of Computation
Theory of ComputationShiraz316
 
Theory of Automata Lesson 02
Theory of Automata Lesson 02Theory of Automata Lesson 02
Theory of Automata Lesson 02hamzamughal39
 
Theory of computing
Theory of computingTheory of computing
Theory of computingRanjan Kumar
 
Theory of Computation Lecture Notes
Theory of Computation Lecture NotesTheory of Computation Lecture Notes
Theory of Computation Lecture NotesFellowBuddy.com
 

Tendances (20)

Regular language and Regular expression
Regular language and Regular expressionRegular language and Regular expression
Regular language and Regular expression
 
Finite Automata
Finite AutomataFinite Automata
Finite Automata
 
TOC 1 | Introduction to Theory of Computation
TOC 1 | Introduction to Theory of ComputationTOC 1 | Introduction to Theory of Computation
TOC 1 | Introduction to Theory of Computation
 
Regular expressions-Theory of computation
Regular expressions-Theory of computationRegular expressions-Theory of computation
Regular expressions-Theory of computation
 
Regular Languages
Regular LanguagesRegular Languages
Regular Languages
 
Deterministic Finite Automata (DFA)
Deterministic Finite Automata (DFA)Deterministic Finite Automata (DFA)
Deterministic Finite Automata (DFA)
 
Lecture 1,2
Lecture 1,2Lecture 1,2
Lecture 1,2
 
Theory of computing
Theory of computingTheory of computing
Theory of computing
 
Introduction TO Finite Automata
Introduction TO Finite AutomataIntroduction TO Finite Automata
Introduction TO Finite Automata
 
TOC 5 | Regular Expressions
TOC 5 | Regular ExpressionsTOC 5 | Regular Expressions
TOC 5 | Regular Expressions
 
Theory of computation Lec2
Theory of computation Lec2Theory of computation Lec2
Theory of computation Lec2
 
Context free grammars
Context free grammarsContext free grammars
Context free grammars
 
Theory of Automata
Theory of AutomataTheory of Automata
Theory of Automata
 
Lecture 6
Lecture 6Lecture 6
Lecture 6
 
Lecture: Automata
Lecture: AutomataLecture: Automata
Lecture: Automata
 
Theory of Computation
Theory of ComputationTheory of Computation
Theory of Computation
 
Theory of Automata Lesson 02
Theory of Automata Lesson 02Theory of Automata Lesson 02
Theory of Automata Lesson 02
 
Theory of computing
Theory of computingTheory of computing
Theory of computing
 
Theory of Computation Lecture Notes
Theory of Computation Lecture NotesTheory of Computation Lecture Notes
Theory of Computation Lecture Notes
 
Lecture 7
Lecture 7Lecture 7
Lecture 7
 

Similaire à Lecture: Regular Expressions and Regular Languages

Theory of Computation - Lectures 4 and 5
Theory of Computation - Lectures 4 and 5Theory of Computation - Lectures 4 and 5
Theory of Computation - Lectures 4 and 5Dr. Maamoun Ahmed
 
hghghghhghghgggggggggggggggggggggggggggggggggg
hghghghhghghgggggggggggggggggggggggggggggggggghghghghhghghgggggggggggggggggggggggggggggggggg
hghghghhghghggggggggggggggggggggggggggggggggggadugnanegero
 
01-Introduction&Languages.pdf
01-Introduction&Languages.pdf01-Introduction&Languages.pdf
01-Introduction&Languages.pdfTariqSaeed80
 
RegularLanguage.pptx
RegularLanguage.pptxRegularLanguage.pptx
RegularLanguage.pptxTapasBhadra1
 
1LECTURE 8 Regular_Expressions.ppt
1LECTURE 8 Regular_Expressions.ppt1LECTURE 8 Regular_Expressions.ppt
1LECTURE 8 Regular_Expressions.pptMarvin886766
 
RegularExpressions.pdf
RegularExpressions.pdfRegularExpressions.pdf
RegularExpressions.pdfImranBhatti58
 
Chapter2CDpdf__2021_11_26_09_19_08.pdf
Chapter2CDpdf__2021_11_26_09_19_08.pdfChapter2CDpdf__2021_11_26_09_19_08.pdf
Chapter2CDpdf__2021_11_26_09_19_08.pdfDrIsikoIsaac
 
Automata
AutomataAutomata
AutomataGaditek
 
Automata
AutomataAutomata
AutomataGaditek
 
End semexam | Theory of Computation | Akash Anand | MTH 401A | IIT Kanpur
End semexam | Theory of Computation | Akash Anand | MTH 401A | IIT KanpurEnd semexam | Theory of Computation | Akash Anand | MTH 401A | IIT Kanpur
End semexam | Theory of Computation | Akash Anand | MTH 401A | IIT KanpurVivekananda Samiti
 

Similaire à Lecture: Regular Expressions and Regular Languages (20)

PART A.doc
PART A.docPART A.doc
PART A.doc
 
Theory of Computation - Lectures 4 and 5
Theory of Computation - Lectures 4 and 5Theory of Computation - Lectures 4 and 5
Theory of Computation - Lectures 4 and 5
 
hghghghhghghgggggggggggggggggggggggggggggggggg
hghghghhghghgggggggggggggggggggggggggggggggggghghghghhghghgggggggggggggggggggggggggggggggggg
hghghghhghghgggggggggggggggggggggggggggggggggg
 
Unit ii
Unit iiUnit ii
Unit ii
 
01-Introduction&Languages.pdf
01-Introduction&Languages.pdf01-Introduction&Languages.pdf
01-Introduction&Languages.pdf
 
RegularLanguage.pptx
RegularLanguage.pptxRegularLanguage.pptx
RegularLanguage.pptx
 
Flat unit 1
Flat unit 1Flat unit 1
Flat unit 1
 
rs1.ppt
rs1.pptrs1.ppt
rs1.ppt
 
1LECTURE 8 Regular_Expressions.ppt
1LECTURE 8 Regular_Expressions.ppt1LECTURE 8 Regular_Expressions.ppt
1LECTURE 8 Regular_Expressions.ppt
 
RegularExpressions.pdf
RegularExpressions.pdfRegularExpressions.pdf
RegularExpressions.pdf
 
L_2_apl.pptx
L_2_apl.pptxL_2_apl.pptx
L_2_apl.pptx
 
Dfa basics
Dfa basicsDfa basics
Dfa basics
 
Dfa basics
Dfa basicsDfa basics
Dfa basics
 
10651372.ppt
10651372.ppt10651372.ppt
10651372.ppt
 
QB104545.pdf
QB104545.pdfQB104545.pdf
QB104545.pdf
 
Chapter2CDpdf__2021_11_26_09_19_08.pdf
Chapter2CDpdf__2021_11_26_09_19_08.pdfChapter2CDpdf__2021_11_26_09_19_08.pdf
Chapter2CDpdf__2021_11_26_09_19_08.pdf
 
Automata
AutomataAutomata
Automata
 
Automata
AutomataAutomata
Automata
 
FSM.pdf
FSM.pdfFSM.pdf
FSM.pdf
 
End semexam | Theory of Computation | Akash Anand | MTH 401A | IIT Kanpur
End semexam | Theory of Computation | Akash Anand | MTH 401A | IIT KanpurEnd semexam | Theory of Computation | Akash Anand | MTH 401A | IIT Kanpur
End semexam | Theory of Computation | Akash Anand | MTH 401A | IIT Kanpur
 

Plus de Marina Santini

Can We Quantify Domainhood? Exploring Measures to Assess Domain-Specificity i...
Can We Quantify Domainhood? Exploring Measures to Assess Domain-Specificity i...Can We Quantify Domainhood? Exploring Measures to Assess Domain-Specificity i...
Can We Quantify Domainhood? Exploring Measures to Assess Domain-Specificity i...Marina Santini
 
Towards a Quality Assessment of Web Corpora for Language Technology Applications
Towards a Quality Assessment of Web Corpora for Language Technology ApplicationsTowards a Quality Assessment of Web Corpora for Language Technology Applications
Towards a Quality Assessment of Web Corpora for Language Technology ApplicationsMarina Santini
 
A Web Corpus for eCare: Collection, Lay Annotation and Learning -First Results-
A Web Corpus for eCare: Collection, Lay Annotation and Learning -First Results-A Web Corpus for eCare: Collection, Lay Annotation and Learning -First Results-
A Web Corpus for eCare: Collection, Lay Annotation and Learning -First Results-Marina Santini
 
An Exploratory Study on Genre Classification using Readability Features
An Exploratory Study on Genre Classification using Readability FeaturesAn Exploratory Study on Genre Classification using Readability Features
An Exploratory Study on Genre Classification using Readability FeaturesMarina Santini
 
Lecture: Semantic Word Clouds
Lecture: Semantic Word CloudsLecture: Semantic Word Clouds
Lecture: Semantic Word CloudsMarina Santini
 
Lecture: Ontologies and the Semantic Web
Lecture: Ontologies and the Semantic WebLecture: Ontologies and the Semantic Web
Lecture: Ontologies and the Semantic WebMarina Santini
 
Lecture: Summarization
Lecture: SummarizationLecture: Summarization
Lecture: SummarizationMarina Santini
 
Lecture: Question Answering
Lecture: Question AnsweringLecture: Question Answering
Lecture: Question AnsweringMarina Santini
 
IE: Named Entity Recognition (NER)
IE: Named Entity Recognition (NER)IE: Named Entity Recognition (NER)
IE: Named Entity Recognition (NER)Marina Santini
 
Lecture: Vector Semantics (aka Distributional Semantics)
Lecture: Vector Semantics (aka Distributional Semantics)Lecture: Vector Semantics (aka Distributional Semantics)
Lecture: Vector Semantics (aka Distributional Semantics)Marina Santini
 
Lecture: Word Sense Disambiguation
Lecture: Word Sense DisambiguationLecture: Word Sense Disambiguation
Lecture: Word Sense DisambiguationMarina Santini
 
Semantic Role Labeling
Semantic Role LabelingSemantic Role Labeling
Semantic Role LabelingMarina Santini
 
Semantics and Computational Semantics
Semantics and Computational SemanticsSemantics and Computational Semantics
Semantics and Computational SemanticsMarina Santini
 
Lecture 9: Machine Learning in Practice (2)
Lecture 9: Machine Learning in Practice (2)Lecture 9: Machine Learning in Practice (2)
Lecture 9: Machine Learning in Practice (2)Marina Santini
 
Lecture 8: Machine Learning in Practice (1)
Lecture 8: Machine Learning in Practice (1) Lecture 8: Machine Learning in Practice (1)
Lecture 8: Machine Learning in Practice (1) Marina Santini
 
Lecture 5: Interval Estimation
Lecture 5: Interval Estimation Lecture 5: Interval Estimation
Lecture 5: Interval Estimation Marina Santini
 
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain RatioLecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain RatioMarina Santini
 

Plus de Marina Santini (20)

Can We Quantify Domainhood? Exploring Measures to Assess Domain-Specificity i...
Can We Quantify Domainhood? Exploring Measures to Assess Domain-Specificity i...Can We Quantify Domainhood? Exploring Measures to Assess Domain-Specificity i...
Can We Quantify Domainhood? Exploring Measures to Assess Domain-Specificity i...
 
Towards a Quality Assessment of Web Corpora for Language Technology Applications
Towards a Quality Assessment of Web Corpora for Language Technology ApplicationsTowards a Quality Assessment of Web Corpora for Language Technology Applications
Towards a Quality Assessment of Web Corpora for Language Technology Applications
 
A Web Corpus for eCare: Collection, Lay Annotation and Learning -First Results-
A Web Corpus for eCare: Collection, Lay Annotation and Learning -First Results-A Web Corpus for eCare: Collection, Lay Annotation and Learning -First Results-
A Web Corpus for eCare: Collection, Lay Annotation and Learning -First Results-
 
An Exploratory Study on Genre Classification using Readability Features
An Exploratory Study on Genre Classification using Readability FeaturesAn Exploratory Study on Genre Classification using Readability Features
An Exploratory Study on Genre Classification using Readability Features
 
Lecture: Semantic Word Clouds
Lecture: Semantic Word CloudsLecture: Semantic Word Clouds
Lecture: Semantic Word Clouds
 
Lecture: Ontologies and the Semantic Web
Lecture: Ontologies and the Semantic WebLecture: Ontologies and the Semantic Web
Lecture: Ontologies and the Semantic Web
 
Lecture: Summarization
Lecture: SummarizationLecture: Summarization
Lecture: Summarization
 
Relation Extraction
Relation ExtractionRelation Extraction
Relation Extraction
 
Lecture: Question Answering
Lecture: Question AnsweringLecture: Question Answering
Lecture: Question Answering
 
IE: Named Entity Recognition (NER)
IE: Named Entity Recognition (NER)IE: Named Entity Recognition (NER)
IE: Named Entity Recognition (NER)
 
Lecture: Vector Semantics (aka Distributional Semantics)
Lecture: Vector Semantics (aka Distributional Semantics)Lecture: Vector Semantics (aka Distributional Semantics)
Lecture: Vector Semantics (aka Distributional Semantics)
 
Lecture: Word Sense Disambiguation
Lecture: Word Sense DisambiguationLecture: Word Sense Disambiguation
Lecture: Word Sense Disambiguation
 
Lecture: Word Senses
Lecture: Word SensesLecture: Word Senses
Lecture: Word Senses
 
Sentiment Analysis
Sentiment AnalysisSentiment Analysis
Sentiment Analysis
 
Semantic Role Labeling
Semantic Role LabelingSemantic Role Labeling
Semantic Role Labeling
 
Semantics and Computational Semantics
Semantics and Computational SemanticsSemantics and Computational Semantics
Semantics and Computational Semantics
 
Lecture 9: Machine Learning in Practice (2)
Lecture 9: Machine Learning in Practice (2)Lecture 9: Machine Learning in Practice (2)
Lecture 9: Machine Learning in Practice (2)
 
Lecture 8: Machine Learning in Practice (1)
Lecture 8: Machine Learning in Practice (1) Lecture 8: Machine Learning in Practice (1)
Lecture 8: Machine Learning in Practice (1)
 
Lecture 5: Interval Estimation
Lecture 5: Interval Estimation Lecture 5: Interval Estimation
Lecture 5: Interval Estimation
 
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain RatioLecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
 

Dernier

SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13Steve Thomason
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingTechSoup
 
General AI for Medical Educators April 2024
General AI for Medical Educators April 2024General AI for Medical Educators April 2024
General AI for Medical Educators April 2024Janet Corral
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...christianmathematics
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...Sapna Thakur
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Disha Kariya
 
9548086042 for call girls in Indira Nagar with room service
9548086042  for call girls in Indira Nagar  with room service9548086042  for call girls in Indira Nagar  with room service
9548086042 for call girls in Indira Nagar with room servicediscovermytutordmt
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAssociation for Project Management
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3JemimahLaneBuaron
 
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...fonyou31
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformChameera Dedduwage
 
Disha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfDisha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfchloefrazer622
 
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...PsychoTech Services
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactdawncurless
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfAdmir Softic
 

Dernier (20)

SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
General AI for Medical Educators April 2024
General AI for Medical Educators April 2024General AI for Medical Educators April 2024
General AI for Medical Educators April 2024
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..
 
9548086042 for call girls in Indira Nagar with room service
9548086042  for call girls in Indira Nagar  with room service9548086042  for call girls in Indira Nagar  with room service
9548086042 for call girls in Indira Nagar with room service
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across Sectors
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3
 
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy Reform
 
Disha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfDisha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdf
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 

Lecture: Regular Expressions and Regular Languages

  • 1. Regular Expressions & Regular Languages slideshare: http://www.slideshare.net/marinasantini1/regular-expressions-and-regular-languages Mathematics for Language Technology http://stp.lingfil.uu.se/~matsd/uv/uv15/mfst/ Last Updated 6 March 2015 Marina Santini santinim@stp.lingfil.uu.se Department of Linguistics and Philology Uppsala University, Uppsala, Sweden Spring 2015 1
  • 2. Acknowledgements  Several  slides  borrowed  from  Jurafsky  and  Mar6n   (2009).    Prac6cal  ac6vi6es  by  Mats  Dahllöf  and  Jurafsky  and   Mar6n  (2009).   2
  • 3. Reading  Required Reading:   E&G (2013): Ch. 9 (pp. 252-256)   Compendium (3): 7.2, 7.3, 8.2.3   Mats Dahllöf: Reguljära uttryck •  http://stp.lingfil.uu.se/~matsd/uv/uv14/mfst/dok/oh6.pdf  Further Reading:   Chapters  2  in  Jurafsky  D.  &  Mar6n  J.  (2009)  Speech  and  Language  Processing:   An  introduc5on  to  natural  language  processing,  computa5onal  linguis5cs,  and   speech  recogni5on.  Online  draG  version:  hIp://stp.lingfil.uu.se/~san6nim/ml/2014/ JurafskyMar6nSpeechAndLanguageProcessing2ed_draG%202007.pdf   3
  • 6. 6 Regular Expressions and Text Searching  Everybody does it   Emacs, vi, perl, grep, etc..  Regular expressions are a compact textual representation of a set of strings representing a language.
  • 7. 7 Example  Find all the instances of the word “the” in a text.   /the/   /[tT]he/   /b[tT]heb/
  • 8. 8 Errors  The process we just went through was based on two fixing kinds of errors   Matching strings that we should not have matched (there, then, other) •  False positives (Type I)   Not matching things that we should have matched (The) •  False negatives (Type II)
  • 9. 9 Errors  Reducing the error rate for an application often involves two antagonistic efforts:   Increasing accuracy, or precision, (minimizing false positives)   Increasing coverage, or recall, (minimizing false negatives).
  • 10. 10 REs: What are they?  Regular expressions describe languages by an algebra.
  • 13. Converting the regular expression (a|b)* to a DFA 13
  • 14. Converting the regular expression (a*|b*)* to a DFA 14
  • 15. Converting the regular expression ab(a|b)* to a DFA 15
  • 16. Remember Jeff Ullman video? 16
  • 17. 17 Operations on Languages  REs use three operations:   union   concatenation   Kleene star (*) [cleany star]
  • 18. Union ∪ (aka: disjunction, OR, |, +)  The union of languages is the usual thing, since languages are sets.  Example: {01,111,10}∪{00, 01} = {01,111,10,00}. 18 01 happens to be in both sets, so it will be once in the union
  • 19. 19 Concatenation: represented by juxtaposition (no punctuation) or middle dot ( · )  The concatenation of languages L and M is denoted LM.  It contains every string wx such that w is in L and x is in M.  Example: {01,111,10}{00, 01} = {0100, 0101, 11100, 11101, 1000, 1001}. In the example, we take 01 from the first language, and we concatenate it with 00 in the second language. That gives us 0100. We then take 01 from the first language again, and we concatenate it with 01 in the second language, and that gives us 0101. Then we take 111 from the first language and we concatenated it with 00 in the second language and this gives us 11100 …. and so on.
  • 20. 20 Kleene Star: represented by an asterisk aka star (*)  If L is a language, then L*, the Kleene star or just “star,” is the set of strings formed by concatenating zero or more strings from L, in any order.  L* = {ε} ∪ L ∪ LL ∪ LLL ∪ …  Example: {0,10}* = {ε, 0, 10, 00, 010, 100, 1010,…} If you take no strings from L, that would give you the empty string.
  • 21. IMPORTANT!  FROM NOW ON, LET’S STICK TO THE FOLLOWING CONVENTIONS (OTHERWISE WE WILL BE CONFUSED):   Union ∪ (aka: disjunction, OR) represented by: | or +   Concatenation: represented by juxtaposition (= no punctuation) or middle dot ( · )   Kleene Star: represented by * 21
  • 22. 22 Precedence of Operators  Parentheses may be used wherever needed to influence the grouping of operators.  Order of precedence is * (highest), then concatenation, then + (lowest). Remember: + = union/disjunction
  • 23. 23 Examples: REs 1.  L(01) = {01}. 2.  L(01+0) = {01, 0}. 3.  L(0(1+0)) = {01, 00}.   Note order of precedence of operators. 4.  L(0*) = {ε, 0, 00, 000,… }. 5.  L((0+10)*(ε+1)) = all strings of 0s and 1s without two consecutive 1s. 1) The regular expression 01 represents the concatenation of the language consisting of one string, 0 and the language consisting of one string, 1. The result is the language containing the one string 01. 2) The language of 01+0 is the union of the language containing only string 01 and the language containing only string 0. 3) The language of 0 concatenated with 1+0 is the two strings 01 and 00. Notice that we need parentheses to force the + to group first. Without them, since concatenation takes precedence over +, we get the interpretation in the second example. 4) The language of 0* is the star of the language containing only the string 0. This is all strings of 0’s, including the empty string. 5) This example denotes the language with all strings of 0s and 1s without two consecutive 0s. To see why this works, in every such string, each 1 is either followed immediately by a 0, or it comes at the end of the string. (0+10)* denotes all strings in which every 1 is followed by a 0. These strings are surely in the language we want. But we also want these strings followed by a final 1. Thus, we concatenate the language of (0+10)* with epsilon+1. This concatenation gives us all the strings where 1s are followed by 0s, plus all those strings with an additional 1 at the end.
  • 24. 24 Equivalence of REs and Finite Automata  For every RE, there is a finite automaton that accepts the same language.  And we need to show that for every finite automaton, there is a RE defining its language.
  • 25. 25 Summary Automata and regular expressions define exactly the same set of languages: the regular languages.
  • 27. 27 The Chomsky Hierachy Regular (DFA) Context- free (PDA) Context- sensitive (LBA) Recursively- enumerable (TM) •  Hierarchy of classes of formal languages One language is of greater generative power or complexity than another if it can define a language that other cannot define. Context-free grammars are more powerful that regular grammars
  • 28. 28 Regular Languages  A language L is regular if it is the language accepted by some DFA.   Note: the DFA must accept only the strings in L, no others.  Some languages are not regular.
  • 29. Only languages that meet the following criteria are regular languages: 29
  • 30.   Regular language derive their name from the fact that the strings they recognize are (in a formal computer science sense) “regular.”   This implies that there are certain kinds of strings that it will be very hard, if not impossible, to recognize with regular expressions, especially nested syntactic structures in natural language. 30
  • 31. Formal languages vs regular languages  A formal language is a set of strings, each string composed of symbols from a finite set called an alphabet.   Ex: {a,b!}  Formal languages are not the same as regular languages…. 31
  • 32. 32 But Many Languages are Regular  They appear in many contexts and have many useful properties.
  • 33. How to tell if a language is not regular  The most common way to prove that a language is regular is to build a regular expression for the language. 33
  • 35. Prac6cal  Ac6vity  1    The  language  L  contains  all  strings  over  the   alphabet  {a,b}  that  begin  with  a  and  end  with  b,   ie:    Write a regular expression that defines the language L.       35
  • 37. Your Solutions 37 In between the concatenation of a and b there must be 0 or more unions (disjuctions) of a and b. Reference: slides 17-22
  • 38. Practical Activity 2  Draw a deterministic finite-state automaton that accepts the following regular expression: 38 ( (ab) | c)* Alternative notation style: ie: 0 or more occurences of the disjunction ab | c Test the automaton with these legal strings in the language : 0 abc a ab cccabc cbacccabababccc ….
  • 39. Practical Activity 2: Possible Correct Solution 39 Having the initial state as a final state gives us the empty string as an element in the language.
  • 40. Your solutions (1): when we interpret ”+” as disjunction, these solutions are wrong because ”c” happens only after ”a” and ”b”… 40 Test these automata with the string on slide 35
  • 41. Your solutions (2): same as previous slide. In addition, here no final states are shown… 41 Test these automata with the string on slide 35
  • 42. Practical Activity 3   Construct a grep regular expression that matches patterns containing at least one “ab” followed by any number of bs.   Construct a grep regular expression that matches any number between 1000 and 9999. 42
  • 43. Practical Activity 3: Possible Solutions   grep ‘(ab)+b*’   [1-9][0-9]{3} 43
  • 44. Exercises: E&G (2013)  Övning 9.40  Optional: as many as you can  AGer  having  completed  the  exercises,   check  out  the  solu6ons  at  the  end  of  the   book.       44