SlideShare une entreprise Scribd logo
1  sur  17
Chapter 4
                        Lexical analysis

CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc.   1
Scanner

         • Main task: identify tokens
                 – Basic building blocks of programs
                 – E.g. keywords, identifiers, numbers, punctuation marks
         • Desk calculator language example:
                         read A
                         sum := A + 3.45e-3
                         write sum
                         write sum / 2




CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc.              2
Formal definition of tokens

         • A set of tokens is a set of strings over an alphabet
            – {read, write, +, -, *, /, :=, 1, 2, …, 10, …, 3.45e-3, …}
         • A set of tokens is a regular set that can be defined by
           comprehension using a regular expression
         • For every regular set, there is a deterministic finite
           automaton (DFA) that can recognize it
            – (Aka deterministic Finite State Machine (FSM))
            – i.e. determine whether a string belongs to the set or not
            – Scanners extract tokens from source code in the same
              way DFAs determine membership

CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc.        3
Regular Expressions
         A regular expression (RE) is:
         • A single character
         • The empty string, ε
         • The concatenation of two regular expressions
                 – Notation: RE1 RE2 (i.e. RE1 followed by RE2)
         • The union of two regular expressions
                 – Notation: RE1 | RE2
         • The closure of a regular expression
                 – Notation: RE*
                 – * is known as the Kleene star
                 – * represents the concatenation of 0 or more strings
         • Caution: notations for regular expressions vary
                 – Learn the basic concepts and the rest is just syntactic sugar
CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc.                     4
Token Definition Example
                                                                         DIG
     • Numeric literals in Pascal, e.g.
        1, 123, 3.1415, 10e-3, 3.14e4                                            *             DIG

     • Definition of token unsignedNum                                       .

        DIG → 0|1|2|3|4|5|6|7|8|9                                                    DIG
                                                                                              *
        unsignedInt → DIG DIG*                                               e       e
        unsignedNum →                                                                    +
            unsignedInt                                                                  -           DIG
                                                                       DIG                   DIG
           (( . unsignedInt) | ε)
           ((e ( + | – | ε) unsignedInt) | ε)                                    *             DIG

     • Notes:
        – Recursion is not allowed!                              • FAs with epsilons are
                                                                   nondeterministic.
        – Parentheses used to avoid                              • NFAs are much harder to
          ambiguity                                                implement (use backtracking)
        – It’s always possible to rewrite                        • Every NFA can be rewriten as
                                                                   a DFA (gets larger, tho)
          removing epsilons
CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc.                                        5
Simple Problem
          • Write a C program which reads in a character string, consisting
            of a’s and b’s, one character at a time. If the string contains a
            double aa, then print string accepted else print string rejected.
          • An abstract solution to this can be expressed as a DFA

                b                                                       a
                             1-                                  2                   3+          a, b
                                              a
                     Start state               b                               An accepting state

                                                                                         input
                                                                                     a           b
               The state transitions of a DFA
               can be encoded as a table                             current   1      2          1
               which specifies the new state
               for a given current state and
                                                                       state
                                                                               2      3          1
               input                                                           3      3          3
CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc.                                          6
#include <stdio.h>
               main()
               { enum State {S1, S2, S3};                        an approach in C
                  enum State currentState = S1;
                  int c = getchar();
                  while (c != EOF) {
                       switch(currentState) {
                         case S1: if (c == ‘a’) currentState = S2;
                                    if (c == ‘b’) currentState = S1;
                                    break;
                         case S2: if (c == ‘a’) currentState = S3;
                                    if (c == ‘b’) currentState = S1;
                                    break;
                         case S3: break;
                        }
                        c = getchar();
                   }
                   if (currentState == S3) printf(“string acceptedn”);
                   else printf(“string rejectedn”);
               }
CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc.                      7
#include <stdio.h>                                    Using a table
           main()
           { enum State {S1, S2, S3};                            simplifies the
              enum Label {A, B};
              enum State currentState = S1;
                                                                 program
              enum State table[3][2] = {{S2, S1}, {S3, S1}, {S3, S3}};
              int label;
              int c = getchar();
              while (c != EOF) {
                   if (c == ‘a’) label = A;
                   if (c == ‘b’) label = B;
                   currentState = table[currentState][label];
                   c = getchar();
               }
              if (currentState == S3) printf(“string acceptedn”);
              else printf(“string rejectedn”);
           }


CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc.                    8
Lex
         • Lexical analyzer generator
            – It writes a lexical analyzer
         • Assumption
            – each token matches a regular expression
         • Needs
            – set of regular expressions
            – for each expression an action
         • Produces
            – A C program
         • Automatically handles many tricky problems
         • flex is the gnu version of the venerable unix tool lex.
            – Produces highly optimized code

CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc.         9
Scanner Generators
          • E.g. lex, flex
          • These programs take
            a table as their input
            and return a program
            (i.e. a scanner) that
            can extract tokens
            from a stream of
            characters
          • A very useful
            programming utility,
            especially when
            coupled with a parser
            generator (e.g., yacc)
          • standard in Unix
CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc.     10
Lex example                                                                                   input


                                         lex                         cc                foolex
               foo.l                                      foolex.c          foolex
                                                                                              tokens
                                                                     > foolex < input
              > flex -ofoolex.c foo.l                                Keyword: begin
              > cc -ofoolex foolex.c -lfl                            Keyword: if
                                                                     Identifier: size
                                                                     Operator: >
             >more input                                             Integer: 10 (10)
             begin                                                   Keyword: then
                                                                     Identifier: size
              if size>10                                             Operator: *
                then size * -3.1415                                  Operator: -
             end                                                     Float: 3.1415 (3.1415)
                                                                     Keyword: end
CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc.                                    11
A Lex Program

                                                                 DIG [0-9]

        … definitions …                                          ID [a-z][a-z0-9]*
                                                                 %%

        %%                                                       {DIG}+            printf("Integern”);
                                                                 {DIG}+"."{DIG}* printf("Floatn”);

        … rules …                                                {ID}
                                                                 [ tn]+
                                                                                   printf("Identifiern”);
                                                                                   /* skip whitespace */
                                                                 .                 printf(“Huh?n");
        %%
                                                                 %%
        … subroutines …                                          main(){yylex();}




CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc.                                               12
Simplest Example


                              %%
                              .|n                ECHO;
                              %%

                              main()
                              {
                                yylex();
                              }




CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc.       13
Strings containing aa

                  %%
                  (a|b)*aa(a|b)*                            {printf(“Accept %sn”, yytext);}

                  [ab]+                                     {printf(“Reject %sn”, yytext);}

                  .|n              ECHO;
                  %%
                  main() {yylex();}




CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc.                                 14
Rules
              • Each has a rule has a pattern and an action.
              • Patterns are regular expression
              • Only one action is performed
                 – The action corresponding to the pattern matched
                   is performed.
                 – If several patterns match the input, the one
                   corresponding to the longest sequence is chosen.
                 – Among the rules whose patterns match the same
                   number of characters, the rule given first is
                   preferred.

CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc.           15
x         character 'x'
                                                                 Flex’s RE syntax
      .         any character except newline
      [xyz]     character class, in this case, matches either an 'x', a 'y', or a 'z'
      [abj-oZ]  character class with a range in it; matches 'a', 'b', any letter
                from 'j' through 'o', or 'Z'
      [^A-Z]    negated character class, i.e., any character but those in the
                class, e.g. any character except an uppercase letter.
      [^A-Zn] any character EXCEPT an uppercase letter or a newline
      r*        zero or more r's, where r is any regular expression
      r+        one or more r's
      r?        zero or one r's (i.e., an optional r)
      {name} expansion of the "name" definition (see above)
      "[xy]"foo" the literal string: '[xy]"foo' (note escaped “)
      x        if x is an 'a', 'b', 'f', 'n', 'r', 't', or 'v', then the ANSI-C
                interpretation of x. Otherwise, a literal 'x' (e.g., escape)
      rs        RE r followed by RE s (e.g., concatenation)
      r|s       either an r or an s
      <<EOF>> end-of-file
CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc.                          16
/* scanner for a toy Pascal-like language */
                           %{
                           #include <math.h> /* needed for call to atof() */
                           %}
                           DIG [0-9]
                           ID [a-z][a-z0-9]*
                           %%
                           {DIG}+             printf("Integer: %s (%d)n", yytext, atoi(yytext));

                           {DIG}+"."{DIG}* printf("Float: %s (%g)n", yytext, atof(yytext));

                           if|then|begin|end                     printf("Keyword: %sn",yytext);

                           {ID}                                  printf("Identifier: %sn",yytext);

                           "+"|"-"|"*"|"/"                       printf("Operator: %sn",yytext);

                           "{"[^}n]*"}"                         /* skip one-line comments */
                           [ tn]+                              /* skip whitespace */                17
CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc.

Contenu connexe

Similaire à 4 lexical (20)

Intermediate code generation in Compiler Design
Intermediate code generation in Compiler DesignIntermediate code generation in Compiler Design
Intermediate code generation in Compiler Design
 
Lifting 1
Lifting 1Lifting 1
Lifting 1
 
Pcd(Mca)
Pcd(Mca)Pcd(Mca)
Pcd(Mca)
 
Pcd(Mca)
Pcd(Mca)Pcd(Mca)
Pcd(Mca)
 
PCD ?s(MCA)
PCD ?s(MCA)PCD ?s(MCA)
PCD ?s(MCA)
 
Principles of Compiler Design
Principles of Compiler DesignPrinciples of Compiler Design
Principles of Compiler Design
 
Compiler Design Material 2
Compiler Design Material 2Compiler Design Material 2
Compiler Design Material 2
 
Intermediate code generation1
Intermediate code generation1Intermediate code generation1
Intermediate code generation1
 
Intermediate code generation
Intermediate code generationIntermediate code generation
Intermediate code generation
 
New compiler design 101 April 13 2024.pdf
New compiler design 101 April 13 2024.pdfNew compiler design 101 April 13 2024.pdf
New compiler design 101 April 13 2024.pdf
 
Lecture 12 intermediate code generation
Lecture 12 intermediate code generationLecture 12 intermediate code generation
Lecture 12 intermediate code generation
 
Lexicalanalyzer
LexicalanalyzerLexicalanalyzer
Lexicalanalyzer
 
Lexicalanalyzer
LexicalanalyzerLexicalanalyzer
Lexicalanalyzer
 
Interm codegen
Interm codegenInterm codegen
Interm codegen
 
Syntaxdirected
SyntaxdirectedSyntaxdirected
Syntaxdirected
 
Syntaxdirected
SyntaxdirectedSyntaxdirected
Syntaxdirected
 
Syntaxdirected (1)
Syntaxdirected (1)Syntaxdirected (1)
Syntaxdirected (1)
 
Predication
PredicationPredication
Predication
 
System Programming Unit IV
System Programming Unit IVSystem Programming Unit IV
System Programming Unit IV
 
LEC 7-DS ALGO(expression and huffman).pdf
LEC 7-DS  ALGO(expression and huffman).pdfLEC 7-DS  ALGO(expression and huffman).pdf
LEC 7-DS ALGO(expression and huffman).pdf
 

Dernier

Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxVishalSingh1417
 
Third Battle of Panipat detailed notes.pptx
Third Battle of Panipat detailed notes.pptxThird Battle of Panipat detailed notes.pptx
Third Battle of Panipat detailed notes.pptxAmita Gupta
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Jisc
 
Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxnegromaestrong
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17Celine George
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.pptRamjanShidvankar
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhikauryashika82
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...ZurliaSoop
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsMebane Rash
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxheathfieldcps1
 
PROCESS RECORDING FORMAT.docx
PROCESS      RECORDING        FORMAT.docxPROCESS      RECORDING        FORMAT.docx
PROCESS RECORDING FORMAT.docxPoojaSen20
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...Poonam Aher Patil
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingTechSoup
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfagholdier
 
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17  How to Extend Models Using Mixin ClassesMixin Classes in Odoo 17  How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17 How to Extend Models Using Mixin ClassesCeline George
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxVishalSingh1417
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxAreebaZafar22
 
Magic bus Group work1and 2 (Team 3).pptx
Magic bus Group work1and 2 (Team 3).pptxMagic bus Group work1and 2 (Team 3).pptx
Magic bus Group work1and 2 (Team 3).pptxdhanalakshmis0310
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfAdmir Softic
 

Dernier (20)

Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptx
 
Third Battle of Panipat detailed notes.pptx
Third Battle of Panipat detailed notes.pptxThird Battle of Panipat detailed notes.pptx
Third Battle of Panipat detailed notes.pptx
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)
 
Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptx
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
 
PROCESS RECORDING FORMAT.docx
PROCESS      RECORDING        FORMAT.docxPROCESS      RECORDING        FORMAT.docx
PROCESS RECORDING FORMAT.docx
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17  How to Extend Models Using Mixin ClassesMixin Classes in Odoo 17  How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
Spatium Project Simulation student brief
Spatium Project Simulation student briefSpatium Project Simulation student brief
Spatium Project Simulation student brief
 
Magic bus Group work1and 2 (Team 3).pptx
Magic bus Group work1and 2 (Team 3).pptxMagic bus Group work1and 2 (Team 3).pptx
Magic bus Group work1and 2 (Team 3).pptx
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 

4 lexical

  • 1. Chapter 4 Lexical analysis CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc. 1
  • 2. Scanner • Main task: identify tokens – Basic building blocks of programs – E.g. keywords, identifiers, numbers, punctuation marks • Desk calculator language example: read A sum := A + 3.45e-3 write sum write sum / 2 CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc. 2
  • 3. Formal definition of tokens • A set of tokens is a set of strings over an alphabet – {read, write, +, -, *, /, :=, 1, 2, …, 10, …, 3.45e-3, …} • A set of tokens is a regular set that can be defined by comprehension using a regular expression • For every regular set, there is a deterministic finite automaton (DFA) that can recognize it – (Aka deterministic Finite State Machine (FSM)) – i.e. determine whether a string belongs to the set or not – Scanners extract tokens from source code in the same way DFAs determine membership CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc. 3
  • 4. Regular Expressions A regular expression (RE) is: • A single character • The empty string, ε • The concatenation of two regular expressions – Notation: RE1 RE2 (i.e. RE1 followed by RE2) • The union of two regular expressions – Notation: RE1 | RE2 • The closure of a regular expression – Notation: RE* – * is known as the Kleene star – * represents the concatenation of 0 or more strings • Caution: notations for regular expressions vary – Learn the basic concepts and the rest is just syntactic sugar CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc. 4
  • 5. Token Definition Example DIG • Numeric literals in Pascal, e.g. 1, 123, 3.1415, 10e-3, 3.14e4 * DIG • Definition of token unsignedNum . DIG → 0|1|2|3|4|5|6|7|8|9 DIG * unsignedInt → DIG DIG* e e unsignedNum → + unsignedInt - DIG DIG DIG (( . unsignedInt) | ε) ((e ( + | – | ε) unsignedInt) | ε) * DIG • Notes: – Recursion is not allowed! • FAs with epsilons are nondeterministic. – Parentheses used to avoid • NFAs are much harder to ambiguity implement (use backtracking) – It’s always possible to rewrite • Every NFA can be rewriten as a DFA (gets larger, tho) removing epsilons CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc. 5
  • 6. Simple Problem • Write a C program which reads in a character string, consisting of a’s and b’s, one character at a time. If the string contains a double aa, then print string accepted else print string rejected. • An abstract solution to this can be expressed as a DFA b a 1- 2 3+ a, b a Start state b An accepting state input a b The state transitions of a DFA can be encoded as a table current 1 2 1 which specifies the new state for a given current state and state 2 3 1 input 3 3 3 CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc. 6
  • 7. #include <stdio.h> main() { enum State {S1, S2, S3}; an approach in C enum State currentState = S1; int c = getchar(); while (c != EOF) { switch(currentState) { case S1: if (c == ‘a’) currentState = S2; if (c == ‘b’) currentState = S1; break; case S2: if (c == ‘a’) currentState = S3; if (c == ‘b’) currentState = S1; break; case S3: break; } c = getchar(); } if (currentState == S3) printf(“string acceptedn”); else printf(“string rejectedn”); } CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc. 7
  • 8. #include <stdio.h> Using a table main() { enum State {S1, S2, S3}; simplifies the enum Label {A, B}; enum State currentState = S1; program enum State table[3][2] = {{S2, S1}, {S3, S1}, {S3, S3}}; int label; int c = getchar(); while (c != EOF) { if (c == ‘a’) label = A; if (c == ‘b’) label = B; currentState = table[currentState][label]; c = getchar(); } if (currentState == S3) printf(“string acceptedn”); else printf(“string rejectedn”); } CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc. 8
  • 9. Lex • Lexical analyzer generator – It writes a lexical analyzer • Assumption – each token matches a regular expression • Needs – set of regular expressions – for each expression an action • Produces – A C program • Automatically handles many tricky problems • flex is the gnu version of the venerable unix tool lex. – Produces highly optimized code CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc. 9
  • 10. Scanner Generators • E.g. lex, flex • These programs take a table as their input and return a program (i.e. a scanner) that can extract tokens from a stream of characters • A very useful programming utility, especially when coupled with a parser generator (e.g., yacc) • standard in Unix CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc. 10
  • 11. Lex example input lex cc foolex foo.l foolex.c foolex tokens > foolex < input > flex -ofoolex.c foo.l Keyword: begin > cc -ofoolex foolex.c -lfl Keyword: if Identifier: size Operator: > >more input Integer: 10 (10) begin Keyword: then Identifier: size if size>10 Operator: * then size * -3.1415 Operator: - end Float: 3.1415 (3.1415) Keyword: end CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc. 11
  • 12. A Lex Program DIG [0-9] … definitions … ID [a-z][a-z0-9]* %% %% {DIG}+ printf("Integern”); {DIG}+"."{DIG}* printf("Floatn”); … rules … {ID} [ tn]+ printf("Identifiern”); /* skip whitespace */ . printf(“Huh?n"); %% %% … subroutines … main(){yylex();} CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc. 12
  • 13. Simplest Example %% .|n ECHO; %% main() { yylex(); } CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc. 13
  • 14. Strings containing aa %% (a|b)*aa(a|b)* {printf(“Accept %sn”, yytext);} [ab]+ {printf(“Reject %sn”, yytext);} .|n ECHO; %% main() {yylex();} CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc. 14
  • 15. Rules • Each has a rule has a pattern and an action. • Patterns are regular expression • Only one action is performed – The action corresponding to the pattern matched is performed. – If several patterns match the input, the one corresponding to the longest sequence is chosen. – Among the rules whose patterns match the same number of characters, the rule given first is preferred. CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc. 15
  • 16. x character 'x' Flex’s RE syntax . any character except newline [xyz] character class, in this case, matches either an 'x', a 'y', or a 'z' [abj-oZ] character class with a range in it; matches 'a', 'b', any letter from 'j' through 'o', or 'Z' [^A-Z] negated character class, i.e., any character but those in the class, e.g. any character except an uppercase letter. [^A-Zn] any character EXCEPT an uppercase letter or a newline r* zero or more r's, where r is any regular expression r+ one or more r's r? zero or one r's (i.e., an optional r) {name} expansion of the "name" definition (see above) "[xy]"foo" the literal string: '[xy]"foo' (note escaped “) x if x is an 'a', 'b', 'f', 'n', 'r', 't', or 'v', then the ANSI-C interpretation of x. Otherwise, a literal 'x' (e.g., escape) rs RE r followed by RE s (e.g., concatenation) r|s either an r or an s <<EOF>> end-of-file CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc. 16
  • 17. /* scanner for a toy Pascal-like language */ %{ #include <math.h> /* needed for call to atof() */ %} DIG [0-9] ID [a-z][a-z0-9]* %% {DIG}+ printf("Integer: %s (%d)n", yytext, atoi(yytext)); {DIG}+"."{DIG}* printf("Float: %s (%g)n", yytext, atof(yytext)); if|then|begin|end printf("Keyword: %sn",yytext); {ID} printf("Identifier: %sn",yytext); "+"|"-"|"*"|"/" printf("Operator: %sn",yytext); "{"[^}n]*"}" /* skip one-line comments */ [ tn]+ /* skip whitespace */ 17 CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc.

Notes de l'éditeur

  1. Makes parsing much easier (the parsers looks for relations between tokens) Other tasks: remove comments
  2. Automatic generation of scanners