SlideShare a Scribd company logo
1 of 45
November 4, 2014
The Quest for the OneTrue Parser
Terence Parr
The ANTLR guy
University of San Francisco
InfoQ.com: News & Community Site
• 750,000 unique visitors/month
• Published in 4 languages (English, Chinese, Japanese and Brazilian
Portuguese)
• Post content from our QCon conferences
• News 15-20 / week
• Articles 3-4 / week
• Presentations (videos) 12-15 / week
• Interviews 2-3 / week
• Books 1 / month
Watch the video with slide
synchronization on InfoQ.com!
http://www.infoq.com/presentations
/parsing-history
Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
- practitioner-driven conference designed for YOU: influencers of
change and innovation in your teams
- speakers and topics driving the evolution and innovation
- connecting and catalyzing the influencers and innovators
Highlights
- attended by more than 12,000 delegates since 2007
- held in 9 cities worldwide
Presented at QCon San Francisco
www.qconsf.com
Am I behind the times?
Buzzword Compliance
✤ What’s a PEG and do I need one? Am I a packrat?
✤ Should I care about context-sensitive parsing?
✤ Do we still need the distinction between the tokenizer and the parser?
✤ Parser Combinators do what, exactly?
✤ Should I be using Generalized-LR (GLR)?
✤ Can I parse tree data structures not just token streams?
✤ How is ANTLR 4’s ALL(*) like the Honey Badger?
Parr-t Like It’s 1989
✤ 25 years ago, LALR/yacc/bison
reigned supreme in tools
✤ In ~1989 you either used yacc or
you built parsers by hand
✤ I didn’t grok yacc with its state
machines and shift/reduce
conflicts
✤ I set out to create a parser generator that generated
what I wrote by hand: recursive-descent parsers
✤ The quest eventually led to some useful innovations
LL(1)
LALR(1)
The Players
✤ Warring religious factions
✤ LL-based, “top-down,” recursive-descent,
“hand built”, LL(1)
✤ LR-based, “bottom-up,” yacc, LALR(1)
✤ Two other camps; researchers working on:
✤ Increasing efficiency of general algorithms
Earley, GLL, GLR, Elkhound, …
✤ Increasing power of top-down parsers
LL(k), predicates, PEG, LL(*), GLL, ALL(*)
Some “Lex Education”
Grammars, parsers, and trees oh my!
Parser Information Flow
✤ The parser feeds off of tokens from the lexer, which feeds off of a char
stream, to check syntax (language membership)
✤ We often want to build a parse tree that records how input matched
stat
assign
expr
100
sp ;=
chars
LEXER
tokens
PARSERsp = 100 ;sp = 100;
parse tree
Language recognizer
Grammar Meta-languages (DSLs)
✤ A grammar is a set of rules that describe a language
✤ A language is just a set of valid sentences
✤ Each rule has a name and set of one or more alternative productions
✤ Most tools use a DSL that looks like this:
stat: ‘if’ expr ‘then’ stat (‘else’ stat)?
| ID ‘=’ expr
| ‘return’ expr
;
expr: expr ‘*’ expr
| expr ‘+’ expr
| ID
| INT
;
Grammar Conditions
✤ Left-recursive grammars have rules that reference rules already in flight; LR
loves this, LL hates this!
✤ Recursive-descent parsers get infinite recursive loops
✤ Ambiguous grammars can match an input phrase in more than one way. In C,
i*j could be an expression or declaration like FILE *f.
✤ GLR was designed for ambiguous grammars; “Police help dog bite victim.”
LL, LR resolve ambiguities at parse-time, picking one path
expr : expr ‘*’ expr
| INT
;
Recursive-descent functions
Top-down, LL(k) for k ≥ 1
assign
: ID ‘=’ expr ‘;’
;
void assign() {
match(ID);
match(‘=’);
expr();
match(‘;’);
}
void expr() {
switch ( curtoken.getType() ) {
case INT :
match(INT);
break;
case STRING :
match(STRING);
break;
default : error;
}
}
expr
: INT | STRING
;
Bottom-up, LR(k)
✤ Yacc is LR-based: LALR(1)
✤ LR recognizers are bottom-up recognizers; they match from leaves of
parse tree towards starting rule at the root whereas LL starts at the
root (top-down) and is goal oriented.
✤ LR consumes tokens until it recognizes an alternative, one that will
ultimately lead to a successful parse of the input. At input int x;
parser reduces the input to a decl ; and then reduces to stat
stat : decl ‘;’ ;
decl : ‘int’ ID
| ‘int’ ID ‘=’ expr
;
Should I be using GLR?
Generalized LR (GLR)
✤ Accepts all grammars since designed to handle ambiguous languages
✤ “Forks” subparsers to pursue all possible paths emanating from LR
states with conflicts
✤ Merges all successful parses into parse “forest”
stat : decl ‘;’
| decl ‘=’ 0 ‘;’
decl : ‘int’ ID
| ‘int’ ID ‘=’ expr
;
Matches int x = 0; via
alternative 1 or 2 of decl
int x = 0; ➞ decl = 0 ; ➞ stat
int x = 0; ➞ decl ; ➞ statOr,
Issues with GLR
✤ Must disambiguate parse forests even for if-then-else ambiguity,
requiring extra time, space, machinery
✤ GLR performance is unpredictable in time and space
✤ Grammar actions don't mix with parser speculation/ambiguities
✤ Semantic predicates less useful w/o side-effecting user actions
✤ Without side effects, actions must buffer data for all interpretations in
immutable data structures or provide undo actions
What’s a PEG and do I need one?
Am I a packrat?
Parser Expression Grammars
(PEGs)
✤ PEGs are grammars based upon LL with explicitly-ordered
alternatives
✤ Attempt alternatives in order specified; first alternative to match,
wins
✤ Unambiguous by definition and PEGs accept all non-left recursive
grammars; T(i) in C++ matches as decl not function call in stat
✤ Packrat parsers record partial results to avoid reparsing subphrases
stat : decl / expr ;
decl : ‘int’ ID
/ ‘int’ ID ‘=’ expr
;
Square PEG, round hole?
✤ PEGs might not do what you want; 2nd alternative of decl never
matches!
✤ PEGs are not great at error reporting and recovery; errors detected at the
end of file
✤ Can’t execute arbitrary actions during the parse; always speculating
✤ Without of side-effecting actions, predicates aren’t as useful
✤ Hard to debug nested backtracking
decl : ‘int’ ID
/ ‘int’ ID ‘=’ expr
;
<== Yes, I’m deadcode
Parser Combinators do what,
exactly?
Higher-order functions as
building blocks
✤ Use programming language itself rather than separate grammar DSL,
avoiding a build step to generate code from grammar
✤ Alternation b|c|d becomes Parsers.or(b, c, d)
✤ Sequence bcd becomes Parsers.sequence(b,c,d)
✤ Has higher-order rules; can
pass rules to rules
✤ Essentially equivalent to an inline PEG or packrat parser, with same
issues
✤ ANTLR also has an interpreter, btw, to avoid build step
list[el] : <el> (‘,’ <el>)* ;
Do we still need the distinction
between the tokenizer and the parser?
Scannerless Parsing
GLR and PEGs are typically scannerless
✤ Tokenizing is natural; we do it. “Humuhumunukunukuapua'a have a diamond-
shaped body with armor-like scales.”
✤ Tokenizing is efficient and processing tokens is convenient
✤ But... scannerless parsing supports mixed languages like C+SQL:
✤ That’s pretty cool and supports modular grammars since we can combine
grammar pieces w/o fear that combined input won’t tokenize properly
✤ Easy to fake if parser is strong enough: just make each char a token!
int next = select ID from users where name='Raj'+1;
int from = 1, select = 2;
int x = select * from;
Scannerless Grammars are Quirky
✤ Must test for white space explicitly and frequently
✤ Distinguishing between keywords and identifiers is messy;
e.g., int or int[ versus interest or int9
prog: ws? (var|func)+ EOF ;
plus: '+' ws? ;
kint: {notLetterOrDigit(4)}? 'i' 'n' 't' ws? ;
id : letter+ {!keywords.contains($text)}? ws? ;
Should I care about context-
sensitive parsing?
Predicated Parsing
✤ Context-sensitive rules are viable per a runtime test called a semantic
predicate; the expression language depends on the tool
✤ Disambiguating a(i) and f(x) in Fortran requires symbol table
information about a,f
✤ Or, can build a “parse forest” and disambiguate after the parse, but
that can be inefficient in time and space
expr: array
| call
;
array : {isArray(token)}? ID ‘(‘ expr ‘)’ ;
call : {isFunc(token)}? ID ‘(‘ expr ‘)’ ;
Can I parse data structures like trees?
(Are youTRIE-curious?)
WALKER
Rest of
Application
APIs
stat
assign
expr
100
sp ;=
visitTerminal(TerminalNode)
enterStat(StatContext)
exitStat(StatContext)
enterAssign(AssignContext)
exitAssign(AssignContext)
enterExpr(ExprContext)
exitExpr(ExprContext)
visitTerminal(TerminalNode)
visitTerminal(TerminalNode)
visitTerminal(TerminalNode)
StatContext
AssignContext
ExprContext
100
TerminalNode
sp
TerminalNode
;
TerminalNode
=
TerminalNode
visitX()
MyVisitor
visitTerminal(TerminalNode)
visitStat(StatContext)
visitAssign(AssignContext)
visitExpr(ExprContext)
Rest of
Application
APIs
Yes, But First...
Imperative processing of parse trees
Visitor
Listener
ANTLR XPath, pattern matching
Declarative+imperative
//	
  “Find	
  all	
  initialized	
  int	
  local	
  variables	
  (Java)”
ParserRuleContext	
  tree	
  =	
  parser.compilationUnit();	
  //	
  parse
String	
  xpath	
  =	
  "//blockStatement/*";	
  //	
  get	
  children	
  of	
  blockStatement
String	
  treePattern	
  =	
  "int	
  <Identifier>	
  =	
  <expression>;";
ParseTreePattern	
  p	
  =
	
  	
  	
  	
  parser.compileParseTreePattern(treePattern,	
  	
  	
  
	
  	
  	
  	
  	
  	
  	
  	
  ExprParser.RULE_localVariableDeclarationStatement);
List<ParseTreeMatch>	
  matches	
  =	
  p.findAll(tree,	
  xpath);
matches.get(0).get(“expression”);	
  //	
  get	
  1st	
  init	
  expr	
  subtree
System.out.println(matches);
/prog/func/'def' Find all def literal kids of func kid of prog
Find all funcs under prog at root/prog/func
Tree Grammars
✤ Specify structure of
tree declaratively with
grammar DSL;
translated to parser
✤ SORCERER in 1994
then into ANTLR
directly
✤ ANTLR 3 could
rewrite trees, invoke
StringTemplates
expr returns [int value]
: ^('+' a=expr b=expr) {$value = a+b;}
| ^('-' a=expr b=expr) {$value = a-b;}
| ^('*' a=expr b=expr) {$value = a*b;}
| INT
{$value = Integer.parseInt($INT.text);}
;
expr
| ^('+' expr expr)
| ^('*' expr expr)
| INT
;
OMeta
✤ “A programming language whose control structure is based on PEGs”
with pattern matching like Haskell/LISP
✤ Can process streams of arbitrary datatypes and nested lists (trees)
✤ Higher-order rules; can pass rule arg to another rule (coolness alert)
✤ Actions written in host language not OMeta itself (coder must watch
out for action side-effects during speculation/backtracking)
✤ Other pattern matching, transformation DSLs: TXL, Stratego/XT,
TOM (Java superset), Rascal Metaprogramming Language, ...
Well, should I use tree grammars?
✤ Tree grammars caused lots of code dup; copy grammar for each tree
pass, just with different actions
✤ Change to tree structure requires change to each tree grammar
✤ Actions cause trouble for grammar inheritance; fragile base-class
problem
✤ Ended up being a maintenance hassle so ANTLR 4 dropped them in
favor of listeners/visitors/xpath/tree-patterns
imperative+declarative style
It takes any grammar you give it. “It just doesn’t give a shit.”
Yes, even left-recursive rules! (except indirect left-recursive rules)
ANTLR 4’s ALL(*) Parser
The Nasty-Ass Honey Badger
What is ALL(*)?
✤ ALL(*) combines simplicity, efficiency, and predictability of top-down
LL(k) parsers with power of GLR-like mechanism to make decisions
✤ Key innovation: shift grammar analysis to parse-time, caching results
✤ Launch subparsers that scan ahead to determine which alternative
(path) will ultimately lead to successful parse; warms up like a JIT
✤ All but one subparser die off, yielding unique prediction
(unless ambiguous phrase or syntax error)
✤ Analogy: ticket_line : PEOPLE+ BORAT
| PEOPLE+ THE_BODY_GUARD
;
void stat() { // parse according to rule stat
switch ( adaptivePredict("stat", call stack) ) {
case 1 : // predict production 1
expr(); match(’=’); expr(); match(’;’);
break;
case 2 : // predict production 2
expr(); match(’;’);
break;
}
}
// ALL(*) but non-LL(k), non-LL(*) grammar
stat: expr ’=’ expr ’;’
| expr ’;’
;
ATN
Recursive-descent Parser
Adaptive LL(*):ALL(*)
Incremental DFA construction
Multiple threads
running >1 parser
instance build the
DFA faster
Parsing 132M, 12920 Java files
Summary
✤ You want the most general parser you can get, as long as you don’t pay for it
with poor performance; ALL(*) is the sweet spot
✤ GLR is too unpredictable in performance; general parsers like GLL/GLR that
support all input interpretations will never compete with more deterministic
strategies
✤ Semantic predicates are rarely needed but a godsend when you need them!
✤ Declarative tree matching is good, but just for searching; tree grammars with
embedded actions, not so good. I like mixed declarative+imperative style
✤ Scannerless grammars are cool as hell; useful for modular grammars and mixed
languages but less convenient to build grammars and process artifacts
✤ PEGs are great for smaller languages; at a disadvantage for action execution,
semantic predicates, error handling. OMeta has same issues
Extras in case...
Parsing InnovationTimeline
(Not exhaustive, obviously)
1985 GLR
1992 GLR fixed to be completely general (looped on cyclic grammars)
1993 Linear approximate lookahead makes LL(k) more attractive
1994 Semantic and syntactic predicates for LL(k)
1997 SGLR scannerless parsing
2002 Packrat parsing
2004 Elkhound: novel combination of GLR and LR parsing
2004 PEGs (formalized speculative, ordered-alternative grammars)
2010 GLL (No parser generator available; Rascal uses GLL variant)
2011 LL(*) (2007 tool released)
2014 ALL(*) (2012 tool released)
Frustration LedTo...
decl: ‘int’ ID ‘;’
| ‘int’ ID ‘=’ expr ‘;’
decl: modifier* type ID ‘;’
| modifier* type ID ‘=’ expr ‘;’
stat: expr ’=’ expr ’;’
| expr ’;’
expr: ID ‘(‘ expr ‘)’ // array index
| ID ‘(‘ expr ‘)’ // func call
First this pissed me off:
Led to LL(k) for k>1
Then this:
semantic predicates
Syntactic predicates
LL(*)
stat: decl | expr
ALL(*)
Intellij Plug-in
Answers how was this input tokenized?
Visualizes parse tree live via parser interpreter
How was that phrase recognized?
Critical feature for large input/grammars
Identify ambiguities,
lookahead depth to optimize
AST vs Parse-Tree
Watch the video with slide synchronization on
InfoQ.com!
http://www.infoq.com/presentations/parsing-
history

More Related Content

More from C4Media

Understand the Trade-offs Using Compilers for Java Applications
Understand the Trade-offs Using Compilers for Java ApplicationsUnderstand the Trade-offs Using Compilers for Java Applications
Understand the Trade-offs Using Compilers for Java ApplicationsC4Media
 
Kafka Needs No Keeper
Kafka Needs No KeeperKafka Needs No Keeper
Kafka Needs No KeeperC4Media
 
High Performing Teams Act Like Owners
High Performing Teams Act Like OwnersHigh Performing Teams Act Like Owners
High Performing Teams Act Like OwnersC4Media
 
Does Java Need Inline Types? What Project Valhalla Can Bring to Java
Does Java Need Inline Types? What Project Valhalla Can Bring to JavaDoes Java Need Inline Types? What Project Valhalla Can Bring to Java
Does Java Need Inline Types? What Project Valhalla Can Bring to JavaC4Media
 
Service Meshes- The Ultimate Guide
Service Meshes- The Ultimate GuideService Meshes- The Ultimate Guide
Service Meshes- The Ultimate GuideC4Media
 
Shifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDShifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDC4Media
 
CI/CD for Machine Learning
CI/CD for Machine LearningCI/CD for Machine Learning
CI/CD for Machine LearningC4Media
 
Fault Tolerance at Speed
Fault Tolerance at SpeedFault Tolerance at Speed
Fault Tolerance at SpeedC4Media
 
Architectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsArchitectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsC4Media
 
ML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsC4Media
 
Build Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerBuild Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerC4Media
 
User & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleUser & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleC4Media
 
Scaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeScaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeC4Media
 
Make Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereMake Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereC4Media
 
The Talk You've Been Await-ing For
The Talk You've Been Await-ing ForThe Talk You've Been Await-ing For
The Talk You've Been Await-ing ForC4Media
 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data EngineeringC4Media
 
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreAutomated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreC4Media
 
Navigating Complexity: High-performance Delivery and Discovery Teams
Navigating Complexity: High-performance Delivery and Discovery TeamsNavigating Complexity: High-performance Delivery and Discovery Teams
Navigating Complexity: High-performance Delivery and Discovery TeamsC4Media
 
High Performance Cooperative Distributed Systems in Adtech
High Performance Cooperative Distributed Systems in AdtechHigh Performance Cooperative Distributed Systems in Adtech
High Performance Cooperative Distributed Systems in AdtechC4Media
 
Rust's Journey to Async/await
Rust's Journey to Async/awaitRust's Journey to Async/await
Rust's Journey to Async/awaitC4Media
 

More from C4Media (20)

Understand the Trade-offs Using Compilers for Java Applications
Understand the Trade-offs Using Compilers for Java ApplicationsUnderstand the Trade-offs Using Compilers for Java Applications
Understand the Trade-offs Using Compilers for Java Applications
 
Kafka Needs No Keeper
Kafka Needs No KeeperKafka Needs No Keeper
Kafka Needs No Keeper
 
High Performing Teams Act Like Owners
High Performing Teams Act Like OwnersHigh Performing Teams Act Like Owners
High Performing Teams Act Like Owners
 
Does Java Need Inline Types? What Project Valhalla Can Bring to Java
Does Java Need Inline Types? What Project Valhalla Can Bring to JavaDoes Java Need Inline Types? What Project Valhalla Can Bring to Java
Does Java Need Inline Types? What Project Valhalla Can Bring to Java
 
Service Meshes- The Ultimate Guide
Service Meshes- The Ultimate GuideService Meshes- The Ultimate Guide
Service Meshes- The Ultimate Guide
 
Shifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDShifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CD
 
CI/CD for Machine Learning
CI/CD for Machine LearningCI/CD for Machine Learning
CI/CD for Machine Learning
 
Fault Tolerance at Speed
Fault Tolerance at SpeedFault Tolerance at Speed
Fault Tolerance at Speed
 
Architectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsArchitectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep Systems
 
ML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.js
 
Build Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerBuild Your Own WebAssembly Compiler
Build Your Own WebAssembly Compiler
 
User & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleUser & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix Scale
 
Scaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeScaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's Edge
 
Make Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereMake Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home Everywhere
 
The Talk You've Been Await-ing For
The Talk You've Been Await-ing ForThe Talk You've Been Await-ing For
The Talk You've Been Await-ing For
 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data Engineering
 
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreAutomated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
 
Navigating Complexity: High-performance Delivery and Discovery Teams
Navigating Complexity: High-performance Delivery and Discovery TeamsNavigating Complexity: High-performance Delivery and Discovery Teams
Navigating Complexity: High-performance Delivery and Discovery Teams
 
High Performance Cooperative Distributed Systems in Adtech
High Performance Cooperative Distributed Systems in AdtechHigh Performance Cooperative Distributed Systems in Adtech
High Performance Cooperative Distributed Systems in Adtech
 
Rust's Journey to Async/await
Rust's Journey to Async/awaitRust's Journey to Async/await
Rust's Journey to Async/await
 

Recently uploaded

Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKJago de Vreede
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 

Recently uploaded (20)

Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 

The Quest for the One True Parser

  • 1. November 4, 2014 The Quest for the OneTrue Parser Terence Parr The ANTLR guy University of San Francisco
  • 2. InfoQ.com: News & Community Site • 750,000 unique visitors/month • Published in 4 languages (English, Chinese, Japanese and Brazilian Portuguese) • Post content from our QCon conferences • News 15-20 / week • Articles 3-4 / week • Presentations (videos) 12-15 / week • Interviews 2-3 / week • Books 1 / month Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations /parsing-history
  • 3. Purpose of QCon - to empower software development by facilitating the spread of knowledge and innovation Strategy - practitioner-driven conference designed for YOU: influencers of change and innovation in your teams - speakers and topics driving the evolution and innovation - connecting and catalyzing the influencers and innovators Highlights - attended by more than 12,000 delegates since 2007 - held in 9 cities worldwide Presented at QCon San Francisco www.qconsf.com
  • 4. Am I behind the times? Buzzword Compliance ✤ What’s a PEG and do I need one? Am I a packrat? ✤ Should I care about context-sensitive parsing? ✤ Do we still need the distinction between the tokenizer and the parser? ✤ Parser Combinators do what, exactly? ✤ Should I be using Generalized-LR (GLR)? ✤ Can I parse tree data structures not just token streams? ✤ How is ANTLR 4’s ALL(*) like the Honey Badger?
  • 5. Parr-t Like It’s 1989 ✤ 25 years ago, LALR/yacc/bison reigned supreme in tools ✤ In ~1989 you either used yacc or you built parsers by hand ✤ I didn’t grok yacc with its state machines and shift/reduce conflicts ✤ I set out to create a parser generator that generated what I wrote by hand: recursive-descent parsers ✤ The quest eventually led to some useful innovations LL(1) LALR(1)
  • 6. The Players ✤ Warring religious factions ✤ LL-based, “top-down,” recursive-descent, “hand built”, LL(1) ✤ LR-based, “bottom-up,” yacc, LALR(1) ✤ Two other camps; researchers working on: ✤ Increasing efficiency of general algorithms Earley, GLL, GLR, Elkhound, … ✤ Increasing power of top-down parsers LL(k), predicates, PEG, LL(*), GLL, ALL(*)
  • 7. Some “Lex Education” Grammars, parsers, and trees oh my!
  • 8. Parser Information Flow ✤ The parser feeds off of tokens from the lexer, which feeds off of a char stream, to check syntax (language membership) ✤ We often want to build a parse tree that records how input matched stat assign expr 100 sp ;= chars LEXER tokens PARSERsp = 100 ;sp = 100; parse tree Language recognizer
  • 9. Grammar Meta-languages (DSLs) ✤ A grammar is a set of rules that describe a language ✤ A language is just a set of valid sentences ✤ Each rule has a name and set of one or more alternative productions ✤ Most tools use a DSL that looks like this: stat: ‘if’ expr ‘then’ stat (‘else’ stat)? | ID ‘=’ expr | ‘return’ expr ; expr: expr ‘*’ expr | expr ‘+’ expr | ID | INT ;
  • 10. Grammar Conditions ✤ Left-recursive grammars have rules that reference rules already in flight; LR loves this, LL hates this! ✤ Recursive-descent parsers get infinite recursive loops ✤ Ambiguous grammars can match an input phrase in more than one way. In C, i*j could be an expression or declaration like FILE *f. ✤ GLR was designed for ambiguous grammars; “Police help dog bite victim.” LL, LR resolve ambiguities at parse-time, picking one path expr : expr ‘*’ expr | INT ;
  • 11. Recursive-descent functions Top-down, LL(k) for k ≥ 1 assign : ID ‘=’ expr ‘;’ ; void assign() { match(ID); match(‘=’); expr(); match(‘;’); } void expr() { switch ( curtoken.getType() ) { case INT : match(INT); break; case STRING : match(STRING); break; default : error; } } expr : INT | STRING ;
  • 12. Bottom-up, LR(k) ✤ Yacc is LR-based: LALR(1) ✤ LR recognizers are bottom-up recognizers; they match from leaves of parse tree towards starting rule at the root whereas LL starts at the root (top-down) and is goal oriented. ✤ LR consumes tokens until it recognizes an alternative, one that will ultimately lead to a successful parse of the input. At input int x; parser reduces the input to a decl ; and then reduces to stat stat : decl ‘;’ ; decl : ‘int’ ID | ‘int’ ID ‘=’ expr ;
  • 13. Should I be using GLR?
  • 14. Generalized LR (GLR) ✤ Accepts all grammars since designed to handle ambiguous languages ✤ “Forks” subparsers to pursue all possible paths emanating from LR states with conflicts ✤ Merges all successful parses into parse “forest” stat : decl ‘;’ | decl ‘=’ 0 ‘;’ decl : ‘int’ ID | ‘int’ ID ‘=’ expr ; Matches int x = 0; via alternative 1 or 2 of decl int x = 0; ➞ decl = 0 ; ➞ stat int x = 0; ➞ decl ; ➞ statOr,
  • 15. Issues with GLR ✤ Must disambiguate parse forests even for if-then-else ambiguity, requiring extra time, space, machinery ✤ GLR performance is unpredictable in time and space ✤ Grammar actions don't mix with parser speculation/ambiguities ✤ Semantic predicates less useful w/o side-effecting user actions ✤ Without side effects, actions must buffer data for all interpretations in immutable data structures or provide undo actions
  • 16. What’s a PEG and do I need one? Am I a packrat?
  • 17. Parser Expression Grammars (PEGs) ✤ PEGs are grammars based upon LL with explicitly-ordered alternatives ✤ Attempt alternatives in order specified; first alternative to match, wins ✤ Unambiguous by definition and PEGs accept all non-left recursive grammars; T(i) in C++ matches as decl not function call in stat ✤ Packrat parsers record partial results to avoid reparsing subphrases stat : decl / expr ; decl : ‘int’ ID / ‘int’ ID ‘=’ expr ;
  • 18. Square PEG, round hole? ✤ PEGs might not do what you want; 2nd alternative of decl never matches! ✤ PEGs are not great at error reporting and recovery; errors detected at the end of file ✤ Can’t execute arbitrary actions during the parse; always speculating ✤ Without of side-effecting actions, predicates aren’t as useful ✤ Hard to debug nested backtracking decl : ‘int’ ID / ‘int’ ID ‘=’ expr ; <== Yes, I’m deadcode
  • 19. Parser Combinators do what, exactly?
  • 20. Higher-order functions as building blocks ✤ Use programming language itself rather than separate grammar DSL, avoiding a build step to generate code from grammar ✤ Alternation b|c|d becomes Parsers.or(b, c, d) ✤ Sequence bcd becomes Parsers.sequence(b,c,d) ✤ Has higher-order rules; can pass rules to rules ✤ Essentially equivalent to an inline PEG or packrat parser, with same issues ✤ ANTLR also has an interpreter, btw, to avoid build step list[el] : <el> (‘,’ <el>)* ;
  • 21. Do we still need the distinction between the tokenizer and the parser?
  • 22. Scannerless Parsing GLR and PEGs are typically scannerless ✤ Tokenizing is natural; we do it. “Humuhumunukunukuapua'a have a diamond- shaped body with armor-like scales.” ✤ Tokenizing is efficient and processing tokens is convenient ✤ But... scannerless parsing supports mixed languages like C+SQL: ✤ That’s pretty cool and supports modular grammars since we can combine grammar pieces w/o fear that combined input won’t tokenize properly ✤ Easy to fake if parser is strong enough: just make each char a token! int next = select ID from users where name='Raj'+1; int from = 1, select = 2; int x = select * from;
  • 23. Scannerless Grammars are Quirky ✤ Must test for white space explicitly and frequently ✤ Distinguishing between keywords and identifiers is messy; e.g., int or int[ versus interest or int9 prog: ws? (var|func)+ EOF ; plus: '+' ws? ; kint: {notLetterOrDigit(4)}? 'i' 'n' 't' ws? ; id : letter+ {!keywords.contains($text)}? ws? ;
  • 24. Should I care about context- sensitive parsing?
  • 25. Predicated Parsing ✤ Context-sensitive rules are viable per a runtime test called a semantic predicate; the expression language depends on the tool ✤ Disambiguating a(i) and f(x) in Fortran requires symbol table information about a,f ✤ Or, can build a “parse forest” and disambiguate after the parse, but that can be inefficient in time and space expr: array | call ; array : {isArray(token)}? ID ‘(‘ expr ‘)’ ; call : {isFunc(token)}? ID ‘(‘ expr ‘)’ ;
  • 26. Can I parse data structures like trees? (Are youTRIE-curious?)
  • 28. ANTLR XPath, pattern matching Declarative+imperative //  “Find  all  initialized  int  local  variables  (Java)” ParserRuleContext  tree  =  parser.compilationUnit();  //  parse String  xpath  =  "//blockStatement/*";  //  get  children  of  blockStatement String  treePattern  =  "int  <Identifier>  =  <expression>;"; ParseTreePattern  p  =        parser.compileParseTreePattern(treePattern,                      ExprParser.RULE_localVariableDeclarationStatement); List<ParseTreeMatch>  matches  =  p.findAll(tree,  xpath); matches.get(0).get(“expression”);  //  get  1st  init  expr  subtree System.out.println(matches); /prog/func/'def' Find all def literal kids of func kid of prog Find all funcs under prog at root/prog/func
  • 29. Tree Grammars ✤ Specify structure of tree declaratively with grammar DSL; translated to parser ✤ SORCERER in 1994 then into ANTLR directly ✤ ANTLR 3 could rewrite trees, invoke StringTemplates expr returns [int value] : ^('+' a=expr b=expr) {$value = a+b;} | ^('-' a=expr b=expr) {$value = a-b;} | ^('*' a=expr b=expr) {$value = a*b;} | INT {$value = Integer.parseInt($INT.text);} ; expr | ^('+' expr expr) | ^('*' expr expr) | INT ;
  • 30. OMeta ✤ “A programming language whose control structure is based on PEGs” with pattern matching like Haskell/LISP ✤ Can process streams of arbitrary datatypes and nested lists (trees) ✤ Higher-order rules; can pass rule arg to another rule (coolness alert) ✤ Actions written in host language not OMeta itself (coder must watch out for action side-effects during speculation/backtracking) ✤ Other pattern matching, transformation DSLs: TXL, Stratego/XT, TOM (Java superset), Rascal Metaprogramming Language, ...
  • 31. Well, should I use tree grammars? ✤ Tree grammars caused lots of code dup; copy grammar for each tree pass, just with different actions ✤ Change to tree structure requires change to each tree grammar ✤ Actions cause trouble for grammar inheritance; fragile base-class problem ✤ Ended up being a maintenance hassle so ANTLR 4 dropped them in favor of listeners/visitors/xpath/tree-patterns imperative+declarative style
  • 32. It takes any grammar you give it. “It just doesn’t give a shit.” Yes, even left-recursive rules! (except indirect left-recursive rules) ANTLR 4’s ALL(*) Parser The Nasty-Ass Honey Badger
  • 33. What is ALL(*)? ✤ ALL(*) combines simplicity, efficiency, and predictability of top-down LL(k) parsers with power of GLR-like mechanism to make decisions ✤ Key innovation: shift grammar analysis to parse-time, caching results ✤ Launch subparsers that scan ahead to determine which alternative (path) will ultimately lead to successful parse; warms up like a JIT ✤ All but one subparser die off, yielding unique prediction (unless ambiguous phrase or syntax error) ✤ Analogy: ticket_line : PEOPLE+ BORAT | PEOPLE+ THE_BODY_GUARD ;
  • 34. void stat() { // parse according to rule stat switch ( adaptivePredict("stat", call stack) ) { case 1 : // predict production 1 expr(); match(’=’); expr(); match(’;’); break; case 2 : // predict production 2 expr(); match(’;’); break; } } // ALL(*) but non-LL(k), non-LL(*) grammar stat: expr ’=’ expr ’;’ | expr ’;’ ; ATN Recursive-descent Parser Adaptive LL(*):ALL(*)
  • 35. Incremental DFA construction Multiple threads running >1 parser instance build the DFA faster
  • 36. Parsing 132M, 12920 Java files
  • 37. Summary ✤ You want the most general parser you can get, as long as you don’t pay for it with poor performance; ALL(*) is the sweet spot ✤ GLR is too unpredictable in performance; general parsers like GLL/GLR that support all input interpretations will never compete with more deterministic strategies ✤ Semantic predicates are rarely needed but a godsend when you need them! ✤ Declarative tree matching is good, but just for searching; tree grammars with embedded actions, not so good. I like mixed declarative+imperative style ✤ Scannerless grammars are cool as hell; useful for modular grammars and mixed languages but less convenient to build grammars and process artifacts ✤ PEGs are great for smaller languages; at a disadvantage for action execution, semantic predicates, error handling. OMeta has same issues
  • 39. Parsing InnovationTimeline (Not exhaustive, obviously) 1985 GLR 1992 GLR fixed to be completely general (looped on cyclic grammars) 1993 Linear approximate lookahead makes LL(k) more attractive 1994 Semantic and syntactic predicates for LL(k) 1997 SGLR scannerless parsing 2002 Packrat parsing 2004 Elkhound: novel combination of GLR and LR parsing 2004 PEGs (formalized speculative, ordered-alternative grammars) 2010 GLL (No parser generator available; Rascal uses GLL variant) 2011 LL(*) (2007 tool released) 2014 ALL(*) (2012 tool released)
  • 40. Frustration LedTo... decl: ‘int’ ID ‘;’ | ‘int’ ID ‘=’ expr ‘;’ decl: modifier* type ID ‘;’ | modifier* type ID ‘=’ expr ‘;’ stat: expr ’=’ expr ’;’ | expr ’;’ expr: ID ‘(‘ expr ‘)’ // array index | ID ‘(‘ expr ‘)’ // func call First this pissed me off: Led to LL(k) for k>1 Then this: semantic predicates Syntactic predicates LL(*) stat: decl | expr ALL(*)
  • 41. Intellij Plug-in Answers how was this input tokenized? Visualizes parse tree live via parser interpreter
  • 42. How was that phrase recognized? Critical feature for large input/grammars
  • 45. Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations/parsing- history