Extensible domain-specific programming for the sciences
The notion of scientists as programmers begs the question of what sort of programming language would be a good fit. The common answer seems to be both none of them and all of them. Many scientific applications are a combination of general-purpose and domain-specific languages: R for statistical elements, MATLAB for matrix-based computations, Perl-based regular expressions for string matching, C or FORTRAN for high performance parallel computations, and scripting languages such as Python to glue them all together. This clumsy situation demonstrates the need for different domain-specific language features.
Our hypothesis is that programming could be made easier, less error-prone and result in higher-quality code if languages could be easily extended, by the programmer, with the domain-specific features that a programmer or scientists needs for their particular task at hand. This talk demonstrates the meta-language processing tools that support this composition of programmer-selected language features, with several extensions chosen from the previously mentioned list of features.
talk at Virginia Bioinformatics Institute, December 5, 2013
1. Extensible domain-specific programming
for the sciences
Eric Van Wyk
University of Minnesota
VBI, December 5, 2013
slides available at http:www.cs.umn.edu/~evw
1 / 45
2. Current trends / topics in PL
Formal verification
CompCert - http://compcert.inria.fr/
Astr´e - http://www.astree.ens.fr/
e
Hoare logic (1960’s)
{P} code {Q}
Proof assistants: Coq, Abella, Isabelle, ...
use required in some PL publishing venues
2 / 45
5. Current trends / topics in PL
Parallel programming - multiple cores, everywhere.
“no more free lunch”
need
new abstractions: e.g. Cilk, MapReduce, FP
new semantics: e.g. deterministic parallel Java
5 / 45
6. Current trends / topics in PL
Expressive and safe static typing
extending richer static types, e.g.
append ::
( [a], [a] ) -> [a]
to dependent types
append ::
( [a|n], [a|m] ) -> [a|n+m]
turns array out-of-bounds and null-pointer bugs into
static type errors
6 / 45
7. Extensible languages
Allow programmers select the features to be used in their
programming languages.
new syntax / notations
new semantic analyses / error-checking
Why would anyone want to do that?
7 / 45
8. Programming language features
General purpose features
assignment statements, loops, if-then-else statements
functions (perhaps higher-order) and procedures
I/O facilities
modules
data: integer, strings, arrays, records
Domain-specific features
matrix operations (MATLAB)
regular expression matching (Perl, Python)
statistics functions (R)
computational geometry operations (LN)
parallel computing (SISAL, X10, NESL, etc.)
Many similarities, needless differences.
Working with multiple (domain-specific) languages is a
headache.
8 / 45
9. Extensible languages
Allow programmers select the features to be used in their
programming languages.
new syntax / notations
new semantic analyses / error-checking
Pick a general purpose host language (e.g. ANSI C),
extend with domain-specific features.
myProgram.xc =⇒ myProgram.c
9 / 45
10. Regular expressions
# include " stdio . h "
# include " regex . h "
int main ( int argc , char * argv []) {
char * text = readFileContents ( " X . data " ) ;
// eukaryotic messenger RNA sequences
regex foo = /^ ATG [ ATGC ]{3 ,10} A {5 ,10} $ / ;
if ( text =~ foo )
printf ( " Matches ... n " ) ;
else
printf ( " Doesn ’t match ... n " ) ;
}
10 / 45
11. Mining Climate Data - Ocean Eddies
Spinning pools of water
Transport heat, salt, and
nutrients
Learning about their
behavior is difficult
11 / 45
13. main ( int argc , char ** argv ) {
Matrix float <3 > data
= readMatrix ( " ssh . data " ) ;
Matrix float <3 > scores
= matrixMap ( scoreTS , data , [2]) ;
writeMatrix ( " temporalScores . data " ,
scores ) ;
}
13 / 45
14. Matrix float <1 > scoreTS ( Matrix float <1 > ts )
{
int i = 0 , beginning , n = dimSize ( ts , 0) ;
Matrix float <1 > scores
= init ( Matrix float <1 > , dimSize ( ts , 0) ) ;
while ( ts [ i ] < ts [ i +1]) { i = i +1 ; }
Matrix float [0] trough ;
while ( i < n -1) {
( trough , beginning , i )
= getTrough ( ts , i ) ;
scores [ beginning :: i ]
= computeArea ( trough ) ;
}
return scores ;
}
14 / 45
15. Matrix float <1 > computeArea
( Matrix float <1 > areaOfInterest )
{
float y1 = areaOfInterest [0];
float y2 = areaOfInterest [ end ];
int x1 = 0;
int x2 = dimSize ( areaOfInterest ,0) -1;
float m = ( y1 - y2 ) / (( float ) ( x1 - x2 ) ) ;
float b = y1 - m * x1 ;
Matrix float <1 > Line = ( x1 :: x2 ) * m + b ;
float area
= with ( x1 <= i < x2 )
fold (+ , 0.0 , line - areaOfInterest ) ;
return
with ( 0 <= i < dimSize ( Line ,0) )
genarray ([ dimSize ( Line , 0) ] , area ) ;
}
15 / 45
16. ( Matrix float <1 > , int , int ) getTrough
( Matrix float <1 > ts , int i )
{
int beginning = i ;
int n = dimSize ( ts , 0) ;
while ( i +1 < n && ts [ i ] >= ts [ i +1])
i = i +1;
while ( i +1 < n && ts [ i ] < ts [ i +1])
i = i +1;
return ( ts [ beginning :: i ] , beginning , i ) ;
}
16 / 45
17. Matrix extensions
several features from MATLAB
with, fold, and genarray from Single Assignment C
all translated down to expected C code
straightforward parallel implementations of matrixMap,
with, fold, and genarray.
17 / 45
19. # include " stdio . h "
int main ( int
int meter x
int meter y
int meter ^2
argc , char * argv []) {
= 3.4 ;
= 5.6 ;
area = x * y ;
printf ( " % d n " , x + y ) ;
printf ( " % d n " , x + z ) ;
// OK
// Error
}
19 / 45
20. # include " stdio . h "
int main ( int
int meter x
int meter y
int meter ^2
argc , char * argv []) {
= 3.4 ;
= 5.6 ;
area = x * y ;
printf ( " % d n " , x + y ) ; // OK
// printf ("% d n " , x + z ) ; // Error
}
20 / 45
21. # include " stdio . h "
int main ( int
int
x
int
y
int
argc , char * argv []) {
= 3.4 ;
= 5.6 ;
area = x * y ;
printf ( " % d n " , x + y ) ;
// OK
}
Extensions of this form find errors, but otherwise are “erased”
during translation.
21 / 45
22. Extension composition
Programmers can select the extensions that they want.
May want to use multiple extensions in the same program.
Distinguish between
1. extension user
has no knowledge of language design or implementations
2. extension developer
must know about language design and implementation
Tools build a custom .xc =⇒ .c translator for them
How can that be done?
22 / 45
23. Building translators from composable extensible
languages
Two primary challenges:
1. composable syntax — enables building a scanner, parser
context-aware scanning [GPCE’07]
modular determinism analysis [PLDI’09]
Copper
2. composable semantics — analysis and translations
attribute grammars with forwarding, collections and
higher-order attributes
set union of specification components
sets of productions, non-terminals, attributes
sets of attribute defining equations, on a production
sets of equations contributing values to a single attribute
modular well-definedness analysis [SLE’12a]
modular termination analysis [SLE’12b, Krishnan-PhD]
Silver
23 / 45
24. Generating parsers and scanners from grammars
and regular expressions
nonterminals: Stmt, Expr
terminals: Id
/[a-zA-Z][a-zA-Z0-9]*/
Num /[0-9]+/
Eq
’=’
Semi ’;’
Plus ’+’
Mult ’*’
Stmt ::= Stmt Semi Stmt
Stmt ::= Id Eq Expr
Expr ::= Expr Plus Expr
Expr ::= Expr Mult Expr
Expr ::= Id
24 / 45
26. Attribute Grammars
add semantics — meaning — to context free grammars
nodes (non-terminals) have attributes
that is, semantic values
Expr may be attributed with
type - the type of the expression
errors - list of error messages
env - mapping variable names to their types
Stmt may be attributed with errors and env
26 / 45
28. Attribute grammar specifications
Equations associated with productions define attribute values.
abstract production addition
e : : Expr : : = l : : Expr ’+ ’ r : : Expr
{
e . e r r o r s := l . e r r o r ++ r . e r r o r s ++
. . . c h e c k t h a t l and r a r e i n t e g e r s
...
e . type = i n t ;
l . env = e . env ;
r . env = e . env ;
}
28 / 45
29. Modern attribute grammars
higher-order attributes
reference attributes
collection attributes
forwarding
module systems
separate compilation
etc.
29 / 45
30. for-loop as an extension
abstract production for
s : : Stmt : : = i : : Name l o w e r : : Expr u p p e r : : Expr
body : : Stmt
{
s . e r r o r s := l o w e r . e r r o r ++ u p p e r . e r r o r s ++
body . e r r o r s ++
. . . c h e c k t h a t i i s an i n t e g e r . . .
forwards to
// i=l o w e r ; w h i l e ( i <= u p p e r ) { body ; i=i +1;}
seq ( assignment ( varRef ( i ) , lower ) ,
while (
l t e ( varRef ( i ) , upper ) ,
b l o c k ( s e q ( body ,
a s s i g n m e n t ( v a r R e f ( i ) , add ( v a r R e f ( i ) ,
i n t L i t ( ”1” ) ) ) ) ) ) ) ;
}
30 / 45
31. Building an attribute grammar evaluator from composed
specifications.
... AG H ∪∗ {AG E1 , ..., AG En }
∀i ∈ [1, n].modComplete(AG H , AG Ei )
E
E
⇒ ⇒ complete(AG H ∪ {AG1 , ..., AGn })
Monolithic analysis - not too hard, but not too useful.
Modular analysis - harder, but required [SLE’12a].
31 / 45
32. Challenges in scanning
Keywords in embedded languages may be identifiers in host
language:
int SELECT ;
...
rs = using c query { SELECT last name
FROM person WHERE ...
32 / 45
33. Challenges in scanning
Different extensions use same keyword
connection c "jdbc:derby:./derby/db/testdb"
with table person [ person id INTEGER,
first name VARCHAR ];
...
b = table ( c1 : T F ,
c2 : F * ) ;
33 / 45
34. Challenges in scanning
Operators with different precedence specifications:
x = 3 + y * z ;
...
str = /[a-z][a-z0-9]*.java/
34 / 45
36. Need for context
Traditionally, parser and scanner are disjoint.
Scanner → Parser → Semantic Analysis
In context aware scanning, they communicate
Scanner
Parser → Semantic Analysis
36 / 45
37. Context aware scanning
Scanner recognizes only tokens valid for current “context”
keeps embedded sub-languages, in a sense, separate
Consider:
chan in, out;
for i in a { a[i] = i*i ; }
Two terminal symbols that match “in”.
terminal IN ’in’ ;
terminal ID /[a-zA-Z ][a-zA-Z 0-9]*/
submits to {keyword };
terminal FOR ’for’ lexer class {keyword };
example is part of AbleP [SPIN’11]
37 / 45
38. Parsing C as an extension to Promela
c_decl {
typedef struct Coord {
int x, y; } Coord;
c_state "Coord pt" "Global"
int z = 3;
}
/* goes in state vector */
/* standard global decl */
active proctype example()
{ c_code { now.pt.x = now.pt.y = 0; };
do :: c_expr { now.pt.x == now.pt.y }
-> c_code { now.pt.y++; }
:: else -> break
od;
c_code { printf("values %d: %d, %d,%dn",
Pexample->_pid, now.z, now.pt.x, now.pt.y);
38 / 45
39. Context aware scanning
This scanning algorithm subordinates the
disambiguation principle of maximal munch
to the principle of
disambiguation by context.
It will return a shorter valid match before a longer invalid
match.
In List<List<Integer>> before “>”,
“>” in valid lookahead but “>>” is not.
A context aware scanner is essentially an implicitly-moded
scanner.
There is no explicit specification of valid look ahead.
It is generated from standard grammars and terminal
regexs.
39 / 45
40. With a smarter scanner, LALR(1) is not so brittle.
We can build syntactically composable language
extensions.
Context aware scanning makes composable syntax “more
likely”
But it does not give a guarantee of composability.
40 / 45
41. Building a parser from composed specifications.
... CFG H ∪∗ {CFG E1 , ..., CFG En }
∀i ∈ [1, n].isComposable(CFG H , CFG Ei )∧
conflictFree(CFG H ∪ CFG Ei )
⇒ ⇒ conflictFree(CFG H ∪ {CFG E1 , ..., CFG En })
Monolithic analysis - not too hard, but not too useful.
Modular analysis - harder, but required [PLDI’09].
Non-commutative composition of restricted LALR(1)
grammars.
41 / 45
43. Expressiveness versus safe composition
Compare to
other parser generators
libraries
The modular compositionality analysis does not require
context aware scanning.
But, context aware scanning makes it practical.
43 / 45
44. Future Work
ableC - extensible C11 specification
builds on lessons learned from extensible specifications of
Java [ECOOP’07], Lustre [FASE’07], Modelica,
Promela [SPIN’11].
incorporate existing language extensions
composition of language extensions are compile-time
language specific analysis
new applications of AGs
44 / 45
45. Thanks for your attention.
Questions?
http://melt.cs.umn.edu
evw@cs.umn.edu
45 / 45
46. Eric Van Wyk and August Schwerdfeger.
Context-aware scanning for parsing extensible languages.
In Intl. Conf. on Generative Programming and Component
Engineering, (GPCE), pages 63–72. ACM, 2007.
Eric Van Wyk, Derek Bodin, Jimin Gao, and Lijesh
Krishnan.
Silver: an extensible attribute grammar system.
Science of Computer Programming, 75(1–2):39–54,
January 2010.
August Schwerdfeger and Eric Van Wyk.
Verifiable composition of deterministic grammars.
In Proc. of Conf. on Programming Language Design and
Implementation (PLDI), pages 199–210. ACM, June 2009.
45 / 45
47. Ted Kaminski and Eric Van Wyk.
Modular well-definedness analysis for attribute grammars.
In Proc. of Intl. Conf. on Software Language Engineering
(SLE), volume 7745 of LNCS, pages 352–371.
Springer-Verlag, September 2012.
Lijesh Krishnan and Eric Van Wyk.
Termination analysis for higher-order attribute grammars.
In Proceedings of the 5th International Conference on
Software Language Engineering (SLE 2012), volume 7745
of LNCS, pages 44–63. Springer-Verlag, September 2012.
Lijesh Krishnan.
Composable Semantics Using Higher-Order Attribute
Grammars.
PhD thesis, University of Minnesota, Department of
Computer Science and Engineering, 2012.
http://purl.umn.edu/144010
45 / 45
48. Yogesh Mali and Eric Van Wyk.
Building extensible specifications and implementations of
Promela with AbleP.
In Proc. of Intl. SPIN Workshop on Model Checking of
Software, volume 6823 of LNCS, pages 108–125.
Springer-Verlag, July 2011.
Eric Van Wyk, Lijesh Krishnan, August Schwerdfeger, and
Derek Bodin.
Attribute grammar-based language extensions for Java.
In Proc. of European Conf. on Object Oriented Prog.
(ECOOP), volume 4609 of LNCS, pages 575–599.
Springer-Verlag, 2007.
45 / 45
49. Jimin Gao, Mats Heimdahl, and Eric Van Wyk.
Flexible and extensible notations for modeling languages.
In Fundamental Approaches to Software Engineering,
FASE 2007, volume 4422 of LNCS, pages 102–116.
Springer-Verlag, March 2007.
45 / 45