This is an extension and partially a compilation of what has been discussed here
We would like to add diagrams displaying syntax structure by parsing formal grammar definitions.
Alternatives and tools
The idea is not new, there are many different tools and alternatives
Kind | Source code or link to the page | Live |
---|---|---|
Visualizer | https://github.com/jacquev6/DrawGrammar | https://jacquev6.github.io/DrawGrammar/ |
Visualizer | https://github.com/GuntherRademacher/rr | https://bottlecaps.de/rr/ui |
Visualizer | http://dotnet.jku.at/applications/visualizer/ | |
Parser | https://www.icosaedro.it/bnf_chk/ | |
Renderer | https://www.npmjs.com/package/railroad-diagrams |
Which standard to use
We have a couple of standards and different notations out there, so we should probably support as many as we can at once
Limitations
DrawGrammar
It does not perform left-reduction
RR
In RR 'a'+?
and 'a'?+
are effectively reduced to 'a'*
, and ('a'|)+
reduced to ('a'|)+
but not to the 'a'*
, although they are effectively the same. It seems that RR needs like 2 steps for that reduction.
It also cannot compress a|a*
although it obviously equals to a+
Concepts
AST
Instead of binary concatenation operator
graph
c1[,] --> A
c2[,] --> B
c2[,] --> C
c1 --> c2
Lets use multiple operands
graph
c1[,]
c1 --> A
c1 --> B
c1 --> C
Normalizing and compressing diagram
RR approach
Author of RR suggest to reduce AST down to two operators.
- alternative
- one or more
See the longer example:
rule ::= 'a'? 'b'+ 'c'* 'd'?+ 'e'?* 'f'+* ('g'|) ('h'|)+ ('i'|)* ('w'|)+* 'x' y 'z'
becomes
[rule]::= 'a'? 'b'+ 'c'* 'd'* 'e'* 'f'* 'g'? 'h'?+ 'i'?* 'w'?* 'x' [y] 'z'
How to represent repetitions
I have not come to a conclusion yet, there are some cases in this implementation that are really tricky, but some, that are relatively simple are not processed.
Currently I am trying to figure out, should "one or many" be a node or a "transition" just like empty transition.
May be it'll make something simpler, may be not. In that case we probably could use only alternative, and definition representing "list of consequent terms and non terms".
%e
- epsilon, empty transition from the start
state to the end
state
%r
- repetition, empty transition from the end
to the start
state
x
or 1
- only one
x?
or x|%e
- zero or one
x+
or x|%r
- one or many
x*
or x|%e|%r
- zero or many
Compression Rules
uncompressed => compressed
We can represent quantifiers as alternatives, or as alternatives with repetitions
a? => a | %e
a+ => a | %r
a* => a | %e | %r
Two quantifiers one after another does not meed being 'greedy' (like in regexp)
(a?)? => a | %e | %e => a?
(a?)+ => a | %e | %r) => a*
(a?)* => a | %e | %e | %r => a?
(a+)? => a | %r | %e => a*
(a+)+ => a | %r | %r => a+
(a+)* => a | %r | %r | %e => a*
// because a* contains %e and %r it always produces a*
a*? => a*
a*+ => a*
a** => a*
Alternatives can unite its elements
a | a? => a | a | %e => a?
a | a+ => a | a | %r => a+
a | a* => a | a | %e | %r => a*
Concatenation can unite its elements single element + quantifier
a a? a => a a? a
a a* a => a a+
a a+ a => a a a+
quantifier + quantifier
a? a? => a{0,2} # we do not support this
a? a? => a? a?
a? a+ => a? a+
a? a* => (a|%e) (a|%e)* => a*
a* a* => (a|%e)* (a|%e)* => a*
a+ a+ => a a* a a* => a a+
a+ a* => a+
a* a+ => a+
Unwrap and unite alternatives
(a) => a
a|(b|c) => a|b|c
Common prefixes
TODO
Recursion and reduction
We need to
- eliminate recursion if possible
- apply left-reduction, search for common prefixes and suffixes in long rules e.g.
a b c d | a b
becomesa b (c d)?
rule: 'a' rule | %e => rule: a*
rule ::= 'a' rule | 'a' => rule: a+
Examples
alt ::= a(b a)*
alt ::= a b c d | b | c d
footer