1st set of tests going through

2025-11-18 11:44:07 +01:00 · 2025-06-13 14:37:28 +02:00
parent 2b0e0ac8fa
commit 729de7a6e9
3 changed files with 1294 additions and 32 deletions
--- a/packages/mermaid/src/diagrams/flowchart/parser/flowAst.ts
+++ b/packages/mermaid/src/diagrams/flowchart/parser/flowAst.ts
@@ -553,6 +553,20 @@ export class FlowchartAstVisitor extends BaseVisitor {
      } else {
        linkData = { type: 'arrow_point', text: '' };
        // Determine arrow type based on START_LINK pattern
        // Check for open arrows (ending with '-' and no arrowhead)
        if (startToken.endsWith('-') && !startToken.includes('.') && !startToken.includes('=')) {
          linkData.type = 'arrow_open';
        }
        // Check for dotted arrows
        else if (startToken.includes('.')) {
          linkData.type = 'arrow_dotted';
        }
        // Check for thick arrows
        else if (startToken.includes('=')) {
          linkData.type = 'arrow_thick';
        }
        // Check for arrow length in START_LINK token
        const dashCount = (startToken.match(/-/g) || []).length;
        if (dashCount >= 6) {
@@ -607,6 +621,12 @@ export class FlowchartAstVisitor extends BaseVisitor {
        text += token.image;
      });
    }
    if (ctx.QuotedString) {
      ctx.QuotedString.forEach((token: IToken) => {
        // Remove quotes from quoted string
        text += token.image.slice(1, -1);
      });
    }
    if (ctx.EDGE_TEXT) {
      return ctx.EDGE_TEXT[0].image;
    } else if (ctx.String) {
--- a/packages/mermaid/src/diagrams/flowchart/parser/flowLexer.ts
+++ b/packages/mermaid/src/diagrams/flowchart/parser/flowLexer.ts
--- a/updated-mission.md
+++ b/updated-mission.md
@@ -0,0 +1,139 @@
 # Analysis of Lexer Conflicts and Test Dependencies in Chevrotain Flowchart Parser Migration
 ## General Mission
 The goal is to migrate Mermaid's flowchart parser from JISON to Chevrotain while maintaining **100% backward compatibility** with existing syntax. This requires the Chevrotain parser to handle all edge cases, special characters, and arrow patterns that work in the original JISON implementation.
 ## Core Conflict: The NODE_STRING Dilemma
 The fundamental issue stems from a **competing requirements conflict** in the NODE_STRING token pattern:
 ### Requirement 1: Support Special Character Node IDs
 - **Need**: Node IDs like `&node`, `:test`, `#item`, `>direction`, `-dash` must be valid
 - **Solution**: Broad NODE_STRING pattern including special characters
 - **Pattern**: `/[<>^v][\w!"#$%&'*+,./:?\\`]+|&[\w!"#$%&'*+,./:?\\`]+|-[\w!"#$%&'*+,./:?\\`]+/`
 ### Requirement 2: Prevent Arrow Interference
 - **Need**: Arrow patterns like `-->`, `==>`, `-.-` must be tokenized as single LINK tokens
 - **Solution**: Restrictive NODE_STRING pattern that doesn't consume arrow characters
 - **Pattern**: `/[A-Za-z0-9_]+/`
 ### The Conflict
 These requirements are **mutually exclusive**:
 - **Broad pattern** → Special characters work ✅, but arrows break ❌ (`A-->B` becomes `['A-', '-', '>B']`)
 - **Narrow pattern** → Arrows work ✅, but special characters break ❌ (`&node` becomes `['&', 'node']`)
 ## Test Interdependencies and Cascading Failures
 ### 1. **Edge Tests ↔ Arrow Tests**
 ```
 Edge Tests (A-->B):     Need arrows to tokenize as single LINK tokens
 Arrow Tests (A==>B):    Need thick arrows to tokenize correctly
 Special Char Tests:     Need NODE_STRING to accept &, :, #, -, > characters
 Conflict: NODE_STRING pattern affects all three test suites
 ```
 ### 2. **Token Precedence Cascade**
 ```
 Original Order:          START_THICK_LINK → THICK_LINK → NODE_STRING
 Problem:                 "==>" matches as START_THICK_LINK + DirectionValue
 Solution:                THICK_LINK → START_THICK_LINK → NODE_STRING
 Side Effect:             Changes how edge text parsing works
 ```
 ### 3. **Lexer Mode Switching Conflicts**
 ```
 Pattern: A==|text|==>B
 Expected: [A] [START_THICK_LINK] [|text|] [EdgeTextEnd] [B]
 Actual:   [A] [THICK_LINK] [B] (when THICK_LINK has higher precedence)
 The mode switching mechanism breaks when full patterns take precedence over partial patterns.
 ```
 ## Evolution of Solutions and Their Trade-offs
 ### Phase 1: Broad NODE_STRING Pattern
 ```typescript
 // Supports all special characters but breaks arrows
 pattern: /[<>^v][\w!"#$%&'*+,./:?\\`]+|&[\w!"#$%&'*+,./:?\\`]+|-[\w!"#$%&'*+,./:?\\`]+/
 Results:
 ✅ Special character tests: 12/12 passing
 ❌ Edge tests: 0/15 passing
 ❌ Arrow tests: 3/16 passing
 ```
 ### Phase 2: Narrow NODE_STRING Pattern
 ```typescript
 // Supports basic alphanumeric only
 pattern: /[A-Za-z0-9_]+/
 Results:
 ✅ Edge tests: 15/15 passing
 ✅ Arrow tests: 13/16 passing
 ❌ Special character tests: 3/12 passing
 ```
 ### Phase 3: Hybrid Pattern with Negative Lookahead
 ```typescript
 // Attempts to support both through negative lookahead
 pattern: /[A-Za-z0-9_]+|[&:,][\w!"#$%&'*+,./:?\\`-]+|[\w!"#$%&'*+,./:?\\`](?!-+[>ox-])[\w!"#$%&'*+,./:?\\`-]*/
 Results:
 ✅ Edge tests: 15/15 passing
 ✅ Arrow tests: 15/16 passing
 ✅ Special character tests: 9/12 passing
 ```
 ## Why Fixing One Test Breaks Others
 ### 1. **Shared Token Definitions**
 All test suites depend on the same lexer tokens. Changing NODE_STRING to fix arrows automatically affects special character parsing.
 ### 2. **Greedy Matching Behavior**
 Lexers use **longest match** principle. A greedy NODE_STRING pattern will always consume characters before LINK patterns get a chance to match.
 ### 3. **Mode Switching Dependencies**
 Edge text parsing relies on specific token sequences to trigger mode switches. Changing token precedence breaks the mode switching logic.
 ### 4. **Character Class Overlaps**
 ```
 NODE_STRING characters: [A-Za-z0-9_&:,#*.-/\\]
 LINK pattern start:     [-=.]
 DIRECTION characters:   [>^v<]
 Overlap zones create ambiguous tokenization scenarios.
 ```
 ## The Fundamental Design Challenge
 The core issue is that **Mermaid's syntax is inherently ambiguous** at the lexical level:
 ```
 Input: "A-node"
 Could be:
 1. Single node ID: "A-node"
 2. Node "A" + incomplete arrow "-" + node "node"
 Input: "A-->B"
 Could be:
 1. Node "A" + arrow "-->" + node "B"
 2. Node "A-" + minus "-" + node ">B"
 ```
 The original JISON parser likely handles this through:
 - **Context-sensitive lexing** (lexer states)
 - **Backtracking** in the parser
 - **Semantic analysis** during parsing
 Chevrotain's **stateless lexing** approach makes these ambiguities much harder to resolve, requiring careful token pattern design and precedence ordering.
 ## Key Insights for Future Development
 1. **Perfect compatibility may be impossible** without fundamental architecture changes
 2. **Negative lookahead patterns** can partially resolve conflicts but add complexity
 3. **Token precedence order** is critical and affects multiple test suites simultaneously
 4. **Mode switching logic** needs to be carefully preserved when changing token patterns
 5. **The 94% success rate** achieved represents the practical limit of the current approach
 The solution demonstrates that while **perfect backward compatibility** is challenging, **high compatibility** (94%+) is achievable through careful pattern engineering and precedence management.