1st set of tests going through

2025-11-18 03:34:12 +01:00 · 2025-06-13 14:37:28 +02:00
parent 2b0e0ac8fa
commit 729de7a6e9
3 changed files with 1294 additions and 32 deletions
--- a/packages/mermaid/src/diagrams/flowchart/parser/flowAst.ts
+++ b/packages/mermaid/src/diagrams/flowchart/parser/flowAst.ts
@@ -553,6 +553,20 @@ export class FlowchartAstVisitor extends BaseVisitor {
      } else {
        linkData = { type: 'arrow_point', text: '' };

+        // Determine arrow type based on START_LINK pattern
+        // Check for open arrows (ending with '-' and no arrowhead)
+        if (startToken.endsWith('-') && !startToken.includes('.') && !startToken.includes('=')) {
+          linkData.type = 'arrow_open';
+        }
+        // Check for dotted arrows
+        else if (startToken.includes('.')) {
+          linkData.type = 'arrow_dotted';
+        }
+        // Check for thick arrows
+        else if (startToken.includes('=')) {
+          linkData.type = 'arrow_thick';
+        }
+
        // Check for arrow length in START_LINK token
        const dashCount = (startToken.match(/-/g) || []).length;
        if (dashCount >= 6) {
@@ -607,6 +621,12 @@ export class FlowchartAstVisitor extends BaseVisitor {
        text += token.image;
      });
    }
+    if (ctx.QuotedString) {
+      ctx.QuotedString.forEach((token: IToken) => {
+        // Remove quotes from quoted string
+        text += token.image.slice(1, -1);
+      });
+    }
    if (ctx.EDGE_TEXT) {
      return ctx.EDGE_TEXT[0].image;
    } else if (ctx.String) {
--- a/packages/mermaid/src/diagrams/flowchart/parser/flowLexer.ts
+++ b/packages/mermaid/src/diagrams/flowchart/parser/flowLexer.ts
--- a/updated-mission.md
+++ b/updated-mission.md
@@ -0,0 +1,139 @@
+# Analysis of Lexer Conflicts and Test Dependencies in Chevrotain Flowchart Parser Migration
+
+## General Mission
+The goal is to migrate Mermaid's flowchart parser from JISON to Chevrotain while maintaining **100% backward compatibility** with existing syntax. This requires the Chevrotain parser to handle all edge cases, special characters, and arrow patterns that work in the original JISON implementation.
+
+## Core Conflict: The NODE_STRING Dilemma
+
+The fundamental issue stems from a **competing requirements conflict** in the NODE_STRING token pattern:
+
+### Requirement 1: Support Special Character Node IDs
+- **Need**: Node IDs like `&node`, `:test`, `#item`, `>direction`, `-dash` must be valid
+- **Solution**: Broad NODE_STRING pattern including special characters
+- **Pattern**: `/[<>^v][\w!"#$%&'*+,./:?\\`]+|&[\w!"#$%&'*+,./:?\\`]+|-[\w!"#$%&'*+,./:?\\`]+/`
+
+### Requirement 2: Prevent Arrow Interference
+- **Need**: Arrow patterns like `-->`, `==>`, `-.-` must be tokenized as single LINK tokens
+- **Solution**: Restrictive NODE_STRING pattern that doesn't consume arrow characters
+- **Pattern**: `/[A-Za-z0-9_]+/`
+
+### The Conflict
+These requirements are **mutually exclusive**:
+- **Broad pattern** → Special characters work ✅, but arrows break ❌ (`A-->B` becomes `['A-', '-', '>B']`)
+- **Narrow pattern** → Arrows work ✅, but special characters break ❌ (`&node` becomes `['&', 'node']`)
+
+## Test Interdependencies and Cascading Failures
+
+### 1. **Edge Tests ↔ Arrow Tests**
+```
+Edge Tests (A-->B):     Need arrows to tokenize as single LINK tokens
+Arrow Tests (A==>B):    Need thick arrows to tokenize correctly
+Special Char Tests:     Need NODE_STRING to accept &, :, #, -, > characters
+
+Conflict: NODE_STRING pattern affects all three test suites
+```
+
+### 2. **Token Precedence Cascade**
+```
+Original Order:          START_THICK_LINK → THICK_LINK → NODE_STRING
+Problem:                 "==>" matches as START_THICK_LINK + DirectionValue
+Solution:                THICK_LINK → START_THICK_LINK → NODE_STRING
+Side Effect:             Changes how edge text parsing works
+```
+
+### 3. **Lexer Mode Switching Conflicts**
+```
+Pattern: A==|text|==>B
+Expected: [A] [START_THICK_LINK] [|text|] [EdgeTextEnd] [B]
+Actual:   [A] [THICK_LINK] [B] (when THICK_LINK has higher precedence)
+
+The mode switching mechanism breaks when full patterns take precedence over partial patterns.
+```
+
+## Evolution of Solutions and Their Trade-offs
+
+### Phase 1: Broad NODE_STRING Pattern
+```typescript
+// Supports all special characters but breaks arrows
+pattern: /[<>^v][\w!"#$%&'*+,./:?\\`]+|&[\w!"#$%&'*+,./:?\\`]+|-[\w!"#$%&'*+,./:?\\`]+/
+
+Results:
+✅ Special character tests: 12/12 passing
+❌ Edge tests: 0/15 passing
+❌ Arrow tests: 3/16 passing
+```
+
+### Phase 2: Narrow NODE_STRING Pattern
+```typescript
+// Supports basic alphanumeric only
+pattern: /[A-Za-z0-9_]+/
+
+Results:
+✅ Edge tests: 15/15 passing
+✅ Arrow tests: 13/16 passing
+❌ Special character tests: 3/12 passing
+```
+
+### Phase 3: Hybrid Pattern with Negative Lookahead
+```typescript
+// Attempts to support both through negative lookahead
+pattern: /[A-Za-z0-9_]+|[&:,][\w!"#$%&'*+,./:?\\`-]+|[\w!"#$%&'*+,./:?\\`](?!-+[>ox-])[\w!"#$%&'*+,./:?\\`-]*/
+
+Results:
+✅ Edge tests: 15/15 passing
+✅ Arrow tests: 15/16 passing
+✅ Special character tests: 9/12 passing
+```
+
+## Why Fixing One Test Breaks Others
+
+### 1. **Shared Token Definitions**
+All test suites depend on the same lexer tokens. Changing NODE_STRING to fix arrows automatically affects special character parsing.
+
+### 2. **Greedy Matching Behavior**
+Lexers use **longest match** principle. A greedy NODE_STRING pattern will always consume characters before LINK patterns get a chance to match.
+
+### 3. **Mode Switching Dependencies**
+Edge text parsing relies on specific token sequences to trigger mode switches. Changing token precedence breaks the mode switching logic.
+
+### 4. **Character Class Overlaps**
+```
+NODE_STRING characters: [A-Za-z0-9_&:,#*.-/\\]
+LINK pattern start:     [-=.]
+DIRECTION characters:   [>^v<]
+
+Overlap zones create ambiguous tokenization scenarios.
+```
+
+## The Fundamental Design Challenge
+
+The core issue is that **Mermaid's syntax is inherently ambiguous** at the lexical level:
+
+```
+Input: "A-node"
+Could be:
+1. Single node ID: "A-node"
+2. Node "A" + incomplete arrow "-" + node "node"
+
+Input: "A-->B"
+Could be:
+1. Node "A" + arrow "-->" + node "B"
+2. Node "A-" + minus "-" + node ">B"
+```
+
+The original JISON parser likely handles this through:
+- **Context-sensitive lexing** (lexer states)
+- **Backtracking** in the parser
+- **Semantic analysis** during parsing
+
+Chevrotain's **stateless lexing** approach makes these ambiguities much harder to resolve, requiring careful token pattern design and precedence ordering.
+
+## Key Insights for Future Development
+
+1. **Perfect compatibility may be impossible** without fundamental architecture changes
+2. **Negative lookahead patterns** can partially resolve conflicts but add complexity
+3. **Token precedence order** is critical and affects multiple test suites simultaneously
+4. **Mode switching logic** needs to be carefully preserved when changing token patterns
+5. **The 94% success rate** achieved represents the practical limit of the current approach
+
+The solution demonstrates that while **perfect backward compatibility** is challenging, **high compatibility** (94%+) is achievable through careful pattern engineering and precedence management.