Skip to content

Conversation

@konard
Copy link
Member

@konard konard commented Nov 30, 2025

Summary

This PR implements automatic tokenization of punctuation and math symbols during parsing, as requested in issue #148. The parser now separates these symbols from adjacent characters, making them individual references in Links Notation.

Key Features

  • Punctuation tokenization: Symbols like ,, ., ;, !, ? are separated from words

    • 1, 2 and 3["1", ",", "2", "and", "3"]
    • hello, world["hello", ",", "world"]
  • Math symbol tokenization: Symbols like +, -, *, /, = are separated only when between digits

    • 1+1["1", "+", "1"]
    • 10-20["10", "-", "20"]
  • Hyphenated words preserved: Words with hyphens between letters are kept intact

    • Jean-Luc Picard["Jean-Luc", "Picard"]
    • conan-center-index["conan-center-index"]
  • Quoted strings preserved: Content inside quotes is not tokenized

    • "1,2,3"["1,2,3"]
    • "hello, world"["hello, world"]
  • Base64 preserved: Strings like bmFtZQ== are kept intact

  • Compact formatting: New option to restore human-readable output without spaces around symbols

Backward Compatibility

Set tokenizeSymbols: false (JS/Python) or use parse_lino_raw() (Rust) to disable tokenization.

Test Plan

  • All 163 JS tests pass (including 23 new tokenization tests)
  • All 110 Rust tests pass (including 16 new tokenization tests)
  • All 161 Python tests pass (including 23 new tokenization tests)

Version Bump

Updated all implementations to version 0.13.0.

Fixes #148

🤖 Generated with Claude Code

Adding CLAUDE.md with task information for AI processing.
This file will be removed when the task is complete.

Issue: #148
@konard konard self-assigned this Nov 30, 2025
This change adds automatic tokenization of punctuation and math symbols
during parsing, making them separate references in Links Notation.

Key features:
- Punctuation (`,`, `.`, `;`, `!`, `?`) is tokenized when following alphanumeric characters
- Math symbols (`+`, `-`, `*`, `/`, `=`, etc.) are tokenized only when between digits
- Hyphenated words (e.g., "Jean-Luc", "conan-center-index") are preserved
- Quoted strings preserve their content
- Base64 strings with `=` padding are preserved
- Backward compatible: use `tokenizeSymbols: false` to disable

New APIs:
- JS: `Parser({ tokenizeSymbols: boolean })`, `FormatOptions({ compactSymbols: boolean })`
- Rust: `Tokenizer`, `parse_lino_raw()`, `format_links_compact()`
- Python: `Parser(tokenize_symbols=bool)`, `Tokenizer` class

Example:
- Input: `1,2,3` → Values: `["1", ",", "2", ",", "3"]`
- Input: `1+1` → Values: `["1", "+", "1"]`
- Input: `Jean-Luc` → Values: `["Jean-Luc"]` (preserved)
- Input: `"1,2,3"` → Values: `["1,2,3"]` (quoted preserved)

Fixes #148

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@konard konard changed the title [WIP] Better punctuation and math symbols support feat: Add punctuation and math symbol tokenization Nov 30, 2025
@konard konard marked this pull request as ready for review November 30, 2025 20:42
@konard
Copy link
Member Author

konard commented Nov 30, 2025

🤖 Solution Draft Log

This log file contains the complete execution trace of the AI solution draft process.

💰 Cost estimation:

  • Public pricing estimate: $12.209440 USD
  • Calculated by Anthropic: unknown
  • Difference: unknown
    📎 Log file uploaded as GitHub Gist (1145KB)
    🔗 View complete solution draft log

Now working session is ended, feel free to review and add any feedback on the solution draft.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Better punctuation and math symbols support

2 participants