feat: Add punctuation and math symbol tokenization #170

konard · 2025-11-30T20:28:12Z

Summary

This PR implements automatic tokenization of punctuation and math symbols during parsing, as requested in issue #148. The parser now separates these symbols from adjacent characters, making them individual references in Links Notation.

Key Features

Punctuation tokenization: Symbols like ,, ., ;, !, ? are separated from words
- 1, 2 and 3 → ["1", ",", "2", "and", "3"]
- hello, world → ["hello", ",", "world"]
Math symbol tokenization: Symbols like +, -, *, /, = are separated only when between digits
- 1+1 → ["1", "+", "1"]
- 10-20 → ["10", "-", "20"]
Hyphenated words preserved: Words with hyphens between letters are kept intact
- Jean-Luc Picard → ["Jean-Luc", "Picard"]
- conan-center-index → ["conan-center-index"]
Quoted strings preserved: Content inside quotes is not tokenized
- "1,2,3" → ["1,2,3"]
- "hello, world" → ["hello, world"]
Base64 preserved: Strings like bmFtZQ== are kept intact
Compact formatting: New option to restore human-readable output without spaces around symbols

Backward Compatibility

Set tokenizeSymbols: false (JS/Python) or use parse_lino_raw() (Rust) to disable tokenization.

Test Plan

All 163 JS tests pass (including 23 new tokenization tests)
All 110 Rust tests pass (including 16 new tokenization tests)
All 161 Python tests pass (including 23 new tokenization tests)

Version Bump

Updated all implementations to version 0.13.0.

Fixes #148

🤖 Generated with Claude Code

Adding CLAUDE.md with task information for AI processing. This file will be removed when the task is complete. Issue: #148

This change adds automatic tokenization of punctuation and math symbols during parsing, making them separate references in Links Notation. Key features: - Punctuation (`,`, `.`, `;`, `!`, `?`) is tokenized when following alphanumeric characters - Math symbols (`+`, `-`, `*`, `/`, `=`, etc.) are tokenized only when between digits - Hyphenated words (e.g., "Jean-Luc", "conan-center-index") are preserved - Quoted strings preserve their content - Base64 strings with `=` padding are preserved - Backward compatible: use `tokenizeSymbols: false` to disable New APIs: - JS: `Parser({ tokenizeSymbols: boolean })`, `FormatOptions({ compactSymbols: boolean })` - Rust: `Tokenizer`, `parse_lino_raw()`, `format_links_compact()` - Python: `Parser(tokenize_symbols=bool)`, `Tokenizer` class Example: - Input: `1,2,3` → Values: `["1", ",", "2", ",", "3"]` - Input: `1+1` → Values: `["1", "+", "1"]` - Input: `Jean-Luc` → Values: `["Jean-Luc"]` (preserved) - Input: `"1,2,3"` → Values: `["1,2,3"]` (quoted preserved) Fixes #148 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

This reverts commit d64d4c3.

konard · 2025-11-30T20:44:10Z

🤖 Solution Draft Log

This log file contains the complete execution trace of the AI solution draft process.

💰 Cost estimation:

Public pricing estimate: $12.209440 USD
Calculated by Anthropic: unknown
Difference: unknown
📎 Log file uploaded as GitHub Gist (1145KB)
🔗 View complete solution draft log

Now working session is ended, feel free to review and add any feedback on the solution draft.

Initial commit with task details for issue #148

d64d4c3

Adding CLAUDE.md with task information for AI processing. This file will be removed when the task is complete. Issue: #148

konard self-assigned this Nov 30, 2025

konard changed the title ~~[WIP] Better punctuation and math symbols support~~ feat: Add punctuation and math symbol tokenization Nov 30, 2025

konard marked this pull request as ready for review November 30, 2025 20:42

Revert "Initial commit with task details for issue #148"

f2bcb2a

This reverts commit d64d4c3.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

feat: Add punctuation and math symbol tokenization #170

feat: Add punctuation and math symbol tokenization #170

Uh oh!

konard commented Nov 30, 2025 •

edited

Loading

Uh oh!

konard commented Nov 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

feat: Add punctuation and math symbol tokenization #170

Are you sure you want to change the base?

feat: Add punctuation and math symbol tokenization #170

Uh oh!

Conversation

konard commented Nov 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Key Features

Backward Compatibility

Test Plan

Version Bump

Uh oh!

konard commented Nov 30, 2025

🤖 Solution Draft Log

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

konard commented Nov 30, 2025 •

edited

Loading