Skip to content

Commit c8efbfd

Browse files
committed
Update docs for improved tokenizing semantics.
1 parent fd44c10 commit c8efbfd

File tree

1 file changed

+36
-27
lines changed

1 file changed

+36
-27
lines changed

lib/Text/WordDiff.pm

Lines changed: 36 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -226,37 +226,46 @@ C</\b/> didn't work so well. What I really wanted to do was to split on the
226226
I<beginning> of every word. Fortunately, _Mastering Regular Expressions_ has a
227227
recipe for that: C<< /(?<!\w)(?=\w)/ >>. I've borrowed this regular expression
228228
for use in Perls before 5.6.x, but go for the Unicode variant in 5.6.0 and
229-
newer: C<< /(?<!\p{IsWord})(?=\p{IsWord})/ >>. With either of these regular
230-
expressions, this sentence, for example, would be split up into the following
231-
tokens:
229+
newer: C<< /(?<!\p{IsWord})(?=\p{IsWord})/ >>. Adding some additional controls
230+
for punctuation and control characters, this sentence, for example, would be
231+
split up into the following tokens:
232232
233233
my @words = (
234-
'With ',
235-
'either ',
236-
'of ',
237-
'these ',
238-
'regular ',
239-
"expressions,\n",
240-
'this ',
241-
'sentence, ',
242-
'for ',
243-
'example, ',
244-
'would ',
245-
'be ',
246-
'split ',
247-
'up ',
248-
'into ',
249-
'the ',
250-
'following ',
251-
'tokens:'
234+
"Adding ",
235+
"some ",
236+
"additional ",
237+
"controls",
238+
"\n",
239+
"for ",
240+
"punctuation ",
241+
"and ",
242+
"control ",
243+
"characters",
244+
", ",
245+
"this ",
246+
"sentence",
247+
", ",
248+
"for ",
249+
"example",
250+
", ",
251+
"would ",
252+
"be",
253+
"\n",
254+
"split ",
255+
"up ",
256+
"into ",
257+
"the ",
258+
"following ",
259+
"tokens",
260+
":",
252261
);
253262
254-
Note that this allows the tokens to include any spacing or punctuation after
255-
each word. So it's not just comparing words, but word-like tokens. This makes
256-
sense to me, at least, as the diff is between these tokens, and thus leads to
257-
a nice word-and-space-and-punctation type diff. It's not unlike what a word
258-
processor might do (although a lot of them are character-based, but that
259-
seemed a bit extreme--feel free to dupe this module into Text::CharDiff!).
263+
So it's not just comparing words, but word-like tokens and control/punctuation
264+
tokens. This makes sense to me, at least, as the diff is between these tokens,
265+
and thus leads to a nice word-and-space-and-punctuation type diff. It's not
266+
unlike what a word processor might do (although a lot of them are
267+
character-based, but that seemed a bit extreme--feel free to dupe this module
268+
into Text::CharDiff!).
260269
261270
Now, I acknowledge that there are localization issues with this approach. In
262271
particular, it will fail with Chinese, Japanese, and Korean text, as these

0 commit comments

Comments
 (0)