@@ -226,37 +226,46 @@ C</\b/> didn't work so well. What I really wanted to do was to split on the
226226I<beginning > of every word. Fortunately, _Mastering Regular Expressions_ has a
227227recipe for that: C<< /(?<!\w)(?=\w)/ >>. I've borrowed this regular expression
228228for use in Perls before 5.6.x, but go for the Unicode variant in 5.6.0 and
229- newer: C<< /(?<!\p{IsWord})(?=\p{IsWord})/ >>. With either of these regular
230- expressions , this sentence, for example, would be split up into the following
231- tokens:
229+ newer: C<< /(?<!\p{IsWord})(?=\p{IsWord})/ >>. Adding some additional controls
230+ for punctuation and control characters , this sentence, for example, would be
231+ split up into the following tokens:
232232
233233 my @words = (
234- 'With ',
235- 'either ',
236- 'of ',
237- 'these ',
238- 'regular ',
239- "expressions,\n",
240- 'this ',
241- 'sentence, ',
242- 'for ',
243- 'example, ',
244- 'would ',
245- 'be ',
246- 'split ',
247- 'up ',
248- 'into ',
249- 'the ',
250- 'following ',
251- 'tokens:'
234+ "Adding ",
235+ "some ",
236+ "additional ",
237+ "controls",
238+ "\n",
239+ "for ",
240+ "punctuation ",
241+ "and ",
242+ "control ",
243+ "characters",
244+ ", ",
245+ "this ",
246+ "sentence",
247+ ", ",
248+ "for ",
249+ "example",
250+ ", ",
251+ "would ",
252+ "be",
253+ "\n",
254+ "split ",
255+ "up ",
256+ "into ",
257+ "the ",
258+ "following ",
259+ "tokens",
260+ ":",
252261 );
253262
254- Note that this allows the tokens to include any spacing or punctuation after
255- each word. So it's not just comparing words, but word-like tokens. This makes
256- sense to me, at least, as the diff is between these tokens, and thus leads to
257- a nice word-and-space-and-punctation type diff. It's not unlike what a word
258- processor might do (although a lot of them are character-based, but that
259- seemed a bit extreme--feel free to dupe this module into Text::CharDiff!).
263+ So it's not just comparing words, but word-like tokens and control/ punctuation
264+ tokens. This makes sense to me, at least, as the diff is between these tokens,
265+ and thus leads to a nice word- and-space-and-punctuation type diff. It's not
266+ unlike what a word processor might do (although a lot of them are
267+ character-based, but that seemed a bit extreme--feel free to dupe this module
268+ into Text::CharDiff!).
260269
261270Now, I acknowledge that there are localization issues with this approach. In
262271particular, it will fail with Chinese, Japanese, and Korean text, as these
0 commit comments