Dot — The Regex Barbapapa

Remember Barbapapa — Annette Tison’s and Talus Taylor’s children’s books and films from the 1970s? The hero was a pink, pear-shaped guy with the ability to take on almost any shape whatsoever. The equivalent in Regex is dot ..

Dot is a character class — a generic character. Instead of using literal characters, like 2, a or #, you can use dot to specify that you accept almost any character.

Ruby> 'mama 2 ##'.gsub /a|2|#/, '¤' #=> "m¤m¤ ¤ ¤¤"
Ruby> 'mama 2 ##'.gsub /./, '¤' #=> "¤¤¤¤¤¤¤¤¤"

There are two cultural problems with the dot, that is important to be aware of:

  1. The character class dot . and the closure function * are together and separately the most abused features of Regex. If you use them perfunctory, you’ll often end up with too general regexes — sometimes even incorrect. Every time you intend to write ., * or even .* You should consider if you really mean something more specific.
  2. A majority of Regex books, including the most popular one, are unclear or even entirely incorrect, as they claim that “dot matches any character.” In most cases dot matches “any character except line breaks.” It’s a very, very important difference.

Why doesn’t dot normally match line breaks?
The original implementations of Regex operated line by line. Programs like grep, handles one line at a time. Trailing line breaks are filtered out before processing. Hence, there are no line breaks. NASA engineer Larry Wall created Perl In the 1980s — the programming language that evangelized Regex more than anyhting else. The original purpose was to make report processing easier. What would then be more natural than to continue on the path of line-oriented work? Another argument is that the idiom .* would change meaning if dot matches line breaks. Perl set the agenda and now, a few decades later, we can only accept that dot typically don’t match line breaks, no matter what you and I believe is logical.

Ruby> "grey gr y gr\ny gray gr\ry".scan /gr.y/
#=> ["grey", "gr y", "gray"]

How can you force dot to match line breaks?
You set a flag. Unfortunately, this flag has different names in different Regex dialects. In Perl, it’s called single-line mode. Imagine what happens if dot matches all characters, including line breaks. Input data becomes a long line where the line break is a character like any other — hence the name. Single-line mode should not be confused with what in Perl is called multi-line mode. Multi-line mode affects the anchors $ and ^ and it’s orthogonal with single-line mode. To add more confusion, Ruby use the term multi-line line, when they mean Perl’s single-line mode. And the real multi-line mode is mandatory in Ruby — no flag available there. The best approach to this mess is if you and I call the flag Dot match all, no matter how it is written syntactically in different dialects. By the way, in Ruby you add m next to the Regex literal when we want the dot to match any character.

Ruby> "grey gr y gr\ny gray gr\ry".scan /gr.y/
#=> ["grey", "gr y", "gray"]
Ruby> "grey gr y gr\ny gray gr\ry".scan /gr.y/m
#=> ["grey", "gr y", "gr\ny", "gray", "gr\ry"]

And if there is no flag?
In some Regex dialects, most notably JavaScript, there’s no flag for dot match all. An workaround is to replace the dot with the idiom [\ s\ S]. This idiom matches exactly one character — either white space or anything that is not whitespace. These two classes are of course 100% of all the characters — including line breaks.

JavaScript> 'grey gr y gr\ny gray gr\ry'.match(/gr.y/g);
[ 'grey', 'gr y', 'gray' ]
JavaScript> 'grey gr y gr\ny gray gr\ry'.match(/gr[\s\S]y/g);
[ 'grey',
'gr y',
'gr\ny',
'gray',
'gr\ry' ]

Is dot to general?
I also argued above that the dot is often abused in our community. What does that mean? Imagine that you want to find all time strings in a text. You’ve got the following specification:

  • Time always includes hours and minutes, sometimes even seconds.
  • Hours, minutes and seconds are always written with two digits
  • You don’t have to ignore impossible numbers, such as minute 61.
  • Inbetween hours, minutes and seconds you’ll find one of the separators dot . or colon :.

The results of a simple regex like \d\d.\d\d(.\d\d)? might surprise you:

Ruby> "12:34 09.00 24.56.33".scan /(\d\d.\d\d(.\d\d)?)/
#=> "12:34 09", "00 24.56"

That’s not what you wished for. Dot matches space! If you replace the item with the more specific character class [.:] you aim closer to target. You mustn’t forget that dot inside a character class means that you literally want to match the dot character ..

Ruby> "12:34 09.00 24.56.33".scan /(\d\d[.:]\d\d([.:]\d\d)?)/
#=> "12:34", "09.00", "24.56.33"

Pomodoro Technique Illustrated -- New book from The Pragmatic Programmers, LLC

Advertisements


%d bloggers like this: