Archive for the 'Regexp' Category

Dot — The Regex Barbapapa

Remember Barbapapa — Annette Tison’s and Talus Taylor’s children’s books and films from the 1970s? The hero was a pink, pear-shaped guy with the ability to take on almost any shape whatsoever. The equivalent in Regex is dot ..

Dot is a character class — a generic character. Instead of using literal characters, like 2, a or #, you can use dot to specify that you accept almost any character.

Ruby> 'mama 2 ##'.gsub /a|2|#/, '¤' #=> "m¤m¤ ¤ ¤¤"
Ruby> 'mama 2 ##'.gsub /./, '¤' #=> "¤¤¤¤¤¤¤¤¤"

There are two cultural problems with the dot, that is important to be aware of:

  1. The character class dot . and the closure function * are together and separately the most abused features of Regex. If you use them perfunctory, you’ll often end up with too general regexes — sometimes even incorrect. Every time you intend to write ., * or even .* You should consider if you really mean something more specific.
  2. A majority of Regex books, including the most popular one, are unclear or even entirely incorrect, as they claim that “dot matches any character.” In most cases dot matches “any character except line breaks.” It’s a very, very important difference.

Why doesn’t dot normally match line breaks?
The original implementations of Regex operated line by line. Programs like grep, handles one line at a time. Trailing line breaks are filtered out before processing. Hence, there are no line breaks. NASA engineer Larry Wall created Perl In the 1980s — the programming language that evangelized Regex more than anyhting else. The original purpose was to make report processing easier. What would then be more natural than to continue on the path of line-oriented work? Another argument is that the idiom .* would change meaning if dot matches line breaks. Perl set the agenda and now, a few decades later, we can only accept that dot typically don’t match line breaks, no matter what you and I believe is logical.

Ruby> "grey gr y gr\ny gray gr\ry".scan /gr.y/
#=> ["grey", "gr y", "gray"]

How can you force dot to match line breaks?
You set a flag. Unfortunately, this flag has different names in different Regex dialects. In Perl, it’s called single-line mode. Imagine what happens if dot matches all characters, including line breaks. Input data becomes a long line where the line break is a character like any other — hence the name. Single-line mode should not be confused with what in Perl is called multi-line mode. Multi-line mode affects the anchors $ and ^ and it’s orthogonal with single-line mode. To add more confusion, Ruby use the term multi-line line, when they mean Perl’s single-line mode. And the real multi-line mode is mandatory in Ruby — no flag available there. The best approach to this mess is if you and I call the flag Dot match all, no matter how it is written syntactically in different dialects. By the way, in Ruby you add m next to the Regex literal when we want the dot to match any character.

Ruby> "grey gr y gr\ny gray gr\ry".scan /gr.y/
#=> ["grey", "gr y", "gray"]
Ruby> "grey gr y gr\ny gray gr\ry".scan /gr.y/m
#=> ["grey", "gr y", "gr\ny", "gray", "gr\ry"]

And if there is no flag?
In some Regex dialects, most notably JavaScript, there’s no flag for dot match all. An workaround is to replace the dot with the idiom [\ s\ S]. This idiom matches exactly one character — either white space or anything that is not whitespace. These two classes are of course 100% of all the characters — including line breaks.

JavaScript> 'grey gr y gr\ny gray gr\ry'.match(/gr.y/g);
[ 'grey', 'gr y', 'gray' ]
JavaScript> 'grey gr y gr\ny gray gr\ry'.match(/gr[\s\S]y/g);
[ 'grey',
'gr y',
'gr\ny',
'gray',
'gr\ry' ]

Is dot to general?
I also argued above that the dot is often abused in our community. What does that mean? Imagine that you want to find all time strings in a text. You’ve got the following specification:

  • Time always includes hours and minutes, sometimes even seconds.
  • Hours, minutes and seconds are always written with two digits
  • You don’t have to ignore impossible numbers, such as minute 61.
  • Inbetween hours, minutes and seconds you’ll find one of the separators dot . or colon :.

The results of a simple regex like \d\d.\d\d(.\d\d)? might surprise you:

Ruby> "12:34 09.00 24.56.33".scan /(\d\d.\d\d(.\d\d)?)/
#=> "12:34 09", "00 24.56"

That’s not what you wished for. Dot matches space! If you replace the item with the more specific character class [.:] you aim closer to target. You mustn’t forget that dot inside a character class means that you literally want to match the dot character ..

Ruby> "12:34 09.00 24.56.33".scan /(\d\d[.:]\d\d([.:]\d\d)?)/
#=> "12:34", "09.00", "24.56.33"

Pomodoro Technique Illustrated -- New book from The Pragmatic Programmers, LLC

Advertisements

Regular Expressions – a brief history

Regular Expressions is a programming language with which we can specify a set of strings. With the help of two operators and one function, we can be more concise than if we would have to list all the strings that are included in the set. Where does Regular Expressions come from? Why is it called Regular and how does it differ from Regex?

The story begins with a neuroscientist and a logician who together tried to understand how the human brain could produce complex patterns using simple cells that are bound together. In 1943, Warren McCulloch and Walter Pitts published “A logical calculus of the ideas immanent in nervous activity”, in the Bulletin of Mathematical Biophysics 5:115-133. Although it was not the objective, it turned out this paper had a great influence on computer science in our time. In 1956, mathematician Stephen Kleene developed this model one step further. In the paper “Representation of events in nerve nets and finite automata” he presents a simple algebra. Somewhere at this point the terms Regular Sets and Regular Expressions were born. As mentioned above, Kleene’s algebra had only two operations and one function.

In 1968, the Unix pioneer Ken Thompson published the article “Regular Expression Search Algorithm” in Communications of the ACM (CACM), Volume 11. With code and prose he described a Regular Expression compiler that creates IBM 7094 object code. Thompson’s efforts did not end there. He also implemented Kleene’s notation in the editor QED. The aim was that the user could do advanced pattern matching in text files. The same feature appeared later on in the editor ed.

To search for a Regular Expression in ed you wrote g/<regular expression>/p The letter g meant global search and p meant print the result. The command — g/re/p — resulted in the standalone program grep, released in the fourth edition of Unix 1973. However, grep didn’t have a complete implementation of regular expressions, and it was not until 1979, in the seventh edition of Unix that we were blessed with Alfred Aho’s egrepextended grep. Now the circle was closed. The program egrep translated any regular expressions to a corresponding DFA.

Larry Wall’s Perl programming language from the late 80’s helped Regular Expressions to become mainstream. Regular Expressions are seamlessly integrated in Perl, even with its own literals. New features were also added to Regular Expressions. The language was extended with abstractions and syntactic sugar, and also brand new features that may not even be possible to implement in Finite Automata. This raises the question if modern Regular Expressions can be called Regular Expressions. I don’t think so. The term Regex denotes not only Kleene’s algebra but also the additions made by Perl, Java, Ruby, and other implementations.

Pomodoro Technique Illustrated -- New book from The Pragmatic Programmers, LLC

Give feedback on my Regular Expressions book prototype

I’ve published a prototype of an upcoming illustrated Regular Expressions book here: http://www.staffannoteberg.com/regexbook

Any feedback is very very appreciated. What would make this book more useful for you?

Best Regards // Staffan

HTML5 Form Validation With Regex

Client side validation has always been a potential headache for front-end programmers. Embedded blocks with a mixture of imperative JavaScript and declarative regex can be a mess. HTML5 has ambition to add abstraction layers that would make this a bit easier. As I’ll explain below, theres’ still a long way to go before it’s rock solid.

There are two ideas that enters the scene now:

  1. The <input> tag has new type attribute values like url, email, date, telephone number, and color.
  2. The <input> tag has the new attribute pattern where you can describe allowed input with a regex.

Note that it’s only validation. It would have been nice to have filtering (e.g. remove spaces in a credit card number) or even replacing (euro is sent to server, whether the user enters euro or ).

In case (1) as well as (2), a nice red-green feedback lets the user know if the user entered text is correct. The tool-tip of the input widget can also have a descriptive message of what the system expects from the user. You just set a value of the title attribute. More on that below.

1. New values for the type attribute of the <input> tag

To use the type attribute is simple. Here’s an example with the new value email:

<input type="email" required />

This made me curious. I guess that email is implemented with a regex under the hood. What does it look like? I don’t know, but it’s not correct. As a matter of fact the spec for the email attribute value is incorrect. It looks like this:

A valid e-mail address is a string that matches the ABNF production 1*( atext / “.” ) “@” ldh-str *( “.” ldh-str ) where atext is defined in RFC 5322 section 3.2.3, and ldh-str is defined in RFC 1034 section 3.5.

So currently, the HTML5 browsers accepts the email -@- and doesn’t accept "staffan nöteberg"@rekursiv.se — I tried. It should be the other way around. (Yes, spaces and diaeresis makes sense to the left of the @ sign, as it’s a local mailbox routing that might involve a not so SMTP:ish system. For the record I tried…

echo 'hello!' |
  /usr/lib/sendmail '"staffan nöteberg"@rekursiv.se'

…and it works!).

However, even though it’s already implemented in many browsers, W3C makes it clear that it’s only a working draft. For the moment there’s a note in the document that they are aware of this error:

NOTE: This requirement is a willful violation of RFC 5322, which defines a syntax for e-mail addresses that is simultaneously too strict (before the “@” character), too vague (after the “@” character), and too lax (allowing comments, white space characters, and quoted strings in manners unfamiliar to most users) to be of practical use here.

My recommendation is to NOT use the email attribute until it has a better implementation.

2. New attribute pattern of the <input> tag

The input tag has several new attributes to specify constraints: autocomplete, min, max, multiple, pattern, and step. I’m particularly interested in the pattern attribute. It’s more generic than the new values of the type attribute mentioned above.

The pattern value is a regex. In what regex dialect? Yes, you guessed it: JavaScript according to ECMA-262 Edition 5. This is a major drawback, since the regex support in JavaScript is modest (e.g. there’s even no meta class to match a letter — many other regex engines support the Unicode \p{L}). The whole user input must be matched by the regex, not only a fraction. You can look at it as if your regex is prefixed with ^(?: and suffixed with )$.

Here are three pragmatic (but not globally perfect) examples I created:

  • Strong password: <input title="at least eight symbols containing at least one number, one lower, and one upper letter" type="text" pattern="(?=.*\d)(?=.*[a-z])(?=.*[A-Z]).{8,}" required />
  • Email address: <input type="text" title="email" required pattern="[^@]+@[^@]+\.[a-zA-Z]{2,6}" />
  • Phone number: <input type="text" required pattern="(\+?\d[- .]*){7,13}" title="international, national or local phone number"/>

I leave it as a reader exercise to interpret these regexes. And you can try them too! They are online in this test page:

If you combine type="email" and pattern then both constraints must be fulfilled.

Summary

HTML5 form validation is a good idea. The pattern tag is very generic, albeit its rather limited regex dialect. Be careful with the new values of the type attribute, as they are only in prototype status currently.

Finally: What about browser support. I’m in deep water here, but I understand it as there’s support for this kind of validation in IE 10+, Firefox 8+, Chrome 16+, Opera 11.6+, and Opera Mobile 10+. There’s partial or none support in Safari and Android.

Pomodoro Technique Illustrated -- New book from The Pragmatic Programmers, LLC

The Scent of Regex Requisite

Did you know that the famous quote “Some people, when confronted with a problem, think: I know, I’ll use regular expressions. Now they have two problems.” dates all the way back to 1997? However, most programmers agree that regex has its time and place. But, how can we know when to use regex? It’s really simple. We must use our nose and feel the scent of regex requisite. Below is a list of five scents that puts the R-word in our working memory.

Text to type

A text sequence is also a kind of data type. You may have read it from a file or perhaps a user entered it into your system. But you don’t confine yourself to text. You want to transform it into a bunch of structured data records. You read record after record from the text hank. Each record consists of a series of comma delimited fields and each record is terminated with a semicolon. Regex loves to parse text hanks.

Non recursive

A recursively-defined data type may be instantiated with values that contains values of the very same type. Think of fir branches. Each branch consists of one stem, zero or more sub branches, and many needles. The sub branches are fir branches as well. A branch may have a sub branch, which may have a sub branch, which may have a sub branch etc. In theory there is no limit to how many levels we can have. As soon as you want to translate text into recursive data — then regex is usually not the best tool. To parse an entire HTML document with nested div tags is an example of recursive data.

Not lucid

If the input is small, regex often doesn’t add anything. But, when you do search-and-replace in 2000 files and what you want to replace has a variable appearance — then a neat little regex is the generalized solution. You can capture different versions and replace them with something that actually depends on the input data. It is quality — no mistakes — and quantity — no misses. In a small input, you can modify by hand. You can easily see what should be changed and to what.

Emerging

Suddenly it happens: you have input from a user, from the network, from another system or from a file. You can not predict what will come, more than that it’s text. It may be a lot and it may be a tiny little piece. Yes, it can even be an empty string. This very uncertainty makes the generalized description of the input data characteristics, useful. You describe a pattern, not a specific entity. Regex is a superhero when it comes to describing generalized patterns.

Complex logic

I’ve described before how 20+ lines of Java code could be transformed into one small regex. This is not a general law. Regex is a limited programming language suited to solve a very specific class of problems. However, in this case the imperative Java code had a lot of nested as well as consecutive conditional statements; if-else-if-if-else — i.e. complex logic. Regex is a declarative language. You describe what you want, and not how to get there. Thus, you don’t have to state all these scrubby paths.

Some of these five scents partly overlap, but each of them are well worth to remember. Facing a programming problem, if you can’t feel any of them, then you can be pretty sure that there are better tools than regex.

Pomodoro Technique Illustrated -- New book from The Pragmatic Programmers, LLC

From Regular Expression to Finite Automaton

For each regular expression — and I mean the three operators and the six recursive rules style — there is a finite automaton that accepts exactly the same strings. Since this is not a university book in mathematics, I’ll show you an inductive reasoning about this and not a formal proof.

The hypothesis is thus that for an arbitrary regular expression p, we can create a finite automaton that has exactly one start state, no paths into the start state, no paths out from the acceptance state and that accepts exactly the same strings that are matched by p.

  • The empty string ε is a regular expression corresponding to a finite automaton with a start state, a path that accepts the empty string ε and leads from the start state to an acceptance state. We’ll call this an //ε-path//.
  • The empty set Ø is the set equivalent to a regular expression that can’t match any single string — not even the empty string ε. It is the same as a two-state automaton, with no single path. One state is start and the other one acceptance. But, they are not linked.
  • A regular expression that only matches the symbol b corresponds to a finite automaton with two states: start and acceptance. There’s a path from start t acceptance, and it only accepts the symbol b.

All three finite automata above have two states. One is start and the other one is acceptance. The difference is that the first one has an ε-path from start to acceptance, the second one has no path, and the third one has a b-path. Now we’ll continue. Imagine that we have two regular expressions p and q corresponding to finite automata s and t respectively.

  • Concatenation of two regular expressions p and q means that we first match a string with p, directly followed by a string that’s matched by q. To create this finite automaton we first add ε-paths from every acceptance state in s to the start state of t. Then we deprive all acceptance states in s their acceptance status and we’ll also withdraw the start status of the start state in t.
  • Alternation of two regular expressions p and q, i.e. p|q is like a finite automaton with a new start state that has ε-paths to all start staes of s and t. The new finite automaton also has a new acceptance state that is reached with ε-paths from all acceptance states of s and t. The start and acceptance states of s and t are thus not start and acceptance state in our new automaton.
  • Kleene star is the concatenation closure. Assume that p = q*. Then s is the finite automaton we get if we take t and add two states and four paths as follows: One new state is the start state and the other one is an acceptance state. All acceptance states of ´s´ loses that status in s, but instead gets an ε-path to the new acceptance state. We add two ε-paths from the new initial state — one to the old start state and one to the new acceptance state. In addition to that, we insert one ε-path from each of the old acceptance states to the old start state.

Look at the pictures above. Then take a deep breath and feel if you can translate an arbitrary regular expression to a finite automaton. Finally assess the last picture where the regular expression (w|bb)* is depicted as a graph using the method described above. Does it feel reasonable?

Pomodoro Technique Illustrated -- New book from The Pragmatic Programmers, LLC

Simple Regular Expression Examples

Now, we have three operators and a small framework. After all this theory, you might wonder if it’s possible for us to solve any problems. Yes, of course we can. Here are some examples:

All binary strings with no more than one zero:

'01101'.match /1*(0|)1*/ #=> #<MatchData "011">
'0111'.match /1*(0|)1*/ #=> #<MatchData "0111">
'1101'.match /1*(0|)1*/ #=> #<MatchData "1101">
'11010'.match /1*(0|)1*/ #=> #<MatchData "1101">

All binary strings with at least one pair of consecutive zeroes:

'101001'.match /(1|0)*00(1|0)*/ #=> #<MatchData "101001">
'10101'.match /(1|0)*00(1|0)*/ #=> nil
'1010100'.match /(1|0)*00(1|0)*/ #=> #<MatchData "1010100">

All binary strings that have no pair of consecutive zeros:

'1010100'.match /1*(011*)*(0|)/ #=> #<MatchData "101010">
'101001'.match /1*(011*)*(0|)/ #=> #<MatchData "1010">
'0010101'.match /1*(011*)*(0|)/ #=> #<MatchData "0">
'0110101'.match /1*(011*)*(0|)/ #=> #<MatchData "0110101">

All binary strings ending in 01:

'110101'.match /(0|1)*01/ #=> #<MatchData "110101">
'11010'.match /(0|1)*01/ #=> #<MatchData "1101">
'1'.match /(0|1)*01/ #=> nil
'01'.match /(0|1)*01/ #=> #<MatchData "01">

All binary strings not ending in 01:

'010'.match /(0|1)*(0|11)|1|0|/ #=> #<MatchData "010">
'011'.match /(0|1)*(0|11)|1|0|/ #=> #<MatchData "011">
''.match /(0|1)*(0|11)|1|0|/ #=> #<MatchData "">
'1'.match /(0|1)*(0|11)|1|0|/ #=> #<MatchData "1">
'01'.match /(0|1)*(0|11)|1|0|/ #=> #<MatchData "0">
'101'.match /(0|1)*(0|11)|1|0|/ #=> #<MatchData "10">

All binary strings that have every pair of consecutive zeroes before every pair of consecutive ones:

'0110101'.match /0*(100*)*1*(011*)*(0|)/ #=> #<MatchData "0110101">
'00101100'.match /0*(100*)*1*(011*)*(0|)/ #=> #<MatchData "0010110">
'11001011'.match /0*(100*)*1*(011*)*(0|)/ #=> #<MatchData "110">
'1100'.match /0*(100*)*1*(011*)*(0|)/ #=> #<MatchData "110">
'0011'.match /0*(100*)*1*(011*)*(0|)/ #=> #<MatchData "0011">

See if you can find even better regular expressions that solve these problems. Remember that there’re an infinite number of synonyms to each regular expression.

Pomodoro Technique Illustrated -- New book from The Pragmatic Programmers, LLC