Regular Expression Concatenation

Using Rule number 2 and Rule number 4, we can create regular expressions that consists of any sequence of symbols from our alphabet. Rule number 2 said that if the symbol a is in the alphabet, then a is a regular expression. Rule number 4 said that if p and q are two regular expressions, then the concatenation pq is a regular expression as well. The concatenation symbol itself is invisible. Just write the two regular expressions right after each other:

'moda'[/m/] #=> "m" – we found the substring s in the string"moda"
'moda'[/o/] #=> "o"
'moda'[/mo/] #=> "mo" - /mo/ is /m/ concatenated with /o/
'moda'[/da/] #=> "da"
'moda'[/moda/] #=> "moda" - /moda/ is /mo/ concatenated with /da/
'moda'[/mado/] #=> nil – no match, since the order was changed

There are some handy terms we usually use for parts of strings:

  • Prefix: A prefix is the substring we have left if we remove zero or more symbols from the end of a string. The strings m, mo, mod, and moda are all prefixes of the string moda. Even the empty string ε is a prefix moda.
  • Suffix: The suffix is the substring that is left if we remove zero or more symbol from the beginning of the string. The strings moda, oda, da, a, and ε are all suffixes of the string moda.
  • Substring: A substring is what we have left if we remove a prefix and a suffix from a string. Note that the prefix and/or the suffix can be ε. Substrings must still be consecutive in the original string. The strings od and moda, but not mda, are substrings of moda.

For any regular expression p, it’s true that εp = pε = p, thus we say that the empty string ε is the identity under concatenation. There is no annihilator under concatenation, i.e., there’s no regular expression 0 so that for any regular expression p it holds that 0p = p0 = 0. Concatenation is not commutative, since pq is not equal to qp, but it’s associative since for any regular expressions p and q it’s true that p(qr) = (pq)r.

If we think of concatenation as a product, then regular expressions also support exponentiation. We write the exponent enclosed in braces to the right of the regular expression:

'aaa'[/aaa/] #=> "aaa"
'aaa'[/a{3}/] #=> "aaa" – yes, the string includes 3 concatenated a
'aaa'[/a{4}/] #=> nil – no, the string doesn't include 4 a

This is obviously just syntactic sugar. All regular expressions that we can write using the exponential operator, can also be unfolded. There are more shortcuts for finite repeated concatenations:

'aa'[/a?/] #=> "a" – the optional operator written as question mark
'b'[/a?/] #=> "" – zero repeats of a matches the empty string
'aa'[/a{,2}/] #=> "aa" – at least two a
'aa'[/a{1,2}/] #=> "aa" – at least one a and at moust two a
'a'[/a{1,2}/] #=> "a"

We will soon see that the concatenation of two regular expressions are not the same as the concatenation of two strings. Remember that a regular expression corresponds to a set of strings. For example, if p = {a, b} and q = {c, d}, then pq = {ac, ad, bc, bd}

Pomodoro Technique Illustrated -- New book from The Pragmatic Programmers, LLC


%d bloggers like this: