Alphabet, string, language…

An alphabet is a finite, nonempty set of symbols. Here are some examples of alphabets:

  • The binary alphabet {0, 1} contains only two symbols, namely 1 and 0.
  • The American Standard Code for Information Interchange, abbreviated ASCII, contains 128 symbols. Only 95 of these are printable. The others, such as Backspace (ASCII 8), are also symbols in our definition.
  • The set of the 95 printable symbols in ASCII is also an alphabet.
  • Unicode contains over 100,000 symbols – Arabic, Chinese, Latin and many other types. But even if it’s a very large number, it is still a finite number of symbols. Unicode is therefore an alphabet.
  • The set of the Cyrillic symbols in Unicode is an alphabet.
  • We can construct an alphabet consisting of a blue and a white dot. That is an alphabet consisting of two symbols, just like the binary alphabet.

Did you by any chance know that the word alphabet comes from alpha and beta, the first two letters of the 3000 year old Phoenician alphabet, and that alpha meant ox and beta meant house? However, the order of symbols does not make an alphabet unique. Thus, {a, b} is the same alphabet as {b, a}. The empty set — the set composed of zero symbols — is by definition not an alphabet.

A string is a finite (ordered) sequence of symbols chosen from an alphabet. Here are some examples of strings:

  • 0110 and 1001 are two strings from the binary alphabet.
  • The strings kadikoy are uskudar are constructed with symbols from the ASCII alphabet.
  • футбол is a string from the Cyrillic alphabet.
  • اللوز is a string from the Arabic alphabet.
  • From any alphabet, it’s possible to create an empty string, i.e. a string consisting of zero symbols. By convention, we denote that string with the character ε.

The very same symbol can occur multiple times in a string. The order is significant; gof is not the same string as fog.

A language is a countable set of strings from a fixed alphabet. The restriction is that the alphabet is finite. The language may well include an infinite number of strings. Here are some examples of languages:

  • All binary numbers ending with 0. It includes 10, 100, 110, etc.
  • All palindromes — strings that are the same forwards and backwards – constructed from the symbols of ASCII.
  • All strings with an even number of symbols from Unicode.
  • The finite set {dog, cat, bird}.
  • The empty set — the language with no strings. This alphabet is by convention denoted Ø.
  • The language {ε} consisting solely of the empty string ε.

Our focus here is on regular expressions. A regular expression is an algebraic description of an entire language. All languages can’t be described by regular expressions. No regular expression can for example describe the language consisting of all palindromes created with symbols from ASCII. I will try to convince you that regular expressions are ridiculously simple and amazingly powerful. Regular expressions are based on two basic rules:

  1. The empty string ε is a regular expression. It describes a language that has one single string, namely the empty string.
  2. If a symbol a is a member of an alphabet, then a is a regular expression. It describes a language that has the string a as its only member.

That was easy, wasn’t it? In addition to the two basic rules, we have three operators that we’ll catch in four more rules. Out of pure laziness, we have conventions for precedence between the three operators, but more on that in an upcoming blog post. First, you should try the two basic rules in IRB. We write regular expressions between two dashes. You can e.g. write the regulars expression a as /a/. IRB can help us to figure out if a string is in a language described by a particular regular expression:

'a'[/a/] #=> "a" – the string a is matched by the regular expression /a/
''[/a/] #=> nil – the empty string is not matched by the regular expression /a/
'b'[/a/] #=> nil – the string b is not matched by the regular expression /a/
'a'[//] #=> "" – the string a is not matched by the empty regular expression / /

Only when IRB returns the entire string, it’s matched by the regular expression.

Pomodoro Technique Illustrated -- New book from The Pragmatic Programmers, LLC


%d bloggers like this: