Archive for the 'Programming' Category



Regex memorizing — here’s the pushdown automaton

Pushdown automaton that matches all strings with the same number of white and blue dots.

Pushdown automaton that matches all strings with
the same number of white and blue dots.

Formal regular expressions can be described by a finite automaton, but modern regex engines support un-regular operators. The problem with finite automata is that they don’t have any memory. Once they are in a state, they have no idea, how they got there. Wise guys invented the stack. When we add a stack to a finite automaton it becomes very powerful, but also quite complex. And of course, it’s not a finite automaton anymore – it’s a pushdown automaton. Why do I then describe the regex engines as finite automata when pushdown automata is a superset of finite automata?

The finite automaton reads a tape serially. Every new symbol read from the tape, initiates a transition in the automaton from the current state to a new state (the new state may be the same state as the current state). If we are in an accept state when we have read all the tape, then we have a match. In pushdown automata, we add a stack. For every iteration, the pushdown automaton may read a symbol from the tape, pop a symbol from the stack, or both. Depending on the current state, the symbol read from the tape and/or the symbol popped from the stack, the pushdown automaton can now initiate a state transition, push a symbol to the stack, or both. If we end up in a accept state when all the tape is read and the stack is at that time empty then we have a match.

The figure above shows a pushdown automaton that matches any input with the same number of blue and white dots – something that is impossible to describe with a finite automaton. This automaton has two states: a start state and an accept state. The four icons in the middle represents that a dot is popped or pushed from the stack. Suppose that the input is blue-white-white-blue:

  1. Read blue dot and make a transition from the start state, through the third stack icon from the top – i.e. a blue dot is pushed to the stack – and back to the start state.
  2. Read white dot and make a transition from the start state, through the first stack icon – i.e. a blue dot is popped from the stack – and back to the start state. Now the stack is empty.
  3. Read white dot and make a transition from the start state, through the second stack icon from the top – i.e. a white dot is pushed to the stack – and back to the start state.
  4. Read blue dot and make a transition from the start state, through the fourth stack icon – i.e. a white dot is popped from the stack – and to the accept state. Now the stack is empty again and all the input is read. It’s a match.

You can probably feel that the pushdown automaton gives us a tool for all kind of recursive implementations. Now that you understand how, I’ll go back to finite automata in all explanations where possible. It’s important for you to understand e.g. what backtracking and greediness does to performance and what input that will be matched. You don’t need a complicated pushdown automaton to learn that. We’ll stick to finite automaton in this book and at the same time remember that many operators in modern regex engines rely on a stack.

By the way: Did you notice the corner case bug in the pushdown automaton above? Doesn’t an empty input string have the same number of blue and white dots?

Dynamic typing inside the Java Virtual Machine

Take a look at the following Java program. Albeit it’s dynamic typing style, it’s accepted by the Java compiler javac:

public class Dynamic {
  public static void main(String[] args) {
	Object o;
	o = System.out;
	o = System.in;
	o = 1;
	o = "Hello World";
  }
}

The method body part of the compiled JVM code looks something like this:

   0:   getstatic       #2; //Field java/lang/System.out:Ljava/io/PrintStream;
   3:   astore_1
   4:   getstatic       #3; //Field java/lang/System.in:Ljava/io/InputStream;
   7:   astore_1
   8:   iconst_1
   9:   invokestatic    #4; //Method java/lang/Integer.valueOf:(I)Ljava/lang/Integer;
   12:  astore_1
   13:  ldc     #5; //String Hello World
   15:  astore_1

At line 0, 4, 8, 9 and 13 something is pushed on the operand stack. Line 3, 7, 12 and 15 pop the top record from the operand stack and assign it to local variable #1. Line 8-9 is called boxing. Since o is a java.lang.Object and Java doesn’t allow java.lang.Object to take a primitive value, like the int 1, this value is boxed in a java.lang.Integer object at line 9. The Java compiler javac created line 9 because Java is based on static typing.

Dynamic typing means that values have types but variables don’t. In static typing, variables have types and are restricted to not refer to any random type of value. The local variable o has type java.lang.Object and can only refer to values of type java.lang.Object or its subtypes.

Unlike Java, local JVM variables are dynamically typed — NOT statically typed. We can safely remove line 9 (the conversion from int to java.lang.Integer) from the JVM code and it will still be perfectly valid JVM code. Local variable #1 will then in turn be assigned a PrintStream, an InputStream, an int and a String. This is possible because local variable #1 – like all local variables in the JVM – has no type.

Complicated logic calls for regex

When is regex the proper tool? One use case is when you try to verify a text, and the verification rules have complicated logic. I found the following Java method in real production code. It verifies that a time zone designator adheres to the ISO 8601 standard:

public boolean isTimeDesignator(String designator) {
  if (designator.equals("Z")) {
    return true;
  } else if (designator.startsWith(" ") && designator.trim().equals("UTC")) {
    return true;
  } else if (designator.startsWith("+") || designator.startsWith("-")) {
    try {
      int hour = Integer.parseInt(designator.substring(1,3));
      int minute;
      if (3 == designator.length()) {
        minute = 0;
      } else if (5 == designator.length()) {
        minute = Integer.parseInt(designator.substring(3,5));
      } else if (6 == designator.length() && ':' == designator.charAt(3)) {
        minute = Integer.parseInt(designator.substring(4,6));
      } else {
        return false;
      }
      return 0 <= hour && 24 >= hour && 0 <= minute && 59 >= minute;
    } catch (NumberFormatException e) {
      return false;
    }
  }
    
  return false;
}

Only counting the if and else, there are 7 ways the control flow can traverse this code. Adding the NumberFormatException there’s another 3. These ten routes need to be covered by tests.

In ISO 8601, global time is always represented as an offset to Coordinated Universal Time, abbreviated UTC. UTC is informally equivalent to the Greenwich Mean Time, also known as GMT. The designator is an offset in hours and minutes to UTC. Hours are mandatory, but minutes and a colon in-between minutes and hours are optional. Offset zero can be represented as a space and then ‘UCT’ or as ‘Z’. Complicated?

In regex — still in Java — it’s rather crisp to verify:

public boolean isTimeDesignator(String designator) {
  return designator.matches("Z| +UTC|[+-]([01]\\d|2[0-4])(:?[0-5]\\d)?");
}

The latter solution may seem cryptic, for people without experience in regex. But if you think about it, the first solution is cryptic for anyone not proficient in imperative coding. And if you ask me, the first solution is hard to follow for anybody. To understand a regex, you need to understand regex. That’s a tautology. With regex in your repertoire, some problem classes can be solved much more smoothly and the solutions are easier to maintain.

Regex syntax and semantics varies

Regex engines do differ in syntax and semantics. This is one reason why you can’t just find an expression with Google and use it in your code – without fully understand it.

Try for example the following in JIRB, the interactive JRuby tool:

irb(main):001:0> require 'java'
=> true
irb(main):002:0> java.util.regex.Pattern.compile("a$").matcher("a\nb").find
=> false
irb(main):003:0> "a\nb"[/a$/]
=> "a"

What happened?

The regex /a$/ matches the letter a just before something checked by a dollar sign assertion. The assert criteria is, by default, not the same in Ruby and Java. In Java, by default, the dollar sign matches at the end of the whole text and before any final line breaks. In Ruby, the dollar sign match at the end of every single line.

Here’s what the three lines of code above means:

  1. First, we need to include the Java libraries by writing require 'java'. This might not be necessary, depending on your setup.
  2. We compile a Java regex and test if part of the string of ‘a’ and ‘b’ with newline in-between can be matched. It can’t.
  3. We compile the same regex in Ruby and test if part of the string of ‘a’ and ‘b’ with newline in-between can be matched. It can.

This is just one of many examples. If you are going to use regexes in your program, you need to understand them. It’s as simple as that.

Pomodoro Technique Illustrated -- New book from The Pragmatic Programmers, LLC

Regex slides

Slides from my regex session at Turku Agile Day are now available below and at Slideshare:

If you can’t see the whole picture due to Slideshare’s embed limits, click ‘zoom’ -> ‘zoom to page’ in the Slideshare tool bar above.

My solution to Citerus/JFokus programming contest

It’s JFokus – the number one annual Java echo system event in Sweden – right now. I’ll present TDD with Regex in a session tomorrow Wednesday. However, Uppsala based consultancy company Citerus kicked off a programming contest during this conference.

The problem: You get one source string and a lot of short acronym strings. Delete acronyms from the source string until you have the shortest possible string. Use whatever programming language you like as long as it is executable on the JVM.

The example says delete the following acronyms:

  • TDD, DDD, DI, DO, OO, UI, ANT, CV, IOC, LOC, SU, VO

…from the source string:

  • VOCDIITEIOCRUDOIANTOCSLOIOCVESTAIOCVOLIOCENTSU

The first idea you probably get is to delete them in a double loop like this:

import java.util.*;
public class Stripper {
  final String [] acronyms;
  public Stripper(final String... acronyms) {
    this.acronyms = acronyms;
  }
  public String strip(final String source) {
    String result = source;
    for (int i = 0; i < source.length(); i++) {
      final String sufix = source.substring(i);
      for (final String acronym : acronyms) {
        if (sufix.startsWith(acronym)) {
          result = shortest(result, strip(source.substring(0, i)
              + sufix.substring(acronym.length())));
        }
      }
    }
    return result;
  }
  private static String shortest(final String s1, final String s2) {
    return s1.length() < s2.length() ? s1 : s2;
  }
  public static void main(String args[]) {
    final long start = System.currentTimeMillis();
    System.out.printf("%s (%d ms)\n", new Stripper("TDD", "DDD", "DI",
        "DO", "OO", "UI", "ANT", "CV", "IOC", "LOC", "SU",
        "VO").strip("VOCDIITEIOCRUDOIANTOCSLOIOCVESTAIOCVOLIOCENTSU"),
            System.currentTimeMillis() - start);
  }
}

Unfortunately, it’s slow. It took my Linux box 20 minutes to run the example. But, you may have noticed that the recursive call invokes the same query over and over again. Let’s add a cache:

import java.util.*;

public class Stripper {
  final private Map<String, String> stripCache = new HashMap<String, String>();
final String [] acronyms;

public Stripper(final String... acronyms) {
  this.acronyms = acronyms;
}
public String strip(final String source) {
  if (stripCache.containsKey(source)) return stripCache.get(source);
  String result = source;
  for (int i = 0; i < source.length(); i++) {
    final String sufix = source.substring(i);
    for (final String acronym : acronyms) {
      if (sufix.startsWith(acronym)) {
        result = shortest(result, strip(source.substring(0, i)
            + sufix.substring(acronym.length())));
      }
    }
  }
  stripCache.put(source, result);
  return result;
}
private static String shortest(final String s1, final String s2) {
  return s1.length() < s2.length() ? s1 : s2;
}
public static void main(String args[]) {
  final long start = System.currentTimeMillis();
  System.out.printf("%s (%d ms)\n", new Stripper("TDD", "DDD", "DI",
      "DO", "OO", "UI", "ANT", "CV", "IOC", "LOC", "SU", 
      "VO").strip("VOCDIITEIOCRUDOIANTOCSLOIOCVESTAIOCVOLIOCENTSU"),
          System.currentTimeMillis() - start);
  }
}

…and suddenly the example is solved in less than 50 ms. An improvement by 30.000 times!

But, it didn’t stop there. If you swap the loops, you can jump over some hopeless cases:

import java.util.*;
public class Stripper {
  final private Map<String, String> stripCache = new HashMap<String, String>();
  final String [] acronyms;
  public Stripper(final String... acronyms) {
    this.acronyms = acronyms;
  }
  public String strip(final String source) {
    if (stripCache.containsKey(source)) return stripCache.get(source);
    String result = source;
    for (final String acronym : acronyms) {
      for (int i = source.indexOf(acronym); 0 <= i && source.length() > i;
            i = source.indexOf(acronym, i + 1)) {
        result = shortest(result, strip(source.substring(0, i) 
              + source.substring(i + acronym.length())));
      }
    }
    stripCache.put(source, result);
    return result;
  }
  private static String shortest(final String s1, final String s2) {
    return s1.length() < s2.length() ? s1 : s2;
  }
  public static void main(String args[]) {
    final long start = System.currentTimeMillis();
  System.out.printf("%s (%d ms)\n", new Stripper("TDD", "DDD", "DI",
      "DO", "OO", "UI", "ANT", "CV", "IOC", "LOC", "SU", 
      "VO").strip("VOCDIITEIOCRUDOIANTOCSLOIOCVESTAIOCVOLIOCENTSU"),
          System.currentTimeMillis() - start);
  }
}

And that made it twice as fast. Now it takes less than 25 ms to solve it.

Regex is not Regular

Regular expressions span exactly the same languages as the mathematical construction called Finite atomata. With the Pumping lemma, we can prove that it’s impossible to create regular expressions for many problems, e.g.:

  1. All palindromes
  2. All text strings that doesn’t consist of a prime number of identical characters

Since Regular expressions have become popular as extensions to imperative programming languages, new features have been added. Languages from Context-free grammars and Pushdown automata (essentially a finite automata with an attached stack), can be matched with these features. Regular expressions are Type-3 and Context-free grammars are type-2 in the Chomsky hierarchy.

To not be confused we call these built-in languages that are more than Type-3, Regex. Regexes are more powerful than Regular expressions.

Non regular prime number finder

With the (non regular) feature back-reference, it’s possible to create a regex that only match strings with a prime number of identical characters:

irb(main):001:0> r = /^.?$|^((.)\2+?)\1+$/
=> /^.?$|^((.)\2+?)\1+$/
irb(main):002:0> r.match "22"
=> nil
irb(main):003:0> r.match "333"
=> nil
irb(main):004:0> r.match "4444"
=> #<MatchData "4444" 1:"44" 2:"4">
irb(main):005:0> r.match "55555"
=> nil
irb(main):006:0> r.match "tttttt"
=> #<MatchData "tttttt" 1:"tt" 2:"t">

Non regular palindrome finder

If we also add the (non regular) feature recursion, we can match all palindromes:

irb(main):001:0> p =
/\A(?<palindrome>|.|(?:(?<prefix>.)\g<palindrome>\k<prefix+0>))\z/
=> /\A(?<palindrome>|.|(?:(?<prefix>.)\g<palindrome>\k<prefix+0>))\z/
irb(main):002:0> p.match "otto"
=> #<MatchData "otto" palindrome:"otto" prefix:"t">
irb(main):003:0> p.match "dallas sallad"
=> #<MatchData "dallas sallad" palindrome:"dallas sallad" prefix:"s">
irb(main):004:0> p.match "rais air"
=> nil
irb(main):005:0> p.match "1010"
=> nil

Recursive Regexes

Oniguruma supports recursive regular expressions. They can e.g. be used for matching a generative grammar like the following:

  1. expression -> expression + term
  2. expression -> expression - term
  3. expression -> term
  4. term -> ( expression )
  5. term -> digit

I wrote the following regex that corresponds to the grammar above:

  • /^(?<expression>(?<term>\d|\(\g<expression>\))([-+]\g<expression>)?)$/

Note that \g<expression> is invoked recursively – like rule number 4 in the grammar. I chose to only allow single digit numbers to make the expression crisper as example. This could of course easily be changed to general integers, floats or imaginary numbers.

Here goes some testing:

irb(main):001:0> r = /^(?<expression>(?<term>\d|\(\g<expression>\))([-+]\g<expression>)?)$/
=> /^(?<expression>(?<term>\d|\(\g<expression>\))([-+]\g<expression>)?)$/
irb(main):002:0> "2+4"[r]
=> "2+4"
irb(main):003:0> "2+45"[r]
=> nil
irb(main):004:0> "2-3(4+3)"[r]
=> nil
irb(main):005:0> "2-3(+3)"[r]
=> nil
irb(main):006:0> "2-31(+3)"[r]
=> nil
irb(main):007:0> "2-3+(+3)"[r]
=> nil
irb(main):008:0> "2-3+(4+3)"[r]
=> "2-3+(4+3)"

Expressions that return nil are obviously not correct.

Pomodoro Technique Illustrated -- New book from The Pragmatic Programmers, LLC

Samurai and Rock Star

To be invited to speak at conferences around the world has many advantages. One is that you meet other authors. Ed Burns and Jonathan Rasmusson are two of them.

I Agile Smuraimet Jonathan Rasmusson in Chicago last summer at Agile2009. He was then struggling with his upcoming book. I asked curiously when he expected the book to be released. Hopefully within a few months, he answered. A few months happened to be almost a year, but now it’s released. The title is “The Agile Samurai — How Agile Masters Deliver Great Software” which describes exactly what it is. There are many Agile books out there, but this one is different. Firstly, even though it’s published by Pragmatic Bookshelf it has a style similar to O’Reilly’s Head First series . Pictures and text are mixed in a way that make learning easy. Secondly, this is more of an in-his-own-words book than the usual Agile book. The book describes Agile ideas instead of defining the Scrum terminology. If you’re interested in Agile, on whatever level, you should read this book.

Last Rock Star Programmersmonth I met Ed Burns in Poland at the GeeCON2010 Java conference. He recently released a new version of his JSF book. But, what caught my attention was another book from 2008. During one of his sessions at GeeCON he replayed interviews with people like Rod Johnson, James Gosling and Andy Hunt. Those interviews were originally recorded for the book “Riding the Crest — Secrets of the Rockstar Programmers.” It’s a book were thought leaders from our own industry shares what they think about entrepreneurs, what makes them productive and how their career affected their private life? The answers go in all directions.

Pomodoro Technique Illustrated -- New book from The Pragmatic Programmers, LLC

Regex extension — here’s the Integer Class

The IP address problem is well known in the regex community. Here I present an extended regex syntax that would make it possible to match IP addresses in a less chatty way.

Ip-address problem

Let’s say I want to validate an IP address. It is four integers with interleaved dots such as 123.125.126.107. The naïve regex would be:

  • \d*\.\d*\.\d*\.\d*

But it is so imprecise that it matches both 9999.9999.9999.9999 and . Since the four integers in a IP address must be in the range 0-255, I can be certain that they consist of one, two, or three digits. With the Limiting Repetition operator, I can formulate this requirement:

  • \d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}

Now 0.1.2.3 and 101.0.202.1 matches, while 9999.9999.9999.9999 doesn’t match because there are too many digits in the integers. Unfortunately 999.888.777.666 matches as well. That’s not an acceptable IP address. I told you that the integers can be up to 255. But, don’t give up. In Friedl’s seminal book “Mastering Regular Expressions”, he describes two ways to shrink the match set. Both are, however, very chatty. The first way is to list all authorized integers in a super chubby alternation:

  • (0|1|2|3|4|5|6|7|8|9|10|11|12|13|14|15|16|17|18|19|20
    |21|22|23|24|25|26|27|28|29|30|31|32|33|34|35
    |36|37|38|39|40|41|42|43|44|45|46|47|48|49|50
    |51|52|53|54|55|56|57|58|59|60|61|62|63|64|65
    |66|67|68|69|70|71|72|73|74|75|76|77|78|79|80
    |81|82|83|84|85|86|87|88|89|90|91|92|93|94|95
    |96|97|98|99|100|101|102|103|104|105|106|107
    |108|109|110|111|112|113|114|115|116|117|118
    |119|120|121|122|123|124|125|126|127|128|129
    |130|131|132|133|134|135|136|137|138|139|140
    |141|142|143|144|145|146|147|148|149|150|151
    |152|153|154|155|156|157|158|159|160|161|162
    |163|164|165|166|167|168|169|170|171|172|173
    |174|175|176|177|178|179|180|181|182|183|184
    |185|186|187|188|189|190|191|192|193|194|195
    |196|197|198|199|200|201|202|203|204|205|206
    |207|208|209|210|211|212|213|214|215|216|217
    |218|219|220|221|222|223|224|225|226|227|228
    |229|230|231|232|233|234|235|236|237|238|239
    |240|241|242|243|244|245|246|247|248|249|250
    |251|252|253|254|255)

Note that the above is just one integer in the IP address. I have to write the super chubby alternation four times with interleaved dots to get the correct regex.

The second way to specify an integer in the range 0-255 is to decompose the problem into sub problems based on the initial character. If the first digit is 0 or 1, then all integers consisting of 1-3 digits are acceptable – I also allow leading zeros as in e.g. 054. When the initial digit is 2 and the second number is in the range 0-4, then I accept digits in the ranges 20-24 and 200-249. Finally, if the integer starts with 25, then I only tolerate it if it’s followed by a digit in the range 0-5. Like this:

  • [01]?\d\d?|2[0-4]\d|25[0-5]

(Does this regex really match 25? The third part of this alternation only matches integers in the range 250-255. Yes, it does. The first part matches any integer consisting of one or two digits.)

Again I must rewrite my regex four times with interleaved dots to match an entire IP address.

Integer Class – a new operator suggested

So far I’ve talked about how regex works right now. Let’s imagine now that I can add a new operator. I call the new operator Integer Class.

Both regexes above solve the problem, but they demand so many characters that it reminds me of chatter from a group of monkeys (i.e. not easy to understand). My proposal is to extend the regex syntax with a new operator: Integer Class. It matches integers of any length, if they are in a specific range. For IP numbers — which should be in the range 0-255 — it would look like this:

  • [0..255]

Square brackets surrounds two integers – a lower and an upper limit. There are double dots in-between the integers. Integer Class would work almost as a syntactic sugar for the super chubby alternation above — but not quite. Here are some details:

  • Backwards compability: most regex interpreters permits Character Classes with repeated characters. A regex [0..255] is syntactically correct already today. But now it means something entirely different than what I want. The regex interpreter doesn’t care about the double fives and the double dots. Right now, [0..255] is a redundant way of writing [0.25], i.e. it matches exactly one character and it must be either 0, dot, 2 or 5. With my syntax and semantics, the regex interpreter would notice the double dots and say “Hey, this isn’t a Character Class, because it’s an Integer Class.” How cool is that?
  • Leading Zeroes: An Integer Class matches leading zeros in the candidate, but only if you specify it. How? Well, by writing a zero in front of the lower limit. For example: [00..255] means that even sequences like 012, 0004 and 00255 are appropriate integers.
  • Negative integers: Of course, negative integers are permitted. For example, [-1..1] matches -1, 0 or 1. Sidenote: a leading dash in a Character Class means that dashes are allowed. It may sound trivial, but then you should know that in a Character Class [A-Z], the dash means from/to – the range A to Z. A dash in a Character Class has a different meaning depending of if it’s in the beginning or the middle of an expression. Anyway, if there is a double point, it is an Integer Class and then dash means minus.
  • Greedy: Like many other regex operators – such as repetition – the Integer Class is greedy of type longest-leftmost-match, but charity obedient. This means that [0..255] rather match 255 than 2 in candidate 255. But the regex engine is prepared to release the fives if it means the whole expression match. This differs from the super chubby alternation above. At least a Traditional NFA engine often matches the first one it finds, i.e. 2 rather than 255.
  • Meta Characters inside Integer class: There’s no need for rules to escape characters since only digits, double dots and dashes are permitted in an Integer class. Backslash is not allowed to reside in an Integer Class.
  • Negated Integer Class: A caret ^ after the right square bracket will negate the Integer Class. Any integer except those stated in the Integer Class will be matched. E.g. [^5..7] match any integer except 5, 6 and 7. Note that a negated Integer Class still must match an integer. Match exactly one integer, but not 5, 6 or 7.
  • Not accepting nothing: A Character Class doesn’t match nothing – the empty string. The same is true for an Integer Class. There must be an integer matched in the range. The exception is, of course, when an Integer Class is followed by * or ? — for example: [0..255]*

With the suggested Integer Class, an IP address can be matched with this regex:

  • ([0..255]\.){3}[0..255]

Relevant questions:

  • Is this syntax possible or would it be ambiguous?2
  • There are many possible operators to add to the regex syntax – is Integer Class the most needed?
  • Should the Integer Class be even more capable, e.g. be able to match floats?

« Previous PageNext Page »



Follow

Get every new post delivered to your Inbox.