Regex syntax and semantics varies

Regex engines do differ in syntax and semantics. This is one reason why you can’t just find an expression with Google and use it in your code – without fully understand it.

Try for example the following in JIRB, the interactive JRuby tool:

irb(main):001:0> require 'java'
=> true
irb(main):002:0> java.util.regex.Pattern.compile("a$").matcher("a\nb").find
=> false
irb(main):003:0> "a\nb"[/a$/]
=> "a"

What happened?

The regex /a$/ matches the letter a just before something checked by a dollar sign assertion. The assert criteria is, by default, not the same in Ruby and Java. In Java, by default, the dollar sign matches at the end of the whole text and before any final line breaks. In Ruby, the dollar sign match at the end of every single line.

Here’s what the three lines of code above means:

  1. First, we need to include the Java libraries by writing require 'java'. This might not be necessary, depending on your setup.
  2. We compile a Java regex and test if part of the string of ‘a’ and ‘b’ with newline in-between can be matched. It can’t.
  3. We compile the same regex in Ruby and test if part of the string of ‘a’ and ‘b’ with newline in-between can be matched. It can.

This is just one of many examples. If you are going to use regexes in your program, you need to understand them. It’s as simple as that.

Pomodoro Technique Illustrated -- New book from The Pragmatic Programmers, LLC


1 Response to “Regex syntax and semantics varies”

  1. 1 Confused about GNU regex documentation « cartesian product Trackback on 2011-11-10 at 14.59
Comments are currently closed.

%d bloggers like this: