Regex extension — here’s the Integer Class

The IP address problem is well known in the regex community. Here I present an extended regex syntax that would make it possible to match IP addresses in a less chatty way.

Ip-address problem

Let’s say I want to validate an IP address. It is four integers with interleaved dots such as 123.125.126.107. The naïve regex would be:

  • \d*\.\d*\.\d*\.\d*

But it is so imprecise that it matches both 9999.9999.9999.9999 and . Since the four integers in a IP address must be in the range 0-255, I can be certain that they consist of one, two, or three digits. With the Limiting Repetition operator, I can formulate this requirement:

  • \d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}

Now 0.1.2.3 and 101.0.202.1 matches, while 9999.9999.9999.9999 doesn’t match because there are too many digits in the integers. Unfortunately 999.888.777.666 matches as well. That’s not an acceptable IP address. I told you that the integers can be up to 255. But, don’t give up. In Friedl’s seminal book “Mastering Regular Expressions”, he describes two ways to shrink the match set. Both are, however, very chatty. The first way is to list all authorized integers in a super chubby alternation:

  • (0|1|2|3|4|5|6|7|8|9|10|11|12|13|14|15|16|17|18|19|20
    |21|22|23|24|25|26|27|28|29|30|31|32|33|34|35
    |36|37|38|39|40|41|42|43|44|45|46|47|48|49|50
    |51|52|53|54|55|56|57|58|59|60|61|62|63|64|65
    |66|67|68|69|70|71|72|73|74|75|76|77|78|79|80
    |81|82|83|84|85|86|87|88|89|90|91|92|93|94|95
    |96|97|98|99|100|101|102|103|104|105|106|107
    |108|109|110|111|112|113|114|115|116|117|118
    |119|120|121|122|123|124|125|126|127|128|129
    |130|131|132|133|134|135|136|137|138|139|140
    |141|142|143|144|145|146|147|148|149|150|151
    |152|153|154|155|156|157|158|159|160|161|162
    |163|164|165|166|167|168|169|170|171|172|173
    |174|175|176|177|178|179|180|181|182|183|184
    |185|186|187|188|189|190|191|192|193|194|195
    |196|197|198|199|200|201|202|203|204|205|206
    |207|208|209|210|211|212|213|214|215|216|217
    |218|219|220|221|222|223|224|225|226|227|228
    |229|230|231|232|233|234|235|236|237|238|239
    |240|241|242|243|244|245|246|247|248|249|250
    |251|252|253|254|255)

Note that the above is just one integer in the IP address. I have to write the super chubby alternation four times with interleaved dots to get the correct regex.

The second way to specify an integer in the range 0-255 is to decompose the problem into sub problems based on the initial character. If the first digit is 0 or 1, then all integers consisting of 1-3 digits are acceptable – I also allow leading zeros as in e.g. 054. When the initial digit is 2 and the second number is in the range 0-4, then I accept digits in the ranges 20-24 and 200-249. Finally, if the integer starts with 25, then I only tolerate it if it’s followed by a digit in the range 0-5. Like this:

  • [01]?\d\d?|2[0-4]\d|25[0-5]

(Does this regex really match 25? The third part of this alternation only matches integers in the range 250-255. Yes, it does. The first part matches any integer consisting of one or two digits.)

Again I must rewrite my regex four times with interleaved dots to match an entire IP address.

Integer Class – a new operator suggested

So far I’ve talked about how regex works right now. Let’s imagine now that I can add a new operator. I call the new operator Integer Class.

Both regexes above solve the problem, but they demand so many characters that it reminds me of chatter from a group of monkeys (i.e. not easy to understand). My proposal is to extend the regex syntax with a new operator: Integer Class. It matches integers of any length, if they are in a specific range. For IP numbers — which should be in the range 0-255 — it would look like this:

  • [0..255]

Square brackets surrounds two integers – a lower and an upper limit. There are double dots in-between the integers. Integer Class would work almost as a syntactic sugar for the super chubby alternation above — but not quite. Here are some details:

  • Backwards compability: most regex interpreters permits Character Classes with repeated characters. A regex [0..255] is syntactically correct already today. But now it means something entirely different than what I want. The regex interpreter doesn’t care about the double fives and the double dots. Right now, [0..255] is a redundant way of writing [0.25], i.e. it matches exactly one character and it must be either 0, dot, 2 or 5. With my syntax and semantics, the regex interpreter would notice the double dots and say “Hey, this isn’t a Character Class, because it’s an Integer Class.” How cool is that?
  • Leading Zeroes: An Integer Class matches leading zeros in the candidate, but only if you specify it. How? Well, by writing a zero in front of the lower limit. For example: [00..255] means that even sequences like 012, 0004 and 00255 are appropriate integers.
  • Negative integers: Of course, negative integers are permitted. For example, [-1..1] matches -1, 0 or 1. Sidenote: a leading dash in a Character Class means that dashes are allowed. It may sound trivial, but then you should know that in a Character Class [A-Z], the dash means from/to – the range A to Z. A dash in a Character Class has a different meaning depending of if it’s in the beginning or the middle of an expression. Anyway, if there is a double point, it is an Integer Class and then dash means minus.
  • Greedy: Like many other regex operators – such as repetition – the Integer Class is greedy of type longest-leftmost-match, but charity obedient. This means that [0..255] rather match 255 than 2 in candidate 255. But the regex engine is prepared to release the fives if it means the whole expression match. This differs from the super chubby alternation above. At least a Traditional NFA engine often matches the first one it finds, i.e. 2 rather than 255.
  • Meta Characters inside Integer class: There’s no need for rules to escape characters since only digits, double dots and dashes are permitted in an Integer class. Backslash is not allowed to reside in an Integer Class.
  • Negated Integer Class: A caret ^ after the right square bracket will negate the Integer Class. Any integer except those stated in the Integer Class will be matched. E.g. [^5..7] match any integer except 5, 6 and 7. Note that a negated Integer Class still must match an integer. Match exactly one integer, but not 5, 6 or 7.
  • Not accepting nothing: A Character Class doesn’t match nothing – the empty string. The same is true for an Integer Class. There must be an integer matched in the range. The exception is, of course, when an Integer Class is followed by * or ? — for example: [0..255]*

With the suggested Integer Class, an IP address can be matched with this regex:

  • ([0..255]\.){3}[0..255]

Relevant questions:

  • Is this syntax possible or would it be ambiguous?2
  • There are many possible operators to add to the regex syntax – is Integer Class the most needed?
  • Should the Integer Class be even more capable, e.g. be able to match floats?
Advertisements

8 Responses to “Regex extension — here’s the Integer Class”


  1. 1 Örjan 2010-05-30 at 09.39

    Hmm… there’s no place like ::1

    C:\Users\orjan>ping ::1

    Pinging ::1 with 32 bytes of data:
    Reply from ::1: time<1ms
    Reply from ::1: time<1ms
    Reply from ::1: time<1ms
    Reply from ::1: time<1ms

    Ping statistics for ::1:
    Packets: Sent = 4, Received = 4, Lost = 0 (0% loss),
    Approximate round trip times in milli-seconds:
    Minimum = 0ms, Maximum = 0ms, Average = 0ms

    And also IP4 maps ::ffff:123.125.126.107

    :)

  2. 2 Johan Lindberg 2010-06-3 at 18.52

    Hi Staffan,

    as I mentioned on Twitter I’ve been playing around with regexes for various purposes from time to time.

    And I think that one of my little experiments might be of interest to you. Unfortunately, I couldn’t find the original code but I rewrote enough earlier today to show off the gist of it. You can find the code at http://gist.github.com/422771

    I have implemented the syntax you propose for the integer class and an interesting thing with regards to ambiguity turned up. I think it might be good to clarify the syntax to make it clear what should happen in the following situations:

    match(‘[1..10]’, ’10’) => ‘1’ or ’10’?
    match(‘[1..10]’, ‘100’) => ‘1’, ’10’ or No match at all?

    BR
    Johan Lindberg
    twitter: @rplaca

  3. 3 Staffan Nöteberg 2010-06-3 at 19.50

    Brilliant follow-up post at github, Johan!

    > match(‘[1..10]‘, ’10′) => ’1′ or ’10′?

    Integer Class is greedy: “[0..255] rather match 255 than 2 in candidate 255“. In your example it will match 10. That means, your pre-engine must lay out the alternation in reverse direction: (10|9|8|7|6|5|4|3|2|1). A regex engine that creates Traditional NFA (e.g. Python), usually matches the first in an alternation. Posix NFA and DFA recognizes the leftmost longest match – it will match 10 regardless your order.

    > match(‘[1..10]‘, ’100′) => ’1′, ’10′ or No match at all?

    It’s greedy here as well:
    ~>echo 100|ruby -pe 'puts $1 if $_[/(10|9|8|7|6|5|4|3|2|1)/]; next'
    10

    And here, as well as in the first example, the pre-engine alternation must be in reverse order if the regexp engine creates Traditional NFA – like Python.

  4. 4 Staffan Nöteberg 2010-06-3 at 19.54

    Right, Örjan! :-)

    There’s a more extensive regex in Goyavaerts/Levithan for IPv(4|6). However, that recipe would have been more crisp if it could use the Integer Class, IMO.

  5. 5 Johan Lindberg 2010-06-3 at 19.58

    Thanks Staffan,

    about a minute after I posted the comment I re-read the post and realized that you had already specified this quite thoroughly.

    Anyway, I’ve updated the code at github so it behaves better now. I’ll try to clean up the implementation in the coming days. Who knows, It might come in handy as a little prototyping tool.

    /Johan

  6. 6 Staffan Nöteberg 2010-06-3 at 20.11

    Actually, this may be a trait: [1..10] prefers to match smallest integer, but [10..1] prefers the biggest. Or maybe this would only make the syntax more complicated, without solving a realistic demand.

  7. 7 Johan Lindberg 2010-06-4 at 09.23

    Well, it won’t cost much to try it out but, as you say, I wonder if it solves a problem that will turn up much.

    I’m actually more concerned about the case with match(‘[1..10]’, ‘100’) matching ’10’ instead of no match. I think that implies that there’s a discrepancy between types in the pattern and the source string.

    We’re talking about integers in the pattern but, in effect, when we allow it to match ’10’ we’re talking about text in the string. ’10’ is a subset of ‘100’ but 10 has nothing (IMHO) to do with 100.

    It makes the whole thing a lot more complicated and strays from the rest of regex thinking, but I can think of several situations where I’d like to have that capability parsing large data files.

    /Johan


  1. 1 jogos do ben 10 Trackback on 2011-10-19 at 22.53
Comments are currently closed.




%d bloggers like this: