Regex process: Copy-Paste-Generalize

Regular Expressions is a flexible tool for matching strings of text. They can be really crisp and elegant, but how can you design them? Below is a design process.

Name

Copy-Paste-Generalize

Intent

Go from an idea to a flexible and generalized regular expression.

Applicability

You have a text example and you know what you want to extract. But, of course, you want your regular expression to be generalized enough to match other candidates.

Consequences

The Copy-Paste-Generalize pattern is easy to get you started and then you can develop your regular expression in an iterative and structured way.

Mechanics

  1. Copy a text example
  2. Paste it as your initial regular expression
  3. Generelaize the expression step by step until it matches any possible candidate

Other Names

This process has many names. E.g. Mehran Habibi call something similar “The Pull Technique” in his book Java Regular Expressions.

Example

Pomodoro Technique Illustrated info page at Amazon.com

Pomodoro Technique Illustrated info page at Amazon.com

Amazon.com presents a sale rank of all books. The list is updated frequently and every book’s current rank can be found at the book’s info page. Suppose I want to match the current rank for my book Pomodoro Technique Illustrated. First I download an example text: the current page at Amazon:

I use Lynx (btw: by far the greatest web browser the world has ever seen) in CLI mode. The result is a rain of lines from Pomodoro Technique Illustrated‘s info page at Amazon. I better grep something to make my example text smaller:

The result is:

  • * Amazon Bestsellers Rank: #23,032 in Books ([89]See Top 100 in

Great! This includes what I want to match. What I’ve done so far is the Copy part of this process. Next goes the Paste. I add an extremely simple Regular Expression and put it in a Ruby one-liner:

As a matter of fact, the expression that resides between the dashes is just a Paste of what I got from the Lynx dump. And the result of this line is:

  • 23,032

…and that’s because of the parenthesis.

Copy done. Paste Done. Let’s start to generalize. Next time I run this, I might get another rank than 23,032. Digits can be captured with the meta sequence \d. This implies the next iteration:

Instead of cascading the \d, I can use the limiting repetition operator:

The text in between “Rank” and the number may change. It would be more robust to describe it as non-digits:

This expression will only work when the rank is between 1,000 and 999,999. Just in case this book gets extremely popular, let’s generalize the number part:

The expression has become pretty compact and robust. I stop here:

  • Bestsellers Rank[^\d]*([\d,]*)

Challenge

Even though it’s only a example above, you may know how to make the Regular Expression or the Ruby/Bash code even more crisp. If you do, feel free to append a comment below.

Pomodoro Technique Illustrated -- New book from The Pragmatic Programmers, LLC

Advertisements

1 Response to “Regex process: Copy-Paste-Generalize”


  1. 1 Mohamed 2011-11-30 at 05.25

    Keep on writing, great job!


Comments are currently closed.




%d bloggers like this: