Skip to main content

RegEx: Non-greedy operation without non-greedy operators

In the world of RegEx (Regular Expressions), not all engines support non-greedy or lazy matching capability of input; the lazy matching was introduced in Perl, so any Regex engine that implements PCRE (Perl Compatible Regular Expression) supports lazy matching out of the box.

If you're on an engine that does not support non-greedy match, you can use some trick to achieve that.


Note: I will be using GNU grep (2.25) in a shell, you can use any tool or any Regex library of your language of choice; they all should behave similarly except for some specific tokens (which I won't be referring here)


Let's start!

In the example below, we have a string foo_bar_spam in variable var and our target is to get foo_ out of it using Regex.

% var='foo_bar_spam'

Now, let's see with usual greedy Regex pattern .* what we can get:

% grep -o '^.*_' <<<"$var"
foo_bar_

We got foo_bar_ as expected.


Note, for GNU/Linux/shell users:

The grep option used:

  • -o gets only the matched portion.

The shell token <<< is known as here-string, it is a special form of here-document; here, using <<<"$var", the expansion of variable var is passed to the standard input of grep. It is similar to doing:

% echo "$var" | grep -o '^.*_'

except one less process (echo), and no pipe (|) which is created in the kernel space (pipefs).


Now, how can we get our desired portion?

One way would be to use the non-greedy operators .*? provided by the -P option of grep, -P enables PCRE engine in grep:

% grep -Po '^.*?_' <<<"$var"
foo_

But we are assuming an engine that does not have this support.

The way to do the exact same thing with any basic Regex engine is to use the pattern ^[^_]+_:

% grep -o '^[^_]\+_' <<<"$var"
foo_

Note: Here, we needed to escape + as it's a ERE (Extended RegEx) token, otherwise we can just use -E to enable ERE:

% grep -Eo '^[^_]+_' <<<"$var"
foo_

ERE enables quantifiers +, {}, ?, (), which are not supported by BRE (Basic RegEx) that grep uses by default

This is just for grep, your engine should just support + out of the box, without escaping.


In ^[^_]+_:

  • ^ matches the start of the line/string
  • [^_]+ matches one or more characters (+) upto next _
  • _ matches a literal _

There you go! This trick could be used in any similar scenario.

As mentioned earlier, the Regex pattern is generic and should be reproducible on any Regex engine.

Here's with Python's default re (RegEx) module:

>>> var = 'foo_bar_spam'

>>> import re

>>> re.search(r'^.*_', var).group() #Greedy
'foo_bar_'

>>> re.search(r'^.*?_', var).group() #Non-greedy with `.*?`
'foo_'

>>> re.search(r'^[^_]+_', var).group() #Non-greedy with `[^_]`
'foo_'

Comments

Comments powered by Disqus