RegEx: Non-greedy operation without non-greedy operators
In the world of RegEx (Regular Expressions), not all engines support non-greedy or lazy matching capability of input; the lazy matching was introduced in Perl, so any Regex engine that implements PCRE (Perl Compatible Regular Expression) supports lazy matching out of the box.
If you're on an engine that does not support non-greedy match, you can use some trick to achieve that.
Note: I will be using GNU grep
(2.25
) in a shell, you can use any tool or any Regex library of your language of choice; they all should behave similarly except for some specific tokens (which I won't be referring here)
Let's start!
In the example below, we have a string foo_bar_spam
in variable var
and our target is to get foo_
out of it using Regex.
% var='foo_bar_spam'
Now, let's see with usual greedy Regex pattern .*
what we can get:
% grep -o '^.*_' <<<"$var" foo_bar_
We got foo_bar_
as expected.
Note, for GNU/Linux/shell users:
The grep
option used:
-
-o
gets only the matched portion.
The shell token <<<
is known as here-string, it is a special form of here-document; here, using <<<"$var"
, the expansion of variable var
is passed to the standard input of
grep
. It is similar to doing:
% echo "$var" | grep -o '^.*_'
except one less process (echo
), and no pipe (|
) which is created in the kernel space (pipefs
).
Now, how can we get our desired portion?
One way would be to use the non-greedy operators .*?
provided by the -P
option of grep
, -P
enables PCRE engine in grep
:
% grep -Po '^.*?_' <<<"$var" foo_
But we are assuming an engine that does not have this support.
The way to do the exact same thing with any basic Regex engine is to use the pattern ^[^_]+_
:
% grep -o '^[^_]\+_' <<<"$var" foo_
Note: Here, we needed to escape +
as it's a ERE (Extended RegEx) token, otherwise we can just use -E
to enable ERE:
% grep -Eo '^[^_]+_' <<<"$var" foo_
ERE enables quantifiers +
, {}
, ?
, ()
, which are not supported by BRE (Basic RegEx) that grep
uses by default
This is just for grep
, your engine should just support +
out of the box, without escaping.
In ^[^_]+_
:
-
^
matches the start of the line/string -
[^_]+
matches one or more characters (+
) upto next_
-
_
matches a literal_
There you go! This trick could be used in any similar scenario.
As mentioned earlier, the Regex pattern is generic and should be reproducible on any Regex engine.
Here's with Python's default re
(RegEx) module:
>>> var = 'foo_bar_spam' >>> import re >>> re.search(r'^.*_', var).group() #Greedy 'foo_bar_' >>> re.search(r'^.*?_', var).group() #Non-greedy with `.*?` 'foo_' >>> re.search(r'^[^_]+_', var).group() #Non-greedy with `[^_]` 'foo_'
Comments
Comments powered by Disqus