Skip to content Skip to sidebar Skip to footer

Lookahead Regex Failing To Find The Same Overlapping Matches

Is it possible to find overlapping matches using regular expressions when searching again for the same pattern? I want to be able to find matches that occurs three times. For examp

Solution 1:

One trick you may use here is to actually just match on ba(?=bab), which would only consume ba, allowing the regex engine to shift forward logically by just one match:

matches = re.findall(r'ba(?=bab)', "babababab")
matches = [i + 'bab'for i in matches]
print(matches)

This prints:

['babab', 'babab', 'babab']

Note that I concatenate the tail bab to each match, which is fine, because we know the actual logic match was babab.

Solution 2:

We can generalize the solution to any regex.

Let's say we have a valid regex pattern which you want to search for overlapping matches.

In order to get overlapping matches, we need to avoid consuming characters in each match, relying on the bump-along mechanism to evaluate the regex on every position of the string. This can be achieved by surrounding the whole regex in a look-ahead (?=<pattern>), and we can nest a capturing group to capture the match (?=(<pattern>)).

This technique works for Python re engine since after it found an empty match, it will simply bump-along and will not re-evaluate the regex at the same position but looking for non-empty match on the second try like PCRE engine.

Sample code:

import re

inp = '10.5.20.52.48.10'
matches = [m[0] iftype(m) istupleelse m for m in re.findall(r'(?=(\d+(\.\d+){2}))', inp)]

Output:

['10.5.20', '0.5.20', '5.20.52', '20.52.48', '0.52.48', '52.48.10', '2.48.10']

If the original pattern doesn't have numbered backreferences then we can build the overlapping version of the regex with string concatenation.

However, if it does, the regex will need to be modified manually to correct the backreferences which have been shifted by the additional capturing group.

Do note that this method doesn't give you overlapping matches starting at the same index (e.g. looking for a+ in aaa will give you 3 matches instead of 6 matches). It's not possible to implement overlapping match starting at the same index in most regex flavors/library, except for Perl.

Post a Comment for "Lookahead Regex Failing To Find The Same Overlapping Matches"