Lookahead Regex Failing To Find The Same Overlapping Matches
Solution 1:
One trick you may use here is to actually just match on ba(?=bab)
, which would only consume ba
, allowing the regex engine to shift forward logically by just one match:
matches = re.findall(r'ba(?=bab)', "babababab")
matches = [i + 'bab'for i in matches]
print(matches)
This prints:
['babab', 'babab', 'babab']
Note that I concatenate the tail bab
to each match, which is fine, because we know the actual logic match was babab
.
Solution 2:
We can generalize the solution to any regex.
Let's say we have a valid regex pattern
which you want to search for overlapping matches.
In order to get overlapping matches, we need to avoid consuming characters in each match, relying on the bump-along mechanism to evaluate the regex on every position of the string. This can be achieved by surrounding the whole regex in a look-ahead (?=<pattern>)
, and we can nest a capturing group to capture the match (?=(<pattern>))
.
This technique works for Python re
engine since after it found an empty match, it will simply bump-along and will not re-evaluate the regex at the same position but looking for non-empty match on the second try like PCRE engine.
Sample code:
import re
inp = '10.5.20.52.48.10'
matches = [m[0] iftype(m) istupleelse m for m in re.findall(r'(?=(\d+(\.\d+){2}))', inp)]
Output:
['10.5.20', '0.5.20', '5.20.52', '20.52.48', '0.52.48', '52.48.10', '2.48.10']
If the original pattern
doesn't have numbered backreferences then we can build the overlapping version of the regex with string concatenation.
However, if it does, the regex will need to be modified manually to correct the backreferences which have been shifted by the additional capturing group.
Do note that this method doesn't give you overlapping matches starting at the same index (e.g. looking for a+
in aaa
will give you 3 matches instead of 6 matches). It's not possible to implement overlapping match starting at the same index in most regex flavors/library, except for Perl.
Post a Comment for "Lookahead Regex Failing To Find The Same Overlapping Matches"