Skip to content Skip to sidebar Skip to footer

Python Regex To Match And Remove The Indels In Pileup Format

My question is similar to the following post but I need this python: Mpileup regex command to remove indels INPUT: chr8 30 T 6 ...,$.$.$A,..A...,,,.,,...+5AGGC...-8GTCG

Solution 1:

This one worked finally

sequence = re.sub("+\d+[ACGT]+", "", sequence)

sequence = re.sub("-\d+[ACGT]+", "", sequence)

Might be helpful for someone who is looking for regex to remove indels from their pileup file.


Solution 2:

This one worked finally

sequence = re.sub("+\d+[ACGT]+", "", sequence)

Except of course, it's wrong. Consider:

.....+5AGGCTA.....

The [ACGT]+ is greedy will eat all the bases, not just the five that the pileup notation says are indels. You can verify this if you have a quality scores string as the lengths of the two strings won't agree after removing indels and other artifacts. Conceptually, the pattern we want is:

r"[+-](\d+)[ACTG]{\1}"

But the regex syntax doesn't allow us to put variables in the general form of the repetition operator, i.e. {5} is fine but not the back reference {\1}.

There are several ways to go about this, mostly involving two steps: first, match the initial part through the count; second, use that count to finish the job. Here's an example:

import re

pileup = '...,$.$.$A,..A...,,,.,,...+5AGGCTA..-8GTCGGAAAT......,a,^F,^].^F,'

while True:
    match = re.search(r"[+-](\d+)", pileup)

    if match is None:
        break

    pileup = pileup[:match.start()] + pileup[match.end() + int(match.group(1)):]

print(pileup)

Match the sign and the count, extract the count. Then cut the match itself plus count characters out of the string. Repeat until you don't find any more indels.

OUTPUT

...,$.$.$A,..A...,,,.,,...A..T......,a,^F,^].^F,

Another approach is to use the results of the first pattern match to dynamically create a second pattern that you can pass to re.sub() to remove each indel in turn.


Solution 3:

you can just use re.compile('[-+]\d+[ACGTacgtNn]+') to replace all the indels:

>>> import re
>>> REOBJ_RM_INDEL = re.compile('[-+]\d+[ACGTacgtNn]+')
>>> bases="...,$.$.$A,..A...,,,.,,...+4AGGC...-5GTCGG......,a,^F,^].^F,"
>>> REOBJ_RM_INDEL.sub('', bases)
... ...,$.$.$A,..A...,,,.,,............,a,^F,^].^F,

Post a Comment for "Python Regex To Match And Remove The Indels In Pileup Format"