Skip to content Skip to sidebar Skip to footer

Multi-line Pattern Matching In Python

A periodic computer generated message (simplified): Hello user123, - (604)7080900 - 152 - minutes Regards Using python, how can I extract '(604)7080900', '152', 'minutes' (i.e.

Solution 1:

>>>import re>>>>>>x="""Hello user123,......- (604)7080900...- 152...- minutes......Regards...""">>>>>>re.findall("\n+\n-\s*(.*)\n-\s*(.*)\n-\s*(minutes)\s*\n\n+",x)
[('(604)7080900', '152', 'minutes')]
>>>

Solution 2:

The simplest approach is to go over these lines (assuming you have a list of lines, or a file, or split the string into a list of lines) until you see a line that's just '\n', then check that each line starts with '- ' (using the startswith string method) and slicing it off, storing the result, until you find another empty line. For example:

# if you have a single string, split it into lines.
L = s.splitlines()
# if you (now) have a list of lines, grab an iterator so we can continue# iteration where it left off.
it = iter(L)
# Alternatively, if you have a file, just use that directly.
it = open(....)

# Find the first empty line:for line in it:
    # Treat lines of just whitespace as empty lines too. If you don't want# that, do 'if line == ""'.ifnot line.strip():
        break# Now starts data.for line in it:
    ifnot line.rstrip():
        # End of data.breakif line.startswith('- '):
        data.append(line[:2].rstrip())
    else:
        # misformed data?raise ValueError, "misformed line %r" % (line,)

Edited: Since you elaborate on what you want to do, here's an updated version of the loops. It no longer loops twice, but instead collects data until it encounters a 'bad' line, and either saves or discards the collected lines when it encounters a block separator. It doesn't need an explicit iterator, because it doesn't restart iteration, so you can just pass it a list (or any iterable) of lines:

defgetblocks(L):
    # The list of good blocks (as lists of lines.) You can also make this# a flat list if you prefer.
    data = []
    # The list of good lines encountered in the current block# (but the block may still become bad.)
    block = []
    # Whether the current block is bad.
    bad = 1for line in L:
        # Not in a 'good' block, and encountering the block separator.if bad andnot line.rstrip():
            bad = 0
            block = []
            continue# In a 'good' block and encountering the block separator.ifnot bad andnot line.rstrip():
            # Save 'good' data. Or, if you want a flat list of lines,# use 'extend' instead of 'append' (also below.)
            data.append(block)
            block = []
            continueifnot bad and line.startswith('- '):
            # A good line in a 'good' (not 'bad' yet) block; save the line,# minus# '- ' prefix and trailing whitespace.
            block.append(line[2:].rstrip())
            continueelse:
            # A 'bad' line, invalidating the current block.
            bad = 1# Don't forget to handle the last block, if it's good# (and if you want to handle the last block.)ifnot bad and block:
        data.append(block)
    return data

And here it is in action:

>>>L = """hello......- x1...- x2...- x3......- x4......- x6...morning...- x7......world""".splitlines()>>>print getblocks(L)
[['x1', 'x2', 'x3'], ['x4']]

Solution 3:

>>> s = """Hello user123,

- (604)7080900
- 152
- minutes

Regards
""">>> import re
>>> re.findall(r'^- (.*)', s, re.M)
['(604)7080900', '152', 'minutes']

Solution 4:

l = """Hello user123,

- (604)7080900
- 152
- minutes

Regards  

Hello user124,

- (604)8576576
- 345
- minutes
- seconds
- bla

Regards"""

do this:

result = []
for data in s.split('Regards'): 
    result.append([v.strip() for v in data.split('-')[1:]])
del result[-1] # remove empty list at end

and have this:

>>> result
[['(604)7080900', '152', 'minutes'],
['(604)8576576', '345', 'minutes', 'seconds', 'bla']]

Post a Comment for "Multi-line Pattern Matching In Python"