Skip to content Skip to sidebar Skip to footer

Split String On Punctuation Or Number In Python

I'm trying to split strings every time I'm encountering a punctuation mark or numbers, such as: toSplit = 'I2eat!Apples22becauseilike?Them' result = re.sub('[0123456789,.?:;~!@#$%^

Solution 1:

Use re.split with capture group:

toSplit ='I2eat!Apples22becauseilike?Them'result= re.split('([0-9,.?:;~!@#$%^&*()])', toSplit)
result

Output:

['I', '2', 'eat', '!', 'Apples', '2', '', '2', 'becauseilike', '?', 'Them']

If you want to split repeated numbers or punctuation, add +:

result= re.split('([0-9,.?:;~!@#$%^&*()]+)', toSplit)
result

Output:

['I', '2', 'eat', '!', 'Apples', '22', 'becauseilike', '?', 'Them']

Solution 2:

You may tokenize strings like you have into digits, letters, and other chars that are not whitespace, letters and digits using

re.findall(r'\d+|(?:[^\w\s]|_)+|[^\W\d_]+', toSplit)

Here,

  • \d+ - 1+ digits
  • (?:[^\w\s]|_)+ - 1+ chars other than word and whitespace chars or _
  • [^\W\d_]+ - any 1+ Unicode letters.

See the regex demo.

Matching approach is more flexible than splitting as it also allows tokenizing complex structure. Say, you also want to tokenize decimal (float, double...) numbers. You will just need to use \d+(?:\.\d+)? instead of \d+:

re.findall(r'\d+(?:\.\d+)?|(?:[^\w\s]|_)+|[^\W\d_]+', toSplit) 
             ^^^^^^^^^^^^^

See this regex demo.

Solution 3:

Use re.split to split at whenever a alphabet range is found

>>> import re                                                              
>>> re.split(r'([A-Za-z]+)', toSplit)                                      
['', 'I', '2', 'eat', '!', 'Apples', '22', 'becauseilike', '?', 'Them', '']
>>> >>> ' '.join(re.split(r'([A-Za-z]+)', toSplit)).split()                    
['I', '2', 'eat', '!', 'Apples', '22', 'becauseilike', '?', 'Them']        

Post a Comment for "Split String On Punctuation Or Number In Python"