Split String On Punctuation Or Number In Python
I'm trying to split strings every time I'm encountering a punctuation mark or numbers, such as: toSplit = 'I2eat!Apples22becauseilike?Them' result = re.sub('[0123456789,.?:;~!@#$%^
Solution 1:
Use re.split
with capture group:
toSplit ='I2eat!Apples22becauseilike?Them'result= re.split('([0-9,.?:;~!@#$%^&*()])', toSplit)
result
Output:
['I', '2', 'eat', '!', 'Apples', '2', '', '2', 'becauseilike', '?', 'Them']
If you want to split repeated numbers or punctuation, add +
:
result= re.split('([0-9,.?:;~!@#$%^&*()]+)', toSplit)
result
Output:
['I', '2', 'eat', '!', 'Apples', '22', 'becauseilike', '?', 'Them']
Solution 2:
You may tokenize strings like you have into digits, letters, and other chars that are not whitespace, letters and digits using
re.findall(r'\d+|(?:[^\w\s]|_)+|[^\W\d_]+', toSplit)
Here,
\d+
- 1+ digits(?:[^\w\s]|_)+
- 1+ chars other than word and whitespace chars or_
[^\W\d_]+
- any 1+ Unicode letters.
See the regex demo.
Matching approach is more flexible than splitting as it also allows tokenizing complex structure. Say, you also want to tokenize decimal (float, double...) numbers. You will just need to use \d+(?:\.\d+)?
instead of \d+
:
re.findall(r'\d+(?:\.\d+)?|(?:[^\w\s]|_)+|[^\W\d_]+', toSplit)
^^^^^^^^^^^^^
See this regex demo.
Solution 3:
Use re.split
to split at whenever a alphabet range is found
>>> import re
>>> re.split(r'([A-Za-z]+)', toSplit)
['', 'I', '2', 'eat', '!', 'Apples', '22', 'becauseilike', '?', 'Them', '']
>>> >>> ' '.join(re.split(r'([A-Za-z]+)', toSplit)).split()
['I', '2', 'eat', '!', 'Apples', '22', 'becauseilike', '?', 'Them']
Post a Comment for "Split String On Punctuation Or Number In Python"