How To Treat Number With Decimals Or With Commas As One Word In Countvectorizer
Solution 1:
The default regex pattern the tokenizer is using for the token_pattern parameter is:
token_pattern='(?u)\\b\\w\\w+\\b'
So a word is defined by a \b
word boundary at the beginning and the end with \w\w+
one alphanumeric character followed by one or more alphanumeric characters between the boundaries. To interpret the regex, the backslashes have to be escaped by \\
.
So you could change the token pattern to:
token_pattern='\\b(\\w+[\\.,]?\\w+)\\b'
Explanation: [\\.,]?
allows for the optional appearance of a .
or ,
. The regex for the first appearing alphanumeric character \w
has to be extended to \w+
to match numbers with more than one digit before the punctuation.
For your slightly adjusted example:
corpus=["I am userna lightning strike 2.5 release re-spins there's many 10,000x bet in NA!"]
analyzer = CountVectorizer().build_analyzer()
vectorizer = CountVectorizer(token_pattern='\\b(\\w+[\\.,]?\\w+)\\b')
result = vectorizer.fit_transform(corpus).todense()
cols = vectorizer.get_feature_names()
print(pd.DataFrame(result, columns = cols))
Output:
10,000x 2.5 am bet in lightning many na re release spins strike there userna
011111111111111
Alternatively you could modify your input text, e.g. by replacing the decimal point .
with underscore _
and removing commas standing between digits.
import re
corpus = ["I am userna lightning strike 2.5 release re-spins there's many 10,000x bet in NA!"]
for i in range(len(corpus)):
corpus[i] = re.sub("(\d+)\.(\d+)", "\\1_\\2", corpus[i])
corpus[i] = re.sub("(\d+),(\d+)", "\\1\\2", corpus[i])
analyzer =CountVectorizer().build_analyzer()
vectorizer =CountVectorizer()
result = vectorizer.fit_transform(corpus).todense()
cols = vectorizer.get_feature_names()
print(pd.DataFrame(result, columns = cols))
Output:
10000x 2_5 am bet in lightning many na re release spins strike there userna
011111111111111
Post a Comment for "How To Treat Number With Decimals Or With Commas As One Word In Countvectorizer"