Skip to content Skip to sidebar Skip to footer

Unicode Category For Commas And Quotation Marks

I have this helper function that gets rid of control characters in XML text: def remove_control_characters(s): #Remove control characters in XML text t = '' for ch in s:

Solution 1:

In Unicode, control characters general category is 'Cc', even if they have no name.unicodedata.category() returns the general category, as you can test for yourself in the python console :

>>>unicodedata.category(unicode('\00')) 'Cc'

For commas and quotation marks, the categories are Pi and Pf. You only test the first character of the returned code in your example, so try instead :

cat = unicodedata.category(ch)
 ifcat == "Cc" or cat == "Pi" or cat == "Pf":

Solution 2:

Based on a last Unicode data file here UnicodeData.txt

Comma and Quotation mark are in Punctuation Other category Po:

002C;COMMA;Po;0;CS;;;;;N;;;;;
0022;QUOTATION MARK;Po;0;ON;;;;;N;;;;;

So, based on your question, your code should be something like this:

o = [c if unicodedata.category(c) != 'Cc'else' '\
    for c in xml if unicodedata.category(c) != 'Po']

return("".join(o))

If you want to find out a category for any other unicode symbol and do not want to deal with the UnicodeData.txt file, you can just print it out with a print(c, unicodedata.category(c))

Post a Comment for "Unicode Category For Commas And Quotation Marks"