Does Python Re (regex) Have An Alternative To \u Unicode Escape Sequences?
Python treats \uxxxx as a unicode character escape inside a string literal (e.g. u'\u2014' gets interpreted as Unicode character U+2014). But I just discovered (Python 2.7) that st
Solution 1:
Use the unichr()
function to create unicode characters from a codepoint:
pattern = u"%s$" % unichr(codepoint)
Solution 2:
One possibility is, rather than call re methods directly, wrap them in something that can understand \u escapes on their behalf. Something like this:
defmy_re_search(pattern, s):
return re.search(unicode_unescape(pattern), s)
defunicode_unescape(s):
"""
Turn \uxxxx escapes into actual unicode characters
"""defunescape_one_match(matchObj):
escape_seq = matchObj.group(0)
return escape_seq.decode('unicode_escape')
return re.sub(r"\\u[0-9a-fA-F]{4}", unescape_one_match, s)
Example of it working:
pat = r"C:\\.*\u20ac"# U+20ac is the euro sign>>> print pat
C:\\.*\u20ac
path = ur"C:\reports\twenty\u20acplan.txt">>> print path
C:\reports\twenty€plan.txt
# Underlying re.search method fails to find a match>>> re.search(pat, path) != NoneFalse# Vs this:>>> my_re_search(pat, path) != NoneTrue
Thanks to Process escape sequences in a string in Python for pointing out the decode("unicode_escape") idea.
But note that you can't just throw your whole pattern through decode("unicode_escape"). It will work some of the time (because most regex special characters don't change their meaning when you put a backslash in front), but it won't work in general. For example, here using decode("unicode_escape") alters the meaning of the regex:
pat = r"C:\\.*\u20ac"# U+20ac is the euro sign>>> print pat
C:\\.*\u20ac # Asks for a literal backslash
pat_revised = pat.decode("unicode_escape")
>>> print pat_revised
C:\.*€ # Asks for a literal period (without a backslash)
Post a Comment for "Does Python Re (regex) Have An Alternative To \u Unicode Escape Sequences?"