Skip to content Skip to sidebar Skip to footer

Does Python Re (regex) Have An Alternative To \u Unicode Escape Sequences?

Python treats \uxxxx as a unicode character escape inside a string literal (e.g. u'\u2014' gets interpreted as Unicode character U+2014). But I just discovered (Python 2.7) that st

Solution 1:

Use the unichr() function to create unicode characters from a codepoint:

pattern = u"%s$" % unichr(codepoint)

Solution 2:

One possibility is, rather than call re methods directly, wrap them in something that can understand \u escapes on their behalf. Something like this:

defmy_re_search(pattern, s):
    return re.search(unicode_unescape(pattern), s)

defunicode_unescape(s):
        """
        Turn \uxxxx escapes into actual unicode characters
        """defunescape_one_match(matchObj):
                escape_seq = matchObj.group(0)
                return escape_seq.decode('unicode_escape')
        return re.sub(r"\\u[0-9a-fA-F]{4}", unescape_one_match, s)

Example of it working:

pat  = r"C:\\.*\u20ac"# U+20ac is the euro sign>>> print pat
C:\\.*\u20ac

path = ur"C:\reports\twenty\u20acplan.txt">>> print path
C:\reports\twenty€plan.txt

# Underlying re.search method fails to find a match>>> re.search(pat, path) != NoneFalse# Vs this:>>> my_re_search(pat, path) != NoneTrue

Thanks to Process escape sequences in a string in Python for pointing out the decode("unicode_escape") idea.

But note that you can't just throw your whole pattern through decode("unicode_escape"). It will work some of the time (because most regex special characters don't change their meaning when you put a backslash in front), but it won't work in general. For example, here using decode("unicode_escape") alters the meaning of the regex:

pat = r"C:\\.*\u20ac"# U+20ac is the euro sign>>> print pat
C:\\.*\u20ac # Asks for a literal backslash

pat_revised  = pat.decode("unicode_escape")
>>> print pat_revised
C:\.*€ # Asks for a literal period (without a backslash)

Post a Comment for "Does Python Re (regex) Have An Alternative To \u Unicode Escape Sequences?"