Skip to content Skip to sidebar Skip to footer

How To Decode Unicode String That Is Read From A File In Python?

I have a file containing UTF-16 strings. When I try to read the unicode, ' ' (double quotes) are added and the string looks like 'b'\\xff\\xfeA\\x00''. The inbuilt .decode function

Solution 1:

Try this:

str.encode().decode()

Solution 2:

It looks like the file has been created by writing bytes literals to it, something like this:

some_bytes = b'Hello world'
with open('myfile.txt', 'w') as f:
    f.write(str(some_bytes))

This gets around the fact that attempting write bytes to a file opened in text mode raises an error, but at the cost that the file now contains "b'hello world'" (note the 'b' inside the quotes).

The solution is to decode the bytes to str before writing:

some_bytes = b'Hello world'
my_str = some_bytes.decode('utf-16') # or whatever the encoding of the bytes might be
with open('myfile.txt', 'w') as f:
    f.write(my_str)

or open the file in binary mode and write the bytes directly

some_bytes = b'Hello world'
with open('myfile.txt', 'wb') as f:
    f.write(some_bytes)

Note you will need to provide the correct encoding if opening the file in text mode

with open('myfile.txt', encoding='utf-16') as f:  # Be sure to use the correct encoding

Consider running Python with the -b or -bb flag set to raise a warning or exception respectively to detect attempts to stringify bytes.


Post a Comment for "How To Decode Unicode String That Is Read From A File In Python?"