Skip to content Skip to sidebar Skip to footer

Pandas Read_csv Fix Columns To Read Data With Newline Characters In Data

Using pandas to read in large tab delimited file df = pd.read_csv(file_path, sep='\t', encoding='latin 1', dtype = str, keep_default_na=False, na_values='') The problem is that th

Solution 1:

The idea is to use regex to find all instances of stuff separated by a given number of tabs and ending in a newline. Then take all that and create a dataframe.

import pandas as pd
import re

defwonky_parser(fn):
    txt = open(fn).read()
    #                          This is where I specified 8 tabs#                                        V
    preparse = re.findall('(([^\t]*\t[^\t]*){8}(\n|\Z))', txt)
    parsed = [t[0].split('\t') for t in preparse]
    return pd.DataFrame(parsed)

Pass a filename to the function and get your dataframe back.

Solution 2:

name your third column

df.columns.values[2] = "some_name"

and use converters to pass your function.

pd.read_csv("foo.csv", sep='\t', encoding='latin 1', dtype = str, keep_default_na=False, converters={'some_name':lambda x:x.replace('/n','')})

you could use any manipulating function which works for you under lambda.

Post a Comment for "Pandas Read_csv Fix Columns To Read Data With Newline Characters In Data"