Pandas Read_csv Fix Columns To Read Data With Newline Characters In Data
Using pandas to read in large tab delimited file df = pd.read_csv(file_path, sep='\t', encoding='latin 1', dtype = str, keep_default_na=False, na_values='') The problem is that th
Solution 1:
The idea is to use regex to find all instances of stuff separated by a given number of tabs and ending in a newline. Then take all that and create a dataframe.
import pandas as pd
import re
defwonky_parser(fn):
txt = open(fn).read()
# This is where I specified 8 tabs# V
preparse = re.findall('(([^\t]*\t[^\t]*){8}(\n|\Z))', txt)
parsed = [t[0].split('\t') for t in preparse]
return pd.DataFrame(parsed)
Pass a filename to the function and get your dataframe back.
Solution 2:
name your third column
df.columns.values[2] = "some_name"
and use converters to pass your function.
pd.read_csv("foo.csv", sep='\t', encoding='latin 1', dtype = str, keep_default_na=False, converters={'some_name':lambda x:x.replace('/n','')})
you could use any manipulating function which works for you under lambda.
Post a Comment for "Pandas Read_csv Fix Columns To Read Data With Newline Characters In Data"