Joining-all-rows-of-a-csv-file-that-have-the-same-1st-column-value
Solution 1:
import csv
from itertools import izip_longest
def merge_rows(a, b):
return [x or y for x,y in izip_longest(a, b, fillvalue='')]
def main():
data = {}
with open("infile.csv", "rb") as inf:
incsv = csv.reader(inf, delimiter=";")
header = next(incsv, [])
for row in incsv:
label = row[0]
try:
data[label] = merge_rows(data[label], row)
except KeyError:
data[label] = row
# write data in sorted order by label
keys = sorted(data, key=lambda k: int(k)) # Python 2
# keys = sorted(data.keys(), key=lambda k: int(k)) # Python 3
with open("outfile.csv", "wb") as outf:
outcsv = csv.writer(outf, delimiter=";")
outcsv.writerow(header)
outcsv.writerows(data[key] for key in keys)
if __name__=="__main__":
main()
Edit: I made a few mods based on your sample data:
added a
delimiter=";"
argument to the csv reader and writeradded code to read and write the header
added a key clause so sort order is numeric, not lexicographic
How it works:
for row in incsv
: For each row in the data file, we get a list - something like ["0", "0", "", "", "", "", "", "", "", "", "", "", "-1.0", "0", "0", "-1", "0"]
. Then label = row[0]
gives label a value of "0"
- your desired first-column value - and we look for data[label]
, a combined row from all preexisting rows having that label.
If that combined row already exists, we merge the new row into it (stored_row = merge_rows(stored_row, new_row)
; otherwise it is created with the new row value (["0", "0", "", "", "", "", "", ""
, etc). So effectively merge_rows
is called for every occurrence of each label except the first time it appears.
merge_rows
takes a pair of lists and combines them - izip_longest
returns corresponding entries, ie izip_longest([0, 1, 2], ["a", "b", "c"])
gives (0, "a"), (1, "b"), (2, "c")
. If one list is shorter than the other, it pads it with fillvalue
to match the length of the longest list it received. x
and y
get assigned the corresponding value from each list, and we or
them together because... well, because or
combines them the way you want ('' or '1' == '1'
, '1' or '' == '1'
, '' or '' == ''
). It then takes all the resulting values and returns them as a list - the resulting combined row.
Hope that helps.
Post a Comment for "Joining-all-rows-of-a-csv-file-that-have-the-same-1st-column-value"