Skip to content Skip to sidebar Skip to footer

Joining-all-rows-of-a-csv-file-that-have-the-same-1st-column-value

OK major example update needed. I have exactly this: Joining all rows of a CSV file that have the same 1st column value in Python (first I must appologize for not getting how to ju

Solution 1:

import csv
from itertools import izip_longest

def merge_rows(a, b):
    return [x or y for x,y in izip_longest(a, b, fillvalue='')]

def main():
    data = {}

    with open("infile.csv", "rb") as inf:
        incsv = csv.reader(inf, delimiter=";")
        header = next(incsv, [])
        for row in incsv:
            label = row[0]
            try:
                data[label] = merge_rows(data[label], row)
            except KeyError:
                data[label] = row

    # write data in sorted order by label
    keys = sorted(data, key=lambda k: int(k))    # Python 2
    # keys = sorted(data.keys(), key=lambda k: int(k))    # Python 3

    with open("outfile.csv", "wb") as outf:
        outcsv = csv.writer(outf, delimiter=";")
        outcsv.writerow(header)
        outcsv.writerows(data[key] for key in keys)

if __name__=="__main__":
    main()

Edit: I made a few mods based on your sample data:

  1. added a delimiter=";" argument to the csv reader and writer

  2. added code to read and write the header

  3. added a key clause so sort order is numeric, not lexicographic

How it works:

for row in incsv: For each row in the data file, we get a list - something like ["0", "0", "", "", "", "", "", "", "", "", "", "", "-1.0", "0", "0", "-1", "0"]. Then label = row[0] gives label a value of "0" - your desired first-column value - and we look for data[label], a combined row from all preexisting rows having that label.

If that combined row already exists, we merge the new row into it (stored_row = merge_rows(stored_row, new_row); otherwise it is created with the new row value (["0", "0", "", "", "", "", "", "", etc). So effectively merge_rows is called for every occurrence of each label except the first time it appears.

merge_rows takes a pair of lists and combines them - izip_longest returns corresponding entries, ie izip_longest([0, 1, 2], ["a", "b", "c"]) gives (0, "a"), (1, "b"), (2, "c"). If one list is shorter than the other, it pads it with fillvalue to match the length of the longest list it received. x and y get assigned the corresponding value from each list, and we or them together because... well, because or combines them the way you want ('' or '1' == '1', '1' or '' == '1', '' or '' == ''). It then takes all the resulting values and returns them as a list - the resulting combined row.

Hope that helps.


Post a Comment for "Joining-all-rows-of-a-csv-file-that-have-the-same-1st-column-value"