Converting Unordered List Of Tuples To Pandas Dataframe

July 30, 2023 Post a Comment

I am using the library usaddress to parse addresses from a set of files I have. I would like my final output to be a data frame where column names represent parts of the address (e

Solution 1:

Not sure if there is a DataFrame constructor that can handle info exactly as you have it now. (Maybe from_records or from_items?--still don't think this structure would be directly compatible.)

Here's a bit of manipulation to get what you're looking for:

cols = [j for _, j in info[0]]

# Could use nested list comprehension here, but this is probably#     more readable.
info2 = []
for row in info:
    info2.append([i for i, _ in row])

pd.DataFrame(info2, columns=cols)

  AddressNumber    StreetName StreetNamePostType StreetNamePostDirectional   PlaceName StateName ZipCode
0           123  Pennsylvania                Ave                   NW       Washington        DC   20008
1           652          Polk                 St                  San       Francisco,        CA   94102

Solution 2:

Thank you for your responses! I ended up doing a completely different workaround as follows:

I checked the documentation to see all possible parse_tags from usaddress, created a DataFrame with all possible tags as columns, and one other column with the extracted addresses. Then I proceeded to parse and extract information from the columns using regex. Code below!

parse_tags = ['Recipient','AddressNumber','AddressNumberPrefix','AddressNumberSuffix',
'StreetName','StreetNamePreDirectional','StreetNamePreModifier','StreetNamePreType',
'StreetNamePostDirectional','StreetNamePostModifier','StreetNamePostType','CornerOf',
'IntersectionSeparator','LandmarkName','USPSBoxGroupID','USPSBoxGroupType','USPSBoxID',
'USPSBoxType','BuildingName','OccupancyType','OccupancyIdentifier','SubaddressIdentifier',
'SubaddressType','PlaceName','StateName','ZipCode']

addr = ['123 Pennsylvania Ave NW Washington DC 20008', 
        '652 Polk St San Francisco, CA 94102', 
        '3711 Travis St #800 Houston, TX 77002']

df = pd.DataFrame({'Addresses': addr})
pd.concat([df, pd.DataFrame(columns = parse_tags)])

Then I created a new column that made a string out of the usaddress parse list and called it "Info"

df['Info'] = df['Addresses'].apply(lambda x: str(usaddress.parse(x)))

Now here's the major workaround. I looped through each column name and looked for it in the corresponding "Info" cell and applied regular expressions to extract information where they existed!

for colname in parse_tags:
    df[colname] = df['Info'].apply(lambda x: re.findall("\('(\S+)', '{}'\)".format(colname), x)[0] if re.search(
    colname, x) else "")

This is probably not the most efficient way, but it worked for my purposes. Thanks everyone for providing suggestions!

Python Tutorial for Beginners

Converting Unordered List Of Tuples To Pandas Dataframe

Solution 1:

Solution 2:

Post a Comment for "Converting Unordered List Of Tuples To Pandas Dataframe"