Converting Unordered List Of Tuples To Pandas Dataframe
Solution 1:
Not sure if there is a DataFrame constructor that can handle info
exactly as you have it now. (Maybe from_records
or from_items
?--still don't think this structure would be directly compatible.)
Here's a bit of manipulation to get what you're looking for:
cols = [j for _, j in info[0]]
# Could use nested list comprehension here, but this is probably# more readable.
info2 = []
for row in info:
info2.append([i for i, _ in row])
pd.DataFrame(info2, columns=cols)
AddressNumber StreetName StreetNamePostType StreetNamePostDirectional PlaceName StateName ZipCode
0 123 Pennsylvania Ave NW Washington DC 20008
1 652 Polk St San Francisco, CA 94102
Solution 2:
Thank you for your responses! I ended up doing a completely different workaround as follows:
I checked the documentation to see all possible parse_tags from usaddress
, created a DataFrame with all possible tags as columns, and one other column with the extracted addresses. Then I proceeded to parse and extract information from the columns using regex
. Code below!
parse_tags = ['Recipient','AddressNumber','AddressNumberPrefix','AddressNumberSuffix',
'StreetName','StreetNamePreDirectional','StreetNamePreModifier','StreetNamePreType',
'StreetNamePostDirectional','StreetNamePostModifier','StreetNamePostType','CornerOf',
'IntersectionSeparator','LandmarkName','USPSBoxGroupID','USPSBoxGroupType','USPSBoxID',
'USPSBoxType','BuildingName','OccupancyType','OccupancyIdentifier','SubaddressIdentifier',
'SubaddressType','PlaceName','StateName','ZipCode']
addr = ['123 Pennsylvania Ave NW Washington DC 20008',
'652 Polk St San Francisco, CA 94102',
'3711 Travis St #800 Houston, TX 77002']
df = pd.DataFrame({'Addresses': addr})
pd.concat([df, pd.DataFrame(columns = parse_tags)])
Then I created a new column that made a string out of the usaddress
parse list and called it "Info"
df['Info'] = df['Addresses'].apply(lambda x: str(usaddress.parse(x)))
Now here's the major workaround. I looped through each column name and looked for it in the corresponding "Info" cell and applied regular expressions to extract information where they existed!
for colname in parse_tags:
df[colname] = df['Info'].apply(lambda x: re.findall("\('(\S+)', '{}'\)".format(colname), x)[0] if re.search(
colname, x) else "")
This is probably not the most efficient way, but it worked for my purposes. Thanks everyone for providing suggestions!
Post a Comment for "Converting Unordered List Of Tuples To Pandas Dataframe"