Skip to content Skip to sidebar Skip to footer

Creating A Matrix From Pandas Dataframe To Display Connectedness

I have my data in this format in a pandas dataframe: Customer_ID Location_ID Alpha A Alpha B Alpha C Beta A Beta B Be

Solution 1:

Here is one approach that takes into account the multiplicity of visits (e.g. if Customer X visits both LocA and LocB twice, he will contribute 2 to the corresponding position in the final matrix).

Idea:

  1. For each location, count visits by customer.
  2. For each location pair, find the sum of minimal numbers of visits for each customer who visited both.
  3. Use unstack and cleanup.

Counter plays nicely here because counters support many natural arithmetic operations, like add, max etc.

import pandas as pd
from collections import Counter
from itertools import product

df = pd.DataFrame({
    'Customer_ID': ['Alpha', 'Alpha', 'Alpha', 'Beta', 'Beta'],
    'Location_ID': ['A', 'B', 'C', 'A', 'B'],
    })


ctrs = {location: Counter(gp.Customer_ID) for location, gp in df.groupby('Location_ID')}


# In [7]: q.ctrs# Out[7]:# {'A': Counter({'Alpha': 1, 'Beta': 1}),#  'B': Counter({'Alpha': 1, 'Beta': 1}),#  'C': Counter({'Alpha': 1})}


ctrs = list(ctrs.items())
overlaps = [(loc1, loc2, sum(min(ctr1[k], ctr2[k]) for k in ctr1))
    for i, (loc1, ctr1) inenumerate(ctrs, start=1)
    for (loc2, ctr2) in ctrs[i:] if loc1 != loc2]
overlaps += [(l2, l1, c) for l1, l2, c in overlaps]


df2 = pd.DataFrame(overlaps, columns=['Loc1', 'Loc2', 'Count'])
df2 = df2.set_index(['Loc1', 'Loc2'])
df2 = df2.unstack().fillna(0).astype(int)


#      Count# Loc2     A  B  C# Loc1# A        0  2  1# B        2  0  1# C        1  1  0

If you like to disregard multiplicities, replace Counter(gp.Customer_ID) with Counter(set(gp.Customer_ID)).

Solution 2:

I'm sure there's a more elegant way but here's a solution I came up with on the fly. Basically you build an adjacency list for each customer, then update the adjacency matrix accordingly:

import pandas as pd

#I'm assuming you can get your data into a pandas data frame:
data = {'Customer_ID':[1,1,1,2,2],'Location':['A','B','C','A','B']}
df = pd.DataFrame(data)

#Initialize an empty matrix
matrix_size = len(df.groupby('Location'))
matrix = [[0for col inrange(matrix_size)] for row inrange(matrix_size)]

#To make life easier, I made a map to go from locations #to row/col positions in the matrix
location_set = list(set(df['Location'].tolist()))
location_set.sort()
location_map = dict(zip(location_set,range(len(location_set))))

#Group data by customer, and create an adjacency list (dyct) for each#Update the matrix accordinglyfor name,group in df.groupby('Customer_ID'):
    locations = set(group['Location'].tolist())
    dyct = {}
    for i in locations:
        dyct[i] = list(locations.difference(i))

    #Loop through the adjacency list and update matrixfor node, edges in dyct.items(): 
        for edge in edges:
            matrix[location_map[edge]][location_map[node]] +=1

Post a Comment for "Creating A Matrix From Pandas Dataframe To Display Connectedness"