Counting Line Frequencies And Producing Output Files
With a textfile like this: a;b b;a c;d d;c e;a f;g h;b b;f b;f c;g a;b d;f  How can one read it, and produce two output text files: one keeping only the lines representing the most
Solution 1:
Here is an answer without frozen set.
df1 = df.apply(sorted, 1)
df_count =df1.groupby(['A', 'B']).size().reset_index().sort_values(0, ascending=False)
df_count.columns = ['A', 'B', 'Count']
df_all = pd.concat([df_count.assign(letter=lambda x: x['A']), 
                    df_count.assign(letter=lambda x: x['B'])]).sort_values(['letter', 'Count'], ascending =[True, False])
df_first = df_all.groupby(['letter']).first().reset_index()
top = int(len(df_count) / 4)
df_top_25 = df_count.iloc[:top]
------------older answer --------
Since order matters you can use a frozen set as the key to a groupby
import pandas as pd
df = pd.read_csv('text.csv', header=None, names=['A','B'], sep=';')
s = df.apply(frozenset, 1)
df_count = s.value_counts().reset_index()
df_count.columns = ['Combos', 'Count']
Which will give you this
   Combos  Count
0  (a, b)      31  (b, f)      22  (d, c)      23  (g, f)      14  (b, h)      15  (c, g)      16  (d, f)      17  (e, a)      1To get the highest combo for each letter we will concatenate this dataframe on top of itself and make another column that will hold either the first or second letter.
df_a = df_count.copy()
df_b = df_count.copy()
df_a['letter'] = df_a['Combos'].apply(lambda x: list(x)[0])
df_b['letter'] = df_b['Combos'].apply(lambda x: list(x)[1])
df_all = pd.concat([df_a, df_b]).sort_values(['letter', 'Count'], ascending =[True, False])
And since this is sorted by letter and count (descending) just get the first row of each group.
df_first = df_all.groupby('letter').first()
And to get the top 25%, just use
top = int(len(df_count) / 4)
df_top_25 = df_count.iloc[:top]
And then use .to_csv to output to file.
Post a Comment for "Counting Line Frequencies And Producing Output Files"