Pandas Filter Rows By Two Column Values With Case Insenstive

August 24, 2023 Post a Comment

I have a simple dataframe as follows: Last Known Date ConfigredValue ReferenceValue 0 24-Jun-17 False FALSE 1 25-Jun-17 FALSE FALSE 2 26-Jun-17 TRUE FALSE 3 2

Solution 1:

Option 1: convert to lowercase or to uppercase and compare

The simplest is to convert the two columns to lower (or to upper) before checking for equality:

df=df[df['ConfigredValue'].str.lower()!=df['ReferenceValue'].str.lower()]

df=df[df['ConfigredValue'].str.upper()!=df['ReferenceValue'].str.upper()]

output:

Out: 
  Last Known Date ConfigredValue ReferenceValue
2226-Jun-17TRUEFALSE

Option 2: Compare the lengths

In this particuler case, you can simply compare the lengths of TRUE and True, they are the same wether the string is upper or lower case:

df[df['ConfigredValue'].str.len()!=df['ReferenceValue'].str.len()]

output:

Out: 
  Last Known Date ConfigredValue ReferenceValue
2226-Jun-17TRUEFALSE

Option 3: Vectorized title

str.title() was also suggested in @0p3n5ourcE answer, here's a vectorized version of it:

df[df['ConfigredValue'].str.title()!=df['ReferenceValue'].str.title()]

Execution time

Benchmarking the speed shows that str.len() is a bit faster

In [35]: timeit df[df['ConfigredValue'].str.lower()!=df['ReferenceValue'].str.lower()]
1000 loops, best of 3: 496 µs per loop

In [36]: timeit df[df['ConfigredValue'].str.upper()!=df['ReferenceValue'].str.upper()]
1000 loops, best of 3: 496 µs per loop

In [37]: timeit df[df['ConfigredValue'].str.title()!=df['ReferenceValue'].str.title()]
1000 loops, best of 3: 495 µs per loop

In [38]: timeit df[df['ConfigredValue'].str.len()!=df['ReferenceValue'].str.len()]
1000 loops, best of 3: 479 µs per loop

Solution 2:

Better replace existing false with 'FALSE' with case = False parameter ie

df['ConfigredValue'] = df['ConfigredValue'].str.replace('false','FALSE',case=False)

df=df[df['ConfigredValue']!=df['ReferenceValue']]

Output:

   Last Known_Date ConfigredValue ReferenceValue
2     2  26-Jun-17           TRUE          FALSE

Solution 3:

Looks like the columns hold boolean values, if it is not a problem converting the columns to boolean datatype then, following can work too (where .title() is used to change first character of string to uppercase e.g. FALSE to False, or true to True which can be used to convert then to corresponding boolean value):

df['ConfigredValue'] = df['ConfigredValue'].apply(lambda row: eval(row.title()))
df['ReferenceValue'] = df['ReferenceValue'].apply(lambda row: eval(row.title()))

Then, using same comparison as above:

df[df['ConfigredValue'] != df['ReferenceValue']]

Output:

Last Known Date  ConfigredValue  ReferenceValue
226-Jun-17TrueFalse

Or, simply using title only similar to uppercase or lowercase:

df[df['ConfigredValue'].str.title() !=df['ReferenceValue'].str.title()]

Solution 4:

Outside the boxpandas.read_csv reads all of these in as boolean. You can dump to csv and read it in again. Then you can use pd.DataFrame.query

pd.read_csv(pd.io.common.StringIO(df.to_csv(index=False))).query(
    'ConfigredValue != ReferenceValue')

  Last Known Date  ConfigredValue  ReferenceValue
226-Jun-17TrueFalse

Python Tutorial for Beginners