Skip to content Skip to sidebar Skip to footer

How To Extract Year (or Datetime) From A Column In A Pandas Dataframe That Contains Text

Suppose I have a pandas dataframe: Id Book 1 Harry Potter (1997) 2 Of Mice and Men (1937) 3 Babe Ruth Story, The (1948) Drama 948) Babe

Solution 1:

How about a simple Regex:

text = 'Harry Potter (1997)'
re.findall('\((\d{4})\)', text)
# ['1997'] Note that thisis a list of "all" the occurrences.

With a Dataframe, it can be done like this:

text = 'Harry Potter (1997)'df = pd.DataFrame({'Book': text}, index=[1])
pattern = '\((\d{4})\)'df['year'] = df.Book.str.extract(pattern, expand=False) #False returns a seriesdf#                  Book   year# 1  Harry Potter (1997)  1997

Finally, if you actually want to separate the title and the data (taking the dataframe reconstruction from Philip in another answer):

df = pd.DataFrame(columns=['Book'], data=[['Harry Potter (1997)'],['Of Mice and Men (1937)'],['Babe Ruth Story, The (1948)   Drama   948)    Babe Ruth Story']])

sep = df['Book'].str.extract('(.*)\((\d{4})\)', expand=False)

sep # A new df, separated into title and year#                       0      1                           # 0          Harry Potter   1997 # 1       Of Mice and Men   1937# 2  Babe Ruth Story, The   1948

Solution 2:

You could do the following.

import pandas as pd
df = pd.DataFrame(columns=['id','Book'], data=[[1,'Harry Potter (1997)'],[2,'Of Mice and Men (1937)'],[3,'Babe Ruth Story, The (1948)   Drama   948)    Babe Ruth Story']])

df['Year'] = df['Book'].str.extract(r'(?!\()\b(\d+){1}')
  1. line: import pandas
  2. line: create the dataframe for sake of understanding
  3. line: create a new column 'Year', which is created from a string extraction on the column Book.

Use regex to find the digits. I use https://regex101.com/r/Bid0qA/1, which is a huge help in understanding how regex works.

Solution 3:

Answer for the full series is actually this:

books['title'].str.findall('\((\d{4})\)').str.get(0)

Post a Comment for "How To Extract Year (or Datetime) From A Column In A Pandas Dataframe That Contains Text"