Skip to content Skip to sidebar Skip to footer

How To Identify Abbreviations/acronyms And Expand Them In SpaCy?

I have a large (~50k) term list and a number of these key phrases / terms have corresponding acronyms / abbreviations. I need a fast way of finding either the abbreviation or the e

Solution 1:

Check out scispacy on GitHub, which implements the acronym identification heuristic described in this paper, (see also here). The heuristic works if acronyms are "introduced" in the text with a pattern like

StackOverflow (SO) is a question and answer site for professional and enthusiast programmers. SO rocks!

A working way to replace all acronyms in a piece of text with their long form could then be

import spacy
from scispacy.abbreviation import AbbreviationDetector

nlp = spacy.load("en_core_web_sm")

abbreviation_pipe = AbbreviationDetector(nlp)
nlp.add_pipe(abbreviation_pipe)

text = "StackOverflow (SO) is a question and answer site for professional and enthusiast programmers. SO rocks!"

def replace_acronyms(text):
    doc = nlp(text)
    altered_tok = [tok.text for tok in doc]
    for abrv in doc._.abbreviations:
        altered_tok[abrv.start] = str(abrv._.long_form)

    return(" ".join(altered_tok))

replace_acronyms(text)

Post a Comment for "How To Identify Abbreviations/acronyms And Expand Them In SpaCy?"