How To Identify Abbreviations/acronyms And Expand Them In SpaCy?
I have a large (~50k) term list and a number of these key phrases / terms have corresponding acronyms / abbreviations. I need a fast way of finding either the abbreviation or the e
Solution 1:
Check out scispacy on GitHub, which implements the acronym identification heuristic described in this paper, (see also here). The heuristic works if acronyms are "introduced" in the text with a pattern like
StackOverflow (SO) is a question and answer site for professional and enthusiast programmers. SO rocks!
A working way to replace all acronyms in a piece of text with their long form could then be
import spacy
from scispacy.abbreviation import AbbreviationDetector
nlp = spacy.load("en_core_web_sm")
abbreviation_pipe = AbbreviationDetector(nlp)
nlp.add_pipe(abbreviation_pipe)
text = "StackOverflow (SO) is a question and answer site for professional and enthusiast programmers. SO rocks!"
def replace_acronyms(text):
doc = nlp(text)
altered_tok = [tok.text for tok in doc]
for abrv in doc._.abbreviations:
altered_tok[abrv.start] = str(abrv._.long_form)
return(" ".join(altered_tok))
replace_acronyms(text)
Post a Comment for "How To Identify Abbreviations/acronyms And Expand Them In SpaCy?"