This function extracts named entities from texts, based on the entity tag
ent attributes of documents objects parsed by spaCy (see
spacy_extract_entity(x, output = c("data.frame", "list"), type = c("all", "named", "extended"), multithread = TRUE, ...)
a character object or a TIF-compliant corpus data.frame (see https://github.com/ropensci/tif)
type of returned object, either
type of named entities, either
logical; If true, the processing is parallelized using pipe functionality of spaCy (https://spacy.io/api/pipe).
data.frame of tokens
When the option
output = "data.frame" is selected, the
function returns a
data.frame with the following fields.
type of entity (e.g.
serial number ID of starting token.
This number corresponds with the number of
data.frame returned from
spacy_tokenize(x) with default options.
of words (tokens) included in a named entity (e.g. for an entity, "New York
length = 4)
#>#> NULLtxt <- c(doc1 = "The Supreme Court is located in Washington D.C.", doc2 = "Paul earned a postgraduate degree from MIT.") spacy_extract_entity(txt)#> doc_id text ent_type start_id length #> 1 doc1 The Supreme Court ORG 1 3 #> 2 doc1 Washington D.C. GPE 7 2 #> 3 doc2 Paul ORG 1 1 #> 4 doc2 MIT ORG 7 1spacy_extract_entity(txt, output = "list")#> $doc1 #>  "The Supreme Court" "Washington D.C." #> #> $doc2 #>  "Paul" "MIT" #>