This function extracts named entities from texts, based on the entity tag ent attributes of documents objects parsed by spaCy (see https://spacy.io/usage/linguistic-features#section-named-entities).

spacy_extract_entity(
  x,
  output = c("data.frame", "list"),
  type = c("all", "named", "extended"),
  multithread = TRUE,
  ...
)

Arguments

x

a character object or a TIF-compliant corpus data.frame (see https://github.com/ropensci/tif)

output

type of returned object, either "list" or "data.frame".

type

type of named entities, either named, extended, or all. See https://spacy.io/docs/usage/entity-recognition#entity-types for details.

multithread

logical; If TRUE, the processing is parallelized using spaCy's architecture (https://spacy.io/api)

...

unused

Value

either a list or data.frame of tokens

Details

When the option output = "data.frame" is selected, the function returns a data.frame with the following fields.

entity_type

type of entity (e.g. ORG for organizations)

start_id

serial number ID of starting token. This number corresponds with the number of data.frame returned from spacy_tokenize(x) with default options.

length

number of words (tokens) included in a named entity (e.g. for an entity, "New York Stock Exchange"", length = 4)

Examples

# \donttest{ spacy_initialize()
#> spaCy is already initialized
#> NULL
txt <- c(doc1 = "The Supreme Court is located in Washington D.C.", doc2 = "Paul earned a postgraduate degree from MIT.") spacy_extract_entity(txt)
#> doc_id text ent_type start_id length #> 1 doc1 The Supreme Court ORG 1 3 #> 2 doc1 Washington D.C. GPE 7 2 #> 3 doc2 Paul PERSON 1 1 #> 4 doc2 MIT ORG 7 1
spacy_extract_entity(txt, output = "list")
#> $doc1 #> [1] "The Supreme Court" "Washington D.C." #> #> $doc2 #> [1] "Paul" "MIT" #>
# }