The spacy_parse()
function calls spaCy to both tokenize and tag the
texts, and returns a data.table of the results. The function provides options
on the types of tagsets (tagset_
options) either "google"
or
"detailed"
, as well as lemmatization (lemma
). It provides a
functionalities of dependency parsing and named entity recognition as an
option. If "full_parse = TRUE"
is provided, the function returns the
most extensive list of the parsing results from spaCy.
spacy_parse(
x,
pos = TRUE,
tag = FALSE,
lemma = TRUE,
entity = TRUE,
dependency = FALSE,
nounphrase = FALSE,
multithread = TRUE,
additional_attributes = NULL,
...
)
a character object, a quanteda corpus, or a TIF-compliant corpus data.frame (see https://github.com/ropenscilabs/tif)
logical whether to return universal dependency POS tagset https://universaldependencies.org/u/pos/)
logical whether to return detailed part-of-speech tags, for the
language model en
, it uses the OntoNotes 5 version of the Penn
Treebank tag set
(https://spacy.io/docs/usage/pos-tagging#pos-schemes). Annotation
specifications for other available languages are available on the spaCy
website (https://spacy.io/api/annotation).
logical; include lemmatized tokens in the output (lemmatization may not work properly for non-English models)
logical; if TRUE
, report named entities
logical; if TRUE
, analyse and tag dependencies
logical; if TRUE
, analyse and tag noun phrases
tags
logical; If TRUE
, the processing is parallelized
using spaCy's architecture (https://spacy.io/api)
a character vector; this option is for
extracting additional attributes of tokens from spaCy. When the names of
attributes are supplied, the output data.frame will contain additional
variables corresponding to the names of the attributes. For instance, when
additional_attributes = c("is_punct")
, the output will include an
additional variable named is_punct
, which is a Boolean (in R,
logical) variable indicating whether the token is a punctuation. A full
list of available attributes is available from
https://spacy.io/api/token#attributes.
not used directly
a data.frame
of tokenized, parsed, and annotated tokens
if (FALSE) {
spacy_initialize()
# See Chap 5.1 of the NLTK book, http://www.nltk.org/book/ch05.html
txt <- "And now for something completely different."
spacy_parse(txt)
spacy_parse(txt, pos = TRUE, tag = TRUE)
spacy_parse(txt, dependency = TRUE)
txt2 <- c(doc1 = "The fast cat catches mice.\\nThe quick brown dog jumped.",
doc2 = "This is the second document.",
doc3 = "This is a \\\"quoted\\\" text." )
spacy_parse(txt2, entity = TRUE, dependency = TRUE)
txt3 <- "We analyzed the Supreme Court with three natural language processing tools."
spacy_parse(txt3, entity = TRUE, nounphrase = TRUE)
spacy_parse(txt3, additional_attributes = c("like_num", "is_punct"))
}