The spacy_parse()
function calls spaCy to both tokenize and tag the
texts, and returns a data.table of the results. The function provides options
on the types of tagsets (tagset_
options) either "google"
or
"detailed"
, as well as lemmatization (lemma
). It provides a
functionalities of dependency parsing and named entity recognition as an
option. If "full_parse = TRUE"
is provided, the function returns the
most extensive list of the parsing results from spaCy.
spacy_parse(
x,
pos = TRUE,
tag = FALSE,
lemma = TRUE,
entity = TRUE,
dependency = FALSE,
nounphrase = FALSE,
multithread = TRUE,
additional_attributes = NULL,
...
)
a character object, a quanteda corpus, or a TIF-compliant corpus data.frame (see https://github.com/ropensci/tif)
logical whether to return universal dependency POS tagset http://universaldependencies.org/u/pos/)
logical whether to return detailed part-of-speech tags, for the
language model en
, it uses the OntoNotes 5 version of the Penn
Treebank tag set
(https://spacy.io/docs/usage/pos-tagging#pos-schemes). Annotation
specifications for other available languages are available on the spaCy
website (https://spacy.io/api/annotation).
logical; include lemmatized tokens in the output (lemmatization may not work properly for non-English models)
logical; if TRUE
, report named entities
logical; if TRUE
, analyse and tag dependencies
logical; if TRUE
, analyse and tag noun phrases
tags
logical; If TRUE
, the processing is parallelized
using spaCy's architecture (https://spacy.io/api)
a character vector; this option is for
extracting additional attributes of tokens from spaCy. When the names of
attributes are supplied, the output data.frame will contain additional
variables corresponding to the names of the attributes. For instance, when
additional_attributes = c("is_punct")
, the output will include an
additional variable named is_punct
, which is a Boolean (in R,
logical) variable indicating whether the token is a punctuation. A full
list of available attributes is available from
https://spacy.io/api/token#attributes.
not used directly
a data.frame
of tokenized, parsed, and annotated tokens
# \donttest{
spacy_initialize()
#> spaCy is already initialized
#> NULL
# See Chap 5.1 of the NLTK book, http://www.nltk.org/book/ch05.html
txt <- "And now for something completely different."
spacy_parse(txt)
#> doc_id sentence_id token_id token lemma pos entity
#> 1 text1 1 1 And and CCONJ
#> 2 text1 1 2 now now ADV
#> 3 text1 1 3 for for ADP
#> 4 text1 1 4 something something PRON
#> 5 text1 1 5 completely completely ADV
#> 6 text1 1 6 different different ADJ
#> 7 text1 1 7 . . PUNCT
spacy_parse(txt, pos = TRUE, tag = TRUE)
#> doc_id sentence_id token_id token lemma pos tag entity
#> 1 text1 1 1 And and CCONJ CC
#> 2 text1 1 2 now now ADV RB
#> 3 text1 1 3 for for ADP IN
#> 4 text1 1 4 something something PRON NN
#> 5 text1 1 5 completely completely ADV RB
#> 6 text1 1 6 different different ADJ JJ
#> 7 text1 1 7 . . PUNCT .
spacy_parse(txt, dependency = TRUE)
#> doc_id sentence_id token_id token lemma pos head_token_id dep_rel
#> 1 text1 1 1 And and CCONJ 3 cc
#> 2 text1 1 2 now now ADV 3 advmod
#> 3 text1 1 3 for for ADP 3 ROOT
#> 4 text1 1 4 something something PRON 3 pobj
#> 5 text1 1 5 completely completely ADV 6 advmod
#> 6 text1 1 6 different different ADJ 4 amod
#> 7 text1 1 7 . . PUNCT 3 punct
#> entity
#> 1
#> 2
#> 3
#> 4
#> 5
#> 6
#> 7
txt2 <- c(doc1 = "The fast cat catches mice.\\nThe quick brown dog jumped.",
doc2 = "This is the second document.",
doc3 = "This is a \\\"quoted\\\" text." )
spacy_parse(txt2, entity = TRUE, dependency = TRUE)
#> doc_id sentence_id token_id token lemma pos head_token_id dep_rel
#> 1 doc1 1 1 The the DET 3 det
#> 2 doc1 1 2 fast fast ADJ 3 amod
#> 3 doc1 1 3 cat cat NOUN 4 nsubj
#> 4 doc1 1 4 catches catch VERB 5 compound
#> 5 doc1 1 5 mice mouse NOUN 5 ROOT
#> 6 doc1 1 6 . . PUNCT 5 punct
#> 7 doc1 1 7 \n \n SPACE 6
#> 8 doc1 2 1 The the DET 4 det
#> 9 doc1 2 2 quick quick ADJ 4 amod
#> 10 doc1 2 3 brown brown ADJ 4 amod
#> 11 doc1 2 4 dog dog NOUN 5 nsubj
#> 12 doc1 2 5 jumped jump VERB 5 ROOT
#> 13 doc1 2 6 . . PUNCT 5 punct
#> 14 doc2 1 1 This this DET 2 nsubj
#> 15 doc2 1 2 is be AUX 2 ROOT
#> 16 doc2 1 3 the the DET 5 det
#> 17 doc2 1 4 second second ADJ 5 amod
#> 18 doc2 1 5 document document NOUN 2 attr
#> 19 doc2 1 6 . . PUNCT 2 punct
#> 20 doc3 1 1 This this DET 2 nsubj
#> 21 doc3 1 2 is be AUX 2 ROOT
#> 22 doc3 1 3 a a DET 7 det
#> 23 doc3 1 4 " " PUNCT 7 punct
#> 24 doc3 1 5 quoted quote VERB 7 amod
#> 25 doc3 1 6 " " PUNCT 7 punct
#> 26 doc3 1 7 text text NOUN 2 attr
#> 27 doc3 1 8 . . PUNCT 2 punct
#> entity
#> 1
#> 2
#> 3
#> 4
#> 5
#> 6
#> 7
#> 8
#> 9
#> 10
#> 11
#> 12
#> 13
#> 14
#> 15
#> 16
#> 17 ORDINAL_B
#> 18
#> 19
#> 20
#> 21
#> 22
#> 23
#> 24
#> 25
#> 26
#> 27
txt3 <- "We analyzed the Supreme Court with three natural language processing tools."
spacy_parse(txt3, entity = TRUE, nounphrase = TRUE)
#> doc_id sentence_id token_id token lemma pos entity
#> 1 text1 1 1 We -PRON- PRON
#> 2 text1 1 2 analyzed analyze VERB
#> 3 text1 1 3 the the DET ORG_B
#> 4 text1 1 4 Supreme Supreme PROPN ORG_I
#> 5 text1 1 5 Court Court PROPN ORG_I
#> 6 text1 1 6 with with ADP
#> 7 text1 1 7 three three NUM CARDINAL_B
#> 8 text1 1 8 natural natural ADJ
#> 9 text1 1 9 language language NOUN
#> 10 text1 1 10 processing processing NOUN
#> 11 text1 1 11 tools tool NOUN
#> 12 text1 1 12 . . PUNCT
#> nounphrase whitespace
#> 1 beg_root TRUE
#> 2 TRUE
#> 3 beg TRUE
#> 4 mid TRUE
#> 5 end_root TRUE
#> 6 TRUE
#> 7 beg TRUE
#> 8 mid TRUE
#> 9 mid TRUE
#> 10 mid TRUE
#> 11 end_root FALSE
#> 12 FALSE
spacy_parse(txt3, additional_attributes = c("like_num", "is_punct"))
#> doc_id sentence_id token_id token lemma pos entity like_num
#> 1 text1 1 1 We -PRON- PRON FALSE
#> 2 text1 1 2 analyzed analyze VERB FALSE
#> 3 text1 1 3 the the DET ORG_B FALSE
#> 4 text1 1 4 Supreme Supreme PROPN ORG_I FALSE
#> 5 text1 1 5 Court Court PROPN ORG_I FALSE
#> 6 text1 1 6 with with ADP FALSE
#> 7 text1 1 7 three three NUM CARDINAL_B TRUE
#> 8 text1 1 8 natural natural ADJ FALSE
#> 9 text1 1 9 language language NOUN FALSE
#> 10 text1 1 10 processing processing NOUN FALSE
#> 11 text1 1 11 tools tool NOUN FALSE
#> 12 text1 1 12 . . PUNCT FALSE
#> is_punct
#> 1 FALSE
#> 2 FALSE
#> 3 FALSE
#> 4 FALSE
#> 5 FALSE
#> 6 FALSE
#> 7 FALSE
#> 8 FALSE
#> 9 FALSE
#> 10 FALSE
#> 11 FALSE
#> 12 TRUE
# }