The spacy_parse() function calls spaCy to both tokenize and tag the texts, and returns a data.table of the results. The function provides options on the types of tagsets (tagset_ options) either "google" or "detailed", as well as lemmatization (lemma). It provides a functionalities of dependency parsing and named entity recognition as an option. If "full_parse = TRUE" is provided, the function returns the most extensive list of the parsing results from spaCy.

spacy_parse(x, pos = TRUE, tag = FALSE, lemma = TRUE,
  entity = TRUE, dependency = FALSE, nounphrase = FALSE,
  multithread = TRUE, additional_attributes = NULL, ...)

Arguments

x

a character object, a quanteda corpus, or a TIF-compliant corpus data.frame (see https://github.com/ropensci/tif)

pos

logical whether to return universal dependency POS tagset http://universaldependencies.org/u/pos/)

tag

logical whether to return detailed part-of-speech tags, for the language model en, it uses the OntoNotes 5 version of the Penn Treebank tag set (https://spacy.io/docs/usage/pos-tagging#pos-schemes). Annotation specifications for other available languages are available on the spaCy website (https://spacy.io/api/annotation).

lemma

logical; include lemmatized tokens in the output (lemmatization may not work properly for non-English models)

entity

logical; if TRUE, report named entities

dependency

logical; if TRUE, analyse and tag dependencies

nounphrase

logical; if TRUE, analyse and tag noun phrases tags

multithread

logical; If TRUE, the processing is parallelized using pipe functionality of spaCy (https://spacy.io/api/pipe)

additional_attributes

a character vector; this option is for extracting additional attributes of tokens from spaCy. When the names of attributes are supplied, the output data.frame will contain additional variables corresponding to the names of the attributes. For instance, when additional_attributes = c("is_punct"), the output will include an additional variable named is_punct, which is a Boolean (in R, logical) variable indicating whether the token is a punctuation. A full list of available attributes is available from https://spacy.io/api/token#attributes.

...

not used directly

Value

a data.frame of tokenized, parsed, and annotated tokens

Examples

#> spaCy is already initialized
#> NULL
# See Chap 5.1 of the NLTK book, http://www.nltk.org/book/ch05.html txt <- "And now for something completely different." spacy_parse(txt)
#> doc_id sentence_id token_id token lemma pos entity #> 1 text1 1 1 And and CCONJ #> 2 text1 1 2 now now ADV #> 3 text1 1 3 for for ADP #> 4 text1 1 4 something something NOUN #> 5 text1 1 5 completely completely ADV #> 6 text1 1 6 different different ADJ #> 7 text1 1 7 . . PUNCT
spacy_parse(txt, pos = TRUE, tag = TRUE)
#> doc_id sentence_id token_id token lemma pos tag entity #> 1 text1 1 1 And and CCONJ CC #> 2 text1 1 2 now now ADV RB #> 3 text1 1 3 for for ADP IN #> 4 text1 1 4 something something NOUN NN #> 5 text1 1 5 completely completely ADV RB #> 6 text1 1 6 different different ADJ JJ #> 7 text1 1 7 . . PUNCT .
spacy_parse(txt, dependency = TRUE)
#> doc_id sentence_id token_id token lemma pos head_token_id dep_rel #> 1 text1 1 1 And and CCONJ 3 cc #> 2 text1 1 2 now now ADV 3 advmod #> 3 text1 1 3 for for ADP 3 ROOT #> 4 text1 1 4 something something NOUN 3 pobj #> 5 text1 1 5 completely completely ADV 6 advmod #> 6 text1 1 6 different different ADJ 4 amod #> 7 text1 1 7 . . PUNCT 3 punct #> entity #> 1 #> 2 #> 3 #> 4 #> 5 #> 6 #> 7
txt2 <- c(doc1 = "The fast cat catches mice.\\nThe quick brown dog jumped.", doc2 = "This is the second document.", doc3 = "This is a \\\"quoted\\\" text." ) spacy_parse(txt2, entity = TRUE, dependency = TRUE)
#> doc_id sentence_id token_id token lemma pos head_token_id dep_rel #> 1 doc1 1 1 The the DET 3 det #> 2 doc1 1 2 fast fast ADJ 3 amod #> 3 doc1 1 3 cat cat NOUN 4 nsubj #> 4 doc1 1 4 catches catch VERB 4 ROOT #> 5 doc1 1 5 mice mouse NOUN 4 dobj #> 6 doc1 1 6 . . PUNCT 4 punct #> 7 doc1 1 7 \n \n SPACE 6 #> 8 doc1 2 1 The the DET 4 det #> 9 doc1 2 2 quick quick ADJ 4 amod #> 10 doc1 2 3 brown brown ADJ 4 amod #> 11 doc1 2 4 dog dog NOUN 5 nsubj #> 12 doc1 2 5 jumped jump VERB 5 ROOT #> 13 doc1 2 6 . . PUNCT 5 punct #> 14 doc2 1 1 This this DET 2 nsubj #> 15 doc2 1 2 is be VERB 2 ROOT #> 16 doc2 1 3 the the DET 5 det #> 17 doc2 1 4 second second ADJ 5 amod #> 18 doc2 1 5 document document NOUN 2 attr #> 19 doc2 1 6 . . PUNCT 2 punct #> 20 doc3 1 1 This this DET 2 nsubj #> 21 doc3 1 2 is be VERB 2 ROOT #> 22 doc3 1 3 a a DET 7 det #> 23 doc3 1 4 " " PUNCT 7 punct #> 24 doc3 1 5 quoted quote VERB 7 amod #> 25 doc3 1 6 " " PUNCT 7 punct #> 26 doc3 1 7 text text NOUN 2 attr #> 27 doc3 1 8 . . PUNCT 2 punct #> entity #> 1 #> 2 #> 3 #> 4 #> 5 #> 6 #> 7 GPE_B #> 8 #> 9 #> 10 #> 11 #> 12 #> 13 #> 14 #> 15 #> 16 #> 17 ORDINAL_B #> 18 #> 19 #> 20 #> 21 #> 22 #> 23 #> 24 #> 25 #> 26 #> 27
txt3 <- "We analyzed the Supreme Court with three natural language processing tools." spacy_parse(txt3, entity = TRUE, nounphrase = TRUE)
#> doc_id sentence_id token_id token lemma pos entity #> 1 text1 1 1 We -PRON- PRON #> 2 text1 1 2 analyzed analyze VERB #> 3 text1 1 3 the the DET ORG_B #> 4 text1 1 4 Supreme supreme PROPN ORG_I #> 5 text1 1 5 Court court PROPN ORG_I #> 6 text1 1 6 with with ADP #> 7 text1 1 7 three three NUM CARDINAL_B #> 8 text1 1 8 natural natural ADJ #> 9 text1 1 9 language language NOUN #> 10 text1 1 10 processing processing NOUN #> 11 text1 1 11 tools tool NOUN #> 12 text1 1 12 . . PUNCT #> nounphrase whitespace #> 1 beg_root TRUE #> 2 TRUE #> 3 beg TRUE #> 4 mid TRUE #> 5 end_root TRUE #> 6 TRUE #> 7 beg TRUE #> 8 mid TRUE #> 9 mid TRUE #> 10 mid TRUE #> 11 end_root FALSE #> 12 FALSE
spacy_parse(txt3, additional_attributes = c("like_num", "is_punct"))
#> doc_id sentence_id token_id token lemma pos entity like_num #> 1 text1 1 1 We -PRON- PRON FALSE #> 2 text1 1 2 analyzed analyze VERB FALSE #> 3 text1 1 3 the the DET ORG_B FALSE #> 4 text1 1 4 Supreme supreme PROPN ORG_I FALSE #> 5 text1 1 5 Court court PROPN ORG_I FALSE #> 6 text1 1 6 with with ADP FALSE #> 7 text1 1 7 three three NUM CARDINAL_B TRUE #> 8 text1 1 8 natural natural ADJ FALSE #> 9 text1 1 9 language language NOUN FALSE #> 10 text1 1 10 processing processing NOUN FALSE #> 11 text1 1 11 tools tool NOUN FALSE #> 12 text1 1 12 . . PUNCT FALSE #> is_punct #> 1 FALSE #> 2 FALSE #> 3 FALSE #> 4 FALSE #> 5 FALSE #> 6 FALSE #> 7 FALSE #> 8 FALSE #> 9 FALSE #> 10 FALSE #> 11 FALSE #> 12 TRUE