The spacy_parse() function calls spaCy to both tokenize and tag the texts, and returns a data.table of the results. The function provides options on the types of tagsets (tagset_ options) either "google" or "detailed", as well as lemmatization (lemma). It provides a functionalities of dependency parsing and named entity recognition as an option. If "full_parse = TRUE" is provided, the function returns the most extensive list of the parsing results from spaCy.

spacy_parse(
  x,
  pos = TRUE,
  tag = FALSE,
  lemma = TRUE,
  entity = TRUE,
  dependency = FALSE,
  nounphrase = FALSE,
  multithread = TRUE,
  additional_attributes = NULL,
  ...
)

Arguments

x

a character object, a quanteda corpus, or a TIF-compliant corpus data.frame (see https://github.com/ropensci/tif)

pos

logical whether to return universal dependency POS tagset http://universaldependencies.org/u/pos/)

tag

logical whether to return detailed part-of-speech tags, for the language model en, it uses the OntoNotes 5 version of the Penn Treebank tag set (https://spacy.io/docs/usage/pos-tagging#pos-schemes). Annotation specifications for other available languages are available on the spaCy website (https://spacy.io/api/annotation).

lemma

logical; include lemmatized tokens in the output (lemmatization may not work properly for non-English models)

entity

logical; if TRUE, report named entities

dependency

logical; if TRUE, analyse and tag dependencies

nounphrase

logical; if TRUE, analyse and tag noun phrases tags

multithread

logical; If TRUE, the processing is parallelized using spaCy's architecture (https://spacy.io/api)

additional_attributes

a character vector; this option is for extracting additional attributes of tokens from spaCy. When the names of attributes are supplied, the output data.frame will contain additional variables corresponding to the names of the attributes. For instance, when additional_attributes = c("is_punct"), the output will include an additional variable named is_punct, which is a Boolean (in R, logical) variable indicating whether the token is a punctuation. A full list of available attributes is available from https://spacy.io/api/token#attributes.

...

not used directly

Value

a data.frame of tokenized, parsed, and annotated tokens

Examples

# \donttest{
spacy_initialize()
#> spaCy is already initialized
#> NULL
# See Chap 5.1 of the NLTK book, http://www.nltk.org/book/ch05.html
txt <- "And now for something completely different."
spacy_parse(txt)
#>   doc_id sentence_id token_id      token      lemma   pos entity
#> 1  text1           1        1        And        and CCONJ       
#> 2  text1           1        2        now        now   ADV       
#> 3  text1           1        3        for        for   ADP       
#> 4  text1           1        4  something  something  PRON       
#> 5  text1           1        5 completely completely   ADV       
#> 6  text1           1        6  different  different   ADJ       
#> 7  text1           1        7          .          . PUNCT       
spacy_parse(txt, pos = TRUE, tag = TRUE)
#>   doc_id sentence_id token_id      token      lemma   pos tag entity
#> 1  text1           1        1        And        and CCONJ  CC       
#> 2  text1           1        2        now        now   ADV  RB       
#> 3  text1           1        3        for        for   ADP  IN       
#> 4  text1           1        4  something  something  PRON  NN       
#> 5  text1           1        5 completely completely   ADV  RB       
#> 6  text1           1        6  different  different   ADJ  JJ       
#> 7  text1           1        7          .          . PUNCT   .       
spacy_parse(txt, dependency = TRUE)
#>   doc_id sentence_id token_id      token      lemma   pos head_token_id dep_rel
#> 1  text1           1        1        And        and CCONJ             3      cc
#> 2  text1           1        2        now        now   ADV             3  advmod
#> 3  text1           1        3        for        for   ADP             3    ROOT
#> 4  text1           1        4  something  something  PRON             3    pobj
#> 5  text1           1        5 completely completely   ADV             6  advmod
#> 6  text1           1        6  different  different   ADJ             4    amod
#> 7  text1           1        7          .          . PUNCT             3   punct
#>   entity
#> 1       
#> 2       
#> 3       
#> 4       
#> 5       
#> 6       
#> 7       

txt2 <- c(doc1 = "The fast cat catches mice.\\nThe quick brown dog jumped.", 
          doc2 = "This is the second document.",
          doc3 = "This is a \\\"quoted\\\" text." )
spacy_parse(txt2, entity = TRUE, dependency = TRUE)
#>    doc_id sentence_id token_id    token    lemma   pos head_token_id  dep_rel
#> 1    doc1           1        1      The      the   DET             3      det
#> 2    doc1           1        2     fast     fast   ADJ             3     amod
#> 3    doc1           1        3      cat      cat  NOUN             4    nsubj
#> 4    doc1           1        4  catches    catch  VERB             5 compound
#> 5    doc1           1        5     mice    mouse  NOUN             5     ROOT
#> 6    doc1           1        6        .        . PUNCT             5    punct
#> 7    doc1           1        7       \n       \n SPACE             6         
#> 8    doc1           2        1      The      the   DET             4      det
#> 9    doc1           2        2    quick    quick   ADJ             4     amod
#> 10   doc1           2        3    brown    brown   ADJ             4     amod
#> 11   doc1           2        4      dog      dog  NOUN             5    nsubj
#> 12   doc1           2        5   jumped     jump  VERB             5     ROOT
#> 13   doc1           2        6        .        . PUNCT             5    punct
#> 14   doc2           1        1     This     this   DET             2    nsubj
#> 15   doc2           1        2       is       be   AUX             2     ROOT
#> 16   doc2           1        3      the      the   DET             5      det
#> 17   doc2           1        4   second   second   ADJ             5     amod
#> 18   doc2           1        5 document document  NOUN             2     attr
#> 19   doc2           1        6        .        . PUNCT             2    punct
#> 20   doc3           1        1     This     this   DET             2    nsubj
#> 21   doc3           1        2       is       be   AUX             2     ROOT
#> 22   doc3           1        3        a        a   DET             7      det
#> 23   doc3           1        4        "        " PUNCT             7    punct
#> 24   doc3           1        5   quoted    quote  VERB             7     amod
#> 25   doc3           1        6        "        " PUNCT             7    punct
#> 26   doc3           1        7     text     text  NOUN             2     attr
#> 27   doc3           1        8        .        . PUNCT             2    punct
#>       entity
#> 1           
#> 2           
#> 3           
#> 4           
#> 5           
#> 6           
#> 7           
#> 8           
#> 9           
#> 10          
#> 11          
#> 12          
#> 13          
#> 14          
#> 15          
#> 16          
#> 17 ORDINAL_B
#> 18          
#> 19          
#> 20          
#> 21          
#> 22          
#> 23          
#> 24          
#> 25          
#> 26          
#> 27          

txt3 <- "We analyzed the Supreme Court with three natural language processing tools." 
spacy_parse(txt3, entity = TRUE, nounphrase = TRUE)
#>    doc_id sentence_id token_id      token      lemma   pos     entity
#> 1   text1           1        1         We     -PRON-  PRON           
#> 2   text1           1        2   analyzed    analyze  VERB           
#> 3   text1           1        3        the        the   DET      ORG_B
#> 4   text1           1        4    Supreme    Supreme PROPN      ORG_I
#> 5   text1           1        5      Court      Court PROPN      ORG_I
#> 6   text1           1        6       with       with   ADP           
#> 7   text1           1        7      three      three   NUM CARDINAL_B
#> 8   text1           1        8    natural    natural   ADJ           
#> 9   text1           1        9   language   language  NOUN           
#> 10  text1           1       10 processing processing  NOUN           
#> 11  text1           1       11      tools       tool  NOUN           
#> 12  text1           1       12          .          . PUNCT           
#>    nounphrase whitespace
#> 1    beg_root       TRUE
#> 2                   TRUE
#> 3         beg       TRUE
#> 4         mid       TRUE
#> 5    end_root       TRUE
#> 6                   TRUE
#> 7         beg       TRUE
#> 8         mid       TRUE
#> 9         mid       TRUE
#> 10        mid       TRUE
#> 11   end_root      FALSE
#> 12                 FALSE
spacy_parse(txt3, additional_attributes = c("like_num", "is_punct"))
#>    doc_id sentence_id token_id      token      lemma   pos     entity like_num
#> 1   text1           1        1         We     -PRON-  PRON               FALSE
#> 2   text1           1        2   analyzed    analyze  VERB               FALSE
#> 3   text1           1        3        the        the   DET      ORG_B    FALSE
#> 4   text1           1        4    Supreme    Supreme PROPN      ORG_I    FALSE
#> 5   text1           1        5      Court      Court PROPN      ORG_I    FALSE
#> 6   text1           1        6       with       with   ADP               FALSE
#> 7   text1           1        7      three      three   NUM CARDINAL_B     TRUE
#> 8   text1           1        8    natural    natural   ADJ               FALSE
#> 9   text1           1        9   language   language  NOUN               FALSE
#> 10  text1           1       10 processing processing  NOUN               FALSE
#> 11  text1           1       11      tools       tool  NOUN               FALSE
#> 12  text1           1       12          .          . PUNCT               FALSE
#>    is_punct
#> 1     FALSE
#> 2     FALSE
#> 3     FALSE
#> 4     FALSE
#> 5     FALSE
#> 6     FALSE
#> 7     FALSE
#> 8     FALSE
#> 9     FALSE
#> 10    FALSE
#> 11    FALSE
#> 12     TRUE
# }