This function extracts noun phrases from documents, based on the noun_chunks attributes of documents objects parsed by spaCy (see https://spacy.io/usage/linguistic-features#noun-chunks).

spacy_extract_nounphrases(x, output = c("data.frame", "list"),
  multithread = TRUE, ...)

Arguments

x

a character object or a TIF-compliant corpus data.frame (see https://github.com/ropensci/tif)

output

type of returned object, either "data.frame" or "list"

multithread

logical; If true, the processing is parallelized using pipe functionality of spaCy (https://spacy.io/api/pipe).

...

unused

Value

either a list or data.frame of tokens

Details

When the option output = "data.frame" is selected, the function returns a data.frame with the following fields.

root_text

contents of root token

start_id

serial number ID of starting token. This number corresponds with the number of data.frame returned from spacy_tokenize(x) with default options.

root_id

serial number ID of root token

length

number of words (tokens) included in a noun-phrase (e.g. for a noun-phrase, "individual car owners", length = 3)

Examples

#> spaCy is already initialized
#> NULL
txt <- c(doc1 = "Natural language processing is a branch of computer science.", doc2 = "Paul earned a postgraduate degree from MIT.") spacy_extract_nounphrases(txt)
#> doc_id text root_text start_id root_id length #> 1 doc1 Natural language processing processing 1 3 3 #> 2 doc1 a branch branch 5 6 2 #> 3 doc1 computer science science 8 9 2 #> 4 doc2 Paul Paul 1 1 1 #> 5 doc2 a postgraduate degree degree 3 5 3 #> 6 doc2 MIT MIT 7 7 1
spacy_extract_nounphrases(txt, output = "list")
#> $doc1 #> [1] "Natural language processing" "a branch" #> [3] "computer science" #> #> $doc2 #> [1] "Paul" "a postgraduate degree" "MIT" #>