Efficient tokenization (without POS tagging, dependency parsing, lemmatization, or named entity recognition) of texts using spaCy.

spacy_tokenize(
  x,
  what = c("word", "sentence"),
  remove_punct = FALSE,
  remove_url = FALSE,
  remove_numbers = FALSE,
  remove_separators = TRUE,
  remove_symbols = FALSE,
  padding = FALSE,
  multithread = TRUE,
  output = c("list", "data.frame"),
  ...
)

Arguments

x

a character object, a quanteda corpus, or a TIF-compliant corpus data.frame (see https://github.com/ropenscilabs/tif)

what

the unit for splitting the text, available alternatives are:

"word"

word segmenter

"sentence"

sentence segmenter

remove_punct

remove punctuation tokens.

remove_url

remove tokens that look like a url or email address.

remove_numbers

remove tokens that look like a number (e.g. "334", "3.1415", "fifty").

remove_separators

remove spaces as separators when all other remove functionalities (e.g. remove_punct) have to be set to FALSE. When what = "sentence", this option will remove trailing spaces if TRUE.

remove_symbols

remove symbols. The symbols are either SYM in pos field, or currency symbols.

padding

if TRUE, leave an empty string where the removed tokens previously existed. This is useful if a positional match is needed between the pre- and post-selected tokens, for instance if a window of adjacency needs to be computed.

multithread

logical; If TRUE, the processing is parallelized using spaCy's architecture (https://spacy.io/api)

output

type of returning object. Either list or data.frame.

...

not used directly

Value

either list or data.frame of tokens

Examples

if (FALSE) {
spacy_initialize()
txt <- "And now for something completely different."
spacy_tokenize(txt)

txt2 <- c(doc1 = "The fast cat catches mice.\\nThe quick brown dog jumped.", 
          doc2 = "This is the second document.",
          doc3 = "This is a \\\"quoted\\\" text." )
spacy_tokenize(txt2)
}