Efficient tokenization (without POS tagging, dependency parsing, lemmatization, or named entity recognition) of texts using spaCy.
a character object, a quanteda corpus, or a TIF-compliant corpus data.frame (see https://github.com/ropenscilabs/tif)
the unit for splitting the text, available alternatives are:
"word"word segmenter
"sentence"sentence segmenter
remove punctuation tokens.
remove tokens that look like a url or email address.
remove tokens that look like a number (e.g. "334", "3.1415", "fifty").
remove spaces as separators when
all other remove functionalities (e.g. remove_punct) have to be set to FALSE.
When what = "sentence", this option will remove trailing spaces if TRUE.
remove symbols. The symbols are either SYM in pos
field, or currency symbols.
if TRUE, leave an empty string where the removed tokens
previously existed. This is useful if a positional match is needed between
the pre- and post-selected tokens, for instance if a window of adjacency
needs to be computed.
logical; If TRUE, the processing is parallelized
using spaCy's architecture (https://spacy.io/api)
type of returning object. Either list or data.frame.
not used directly
either list or data.frame of tokens
if (FALSE) {
spacy_initialize()
txt <- "And now for something completely different."
spacy_tokenize(txt)
txt2 <- c(doc1 = "The fast cat catches mice.\\nThe quick brown dog jumped.",
doc2 = "This is the second document.",
doc3 = "This is a \\\"quoted\\\" text." )
spacy_tokenize(txt2)
}