Efficient tokenization (without POS tagging, dependency parsing, lemmatization, or named entity recognition) of texts using spaCy.
a character object, a quanteda corpus, or a TIF-compliant corpus data.frame (see https://github.com/ropenscilabs/tif)
the unit for splitting the text, available alternatives are:
"word"
word segmenter
"sentence"
sentence segmenter
remove punctuation tokens.
remove tokens that look like a url or email address.
remove tokens that look like a number (e.g. "334", "3.1415", "fifty").
remove spaces as separators when
all other remove functionalities (e.g. remove_punct
) have to be set to FALSE
.
When what = "sentence"
, this option will remove trailing spaces if TRUE
.
remove symbols. The symbols are either SYM
in pos
field, or currency symbols.
if TRUE
, leave an empty string where the removed tokens
previously existed. This is useful if a positional match is needed between
the pre- and post-selected tokens, for instance if a window of adjacency
needs to be computed.
logical; If TRUE
, the processing is parallelized
using spaCy's architecture (https://spacy.io/api)
type of returning object. Either list
or data.frame
.
not used directly
either list
or data.frame
of tokens
if (FALSE) {
spacy_initialize()
txt <- "And now for something completely different."
spacy_tokenize(txt)
txt2 <- c(doc1 = "The fast cat catches mice.\\nThe quick brown dog jumped.",
doc2 = "This is the second document.",
doc3 = "This is a \\\"quoted\\\" text." )
spacy_tokenize(txt2)
}