This package is an R wrapper to the spaCy “industrial strength natural language processing” Python library from http://spacy.io.

Installing the package

  1. Install miniconda

    The easiest way to install spaCy and spacyr is through an auto-installation function in spacyr package. This function utilizes a conda environment and therefore, some version of conda has to be installed in the system. You can install miniconda from https://conda.io/miniconda.html (Choose 64-bit version for your system).

    If you have any version of conda, you can skip this step. You can check it by entering conda --version in Console.

  2. Install the spacyr R package:

    • From GitHub:

      To install the latest package from source, you can simply run the following.

    devtools::install_github("quanteda/spacyr", build_vignettes = FALSE)
    • From CRAN:
  3. Install spaCy in a conda environment

    • For Windows, you need to run R as an administrator to make installation work properly. To do so, right click Rstudio (or R desktop icon) and select “Run as administrator” when launching R.

    • To install spaCy, you can simply run

    This will install the latest version of spaCy (and its required packages) and English language model. After installation, you can initialize spacy in R with

    This will return the following message if spaCy was installed with this method.

  4. (optional) Add more language models

    For spaCy installed by spacy_install(), spacyr provides a useful helper function to install additional language models. For instance, to install Gernman language model

    (Again, Windows users have to run this command as an administrator. Otherwise, sim-link to language model will fail.)

Comments and feedback

We welcome your comments and feedback. Please file issues on the issues page, and/or send us comments at and .

A walkthrough of spacyr

Starting a spacyr session

To allow R to access the underlying Python functionality, it must open a connection by being initialized within your R session.

We provide a function for this, spacy_initialize(), which attempts to make this process as painless as possible by searching your system for Python executables, and testing which have spaCy installed. For power users (such as those with multiple installations of Python), it is possible to specify the path manually through the python_executable argument, which also makes initialization faster. (You will need to change the value on your system of the Python executable.)

Tokenizing and tagging texts

The spacy_parse() is spacyr’s main function. It calls spaCy both to tokenize and tag the texts. It provides two options for part of speech tagging, plus options to return word lemmas, entity recognition, and dependency parsing. It returns a data.frame corresponding to the emerging text interchange format for token data.frames.

The approach to tokenizing taken by spaCy is inclusive: it includes all tokens without restrictions, including punctuation characters and symbols.

Example:

Two fields are available for part-of-speech tags. The pos field returned is the Universal tagset for parts-of-speech, a general scheme that most users will find serves their needs, and also that provides equivalencies across langages. spacyr also provides a more detailed tagset, defined in each spaCy language model. For English, this is the OntoNotes 5 version of the Penn Treebank tag set.

For the German language model, the Universal tagset (pos) remains the same, but the detailed tagset (tag) is the TIGER Treebank scheme.

Using other language models

By default, spacyr loads an English language model. You also can load SpaCy’s other language models or use one of the language models with alpha support by specifying the model option when calling spacy_initialize(). We have sucessfully tested following language models with spacy version 2.0.1.

Language ModelName
German de
Spanish es
Portuguese pt
French fr
Italian it
Dutch nl

This is an example of parsing German texts.

## first finalize the spacy if it's loaded
spacy_finalize()
spacy_initialize(model = "de")
## Python space is already attached.  If you want to switch to a different Python, please restart R.
## successfully initialized (spaCy Version: 2.0.12, language model: de)
## (python options: type = "condaenv", value = "spacy_condaenv")

txt_german <- c(R = "R ist eine freie Programmiersprache für statistische Berechnungen und Grafiken. Sie wurde von Statistikern für Anwender mit statistischen Aufgaben entwickelt.",
               python = "Python ist eine universelle, üblicherweise interpretierte höhere Programmiersprache. Sie will einen gut lesbaren, knappen Programmierstil fördern.")
results_german <- spacy_parse(txt_german, dependency = TRUE, lemma = FALSE, tag = TRUE)
results_german
##    doc_id sentence_id token_id              token   pos   tag
## 1       R           1        1                  R PROPN    NE
## 2       R           1        2                ist   AUX VAFIN
## 3       R           1        3               eine   DET   ART
## 4       R           1        4              freie   ADJ  ADJA
## 5       R           1        5 Programmiersprache  NOUN    NN
## 6       R           1        6                für   ADP  APPR
## 7       R           1        7       statistische   ADJ  ADJA
## 8       R           1        8       Berechnungen  NOUN    NN
## 9       R           1        9                und  CONJ   KON
## 10      R           1       10           Grafiken  NOUN    NN
## 11      R           1       11                  . PUNCT    $.
## 12      R           2        1                Sie  PRON  PPER
## 13      R           2        2              wurde   AUX VAFIN
## 14      R           2        3                von   ADP  APPR
## 15      R           2        4       Statistikern  NOUN    NN
## 16      R           2        5                für   ADP  APPR
## 17      R           2        6           Anwender  NOUN    NN
## 18      R           2        7                mit   ADP  APPR
## 19      R           2        8      statistischen   ADJ  ADJA
## 20      R           2        9           Aufgaben  NOUN    NN
## 21      R           2       10         entwickelt  VERB  VVPP
## 22      R           2       11                  . PUNCT    $.
## 23 python           1        1             Python  NOUN    NN
## 24 python           1        2                ist   AUX VAFIN
## 25 python           1        3               eine   DET   ART
## 26 python           1        4        universelle   ADJ  ADJA
## 27 python           1        5                  , PUNCT    $,
## 28 python           1        6      üblicherweise   ADV   ADV
## 29 python           1        7     interpretierte   ADJ  ADJA
## 30 python           1        8             höhere   ADJ  ADJA
## 31 python           1        9 Programmiersprache  NOUN    NN
## 32 python           1       10                  . PUNCT    $.
## 33 python           2        1                Sie  PRON  PPER
## 34 python           2        2               will  VERB VMFIN
## 35 python           2        3              einen   DET   ART
## 36 python           2        4                gut   ADJ  ADJD
## 37 python           2        5           lesbaren   ADJ  ADJA
## 38 python           2        6                  , PUNCT    $,
## 39 python           2        7            knappen   ADJ  ADJA
## 40 python           2        8    Programmierstil  NOUN    NN
## 41 python           2        9            fördern  VERB VVFIN
## 42 python           2       10                  . PUNCT    $.
##    head_token_id dep_rel entity
## 1              2      sb       
## 2              2    ROOT       
## 3              5      nk       
## 4              5      nk       
## 5              2      pd       
## 6              5     mnr       
## 7              8      nk       
## 8              6      nk       
## 9              8      cd       
## 10             9      cj       
## 11             2   punct       
## 12             2      sb       
## 13             2    ROOT       
## 14            10     sbp       
## 15             3      nk  LOC_B
## 16             4     mnr       
## 17             5      nk       
## 18            10      mo       
## 19             9      nk       
## 20             7      nk       
## 21             2      oc       
## 22             2   punct       
## 23             2      sb MISC_B
## 24             2    ROOT       
## 25             9      nk       
## 26             9      nk       
## 27             4   punct       
## 28             7      mo       
## 29             4      cj       
## 30             9      nk       
## 31             2      pd       
## 32             2   punct       
## 33             2      sb       
## 34             2    ROOT       
## 35             8      nk       
## 36             5      mo       
## 37             8      nk       
## 38             5   punct       
## 39             5      cj       
## 40             9      oa       
## 41             2      oc       
## 42             2   punct

Note that the additional language models must first be installed in spaCy. The German language model, for example, can be installed (python -m spacy download de) before you call spacy_initialize().

When you finish

A background process of spaCy is initiated when you ran spacy_initialize(). Because of the size of language models of spaCy, this takes up a lot of memory (typically 1.5GB). When you do not need the Python connection any longer, you can finalize the python connection (and terminate the process) by calling the spacy_finalize() function.

By calling spacy_initialize() again, you can restart the backend spaCy.

Permanently seting the default Python

If you want to skip spacyr searching for Python intallation with spaCy, you can do so by permanently setting the path to the spaCy-enabled Python by specifying it in an R-startup file (For Mac/Linux, the file is ~/.Rprofile), which is read every time a new R is launched. You can set the option permanently when you call spacy_initialize:

spacy_initialize(save_profile = TRUE)

Once this is appropriately set up, the message from spacy_initialize() changes to something like:

## spacy python option is already set, spacyr will use:
##  condaenv = "spacy_condaenv"
## successfully initialized (spaCy Version: 2.0.11, language model: en)
## (python options: type = "condaenv", value = "spacy_condaenv")

To ignore the permanently set options, you can initialize spacy with refresh_settings = TRUE.

Using spacyr with other packages

Conformity to the Text Interchange Format

The Text Interchange Format is an emerging standard structure for text package objects in R, such as corpus and token objects. spacy_initialize() can take a TIF corpus data.frame or character object as a valid input. Moreover, the data.frames returned by spacy_parse() and entity_consolidate() conform to the TIF tokens standard for data.frame tokens objects. This will make it easier to use with any text analysis package for R that works with TIF standard objects.