Data Encryption

The scenario:

A group of people are working on a sensitive data set that for practical reasons needs to be stored in a place that we’re not 100% happy with the security (e.g., Dropbox), or we’re concerned that files stored in plain text on users computers (e.g. laptops) may lead to the data being compromised.

If the data can be stored encrypted but everyone in the group can still read and write the data then we’ve improved the situation somewhat. But organising for everyone to get a copy of the key to decrypt the data files is non-trivial. The workflow described here aims to simplify this procedure using lower-level functions in the cyphr package.

The general procedure is this:

  1. A person will set up a set of personal keys and a key for the data. The data key will be encrypted with their personal key so they have access to the data but nobody else does. At this point the data can be encrypted.

  2. Additional users set up personal keys and request access to the data. Anyone with access to the data can grant access to anyone else.

Before doing any of this, everyone needs to have ssh keys set up. By default the package will use your ssh keys found at “~/.ssh”; see the main package vignette for how to use this.

For clarity here we will generate two sets of key pairs for two actors Alice and Bob:

path_key_alice <- cyphr::ssh_keygen(password = FALSE)
path_key_bob <- cyphr::ssh_keygen(password = FALSE)

These would ordinarily be on different machines (nobody has access to anyone else’s private key) and they would be password protected. In the function calls below, all the path_user arguments would be omitted.

We’ll store data in the directory data; at present there is nothing there (this is in a temporary directory for compliance with CRAN policies but would ordinarily be somewhere persistent and under version control ideally).

data_dir <- file.path(tempdir(), "data")
dir.create(data_dir)
dir(data_dir)
## character(0)

First, create a personal set of keys. These will be shared across all projects and stored away from the data. Ideally one would do this with ssh-keygen at the command line, following one of the many guides available. A utility function ssh_keygen (which simply calls ssh-keygen for you) is available in this package though. You will need to generate a key on each computer you want access from. Don’t copy the key around. If you lose your user key you will lose access to the data!

Second, create a key for the data and encrypt that key with your personal key. Note that the data key is never stored directly - it is always stored encrypted by a personal key.

cyphr::data_admin_init(data_dir, path_user = path_key_alice)
## Generating data key
## Authorising ourselves
## Adding key 18:77:39:70:e4:7d:54:fe:88:c5:fb:4d:5d:7b:4e:91:1c:47:a8:c6:0f:dd:5d:61:97:91:1f:92:b1:7c:8d:b3
##   user: root
##   host: e453a55c7d77
##   date: 2024-10-28 06:06:38.20521
## Verifying

The data key is very important. If it is deleted, then the data cannot be decrypted. So do not delete the directory data_dir/.cyphr! Ideally add it to your version control system so that it cannot be lost. Of course, if you’re working in a group, there are multiple copies of the data key (each encrypted with a different person’s personal key) which reduces the chance of total loss.

This command can be run multiple times safely; if it detects it has been rerun and the data key will not be regenerated.

cyphr::data_admin_init(data_dir, path_user = path_key_alice)
## Already set up at /tmp/RtmprgIrZh/data
## Verifying

Third, you can add encrypted data to the directory (or to anywhere really). When run, cyphr::config_data will verify that it can actually decrypt things.

key <- cyphr::data_key(data_dir, path_user = path_key_alice)

This object can be used with all the cyphr functions (see the “cyphr” vignette; vignette("cyphr"))

filename <- file.path(data_dir, "iris.rds")
cyphr::encrypt(saveRDS(iris, filename), key)
dir(data_dir)
## [1] "iris.rds"

The file is encrypted and so cannot be read with readRDS:

readRDS(filename)
## Error in readRDS(filename): unknown input format

But we can decrypt and read it:

head(cyphr::decrypt(readRDS(filename), key))
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

Fourth, have someone else join in. Recall that to simulate another person here, I’m going to pass an argument path_user = path_key_bob though to the functions. This contains the path to “Bob”’s ssh keypair. If run on an actually different computer this would not be needed; this is just to simulate two users in a single session for this vignette (see minimal example below where this is simulated). Again, typically this user would also not use the cyphr::ssh_keygen function but use the ssh-keygen command from their shell.

We’re going to assume that the user can read and write to the data. This is the case for my use case where the data are stored on dropbox and will be the case with GitHub based distribution, though there would be a pull request step in here.

This user cannot read the data, though trying to will print a message explaining how you might request access:

key_bob <- cyphr::data_key(data_dir, path_user = path_key_bob)

But bob is your collaborator and needs access! What they need to do is run:

cyphr::data_request_access(data_dir, path_user = path_key_bob)
## A request has been added
## Email someone with access to add you
## 
##     hash: 9c:f0:8d:c9:6b:1e:05:3e:78:d9:58:4e:d2:5d:77:04:8d:ab:0f:57:c3:a3:27:1d:b7:b5:62:c6:b0:86:3f:12
## 
## If you are using git, you will need to commit and push first:
## 
##     git add .cyphr
##     git commit -m "Please add me to the dataset"
##     git push

(again, ordinarily you would not need the bob bit here)

The user should the send an email to someone with access and quote the hash in the message above.

Fifth, back on the first computer we can authorise the second user. First, see who has requested access:

req <- cyphr::data_admin_list_requests(data_dir)
req
## 1 key:
##   9c:f0:8d:c9:6b:1e:05:3e:78:d9:58:4e:d2:5d:77:04:8d:ab:0f:57:c3:a3:27:1d:b7:b5:62:c6:b0:86:3f:12
##     user: root
##     host: e453a55c7d77
##     date: 2024-10-28 06:06:38.314863

We can see the same hash here as above (9cf08dc96b1e053e78d9584ed25d77048dab0f57c3a3271db7b562c6b0863f12)

…and then grant access to them with the cyphr::data_admin_authorise function.

cyphr::data_admin_authorise(data_dir, yes = TRUE, path_user = path_key_alice)
## There is 1 request for access
## Adding key 9c:f0:8d:c9:6b:1e:05:3e:78:d9:58:4e:d2:5d:77:04:8d:ab:0f:57:c3:a3:27:1d:b7:b5:62:c6:b0:86:3f:12
##   user: root
##   host: e453a55c7d77
##   date: 2024-10-28 06:06:38.314863
## Added 1 key
## If you are using git, you will need to commit and push:
## 
##     git add .cyphr
##     git commit -m "Authorised root"
##     git push

If you do not specify yes = TRUE will prompt for confirmation at each key added.

This has cleared the request queue:

cyphr::data_admin_list_requests(data_dir)
## (empty)

and added it to our set of keys:

cyphr::data_admin_list_keys(data_dir)
## 2 keys:
##   18:77:39:70:e4:7d:54:fe:88:c5:fb:4d:5d:7b:4e:91:1c:47:a8:c6:0f:dd:5d:61:97:91:1f:92:b1:7c:8d:b3
##     user: root
##     host: e453a55c7d77
##     date: 2024-10-28 06:06:38.20521
##   9c:f0:8d:c9:6b:1e:05:3e:78:d9:58:4e:d2:5d:77:04:8d:ab:0f:57:c3:a3:27:1d:b7:b5:62:c6:b0:86:3f:12
##     user: root
##     host: e453a55c7d77
##     date: 2024-10-28 06:06:38.314863

Finally, as soon as the authorisation has happened, the user can encrypt and decrypt files:

key_bob <- cyphr::data_key(data_dir, path_user = path_key_bob)
head(cyphr::decrypt(readRDS(filename), key_bob))
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

Minimal example

As above, but with less discussion:

Setup, on Alice’s computer:

cyphr::data_admin_init(data_dir, path_user = path_key_alice)
## Generating data key
## Authorising ourselves
## Adding key 18:77:39:70:e4:7d:54:fe:88:c5:fb:4d:5d:7b:4e:91:1c:47:a8:c6:0f:dd:5d:61:97:91:1f:92:b1:7c:8d:b3
##   user: root
##   host: e453a55c7d77
##   date: 2024-10-28 06:06:38.419023
## Verifying

Get the data key key:

key <- cyphr::data_key(data_dir, path_user = path_key_alice)

Encrypt a file:

cyphr::encrypt(saveRDS(iris, filename), key)

Request access, on Bob’s computer:

hash <- cyphr::data_request_access(data_dir, path_user = path_key_bob)
## A request has been added
## Email someone with access to add you
## 
##     hash: 9c:f0:8d:c9:6b:1e:05:3e:78:d9:58:4e:d2:5d:77:04:8d:ab:0f:57:c3:a3:27:1d:b7:b5:62:c6:b0:86:3f:12
## 
## If you are using git, you will need to commit and push first:
## 
##     git add .cyphr
##     git commit -m "Please add me to the dataset"
##     git push

Alice authorises this request::

cyphr::data_admin_authorise(data_dir, yes = TRUE, path_user = path_key_alice)
## There is 1 request for access
## Adding key 9c:f0:8d:c9:6b:1e:05:3e:78:d9:58:4e:d2:5d:77:04:8d:ab:0f:57:c3:a3:27:1d:b7:b5:62:c6:b0:86:3f:12
##   user: root
##   host: e453a55c7d77
##   date: 2024-10-28 06:06:38.464247
## Added 1 key
## If you are using git, you will need to commit and push:
## 
##     git add .cyphr
##     git commit -m "Authorised root"
##     git push

Bob can get the data key:

key <- cyphr::data_key(data_dir, path_user = path_key_bob)

Bob can read the secret data:

head(cyphr::decrypt(readRDS(filename), key))
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

Details & disclosure

Encryption does not work through security through obscurity; it works because we can rely on the underlying maths enough to be open about how things are stored and where.

Most encryption libraries require some degree of security in the underlying software. Because of the way R works this is very difficult to guarantee; it is trivial to rewrite code in running packages to skip past verification checks. So this package is not designed to (or able to) avoid exploits in your running code; an attacker could intercept your private keys, the private key to the data, or skip the verification checks that are used to make sure that the keys you load are what they say they are. However, the data are safe; only people who have keys to the data will be able to read it.

cyphr uses two different encryption algorithms; it uses RSA encryption via the openssl package for user keys, because there is a common file format for these keys so it makes user configuration easier. It uses the modern sodium package (and through that the libsodium library) for data encryption because it is very fast and simple to work with. This does leave two possible points of weakness as a vulnerability in either of these libraries could lead to an exploit that could allow decryption of your data.

Each user has a public/private key pair. Typically this is in ~/.ssh/id_rsa.pub and ~/.ssh/id_rsa, and if found these will be used. Alternatively the location of the keypair can be stored elsewhere and pointed at with the USER_KEY or USER_PUBKEY environment variables. The key may be password protected (and this is recommended!) and the password will be requested without ever echoing it to the terminal.

The data directory has a hidden directory .cyphr in it.

dir(data_dir, all.files = TRUE, no.. = TRUE)
## [1] ".cyphr"   "iris.rds"

This does not actually need to be stored with the data but it makes sense to (there are workflows where data is stored remotely where storing this directory might make sense). The “keys” directory contains a number of files; one for each person who has access to the data.

dir(file.path(data_dir, ".cyphr", "keys"))
## [1] "18773970e47d54fe88c5fb4d5d7b4e911c47a8c60fdd5d6197911f92b17c8db3"
## [2] "9cf08dc96b1e053e78d9584ed25d77048dab0f57c3a3271db7b562c6b0863f12"
names(cyphr::data_admin_list_keys(data_dir))
## [1] "18773970e47d54fe88c5fb4d5d7b4e911c47a8c60fdd5d6197911f92b17c8db3"
## [2] "9cf08dc96b1e053e78d9584ed25d77048dab0f57c3a3271db7b562c6b0863f12"

(the file test is a small file encrypted with the data key used to verify everything is working OK).

Each file is stored in RDS format and is a list with elements:

  • user: the reported user name of the person who created request for data
  • host: the reported computer name
  • date: the time the request was generated
  • pub: the RSA public key of the user
  • key: the data key, encrypted with the user key. Without the private key, this cannot be used. With the user’s private key this can be used to generate the symmetric key to the data.
h <- names(cyphr::data_admin_list_keys(data_dir))[[1]]
readRDS(file.path(data_dir, ".cyphr", "keys", h))
## $user
## [1] "root"
## 
## $host
## [1] "e453a55c7d77"
## 
## $date
## [1] "2024-10-28 06:06:38 UTC"
## 
## $pub
## [2048-bit rsa public key]
## md5: 77de30b70a2f9b728d7b5725ebbe20f4
## sha256: 18773970e47d54fe88c5fb4d5d7b4e911c47a8c60fdd5d6197911f92b17c8db3
## 
## $key
##   [1] 24 c8 9d 10 58 3f 92 39 c2 ad f1 9c 35 1c d1 55 10 5a 97 1d f4 69 3e 6b b2
##  [26] 43 a5 c0 af 76 a8 96 57 96 5a 51 bb 57 a3 d0 14 01 d7 4c 9f 41 99 e0 03 4c
##  [51] 50 fc ae 08 dc 34 36 b4 28 1b 9f 04 62 e8 d7 c9 67 a6 cb 4b b9 fd 9f 52 ba
##  [76] 90 9d 9a b0 7d 64 cf aa 56 e0 9d 77 b8 cd 6e 52 ac ef da e9 4d 6a e4 76 52
## [101] f8 d8 1d 7f c5 f2 bc d1 10 18 1b db 77 35 39 4e 8a 20 e9 77 fc 03 45 ec de
## [126] d6 65 3a c4 da 6d 83 ef 17 df 85 24 5c f4 c0 62 d0 46 0c 82 11 16 73 25 b3
## [151] ad 76 eb 13 85 24 24 e0 de d5 cb a1 78 e8 2e 08 f0 ae a5 3c eb b8 8d e3 53
## [176] 32 bd 6f 09 56 5a 6a c8 eb 20 ab e1 fe 1e 1c b3 29 65 97 b5 71 da 8e b6 32
## [201] a9 d3 1a 0a 75 78 c7 7c a9 74 3b 80 53 2b 42 b4 56 39 b7 3c ef 10 67 bf 34
## [226] 73 85 a5 fa 16 38 2a ba a8 63 09 9b ae 0a 67 07 af 12 7a 60 63 7c 63 12 53
## [251] cc 71 e5 3e cc e0

You can see that the hash of the public key is the same as name of the stored file here (which is used to prevent collisions when multiple people request access at the same time).

h
## [1] "18773970e47d54fe88c5fb4d5d7b4e911c47a8c60fdd5d6197911f92b17c8db3"

When a request is posted it is an RDS file with all of the above except for the key element, which is added during authorisation.

(Note that the verification relies on the package code not being attacked, and given R’s highly dynamic nature an attacker could easily swap out the definition for the verification function with something that always returns TRUE.)

When an authorised user creates the data_key object (which allows decryption of the data) secret will:

  • read their private user key (probably from ~/.ssh/id_rsa)
  • read the encrypted data key from the data directory (the $key element from the list above).
  • decrypt this data key using their user key to yield the the data symmetric key.

Limitations

In the Dropbox scenario, non-password protected keys will afford only limited protection. This is because even though the keys and data are stored separately on Dropbox, they will be in the same place on a local computer; if that computer is lost then the only thing preventing an attacker recovering the data is security through obscurity (the data would appear to be random junk but they will be able to run your analysis scripts as easily as you can). Password protected keys will improve this situation considerably as without a password the data cannot be recovered.

The data is not encrypted during a running R session. R allows arbitrary modification of code at runtime so this package provides no security from the point where the data can be decrypted. If your computer was compromised then stealing the data while you are running R should be assumed to be straightforward.