The scenario:
A group of people are working on a sensitive data set that for practical reasons needs to be stored in a place that we’re not 100% happy with the security (e.g., Dropbox), or we’re concerned that files stored in plain text on users computers (e.g. laptops) may lead to the data being compromised.
If the data can be stored encrypted but everyone in the group can
still read and write the data then we’ve improved the situation
somewhat. But organising for everyone to get a copy of the key to
decrypt the data files is non-trivial. The workflow described here aims
to simplify this procedure using lower-level functions in the
cyphr
package.
The general procedure is this:
A person will set up a set of personal keys and a key for the data. The data key will be encrypted with their personal key so they have access to the data but nobody else does. At this point the data can be encrypted.
Additional users set up personal keys and request access to the data. Anyone with access to the data can grant access to anyone else.
Before doing any of this, everyone needs to have ssh keys set up. By default the package will use your ssh keys found at “~/.ssh”; see the main package vignette for how to use this.
For clarity here we will generate two sets of key pairs for two actors Alice and Bob:
path_key_alice <- cyphr::ssh_keygen(password = FALSE)
path_key_bob <- cyphr::ssh_keygen(password = FALSE)
These would ordinarily be on different machines (nobody has access to
anyone else’s private key) and they would be password protected. In the
function calls below, all the path_user
arguments would be
omitted.
We’ll store data in the directory data
; at present there
is nothing there (this is in a temporary directory for compliance with
CRAN policies but would ordinarily be somewhere persistent and under
version control ideally).
## character(0)
First, create a personal set of keys. These will be
shared across all projects and stored away from the data. Ideally one
would do this with ssh-keygen
at the command line,
following one of the many guides available. A utility function
ssh_keygen
(which simply calls ssh-keygen
for
you) is available in this package though. You will need to generate a
key on each computer you want access from. Don’t copy the key around. If
you lose your user key you will lose access to the data!
Second, create a key for the data and encrypt that key with your personal key. Note that the data key is never stored directly - it is always stored encrypted by a personal key.
## Generating data key
## Authorising ourselves
## Adding key 18:77:39:70:e4:7d:54:fe:88:c5:fb:4d:5d:7b:4e:91:1c:47:a8:c6:0f:dd:5d:61:97:91:1f:92:b1:7c:8d:b3
## user: root
## host: e453a55c7d77
## date: 2024-10-28 06:06:38.20521
## Verifying
The data key is very important. If it is deleted, then the data
cannot be decrypted. So do not delete the directory
data_dir/.cyphr
! Ideally add it to your version control
system so that it cannot be lost. Of course, if you’re working in a
group, there are multiple copies of the data key (each encrypted with a
different person’s personal key) which reduces the chance of total
loss.
This command can be run multiple times safely; if it detects it has been rerun and the data key will not be regenerated.
## Already set up at /tmp/RtmprgIrZh/data
## Verifying
Third, you can add encrypted data to the directory
(or to anywhere really). When run, cyphr::config_data
will
verify that it can actually decrypt things.
This object can be used with all the cyphr
functions
(see the “cyphr” vignette; vignette("cyphr")
)
filename <- file.path(data_dir, "iris.rds")
cyphr::encrypt(saveRDS(iris, filename), key)
dir(data_dir)
## [1] "iris.rds"
The file is encrypted and so cannot be read with
readRDS
:
## Error in readRDS(filename): unknown input format
But we can decrypt and read it:
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
Fourth, have someone else join in. Recall that to
simulate another person here, I’m going to pass an argument
path_user = path_key_bob
though to the functions. This
contains the path to “Bob”’s ssh keypair. If run on an actually
different computer this would not be needed; this is just to simulate
two users in a single session for this vignette (see minimal example
below where this is simulated). Again, typically this user would also
not use the cyphr::ssh_keygen
function but use the
ssh-keygen
command from their shell.
We’re going to assume that the user can read and write to the data. This is the case for my use case where the data are stored on dropbox and will be the case with GitHub based distribution, though there would be a pull request step in here.
This user cannot read the data, though trying to will print a message explaining how you might request access:
But bob
is your collaborator and needs access! What they
need to do is run:
## A request has been added
## Email someone with access to add you
##
## hash: 9c:f0:8d:c9:6b:1e:05:3e:78:d9:58:4e:d2:5d:77:04:8d:ab:0f:57:c3:a3:27:1d:b7:b5:62:c6:b0:86:3f:12
##
## If you are using git, you will need to commit and push first:
##
## git add .cyphr
## git commit -m "Please add me to the dataset"
## git push
(again, ordinarily you would not need the bob
bit
here)
The user should the send an email to someone with access and quote the hash in the message above.
Fifth, back on the first computer we can authorise the second user. First, see who has requested access:
## 1 key:
## 9c:f0:8d:c9:6b:1e:05:3e:78:d9:58:4e:d2:5d:77:04:8d:ab:0f:57:c3:a3:27:1d:b7:b5:62:c6:b0:86:3f:12
## user: root
## host: e453a55c7d77
## date: 2024-10-28 06:06:38.314863
We can see the same hash here as above
(9cf08dc96b1e053e78d9584ed25d77048dab0f57c3a3271db7b562c6b0863f12
)
…and then grant access to them with the
cyphr::data_admin_authorise
function.
## There is 1 request for access
## Adding key 9c:f0:8d:c9:6b:1e:05:3e:78:d9:58:4e:d2:5d:77:04:8d:ab:0f:57:c3:a3:27:1d:b7:b5:62:c6:b0:86:3f:12
## user: root
## host: e453a55c7d77
## date: 2024-10-28 06:06:38.314863
## Added 1 key
## If you are using git, you will need to commit and push:
##
## git add .cyphr
## git commit -m "Authorised root"
## git push
If you do not specify yes = TRUE
will prompt for
confirmation at each key added.
This has cleared the request queue:
## (empty)
and added it to our set of keys:
## 2 keys:
## 18:77:39:70:e4:7d:54:fe:88:c5:fb:4d:5d:7b:4e:91:1c:47:a8:c6:0f:dd:5d:61:97:91:1f:92:b1:7c:8d:b3
## user: root
## host: e453a55c7d77
## date: 2024-10-28 06:06:38.20521
## 9c:f0:8d:c9:6b:1e:05:3e:78:d9:58:4e:d2:5d:77:04:8d:ab:0f:57:c3:a3:27:1d:b7:b5:62:c6:b0:86:3f:12
## user: root
## host: e453a55c7d77
## date: 2024-10-28 06:06:38.314863
Finally, as soon as the authorisation has happened, the user can encrypt and decrypt files:
key_bob <- cyphr::data_key(data_dir, path_user = path_key_bob)
head(cyphr::decrypt(readRDS(filename), key_bob))
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
As above, but with less discussion:
Setup, on Alice’s computer:
## Generating data key
## Authorising ourselves
## Adding key 18:77:39:70:e4:7d:54:fe:88:c5:fb:4d:5d:7b:4e:91:1c:47:a8:c6:0f:dd:5d:61:97:91:1f:92:b1:7c:8d:b3
## user: root
## host: e453a55c7d77
## date: 2024-10-28 06:06:38.419023
## Verifying
Get the data key key:
Encrypt a file:
Request access, on Bob’s computer:
## A request has been added
## Email someone with access to add you
##
## hash: 9c:f0:8d:c9:6b:1e:05:3e:78:d9:58:4e:d2:5d:77:04:8d:ab:0f:57:c3:a3:27:1d:b7:b5:62:c6:b0:86:3f:12
##
## If you are using git, you will need to commit and push first:
##
## git add .cyphr
## git commit -m "Please add me to the dataset"
## git push
Alice authorises this request::
## There is 1 request for access
## Adding key 9c:f0:8d:c9:6b:1e:05:3e:78:d9:58:4e:d2:5d:77:04:8d:ab:0f:57:c3:a3:27:1d:b7:b5:62:c6:b0:86:3f:12
## user: root
## host: e453a55c7d77
## date: 2024-10-28 06:06:38.464247
## Added 1 key
## If you are using git, you will need to commit and push:
##
## git add .cyphr
## git commit -m "Authorised root"
## git push
Bob can get the data key:
Bob can read the secret data:
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
Encryption does not work through security through obscurity; it works because we can rely on the underlying maths enough to be open about how things are stored and where.
Most encryption libraries require some degree of security in the underlying software. Because of the way R works this is very difficult to guarantee; it is trivial to rewrite code in running packages to skip past verification checks. So this package is not designed to (or able to) avoid exploits in your running code; an attacker could intercept your private keys, the private key to the data, or skip the verification checks that are used to make sure that the keys you load are what they say they are. However, the data are safe; only people who have keys to the data will be able to read it.
cyphr
uses two different encryption algorithms; it uses
RSA encryption via the openssl
package for user keys,
because there is a common file format for these keys so it makes user
configuration easier. It uses the modern sodium package (and through
that the libsodium library) for data encryption because it is very fast
and simple to work with. This does leave two possible points of weakness
as a vulnerability in either of these libraries could lead to an exploit
that could allow decryption of your data.
Each user has a public/private key pair. Typically this is in
~/.ssh/id_rsa.pub
and ~/.ssh/id_rsa
, and if
found these will be used. Alternatively the location of the keypair can
be stored elsewhere and pointed at with the USER_KEY
or
USER_PUBKEY
environment variables. The key may be password
protected (and this is recommended!) and the password will be requested
without ever echoing it to the terminal.
The data directory has a hidden directory .cyphr
in
it.
## [1] ".cyphr" "iris.rds"
This does not actually need to be stored with the data but it makes sense to (there are workflows where data is stored remotely where storing this directory might make sense). The “keys” directory contains a number of files; one for each person who has access to the data.
## [1] "18773970e47d54fe88c5fb4d5d7b4e911c47a8c60fdd5d6197911f92b17c8db3"
## [2] "9cf08dc96b1e053e78d9584ed25d77048dab0f57c3a3271db7b562c6b0863f12"
## [1] "18773970e47d54fe88c5fb4d5d7b4e911c47a8c60fdd5d6197911f92b17c8db3"
## [2] "9cf08dc96b1e053e78d9584ed25d77048dab0f57c3a3271db7b562c6b0863f12"
(the file test
is a small file encrypted with the data
key used to verify everything is working OK).
Each file is stored in RDS format and is a list with elements:
h <- names(cyphr::data_admin_list_keys(data_dir))[[1]]
readRDS(file.path(data_dir, ".cyphr", "keys", h))
## $user
## [1] "root"
##
## $host
## [1] "e453a55c7d77"
##
## $date
## [1] "2024-10-28 06:06:38 UTC"
##
## $pub
## [2048-bit rsa public key]
## md5: 77de30b70a2f9b728d7b5725ebbe20f4
## sha256: 18773970e47d54fe88c5fb4d5d7b4e911c47a8c60fdd5d6197911f92b17c8db3
##
## $key
## [1] 24 c8 9d 10 58 3f 92 39 c2 ad f1 9c 35 1c d1 55 10 5a 97 1d f4 69 3e 6b b2
## [26] 43 a5 c0 af 76 a8 96 57 96 5a 51 bb 57 a3 d0 14 01 d7 4c 9f 41 99 e0 03 4c
## [51] 50 fc ae 08 dc 34 36 b4 28 1b 9f 04 62 e8 d7 c9 67 a6 cb 4b b9 fd 9f 52 ba
## [76] 90 9d 9a b0 7d 64 cf aa 56 e0 9d 77 b8 cd 6e 52 ac ef da e9 4d 6a e4 76 52
## [101] f8 d8 1d 7f c5 f2 bc d1 10 18 1b db 77 35 39 4e 8a 20 e9 77 fc 03 45 ec de
## [126] d6 65 3a c4 da 6d 83 ef 17 df 85 24 5c f4 c0 62 d0 46 0c 82 11 16 73 25 b3
## [151] ad 76 eb 13 85 24 24 e0 de d5 cb a1 78 e8 2e 08 f0 ae a5 3c eb b8 8d e3 53
## [176] 32 bd 6f 09 56 5a 6a c8 eb 20 ab e1 fe 1e 1c b3 29 65 97 b5 71 da 8e b6 32
## [201] a9 d3 1a 0a 75 78 c7 7c a9 74 3b 80 53 2b 42 b4 56 39 b7 3c ef 10 67 bf 34
## [226] 73 85 a5 fa 16 38 2a ba a8 63 09 9b ae 0a 67 07 af 12 7a 60 63 7c 63 12 53
## [251] cc 71 e5 3e cc e0
You can see that the hash of the public key is the same as name of the stored file here (which is used to prevent collisions when multiple people request access at the same time).
## [1] "18773970e47d54fe88c5fb4d5d7b4e911c47a8c60fdd5d6197911f92b17c8db3"
When a request is posted it is an RDS file with all of the above
except for the key
element, which is added during
authorisation.
(Note that the verification relies on the package code not being
attacked, and given R’s highly dynamic nature an attacker could easily
swap out the definition for the verification function with something
that always returns TRUE
.)
When an authorised user creates the data_key
object
(which allows decryption of the data) secret
will:
~/.ssh/id_rsa
)$key
element from the list above).In the Dropbox scenario, non-password protected keys will afford only limited protection. This is because even though the keys and data are stored separately on Dropbox, they will be in the same place on a local computer; if that computer is lost then the only thing preventing an attacker recovering the data is security through obscurity (the data would appear to be random junk but they will be able to run your analysis scripts as easily as you can). Password protected keys will improve this situation considerably as without a password the data cannot be recovered.
The data is not encrypted during a running R session. R allows arbitrary modification of code at runtime so this package provides no security from the point where the data can be decrypted. If your computer was compromised then stealing the data while you are running R should be assumed to be straightforward.