While the gendercoder
dictionaries aim to be as
comprehensive as possible, it is inevitable that new typos and
variations will occur in wild data. Moreover, at present, the
dictionaries are limited to data the authors have had access to which
has been collected in English. As such, if you are collecting data, you
will at some point want to add to or create your own dictionaries (and
if so, we strongly encourage contributions either as a pull request via
github, or by raising an
issue so the team can help).
Let’s say I have free-text gender data, but some of it is not in English.
library(gendercoder)
library(dplyr)
df
#> # A tibble: 9 × 1
#> gender
#> <chr>
#> 1 male
#> 2 enby
#> 3 womn
#> 4 mlae
#> 5 mann
#> 6 frau
#> 7 femme
#> 8 homme
#> 9 nin
I can create a new dictionary by creating a named vector, where the
names are the raw, uncoded values, and the values are the desired
outputs. This can then be used as the dictionary in the
recode_gender()
function.
new_dictionary <- c(
mann = "man",
frau = "woman",
femme = "woman",
homme = "man",
nin = "man")
df %>%
mutate(recoded_gender = recode_gender(gender,
dictionary = new_dictionary,
retain_unmatched = TRUE))
#> Some results not matched from the dictionary have been filled with the original values.
#> # A tibble: 9 × 2
#> gender recoded_gender
#> <chr> <chr>
#> 1 male male
#> 2 enby enby
#> 3 womn womn
#> 4 mlae mlae
#> 5 mann man
#> 6 frau woman
#> 7 femme woman
#> 8 homme man
#> 9 nin man
However, as you can see using just this new dictionary leaves a number of responses uncoded that the built-in dictionaries could handle. As the dictionaries are just vectors, we can simply concatenate these to use both at the same time.
We can do this in-line…
df %>%
mutate(recoded_gender = recode_gender(gender,
dictionary = c(manylevels_en, new_dictionary),
retain_unmatched = TRUE))
#> # A tibble: 9 × 2
#> gender recoded_gender
#> <chr> <chr>
#> 1 male man
#> 2 enby non-binary
#> 3 womn woman
#> 4 mlae man
#> 5 mann man
#> 6 frau woman
#> 7 femme woman
#> 8 homme man
#> 9 nin man
Or otherwise we can create a new dictionary and call that later, useful if you might want to save an augmented dictionary for later use or for contributing to the package.
manylevels_plus <- c(manylevels_en, new_dictionary)
df %>%
mutate(recoded_gender = recode_gender(gender,
dictionary = manylevels_plus,
retain_unmatched = TRUE))
#> # A tibble: 9 × 2
#> gender recoded_gender
#> <chr> <chr>
#> 1 male man
#> 2 enby non-binary
#> 3 womn woman
#> 4 mlae man
#> 5 mann man
#> 6 frau woman
#> 7 femme woman
#> 8 homme man
#> 9 nin man
Let’s say you are happy with your manylevels_plus
dictionary and think it should be part of the manylevels_en
dictionary in the package. All you need to do is fork
the gendercoder repo, clone
it to your local device, and then rename your vector and use the
usethis::use_data()
function to overwrite the
manylevels_en
dictionary as shown below.
Once you’ve pushed the changes to your fork, you can make a pull request. Please tell us what you’re adding so we know what to look out for and how to test it.