Adding to the dictionary

Outline

While the gendercoder dictionaries aim to be as comprehensive as possible, it is inevitable that new typos and variations will occur in wild data. Moreover, at present, the dictionaries are limited to data the authors have had access to which has been collected in English. As such, if you are collecting data, you will at some point want to add to or create your own dictionaries (and if so, we strongly encourage contributions either as a pull request via github, or by raising an issue so the team can help).

Adding to the dictionary

Let’s say I have free-text gender data, but some of it is not in English.

library(gendercoder)
library(dplyr)
df
#> # A tibble: 9 × 1
#>   gender
#>   <chr> 
#> 1 male  
#> 2 enby  
#> 3 womn  
#> 4 mlae  
#> 5 mann  
#> 6 frau  
#> 7 femme 
#> 8 homme 
#> 9 nin

I can create a new dictionary by creating a named vector, where the names are the raw, uncoded values, and the values are the desired outputs. This can then be used as the dictionary in the recode_gender() function.

new_dictionary <- c(
  mann = "man", 
  frau = "woman", 
  femme = "woman", 
  homme = "man", 
  nin = "man")

df %>% 
  mutate(recoded_gender = recode_gender(gender, 
                                        dictionary = new_dictionary, 
                                        retain_unmatched = TRUE))
#> Some results not matched from the dictionary have been filled with the original values.
#> # A tibble: 9 × 2
#>   gender recoded_gender
#>   <chr>  <chr>         
#> 1 male   male          
#> 2 enby   enby          
#> 3 womn   womn          
#> 4 mlae   mlae          
#> 5 mann   man           
#> 6 frau   woman         
#> 7 femme  woman         
#> 8 homme  man           
#> 9 nin    man

However, as you can see using just this new dictionary leaves a number of responses uncoded that the built-in dictionaries could handle. As the dictionaries are just vectors, we can simply concatenate these to use both at the same time.

We can do this in-line…

df %>% 
  mutate(recoded_gender = recode_gender(gender, 
                                        dictionary = c(manylevels_en, new_dictionary), 
                                        retain_unmatched = TRUE))
#> # A tibble: 9 × 2
#>   gender recoded_gender
#>   <chr>  <chr>         
#> 1 male   man           
#> 2 enby   non-binary    
#> 3 womn   woman         
#> 4 mlae   man           
#> 5 mann   man           
#> 6 frau   woman         
#> 7 femme  woman         
#> 8 homme  man           
#> 9 nin    man

Or otherwise we can create a new dictionary and call that later, useful if you might want to save an augmented dictionary for later use or for contributing to the package.

manylevels_plus <-  c(manylevels_en, new_dictionary)

df %>% 
  mutate(recoded_gender = recode_gender(gender, 
                                        dictionary = manylevels_plus, 
                                        retain_unmatched = TRUE))
#> # A tibble: 9 × 2
#>   gender recoded_gender
#>   <chr>  <chr>         
#> 1 male   man           
#> 2 enby   non-binary    
#> 3 womn   woman         
#> 4 mlae   man           
#> 5 mann   man           
#> 6 frau   woman         
#> 7 femme  woman         
#> 8 homme  man           
#> 9 nin    man

Making it official

Let’s say you are happy with your manylevels_plus dictionary and think it should be part of the manylevels_en dictionary in the package. All you need to do is fork the gendercoder repo, clone it to your local device, and then rename your vector and use the usethis::use_data() function to overwrite the manylevels_en dictionary as shown below.

manylevels_en <-  manylevels_plus
usethis::use_data(manylevels_en, overwrite = TRUE)

Once you’ve pushed the changes to your fork, you can make a pull request. Please tell us what you’re adding so we know what to look out for and how to test it.