Package: excluder 0.5.1

Jeffrey R. Stevens

excluder: Checks for Exclusion Criteria in Online Data

Data that are collected through online sources such as Mechanical Turk may require excluding rows because of IP address duplication, geolocation, or completion duration. This package facilitates exclusion of these data for Qualtrics datasets.

Authors:Jeffrey R. Stevens [aut, cre, cph], Joseph O'Brien [rev], Julia Silge [rev]

excluder_0.5.1.tar.gz
excluder_0.5.1.zip(r-4.5)excluder_0.5.1.zip(r-4.4)excluder_0.5.1.zip(r-4.3)
excluder_0.5.1.tgz(r-4.5-any)excluder_0.5.1.tgz(r-4.4-any)excluder_0.5.1.tgz(r-4.3-any)
excluder_0.5.1.tar.gz(r-4.5-noble)excluder_0.5.1.tar.gz(r-4.4-noble)
excluder_0.5.1.tgz(r-4.4-emscripten)excluder_0.5.1.tgz(r-4.3-emscripten)
excluder.pdf |excluder.html
excluder/json (API)
NEWS

# Install 'excluder' in R:
install.packages('excluder', repos = c('https://ropensci.r-universe.dev', 'https://cloud.r-project.org'))

Reviews:rOpenSci Software Review #455

Bug tracker:https://github.com/ropensci/excluder/issues

Pkgdown site:https://docs.ropensci.org

Datasets:
  • qualtrics_fetch - Example numeric metadata imported with 'qualtRics::fetch_survey()' from simulated Qualtrics study
  • qualtrics_fetch2 - Example numeric metadata imported with 'qualtRics::fetch_survey()' from simulated Qualtrics study but with labels included as column names
  • qualtrics_numeric - Example numeric metadata from simulated Qualtrics study
  • qualtrics_raw - Example text-based metadata from simulated Qualtrics study
  • qualtrics_text - Example text-based metadata from simulated Qualtrics study

On CRAN:excluder-0.5.1(2024-01-13)

Conda:

datacleaningexclusionmturkqualtrics

5.51 score 9 stars 18 scripts 357 downloads 28 exports 31 dependencies

Last updated 15 days agofrom:680fb64793 (on main). Checks:9 OK. Indexed: yes.

TargetResultLatest binary
Doc / VignettesOKMar 05 2025
R-4.5-winOKMar 05 2025
R-4.5-macOKMar 05 2025
R-4.5-linuxOKMar 05 2025
R-4.4-winOKMar 05 2025
R-4.4-macOKMar 05 2025
R-4.4-linuxOKMar 05 2025
R-4.3-winOKMar 05 2025
R-4.3-macOKMar 05 2025

Exports:%>%check_duplicatescheck_durationcheck_ipcheck_locationcheck_previewcheck_progresscheck_resolutioncollapse_exclusionsdeidentifyexclude_duplicatesexclude_durationexclude_ipexclude_locationexclude_previewexclude_progressexclude_resolutionmark_duplicatesmark_durationmark_ipmark_locationmark_previewmark_progressmark_resolutionremove_label_rowsrename_columnsunite_exclusionsuse_labels

Dependencies:AsioHeadersclicpp11curldplyrfansigenericsgluehmsipaddressjanitorlifecyclelubridatemagrittrmapspillarpkgconfigpurrrR6Rcpprlangsnakecasestringistringrtibbletidyrtidyselecttimechangeutf8vctrswithr

excluder

Rendered fromexcluder.Rmdusingknitr::rmarkdownon Mar 05 2025.

Last update: 2023-02-13
Started: 2022-06-26

Citation

To cite excluder in publications, please use:

Stevens, J. R. (2021). excluder: An R package that checks for exclusion criteria in online data. Journal of Open Source Software, 6(67), 3893. https://doi.org/10.21105/joss.03893

Corresponding BibTeX entry:

  @Article{,
    title = {excluder: An R package that checks for exclusion criteria
      in online data},
    author = {Jeffrey R. Stevens},
    year = {2021},
    journal = {Journal of Open Source Software},
    volume = {6},
    number = {67},
    pages = {3893},
    url = {https://doi.org/10.21105/joss.03893},
    doi = {10.21105/joss.03893},
  }

Readme and manuals

excluder

The goal of {excluder} is to facilitate checking for, marking, and excluding rows of data frames for common exclusion criteria. This package applies to data collected from Qualtrics surveys, and default column names come from importing data with the {qualtRics} package.

This may be most useful for Mechanical Turk data to screen for duplicate entries from the same location/IP address or entries from locations outside of the United States. But it can be used more generally to exclude based on response durations, preview status, progress, or screen resolution.

More details are available on the package website and the getting started vignette.

Installation

You can install the stable released version of {excluder} from CRAN with:

install.packages("excluder")

You can install developmental versions from GitHub with:

# install.packages("remotes")
# remotes::install_github("ropensci/excluder")
install.packages("excluder", repos = c('https://ropensci.r-universe.dev', 'https://cloud.r-project.org'))

Verbs

This package provides three primary verbs:

  • mark functions add a new column to the original data frame that labels the rows meeting the exclusion criteria. This is useful to label the potential exclusions for future processing without changing the original data frame.
  • check functions search for the exclusion criteria and output a message with the number of rows meeting the criteria and a data frame of the rows meeting the criteria. This is useful for viewing the potential exclusions.
  • exclude functions remove rows meeting the exclusion criteria. This is safest to do after checking the rows to ensure the exclusions are correct.

Exclusion types

This package provides seven types of exclusions based on Qualtrics metadata. If you have ideas for other metadata exclusions, please submit them as issues. Note, the intent of this package is not to develop functions for excluding rows based on survey-specific data but on general, frequently used metadata.

  • duplicates works with rows that have duplicate IP addresses and/or locations (latitude/longitude).
  • duration works with rows whose survey completion time is too short and/or too long.
  • ip works with rows whose IP addresses are not found in the specified country (note: this exclusion type requires an internet connection to download the country’s IP ranges).
  • location works with rows whose latitude and longitude are not found in the United States.
  • preview works with rows that are survey previews.
  • progress works with rows in which the survey was not complete.
  • resolution works with rows whose screen resolution is not acceptable.

Usage

The verbs and exclusion types combine with _ to create the functions, such as check_duplicates(), exclude_ip(), and mark_duration(). Multiple functions can be linked together using the {magrittr} pipe %>%. For datasets downloaded directly from Qualtrics, use remove_label_rows() to remove the first two rows of labels and convert date and numeric columns in the metadata, and use deidentify() to remove standard Qualtrics columns with identifiable information (e.g., IP addresses, geolocation).

Marking

The mark_*() functions output the original data set with a new column specifying rows that meet the exclusion criteria. These can be piped together with %>% for multiple exclusion types.

library(excluder)
# Mark preview and short duration rows
df <- qualtrics_text %>%
  mark_preview() %>%
  mark_duration(min_duration = 200)
#> ℹ 2 rows were collected as previews. It is highly recommended to exclude these rows before further processing.
#> ℹ 23 out of 100 rows took less time than 200.
tibble::glimpse(df)
#> Rows: 100
#> Columns: 18
#> $ StartDate               <dttm> 2020-12-11 12:06:52, 2020-12-11 12:06:43, 202…
#> $ EndDate                 <dttm> 2020-12-11 12:10:30, 2020-12-11 12:11:27, 202…
#> $ Status                  <chr> "Survey Preview", "Survey Preview", "IP Addres…
#> $ IPAddress               <chr> NA, NA, "73.23.43.0", "16.140.105.0", "107.57.…
#> $ Progress                <dbl> 100, 100, 100, 100, 100, 100, 100, 100, 100, 1…
#> $ `Duration (in seconds)` <dbl> 465, 545, 651, 409, 140, 213, 177, 662, 296, 2…
#> $ Finished                <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE…
#> $ RecordedDate            <dttm> 2020-12-11 12:10:30, 2020-12-11 12:11:27, 202…
#> $ ResponseId              <chr> "R_xLWiuPaNuURSFXY", "R_Q5lqYw6emJQZx2o", "R_f…
#> $ LocationLatitude        <dbl> 29.73694, 39.74107, 34.03852, 44.96581, 27.980…
#> $ LocationLongitude       <dbl> -94.97599, -121.82490, -118.25739, -93.07187, …
#> $ UserLanguage            <chr> "EN", "EN", "EN", "EN", "EN", "EN", "EN", "EN"…
#> $ Browser                 <chr> "Chrome", "Chrome", "Chrome", "Chrome", "Chrom…
#> $ Version                 <chr> "88.0.4324.41", "88.0.4324.50", "87.0.4280.88"…
#> $ `Operating System`      <chr> "Windows NT 10.0", "Macintosh", "Windows NT 10…
#> $ Resolution              <chr> "1366x768", "1680x1050", "1366x768", "1536x864…
#> $ exclusion_preview       <chr> "preview", "preview", "", "", "", "", "", "", …
#> $ exclusion_duration      <chr> "", "", "", "", "duration_quick", "", "duratio…

Use the unite_exclusions() function to unite all of the marked columns into a single column.

# Collapse labels for preview and short duration rows
df <- qualtrics_text %>%
  mark_preview() %>%
  mark_duration(min_duration = 200) %>%
  unite_exclusions()
#> ℹ 2 rows were collected as previews. It is highly recommended to exclude these rows before further processing.
#> ℹ 23 out of 100 rows took less time than 200.
tibble::glimpse(df)
#> Rows: 100
#> Columns: 17
#> $ StartDate               <dttm> 2020-12-11 12:06:52, 2020-12-11 12:06:43, 202…
#> $ EndDate                 <dttm> 2020-12-11 12:10:30, 2020-12-11 12:11:27, 202…
#> $ Status                  <chr> "Survey Preview", "Survey Preview", "IP Addres…
#> $ IPAddress               <chr> NA, NA, "73.23.43.0", "16.140.105.0", "107.57.…
#> $ Progress                <dbl> 100, 100, 100, 100, 100, 100, 100, 100, 100, 1…
#> $ `Duration (in seconds)` <dbl> 465, 545, 651, 409, 140, 213, 177, 662, 296, 2…
#> $ Finished                <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE…
#> $ RecordedDate            <dttm> 2020-12-11 12:10:30, 2020-12-11 12:11:27, 202…
#> $ ResponseId              <chr> "R_xLWiuPaNuURSFXY", "R_Q5lqYw6emJQZx2o", "R_f…
#> $ LocationLatitude        <dbl> 29.73694, 39.74107, 34.03852, 44.96581, 27.980…
#> $ LocationLongitude       <dbl> -94.97599, -121.82490, -118.25739, -93.07187, …
#> $ UserLanguage            <chr> "EN", "EN", "EN", "EN", "EN", "EN", "EN", "EN"…
#> $ Browser                 <chr> "Chrome", "Chrome", "Chrome", "Chrome", "Chrom…
#> $ Version                 <chr> "88.0.4324.41", "88.0.4324.50", "87.0.4280.88"…
#> $ `Operating System`      <chr> "Windows NT 10.0", "Macintosh", "Windows NT 10…
#> $ Resolution              <chr> "1366x768", "1680x1050", "1366x768", "1536x864…
#> $ exclusions              <chr> "preview", "preview", "", "", "duration_quick"…
Checking

The check_*() functions output messages about the number of rows that meet the exclusion criteria. Because checks return only the rows meeting the criteria, they should not be connected via pipes unless you want to subset the second check criterion within the rows that meet the first criterion. Thus, in general, check_*() functions should be used individually. If you want to view the potential exclusions for multiple criteria, use the mark_*() functions.

# Check for preview rows
qualtrics_text %>%
  check_preview()
#> ℹ 2 rows were collected as previews. It is highly recommended to exclude these rows before further processing.
#>             StartDate             EndDate         Status IPAddress Progress
#> 1 2020-12-11 12:06:52 2020-12-11 12:10:30 Survey Preview      <NA>      100
#> 2 2020-12-11 12:06:43 2020-12-11 12:11:27 Survey Preview      <NA>      100
#>   Duration (in seconds) Finished        RecordedDate        ResponseId
#> 1                   465     TRUE 2020-12-11 12:10:30 R_xLWiuPaNuURSFXY
#> 2                   545     TRUE 2020-12-11 12:11:27 R_Q5lqYw6emJQZx2o
#>   LocationLatitude LocationLongitude UserLanguage Browser      Version
#> 1         29.73694         -94.97599           EN  Chrome 88.0.4324.41
#> 2         39.74107        -121.82490           EN  Chrome 88.0.4324.50
#>   Operating System Resolution
#> 1  Windows NT 10.0   1366x768
#> 2        Macintosh  1680x1050
Excluding

The exclude_*() functions remove the rows that meet exclusion criteria. These, too, can be piped together. Since the output of each function is a subset of the original data with the excluded rows removed, the order of the functions will influence the reported number of rows meeting the exclusion criteria.

# Exclude preview then incomplete progress rows
df <- qualtrics_text %>%
  exclude_duration(min_duration = 100) %>%
  exclude_progress()
#> ℹ 4 out of 100 rows of short and/or long duration were excluded, leaving 96 rows.
#> ℹ 4 out of 96 rows with incomplete progress were excluded, leaving 92 rows.
dim(df)
#> [1] 92 16
# Exclude incomplete progress then preview rows
df <- qualtrics_text %>%
  exclude_progress() %>%
  exclude_duration(min_duration = 100)
#> ℹ 6 out of 100 rows with incomplete progress were excluded, leaving 94 rows.
#> ℹ 2 out of 94 rows of short and/or long duration were excluded, leaving 92 rows.
dim(df)
#> [1] 92 16

Though the order of functions should not influence the final data set, it may speed up processing large files by removing preview and incomplete progress data first and waiting to check IP addresses and locations after other exclusions have been performed.

# Exclude rows
df <- qualtrics_text %>%
  exclude_preview() %>%
  exclude_progress() %>%
  exclude_duplicates() %>%
  exclude_duration(min_duration = 100) %>%
  exclude_resolution() %>%
  exclude_ip() %>%
  exclude_location()
#> ℹ 2 out of 100 preview rows were excluded, leaving 98 rows.
#> ℹ 6 out of 98 rows with incomplete progress were excluded, leaving 92 rows.
#> ℹ 9 out of 92 duplicate rows were excluded, leaving 83 rows.
#> ℹ 2 out of 83 rows of short and/or long duration were excluded, leaving 81 rows.
#> ℹ 3 out of 81 rows with unacceptable screen resolution were excluded, leaving 78 rows.
#> ℹ 10 out of 78 rows with IP addresses outside of US were excluded, leaving 68 rows.
#> ℹ 4 out of 68 rows outside of the US were excluded, leaving 64 rows.

Citing this package

To cite {excluder}, use:

Stevens, J. R. (2021). excluder: An R package that checks for exclusion criteria in online data. Journal of Open Source Software, 6(67), 3893. https://doi.org/10.21105/joss.03893

Contributing to this package

Contributions to {excluder} are most welcome! Feel free to check out open issues for ideas. And pull requests are encouraged, but you may want to raise an issue or contact the maintainer first.

Please note that the excluder project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.

Acknowledgments

I thank Francine Goh and Billy Lim for comments on an early version of the package, as well as the insightful feedback from rOpenSci editor Mauro Lepore and reviewers Joseph O’Brien and Julia Silge. This work was funded by US National Science Foundation grant NSF-1658837.

Help Manual

Help pageTopics
Check for duplicate IP addresses and/or locationscheck_duplicates
Check for minimum or maximum durationscheck_duration
Check for IP addresses from outside of a specified country.check_ip
Check for locations outside of the UScheck_location
Check for survey previewscheck_preview
Check for survey progresscheck_progress
Check screen resolutioncheck_resolution
Remove columns that could include identifiable informationdeidentify
Exclude rows with duplicate IP addresses and/or locationsexclude_duplicates
Exclude rows with minimum or maximum durationsexclude_duration
Exclude IP addresses from outside of a specified country.exclude_ip
Exclude locations outside of USexclude_location
Exclude survey previewsexclude_preview
Exclude survey progressexclude_progress
Exclude unacceptable screen resolutionexclude_resolution
Mark duplicate IP addresses and/or locationsmark_duplicates
Mark minimum or maximum durationsmark_duration
Mark IP addresses from outside of a specified country.mark_ip
Mark locations outside of USmark_location
Mark survey previewsmark_preview
Mark survey progressmark_progress
Mark unacceptable screen resolutionmark_resolution
Example numeric metadata imported with 'qualtRics::fetch_survey()' from simulated Qualtrics studyqualtrics_fetch
Example numeric metadata imported with 'qualtRics::fetch_survey()' from simulated Qualtrics study but with labels included as column namesqualtrics_fetch2
Example numeric metadata from simulated Qualtrics studyqualtrics_numeric
Example text-based metadata from simulated Qualtrics studyqualtrics_raw
Example text-based metadata from simulated Qualtrics studyqualtrics_text
Remove two initial rows created in Qualtrics dataremove_label_rows
Rename columns to match standard Qualtrics namesrename_columns
Unite multiple exclusion columns into single columnunite_exclusions
Use Qualtrics labels as column namesuse_labels