--- title: "An Introduction to the dataset Package" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{An Introduction to the dataset Package} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) if (!requireNamespace("rdflib", quietly = TRUE )) { stop("Please install 'rdflib' to run this vignette.") } ``` ## Overview The `dataset` package extends tidy data with semantic metadata, provenance, and machine-readable definitions. It supports a gradual workflow from provisional semantic harmonisation with `prelabel()` to formally defined variables with `defined()` and fully described datasets with `dataset_df()`. This makes datasets easier to exchange, reuse, publish, and serialize to RDF and other FAIR-compliant formats. This vignette provides a high-level introduction. For details on key components, see: - `vignette("prelabelled", package = "dataset")`: Handling Semantic Ambiguity with `prelabelled` Vectors. - `vignette("defined", package = "dataset")`: Semantic vectors with `defined()` - `vignette("dataset_df", package = "dataset")`: Structuring and metadata with `dataset_df()` - `vignette("rdf", package = "dataset")`: Exporting to RDF and Linked Data - `vignette("bibrecord", package = "dataset")`: Creating rich citation metadata using `bibrecord()` ## Why extend tidy data? Hadley Wickham (2014) defines [tidy data](https://vita.had.co.nz/papers/tidy-data.pdf) with three principles: - Each variable forms a column - Each observation forms a row - Each observational unit forms a table This structure is ideal for analysis because it links the structure of a dataset with its meaning. A variable represents an underlying attribute, and an observation represents measurements collected on the same unit. In practice, however, analysts rarely begin with perfectly harmonized data. During data cleaning, transformation, and integration, they make many semantic decisions: resolving inconsistent coding schemes, standardizing categories, selecting units of measurement, or deciding how concepts from different sources correspond to one another. By the time a dataset is ready for analysis, these assumptions are usually clear to the analyst who created it. The problem arises when the dataset leaves its original context. Other analysts may use different terminology, apply different coding conventions, or simply lack knowledge of the decisions that were made during data preparation. Even the original analyst may find these assumptions difficult to reconstruct months or years later. The `dataset` package extends tidy data by making such semantic assumptions explicit and preserving them alongside the data. Rather than treating semantic harmonisation and data provenance as undocumented steps in a workflow, it allows them to be recorded incrementally as the dataset evolves. The goal is not to burden analysts with complex semantic technologies. Instead, the package provides lightweight tools for gradually recording the information needed to review, reuse, audit, publish, and correctly combine datasets across projects, organisations, and time. ## Example: gradual semantic stabilisation Many data integration problems begin with values that refer to the same concept but use different coding conventions. ```{r predefine} library(dataset) country <- prelabel( c("AD", "Andorra", "AND", "LI", "Liechtenstein"), labels = c( Andorra = "AD", AND = "AD", Liechtenstein = "LI" ) ) country ``` The `prelabelled` class records provisional semantic assumptions without requiring a formal semantic definition. In this example, "AD", "Andorra", and "AND" are treated as equivalent representations of the same geopolitical entity. The current mappings can be inspected directly: ```{r predefine-attribute} attr(country, "prelabel") ``` This approach is useful during data cleaning and integration, where semantic assumptions may still evolve. Once these assumptions become sufficiently stable, they can be formalized with `defined()`. For further information, see `vignette("prelabelled", package = "dataset")`: Handling Semantic Ambiguity with `prelabelled` Vectors. ## Example: defining semantically rich vectors After values have been harmonized, variables can be formally defined with machine-readable semantic metadata. Semantically rich vectors are vectors in a data.frame that contain richer semantics than a simple column name; a long-form human-readable title; a machine- and human-readable variable definition; and if needed, an external resource that contains the codebook. ```{r definegdpdataset} library(dataset) gdp <- defined( c(2355, 2592, 2884), label = "Gross Domestic Product", unit = "CP_MEUR", concept = "http://data.europa.eu/83i/aa/GDP" ) geo <- defined( rep("AD", 3), label = "Geopolitical Entity", concept = "http://purl.org/linked-data/sdmx/2009/dimension#refArea", namespace = "https://www.geonames.org/countries/$1/" ) gdp geo ``` In this case, we define `geo` as the geopolitical entity , and we know that the `AD` value can resolve to Andorra: . These vectors now carry metadata you can inspect directly — including their label, unit, and concept URI — which will be preserved even after transformation or storage. For further information, see vignette("defined", package = "dataset")`: Semantic vectors with`defined()\`. ## Example: creating a dataset from a metadata-enriched data frame ```{r smalldatasetexample} small_dataset <- dataset_df( geo = geo, gdp = gdp, identifier = c(gdp = "http://example.com/dataset#gdp"), dataset_bibentry = dublincore( title = "Small GDP Dataset", creator = person("Jane", "Doe", role = "aut"), publisher = "Small Repository", subject = "Gross Domestic Product" ) ) small_dataset ``` For further information see `vignette("dataset_df", package = "dataset")`: Structuring and metadata with `dataset_df()`. This dataset not only stores the variables and values, but also includes embedded metadata that supports precise interpretation and repository-level publication. ```{r dublincoremetadata} as_dublincore(small_dataset) ``` For further information see`vignette("bibrecord", package = "dataset")`: Creating rich citation metadata using `bibrecord()` ## Exporting to RDF As Carl Boettinger has shown in the vignettes accompanying the R-binding to the popular Python library [rdflib](https://CRAN.R-project.org/package=rdflib), (see: [A tidyverse lover's intro to RDF](https://docs.ropensci.org/rdflib/articles/rdf_intro.html)), tidy datasets can be retrofitted with rich metadata if they are pivoted to a strictly three-column long format. Our packages tries to lower the burden of such retrofitting with early binding and sensible defaults to serialise the dataset's contents and the dataset's bibliographic data to this format for those who are not familiar with RDF. You can convert any `dataset_df` object into a tidy 3-column representation (subject–predicate–object) using `dataset_to_triples()`: ```{r triplesexample} triples <- dataset_to_triples(small_dataset, format = "nt" ) triples ``` This 3-column format (subject–predicate–object) is compatible with semantic web tools such as SPARQL, `rdflib`, and triple stores. ```{r ntexample} mycon <- tempfile("my_dataset", fileext = "nt" ) my_description <- describe( x = small_dataset, con = mycon ) # Only three statements are shown: readLines(mycon)[c(4, 8, 12)] ``` ```{r provenancexample} ## Show two lines of provenance: provenance(small_dataset)[c(6, 7)] ``` For further information, see `vignette("rdf", package = "dataset")`: Exporting to RDF and Linked Data. ## Coercing back There may be use cases when your richer dataset needs to be simplified to as base R `data.frame` or a `tbf_df`. We offer two coercion forms: ```{r smalldf} small_df <- as.data.frame(small_dataset, strip_attributes = FALSE ) attr(small_dataset, "subject") ``` Using the `strip_attributes = FALSE` the rich attributes remain in the base R data.frame. In most pipelines the attributes play no role, and you can retain it, and perhaps later load it back to a richer form. You can also strip all these attributes, and choose `tbl_df` (if you have `tibble`) installed": ```{r smalltbl} small_tbl <- as_tibble( small_dataset, strip_attributes = TRUE ) small_tbl ``` ## Summary The *dataset* package enriches tidy data by attaching metadata from the start of the workflow. It helps avoid semantic mismatches, supports RDF publication, and meets interoperability standards like SDMX, DataCite, and Dublin Core. Use it when you need: - Meaningful variable descriptions and URIs - Dataset-level metadata embedded directly in .rds or .rda files - Easy export to RDF and semantic web formats.