A completely fictional dataset of monthly butterfly counts
butterflycount
butterflycount
butterflycount
A list with 5 dataframes (january, february, march, april, may) containing 3 columns, and 3 + n_month rows:
The date on which the imaginary count took place, in yyyy-mm-dd format
Number of fictional butterflies counted
Butterfly species name, only appears in april
...
A version of butterflycount made messy using the messy package. This dataset is only used for testing purposes
butterflymess
butterflymess
butterflymess
A list with 5 dataframes (january, february, march, april, may) containing 3 columns, and 3 + n_month rows:
The date on which the imaginary, and messy, count took place, in yyyy-mm-dd format
Number of fictional butterflies counted
Butterfly species name, only appears in april
...
This function matches two dataframe objects by their unique identifier (usually "time" or "datetime in a timeseries), and returns a new dataframe which contains only rows that have changed compared to previous data. It will not return any new rows.
catch(df_current, df_previous, datetime_variable, ...)
catch(df_current, df_previous, datetime_variable, ...)
df_current |
data.frame, the newest/current version of dataset x. |
df_previous |
data.frame, the old version of dataset, for example x - t1. |
datetime_variable |
string, which variable to use as unique ID to join
|
... |
Other |
The underlying functionality is handled by create_object_list()
.
A dataframe which contains only rows of df_current
that have
changes from df_previous
, but without new rows. Also returns a waldo
object as in loupe()
.
# Returning only matched rows which contain changes df_caught <- butterfly::catch( butterflycount$march, # New or current dataset butterflycount$february, # Previous version you are comparing it to datetime_variable = "time" # Unique ID variable they have in common ) df_caught
# Returning only matched rows which contain changes df_caught <- butterfly::catch( butterflycount$march, # New or current dataset butterflycount$february, # Previous version you are comparing it to datetime_variable = "time" # Unique ID variable they have in common ) df_caught
This function creates a list of objects which is used by all of loupe()
,
catch()
and release()
.
create_object_list(df_current, df_previous, datetime_variable, ...)
create_object_list(df_current, df_previous, datetime_variable, ...)
df_current |
data.frame, the newest/current version of dataset x. |
df_previous |
data.frame, the old version of dataset, for example x - t1. |
datetime_variable |
string, which variable to use as unique ID to join
|
... |
Other |
This function matches two dataframe objects by their unique identifier (usually "time" or "datetime in a timeseries).
It informs the user of new (unmatched) rows which have appeared, and then
returns a waldo::compare()
call to give a detailed breakdown of changes.
The main assumption is that df_current
and df_previous
are a newer and
older versions of the same data, and that the datetime_variable
variable
name always remains the same. Elsewhere new columns can of appear, and these
will be returned in the report.
A list containing boolean where TRUE indicates no changes to previous data and FALSE indicates unexpected changes, a dataframe of the current data without new rows and a dataframe of new rows only
butterfly_object_list <- butterfly::create_object_list( butterflycount$february, # New or current dataset butterflycount$january, # Previous version you are comparing to datetime_variable = "time" # Unique ID variable they have in common ) butterfly_object_list # You can pass other `waldo::compare()` options such as tolerance here butterfly_object_list <- butterfly::create_object_list( butterflycount$march, # New or current dataset butterflycount$february, # Previous version you are comparing it to datetime_variable = "time", # Unique ID variable they have in common tolerance = 2 ) butterfly_object_list
butterfly_object_list <- butterfly::create_object_list( butterflycount$february, # New or current dataset butterflycount$january, # Previous version you are comparing to datetime_variable = "time" # Unique ID variable they have in common ) butterfly_object_list # You can pass other `waldo::compare()` options such as tolerance here butterfly_object_list <- butterfly::create_object_list( butterflycount$march, # New or current dataset butterflycount$february, # Previous version you are comparing it to datetime_variable = "time", # Unique ID variable they have in common tolerance = 2 ) butterfly_object_list
A completely fictional dataset of daily precipitation
forestprecipitation
forestprecipitation
forestprecipitation
A list with 2 dataframes (january, february) containing 2 columns, and 6 rows. February intentionally resets to 1970-01-01
The date on which the imaginary rainfall was measured took place, in yyyy-mm-dd format
Rainfall in mm
...
A loupe is a simple, small magnification device used to examine small details more closely.
loupe(df_current, df_previous, datetime_variable, ...)
loupe(df_current, df_previous, datetime_variable, ...)
df_current |
data.frame, the newest/current version of dataset x. |
df_previous |
data.frame, the old version of dataset, for example x - t1. |
datetime_variable |
string, which variable to use as unique ID to join
|
... |
Other |
This function is intended to aid in the verification of continually updating timeseries data where we expect new values but want to ensure previous values remains unchanged.
This function matches two dataframe objects by their unique identifier (usually "time" or "datetime in a timeseries).
It informs the user of new (unmatched) rows which have appeared, and then
returns a waldo::compare()
call to give a detailed breakdown of changes. If
you are not familiar with waldo::compare()
, this is an expanded and more
verbose function similar to base R's all.equal()
.
loupe()
will then return TRUE if there are not changes to previous data,
or FALSE if there are unexpected changes. If you want to extract changes as
a dataframe, use catch()
, or if you want to drop them, use release()
.
The main assumption is that df_current
and df_previous
are a newer and
older versions of the same data, and that the datetime_variable
variable
name always remains the same. Elsewhere new columns can of appear, and these
will be returned in the report.
The underlying functionality is handled by create_object_list()
.
A boolean where TRUE indicates no changes to previous data and FALSE indicates unexpected changes.
# Checking two dataframes for changes # Returning TRUE (no changes) or FALSE (changes) # This example contains no differences with previous data butterfly::loupe( butterflycount$february, # New or current dataset butterflycount$january, # Previous version you are comparing it to datetime_variable = "time" # Unique ID variable they have in common ) # This example does contain differences with previous data butterfly::loupe( butterflycount$march, butterflycount$february, datetime_variable = "time" )
# Checking two dataframes for changes # Returning TRUE (no changes) or FALSE (changes) # This example contains no differences with previous data butterfly::loupe( butterflycount$february, # New or current dataset butterflycount$january, # Previous version you are comparing it to datetime_variable = "time" # Unique ID variable they have in common ) # This example does contain differences with previous data butterfly::loupe( butterflycount$march, butterflycount$february, datetime_variable = "time" )
This function matches two dataframe objects by their unique identifier (usually "time" or "datetime in a timeseries), and returns a new dataframe which contains the new rows (if present) but matched rows which contain changes from previous data will be dropped.
release(df_current, df_previous, datetime_variable, include_new = TRUE, ...)
release(df_current, df_previous, datetime_variable, include_new = TRUE, ...)
df_current |
data.frame, the newest/current version of dataset x. |
df_previous |
data.frame, the old version of dataset, for example x - t1. |
datetime_variable |
string, which variable to use as unique ID to join
|
include_new |
boolean, should new rows be included? Default is TRUE. |
... |
Other |
A dataframe which contains only rows of df_current
that have not
changed from df_previous
, and includes new rows. Also returns a waldo
object as in loupe()
.
# Dropping matched rows which contain changes, and returning unchanged rows df_released <- butterfly::release( butterflycount$march, # New or current dataset butterflycount$february, # Previous version you are comparing it to datetime_variable = "time", # Unique ID variable they have in common include_new = TRUE # Whether to include new rows or not, default is TRUE ) df_released
# Dropping matched rows which contain changes, and returning unchanged rows df_released <- butterfly::release( butterflycount$march, # New or current dataset butterflycount$february, # Previous version you are comparing it to datetime_variable = "time", # Unique ID variable they have in common include_new = TRUE # Whether to include new rows or not, default is TRUE ) df_released
Check if a timeseries is continuous. Even if a timeseries does not contain obvious gaps, this does not automatically mean it is also continuous.
timeline(df_current, datetime_variable, expected_lag = 1)
timeline(df_current, datetime_variable, expected_lag = 1)
df_current |
data.frame, the newest/current version of dataset x. |
datetime_variable |
string, the "datetime" variable that should be checked for continuity. |
expected_lag |
numeric, the acceptable difference between timestep for
a timeseries to be classed as continuous. Any difference greater than
|
Measuring instruments can have different behaviours when they fail. For example, during power failure an internal clock could reset to "1970-01-01", or the manufacturing date (say, "2021-01-01"). This leads to unpredictable ways of checking if a dataset is continuous.
The timeline_group()
and timeline()
functions attempt to give the user
control over how to check for continuity by providing an expected_lag
. The
difference between timesteps in a dataset should not exceed the
expected_lag
.
Note: for monthly data it is recommended you convert your Date column to a monthly format (e.g 2024-October, 10-2024, Oct-2024 etc.), so a constant expected lag can be set (not a range of 29 - 31 days).
A boolean, TRUE if the timeseries is continuous, and FALSE if there are more than one continuous timeseries within the dataset.
# A nice continuous dataset should return TRUE butterfly::timeline( forestprecipitation$january, datetime_variable = "time", expected_lag = 1 ) # In February, our imaginary rain gauge's onboard computer had a failure. # The timestamp was reset to 1970-01-01 butterfly::timeline( forestprecipitation$february, datetime_variable = "time", expected_lag = 1 )
# A nice continuous dataset should return TRUE butterfly::timeline( forestprecipitation$january, datetime_variable = "time", expected_lag = 1 ) # In February, our imaginary rain gauge's onboard computer had a failure. # The timestamp was reset to 1970-01-01 butterfly::timeline( forestprecipitation$february, datetime_variable = "time", expected_lag = 1 )
If after using timeline()
you have established a timeseries is not
continuous, or if you are working with data where you expect distinct
sequences or events, you can use timeline_group()
to extract and
classify different distinct continuous chunks of your data.
timeline_group(df_current, datetime_variable, expected_lag = 1)
timeline_group(df_current, datetime_variable, expected_lag = 1)
df_current |
data.frame, the newest/current version of dataset x. |
datetime_variable |
string, the "datetime" variable that should be checked for continuity. |
expected_lag |
numeric, the acceptable difference between timestep for
a timeseries to be classed as continuous. Any difference greater than
|
We attempt to do this without sorting, or changing the data for a couple of reasons:
There are no difference in dates: Some instruments might record dates that appear identical, but are still in chronological order. For example, high-frequency data in fractional seconds. This is a rare use case though.
Dates are generally ascending/descending, but the instrument has returned to origin. Probably more common, and will results in a non-continuous dataset, however the records are still in chronological order This is something we would like to discover. This is accounted for in the logic in case_when().
Note: for monthly data it is recommended you convert your Date column to a monthly format (e.g 2024-October, 10-2024, Oct-2024 etc.), so a constant expected lag can be set (not a range of 29 - 31 days).
A data.frame, identical to df_current
, but with extra columns
timeline_group
, which assigns a number to each continuous sets of
data and timelag
which specifies the time lags between rows.
# A nice continuous dataset should return TRUE # In February, our imaginary rain gauge's onboard computer had a failure. # The timestamp was reset to 1970-01-01 # We want to group these different distinct continuous sequences: butterfly::timeline_group( forestprecipitation$february, datetime_variable = "time", expected_lag = 1 )
# A nice continuous dataset should return TRUE # In February, our imaginary rain gauge's onboard computer had a failure. # The timestamp was reset to 1970-01-01 # We want to group these different distinct continuous sequences: butterfly::timeline_group( forestprecipitation$february, datetime_variable = "time", expected_lag = 1 )