--- title: "Fuzzy sets" abstract: > Describes the fuzzy sets, interpretation and how to work with them. date: "`r format(Sys.time(), '%Y %b %d')`" output: html_document: fig_caption: true code_folding: show self_contained: yes toc_float: collapsed: true toc_depth: 3 author: - name: Lluís Revilla email: lluis.revilla@gmail.com vignette: > %\VignetteIndexEntry{Fuzzy sets} %\VignetteEncoding{UTF-8} %\VignetteEngine{knitr::rmarkdown} %\DeclareUnicodeCharacter{2229}{$\cap$} %\DeclareUnicodeCharacter{222A}{$\cup$} editor_options: chunk_output_type: console --- ```{r setup, message=FALSE, warning=FALSE, include=FALSE} knitr::opts_knit$set(root.dir = ".") knitr::opts_chunk$set(collapse = TRUE, warning = TRUE, comment = "#>") library("BaseSet") library("dplyr") ``` # Getting started This vignettes supposes that you already read the "About BaseSet" vignette. This vignette explains what are the fuzzy sets and how to use them. As all methods for "normal" sets are available for fuzzy sets this vignette focuses on how to create, use them. # What are fuzzy sets and why/when use them ? Fuzzy sets are generalizations of classical sets where there is some vagueness, one doesn't know for sure something on this relationship between the set and the element. This vagueness can be on the assignment to a set and/or on the membership on the set. One way these vagueness arise is when classifying a continuous scale on a categorical scale, for example: when the temperature is 20ºC is it hot or not? If the temperature drops to 15ºC is it warm? When does the switch happen from warm to hot? In fuzzy set theories the step from a continuous scale to a categorical scale is performed by the [membership function](https://en.wikipedia.org/wiki/Membership_function_(mathematics)) and is called [fuzzification](https://en.wikipedia.org/wiki/Fuzzy_logic#Fuzzification). When there is a degree of membership and uncertainty on the membership it is considered a [type-2 fuzzy set](https://en.wikipedia.org/wiki/Type-2_fuzzy_sets_and_systems). We can understand it with using as example a paddle ball, it can be used for tennis or paddle (membership), but until we don't test it bouncing on the court we won't be sure (assignment) if it is a paddle ball or a tennis ball. We could think about a ball equally good for tennis and paddle (membership) but which two people thought it is for tennis and the other for paddle. These voting/rating system is also a common scenario where fuzzy sets arise. When several graders/people need to agree but compromise on a middle ground. One modern example of this is ratting apps where several people vote between 1 and 5 an app and the displayed value takes into consideration all the votes. As you can see when one does have some vagueness or uncertainty then fuzzy logic is a good choice. There have been developed several logic and methods. # Creating a fuzzy set To create a fuzzy set you need to have a column named "fuzzy" if you create it from a `data.frame` or have a named numeric vector if you create it from a `list`. These values are restricted to a numeric value between 0 and 1. The value indicates the strength, membership, truth value (or probability) of the relationship between the element and the set. ```{r fuzzy} set.seed(4567) # To be able to have exact replicates relations <- data.frame(sets = c(rep("A", 5), "B", "C"), elements = c(letters[seq_len(6)], letters[6]), fuzzy = runif(7)) fuzzy_set <- tidySet(relations) ``` # Working with fuzzy sets We can work with fuzzy sets as we do with normal sets. But if you remember that at the end of the previous vignette we used an Important column, now it is already included as fuzzy. This allows us to use this information for union and intersection methods and other operations: ## Union You can make a union of two sets present on the same object. ```{r union} BaseSet::union(fuzzy_set, sets = c("A", "B")) BaseSet::union(fuzzy_set, sets = c("A", "B"), name = "D") ``` We get a new set with all the elements on both sets. There isn't an element in both sets A and B, so the fuzzy values here do not change. If we wanted to use other logic we can with provide it with the FUN argument. ```{r union_logic} BaseSet::union(fuzzy_set, sets = c("A", "B"), FUN = function(x){sqrt(sum(x))}) ``` There are several logic see for instance `?sets::fuzzy_logic`. You should pick the operators that fit on the framework of your data. Make sure that the defaults arguments of logic apply to your data obtaining process. ## Intersection However if we do the intersection between B and C we can see some changes on the fuzzy value: ```{r intersection} intersection(fuzzy_set, sets = c("B", "C"), keep = FALSE) intersection(fuzzy_set, sets = c("B", "C"), keep = FALSE, FUN = "mean") ``` Different logic on the `union()`, `intersection()`, `complement()`, and `cardinality()`, we get different fuzzy values. Depending on the nature of our fuzziness and the intended goal we might apply one or other rules. ## Complement We can look for the complement of one or several sets: ```{r complement} complement_set(fuzzy_set, sets = "A", keep = FALSE) ``` Note that the values of the complement are `1-fuzzy` but can be changed: ```{r complement_previous} filter(fuzzy_set, sets == "A") complement_set(fuzzy_set, sets = "A", keep = FALSE, FUN = function(x){1-x^2}) ``` ## Subtract This is the equivalent of `setdiff`, but clearer: ```{r subtract} subtract(fuzzy_set, set_in = "A", not_in = "B", keep = FALSE, name = "A-B") # Or the opposite B-A, but using the default name: subtract(fuzzy_set, set_in = "B", not_in = "A", keep = FALSE) ``` Note that here there is also a subtraction of the fuzzy value. # Sizes If we consider the fuzzy values as probabilities then the size of a set is not fixed. To calculate the size of a given set we have `set_size()`: ```{r set_size} set_size(fuzzy_set) ``` Or an element can be in 0 sets: ```{r element_size} element_size(fuzzy_set) ``` In this example we can see that it is more probable that the element "a" is not present than the element "f" being present in one set. # Interpretation Sometimes it can be a bit hard to understand what do the fuzzy sets mean on your analysis. To better understand let's dive a bit in the interpretation with an example: Imagine you have your experiment where you collected data from a sample of cells for each cell (our elements). Then you used some program to classify which type of cell it is (alpha, beta, delta, endothelial), this are our sets. The software returns a probability for each type it has: the higher, the more confident it is of the assignment: ```{r cells_0} sc_classification <- data.frame( elements = c("D2ex_1", "D2ex_10", "D2ex_11", "D2ex_12", "D2ex_13", "D2ex_14", "D2ex_15", "D2ex_16", "D2ex_17", "D2ex_18", "D2ex_1", "D2ex_10", "D2ex_11", "D2ex_12", "D2ex_13", "D2ex_14", "D2ex_15", "D2ex_16", "D2ex_17", "D2ex_18", "D2ex_1", "D2ex_10", "D2ex_11", "D2ex_12", "D2ex_13", "D2ex_14", "D2ex_15", "D2ex_16", "D2ex_17", "D2ex_18", "D2ex_1", "D2ex_10", "D2ex_11", "D2ex_12", "D2ex_13", "D2ex_14", "D2ex_15", "D2ex_16", "D2ex_17", "D2ex_18"), sets = c("alpha", "alpha", "alpha", "alpha", "alpha", "alpha", "alpha", "alpha", "alpha", "alpha", "endothel", "endothel", "endothel", "endothel", "endothel", "endothel", "endothel", "endothel", "endothel", "endothel", "delta", "delta", "delta", "delta", "delta", "delta", "delta", "delta", "delta", "delta", "beta", "beta", "beta", "beta", "beta", "beta", "beta", "beta", "beta", "beta"), fuzzy = c(0.18, 0.169, 0.149, 0.192, 0.154, 0.161, 0.169, 0.197, 0.162, 0.201, 0.215, 0.202, 0.17, 0.227, 0.196, 0.215, 0.161, 0.195, 0.178, 0.23, 0.184, 0.172, 0.153, 0.191, 0.156, 0.167, 0.165, 0.184, 0.162, 0.194, 0.197, 0.183, 0.151, 0.208, 0.16, 0.169, 0.169, 0.2, 0.154, 0.208), stringsAsFactors = FALSE) head(sc_classification) ``` Our question is **which type of cells did we have on the original sample?** We can easily answer this by looking at the relations that have higher confidence of the relationship for each cell. ```{r cells_classification} sc_classification %>% group_by(elements) %>% filter(fuzzy == max(fuzzy)) %>% group_by(sets) %>% count() ``` There is a cell that can be in two cell types, because we started with 10 cells and we have 11 elements here. However, how likely is that a cell is placed to just a single set? ```{r cells_subset} scTS <- tidySet(sc_classification) # Conversion of format sample_cells <- scTS %>% element_size() %>% group_by(elements) %>% filter(probability == max(probability)) sample_cells ``` There must be some cell misclassification: we have `r nElements(scTS)` cells in this example but there are `r sum(sample_cells$size)` cells that the maximum probability predicted for these types is a single cell type. Even the cell that had two equal probabilities for two cell types is more probable to be in no one of these cell types than in any of them. Ideally the predicted number of cells per type and the cells with higher confidence about the type should match. We can also look the other way around: How good is the prediction of a cell type for each cell? ```{r celltypes} scTS %>% set_size() %>% group_by(sets) %>% filter(probability == max(probability)) ``` We can see that for each cell type it is probable to have at least one cell and in the endothelial cell type two cells is the most probable outcome. However, these probabilities are lower than the probabilities of cells being assigned a cell type. This would mean that this method is not a good method or that the cell types are not specific enough for the cell. In summary, the cells that we had most probable are not those 4 cell types except in two cells were it might be. # Session info {.unnumbered} ```{r sessionInfo, echo=FALSE} sessionInfo() ```