Quickstart
Quickstart.Rmd
Prepare a sample dataset for Naive Bayes
In the first step, we simulate a dataset with the following specifications:
- a total number of 5 classes, 2 of which are known and 3 are unknown, respectively
- n_labeled = 300 labeled samples, evenly distributed over the 2 known classes
- n_unlabeled = 3000 unlabeled samples, evenly distributed over the 5 classes
- a dimension of n_feats = 2 features (to facilitate representation)
Each class \(i\) is represented by a bivariate Gaussian distribution with mean vector \(\mu = (\mu_1^{(i)}, \mu_2^{(i)})\) and covariance matrix \[\Sigma = \left(\begin{array}{cc}1 & 0 \\ 0 & 1\end{array}\right)\] (identical for all classes). The known classes are 0 and 1.
library(mvtnorm)
set.seed(1)
# sample sizes
n_labeled <- 300
n_unlabeled <- 3000
n_feats <- 2
# mean values and covariance matrices
mu <- rbind(
c(-5, 10),
c(0, 5),
c(-5, 0),
c(-10,5),
c(10,-10)
)
rownames(mu) <- 0:4
sigma <- diag(rep(1,n_feats))
print(mu)
#> [,1] [,2]
#> 0 -5 10
#> 1 0 5
#> 2 -5 0
#> 3 -10 5
#> 4 10 -10
print(sigma)
#> [,1] [,2]
#> [1,] 1 0
#> [2,] 0 1
Given the data specifications, we simulate from the bivariate Gaussian distribution to generate the dataset given the class vector. We discriminate between the true class vector y_true containing the labels 0-4 of all samples, and the model input class vector y containing a true label only for the labeled data, and NA, otherwise.
# specify number of labeles / unlabeled samples per class
labeled_classes <- rep(0:1, each = n_labeled / 2)
unlabeled_classes <- rep(0:4, each = n_unlabeled / 5)
num_sample <- c(table(labeled_classes), table(unlabeled_classes))
# simulate X, ytrue and y
X <- c()
for(i in 1:length(num_sample)){
X <- rbind(X, rmvnorm(num_sample[i], mu[names(num_sample)[i],], sigma))
}
X <- cbind(X, 1)
y <- rep(c(1:2, NA), c(table(labeled_classes), length(unlabeled_classes)))
ytrue <- rep(names(num_sample), num_sample)
colnames(X) <- paste0("x", 1:3)
# summaries of the model input class vector y, and the true class vector ytrue
summary(as.factor(y))
#> 1 2 NA's
#> 150 150 3000
summary(as.factor(ytrue))
#> 0 1 2 3 4
#> 750 750 600 600 600
Using the model input class vector y and the simulated feature matrix X, we generate the input dataset and the formula for the model.
# input dataset for the model
data <- as.data.frame(cbind(X, y))
# model formula
formula <- y ~ x1 + x2 + x3 - 1
# simulated data
head(data)
#> x1 x2 x3 y
#> 1 -5.626454 10.183643 1 1
#> 2 -5.835629 11.595281 1 1
#> 3 -4.670492 9.179532 1 1
#> 4 -4.512571 10.738325 1 1
#> 5 -4.424219 9.694612 1 1
#> 6 -3.488219 10.389843 1 1
The simulated dataset is given by the following scatterplot:
Train SSC-UC model
The SSC-UC model is called via SSCUC and returns a BayesClassifier object with known and unknown classes. In this case, a naive Bayes classifier is used, specified by the argument naive=TRUE.
library(SSCUC)
# train model
model <- SSC(formula, data,
naive = TRUE)
#> [1] "Starting EM with 4 classes"
#> Warning in BayesClassifier(formula, data, naive = naive, prior = prior, :
#> BayesClassifier removed 1157 NAs
#> [1] "EM converged after 4 iterations"
#> [1] "BIC: 19145.225046868"
#> [1] "Trying EM with 3 classes"
#> Warning in BayesClassifier(formula, data, naive = naive, prior = prior, :
#> BayesClassifier removed 1157 NAs
#> [1] "EM converged after 3 iterations"
#> [1] "BIC: 21005.6744068909"
#> [1] "BIC increased when updating - stopping"
summary(model)
#> BayesClassifier model with 6 classes and 2 non-constant features
#> Note: the total number of features is 3
#> ==============================
#> formula: y ~ x1 + x2 + x3 - 1
#> used features: x1, x2
#> parameters:
mu | Sigma | prior | |
---|---|---|---|
1 | -5.03,10.02 | 0.92,0,0,0.88 | 0.21 |
2 | -0.01,5.04 | 0.94,0,0,0.96 | 0.21 |
U1 | -3.03,7.35 | 10.93,0,0,9.3 | 0.03 |
U2 | -4.97,-0.02 | 1.03,0,0,1.07 | 0.18 |
U3 | -10.08,5.05 | 0.99,0,0,0.88 | 0.18 |
U4 | 9.98,-10.02 | 0.97,0,0,1.18 | 0.18 |
Predict and evaluate unlabeled data using SSC-UC model
Class labels for the unlabeled data in the dataset are obtained using the predict function with argument type=“class”.
# predict on unlabeled data
pred <- predict(model, newdata = subset(data, is.na(y)), type = "class")
Evaluate in full confusion matrix (multiple unknown classes)
The full confusion matrix contains all the classes modeled by the BayesClassifier:
- the class labels 0-4 from the data specification as reference labels (not that, however, only classes 0 and 1 are known to the model a-priori),
- the class labels for known classes (0-1) and unknown classes (U1-U4) as predicted labels.
library(caret)
#> Loading required package: lattice
library(knitr)
# specify levels
levels <- sort(union(unique(pred), unique(ytrue)))
kable(confusionMatrix(
reference = factor(ytrue[is.na(y)], levels = levels),
data = factor(pred, levels = levels))$table[sort(unique(pred)),sort(unique(ytrue))]
)
0 | 1 | 2 | 3 | 4 | |
---|---|---|---|---|---|
1 | 592 | 0 | 0 | 0 | 0 |
2 | 0 | 598 | 0 | 0 | 0 |
U1 | 8 | 2 | 0 | 1 | 0 |
U2 | 0 | 0 | 600 | 0 | 0 |
U3 | 0 | 0 | 0 | 599 | 0 |
U4 | 0 | 0 | 0 | 0 | 600 |
The following plot shows the unlabeled data with their predicted class labels:
Evaluate in reduced confusion matrix (all unknown classes as class “U”)
As a second evaluation step, we can summarize all “unknown” classes detected by SSC-UC into one class “U” (unknown). Thereby, we reduce the number of labels in the confusion matrix to 0-1 and U.
# set all new class labels to "U"
pred[!pred %in% unique(ytrue)] <- "U"
# specify levels
levels <- sort(union(unique(pred), unique(ytrue)))
kable(confusionMatrix(
reference = factor(ytrue[is.na(y)], levels = levels),
data = factor(pred, levels = levels))$table[sort(unique(pred)),sort(unique(ytrue))]
)
0 | 1 | 2 | 3 | 4 | |
---|---|---|---|---|---|
1 | 592 | 0 | 0 | 0 | 0 |
2 | 0 | 598 | 0 | 0 | 0 |
U | 8 | 2 | 600 | 600 | 600 |
The following plot shows the unlabeled data with their predicted class labels, when considering all unknown classes as one class.