Bayes Classifier example
BayesClassifier.Rmd
Prepare a simulated dataset for a Bayes Classifier
In the first step, we simulate a dataset with the following specifications:
- a total number of 10 classes, 4 of which are known and 6 are unknown, respectively
- n_labeled = 200 labeled samples, evenly distributed over the 4 known classes
- n_unlabeled = 5000 unlabeled samples, evenly distributed over the 10 classes
- a dimension of n_feats = 2 features (to facilitate representation)
Each class \(i\) is represented by a bivariate Gaussian distribution with mean vector \(\mu = (\mu_1^{(i)}, \mu_2^{(i)})\) and symmetric covariance matrix \[\Sigma_i = \left(\begin{array}{cc}\sigma_{1,1}^{(i)} & \sigma_{1,2}^{(i)} \\ \sigma_{1,2}^{(i)} & \sigma_{2,2}^{(i)}\end{array}\right).\] The known classes are 0-3.
library(mvtnorm)
set.seed(1)
# sample sizes
n_labeled <- 400
n_unlabeled <- 5000
n_feats <- 2
# mean values and covariance matrices
mu <- t(replicate(
10,
runif(2, -10, 10)
))
sigma <- t(replicate(
10,
{
A = matrix(runif(2^2)*2-1, ncol=2)
return(as.vector( t(A) %*% A))
}
))
rownames(mu) <- rownames(sigma) <- 0:9
print(mu)
#> [,1] [,2]
#> 0 -4.689827 -2.55752201
#> 1 1.457067 8.16415580
#> 2 -5.966361 7.96779370
#> 3 8.893505 3.21595585
#> 4 2.582281 -8.76427459
#> 5 -5.880509 -6.46886495
#> 6 3.740457 -2.31792564
#> 7 5.396828 -0.04601516
#> 8 4.352370 9.83812190
#> 9 -2.399296 5.54890443
print(sigma)
#> [,1] [,2] [,3] [,4]
#> 0 1.0873223 0.69488058 0.69488058 0.6528557
#> 1 0.2686249 0.50666811 0.50666811 1.0024862
#> 2 0.6486391 -0.09008240 -0.09008240 0.0409379
#> 3 0.3940044 -0.21990520 -0.21990520 0.5422173
#> 4 0.9611412 0.40244041 0.40244041 0.2316753
#> 5 0.4985329 0.39442301 0.39442301 0.3314552
#> 6 0.3384411 -0.08302008 -0.08302008 0.9109265
#> 7 0.3644605 0.25766878 0.25766878 0.5238927
#> 8 0.2758416 0.51517414 0.51517414 1.3789753
#> 9 0.1364133 -0.12600427 -0.12600427 0.1397050
Given the data specifications, we simulate from the bivariate Gaussian distribution to generate the dataset given the class vector. We discriminate between the true class vector y_true containing the labels 0-9 of all samples, and the model input class vector y containing a true label only for the labeled data, and NA, otherwise.
# specify number of labeles / unlabeled samples per class
labeled_classes <- rep(0:3, each = n_labeled / 4)
unlabeled_classes <- rep(0:9, each = n_unlabeled / 10)
num_sample <- c(table(labeled_classes), table(unlabeled_classes))
# simulate X, ytrue and y
X <- c()
for(i in 1:length(num_sample)){
X <- rbind(X, rmvnorm(num_sample[i],
mu[names(num_sample)[i],],
matrix(sigma[names(num_sample)[i],], nrow = 2, ncol = 2)
)
)
}
y <- rep(c(0:3, NA), c(table(labeled_classes), length(unlabeled_classes)))
ytrue <- rep(names(num_sample), num_sample)
colnames(X) <- paste0("x", 1:2)
# summaries of the model input class vector y, and the true class vector ytrue
summary(as.factor(y))
#> 0 1 2 3 NA's
#> 100 100 100 100 5000
summary(as.factor(ytrue))
#> 0 1 2 3 4 5 6 7 8 9
#> 600 600 600 600 500 500 500 500 500 500
Using the model input class vector y and the simulated feature matrix X, we generate the input dataset and the formula for the model.
# input dataset for the model
data <- as.data.frame(cbind(X, y))
# model formula
formula <- y ~ x1 + x2 - 1
# simulated data
head(data)
#> x1 x2 y
#> 1 -3.438659 -2.052945 0
#> 2 -4.343196 -2.430395 0
#> 3 -6.177755 -3.426197 0
#> 4 -5.090663 -2.765295 0
#> 5 -3.318454 -1.566550 0
#> 6 -4.953884 -2.801539 0
The simulated dataset is given by the following scatterplot:
Train SSC-UC model
The SSC-UC model is called via SSCUC and returns a BayesClassifier object with known and unknown classes. In this case, a Bayes classifier is used, specified by the argument naive=TRUE.
library(SSCUC)
# train model
model <- SSC(formula, data, naive = FALSE)
#> Warning in summary.mclustBIC(BIC, data, G = G, modelNames = modelNames): best
#> model occurs at the min or max of number of components considered!
#> Warning in Mclust(X[newclass_inds, !const_cols, drop = F], G = g_opts,
#> modelNames = gmmModelName, : optimal number of clusters occurs at max choice
#> [1] "Starting EM with 6 classes"
#> Warning in BayesClassifier(formula, data, naive = naive, prior = prior, :
#> BayesClassifier removed 1957 NAs
#> [1] "EM converged after 5 iterations"
#> [1] "BIC: 18418.8851091367"
#> [1] "Trying EM with 5 classes"
#> Warning in BayesClassifier(formula, data, naive = naive, prior = prior, :
#> BayesClassifier removed 1957 NAs
#> [1] "EM converged after 4 iterations"
#> [1] "BIC: 22778.8389840301"
#> [1] "BIC increased when updating - stopping"
summary(model)
#> BayesClassifier model with 10 classes and 2 non-constant features
#> ==============================
#> formula: y ~ x1 + x2 - 1
#> used features: x1, x2
#> parameters:
mu | Sigma | prior | |
---|---|---|---|
0 | -4.66,-2.53 | 0.84,0.52,0.52,0.5 | 0.09 |
1 | 1.47,8.19 | 0.27,0.48,0.48,0.96 | 0.11 |
2 | -5.94,7.96 | 0.67,-0.09,-0.09,0.05 | 0.11 |
3 | 8.81,3.25 | 0.31,-0.18,-0.18,0.44 | 0.09 |
U1 | 2.51,-8.77 | 0.96,0.4,0.4,0.24 | 0.09 |
U2 | 4.55,-1.18 | 1.08,1.06,1.06,1.93 | 0.18 |
U3 | -5.88,-6.47 | 0.56,0.45,0.45,0.4 | 0.09 |
U4 | -2.4,5.55 | 0.13,-0.11,-0.11,0.13 | 0.09 |
U5 | 4.37,9.92 | 0.27,0.5,0.5,1.37 | 0.09 |
U6 | 0.29,1.24 | 43.47,11.45,11.45,17.91 | 0.05 |
Predict and evaluate unlabeled data using SSC-UC model
Class labels for the unlabeled data in the dataset are obtained using the predict function with argument type=“class”.
# predict on unlabeled data
pred <- predict(model, newdata = subset(data, is.na(y)), type = "class")
Evaluate in full confusion matrix (multiple unknown classes)
The full confusion matrix contains all the classes modeled by the BayesClassifier:
- the class labels 0-9 from the data specification as reference labels (not that, however, only classes 0-3 are known to the model a-priori),
- the class labels for known classes (0-3) and unknown classes (U1-U8) as predicted labels.
library(caret)
#> Loading required package: lattice
library(knitr)
# specify levels
levels <- sort(union(unique(pred), unique(ytrue)))
kable(confusionMatrix(
reference = factor(ytrue[is.na(y)], levels = levels),
data = factor(pred, levels = levels))$table[sort(unique(pred)),sort(unique(ytrue))]
)
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 489 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 0 | 500 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 0 | 0 | 500 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
3 | 0 | 0 | 0 | 498 | 0 | 0 | 0 | 0 | 0 | 0 |
U1 | 0 | 0 | 0 | 0 | 500 | 0 | 0 | 0 | 0 | 0 |
U2 | 0 | 0 | 0 | 0 | 0 | 0 | 497 | 500 | 0 | 0 |
U3 | 0 | 0 | 0 | 0 | 0 | 500 | 0 | 0 | 0 | 0 |
U4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 500 |
U5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 500 | 0 |
U6 | 11 | 0 | 0 | 2 | 0 | 0 | 3 | 0 | 0 | 0 |
The following plot shows the unlabeled data with their predicted class labels:
Evaluate in reduced confusion matrix (all unknown classes as class “U”)
As a second evaluation step, we can summarize all “unknown” classes detected by SSC-UC into one class “U” (unknown). Thereby, we reduce the number of labels in the confusion matrix to 0-3 and U.
# set all new class labels to "U"
pred[!pred %in% unique(ytrue)] <- "U"
# specify levels
levels <- sort(union(unique(pred), unique(ytrue)))
kable(confusionMatrix(
reference = factor(ytrue[is.na(y)], levels = levels),
data = factor(pred, levels = levels))$table[sort(unique(pred)),sort(unique(ytrue))]
)
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 489 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 0 | 500 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 0 | 0 | 500 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
3 | 0 | 0 | 0 | 498 | 0 | 0 | 0 | 0 | 0 | 0 |
U | 11 | 0 | 0 | 2 | 500 | 500 | 500 | 500 | 500 | 500 |
The following plot shows the unlabeled data with their predicted class labels, when considering all unknown classes as one class.