Probabilistic Logistic Regression and Deep Learning

By Jessie Hobb On Jan 25, 2023

Probabilistic Deep Learning

This article belongs to the series “Probabilistic Deep Learning”. This weekly series covers probabilistic approaches to deep learning. The main goal is to extend deep learning models to quantify uncertainty, i.e., know what they do not know.

In this article, we will introduce the concept of probabilistic logistic regression, a powerful technique that allows for the inclusion of uncertainty in the prediction process. We will explore how this approach can lead to more robust and accurate predictions, especially in cases where the data is noisy, or the model is overfitting. Additionally, by incorporating a prior distribution on the model parameters, we can regularize the model and prevent overfitting. This approach serves as a great first step into the exciting world of Bayesian Deep Learning.

Articles published so far:

Gentle Introduction to TensorFlow Probability: Distribution Objects
Gentle Introduction to TensorFlow Probability: Trainable Parameters
Maximum Likelihood Estimation from scratch in TensorFlow Probability
Probabilistic Linear Regression from scratch in TensorFlow
Probabilistic vs. Deterministic Regression with Tensorflow
Frequentist vs. Bayesian Statistics with Tensorflow
Deterministic vs. Probabilistic Deep Learning
Naive Bayes from scratch with TensorFlow
Probabilistic Logistic Regression with TensorFlow

Figure 1: The motto for today: lines can separate more things than we give them credit for (source)

As usual, the code is available on my GitHub.

In our previous article in this series we built the Naive Bayes algorithm from scratch and used it to classify wine samples based on selected characteristics. This time, we will be using a probabilistic logistic regression approach. Since we already followed the end-to-end approach, I will skip most of the Exploratory Data Analysis section and the class prior distribution definition.

The only thing to note is that there is a difference in the features that we selected for this model.

Figure 2: Target samples distribution by alcohol and hue.

We will use hue and flavanoids as our independent variables. Notice how these features are more effective in separating the target variable than alcohol and hue.

Figure 3: Target samples distribution by flavanoids and hue.

Logistic regression is a widely-used statistical method for binary classification, which is used to model the relationship between a binary response variable and one or more predictors. Logistic regression can be used to model the probability of a binary outcome as a function of the predictor variables. The traditional logistic regression model is a deterministic model, which assumes that the relationship between the predictor variables and the response variable is fixed and known. However, in many real-world applications, the true relationship between the predictors and the response is uncertain, and it is more appropriate to use a probabilistic approach.

Probabilistic logistic regression models the relationship between the predictor variables and the binary response variable using a probabilistic framework, and is able to account for uncertainty in the data and the model parameters. This is achieved by placing probability distributions over the model parameters, rather than assuming fixed values for them. In this way, probabilistic logistic regression models can provide more accurate predictions and better uncertainty quantification compared to traditional logistic regression models.

One of the most popular probabilistic models for logistic regression is the Bayesian logistic regression model. These models are based on Bayes’ theorem, which states that the posterior probability of a model parameter given the data is proportional to the product of the likelihood of the data given the parameter and the prior probability of the parameter. Often, Bayesian logistic regression models use a conjugate prior distribution for the model parameters, which allows for closed-form solutions for the posterior distributions. This enables the computation of the probability of the response variable given the predictor variables, which is known as the posterior predictive distribution.

In this section, we present a method for computing the class-conditional densities in the probabilistic approach to logistic regression. Our method is based on the maximum likelihood estimate for the means, which is given by

where 𝑋(𝑛)𝑖 is the i-th feature of the n-th sample, 𝑌(𝑛) is the target label of the n-th sample, 𝑘 is the class label, and 𝛿(𝑌(𝑛)=𝑦𝑘) is an indicator function that equals 1 if 𝑌(𝑛)=𝑦𝑘, and 0 otherwise.

To estimate the standard deviations 𝜎𝑖, instead of using the closed-form solution we will be learning these parameters from the data. We achieve this by implementing a custom training loop, which optimizes the values of the standard deviations by minimizing the average per-example negative log-likelihood of the data.

Our function computes the means 𝜇𝑖𝑘 of the class-conditional Gaussians according to the above equation. Then, it creates a multivariate Gaussian distribution object using MultivariateNormalDiag with the means set to 𝜇𝑖𝑘 and the scales set to the TensorFlow Variable.

The function runs a custom training loop for a specified number of epochs, in which the average per-example negative log-likelihood is computed. Next, the gradients are propagated and the scales variables updated accordingly. At each iteration, the values of the scales variable and the loss are saved.

It returns a tuple of three objects: the loss values, the scales variable at each iteration, and the final learned batched MultivariateNormalDiag distribution object.

def train(x, y, scales, optimiser, epochs):
estimated_scales = []
n_classes = np.unique(y).shape[0]
n_features = x.shape[1]
counts = np.zeros(n_classes)
mean_cond_class = []
std_feature_given_class = []
for c_k in range(n_classes):
mean_cond_class.append(np.mean(x[np.squeeze(y==c_k)], axis=0))
mean_cond_class = np.asarray(mean_cond_class, dtype=np.float32)
x_c = np.concatenate((x,y.reshape(-1,1)), axis=1)mv_normal_diag = tfd.MultivariateNormalDiag(loc=mean_cond_class,scale_diag=scales)
x = np.expand_dims(x , 1).astype('float32')
for i in range(epochs):
with tf.GradientTape() as tape:
tape.watch(mv_normal_diag.trainable_variables)
predictions = - mv_normal_diag.log_prob(x)
p1 = tf.reduce_sum(predictions[np.squeeze(y==0)][:,0])
p2 = tf.reduce_sum(predictions[np.squeeze(y==1)][:,1])
loss = p1 + p2
grads = tape.gradient(loss, mv_normal_diag.trainable_variables)
opt.apply_gradients(zip(grads, mv_normal_diag.trainable_variables))
estimated_scales.append(mv_normal_diag.trainable_variables[0].numpy())
print('Step {:03d}: Loss: {:.3f}: Scale1: {:.3f}: Scale2: {:.3f}'.format(i, loss, mv_normal_diag.trainable_variables[0].numpy()[0], mv_normal_diag.trainable_variables[0].numpy()[1]))
estimated_scales = np.asarray(estimated_scales)
return estimated_scales, mv_normal_diag

Let’s create our variables to be trained.

scales = tf.Variable([1., 1.], name='scales')
opt = tf.keras.optimizers.Adam(learning_rate=0.01)
epochs = 100

We are now ready to start the training procedure.

scales_arr, class_conditionals_binary = train(x_train, y_train, scales, opt, epochs)-----
Step 000: Loss: 290.708: Scale1: 0.990: Scale2: 0.990
Step 001: Loss: 288.457: Scale1: 0.980: Scale2: 0.980
Step 002: Loss: 286.196: Scale1: 0.970: Scale2: 0.970
Step 003: Loss: 283.924: Scale1: 0.960: Scale2: 0.960
Step 004: Loss: 281.641: Scale1: 0.950: Scale2: 0.950
Step 005: Loss: 279.348: Scale1: 0.940: Scale2: 0.940
[...]

Finally, we can check how the model separates our classes of wine.

Figure 4: Class-conditional density contours.

Using the function that we defined in the previous article, we can generate predictions for out test set. In the plot above we can see that the classes are well separated and thus we get a good accuracy from our model.

predictions = predict(prior_binary, class_conditionals_binary, x_test)accuracy = accuracy_score(y_test, predictions)
print("Test accuracy: {:.4f}".format(accuracy))
---------
Test accuracy: 0.92

In order to quantitatively assess the performance of our probabilistic logistic regression model, we plot the decision regions. These regions, defined by the boundaries that separate the two classes, provide insight into the ability of the model to separate the classes. Our analysis shows that the model is able to effectively separate the two classes in the dataset, as evidenced by the visually distinct regions. However, it is important to note that the decision boundary is constrained to be linear, as per the assumptions of the logistic regression model.

plt.figure(figsize=(9, 5))
plot_data(x_train, y_train)
x0_min, x_0_max = x_train[:, 0].min()-0.5, x_train[:, 0].max()+0.5
x1_min, x_1_max = x_train[:, 1].min()-0.5, x_train[:, 1].max()+0.5
contour_plot((x0_min, x0_max), (x1_min, x1_max), 
lambda x: predict(prior_binary, class_conditionals_binary, x), 
1, label_colors, levels=[-0.5, 0.5, 1.5],
num_points=200)
plt.title("Training set with decision regions")
plt.show()

Figure 5: Class-conditional decision regions.

In this section, we link the above definitions of the class-conditional densities to logistic regression. We show that the predictive distribution 𝑃(𝑌=𝑦0|𝑋) can be written as

where 𝑃(𝑋|𝑌=𝑦0) and 𝑃(𝑋|𝑌=𝑦1) are the class-conditional densities, and 𝑃(𝑌=𝑦0) and 𝑃(𝑌=𝑦1) are the class priors.

This equation can be re-arranged to give 𝑃(𝑌=𝑦0|𝑋)=𝜎(𝑎) where

is the sigmoid function, and

is the log-odds.

With our additional modeling assumption of a shared covariance matrix Σ, it can be shown, using the Gaussian pdf, that 𝑎 is in fact a linear function of 𝑋,

where

This linear function, 𝑎=𝑤𝑇𝑋+𝑤0, explains the reason behind the decision boundary of a logistic regression being linear. It can be seen that the parameters 𝑤 and 𝑤0 are functions of the class-conditional densities 𝑃(𝑋|𝑌=𝑦0) and 𝑃(𝑋|𝑌=𝑦1) and the class priors 𝑃(𝑌=𝑦0) and 𝑃(𝑌=𝑦1). These parameters are typically estimated with maximum likelihood, as we have done in previous sections.

In this section, we use the equations derived in previous sections to directly parameterize the output Bernoulli distribution of the generative logistic regression model. Specifically, we use the prior distribution and class-conditional distributions to compute the weights and bias terms 𝑤 and 𝑤0.

To achieve this, we write a new function that takes the prior distribution and the class-conditional distributions as inputs. The function uses the parameters of these distributions to compute the weights and bias terms according to the equations derived in previous sections.

The inputs to the function are the prior distribution prior over the two classes and the class-conditional distributions.

The function then uses these inputs to compute the weights and bias terms as

The function returns 𝑤 and 𝑤0, which can be used to directly parameterize the output Bernoulli distribution of the generative logistic regression model. This allows for a more direct and transparent understanding of the model parameters and their relationship to the prior and class-conditional distributions.

def get_logistic_regression_params(prior, class_conditionals):
cov = class_conditionals.covariance()[0]
cov_inv = tf.linalg.inv(cov)
mu0 = class_conditionals.parameters['loc'][0]
mu1 = class_conditionals.parameters['loc'][1]
w = np.matmul(cov_inv,(mu0-mu1))
w0 = - 0.5 * (np.matmul(tf.transpose(mu0), np.matmul(cov_inv, mu0)))\
+ 0.5 * (np.matmul(tf.transpose(mu1), np.matmul(cov_inv, mu1)))\
+ np.log(prior.parameters['probs'][0] / prior.parameters['probs'][1])
return w, w0w, w0 = get_logistic_regression_params(prior_binary, class_conditionals_binary)

We can now use these parameters to make a contour plot to display the predictive distribution of our logistic regression model.

fig, ax = plt.subplots(1, 1, figsize=(9, 5))
plot_data(x_train, y_train, alpha=0.35)
x0_min, x0_max = x_train[:, 0].min()-0.5, x_train[:, 0].max()+0.5
x1_min, x1_max = x_train[:, 1].min()-0.5, x_train[:, 1].max()+0.5
X0, X1 = get_meshgrid((x0_min, x0_max), (x1_min, x1_max))logits = np.dot(np.array([X0.ravel(), X1.ravel()]).T, w) + w0
Z = tf.math.sigmoid(logits)
lr_contour = ax.contour(X0, X1, np.array(Z).T.reshape(*X0.shape), levels=10)
ax.clabel(lr_contour, inline=True, fontsize=10)
contour_plot((x0_min, x0_max), (x1_min, x1_max), 
lambda x: predict(prior_binary, class_conditionals_binary, x), 
1, label_colors, levels=[-0.5, 0.5, 1.5],
num_points=200)
plt.title("Training set with prediction contours")
plt.show()

Figure 6: Density contours of the predictive distribution of our logistic regression model.

The above approach can be considered a form of Bayesian inference, as it involves incorporating prior knowledge about the model parameters through the prior distribution and updating this knowledge using the observed data through the class-conditional distributions. This is a key aspect of Bayesian inference, which aims to incorporate prior knowledge and uncertainty about the model parameters into the inference process.

In Bayesian inference, the goal is to compute the posterior distribution of the model parameters given the observed data. The above approach can be seen as a form of approximate Bayesian inference, as it involves using the maximum likelihood estimates of the class-conditional densities and the prior distribution to compute the weights and biases of the model. Also, the approach incorporates uncertainty through the shared covariance matrix, which serves as a regularization term.

It is worth noting that the above approach is not fully Bayesian, as it does not provide a closed form for the posterior of the model parameters. Instead, it uses an approximation based on the maximum likelihood estimates.

In this article, we proposed a probabilistic approach to logistic regression that addresses aleatoric uncertainty in the prediction process. Through the incorporation of a prior distribution on the model parameters, our approach regularizes the model and prevents overfitting. We show how to implement the approach using TensorFlow Probability and how to analyse its results.

It is worth noting that while our approach incorporates Bayesian principles, it is not a full Bayesian approach as we do not have a full posterior distribution of the model parameters. Nevertheless, accounting for aleatoric uncertainty in the prediction process already gives us more confidence about our predictive process.

Keep in touch: LinkedIn

[1] — Wine Dataset

[2] — Coursera: Deep Learning Specialization

[3] — Coursera: TensorFlow 2 for Deep Learning Specialization

[4] — TensorFlow Probability Guides and Tutorials

[5] — TensorFlow Probability Posts in TensorFlow Blog