Probabilistic vs. Deterministic Regression with Tensorflow | by Luís Roque | Dec, 2022

By Jessie Hobb On Dec 16, 2022

Probabilistic deep learning

This article belongs to the series “Probabilistic Deep Learning”. This weekly series covers probabilistic approaches to deep learning. The main goal is to extend deep learning models to quantify uncertainty, i.e. know what they do not know.

This article will explore the main differences between deterministic and probabilistic regression. In general, deterministic regression is practical when the relationship between the independent and dependent variables is well understood and relatively stable. On the other hand, probabilistic regression is more appropriate when there is uncertainty or variability in the data. As an exercise to support our claims, we are going to fit a probabilistic model to non-linear data using TensorFlow Probability.

Articles published so far:

Gentle Introduction to TensorFlow Probability: Distribution Objects
Gentle Introduction to TensorFlow Probability: Trainable Parameters
Maximum Likelihood Estimation from scratch in TensorFlow Probability
Probabilistic Linear Regression from scratch in TensorFlow
Probabilistic vs. Deterministic Regression with Tensorflow

Figure 1: Our mantra for today: not all lines are straight (source)

We develop our models using TensorFlow and TensorFlow Probability. TensorFlow Probability is a Python library built on top of TensorFlow. We are going to start with the basic objects that we can find in TensorFlow Probability and understand how can we manipulate them. We will increase complexity incrementally over the following weeks and combine our probabilistic models with deep learning on modern hardware (e.g. GPU).

As usual, the code is available on my GitHub.

Definitions

Deterministic regression is a type of regression analysis where the relationship between the independent and dependent variables is known and fixed. Hence, it is a helpful tool for predicting the value of a dependent variable given a set of known independent variables. In other words, if the same inputs are provided to a deterministic regression model, it will always produce the same output.

If we think about linear regression models, the Gauss-Markov theorem immediately comes to mind since it establishes the optimality of the ordinary least squares (OLS) estimator under certain assumptions. In particular, the Gauss-Markov theorem states that the OLS estimator is the best linear unbiased estimator (BLUE), meaning it has the smallest variance among all linear unbiased estimators. However, the Gauss-Markov theorem does not address the issue of uncertainty or belief in the estimates, which is a crucial aspect of the probabilistic approaches.

On the other hand, probabilistic regression treats the way that the independent and dependent variables interact as unknown and assumes that they can vary from one data set to another. Instead of predicting a single value for the dependent variable, a probabilistic regression model predicts a probability distribution for the possible values of the dependent variable. It allows the model to account for uncertainty and variability in the data and can provide more accurate predictions in some cases.

Let’s give a simple example to simplify the understanding. A researcher studies the relationship between students’ time studying for a test and their scores. In this scenario, the researcher could use the OLS method to estimate the slope and intercept of the regression line and use the Gauss-Markov theorem to justify the choice of this estimator. However, as we stated before, the Gauss-Markov theorem does not address the issue of uncertainty or belief in the estimates. In the probabilistic world, the emphasis is on using probability to describe the uncertainty or belief in the model or parameter, rather than just the optimality of the estimator. Thus, we might use a different approach to estimating the slope and intercept of the regression line. Consequently, we might come to a different conclusion about the relationship between study time and test scores based on the data and the prior belief in the slope and intercept values.

Bayesian Statistics and the Bias–variance Trade-off

Probabilistic regression can be seen as a form of Bayesian statistics. It involves treating the unknown relationship between the independent and dependent variables as a random variable and estimating its probability distribution based on the available data. In this way, we can think of it as a way to incorporate uncertainty and variability into regression analysis. Recall that Bayesian statistics is a framework for statistical analysis in which all unknown quantities are treated as random variables, and their probability distributions are updated as new data is observed. This is in contrast to classical statistics, which typically assumes that unknown quantities are fixed, but unknown parameters.

Another way to consider the difference between the two approaches is to consider the trade-off between bias and variance in statistical estimates. Bias refers to the difference between a parameter’s estimated value and the actual value, while variance refers to the spread or variability of the estimated values. By providing a distribution over the model parameters rather than a single point estimate, probabilistic regression can help reduce bias in the estimates, improving the model’s overall accuracy. Additionally, probabilistic regression can provide a measure of uncertainty or confidence in the estimated values, which can help make decisions or predictions based on the model. It can be beneficial when working with noisy or incomplete data, where the uncertainty in the estimates is higher.

Let’s jump to an example to make these concepts easier to understand. We will not cover the fully Bayesian approach which would entail the estimation of the epistemic uncertainty — the uncertainty of the model. We will study this type of uncertainty in a future article. Nevertheless, we will estimate a different kind of uncertainty — the aleatoric uncertainty. It can be defined as the uncertainty in the generative process of the data.

This time, we will cover a more complex regression analysis — non-linear regression. In contrast to linear regression, which models the relationship between the variables using a straight line, non-linear regression allows for more complex relationships between the variables to be modeled. It makes non-linear regression a valuable tool for many machine learning applications, where the relationships between the variables may be too complex to be accurately modeled using a linear equation.

We start by creating some data that follows a non-linear pattern:

Notice that the noise 𝜖𝑖∼N(0,1) is independent and identically distributed.

x = np.linspace(-1, 1, 1000)[:, np.newaxis]
y = np.power(x, 3) + (3/15)*(1+x)*np.random.randn(1000)[:, np.newaxis]plt.scatter(x, y, alpha=0.1)
plt.show()

Figure 2: Data artificially generated following a non-linear equation with Gaussian noise.

As usual we define the log-likelihood as our loss function.

def negative_log_like(y_true, y_pred):
return -y_pred.log_prob(y_true)

As we saw in the previous articles, the way to extend our deterministic Deep Learning approach to be probabilistic is by using probabilistic layers, e.g. DistributionLambda . Recall that the DistributionLambda layer returns a distribution object. It is also the base class for several other probabilistic layers implemented in TensorFlow Probability, which we will use in future articles.

To build our model, we start by adding 2 dense layers. The first has 8 units and a sigmoid activation function. The second has 2 units and no activation function. We do not add one because we want to parameterize our Gaussian distribution that follows the second dense layer with any real value. The Gaussian distribution is defined by theDistributionLambda layer. Remember that the scale of the distribution is the standard deviation, and this should be a positive value. As before, we pass the tensor component through the softplus function to respect this constraint.

Note that the real difference between a linear and a non-linear model is the added Dense layer as the first layer of the model.

model = Sequential([
Dense(input_shape=(1,), units=8, activation='sigmoid'),
Dense(2),
tfpl.DistributionLambda(lambda p:tfd.Normal(loc=p[...,:1], scale=tf.math.softplus(p[...,1:])))
])model.compile(loss=negative_log_like, optimizer=RMSprop(learning_rate=0.01))
model.summary()

We can check our model’s output shape to understand better what is happening. We get an empty event shape and a batch shape (1000, 1). 1000 refers to the batch size, while the extra dimension does not make sense in our problem statement. We want to represent a single random variable that is normally distributed.

y_model = model(x)
y_sample = y_model.sample()
y_model<tfp.distributions._TensorCoercible 'tensor_coercible' batch_shape=[1000, 1] event_shape=[] dtype=float32>

We can use a wrapper that TensorFlow Probability provides to simplify the implementation of our last layer and make it more in line with what we expect to get as an output shape. By using theIndependentNormallayer we can build a similar distribution that we built with DistributionLambda. At the same time, we can use a static method that outputs the number of parameters required to the probabilistic layer to define the number of units in the previous Dense layer: tfpl.IndependentNormal.params_size.

model = Sequential([
Dense(input_shape=(1,), units=8, activation='sigmoid'),
Dense(tfpl.IndependentNormal.params_size(event_shape=1)),
tfpl.IndependentNormal(event_shape=1)
])model.compile(loss=negative_log_like, optimizer=RMSprop(learning_rate=0.01))
model.summary()

As we can see, the shape is now correctly specified, as the extra dimension in the batch shape was moved to the event shape.

y_model = model(x)
y_sample = y_model.sample()
y_model<tfp.distributions._TensorCoercible 'tensor_coercible' batch_shape=[1000, 1] event_shape=[] dtype=float32>

Time to fit the model to our synthetically generated data.

model.fit(x=x, y=y, epochs=500, verbose=False)
model.evaluate(x, y)

As expected, we were able to capture the aleatoric uncertainty of the generative process of the data. It can be seen below by the confidence intervals we can generate. An even more interesting feature of the probabilistic model is the samples that we can generate that, as we can see below, follow the original generative process of the data.

y_hat = y_model.mean()
y_sd = y_model.stddev()
y_hat_u = y_hat -2 * y_sd
y_hat_d = y_hat + 2*y_sdfig, (ax_0, ax_1) =plt.subplots(1, 2, figsize=(15, 5), sharey=True)
ax_0.scatter(x, y, alpha=0.4, label='data')
ax_0.scatter(x, y_sample, alpha=0.4, color='red', label='model sample')
ax_0.legend()
ax_1.scatter(x, y, alpha=0.4, label='data')
ax_1.plot(x, y_hat, color='red', alpha=0.8, label='model $\mu$')
ax_1.plot(x, y_hat_u, color='green', alpha=0.8, label='model $\mu \pm 2 \sigma$')
ax1.plot(x, y_hat_d, color='green', alpha=0.8)
ax1.legend()
plt.show()

Figure 3: Generated samples from the probabilistic non-linear regression model (on the left) and its fitting to the data (on the right).

This article explored the main differences between deterministic and probabilistic regression. We saw that deterministic regression is practical when the relationship between the independent and dependent variables is well-understood and relatively stable. On the other hand, probabilistic regression is more appropriate when there is uncertainty or variability in the data. As an exercise, we then fitted a probabilistic model to non-linear data. By adding an extra dense layer with an activation function at the beginning of our model, we can learn non-linear patterns in the data. Our final layer is a probabilistic layer, which outputs a distribution object. To be more coherent with our problem statement, we extended our approach to using the IndependentNormal layer we explored a few articles ago. It allowed us to move batch dimensions to the event shape. Next, we fitted the data successfully while providing a measure for the aleatoric uncertainty. Finally, we generated new samples that closely followed the original generative process of the data.

Next week, we will explore the differences between a frequentist and a Bayesian approach. See you then!

Keep in touch: LinkedIn

[1] — Coursera: Deep Learning Specialization

[2] — Coursera: TensorFlow 2 for Deep Learning Specialization

[3] — TensorFlow Probability Guides and Tutorials

[4] — TensorFlow Probability Posts in TensorFlow Blog