SGDRegressor with Scikit-Learn: Untaught Lessons You Need to Know | by Angela Shi | Mar, 2023

By Jessie Hobb On Mar 8, 2023

Unveiling Hidden Algorithmic Relationships through a Confusing Name

In the world of machine learning, the linear model is a fundamental technique that is widely used to predict numerical values based on input data. The SGDRegressor estimator from scikit-learn is a powerful tool that allows machine learning practitioners to perform linear regression quickly and efficiently.

However, the name SGDRegressor can be a bit confusing for beginners. In this article, we will explore the different components of SGDRegressor and explain how it works. We will also examine why the name may be misleading for beginners who are just starting to learn about machine learning.

Furthermore, we will delve into the idea that there are actually multiple models hidden within SGDRegressor, each with its own specific set of parameters and hyperparameters. This raises interesting questions about the definition of models in machine learning. For instance, is ridge regression or SVR a separate model, or is it simply one tuned version of a linear model?

By the time you finish reading this article, you will have a better understanding of the inner workings of SGDRegressor and gain a newfound appreciation for the complexities, and well actually, the simplicity of linear models in machine learning.

Photo by Sawindu Abeyanayake on Unsplash

SGDRegressor is a machine learning algorithm in Scikit-Learn that implements Stochastic Gradient Descent (SGD) to solve regression problems. It is a popular choice for large-scale regression tasks due to its ability to handle high-dimensional datasets and its fast training time.

SGDRegressor works by iteratively updating the model weights using a small random subset of the training data, rather than the entire dataset, which makes it computationally efficient for large datasets. It also includes several hyperparameters that can be tuned to optimize performance, including learning rate, penalty or regularization term, and number of iterations.

1.1 Linear regressor

SGDRegressor is a linear model that uses a linear function to predict the target variable. The linear function takes the form of:

y = w[0] + w[1] * x[1] + … + w[p] * x[p]

where x[1] to x[p] are the input features, w[1] to w[p] are the coefficients of the linear model, and w[0] or b is the intercept term. The objective of the SGDRegressor algorithm is to find the values of w and b that minimize a loss function that has to be defined between the predicted values and the actual values of the target variable.

1.2 Stochastic Gradient Descent vs. Gradient Descent

The SGDRegressor algorithm uses stochastic gradient descent for optimization. Stochastic gradient descent is an iterative optimization algorithm that updates the parameters of the model in small batches of data. The algorithm updates the parameters using the gradient of the cost function with respect to the parameters.

Stochastic Gradient Descent (SGD) and Gradient Descent (GD) are both optimization algorithms commonly used in machine learning. While both aim to minimize a loss function, they differ in their approach and have their own pros and cons.

Pros of SGD:

Faster convergence: Since SGD only considers one training example at a time, it updates the model parameters much faster than GD. This can result in faster convergence, especially for large datasets.
Better generalization: SGD’s random sampling of training examples can prevent the model from overfitting to the training data, leading to better generalization to new data.
Suitable for non-convex problems: SGD is more robust than GD for non-convex problems, as it can avoid getting stuck in local minima.

Cons of SGD:

Can be noisy: Since SGD updates the model parameters based on a single training example at a time, the updates can be noisy and result in a lot of oscillation in the loss function.
Requires more tuning: SGD requires tuning of its hyperparameters such as learning rate, momentum, and batch size, which can be time-consuming and require trial and error.
May converge to suboptimal solution: SGD is prone to getting stuck in saddle points or plateau regions, leading to suboptimal solutions.

Pros of GD:

Guaranteed convergence: GD guarantees convergence to the global minimum for convex problems, as it considers the entire training dataset to update the model parameters.
Smooth convergence: GD’s updates are more stable since it considers the entire training dataset, leading to a smoother convergence.
Simpler to tune: GD requires fewer hyperparameters to be tuned, making it easier to use and requiring less trial and error.

Cons of GD:

Slow convergence: GD updates the model parameters based on the entire training dataset, which can be time-consuming for large datasets, leading to slower convergence.
Prone to overfitting: GD can overfit to the training data, especially if the model has a large number of parameters or if the training data is noisy.
Unsuitable for non-convex problems: GD can get stuck in local minima for non-convex problems, making it unsuitable for these types of problems.

In conclusion, while SGD and GD are both widely used optimization algorithms in machine learning, their effectiveness depends on the specific problem at hand. SGD is typically faster and better for large datasets and non-convex problems, while GD is more reliable for smaller datasets and convex problems.

1.3 Using SGDRegressor in scikit-learn

Using SGDRegressor in scikit-learn is straightforward. First, we need to import the SGDRegressor class from the linear_model module of scikit-learn. Then, we can create an instance of the SGDRegressor class and fit the model to our training data.

Here’s an example of how to use SGDRegressor in scikit-learn:

from sklearn.linear_model import SGDRegressor
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error# Load the Boston housing dataset
X, y = load_boston(return_X_y=True)
# Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Create an instance of SGDRegressor
sgd_reg = SGDRegressor(max_iter=1000, tol=1e-3, penalty=None, eta0=0.1)
# Fit the model to the training data
sgd_reg.fit(X_train, y_train)
# Make predictions on the test data
y_pred = sgd_reg.predict(X_test)
# Calculate the mean squared error of the predictions
mse = mean_squared_error(y_test, y_pred)
print("Mean squared error:", mse)

In this example, we load the Boston housing dataset and split it into training and test sets using the train_test_split function. Then, we create an instance of SGDRegressor and fit the model to the training data using the fit method. Finally, we make predictions on the test data using the predict method and calculate the mean squared error of the predictions using the mean_squared_error function.

In this section, we will discuss why the name SGDRegressor can be confusing for beginners in machine learning. While experts in the field may be familiar with the name and its associated algorithm, those new to the field may not immediately understand what it does or how it works.

However, we will also explain that for experts, the name is not necessarily a source of confusion. This is because they are already familiar with the algorithm and its purpose, and can easily distinguish it from other similar algorithms.

2.1 Why it is confusing

In my article Machine Learning in Three Steps: How to Efficiently Learn It, I explained how to improve your learning of machine learning algorithms by breaking down the process into three distinct steps: model algorithm, fitting algorithm, and tuning algorithm. Despite the simplicity of this approach, it can be challenging to apply in practice. The SGDRegressor algorithm serves as a prime example of this.

In my opinion, choosing an appropriate name that accurately reflects the characteristics of a machine learning algorithm is crucial for understanding and effectively using the algorithm. However, this is not always the case, as evidenced by the confusingly named “SGDRegressor” algorithm.

In scikit-learn, there is a convention of adding either “regressor” or “classifier” to the name of the machine learning model, depending on whether it is used for regression or classification tasks. This convention is evident in many examples, such as DecisionTreeClassifier vs. DecisionTreeRegressor, KNeighborsClassifier vs. KNeighborsRegressor, and so on.

However, the problem with the name “SGDRegressor” is that if we apply this naming convention to it, it implies that “SGD” is a machine learning model when in reality, it is a fitting algorithm. This can be particularly confusing for beginners who may not have a clear understanding of the different components of machine learning algorithms.

2.2 How to explain this naming

For experts, the naming convention of “SGDRegressor” may seem acceptable. The use of “SGD” suggests that this model is based on mathematical functions, or parametric models, rather than distance or tree-based models. Therefore, “SGD” implies the hidden models used despite the fact that SGD itself is a fitting algorithm.

Although in theory, SGD can be used for both linear and non-linear models, such as neural networks, in practice, this estimator only implements linear models. While a more precise and concise name for this estimator could be “LinearSGDRegressor”, it may not be necessary as “SGDRegressor” is already part of scikit-learn’s “linear_model” module, which by definition only implements linear models.

The parameters available in SGDRegressor give us the flexibility to choose different combinations of loss (squared_error, huber, epsilon_insensitive, or squared_epsilon_insensitive) and penalty (l1, l2, elasticnet or none) functions. Some of these combinations correspond to traditional statistical models.

It is interesting to note that, historically, statisticians developed and constructed different models based on different assumptions and objectives. In contrast, machine learning provides a more unified framework where the linear model remains the same, but the loss function and penalty can be changed to achieve different objectives.

3.1 Loss functions

The loss functions squared_error, huber, epsilon_insensitive, and squared_epsilon_insensitive differ in their impact on the model and its performance.

The squared_error loss function, also known as mean squared error (MSE), penalizes larger errors more heavily than smaller ones, making it sensitive to outliers. This loss function is commonly used in linear regression.
The huber loss function is less sensitive to outliers than the squared_error loss function, as it transitions from quadratic to linear error for larger residuals. This makes it a good choice for datasets with some outliers.
The epsilon_insensitive loss function is used for regression problems where the target values are expected to fall within a certain range of values. It ignores errors smaller than a certain threshold, known as epsilon, and penalizes larger errors linearly. This loss function is commonly used in support vector regression.
The squared_epsilon_insensitive loss function is similar to the epsilon_insensitive loss function, but it penalizes larger errors more heavily by squaring them. This loss function can be useful when larger errors need to be penalized more than linearly, but smaller errors can be ignored.

Overall, the choice of loss function can impact the model’s ability to handle outliers and the level of sensitivity to errors. It is important to select the appropriate loss function for the specific problem at hand to ensure the best possible performance.

Here is an image summarizing all the loss functions with their mathematical expressions. Python implementations are also provided, allowing us to visualize and compare their behaviors. If you would like to access the Python code, you can support me on Ko-fi through the following link: https://ko-fi.com/s/4cc6555852.

loss function in SGDRegressor — image by the author

It should be noted that even though the SGDRegressor does not explicitly offer MAE loss, it can still be achieved by setting the epsilon insensitive loss function’s hyperparameter, epsilon, to zero. This is because the mathematical expression of the epsilon insensitive loss becomes equivalent to MAE loss when epsilon is zero.

In the end, despite the variety of loss functions, we can observe that their construction is based on two fundamental concepts: the basic loss being either quadratic or absolute, and the introduction of the epsilon-tube concept, which creates two distinct zones — the central and collateral zones — with different types of loss functions.

By adopting this approach, one can potentially create their own customized loss function!

3.2 Penalty term

When using the SGDRegressor, we can also specify penalties to be used in addition to the chosen loss function. The available penalties are L1, L2, Elastic Net, and None.

L1 penalty adds the absolute value of the coefficients to the loss function, leading to sparse models where some coefficients are set to zero. This penalty can be useful for feature selection and reducing overfitting.

L2 penalty adds the square of the coefficients to the loss function, leading to smaller but non-zero coefficients. This penalty can also help reduce overfitting and improve model generalization.

Elastic Net penalty combines both L1 and L2 penalties, allowing for both sparsity and non-zero coefficients. It has two hyperparameters: alpha controls the weight between L1 and L2 penalties, and l1_ratio controls the balance between L1 and L2 penalties.

Finally, None means no penalty is used, and the model is fitted with just the chosen loss function.

Choosing the right penalty depends on the specific problem and the nature of the data. In general, L1 and Elastic Net penalties can help with feature selection and sparse models, while L2 penalty can help with generalization and avoiding overfitting.

3.3 Presentation of the Hidden Models

The SGDRegressor offers flexibility in specifying different combinations of loss and penalty parameters. Some combinations of loss and penalty parameters in SGDRegressor correspond to well-known models with their specific names. The following image provides an overview of these options, including the corresponding model names, and we will delve into a few specific ones.

Loss functions and penalty terms in SGDRegressor and corresponding models names — image by the author

Squared error

The simplest combination in SGDRegressor is squared_error with no penalty, which corresponds to the classic linear regression model available in Scikit-learn’s LinearRegression estimator. However, this naming convention can be misleading. Although all these models are regressors and linear, referring to them as “linear regression” can cause confusion. To avoid this, it is best to use the term “linear model” instead and reserve the term “linear regression” for OLS regression specifically.

Lasso is a linear (OLS) regression model with L1 regularization, which encourages sparsity in the coefficient estimates.

Ridge regression, on the other hand, adds an L2 penalty to the linear (OLS) regression model to help mitigate the effects of multicollinearity.

ElasticNet combines the penalties of both L1 and L2 regularization in order to achieve a balance between the sparsity of Lasso and the stability of Ridge.

With the squared error, we can easily identify specific model names for each penalty used. However, when it comes to other loss functions, there is not always a specific name associated with each penalty. In my opinion, researchers historically focused on finding the effects of regularization or penalization of the coefficients for linear regressions, leading to specific names being given to each type of penalty. For other loss functions, adding a penalty term no longer seems novel.

Huber loss

Huber loss function is an alternative loss function that is less sensitive to outliers, making it a good choice for datasets with significant outliers. The Huber loss function is a combination of the squared loss function and the absolute loss function. For small errors, it behaves like the squared loss function, while for larger errors, it behaves like the absolute loss function. This makes it more robust to outliers than the squared loss function, while still providing good performance for smaller errors.

Epsilon insensitive loss

Epsilon-insensitive loss is another type of loss function that is commonly used in a linear model. The epsilon parameter is used to define the range within which the error is considered to be zero. This loss function is useful for datasets with noisy output variables, as it can help to reduce the impact of small fluctuations in the output variable.

In fact, the combination of Epsilon-insensitive loss and L2 regularization is also referred to as SVR (Support Vector Regression). The use of the epsilon-insensitive tube concept makes it mandatory to add a penalty as there can be an infinite number of solutions without it. The term “support vector” is used because just as the L1 regularization term for the coefficients will lead to certain coefficients becoming zero, the L1 loss (Absolute loss) applied to the dataset will lead to certain data points not being used to calculate the coefficients, leaving only the remaining ones, which are called support vectors.

A noteworthy observation is that Huber regression is commonly portrayed as less sensitive, but it shares this attribute with SVR, as both utilize absolute values for larger values. While the term “epsilon insensitive” emphasizes the central zone where the error is zero, the absolute error for larger values can also have a significant impact on the final model.

To learn more about SVR and Epsilon Insensitive Loss with Scikit-learn, you can read this article: Understanding SVR and Epsilon Insensitive Loss with Scikit-learn

Squared epsilon insensitive loss

We may never have heard this one before: Squared epsilon insensitive loss! As its name indicates, it is based on epsilon insensitive loss but uses the squared error instead of the absolute error. The question arises, why use this particular loss function? Well, why not.

The answer is that in machine learning, the “no free lunch” theory suggests that one model cannot perform well on all datasets, so it is necessary to test different loss functions. In some cases, the squared epsilon insensitive loss may turn out to be the best choice.

3.4 One Tunable Model or Different Models?

To comprehend the different names assigned to the models, we can approach them from two distinct perspectives: the statistical perspective and the machine learning perspective.

In the field of statistics, various models have been developed over the years to tackle different types of problems. As a result, there are many different types of models with different assumptions, constraints, and properties. On the other hand, the machine learning framework is relatively simpler in that it revolves around the linear model, which can be easily modified by changing the loss function and applying a penalty to avoid overfitting.

This simplicity and flexibility make it easier for machine learning practitioners to experiment and adapt to different problem settings, leading to the development of new and effective models for a wide range of applications.

In my previous article, “Machine Learning in Three Steps: How to Efficiently Learn It”, I highlighted the importance of distinguishing the three parts of algorithms — the model algorithm, the model fitting algorithm, and the model tuning algorithm. This approach can help simplify the understanding of machine learning algorithms.

As for SGDRegressor, here are the three steps:

Model: we can then view LASSO, ridge, elastic net, SVM, and Huber regression as ONE single model, which is the linear model represented as y = wX + b.
Fitting: the fitting algorithm used is stochastic gradient descent (SGD).
Tuning: the hyperparameters that can be tuned include the loss and penalty, among others.

Although sci-kit learn has several models as standalone estimators like LinearRegression, LASSO, and Ridge in the same linear_model module, it is not necessary to debate whether they are actually the same model or not. The focus should be on understanding their inner functioning, as the names can be misleading.

Before concluding this section, I had a thought occur to me: is the estimator LinearRegression actually a machine learning model, given that it is not tunable and does not have any hyperparameters to adjust?

In conclusion, the SGDRegressor in scikit-learn offers a flexible and powerful tool for linear regression tasks. Its various loss functions and penalty options provide the user with many choices to customize the model to their specific needs. Additionally, the ability to use SGD to fit non-convex functions is a significant advantage over standard gradient descent. It is important to keep in mind that the loss and penalty parameters should be considered hyperparameters that require tuning. By applying the learning framework of model, fitting, and tuning, data scientists can utilize the SGDRegressor to achieve optimal results in their linear regression tasks.