Evaluate the Performance Of Deep Learning Models in Keras
Last Updated on June 29, 2022
Keras is an easy to use and powerful Python library for deep learning.
There are a lot of decisions to make when designing and configuring your deep learning models. Most of these decisions must be resolved empirically through trial and error and evaluating them on real data.
As such, it is critically important to have a robust way to evaluate the performance of your neural networks and deep learning models.
In this post you will discover a few ways that you can use to evaluate model performance using Keras.
Kick-start your project with my new book Deep Learning With Python, including step-by-step tutorials and the Python source code files for all examples.
Let’s get started.
- May/2016: Original post
- Update Oct/2016: Updated examples for Keras 1.1.0 and scikit-learn v0.18.
- Update Mar/2017: Updated example for Keras 2.0.2, TensorFlow 1.0.1 and Theano 0.9.0.
- Update Mar/2018: Added alternate link to download the dataset as the original appears to have been taken down.
- Update Jun/2022: Update to TensorFlow 2.x syntax
Evaluate the Performance Of Deep Learning Models in Keras
Photo by Thomas Leuthard, some rights reserved.
Empirically Evaluate Network Configurations
There are a myriad of decisions you must make when designing and configuring your deep learning models.
Many of these decisions can be resolved by copying the structure of other people’s networks and using heuristics. Ultimately, the best technique is to actually design small experiments and empirically evaluate options using real data.
This includes high-level decisions like the number, size and type of layers in your network. It also includes the lower level decisions like the choice of loss function, activation functions, optimization procedure and number of epochs.
Deep learning is often used on problems that have very large datasets. That is tens of thousands or hundreds of thousands of instances.
As such, you need to have a robust test harness that allows you to estimate the performance of a given configuration on unseen data, and reliably compare the performance to other configurations.
Need help with Deep Learning in Python?
Take my free 2-week email course and discover MLPs, CNNs and LSTMs (with code).
Click to sign-up now and also get a free PDF Ebook version of the course.
Data Splitting
The large amount of data and the complexity of the models require very long training times.
As such, it is typically to use a simple separation of data into training and test datasets or training and validation datasets.
Keras provides a two convenient ways of evaluating your deep learning algorithms this way:
- Use an automatic verification dataset.
- Use a manual verification dataset.
Use a Automatic Verification Dataset
Keras can separate a portion of your training data into a validation dataset and evaluate the performance of your model on that validation dataset each epoch.
You can do this by setting the validation_split argument on the fit() function to a percentage of the size of your training dataset.
For example, a reasonable value might be 0.2 or 0.33 for 20% or 33% of your training data held back for validation.
The example below demonstrates the use of using an automatic validation dataset on a small binary classification problem. All examples in this post use the Pima Indians onset of diabetes dataset. You can download it from the UCI Machine Learning Repository and save the data file in your current working directory with the filename pima-indians-diabetes.csv (update: download from here).
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
# MLP with automatic validation set from tensorflow.keras.models import Sequential from tensorflow.keras.layers import Dense import numpy # fix random seed for reproducibility numpy.random.seed(7) # load pima indians dataset dataset = numpy.loadtxt(“pima-indians-diabetes.csv”, delimiter=“,”) # split into input (X) and output (Y) variables X = dataset[:,0:8] Y = dataset[:,8] # create model model = Sequential() model.add(Dense(12, input_dim=8, activation=‘relu’)) model.add(Dense(8, activation=‘relu’)) model.add(Dense(1, activation=‘sigmoid’)) # Compile model model.compile(loss=‘binary_crossentropy’, optimizer=‘adam’, metrics=[‘accuracy’]) # Fit the model model.fit(X, Y, validation_split=0.33, epochs=150, batch_size=10) |
Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.
Running the example, you can see that the verbose output on each epoch shows the loss and accuracy on both the training dataset and the validation dataset.
… Epoch 145/150 514/514 [==============================] – 0s – loss: 0.5252 – acc: 0.7335 – val_loss: 0.5489 – val_acc: 0.7244 Epoch 146/150 514/514 [==============================] – 0s – loss: 0.5198 – acc: 0.7296 – val_loss: 0.5918 – val_acc: 0.7244 Epoch 147/150 514/514 [==============================] – 0s – loss: 0.5175 – acc: 0.7335 – val_loss: 0.5365 – val_acc: 0.7441 Epoch 148/150 514/514 [==============================] – 0s – loss: 0.5219 – acc: 0.7354 – val_loss: 0.5414 – val_acc: 0.7520 Epoch 149/150 514/514 [==============================] – 0s – loss: 0.5089 – acc: 0.7432 – val_loss: 0.5417 – val_acc: 0.7520 Epoch 150/150 514/514 [==============================] – 0s – loss: 0.5148 – acc: 0.7490 – val_loss: 0.5549 – val_acc: 0.7520 |
Use a Manual Verification Dataset
Keras also allows you to manually specify the dataset to use for validation during training.
In this example we use the handy train_test_split() function from the Python scikit-learn machine learning library to separate our data into a training and test dataset. We use 67% for training and the remaining 33% of the data for validation.
The validation dataset can be specified to the fit()
function in Keras by the validation_data
argument. It takes a tuple of the input and output datasets.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
# MLP with manual validation set from tensorflow.keras.models import Sequential from tensorflow.keras.layers import Dense from sklearn.model_selection import train_test_split import numpy # fix random seed for reproducibility seed = 7 numpy.random.seed(seed) # load pima indians dataset dataset = numpy.loadtxt(“pima-indians-diabetes.csv”, delimiter=“,”) # split into input (X) and output (Y) variables X = dataset[:,0:8] Y = dataset[:,8] # split into 67% for train and 33% for test X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.33, random_state=seed) # create model model = Sequential() model.add(Dense(12, input_dim=8, activation=‘relu’)) model.add(Dense(8, activation=‘relu’)) model.add(Dense(1, activation=‘sigmoid’)) # Compile model model.compile(loss=‘binary_crossentropy’, optimizer=‘adam’, metrics=[‘accuracy’]) # Fit the model model.fit(X_train, y_train, validation_data=(X_test,y_test), epochs=150, batch_size=10) |
Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.
Like before, running the example provides verbose output of training that includes the loss and accuracy of the model on both the training and validation datasets for each epoch.
… Epoch 145/150 514/514 [==============================] – 0s – loss: 0.4847 – acc: 0.7704 – val_loss: 0.5668 – val_acc: 0.7323 Epoch 146/150 514/514 [==============================] – 0s – loss: 0.4853 – acc: 0.7549 – val_loss: 0.5768 – val_acc: 0.7087 Epoch 147/150 514/514 [==============================] – 0s – loss: 0.4864 – acc: 0.7743 – val_loss: 0.5604 – val_acc: 0.7244 Epoch 148/150 514/514 [==============================] – 0s – loss: 0.4831 – acc: 0.7665 – val_loss: 0.5589 – val_acc: 0.7126 Epoch 149/150 514/514 [==============================] – 0s – loss: 0.4961 – acc: 0.7782 – val_loss: 0.5663 – val_acc: 0.7126 Epoch 150/150 514/514 [==============================] – 0s – loss: 0.4967 – acc: 0.7588 – val_loss: 0.5810 – val_acc: 0.6929 |
Manual k-Fold Cross Validation
The gold standard for machine learning model evaluation is k-fold cross validation.
It provides a robust estimate of the performance of a model on unseen data. It does this by splitting the training dataset into k subsets and takes turns training models on all subsets except one which is held out, and evaluating model performance on the held out validation dataset. The process is repeated until all subsets are given an opportunity to be the held out validation set. The performance measure is then averaged across all models that are created.
It is important to understand that cross validation means to estimate a model design (e.g., 3-layer vs 4-layer neural network) rather than a specific fitted model. We do not want to use a specific dataset to fit the models and compare the result. Since this may due to that particular dataset fits better on one model design. Instead, we want to use multiple datasets to fit, resulting in multiple fitted model of the same design and take the average performance measure for comparison.
Cross validation is often not used for evaluating deep learning models because of the greater computational expense. For example k-fold cross validation is often used with 5 or 10 folds. As such, 5 or 10 models must be constructed and evaluated, greatly adding to the evaluation time of a model.
Nevertheless, it when the problem is small enough or if you have sufficient compute resources, k-fold cross validation can give you a less biased estimate of the performance of your model.
In the example below we use the handy StratifiedKFold class from the scikit-learn Python machine learning library to split up the training dataset into 10 folds. The folds are stratified, meaning that the algorithm attempts to balance the number of instances of each class in each fold.
The example creates and evaluates 10 models using the 10 splits of the data and collects all of the scores. The verbose output for each epoch is turned off by passing verbose=0
to the fit()
and evaluate()
functions on the model.
The performance is printed for each model and it is stored. The average and standard deviation of the model performance is then printed at the end of the run to provide a robust estimate of model accuracy.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 |
# MLP for Pima Indians Dataset with 10-fold cross validation from tensorflow.keras.models import Sequential from tensorflow.keras.layers import Dense from sklearn.model_selection import StratifiedKFold import numpy as np # fix random seed for reproducibility seed = 7 np.random.seed(seed) # load pima indians dataset dataset = np.loadtxt(“pima-indians-diabetes.csv”, delimiter=“,”) # split into input (X) and output (Y) variables X = dataset[:,0:8] Y = dataset[:,8] # define 10-fold cross validation test harness kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=seed) cvscores = [] for train, test in kfold.split(X, Y): # create model model = Sequential() model.add(Dense(12, input_dim=8, activation=‘relu’)) model.add(Dense(8, activation=‘relu’)) model.add(Dense(1, activation=‘sigmoid’)) # Compile model model.compile(loss=‘binary_crossentropy’, optimizer=‘adam’, metrics=[‘accuracy’]) # Fit the model model.fit(X[train], Y[train], epochs=150, batch_size=10, verbose=0) # evaluate the model scores = model.evaluate(X[test], Y[test], verbose=0) print(“%s: %.2f%%” % (model.metrics_names[1], scores[1]*100)) cvscores.append(scores[1] * 100)
print(“%.2f%% (+/- %.2f%%)” % (np.mean(cvscores), np.std(cvscores))) |
Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.
Running the example will take less than a minute and will produce the following output:
acc: 77.92% acc: 68.83% acc: 72.73% acc: 64.94% acc: 77.92% acc: 35.06% acc: 74.03% acc: 68.83% acc: 34.21% acc: 72.37% 64.68% (+/- 15.50%) |
Summary
In this post you discovered the importance of having a robust way to estimate the performance of your deep learning models on unseen data.
You discovered three ways that you can estimate the performance of your deep learning models in Python using the Keras library:
- Use Automatic Verification Datasets.
- Use Manual Verification Datasets.
- Use Manual k-Fold Cross Validation.
Do you have any questions about deep learning with Keras or this post? Ask your question in the comments and I will do my best to answer it.
Last Updated on June 29, 2022
Keras is an easy to use and powerful Python library for deep learning.
There are a lot of decisions to make when designing and configuring your deep learning models. Most of these decisions must be resolved empirically through trial and error and evaluating them on real data.
As such, it is critically important to have a robust way to evaluate the performance of your neural networks and deep learning models.
In this post you will discover a few ways that you can use to evaluate model performance using Keras.
Kick-start your project with my new book Deep Learning With Python, including step-by-step tutorials and the Python source code files for all examples.
Let’s get started.
- May/2016: Original post
- Update Oct/2016: Updated examples for Keras 1.1.0 and scikit-learn v0.18.
- Update Mar/2017: Updated example for Keras 2.0.2, TensorFlow 1.0.1 and Theano 0.9.0.
- Update Mar/2018: Added alternate link to download the dataset as the original appears to have been taken down.
- Update Jun/2022: Update to TensorFlow 2.x syntax
![Evaluate the Performance Of Deep Learning Models in Keras](https://machinelearningmastery.com/wp-content/uploads/2016/05/Evaluate-the-Performance-Of-Deep-Learning-Models-in-Keras.jpg)
Evaluate the Performance Of Deep Learning Models in Keras
Photo by Thomas Leuthard, some rights reserved.
Empirically Evaluate Network Configurations
There are a myriad of decisions you must make when designing and configuring your deep learning models.
Many of these decisions can be resolved by copying the structure of other people’s networks and using heuristics. Ultimately, the best technique is to actually design small experiments and empirically evaluate options using real data.
This includes high-level decisions like the number, size and type of layers in your network. It also includes the lower level decisions like the choice of loss function, activation functions, optimization procedure and number of epochs.
Deep learning is often used on problems that have very large datasets. That is tens of thousands or hundreds of thousands of instances.
As such, you need to have a robust test harness that allows you to estimate the performance of a given configuration on unseen data, and reliably compare the performance to other configurations.
Need help with Deep Learning in Python?
Take my free 2-week email course and discover MLPs, CNNs and LSTMs (with code).
Click to sign-up now and also get a free PDF Ebook version of the course.
Data Splitting
The large amount of data and the complexity of the models require very long training times.
As such, it is typically to use a simple separation of data into training and test datasets or training and validation datasets.
Keras provides a two convenient ways of evaluating your deep learning algorithms this way:
- Use an automatic verification dataset.
- Use a manual verification dataset.
Use a Automatic Verification Dataset
Keras can separate a portion of your training data into a validation dataset and evaluate the performance of your model on that validation dataset each epoch.
You can do this by setting the validation_split argument on the fit() function to a percentage of the size of your training dataset.
For example, a reasonable value might be 0.2 or 0.33 for 20% or 33% of your training data held back for validation.
The example below demonstrates the use of using an automatic validation dataset on a small binary classification problem. All examples in this post use the Pima Indians onset of diabetes dataset. You can download it from the UCI Machine Learning Repository and save the data file in your current working directory with the filename pima-indians-diabetes.csv (update: download from here).
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
# MLP with automatic validation set from tensorflow.keras.models import Sequential from tensorflow.keras.layers import Dense import numpy # fix random seed for reproducibility numpy.random.seed(7) # load pima indians dataset dataset = numpy.loadtxt(“pima-indians-diabetes.csv”, delimiter=“,”) # split into input (X) and output (Y) variables X = dataset[:,0:8] Y = dataset[:,8] # create model model = Sequential() model.add(Dense(12, input_dim=8, activation=‘relu’)) model.add(Dense(8, activation=‘relu’)) model.add(Dense(1, activation=‘sigmoid’)) # Compile model model.compile(loss=‘binary_crossentropy’, optimizer=‘adam’, metrics=[‘accuracy’]) # Fit the model model.fit(X, Y, validation_split=0.33, epochs=150, batch_size=10) |
Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.
Running the example, you can see that the verbose output on each epoch shows the loss and accuracy on both the training dataset and the validation dataset.
… Epoch 145/150 514/514 [==============================] – 0s – loss: 0.5252 – acc: 0.7335 – val_loss: 0.5489 – val_acc: 0.7244 Epoch 146/150 514/514 [==============================] – 0s – loss: 0.5198 – acc: 0.7296 – val_loss: 0.5918 – val_acc: 0.7244 Epoch 147/150 514/514 [==============================] – 0s – loss: 0.5175 – acc: 0.7335 – val_loss: 0.5365 – val_acc: 0.7441 Epoch 148/150 514/514 [==============================] – 0s – loss: 0.5219 – acc: 0.7354 – val_loss: 0.5414 – val_acc: 0.7520 Epoch 149/150 514/514 [==============================] – 0s – loss: 0.5089 – acc: 0.7432 – val_loss: 0.5417 – val_acc: 0.7520 Epoch 150/150 514/514 [==============================] – 0s – loss: 0.5148 – acc: 0.7490 – val_loss: 0.5549 – val_acc: 0.7520 |
Use a Manual Verification Dataset
Keras also allows you to manually specify the dataset to use for validation during training.
In this example we use the handy train_test_split() function from the Python scikit-learn machine learning library to separate our data into a training and test dataset. We use 67% for training and the remaining 33% of the data for validation.
The validation dataset can be specified to the fit()
function in Keras by the validation_data
argument. It takes a tuple of the input and output datasets.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
# MLP with manual validation set from tensorflow.keras.models import Sequential from tensorflow.keras.layers import Dense from sklearn.model_selection import train_test_split import numpy # fix random seed for reproducibility seed = 7 numpy.random.seed(seed) # load pima indians dataset dataset = numpy.loadtxt(“pima-indians-diabetes.csv”, delimiter=“,”) # split into input (X) and output (Y) variables X = dataset[:,0:8] Y = dataset[:,8] # split into 67% for train and 33% for test X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.33, random_state=seed) # create model model = Sequential() model.add(Dense(12, input_dim=8, activation=‘relu’)) model.add(Dense(8, activation=‘relu’)) model.add(Dense(1, activation=‘sigmoid’)) # Compile model model.compile(loss=‘binary_crossentropy’, optimizer=‘adam’, metrics=[‘accuracy’]) # Fit the model model.fit(X_train, y_train, validation_data=(X_test,y_test), epochs=150, batch_size=10) |
Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.
Like before, running the example provides verbose output of training that includes the loss and accuracy of the model on both the training and validation datasets for each epoch.
… Epoch 145/150 514/514 [==============================] – 0s – loss: 0.4847 – acc: 0.7704 – val_loss: 0.5668 – val_acc: 0.7323 Epoch 146/150 514/514 [==============================] – 0s – loss: 0.4853 – acc: 0.7549 – val_loss: 0.5768 – val_acc: 0.7087 Epoch 147/150 514/514 [==============================] – 0s – loss: 0.4864 – acc: 0.7743 – val_loss: 0.5604 – val_acc: 0.7244 Epoch 148/150 514/514 [==============================] – 0s – loss: 0.4831 – acc: 0.7665 – val_loss: 0.5589 – val_acc: 0.7126 Epoch 149/150 514/514 [==============================] – 0s – loss: 0.4961 – acc: 0.7782 – val_loss: 0.5663 – val_acc: 0.7126 Epoch 150/150 514/514 [==============================] – 0s – loss: 0.4967 – acc: 0.7588 – val_loss: 0.5810 – val_acc: 0.6929 |
Manual k-Fold Cross Validation
The gold standard for machine learning model evaluation is k-fold cross validation.
It provides a robust estimate of the performance of a model on unseen data. It does this by splitting the training dataset into k subsets and takes turns training models on all subsets except one which is held out, and evaluating model performance on the held out validation dataset. The process is repeated until all subsets are given an opportunity to be the held out validation set. The performance measure is then averaged across all models that are created.
It is important to understand that cross validation means to estimate a model design (e.g., 3-layer vs 4-layer neural network) rather than a specific fitted model. We do not want to use a specific dataset to fit the models and compare the result. Since this may due to that particular dataset fits better on one model design. Instead, we want to use multiple datasets to fit, resulting in multiple fitted model of the same design and take the average performance measure for comparison.
Cross validation is often not used for evaluating deep learning models because of the greater computational expense. For example k-fold cross validation is often used with 5 or 10 folds. As such, 5 or 10 models must be constructed and evaluated, greatly adding to the evaluation time of a model.
Nevertheless, it when the problem is small enough or if you have sufficient compute resources, k-fold cross validation can give you a less biased estimate of the performance of your model.
In the example below we use the handy StratifiedKFold class from the scikit-learn Python machine learning library to split up the training dataset into 10 folds. The folds are stratified, meaning that the algorithm attempts to balance the number of instances of each class in each fold.
The example creates and evaluates 10 models using the 10 splits of the data and collects all of the scores. The verbose output for each epoch is turned off by passing verbose=0
to the fit()
and evaluate()
functions on the model.
The performance is printed for each model and it is stored. The average and standard deviation of the model performance is then printed at the end of the run to provide a robust estimate of model accuracy.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 |
# MLP for Pima Indians Dataset with 10-fold cross validation from tensorflow.keras.models import Sequential from tensorflow.keras.layers import Dense from sklearn.model_selection import StratifiedKFold import numpy as np # fix random seed for reproducibility seed = 7 np.random.seed(seed) # load pima indians dataset dataset = np.loadtxt(“pima-indians-diabetes.csv”, delimiter=“,”) # split into input (X) and output (Y) variables X = dataset[:,0:8] Y = dataset[:,8] # define 10-fold cross validation test harness kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=seed) cvscores = [] for train, test in kfold.split(X, Y): # create model model = Sequential() model.add(Dense(12, input_dim=8, activation=‘relu’)) model.add(Dense(8, activation=‘relu’)) model.add(Dense(1, activation=‘sigmoid’)) # Compile model model.compile(loss=‘binary_crossentropy’, optimizer=‘adam’, metrics=[‘accuracy’]) # Fit the model model.fit(X[train], Y[train], epochs=150, batch_size=10, verbose=0) # evaluate the model scores = model.evaluate(X[test], Y[test], verbose=0) print(“%s: %.2f%%” % (model.metrics_names[1], scores[1]*100)) cvscores.append(scores[1] * 100)
print(“%.2f%% (+/- %.2f%%)” % (np.mean(cvscores), np.std(cvscores))) |
Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.
Running the example will take less than a minute and will produce the following output:
acc: 77.92% acc: 68.83% acc: 72.73% acc: 64.94% acc: 77.92% acc: 35.06% acc: 74.03% acc: 68.83% acc: 34.21% acc: 72.37% 64.68% (+/- 15.50%) |
Summary
In this post you discovered the importance of having a robust way to estimate the performance of your deep learning models on unseen data.
You discovered three ways that you can estimate the performance of your deep learning models in Python using the Keras library:
- Use Automatic Verification Datasets.
- Use Manual Verification Datasets.
- Use Manual k-Fold Cross Validation.
Do you have any questions about deep learning with Keras or this post? Ask your question in the comments and I will do my best to answer it.