The 1958 Perceptron as a Breast Cancer classifier | by Mario Emmanuel | Jan, 2023

By Jessie Hobb On Jan 3, 2023

A practical example implemented in Mathematica

The US Navy using the Mark I Perceptron to read letters. National Museum of the US Navy, 1960. Image from Wikimedia Commons (Public Domain).

The Rosenblatt Perceptron, developed by Frank Rosenblatt in 1958[1][2], is considered the origin of neural networks because it was the first algorithm that demonstrated the ability of a machine to learn from data.

The Perceptron was a simple model that consisted of a single layer of artificial neurons, or units, that could be trained to recognize patterns in input data. This marked the beginning of the field of artificial intelligence and paved the way for the development of more complex neural networks that have been used in a variety of applications.

The first implementation was the Mark I Perceptron developed by the MIT Lincoln Laboratory in the late 1950s. It was the first machine that was capable of learning from examples, and it was used to perform pattern recognition tasks such as reading handwritten letters and recognizing spoken words. The Mark I Perceptron was built using vacuum tubes and other electronic components, and it consisted of a series of units that were connected together in a layered structure. Each unit was able to process input data and adjust its internal weights and biases in order to recognize patterns in the data.

Figure 1. A detail of the Mark 1. Source: Wikimedia Commons.

This machine was used by the US Navy in the 1960s to read handwritten letters. At the time, the Navy was receiving a large volume of correspondence and needed a way to quickly and accurately process the letters. They turned to the Perceptron, which had the ability to learn from examples and recognize patterns in input data. By training the Perceptron on a large dataset of handwritten letters, the Navy was able to develop a system that could accurately read and classify the letters with minimal human intervention. This was a significant achievement at the time, as it demonstrated the potential of artificial intelligence and machine learning to automate tasks that had previously been done by humans. The success of the Perceptron in this application helped to establish the technology as a powerful tool for pattern recognition and opened the door to its use in a wide range of applications.

The Rosenblatt Perceptron is a simple model of an artificial neuron. In this model, the single neuron has multiple inputs, which are the values of the features in the input data, and a single output, which is a binary value that indicates whether the input data belongs to one of two classes. The Perceptron works by assigning weights to each input, which represent the importance of that input in determining the output. The output is then calculated using a mathematical function that combines the weighted inputs with a bias term. The bias term is a fixed value that can be adjusted to shift the output of the Perceptron. The Perceptron can be trained to recognize patterns in the input data by adjusting the weights and bias based on the errors in its predictions. This process of adjusting the weights and bias to minimize the errors is known as training the Perceptron.

Figure 2. The Perceptron corresponds to a one neuron Neural Network. In the image no bias is shown, it has to be either added separately, applied as a constant to one of the weights or not added at all. Image by Chrislb from Wikimedia Commons.

The model consists of a single layer of units, each of which has d features as inputs and a binary output that can take on the values -1 and +1. The Perceptron uses these inputs and output to learn how to classify data into two classes.

Figure 3. Training observation Equations.

In order to train the Perceptron, we need a training set that consists of multiple observations, each of which consists of the d features (X vector) and the actual output (y). The Perceptron uses these observations to learn how to predict the output based on the features. The output of the Perceptron is calculated using the sign of the dot product between the weight vector and the feature vector.

The prediction function of the Rosenblatt Perceptron is used to calculate the output of the Perceptron based on the input data and the internal weights and bias of the unit. The output of the Perceptron is a binary value that indicates whether the input data belongs to one of two classes. The prediction function is defined as follows:

Figure 4. Prediction function without bias.

Figure 5. Prediction function with bias.

Where W is a vector of weights that corresponds to the input features, X is a vector of the input values for the features. The model can be implemented with and without bias, being b the bias term. The sign function returns -1 if the value inside the parentheses is negative and +1 if it is positive. The Perceptron adjusts the weights and bias during training in order to minimize the errors in its predictions and improve its accuracy.

The bias term in the Rosenblatt Perceptron represents an additional weight that is applied to all of the inputs. It is used to shift the output of the Perceptron and can be adjusted during training to improve the accuracy of the model. The bias can be implemented in one of two ways: as a constant +1 in one of the features, or as an external parameter that is adjusted separately from the other weights.

If the bias is implemented as a constant +1 feature (feature trick), it is treated like any other input feature and is given its own weight. This means that the bias term is included in the calculation of the output.

Alternatively, the bias can be implemented as an external parameter that is adjusted separately from the other weights. The bias is then added to the output calculated with the weight vector and the feature vectors.

The bias term in the Rosenblatt Perceptron is useful when the feature variables are mean centered, but the mean of the binary class prediction is not 0, because it allows the model to adjust the decision boundary of the model in order to better fit the data. This can be especially important when the binary class distribution is highly imbalanced, as the model may tend to predict the majority class more often in order to minimize errors. In this case, the bias can be used to adjust the position of the decision boundary and improve the ability of the model to correctly classify the minority class.

The Rosenblatt Perceptron model did not include a formal definition of a loss function, even when its goal was to minimize the error between the predicted and actual values. In order to achieve this, the Perceptron adjusts the weights and bias of the model based on the errors in its predictions. One way to define the error in the Perceptron model is to use the least-squares method, which involves minimizing the sum of the squared differences between the predicted and actual values. This loss function can be expressed mathematically as follows:

Figure 6. A loss function for the Perceptron.

While gradient descent is the usual way to minimize loss functions in Machine Learning, it can not be applied to Rosenblatt Perceptron, the reason being that the function is not continuous. Instead, the Perceptron used a learning rule called the perceptron convergence theorem, which was based on the idea of adjusting the weights and bias of the model in order to minimize the errors in its predictions.

Figure 7. Equivalent to gradient descent of the Loss function.

Figure 8. Iterative equations to optimise Weight vector.

The learning parameter in the Rosenblatt Perceptron model is a hyperparameter that determines the step size of the learning algorithm. It is used to control the rate at which the weights and bias of the model are updated based on the errors in the predictions. A larger learning parameter will result in larger updates to the weights and bias, which can lead to faster learning but may also increase the risk of overfitting. A smaller learning parameter will result in smaller updates, which can lead to slower learning but may also reduce the risk of overfitting.

The Perceptron model is often described as having a stochastic gradient descent (SGD) learning algorithm, despite the fact that it was developed before the concept of gradient descent was introduced. This is because the Perceptron learning rule, known as the perceptron convergence theorem, is similar in spirit to gradient descent, as it involves iteratively adjusting the weights and bias of the model in order to minimize the errors in the predictions. Like gradient descent, the Perceptron uses a learning parameter to control the step size of the updates, and it can be seen as a form of online learning, as it processes the training data one sample at a time.

Overall, the learning parameter in the Perceptron model plays a crucial role in controlling the speed and accuracy of the learning process. By adjusting the learning parameter, it is possible to fine-tune the performance of the Perceptron model and achieve better results on a variety of classification tasks.

The Wisconsin Breast Cancer dataset[3] is a commonly used dataset for demonstrating the capabilities of the Rosenblatt Perceptron model. This dataset consists of 699 samples of breast cancer biopsy images taken from 1989 until 1992, which have been classified as benign or malignant based on the presence of certain features. The dataset includes a total of 9 features that were calculated from the images, including the clump thickness, uniformity of the cell size, uniformity of the cell shape, marginal adhesion, single epithelial cell size, bare nuclei, bland chromatin, normal nucleoli and mitoses of the tumors.

The dataset is a good example to see the Perceptron in action because it is a relatively simple dataset with a clear separation between the benign and malignant classes. This means that the Perceptron should be able to learn to classify the samples correctly with a high degree of accuracy. In addition, the features in the dataset are well-defined and easy to understand, which makes it easy to interpret the results of the Perceptron model.

The Wisconsin Breast Cancer dataset includes the following features:

   #  Attribute                     Domain
-- -----------------------------------------
1. Sample code number            id number
2. Clump Thickness               1 - 10
3. Uniformity of Cell Size       1 - 10
4. Uniformity of Cell Shape      1 - 10
5. Marginal Adhesion             1 - 10
6. Single Epithelial Cell Size   1 - 10
7. Bare Nuclei                   1 - 10
8. Bland Chromatin               1 - 10
9. Normal Nucleoli               1 - 10
10. Mitoses                       1 - 10
11. Class :                       (2 for benign, 4 for malignant)

In this example the Perceptron will be implemented using Wolfram Mathematica Language (it can be adapted to any other language easily).

The steps to implement will be:

Define the features.
Load the data (including cleaning).
Divide dataset into training and test.
Assign an initial Weight vector.
Train the model (optimise weight vector).
Use the trained model to compare test data set.
Calculate the accuracy.
Calculate the recall.
Calculate the confussion matrix.

Step 1. Define the features

features = {
"Sample code number", 
"Clump Thickness", 
"Uniformity of Cell Size", 
"Uniformity of Cell Shape",
"Marginal Adhesion",  
"Single Epithelial Cell Size" ,
"Bare Nuclei",
"Bland Chromatin",
"Normal Nucleoli",
"Mitoses",
"Class"
};
features = ToUpperCase[features];
features = StringReplace[features, " " -> "_"]---
{"SAMPLE_CODE_NUMBER", "CLUMP_THICKNESS", "UNIFORMITY_OF_CELL_SIZE", \
"UNIFORMITY_OF_CELL_SHAPE", "MARGINAL_ADHESION", \
"SINGLE_EPITHELIAL_CELL_SIZE", "BARE_NUCLEI", "BLAND_CHROMATIN", \
"NORMAL_NUCLEOLI", "MITOSES", "CLASS"}

Step 2. Load the data and clean it

SetDirectory[
"/data/uci_breast_cancer/"];
data = Import[
"/data/uci_breast_cancer/breast-cancer-wisconsin.data"];
data = DeleteCases[data, {___, x_ /; ! NumberQ[x], ___}];

Step 3. Divide dataset into training and test

(* OBTAIN DATA LENGTH *)
n = Length[data];(* SET THE RATIO BETWEEN TEST AND TRAINING, TEST IS 20 PERCENT *)
testFraction = 0.2;
(* SET THE ALPHA VALUE *)
alpha = 0.9;
(* SHUFFLE THE DATA *)
randomizedData = RandomSample[data];
(* EXTRACT THE TRAINING AND TEST SETS *)
testData = Take[randomizedData, Round[testFraction*n]];
trainingData = Drop[randomizedData, Round[testFraction*n]];
(* GET THE LENGTHS OF EACH DATASET *)
lengthTestData = Length[testData]
lengthTrainingData = Length[trainingData]
---
137
546

Step 4. Assign an initial Weight vector

W = ConstantArray[1, Length[features[[2 ;; 10]]]]---
{1, 1, 1, 1, 1, 1, 1, 1, 1}

Step 5. Train the model (optimise the weight vector)

nonZeroSign[x_] := If[x > 0.0, 1.0, -1.0];
Do[
X = trainingData[[i, 2 ;; 10]];
Y = trainingData[[i, 11]] - 3;
EY = nonZeroSign[Dot[X, W]];
W = (W  + alpha (Y - EY) X);
, {i, 1, lengthTrainingData}
];
W---
{-2.6, 42.4, 6.4, 20.8, -78.2, 19., -26., 53.2, 6.4}

Step 6. Use the trained model to compare test data set

results = ConstantArray[0, lengthTestData];
Do[
X = testData[[i, 2 ;; 10]];
results[[i]] = nonZeroSign[Dot[X, W]];
, {i, 1, lengthTestData}
];
Y = testData[[All, 11]] - 3.0
EY = results---
{-1., -1., 1., -1., -1., -1., 1., 1., -1., -1., -1., -1., -1., -1., \
-1., -1., -1., -1., -1., -1., 1., 1., -1., -1., -1., -1., 1., -1., \
1., -1., -1., -1., -1., -1., -1., -1., 1., -1., -1., -1., -1., -1., \
-1., -1., -1., -1., 1., -1., -1., 1., 1., -1., 1., 1., 1., -1., 1., \
-1., 1., -1., -1., -1., -1., 1., -1., 1., -1., -1., -1., 1., -1., \
-1., -1., 1., 1., 1., 1., 1., -1., 1., -1., 1., -1., -1., 1., 1., \
-1., -1., 1., 1., -1., -1., 1., 1., 1., -1., -1., -1., 1., -1., 1., \
-1., -1., 1., 1., 1., -1., -1., -1., -1., -1., -1., -1., -1., -1., \
-1., -1., 1., -1., 1., 1., -1., 1., -1., 1., -1., -1., 1., -1., -1., \
-1., -1., -1., -1., -1., -1., -1.}
{-1., -1., -1., -1., -1., -1., 1., 1., -1., -1., -1., -1., -1., -1., \
-1., -1., -1., -1., -1., 1., 1., 1., -1., -1., -1., -1., 1., -1., 1., \
-1., -1., -1., -1., -1., 1., 1., 1., -1., -1., -1., -1., 1., -1., 1., \
-1., 1., -1., 1., -1., -1., 1., -1., 1., 1., 1., -1., 1., -1., 1., \
1., -1., -1., -1., 1., 1., 1., 1., -1., -1., -1., -1., -1., -1., 1., \
1., 1., 1., -1., -1., 1., -1., 1., 1., -1., 1., 1., -1., 1., 1., 1., \
-1., -1., 1., 1., 1., -1., -1., 1., 1., -1., 1., -1., -1., 1., 1., \
-1., -1., -1., 1., -1., -1., 1., 1., 1., -1., -1., -1., 1., -1., 1., \
1., -1., 1., -1., 1., -1., -1., 1., -1., -1., -1., -1., -1., -1., 1., \
1., -1.}

Step 7. Calculate accuracy

testDataHit = MapThread[Equal, {EY, Y}]
testDataHitCount = Count[testDataHit, True]
EYAccuracy = testDataHitCount/Length[Y]*1.0---
{True, True, False, True, True, True, True, True, True, True, True, \
True, True, True, True, True, True, True, True, False, True, True, \
True, True, True, True, True, True, True, True, True, True, True, \
True, False, False, True, True, True, True, True, False, True, False, \
True, False, False, False, True, False, True, True, True, True, True, \
True, True, True, True, False, True, True, True, True, False, True, \
False, True, True, False, True, True, True, True, True, True, True, \
False, True, True, True, True, False, True, True, True, True, False, \
True, True, True, True, True, True, True, True, True, False, True, \
True, True, True, True, True, True, False, True, True, False, True, \
True, False, False, False, True, True, True, True, True, True, True, \
True, True, True, True, True, True, True, True, True, True, True, \
True, True, False, False, True}
112
0.817518

Step 8. Calculate recall

recall =
Count[
Thread[Thread[Y == 1.0] && Thread[EY == 1.0]]
, True]/(
Count[
Thread[Thread[Y == 1.0] && Thread[EY == 1.0]]
, True]
+
Count[
Thread[Thread[Y == 1.0] && Thread[EY == -1.0]]
, True]
)*1.0---
0.863636

Step 9. Confussion matrix

beningPredictedBening = 
Count[Thread[Thread[Y == -1.0] && Thread[EY == -1.0]], True]
beningPredictedMalignant = 
Count[Thread[Thread[Y == -1.0] && Thread[EY == 1.0]], True]
malignantPredictedBening = 
Count[Thread[Thread[Y == 1.0] && Thread[EY == -1.0]], True]
malignantPredictedMalignant = 
Count[Thread[Thread[Y == 1.0] && Thread[EY == 1.0]], True]
MatrixPlot[
{
{beningPredictedBening, beningPredictedMalignant},
{malignantPredictedBening, malignantPredictedMalignant}
},
ImageSize -> 300,
ColorFunction -> "TemperatureMap",
FrameTicks -> {
{{1, "TUMOUR\nBENING"}, {2, "TUMOUR\nMALIGNANT"}},
{{1, "PREDICTED\nBENING"}, {2, "PREDICTED\nMALIGNANT"}},
{{1, ""}, {2, ""}},
{{1, ""}, {2, ""}}
},
PlotLabel -> "CONFUSSION MATRIX PERCEPTRON",
Epilog -> {
Text[beningPredictedBening, {1/2, 3/2}],
Text[beningPredictedMalignant, {3/2, 3/2}],
Text[malignantPredictedBening, {1/2, 1/2}],
Text[malignantPredictedMalignant, {3/2, 1/2}]}
]---
74
19
6
38

Figure 9. Confussion Matrix of our Perceptron

One way to measure how good a Perceptron (or any other classifier) is to evaluate its performance on a test dataset. There are several metrics that can be used to evaluate the performance of a classifier, such as accuracy, precision, recall, and F1 score.

Accuracy is the percentage of correct predictions made by the classifier. It is calculated as the number of correct predictions divided by the total number of predictions. However, accuracy can be misleading if the classes are imbalanced (i.e., one class is much more common than the other).

Recall is the percentage of positive cases that were correctly identified by the classifier. It is calculated as the number of true positive predictions divided by the total number of positive cases. Recall is particularly important in applications where it is important to minimize the number of false negatives (e.g., a cancer detector). In our example we get an 86%, which means that our predictor is missing 14% of the malignant tumours.

A confusion matrix is a table that shows the number of true positive, true negative, false positive, and false negative predictions made by a classifier. It is a useful tool for understanding the strengths and weaknesses of a classifier, and for comparing the performance of different classifiers.

In a cancer detector, the confusion matrix is particularly important because it can help identify cases where the classifier is making incorrect predictions. For example, if the classifier is making a large number of false negatives (i.e., it is missing a lot of cancer cases), it may be necessary to adjust the classifier or to gather more training data to improve its performance. On the other hand, if the classifier is making a large number of false positives (i.e., it is identifying a lot of benign cases as cancerous), it may be necessary to adjust the classifier to be more conservative in its predictions.

This article provided a basic mathematical description of the Perceptron, a type of single-layer neural network that was developed in the 1950s. It explained the mathematics behind Perceptrons, including how they can be used to classify data into different categories.

The Perceptron was applied to a well-known dataset in data science, the Wisconsin Breast Cancer dataset from 1995, and demonstrated how different metrics can be used to evaluate the performance of the classifier. The implementation of the Perceptron in Mathematica showed how easily these concepts can be represented in modern programming languages, and the different steps described in the implementation demonstrate how a classifier can be designed from scratch and its performance evaluated.

While Perceptrons are no longer used in modern machine learning practice due to the development of more advanced neural network architectures, the article showed that they are still a valuable tool for understanding the fundamental principles of neural networks.

[1] https://news.cornell.edu/stories/2019/09/professors-perceptron-paved-way-ai-60-years-too-soon

[2] https://psycnet.apa.org/record/1959-09865-001

[3] https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+(diagnostic) | https://archive-beta.ics.uci.edu/dataset/15/breast+cancer+wisconsin+original (CC BY 4.0 license, see acknowledges).

The dataset used was made available by the UCI Machine Learning Repository. Dataset created by:
1. Dr. William H. Wolberg, General Surgery Dept.
University of Wisconsin, Clinical Sciences Center
Madison, WI 53792
2. W. Nick Street, Computer Sciences Dept.
University of Wisconsin, 1210 West Dayton St., Madison, WI 53706
3. Olvi L. Mangasarian, Computer Sciences Dept.
University of Wisconsin, 1210 West Dayton St., Madison, WI 53706
Donor: Nick Street.
UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science. (https://archive.ics.uci.edu/ml/about.html / https://archive.ics.uci.edu/ml/citation_policy.html).