Human-Learn: Rule-Based Learning as an Alternative to Machine Learning | by Khuyen Tran | Jan, 2023
Incorporate Domain Knowledge into Your Model with Rule-Based Learning
You are given a labeled dataset and assigned to predict a new one. What would you do?
The first approach that you probably try is to train a machine learning model to find rules for labeling new data.
This is convenient, but it is challenging to know why the machine learning model comes up with a particular prediction. You also can’t incorporate your domain knowledge into the model.
Instead of depending on a machine learning model to make predictions, is there a way to set the rules for data labeling based on your knowledge?
That is when human-learn comes in handy.
human-learn is a Python package to create rule-based systems that are easy to construct and are compatible with scikit-learn.
To install human-learn, type:
pip install human-learn
In the previous article, I talked about how to create a human learning model by drawing:
In this article, we will learn how to create a model with a simple function.
Feel free to play and fork the source code of this article here:
To evaluate the performance of a rule-based model, let’s start with predicting a dataset using a machine learning model.
We will use the Occupation Detection Dataset from UCI Machine Learning Repository as an example for this tutorial.
Our task is to predict room occupancy based on temperature, humidity, light, and CO2. A room is not occupied if Occupancy=0
and is occupied if Occupancy=1
.
After downloading the dataset, unzip and read the data:
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report# Get train and test data
train = pd.read_csv("occupancy_data/datatraining.txt").drop(columns="date")
test = pd.read_csv("occupancy_data/datatest.txt").drop(columns="date")
# Get X and y
target = "Occupancy"
train_X, train_y = train.drop(columns=target), train[target]
val_X, val_y = test.drop(columns=target), test[target]
Take a look at the first ten records of the train
dataset:
train.head(10)
Train the scikit-learn’s RandomForestClassifier
model on the training dataset and use it to predict the test dataset:
# Train
forest_model = RandomForestClassifier(random_state=1)# Preduct
forest_model.fit(train_X, train_y)
machine_preds = forest_model.predict(val_X)
# Evalute
print(classification_report(val_y, machine_preds))
The score is pretty good. However, we are unsure how the model comes up with these predictions.
Let’s see if we can label the new data with simple rules.
There are four steps to create rules for labeling data:
- Generate a hypothesis
- Observe the data to validate the hypothesis
- Start with simple rules based on the observations
- Improve the rules
Generate a Hypothesis
Light in a room is a good indicator of whether a room is occupied. Thus, we can assume that the lighter a room is, the more likely it will be occupied.
Let’s see if this is true by looking at the data.
Observe the Data
To validate our guess, let’s use a box plot to find the difference in the amount of light between an occupied room (Occupancy=1
) and an empty room (Occupancy=0
).
import plotly.express as px
import plotly.graph_objects as gofeature = "Light"
px.box(data_frame=train, x=target, y=feature)
We can see a significant difference in the median between an occupied and an empty room.
Start with Simple Rules
Now, we will create rules for whether a room is occupied based on the light in that room. Specifically, if the amount of light is above a certain threshold, Occupancy=1
and Occupancy=0
otherwise.
But what should that threshold be? Let’s start with picking 100
to be threshold and see what we get.
To create a rule-based model with human-learn, we will:
- Write a simple Python function that specifies the rules
- Use
FunctionClassifier
to turn that function into a scikit-learn model
import numpy as np
from hulearn.classification import FunctionClassifierdef create_rule(data: pd.DataFrame, col: str, threshold: float=100):
return np.array(data[col] > threshold).astype(int)
mod = FunctionClassifier(create_rule, col='Light')
Predict the test set and evaluate the predictions:
mod.fit(train_X, train_y)
preds = mod.predict(val_X)
print(classification_report(val_y, preds))
The accuracy is better than what we got earlier using RandomForestClassifier
!
Improve the Rules
Let’s see if we can get a better result by experimenting with several thresholds. We will use parallel coordinates to analyze the relationships between a specific value of light and room occupancy.
from hulearn.experimental.interactive import parallel_coordinatesparallel_coordinates(train, label=target, height=200)
From the parallel coordinates, we can see that the room with a light above 250 Lux has a high chance of being occupied. The optimal threshold that separates an occupied room from an empty room seems to be somewhere between 250 Lux and 750 Lux.
Let’s find the best threshold in this range using scikit-learn’s GridSearch
.
from sklearn.model_selection import GridSearchCVgrid = GridSearchCV(mod, cv=2, param_grid={"threshold": np.linspace(250, 750, 1000)})
grid.fit(train_X, train_y)
Get the best threshold:
best_threshold = grid.best_params_["threshold"]
best_threshold
> 364.61461461461465
Plot the threshold on the box plot.
Use the model with the best threshold to predict the test set:
human_preds = grid.predict(val_X)
print(classification_report(val_y, human_preds))
The threshold of 365
gives a better result than the threshold of 100
.
Using domain knowledge to create rules with a rule-based model is nice, but there are some disadvantages:
- It doesn’t generalize well to unseen data
- It is difficult to come up with rules for complex data
- There is no feedback loop to improve the model
Thus, combing a rule-based model and an ML model will help data scientists scale and improve the model while still being able to incorporate their domain expertise.
One straightforward way to combine the two models is to decide whether to reduce false negatives or false positives.
Reduce False Negatives
You might want to reduce false negatives in scenarios such as predicting whether a patient has cancer (it is better to make a mistake telling patients that they have cancer than to fail to detect cancer).
To reduce false negatives, choose positive labels when two models disagree.
Reduce False Positives
You might want to reduce false positives in scenarios such as recommending videos that might be violent to kids (it is better to make the mistake of not recommending kid-friendly videos than to recommend adult videos to kids).
To reduce false positives, choose negative labels when two models disagree.
You can also use other more complex policy layers to decide which prediction to choose from.
For a deeper dive into how to combine an ML model and a rule-based model, I recommend checking this excellent video by Jeremy Jordan.
Congratulations! You have just learned what a rule-based model is and how to combine it with a machine-learning model. I hope this article gives you the knowledge needed to develop your own rule-based model.
Incorporate Domain Knowledge into Your Model with Rule-Based Learning
You are given a labeled dataset and assigned to predict a new one. What would you do?
The first approach that you probably try is to train a machine learning model to find rules for labeling new data.
This is convenient, but it is challenging to know why the machine learning model comes up with a particular prediction. You also can’t incorporate your domain knowledge into the model.
Instead of depending on a machine learning model to make predictions, is there a way to set the rules for data labeling based on your knowledge?
That is when human-learn comes in handy.
human-learn is a Python package to create rule-based systems that are easy to construct and are compatible with scikit-learn.
To install human-learn, type:
pip install human-learn
In the previous article, I talked about how to create a human learning model by drawing:
In this article, we will learn how to create a model with a simple function.
Feel free to play and fork the source code of this article here:
To evaluate the performance of a rule-based model, let’s start with predicting a dataset using a machine learning model.
We will use the Occupation Detection Dataset from UCI Machine Learning Repository as an example for this tutorial.
Our task is to predict room occupancy based on temperature, humidity, light, and CO2. A room is not occupied if Occupancy=0
and is occupied if Occupancy=1
.
After downloading the dataset, unzip and read the data:
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report# Get train and test data
train = pd.read_csv("occupancy_data/datatraining.txt").drop(columns="date")
test = pd.read_csv("occupancy_data/datatest.txt").drop(columns="date")
# Get X and y
target = "Occupancy"
train_X, train_y = train.drop(columns=target), train[target]
val_X, val_y = test.drop(columns=target), test[target]
Take a look at the first ten records of the train
dataset:
train.head(10)
Train the scikit-learn’s RandomForestClassifier
model on the training dataset and use it to predict the test dataset:
# Train
forest_model = RandomForestClassifier(random_state=1)# Preduct
forest_model.fit(train_X, train_y)
machine_preds = forest_model.predict(val_X)
# Evalute
print(classification_report(val_y, machine_preds))
The score is pretty good. However, we are unsure how the model comes up with these predictions.
Let’s see if we can label the new data with simple rules.
There are four steps to create rules for labeling data:
- Generate a hypothesis
- Observe the data to validate the hypothesis
- Start with simple rules based on the observations
- Improve the rules
Generate a Hypothesis
Light in a room is a good indicator of whether a room is occupied. Thus, we can assume that the lighter a room is, the more likely it will be occupied.
Let’s see if this is true by looking at the data.
Observe the Data
To validate our guess, let’s use a box plot to find the difference in the amount of light between an occupied room (Occupancy=1
) and an empty room (Occupancy=0
).
import plotly.express as px
import plotly.graph_objects as gofeature = "Light"
px.box(data_frame=train, x=target, y=feature)
We can see a significant difference in the median between an occupied and an empty room.
Start with Simple Rules
Now, we will create rules for whether a room is occupied based on the light in that room. Specifically, if the amount of light is above a certain threshold, Occupancy=1
and Occupancy=0
otherwise.
But what should that threshold be? Let’s start with picking 100
to be threshold and see what we get.
To create a rule-based model with human-learn, we will:
- Write a simple Python function that specifies the rules
- Use
FunctionClassifier
to turn that function into a scikit-learn model
import numpy as np
from hulearn.classification import FunctionClassifierdef create_rule(data: pd.DataFrame, col: str, threshold: float=100):
return np.array(data[col] > threshold).astype(int)
mod = FunctionClassifier(create_rule, col='Light')
Predict the test set and evaluate the predictions:
mod.fit(train_X, train_y)
preds = mod.predict(val_X)
print(classification_report(val_y, preds))
The accuracy is better than what we got earlier using RandomForestClassifier
!
Improve the Rules
Let’s see if we can get a better result by experimenting with several thresholds. We will use parallel coordinates to analyze the relationships between a specific value of light and room occupancy.
from hulearn.experimental.interactive import parallel_coordinatesparallel_coordinates(train, label=target, height=200)
From the parallel coordinates, we can see that the room with a light above 250 Lux has a high chance of being occupied. The optimal threshold that separates an occupied room from an empty room seems to be somewhere between 250 Lux and 750 Lux.
Let’s find the best threshold in this range using scikit-learn’s GridSearch
.
from sklearn.model_selection import GridSearchCVgrid = GridSearchCV(mod, cv=2, param_grid={"threshold": np.linspace(250, 750, 1000)})
grid.fit(train_X, train_y)
Get the best threshold:
best_threshold = grid.best_params_["threshold"]
best_threshold
> 364.61461461461465
Plot the threshold on the box plot.
Use the model with the best threshold to predict the test set:
human_preds = grid.predict(val_X)
print(classification_report(val_y, human_preds))
The threshold of 365
gives a better result than the threshold of 100
.
Using domain knowledge to create rules with a rule-based model is nice, but there are some disadvantages:
- It doesn’t generalize well to unseen data
- It is difficult to come up with rules for complex data
- There is no feedback loop to improve the model
Thus, combing a rule-based model and an ML model will help data scientists scale and improve the model while still being able to incorporate their domain expertise.
One straightforward way to combine the two models is to decide whether to reduce false negatives or false positives.
Reduce False Negatives
You might want to reduce false negatives in scenarios such as predicting whether a patient has cancer (it is better to make a mistake telling patients that they have cancer than to fail to detect cancer).
To reduce false negatives, choose positive labels when two models disagree.
Reduce False Positives
You might want to reduce false positives in scenarios such as recommending videos that might be violent to kids (it is better to make the mistake of not recommending kid-friendly videos than to recommend adult videos to kids).
To reduce false positives, choose negative labels when two models disagree.
You can also use other more complex policy layers to decide which prediction to choose from.
For a deeper dive into how to combine an ML model and a rule-based model, I recommend checking this excellent video by Jeremy Jordan.
Congratulations! You have just learned what a rule-based model is and how to combine it with a machine-learning model. I hope this article gives you the knowledge needed to develop your own rule-based model.