Techno Blender
Digitally Yours.

Human-Learn: Rule-Based Learning as an Alternative to Machine Learning | by Khuyen Tran | Jan, 2023

0 34


You are given a labeled dataset and assigned to predict a new one. What would you do?

The first approach that you probably try is to train a machine learning model to find rules for labeling new data.

Image by Author

This is convenient, but it is challenging to know why the machine learning model comes up with a particular prediction. You also can’t incorporate your domain knowledge into the model.

Instead of depending on a machine learning model to make predictions, is there a way to set the rules for data labeling based on your knowledge?

Image by Author

That is when human-learn comes in handy.

human-learn is a Python package to create rule-based systems that are easy to construct and are compatible with scikit-learn.

To install human-learn, type:

pip install human-learn

In the previous article, I talked about how to create a human learning model by drawing:

In this article, we will learn how to create a model with a simple function.

Feel free to play and fork the source code of this article here:

To evaluate the performance of a rule-based model, let’s start with predicting a dataset using a machine learning model.

We will use the Occupation Detection Dataset from UCI Machine Learning Repository as an example for this tutorial.

Our task is to predict room occupancy based on temperature, humidity, light, and CO2. A room is not occupied if Occupancy=0 and is occupied if Occupancy=1 .

After downloading the dataset, unzip and read the data:

import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

# Get train and test data
train = pd.read_csv("occupancy_data/datatraining.txt").drop(columns="date")
test = pd.read_csv("occupancy_data/datatest.txt").drop(columns="date")

# Get X and y
target = "Occupancy"
train_X, train_y = train.drop(columns=target), train[target]
val_X, val_y = test.drop(columns=target), test[target]

Take a look at the first ten records of the train dataset:

train.head(10)
Image by Author

Train the scikit-learn’s RandomForestClassifier model on the training dataset and use it to predict the test dataset:

# Train
forest_model = RandomForestClassifier(random_state=1)

# Preduct
forest_model.fit(train_X, train_y)
machine_preds = forest_model.predict(val_X)

# Evalute
print(classification_report(val_y, machine_preds))

Image by Author

The score is pretty good. However, we are unsure how the model comes up with these predictions.

Let’s see if we can label the new data with simple rules.

There are four steps to create rules for labeling data:

  1. Generate a hypothesis
  2. Observe the data to validate the hypothesis
  3. Start with simple rules based on the observations
  4. Improve the rules

Generate a Hypothesis

Light in a room is a good indicator of whether a room is occupied. Thus, we can assume that the lighter a room is, the more likely it will be occupied.

Let’s see if this is true by looking at the data.

Observe the Data

To validate our guess, let’s use a box plot to find the difference in the amount of light between an occupied room (Occupancy=1) and an empty room (Occupancy=0).

import plotly.express as px
import plotly.graph_objects as go

feature = "Light"
px.box(data_frame=train, x=target, y=feature)

Image by Author

We can see a significant difference in the median between an occupied and an empty room.

Start with Simple Rules

Now, we will create rules for whether a room is occupied based on the light in that room. Specifically, if the amount of light is above a certain threshold, Occupancy=1 and Occupancy=0 otherwise.

Image by Author

But what should that threshold be? Let’s start with picking 100 to be threshold and see what we get.

Image by Author

To create a rule-based model with human-learn, we will:

  • Write a simple Python function that specifies the rules
  • Use FunctionClassifier to turn that function into a scikit-learn model
import numpy as np
from hulearn.classification import FunctionClassifier

def create_rule(data: pd.DataFrame, col: str, threshold: float=100):
return np.array(data[col] > threshold).astype(int)

mod = FunctionClassifier(create_rule, col='Light')

Predict the test set and evaluate the predictions:

mod.fit(train_X, train_y)
preds = mod.predict(val_X)
print(classification_report(val_y, preds))
Image by Author

The accuracy is better than what we got earlier using RandomForestClassifier!

Improve the Rules

Let’s see if we can get a better result by experimenting with several thresholds. We will use parallel coordinates to analyze the relationships between a specific value of light and room occupancy.

from hulearn.experimental.interactive import parallel_coordinates

parallel_coordinates(train, label=target, height=200)

Image by Author

From the parallel coordinates, we can see that the room with a light above 250 Lux has a high chance of being occupied. The optimal threshold that separates an occupied room from an empty room seems to be somewhere between 250 Lux and 750 Lux.

Let’s find the best threshold in this range using scikit-learn’s GridSearch.

from sklearn.model_selection import GridSearchCV

grid = GridSearchCV(mod, cv=2, param_grid={"threshold": np.linspace(250, 750, 1000)})
grid.fit(train_X, train_y)

Get the best threshold:

best_threshold = grid.best_params_["threshold"]
best_threshold
> 364.61461461461465

Plot the threshold on the box plot.

Image by Author

Use the model with the best threshold to predict the test set:

human_preds = grid.predict(val_X)
print(classification_report(val_y, human_preds))
Image by Author

The threshold of 365 gives a better result than the threshold of 100.

Using domain knowledge to create rules with a rule-based model is nice, but there are some disadvantages:

  • It doesn’t generalize well to unseen data
  • It is difficult to come up with rules for complex data
  • There is no feedback loop to improve the model

Thus, combing a rule-based model and an ML model will help data scientists scale and improve the model while still being able to incorporate their domain expertise.

One straightforward way to combine the two models is to decide whether to reduce false negatives or false positives.

Reduce False Negatives

You might want to reduce false negatives in scenarios such as predicting whether a patient has cancer (it is better to make a mistake telling patients that they have cancer than to fail to detect cancer).

To reduce false negatives, choose positive labels when two models disagree.

Image by Author

Reduce False Positives

You might want to reduce false positives in scenarios such as recommending videos that might be violent to kids (it is better to make the mistake of not recommending kid-friendly videos than to recommend adult videos to kids).

To reduce false positives, choose negative labels when two models disagree.

Image by Author

You can also use other more complex policy layers to decide which prediction to choose from.

For a deeper dive into how to combine an ML model and a rule-based model, I recommend checking this excellent video by Jeremy Jordan.

Congratulations! You have just learned what a rule-based model is and how to combine it with a machine-learning model. I hope this article gives you the knowledge needed to develop your own rule-based model.


You are given a labeled dataset and assigned to predict a new one. What would you do?

The first approach that you probably try is to train a machine learning model to find rules for labeling new data.

Image by Author

This is convenient, but it is challenging to know why the machine learning model comes up with a particular prediction. You also can’t incorporate your domain knowledge into the model.

Instead of depending on a machine learning model to make predictions, is there a way to set the rules for data labeling based on your knowledge?

Image by Author

That is when human-learn comes in handy.

human-learn is a Python package to create rule-based systems that are easy to construct and are compatible with scikit-learn.

To install human-learn, type:

pip install human-learn

In the previous article, I talked about how to create a human learning model by drawing:

In this article, we will learn how to create a model with a simple function.

Feel free to play and fork the source code of this article here:

To evaluate the performance of a rule-based model, let’s start with predicting a dataset using a machine learning model.

We will use the Occupation Detection Dataset from UCI Machine Learning Repository as an example for this tutorial.

Our task is to predict room occupancy based on temperature, humidity, light, and CO2. A room is not occupied if Occupancy=0 and is occupied if Occupancy=1 .

After downloading the dataset, unzip and read the data:

import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

# Get train and test data
train = pd.read_csv("occupancy_data/datatraining.txt").drop(columns="date")
test = pd.read_csv("occupancy_data/datatest.txt").drop(columns="date")

# Get X and y
target = "Occupancy"
train_X, train_y = train.drop(columns=target), train[target]
val_X, val_y = test.drop(columns=target), test[target]

Take a look at the first ten records of the train dataset:

train.head(10)
Image by Author

Train the scikit-learn’s RandomForestClassifier model on the training dataset and use it to predict the test dataset:

# Train
forest_model = RandomForestClassifier(random_state=1)

# Preduct
forest_model.fit(train_X, train_y)
machine_preds = forest_model.predict(val_X)

# Evalute
print(classification_report(val_y, machine_preds))

Image by Author

The score is pretty good. However, we are unsure how the model comes up with these predictions.

Let’s see if we can label the new data with simple rules.

There are four steps to create rules for labeling data:

  1. Generate a hypothesis
  2. Observe the data to validate the hypothesis
  3. Start with simple rules based on the observations
  4. Improve the rules

Generate a Hypothesis

Light in a room is a good indicator of whether a room is occupied. Thus, we can assume that the lighter a room is, the more likely it will be occupied.

Let’s see if this is true by looking at the data.

Observe the Data

To validate our guess, let’s use a box plot to find the difference in the amount of light between an occupied room (Occupancy=1) and an empty room (Occupancy=0).

import plotly.express as px
import plotly.graph_objects as go

feature = "Light"
px.box(data_frame=train, x=target, y=feature)

Image by Author

We can see a significant difference in the median between an occupied and an empty room.

Start with Simple Rules

Now, we will create rules for whether a room is occupied based on the light in that room. Specifically, if the amount of light is above a certain threshold, Occupancy=1 and Occupancy=0 otherwise.

Image by Author

But what should that threshold be? Let’s start with picking 100 to be threshold and see what we get.

Image by Author

To create a rule-based model with human-learn, we will:

  • Write a simple Python function that specifies the rules
  • Use FunctionClassifier to turn that function into a scikit-learn model
import numpy as np
from hulearn.classification import FunctionClassifier

def create_rule(data: pd.DataFrame, col: str, threshold: float=100):
return np.array(data[col] > threshold).astype(int)

mod = FunctionClassifier(create_rule, col='Light')

Predict the test set and evaluate the predictions:

mod.fit(train_X, train_y)
preds = mod.predict(val_X)
print(classification_report(val_y, preds))
Image by Author

The accuracy is better than what we got earlier using RandomForestClassifier!

Improve the Rules

Let’s see if we can get a better result by experimenting with several thresholds. We will use parallel coordinates to analyze the relationships between a specific value of light and room occupancy.

from hulearn.experimental.interactive import parallel_coordinates

parallel_coordinates(train, label=target, height=200)

Image by Author

From the parallel coordinates, we can see that the room with a light above 250 Lux has a high chance of being occupied. The optimal threshold that separates an occupied room from an empty room seems to be somewhere between 250 Lux and 750 Lux.

Let’s find the best threshold in this range using scikit-learn’s GridSearch.

from sklearn.model_selection import GridSearchCV

grid = GridSearchCV(mod, cv=2, param_grid={"threshold": np.linspace(250, 750, 1000)})
grid.fit(train_X, train_y)

Get the best threshold:

best_threshold = grid.best_params_["threshold"]
best_threshold
> 364.61461461461465

Plot the threshold on the box plot.

Image by Author

Use the model with the best threshold to predict the test set:

human_preds = grid.predict(val_X)
print(classification_report(val_y, human_preds))
Image by Author

The threshold of 365 gives a better result than the threshold of 100.

Using domain knowledge to create rules with a rule-based model is nice, but there are some disadvantages:

  • It doesn’t generalize well to unseen data
  • It is difficult to come up with rules for complex data
  • There is no feedback loop to improve the model

Thus, combing a rule-based model and an ML model will help data scientists scale and improve the model while still being able to incorporate their domain expertise.

One straightforward way to combine the two models is to decide whether to reduce false negatives or false positives.

Reduce False Negatives

You might want to reduce false negatives in scenarios such as predicting whether a patient has cancer (it is better to make a mistake telling patients that they have cancer than to fail to detect cancer).

To reduce false negatives, choose positive labels when two models disagree.

Image by Author

Reduce False Positives

You might want to reduce false positives in scenarios such as recommending videos that might be violent to kids (it is better to make the mistake of not recommending kid-friendly videos than to recommend adult videos to kids).

To reduce false positives, choose negative labels when two models disagree.

Image by Author

You can also use other more complex policy layers to decide which prediction to choose from.

For a deeper dive into how to combine an ML model and a rule-based model, I recommend checking this excellent video by Jeremy Jordan.

Congratulations! You have just learned what a rule-based model is and how to combine it with a machine-learning model. I hope this article gives you the knowledge needed to develop your own rule-based model.

FOLLOW US ON GOOGLE NEWS

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.

Leave a comment