5 ways of Implementing Open Closed Principle with Python | by Erdem Isbilen | Mar, 2023

By Jessie Hobb On Mar 2, 2023

Object Oriented Programming Principles for Data Scientists

The Open-Closed Principle (OCP) is one of the five SOLID principles of object-oriented programming. It states that software entities, such as classes, modules, and functions, should be open for extension but closed for modification. In other words, you should be able to add new features to your software without having to modify existing code.

The goal of the OCP is to create software that is more flexible and easier to maintain over time. By designing software that can be extended without modifying existing code, you can reduce the risk of introducing new bugs and make your code easier to read and understand.

While the OCP is primarily concerned with software design, it can also be applied to data science. Data scientists often work with large, complex data sets and models that need to be updated and modified over time. By following the OCP, data scientists can ensure that their models are easily extensible and maintainable over time.

In the context of data science, a “model” typically refers to a mathematical or statistical representation of a real-world system or process. Models can be used to make predictions, classify data, or understand complex relationships between variables.

For example, a data scientist might build a machine learning model to predict customer churn for a business based on historical customer data. The model would be trained on a dataset of past customer behaviour and would use that information to make predictions about which customers are most likely to churn in the future.

Models can take many different forms, depending on the specific problem being solved and the data available. Some common types of models used in data science include regression models, decision trees, neural networks, and support vector machines.

Specifically, the OCP can help data scientists in the following ways:

Facilitating model extension: By designing models to be open for extension, data scientists can easily add new features and functionality to their models without modifying the original code. This can help them keep their models up-to-date and relevant over time.
Encouraging modular design: The OCP encourages modular design, which can make it easier to update and modify models over time. By breaking down models into smaller, more manageable components, data scientists can make changes to specific parts of the model without affecting the rest of the code.
Enhancing maintainability: By designing models that are closed for modification, data scientists can ensure that their code is more stable and less prone to errors. This can make it easier to maintain and update models over time.

(1) Use abstraction:

One way to implement the OCP in Python is to use abstraction to hide implementation details and allow for extension without modification. In this example, we define an abstract base class DataTransformer that defines a single abstract method transform. This class serves as an abstraction for any data transformer that we might create in the future. We then define two concrete implementations of this abstract class: StandardScalerTransformer and LogTransformer.

from abc import ABC, abstractmethod
import pandas as pdclass DataTransformer(ABC):
@abstractmethod
def transform(self, data):
pass
class DataPipeline:
def __init__(self, transformers):
self.transformers = transformers
def run(self, data):
for transformer in self.transformers:
data = transformer.transform(data)
return data
class StandardScalerTransformer(DataTransformer):
def __init__(self, mean=None, std=None):
self.mean = mean
self.std = std
def fit(self, data):
if self.mean is None:
self.mean = data.mean()
if self.std is None:
self.std = data.std()
def transform(self, data):
self.fit(data)
return (data - self.mean) / self.std
class LogTransformer(DataTransformer):
def transform(self, data):
return pd.Series(data).apply(lambda x: log(x))
if __name__ == '__main__':
# Load data
data = pd.read_csv('data.csv')
# Instantiate transformers
scaler = StandardScalerTransformer()
log_transformer = LogTransformer()
# Create and run pipeline
pipeline = DataPipeline(transformers=[scaler, log_transformer])
data = pipeline.run(data)

The DataPipeline class takes a list of DataTransformer objects as an argument to its constructor. This allows us to add or remove transformers to the pipeline without modifying the class itself. The run method applies each transformer in the pipeline to the data sequentially.

Using this abstraction, we can easily create new transformers that implement the transform method and add them to the pipeline without modifying the existing DataPipeline class. This demonstrates the Open/Closed Principle in action: the DataPipeline class is open for extension (we can add new transformers to the pipeline) but closed for modification (we don’t need to modify the existing class to add new transformers).

(2) Use inheritance and/or composition:

Another way to implement the OCP in Python is to use inheritance to extend the behaviour of your classes. By creating a base class with a well-defined interface, you can create new subclasses that inherit that interface and add new functionality. This allows you to extend the behaviour of your code without modifying the original implementation.

We can implement the OCP in Python with composition rather than inheritance. By defining objects that contain other objects, data scientists can create code that is more modular and easier to extend.

In this example, we define the DataAnalyzer base class with abstract methods preprocess and analyze, which must be implemented by subclasses. We also define an __init__ method that takes in the data to be analyzed.

We then define three subclasses for analyzing numerical, text, and image data, respectively. Each subclass overrides the preprocess and analyze methods to provide specialized functionality for that type of data.

For example, the NumericalDataAnalyzer subclass includes a preprocess method that scales the numerical data using a StandardScaler object, while the TextDataAnalyzer subclass includes a preprocess method that vectorizes the text data using a TfidfVectorizer object. Similarly, the ImageDataAnalyzer subclass includes a preprocess method that extracts features from the image data using a ResNet50 object.

from abc import ABC, abstractmethodclass DataAnalyzer(ABC):
def __init__(self, data):
self.data = data
@abstractmethod
def preprocess(self):
pass
@abstractmethod
def analyze(self):
pass
class NumericalDataAnalyzer(DataAnalyzer):
def __init__(self, numerical_data):
super().__init__(numerical_data)
self.scaler = StandardScaler()
def preprocess(self):
self.data = self.scaler.fit_transform(self.data)
def analyze(self):
# analyze numerical data here
class TextDataAnalyzer(DataAnalyzer):
def __init__(self, text_data):
super().__init__(text_data)
self.vectorizer = TfidfVectorizer()
def preprocess(self):
self.data = self.vectorizer.fit_transform(self.data)
def analyze(self):
# analyze text data here
class ImageDataAnalyzer(DataAnalyzer):
def __init__(self, image_data):
super().__init__(image_data)
self.feature_extractor = ResNet50()
def preprocess(self):
self.data = self.feature_extractor.extract_features(self.data)
def analyze(self):
# analyze image data here

By using inheritance to create specialized subclasses and composition to include specialised objects within those subclasses, we can analyze each type of data consistently and flexibly, while adhering to the Open-Closed Principle. If we need to add support for a new type of data in the future, we can simply create a new subclass that inherits from DataAnalyzer and includes appropriate composition of specialized objects. This approach allows us to extend our program without modifying the existing code, making it easier to maintain and reuse.

(3) Use plugins:

The OCP can be applied to create a plugin architecture in Python. By defining a set of well-defined interfaces or abstract base classes, data scientists can allow other developers to write plugins that extend the functionality of their code without modifying the original implementation.

Suppose we have a script that performs some data processing on a given dataset. We want to be able to easily add and remove different data processing steps as plugins, without modifying the script code.

We can use a plugin architecture to define data processing steps as plugins that can be loaded dynamically at runtime. Here’s some example code:

import importlibclass DataProcessingPlugin:
def process_data(self, data):
pass
class RemoveDuplicatesPlugin(DataProcessingPlugin):
def process_data(self, data):
# Remove duplicate rows from the data
return data.drop_duplicates()
class ImputeMissingValuesPlugin(DataProcessingPlugin):
def process_data(self, data):
# Impute missing values in the data using mean imputation
return data.fillna(data.mean())
def process_data(data, processing_steps):
# Load plugins dynamically
plugins = [importlib.import_module(f'plugins.{step}_plugin') for step in processing_steps]
# Apply each processing plugin to the data sequentially
for plugin in plugins:
data = plugin.process_data(data)
return data

In this example, we define a DataProcessingPlugin class that defines the interface for data processing plugins. We also define two plugins, RemoveDuplicatesPlugin and ImputeMissingValuesPlugin, that implement the DataProcessingPlugin interface and define custom logic for removing duplicate rows and imputing missing values, respectively.

We then define a process_data function that takes a dataset and a list of processing steps as inputs. The function uses the importlib module to load each processing plugin corresponding to the given steps dynamically at runtime. It then applies each processing plugin to the data sequentially to produce a final processed dataset.

Using plugins in this way allows us to easily modify and experiment with different data processing steps without having to modify the process_data function code. It also makes it easier to share and reuse data processing code between different projects.

(4) Use configuration files:

The OCP can be applied by using configuration files to control the behavior of a Python program. By separating configuration data from code, data scientists can create programs that are easier to extend and maintain. For example, a data scientist might define a configuration file that specifies the parameters for a machine learning model, allowing other developers to experiment with different parameter settings without modifying the original code.

Here is an example:

Suppose we have a dataset of customer reviews and we want to analyze the sentiment of each review using various machine learning models. We want to be able to easily swap out the machine learning model being used without changing the code of our sentiment analysis script.

We can use a configuration file to specify which machine learning model to use and its associated hyperparameters. Here’s an example configuration file:

{
"model": "logistic_regression",
"model_params": {
"C": 1.0
}
}

In this example, we’re using logistic regression as our machine learning model and specifying a regularization parameter of 1.0.

Our sentiment analysis script can then be read in this configuration file and use the specified machine learning model to analyze the sentiment of each review. Here’s some example code:

import pandas as pd
from sklearn.linear_model import LogisticRegression
import jsondef load_data(filename):
# Load customer reviews from CSV file
data = pd.read_csv(filename)
# Remove any rows with missing data
data.dropna(inplace=True)
return data
def preprocess_data(data):
# Preprocess customer reviews
# ...
return processed_data
def train_model(data, model_name, model_params):
# Train specified machine learning model on preprocessed data
if model_name == 'logistic_regression':
model = LogisticRegression(C=model_params['C'])
else:
raise ValueError('Invalid model name: {}'.format(model_name))
model.fit(data['X'], data['y'])
return model
if __name__ == '__main__':
# Load configuration from file
with open('config.json') as f:
config = json.load(f)
# Load data
data = load_data('reviews.csv')
# Preprocess data
processed_data = preprocess_data(data)
# Train model
model = train_model(processed_data, config['model'], config['model_params'])
# Use model to analyze sentiment of each review
# ...

In this example, we define a load_data function to load customer reviews from a CSV file, a preprocess_data function to preprocess the data for use with the machine learning model, and a train_model function to train the specified machine learning model on the preprocessed data.

We read in the configuration file using the json.load method and pass the specified machine learning model and its hyperparameters to the train_model function. This allows us to easily switch out the machine learning model being used by simply modifying the configuration file, without having to change any of the code in our sentiment analysis script.

Using configuration files to specify parameters in this way is a common pattern in data science, as it allows for easy modification and experimentation without having to modify the code.

(5) Use dependency injection:

The OCP can be applied by using dependency injection to allow objects to be created and configured dynamically.

In this example, we define three classes for data cleaning, feature engineering, and machine learning. Each of these classes takes a strategy as a dependency that is injected when creating an instance of the class.

class DataLoader:
def __init__(self, filename):
self.filename = filenamedef load_data(self):
# Load data from file
data = pd.read_csv(self.filename)
return data
class DataCleaner:
def __init__(self, strategy):
self.strategy = strategy
def clean_data(self, data):
# Clean data using specified strategy
cleaned_data = self.strategy.clean(data)
return cleaned_data
class FeatureEngineer:
def __init__(self, strategy):
self.strategy = strategy
def engineer_features(self, data):
# Engineer features using specified strategy
engineered_data = self.strategy.engineer(data)
return engineered_data
class Model:
def __init__(self):
self.model = RandomForestClassifier()
def train(self, X, y):
# Train machine learning model on preprocessed data
self.model.fit(X, y)
def predict(self, X):
# Use trained machine learning model to make predictions
predictions = self.model.predict(X)
return predictions
if __name__ == '__main__':
# Create instances of data cleaning and feature engineering strategies
cleaning_strategy = RemoveDuplicatesStrategy()
feature_engineering_strategy = AddFeaturesStrategy()
# Create instances of data loader, data cleaner, feature engineer, and model
data_loader = DataLoader('data.csv')
data_cleaner = DataCleaner(cleaning_strategy)
feature_engineer = FeatureEngineer(feature_engineering_strategy)
model = Model()
# Load data
data = data_loader.load_data()
# Clean data
cleaned_data = data_cleaner.clean_data(data)
# Engineer features
engineered_data = feature_engineer.engineer_features(cleaned_data)
# Train model
X = engineered_data.drop('target', axis=1)
y = engineered_data['target']
model.train(X, y)
# Make predictions
predictions = model.predict(X)

We create instances of the data cleaning and feature engineering strategies and inject them into instances of the DataCleaner and FeatureEngineer classes, respectively. This allows us to easily switch out the data cleaning and feature engineering strategies with different implementations, without modifying any of the code in our machine learning script.

We also create an instance of the DataLoader class, which is not injected with any dependencies. This is because the data loading strategy is unlikely to change frequently, so it does not need to be specified as a dependency.

Using dependency injection in this way allows us to easily modify and experiment with different data cleaning and feature engineering strategies without having to modify the machine learning code.

By following the Open-Closed Principle, data scientists can create code that is not only functional, but also modular, extensible, and easy to maintain. This can help to ensure the longevity and effectiveness of our data science projects, enabling us to more easily meet the ever-changing demands of the data science landscape.