Techno Blender
Digitally Yours.
Browsing Tag

datascience

How to prepare data for K-fold cross-validation in Machine Learning | by Andrew D #datascience | Dec, 2022

Image by authorCross-validation is the first technique to use to avoid overfitting and data leakage when we want to train a predictive model on our data.Its function is essential as it allows us to test functions and logics on our data in a safe way — namely, avoiding that these processes contaminate our validation data.If we want to do preprocessing, feature engineering or other transformations, we must always first partition our data correctly.This ensures that our validation data is actually representative of our…

A comprehensive introduction to Tensorflow’s Sequential API and model for deep learning | by Andrew D #datascience | Nov, 2022

Kickstart your understanding of one of Tensorflow’s most powerful set of tools for deep learningImage by authorTensorflow is the framework created by Google that allows machine learning practitioners to create deep learning models and is often the first solution that is proposed to analysts who approach deep learning for the first time.The reason is to be found in the simplicity and intuitiveness of the Tensorflow sequential API — it allows the analyst to create very complex and powerful neural networks while remaining…

Why having many features can hinder your model’s performance | by Andrew D #datascience | Oct, 2022

The activity of feature engineering can be very useful for improving the performance of a predictive model. However, it could worsen our results if we don’t keep in mind certain principles to avoid.Image by author.A data analysis project always begins with a dataset. This may have been delivered by the customer, found publicly on sites like Kaggle.com, or created by us and our team.In any of these cases, the dataset will show an anatomy that will vary according to the type of phenomenon it wants to describe, and will have…

Content tagging with fuzzy logic in Python | by Andrew D #datascience | Oct, 2022

Learn how to use a simple script to tag textual content with fuzzy logicImage by author.The term fuzzy logic refers to all those rules and functions that apply a heuristic based on the approximation of truth.In computer science, fuzzy logic is meant to help provide a degree of truth rather than actual truth. It is an approach that allow us to answer inaccurately to equally imprecise questions.Let’s say we are asked how old we are, but in the formAre you elderly?Let’s look at this question from a logical point of view:if…

How to compute text similarity on a website with TF-IDF in Python | by Andrew D #datascience | Oct, 2022

A simple and effective approach to text similarity with TF-IDF and PandasImage by authorCalculating the similarity between two pieces of text is a very useful activity in the field of data mining and natural language processing (NLP). This allows both to isolate anomalies and diagnose for specific problems, for example very similar or very different texts on a blog, or to group similar entities into useful categories.In this article we are going to use a script published here to scrape a blog and create a small corpus on…

What is the difference between machine learning and deep learning? | by Andrew D #datascience | Sep, 2022

An introduction for new learners to these two, often misleading, concepts in data scienceImage by author.It is not unusual to encounter the terms machine learning and deep learning in the context of data science. The way they are used is often the same, and sometimes they can even mean the same thing.This often leads the inexperienced reader to confuse the true interpretation of these two terms.This article aims to clarify the two terminologies, so that the reader can have a precise understanding of what machine learning…

What Is Cross-Validation in Machine Learning | by Andrew D #datascience | Aug, 2022

Learn what cross-validation is — a fundamental technique for building generalizable modelsImage by author.The concept of cross-validation extends directly from the one of overfitting, covered in my previous article.Cross-validation is one of the most effective techniques to avoid overfitting and to understand the performance of a predictive model well.When I wrote about overfitting, I divided my data into training and test sets. The training set was used to train the model, the test set to evaluate its performance. But…

Overcome the biggest obstacle in machine learning: Overfitting | by Andrew D #datascience | Aug, 2022

Overfitting is a concept in data science that occurs when a predictive model learns to generalize well on training data but not on unseen dataImage by author.The best way to explain what overfitting is is through an example.Picture this scenario: we have just been hired as a data scientist in a company that develops photo processing software. The company recently decided to implement machine learning in their processes and the intention is to create software that can distinguish original photos from edited photos.Our task…

The Explanation You Need on Binary Classification Metrics | by Andrew D #datascience | Aug, 2022

An intuitive overview of the most common metrics used to assess the quality of a binary classification modelPhoto by Annie Spratt on UnsplashThe model assessment phase starts when we create a holdout set which consists of examples the learning algorithm didn’t see during training. If our model performs well on the holdout set we can say that our model generalizes well and is of good quality.The most common way to assess whether a model is good or not is to compute a performance metric on the holdout data.This article will…

Feature Selection with Boruta in Python | by Andrew D #datascience

Learn how the Boruta algorithm works for feature selection. Explanation + templatePhoto by Caroline on UnsplashThe feature selection process is fundamental in any machine learning project. In this post we’ll go through the Boruta algorithm, which allows us to create a ranking of our features, from the most important to the least impacting for our model. Boruta is simple to use and a powerful technique that analysts should incorporate in their pipeline.Boruta is not a stand-alone algorithm: it sits on top of the Random…