Machine Learning on a Large Scale | by Pan Cretan | Jun, 2022

By Jessie Hobb On Jun 20, 2022

A demonstration using binomial and multinomial logistic regression in PySpark

With the release of Spark 3.2.1, that has been locally deployed for this article, PySpark offers a fluent API that resembles the expressivity of scikit-learn but additionally offers the benefits of distributed computing. This article demonstrates the use of the pyspark.ml module for constructing ML pipelines on top of Spark data frames (instead of RDDs with the older pyspark.mllib module). The functionality is exemplified using binomial and multinomial logistic regression that admittedly are not the most advanced machine learning algorithms. Still, their simplicity makes them ideal for demonstrating the PySpark machine learning API. This tutorial may be of interest to readers that are new to machine learning with PySpark and to readers who are more familiar with earlier versions of Spark and in particular of the pyspark.mllib module.

Table of contents

· Setting the scene
· Binomial logistic regression
∘ Preparatory work
∘ First modelling attempt
∘ Assessing model quality
∘ Cross-validation and hyper-parameter tuning
∘ Model interpretation
· Multinomial logistic regression
· Conclusions

We first create a spark session by allocating 8 GiB of memory and four cores

Figure 1: pairwise relationships of features in dataset

For simplicity we will use the same dataset for both binomial and multinomial logistic regression. For binary classification we attempt to predict whether the species is Iris virginica vs. not Iris virginica

A demonstration using binomial and multinomial logistic regression in PySpark

Table of contents

We first create a spark session by allocating 8 GiB of memory and four cores

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.