5 Python Libraries to Learn to Start Your Data Science Career | by Federico Trotta | Dec, 2022

By Jessie Hobb On Dec 3, 2022

Master these libraries for a smoother career path

If you want to study Python for Data Science to start a new career, I’m sure you are struggling with all these things to know and master. I know you are overwhelmed by all these new concepts, including all the mathematics you should know, and you may feel you’ll never arrive at the goal of your new job.

I know: job descriptions do not help with that. It really seems like Data Scientists must be aliens; even juniors, sometimes.

In my opinion, an important skill to master is learning how to stop the fear of “I have to know everything”. Believe me: especially at the beginning, if you are pursuing a junior position, you absolutely do not have to know everything. Well, telling the truth: even seniors do not really know everything.

So, If you want to start a career in Data Science, in this article I show you five Python libraries you absolutely have to know.

As we can see on their website, Anaconda is:

The world’s most popular open-source Python distribution platform

Anaconda is a Python distribution specifically created for Data Science; so it is not properly a library, but we can intend it as a library because, in software development, a library is a collection of related modules; so, since Anaconda provides all the must-haves for Data Scientists — included the most used packages — we can intend it as a library and, also, is a must-have for you.

The first important thing provided by Anaconda is Jupyter Notebook which is:

the original web application for creating and sharing computational documents. It offers a simple, streamlined, document-centric experience.

Jupyter Notebook is a web application that runs locally on your machine and it is created on purpose for Data Scientists. The main important characteristic that makes it attractive (and very useful) for Data Scientists is the fact that every cell runs independently giving us the possibility to:

Do mathematical and coding experiments in independent cells, without affecting the whole code.
Write text, if needed, in each cell; this makes Jupyter Notebooks the perfect environment to present scientific works with your code (so, you can forget Latex environments, if you want).

To get started with Jupiter Notebooks, I advise you to read this guide here.

Then, when you gain experience, you may need some shortcuts to speed up your experience. You can use this guide here.

Also, as said before, Anaconda provides us with all the packages needed for Data Science. This way we don’t have to install them. For example, say you need “pandas”; without Anaconda, you need to install it by typing $ pip install pandas in your terminal. With Anaconda you don’t have to do that because it installs pandas for us. A very good advantage!

Pandas is a library that makes you import, manipulate and analyze data. On their website, they say that

pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language.

If you want to work with data you absolutely need to master Pandas because, nowadays, is widely used by Data Scientists and Analysts.

The power of Pandas relies on the fact that this library makes us work with tabular data. In statistics, tabular data refers to data that is organized in a table with rows and columns. We typically refer to tabular data as data frames.

This is important because we work with tabular data in a lot of situations; for example:

With excel files.
With CSV files.
With databases.

A data frame is the representation of tabular data. Image from the Panda’s website here: https://pandas.pydata.org/docs/getting_started/index.html

The reality of many firms is that, regardless of your role, you’ll always have to deal, somehow, with data in excel/CSV and/or in databases; this is why Pandas is a fundamental resource for you to master.

Also, consider that you can even access data from databases and get them directly into your Jupyter Notebooks for further analysis in Pandas. We can do so using a library called PyOdbc. Take a look at that here.

After data manipulation and analysis with Pandas, you typically want to make some plots. This can be done with matplotlib which is:

a comprehensive library for creating static, animated, and interactive visualizations in Python

Matplotlib is the first library to plot graphs I advise you to use, because it is widely used and, in my opinion, it helps you gain experience coding.

Matplotlib helps us plot the most important plots we may need:

Statistical plots like histograms or bar charts.
Scatterplots.
Boxplots.

And many more. You can start with Matplotlib here, using their tutorials.

At a certain point, when you’ve gained experience in analyzing data, you may not be completely satisfied with Matplotlib; mainly (in my experience) this may be due to the fact that to perform advanced plots we have to write a lot of code with matplotlib. This is why Seaborn may help you. Seaborn, in fact:

is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.

But what does it mean that Seaborn mainly helps us with advanced plots, letting us write less code than matplotlib? For example, say you have some data regarding people tipping waiters. We want to plot a graph of the total bill and the tip, but we want even to show if the people were smokers or not and if the people were at the restaurant at dinner or at launch. We can do so like that:

# Import seaborn
import seaborn as sns# Apply the default theme
sns.set_theme()
# Load the dataset
tips = sns.load_dataset("tips")
# Create the visualization
sns.relplot(
data=tips,
x="total_bill", y="tip", col="time",
hue="smoker", style="smoker", size="size",
)

And we get:

The visualization of the data coded above. The image is taken from one tutorial on the Seaborn website here: https://seaborn.pydata.org/tutorial/introduction.html

So, as we can see, with very few lines of code we can achieve a great result thanks to Seaborn.

So, a question may arise: “should I use Matplotlib or Seaborn?”

My advice is to start with Matplotlib and then move to Seaborn when you’ve gained some experience because the reality is that, most of the time, we use both Matplotlib and Seaborn (because remember: Seaborn is based on Matplotlib).

The main thing that distinguishes a Data Analyst from a Data Scientist is the ability to use Machine Learning (ML). Machine Learning is the branch of Artificial Intelligence that focuses on the use of data and algorithms to make classifications or predictions.

In Python, ML models can be invoked and trained using a library called scikit-learn (sometimes called sk-learn) which is a library of:

Simple and efficient tools for predictive data analysis.

As a Data Scientist, all the work related to Machine Learning is done in sk-learn and this is why is fundamental for you to master at least the basics of this library.

The libraries we introduced have been numbered in ascending order, and my advice for you is to follow this order. So, first of all, install Anaconda to set up the environment and gain experience with Python, using Jupiter Notebooks. Then, start analyzing data with Pandas. Then visualize data with Matplotlib first and then with Seaborn. Finally, use sk-learn for Machine Learning.