Introduction To Python For Data Visualization With Seaborn

Here we will learn how to create various kinds of plots using one of Python’s most efficient libraries example seaborn built especially for data visualization.

Data visualization which helps us to present our analysis from any data which we analyze is primarily performed using Matplotlib which is a very strong and comprehensive library for performing such tasks. But one of the points where Matplotlib suffers is the length of the code which can increase to a significant extent while building small representations.

This is where Seaborn comes as our savior. Seaborn is utilized for plotting of some of the most pleasing data visualization representations.

Introduction to Seaborn

Around the globe, Seaborn is known for its ability to make statistical graphs in Python. Matplotlib is the language which acts as the basic building block for Seaborn along with Pandas.

Salient features which make it so lovable are as follows:

  1. Seaborn is used for visualizing univariate or bivariate distributions and even for comparing them.
  2. It simplifies the representations of complex datasets.
  3. Seaborn provides us with the control over matplotlib’s figure styling through various inbuilt themes which it possesses.
  4. Through seaborn, we can choose amongst the variety of color palettes for making our plots much more conclusive to the viewer.
  5. Using seaborn, we have the facility of representation of categorical values.

Overall the usage of data frames and arrays makes seaborn much more efficient with whole datasets and it produces plots which consist of lots of information. One thing we must remember is that Seaborn is not something which can act as a substitute for Matplotlib but it can definitely act as a compliment for Matplotlib and improve upon the lacking points of it, thus presenting a holistic view.

Seaborn Tutorial Contents

In this tutorial, we’ll look at some of the most essential plots which are built using Seaborn library. All the way through this tutorial, we’ll try to learn the concepts using examples and its explanation.

The Steps for Covering this Tutorial

  1. What is Data Visualization
  2. Why Data Visualization
  3. Installing Seaborn
  4. Importing libraries and dataset
  5. Different types of plots/graphs which are build using Seaborn
    • Bar Plot
    • Box Plot
    • Scatter Plot
    • Violin Plot
    • Swarm Plot
    • Overlaying Swarm Plot with Violin plot
    • Joint Kernel Density Plot
    • Joint Plot
    • Heatmap
    • Clustermap

Step 1: What is Data Visualization?

Data Visualization can be defined as a process of extracting essential information from raw/processed data and then representing it pictorially for better understanding and analysis of the facts/figures. Data Visualization is an amalgamation of two fields i.e. Science and Art, this means we are applying our scientific and artistic skills in the making of any kind of visualizations.

To be more precise, data visualization is a strategy of depicting the quantitative knowledge obtained through various data wrangling processes in a graphical manner.

The main objective of data visualization is to decide the most optimal way of presenting a data set with the data visualization practices in mind and all these tasks become much more difficult when we have to deal with large datasets.

Step 2: Why Data Visualization is Required?

In recent times, the amount of data which is generated is gigantic and we want it to be of some significance. According to reports, the data which revolves around the internet every second is more than were stored in the entire internet in past 20 years.

This is because as people are using the internet for building connections not only amongst themselves but also for connecting machines the data will grow exponentially.

The human mind is not competent enough to deal with such data and thus we need some kind of tool which can help us to understand profoundly what all such data is trying to convey. For these purposes, only data visualization is used since it not only reduces the size of large datasets but is able to represent the useful information very easily in the form of graphs which also aids in delivering the content to the viewer.

There are numerous plots which are used in Data Visualization such as Histograms, Pie Chart, Box Plot, Word Cloud, Scatter plot etc. there is a long list of such graphs and most of them we’ll see with examples very soon in this tutorial.

Step 3: Installing Seaborn

As we will be working with Seaborn, an inbuilt library of python. We are required to install it on our system on which we are working. Along with Seaborn, as I have worked with Jupyter Notebook for this tutorial I would recommend to use it as well.

You can go for the following installations if you wish:

  1. Python 2.7/ Python 3(Preferred)
  2. Python Libraries for Data Visualization
    • Pandas
    • Matplotlib
    • Seaborn
  1. Jupyter Notebook.

You can download and install Anaconda Distribution as it comes along with the packages which are required. You only have to follow the steps on that page.

If you have python and you want to install seaborn in your system, you can execute the following and then it will help you to install the latest version of Seaborn in your system.

If you are opting to work upon Jupyter notebook. Then, once you have installed Anaconda, open Jupyter Notebook (either through the command line or navigator app) and create a new notebook.

Step 4: Importing Libraries and dataset

Initially, we will import Pandas which will be used for handling relational datasets.

After this, we have to import Matplotlib which will aid in making changes to the plots which we will be created using Seaborn.

NOTE: While using Jupyter Notebook, we write %matplotlib inline for displaying the plots in the notebook.

Finally, we will import the library which is the base of this tutorial i.e. Seaborn.

Now we are all set to import the dataset which we will be using for Visualization purposes.

NOTE: We have used aliases for the imported libraries such as pd for pandas, plt for Matplotlib and sns for Seaborn. This is actually done for our convenience, so that we can invoke these libraries and its methods using these aliases and we will not be required to write the full name of the library.

The dataset for this tutorial is a very interesting one i.e. Pokémon dataset. You can download it for free from here. Pokemon.csv

After having downloaded the dataset, remember to keep the dataset (csv file) in the same folder where your python file/ Jupyter notebook is present, as there will no issues of providing the location of the dataset.

Now to import the dataset we have to execute the following code.

NOTE: Here the index_col=0 argument tells that we will be using the first column of the dataset as the ID column.

Once we have imported the dataset, we can view the values of the dataset and see how the dataset looks like by the head () function of df (dataframe). We can pass different values to the head () function for viewing more or less values. The default value is 5.

Step 5: Let’s start plotting various graphs and visualizations.

Bar Plot

We use the in-built function of seaborn i.e. countplot() for  plotting the bar graph where we have provided the ‘Type 1’ as the value for x-axis and ‘df’ as the value for data.

This is the output we obtain when we execute the above code, where x-axis as Type 1 and y-axis is labelled as count. But you must have noticed that the x-axis values are not visible due to the lack of space. For getting rid of this problem we can try the following.

Here we have made some additions to the previous code so that we can get a clean output of the bar plot as compared to the previous one. In this, you can see we have used matplotlib’s ‘xticks’ method in which we have set the value of ‘rotation’ as 70 which will tilt the x-axis values by 70 degrees making it clearly visible. Moreover, we have passed another argument in countplot i.e. palette and its value as ‘rainbow’ this will present the color of bars in rainbow colours.

Along with this, we have used ‘rcParams’ method of matplotlib to increase the size of x-axis values and labels of x-axis.

We can clearly see the values of x-axis are presented neatly and the labels of both the axes are also visible clearly.

Uses:
A Bar Plot is used to represent a comparison between categories of data. It can be represented either in vertical/horizontal manner.

Box Plot

A box plot is the visual representation statistical five number summary of a given data set i.e. Minimum, First Quartile, Median, Third Quartile and Maximum.

Box Plot representation
Box Plot representation

 

 

For plotting the boxplot we have used the boxplot() function of seaborn, but we can see that for some values the result obtained is insignificant, thus we will have to remove all those columns which are redundant like ‘Total’ as we have the individual stats and the one’s which are not combat i.e. ‘Legendary ‘ and ‘Generation’.

NOTE: The black dots which are visible are the outliers in the dataset.

Now we have removed the ‘#’, ‘Total’, ‘Legendary’ and ‘Generation’ column from the dataset by using the drop() method of df and also specified the axis for identifying the x-axis where they are located.

Uses:
It is used in exploratory data analysis. Through this, we can represent the shape of the distribution of data through it.

Scatter Plot

Here we have used the lmplot () function of seaborn for creating the scatter plot where we have provided values of x-axis and y-axis as ‘Sp. Atk’ and ‘Sp. Def’ respectively and provided the ‘data’ parameter with value ‘df

It is evident that we were looking to plot to scatter plot but we have also obtained the regression line. This is because Seaborn has no specific scatter plot function, reason why we see a diagonal line. The function
lmplot ()’ is used to fit and plot the regression line.

To solve the regression line issue, we can use ‘fit_reg’ argument and set it as ‘False’. Moreover, we can use the ‘hue’ parameter which will help us to present the points on the graph with much more clarity i.e. we represent the third dimension of information using ‘Type 1’ as the value for ‘hue’.

We can clearly view different generations of Pokémon and their special attack and special defense values through this scatter plot.

Uses:
Scatter plot helps us to find potential relationships between values and they can help in detecting outliers in the datasets. They can simplify the large dataset’s representation easily.

Violin Plot

Violin plot is built using seaborn’s violinplot () function. Before this, we can set the background style of the violin plot by the ‘set_style’ method which has been given the value as ‘whitegrid’, there are other values as well like ‘dark’, ‘white’, ‘darkgrid’ and ‘ticks’.

Then, we can pass the x-axis and y-axis values and then customizing the plot using matplotlib methods.

Uses:
They help in representing a fantastically huge amount of information effectively in the small amount of space.

Swarm Plot

Now for plotting swarm plot, we use the ‘swarmplot ()’ function of seaborn with ‘Type 1’ and ‘Attack’ as the values. Initially, we have set the size of our swarm plot using ‘figure ()’ function.

Uses:
It gives a better representation of the distribution of values. It can be built on its own but is also a good complement to a box or violin plot.

Overlaying Swarm and Violin Plots

Through overlaying of swarm and violin plots on similar values can help us to analyse in a much more efficient manner.

First, to make this plot, we will make the figure a bit larger by using the figure () function of matplotlib. After this, we have used violinplot () function for plotting the violin plot using the ‘Type 1’ and ‘Attack’ values and data is given ‘df’ as value. We have also used ‘inner’ argument with ‘None’ as the value for removing the inner bars which are inside the violin.

Next, we will plot the swarm plot using swarmplot () function with the same values for x-axis and y-axis.  We have also provided the title of the plot using title () function of matplotlib.

Here in this plot, we can see the swarm plot over violin plot. But the points of swarm plot is not clearly visible, so we will try to remove this using our following code:

In the above code, to make the swarm plot points visible we will be using two arguments i.e. ‘color’ and ‘alpha’. We have specified the color value as ‘k’ which represents black and ‘alpha’ is used to increase/decrease the transparency which is why we have given the value as ‘0.6

Now we can clearly see the points of swarm plot over the violin plots. Therefore, this is our final plot where the swarm plot is over the violin plot.

Uses:
It helps in visualizing the distribution of data and the probability density.

Joint Kernel Density Plot

In this, we get to know how much the data is scattered from the two columns which are under consideration.

For plotting the joint kernel density plot, we proceed with the styling which is done through seaborn and matplotlib. After that, we will use the kdeplot () function of Seaborn. Here we can see that the arguments to the kdeplot () are passed differently as compared to other plotting functions. Here, we use ‘dataframe (df)’ to call the values for both the axes.

Uses:
It helps in finding the probability density function of any dataset. This can help in smoothing the data around values of PDF.

Joint Plot

Joint plot is build using the jointplot () function of seaborn where we provide the values of x-axis and y-axis along with this we give the argument ‘kind’ for specifying the plot which we are creating jointly, here we have given the value as ‘scatter’ and we have even specified the ‘color’ value as ‘g’ i.e. green. There are other values for ‘kind’ which are ‘reg’, ‘resid’, ‘hex’ etc.

Uses:
Through joint plot, we get the liberty to use two plots for representation of the same data which helps in a better analysis.

Heatmap

For plotting Heatmap we will be using a different dataset i.e. ‘flights.csv’ which is an in-built dataset in Seaborn library and we will be load this dataset using seaborn itself.

Loading the dataset using Seaborn

Here we have used the load_dataset function to load the ‘flights’ dataset for Visualization.

Displaying the flight’s dataset.

 

Creating a pivot table.

To plot the Heatmap, we will be required to draw the correlation between the columns which is done through the pivot () function where we have passed month and year as x-axis and y-axis values respectively and passengers for a range.

Plotting the Heatmap

Now to plot the Heatmap, we use the heatmap () function of Seaborn where we have passed the dataset flights as one argument and color of the Heatmap as ‘OrRd’ i.e. Orange and Red.

NOTE: There are many cmap values which can be passed for trying out.

In heatmap we can see as the value of range gets higher the intensity of color increases and for lesser values, the color is lighter in shade.

To make the heatmap more informative we can add some more arguments to the heatmap () function. The values displayed in each cell is represented by ‘annot’ argument which is set as ‘True’ and for displaying those values in a decimal format we use the ’fmt’ argument. Lastly, to make it neater we will draw borders between the lines using ‘linewidths’ argument, we can specify different values for it as per our requirements.

Uses:
Since it is the 2-dimensional figure, it helps in visualizing complex data. We can even use it to represent large datasets.

Clustermap

So we’ll end this tutorial with another graph i.e. Clustermap which is built using the clustermap () function of seaborn. Clustermap is very similar to heatmap, the only difference is that the similar values are combined together and represented in clustermap, which is evident from its name.

Here again, we have use matplotlib for designing the figure. The other arguments are similar to the previous heatmap example.

Here you can see the months are not in the known order, this is because the clustermap has combined the months on the basis of similar values i.e. clusters. Similarly, the years are clustered on the same criteria.

Even we can control the clustering in clustermap, this can be done by using ‘col_cluster’/’row_cluster’ arguments which are given values as true/false. In the above example, we have set the col_cluster as ‘False’ resulting in no clustering for columns. Secondly, we have used standard_scale for standardizing our dataset either by columns or rows, when we give value as ‘0’ the data across rows will be standardized and if we pass ‘1’ the column data will be standardized.

Here at the top-left corner, we can see the range has been changed (200 – 600 in the previous example) and has been made from 0.0 to 1.0, so this how the values are standardized.

Uses:
With the ability to standardize data, we can represent datasets through a cluster map which have columns with huge differences making it easier for analysis.

So finally we have reached the end of this Introduction to Data Visualization with Seaborn, I hope you would have liked it and most importantly learned from it.

You might also like More from author