Hello Coders !

Recently I came across another useful library called Pandas Profiling for doing exploratory data analysis in a couple of lines of code.

Pandas Profiling uses pandas dataframe as an input and performs a quick analysis of the dataset. It then shares a report with the brief summary of the dataset along with visualizations. In the generated report, You can find the types of the columns, unique values, missing values and the most frequent values. It also displays the duplicate rows and highly correlated values.

It is something similar to the SweetViz library which I have already shared in this blog. If you haven’t read the blogpost yet, you can read it in the link.

Now, I will show you an example on how to use the library to generate the reports.

Installation

Install the latest version of the library

Pip install pandas_profiling

At the time of writing this blog, the latest version of pandas profiling library is v3.0.0

Prepare a report

To perform EDA, I have downloaded the dataset of Amazon Bestselling books 2009 - 2019 from Kaggle.

import pandas as pd
import pandas_profiling as pp

df = pd.read_csv("/content/bestsellers_with_categories.csv")

profile = pp.ProfileReport(df)
profile.to_file("output.html")

I will walk you through the above code now.

  1. Import the necessary packages - Pandas and Pandas-Profiling.
  2. Read the downloaded CSV file and create a dataframe.
  3. Input this dataframe to the ProfileReport() method.

It will analyse the dataset and produce a report containing the insights.

Then load the results to the file “Output.html”. Let us have a quick look into the generated report

Overview

Pandas Profiling Report Overview

Here we can see an overview of the dataset such as number of variables and total number of records available in the dataset. Also, we can find the number of categorical and numerical variables, the number of missing cells and the duplicate rows.

In this Amazon Bestselling books dataset, we can see that there are seven variables and 550 rows.

Sample Records

Pandas Profiling Report Sample Records

Scroll down to the bottom of the report, you can see the sample first 10 records and last 10 records of the dataset.

Variables

Pandas Profiling Report Variables

Next, we will have detailed information about each and every variable along with a histogram.

For the imported dataset, we can see that there are 351 unique book names which contribute to 63.8% of the dataset. And there is no missing values in the name column

Interactions

Pandas Profiling Report Interactions

Pandas-Profiling automatically generates the interaction plots of every pair of the variables. In the above diagram, we can see the interaction plot of ‘User Ratings“ and “Reviews” column.

Correlation:

Pandas Profiling Report Correlation

Correlation is an another important feature generated in the report used to check how every two variables are related.

Missing Values:

Pandas Profiling Report Missing Values

Using this, we can check the number of missing values in each variable. Here all the variables have 550 records so no missing values in this dataset.

You can find the code and a sample report in my github repository.

References:

  1. Pandas Profiling Official documentation https://pandas-profiling.github.io/pandas-profiling/docs/master/
  2. Cover Image Credits - Pandas profiling documentation