Hello Coders !
Recently I came across another useful library called
Pandas Profiling for doing exploratory data analysis in a couple of lines of code.
Pandas Profiling uses pandas dataframe as an input and performs a quick analysis of the dataset. It then shares a report with the brief summary of the dataset along with visualizations. In the generated report, You can find the types of the columns, unique values, missing values and the most frequent values. It also displays the duplicate rows and highly correlated values.
It is something similar to the SweetViz library which I have already shared in this blog. If you haven’t read the blogpost yet, you can read it in the
Now, I will show you an example on how to use the library to generate the reports.
Install the latest version of the library
Pip install pandas_profiling
At the time of writing this blog, the latest version of pandas profiling library is v3.0.0
Prepare a report
To perform EDA, I have downloaded the dataset of
Amazon Bestselling books 2009 - 2019 from Kaggle.
import pandas as pd import pandas_profiling as pp df = pd.read_csv("/content/bestsellers_with_categories.csv") profile = pp.ProfileReport(df) profile.to_file("output.html")
I will walk you through the above code now.
- Import the necessary packages - Pandas and Pandas-Profiling.
- Read the downloaded CSV file and create a dataframe.
- Input this dataframe to the
It will analyse the dataset and produce a report containing the insights.
Then load the results to the file “Output.html”. Let us have a quick look into the generated report
Here we can see an overview of the dataset such as number of variables and total number of records available in the dataset. Also, we can find the number of categorical and numerical variables, the number of missing cells and the duplicate rows.
In this Amazon Bestselling books dataset, we can see that there are seven variables and 550 rows.
Scroll down to the bottom of the report, you can see the sample first 10 records and last 10 records of the dataset.
Next, we will have detailed information about each and every variable along with a histogram.
For the imported dataset, we can see that there are 351 unique book names which contribute to 63.8% of the dataset. And there is no missing values in the name column
Pandas-Profiling automatically generates the interaction plots of every pair of the variables. In the above diagram, we can see the interaction plot of ‘User Ratings“ and “Reviews” column.
Correlation is an another important feature generated in the report used to check how every two variables are related.
Using this, we can check the number of missing values in each variable. Here all the variables have 550 records so no missing values in this dataset.
You can find the code and a sample report in my
- Pandas Profiling Official documentation
- Cover Image Credits - Pandas profiling documentation