Data Visualization Cheat Sheet with Seaborn and Matplotlib | #14
Introduction
Exploratory Data Analysis — EDA is an indispensable step in data mining. To interpret various aspects of a data set like its distribution, principal or interference, it is necessary to visualize our data in different graphs or images. Fortunately, Python offers a lot of libraries to make visualization more convenient and easier than ever. Some of which are widely used today such as Matplotlib, Seaborn, Plotly or Bokeh.
Since my job concentrates on scrutinizing all angles of data, I have been exposed to many types of graphs. However, because there are way too many functions and the codes are not easy to remember, I sometimes forget the syntax and have to review or search for similar codes on the Internet. Without doubt, it has wasted a lot of my time, hence my motivation for writing this article. Hopefully, it can be a small help to anyone who has a memory of a goldfish like me.
Data Description
My dataset is downloaded from public Kaggle dataset. It is a grocery dataset, and you can easily get the data from the link below:
Groceries dataset
Dataset of 38765 rows for Market Basket Analysis
www.kaggle.com
This grocery data consists of 3 columns, which are:
- Member_number: id numbers of customers
- Date: date of purchasing
- itemDescription: Item name
Now, let’s have a look at the data frame and its information:
Figure 1: Data frame
Figure 2: Data’s description
Install necessary packages
There are some packages that we should import first.
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
Visualize data
Line Chart
For this section, I will use a line graph to visualize sales the grocery store during the time of 2 years 2014 and 2015.
First, I will transform the data frame a bit to get the items counted by month and year.
Figure 3: Items Counted by Month-Year
After we have our data, let’s try to visualize it:
Figure 4: Line Chart of Items Counted by Month-Year
Bar Chart
Bar chart is used to simulate the changing trend of objects over time or to compare the figures / factors of objects. Bar charts usually have two axes: one axis is the object / factor that needs to be analyzed, the other axis is the parameters of the objects.
For this dataset, I will use a bar chart to visualize 10 best categories sold in 2014 and 2015. You can either display it by horizontal or vertical bar chart. Let’s see how it looks.
Data Transformation
Figure 4: Items Counted by Categories
Horizontal Bar Chart
Figure 5: Horizontal Bar Chart
If you prefer vertical bar chart, try this:
Figure 6: Vertical Bar Chart
Bar Chart with Hue Value
If you want to compare each category’s sales by year, what would your visualization look like? You can draw the graph with an addition of an element called hue value.
Figure 7: Bar Chart with Hue Value
Now, can you see it more clearly?
Histogram
Imagine that I want to discover the frequency of customers buying whole milk, the best seller category. I will use histogram to obtain this information.
Figure 8: Frequency of customers buying whole milk in 2014 and 2015
By looking at the visualization, we can see that customers hardly repurchase this item more than twice, and a lot of customers cease to buy this product after their first purchases.
Pie chart
Actually, pie charts are quite poor at communicating the data. However, it does not hurt to learn this visualization technique.
For this data, I want to compare the sales of top 10 categories with the rest in both year 2014 and 2015. Now, let’s transform our data to get this information visualized.
Our data is now ready. Let’s see the pies!
Figure 9: Pie Charts
So, it is obvious that top 10 categories were less purchased in 2015 compared to 2014, by 5.5%.
Swarm Plot
Another way to review your data is swarm plot. In swarm plot, points are adjusted (vertical classification only) so that they do not overlap. This is helpful as it complements box plot when you want to display all observations along with some representation of the underlying distribution.
As I want to see the number of items sold in each day of the week, I may use this type of chart to display the information. As usual, let’s first calculate the items sold and group them by categories and days.
After we obtain the data, let’s see how the graph looks like.
Figure 10: Swarm Chart
Conclude
In this article, I have shown you how to customize your data with different types of visualizations. If you find it helpful, you can save it and review anytime you want. It can save you tons of time down the road. :D
WRITTEN BY
Chi Nguyen
An introverted girl who craves for learning and writing
Follow
313
Your journey starts here.
DATA SCIENCE
Top 3 Lesser-Known Pandas Function
DATA SCIENCE
Data visualisation: 3 secret tips on Python to make interactive graphs and impress your boss.
DATA SCIENCE
5 Pandas Tricks That’ll Make Your Life Easier
DATA SCIENCE
How to change semi-structured text into a Pandas dataframe
Sign up for The Daily Pick
By Towards Data Science
Hands-on real-world examples, research, tutorials, and cutting-edge techniques delivered Monday to Thursday. Make learning your daily ritual. Take a look
Get this newsletter
Emails will be sent to bursiform@socialtalker.com.
Thanks to Linda Chen.
313
More from Towards Data Science
Follow
A Medium publication sharing concepts, ideas, and codes.
Jerry Wei
·1 day ago
Applying Curriculum Learning to Medical Images
A quick explanation of using curriculum learning for medical image analysis.
General Overview. In this study, I worked with a team of researchers to apply curriculum learning to improve the accuracy of a deep learning model for classifying colorectal cancer images. The full paper can be found here, and it is going to be published and presented at the 2021 Winter Conference on Applications of Computer Vision (WACV).
Proposed curriculum learning scheme for training a colorectal polyp classifier. The classifier first trains on easy images, and progressively-harder images are gradually added in subsequent stages.
The Motivation. Curriculum learning is an elegant idea inspired by human learning that proposes that deep learning models should be trained on examples in a specified order based on difficulty (typically easy examples and then hard examples), as opposed to random sampling. …
Read more · 4 min read
10
Luís Rita
·1 day ago
First Helmet Detector using YOLOv5
CycleAI: Empowering cyclists in fighting for their own safety
CycleAI: Empowering cyclists in fighting for their own safety [Image by Author].
Mobility is a priority theme for the European Union in the context of urban development. At the same time, hundreds of people, including cyclists and pedestrians, lose their lives on the roads. Therefore, planning and ordering of cities through appropriate infrastructures is urging, alongside a safe and efficient transport network aimed at active mobility — both on foot and by bicycle.
It is now presented the object detection model that was trained to identify whether cyclists are wearing a helmet and, potentially, studying their prevalence.
YOLOv5
YOLOv5 is the most recent version of YOLO which was originally developed by Joseph Redmon. First version runs in a framework called Darknet which was purposely built to execute YOLO [1]. …
Read more · 5 min read
Felipe de Pontes Adachi
·1 day ago
How I Learned to Stop Worrying and Track my Machine Learning Experiments
Keep your machine learning projects under control
Photo by Annie Spratt on Unsplash
To track and reproduce
From my personal experience, one thing I realized is that tracking machine learning experiments is important. This realization was eventually followed by another one: tracking machine learning experiments is hard.