Visualizing your Data

Communication is an essential skill for every data scientist. You can have the best idea in the world, but if you are unable to properly portray it to others, it might as well be a figment of your imagination. When it comes to data, the best way to show off your work is through visualizations, but the question of “Which kind of visualization should I use?” is one that always pops up. In this story I will be exploring a collection of different plots with how they look like, how to use them, and a suggestion or two on why they should be used.

I will be using the library Seaborn version 0.11.0 for visualization and the Titanic dataset from kaggle for the actual data.

first five rows of the titanic dataset

Count Plot

sns.countplot(x = 'Survived', data = df, palette = 'flare')

A humble count plot. Takes in a categorical column and plots out the frequencies of each unique value.

Distribution Plot — Histogram

sns.displot(data = df, kind = 'hist', x = 'Age', hue = 'Survived', multiple = 'stack', height = 5, aspect = 1.75, palette = 'flare')

Using the displot() function with the ‘hist’ parameter, takes in a numeric column and plots out the frequencies of each bin. View this as a count plot but for numeric columns instead of categorical.

Distribution Plot — KDE

sns.displot(data = df, kind = 'kde', x = 'Age', hue = 'Survived', height = 5, aspect = 1.75, palette = 'flare')

Using the displot() function with the ‘kde’ parameter, takes in a numeric column and plots out the density of the inputs using kernel destiny estimates.

Distribution Plot — ECDF

sns.displot(data = df, kind = 'ecdf', x = 'Age', hue = 'Survived', height = 5, aspect = 1.75, palette = 'flare')

Using the displot() function with the ‘ecdf’ parameter, similar to the plot above it takes in a numeric column but plots out cumulative density using the empirical cumulative distribution function.

Distribution Plot — Bivariate — Histogram

sns.displot(data = df, y = 'Age', x = 'Fare', height = 10,
aspect = 1, color = '#663777', kind = 'hist', palette = 'flare')

Using the displot() function with the ‘hist’ parameter and two numerical columns as inputs, it’ll plot a bivariate plot in histogram form.

Distribution — Bivariate — KDE

sns.displot(data = df, y = 'Age', x = 'Fare', kind = 'kde',
height = 10, aspect = 1, color = '#663777', palette = 'flare')

Using the displot() function with the ‘kde’ parameter and two numerical columns as inputs, it’ll plot a bivariate plot in KDE form.

Sadly bivariate plots currently only support histograms and KDEs.

Joint Plot — Histogram

sns.jointplot(data = df, x = 'Age', y = 'Fare', kind = 'hist',
xlim = (0, 85), ylim = (0, 200), height = 10, color = '#c4537e', palette = 'flare')

Joint plots are a plot that combines univariate and bivariate distribution plots. The bivariate plots are in the center while the univariate plots are on the top and right side. Here the joinplot() function is used with two numerical columns as input and the ‘hist’ parameter.

Joint Plot — KDE

sns.jointplot(data = df, x = 'Age', y = 'Fare', kind = 'kde',
fill = True, xlim = (0, 85), ylim = (0, 200), height = 10,
color = '#c4537e', palette = 'flare')

Using the jointplot() function with the ‘kde’ parameter and two numerical columns as inputs. The fill parameter has been set to True which gives the graph it’s filled in appearance.

Joint Plot — Hex

sns.jointplot(data = df, x = 'Age', y = 'Fare', kind = 'hex',
height = 10, xlim = (0, 85), ylim = (0, 500),
color = '#c4537e', palette = 'flare')

Using the jointplot() function with the ‘hex’ parameter and two numerical columns as inputs. Essentially the same as the histogram version, but can be more aesthetically pleasing if the data allows it.

Joint Plot — Scatter

sns.jointplot(data = df, x = 'Age', y = 'Fare', kind = 'scatter', hue = 'Pclass', xlim = (0, 85), ylim = (0, 200), height = 10, palette = 'flare')

Using the jointplot() function with the ‘scatter’ parameter and two numerical columns as inputs. This jointplot utilizes a scatterplot instead of a distribution plot for the bivariate portion, while the univariate plots use a KDE.

Bar Plot

sns.barplot(x = 'Pclass', y = 'Age', hue = 'Survived', data = df, palette = 'flare', order = ['1','2','3'])

Using the barplot() function with a categorical column for the x-input and a numerical column for the y-input. The plot will show the average numerical value each unique categorical value has. The black bars symbolizes the standard error each column has for their average.

Point Plot

sns.pointplot(x = 'Pclass', y = 'Age', hue = 'Survived', order = ['1','2','3'], data = df, palette = 'flare')

Using the pointplot() function with a categorical column for the x-input and a numerical column for the y-input. The point plot is essentially just the bar plot but with only the mean and standard error shown, no bars.

Line Plot

sns.lineplot(x = 'Pclass', y = 'Age', hue = 'Survived', style = 'Embarked', data = df, palette = 'flare')

Using the lineplot() function with either a categorical or numerical column for the x-input and a numerical column for the y-input. Cutting features down even further, it can be treated a bar plot without the standard error and the bars.

Box Plot

sns.boxplot(x = 'Pclass', y = 'Age', hue = 'Survived', data = df, palette = 'flare')

Using the boxplot() function with a numerical column as an input. It plots out the distribution in the column in a way to view percentiles and is an easy way of viewing outliers in the data.

Boxen Plot

sns.boxenplot(x = 'Pclass', y = 'Age', hue = 'Survived', data = df, palette = 'flare')

Using the boxenplot() function with a numerical column as the input. Building off of the box plot, it is a bit more descriptive as it contains extra quantiles in the form of smaller boxes. The extra boxes help to visualize the distribution outside of the interquartile range, and provides more information about the shape of the distribution.

Violin Plot

sns.violinplot(x = 'Pclass', y = 'Age', hue = 'Survived', split = True, data = df, palette = 'flare')

Using the violinplot() function with a numerical column as the input. Violin plots have similar usage to boxen plots, and can be used interchangeably. However when the amount of data and outliers are huge, violin plots look aesthetically nicer.

Swarm Plot

sns.swarmplot(x = 'Pclass', y = 'Age', hue = 'Survived', data = df, palette = 'flare')

Using the swarmplot() function with a numerical column as the input. Functionally the same as a violin plot, but it shows each row as a point on the plot. Very clear way of viewing distribution but can get messy when you have datasets with high row counts.

Scatter Plot

sns.scatterplot(x = 'Age', y = 'Fare', hue = 'Survived', data = df, palette = 'flare')

Using the scatterplot() function with two numerical columns as inputs. Simple and clear way of viewing the relations between two columns and if they correlate with each other.

Strip Plot

sns.stripplot(x = 'Pclass', y = 'Age', hue = 'Survived', data = df, palette = 'flare')

Using the stripplot() function and a categorical column as the x-input and a numerical column as the y-input. Similar to the plot above, it can be treated as a combination of a scatter plot and a swarm plot. Great for seeing how a numerical value correlates with it’s respective categorical value, but similar to swarm plots, can get messy when there’s a lot of values.

Facet Grid

g = sns.FacetGrid(col = 'Embarked', row = 'Sex', data = df,
height = 5, palette = 'flare')
g.map(sns.swarmplot, 'Pclass', 'Age', 'Survived', palette = 'flare')

Facet Grids are use when you want to plot a dataset onto multiple axes at once. This uses the FacetGrid() function, a map() function, two categorical columns, and a plot of your choice (a swarm plot in this example) with it’s own inputs. The facet grid will create a grid with rows and columns equal to the number of unique categorical values in their respective input column, and each grid square will be a different combination of each value. This combination forms the environment for the actual plot itself, so every square will have a different plot. This showcases a wide variety of different scenarios for your data and can greatly help your understanding of it.

Categorical Plot

sns.catplot(data = df, x = 'Pclass', y = 'Age', row = 'Sex',
col = 'Embarked', hue = 'Survived', kind = 'swarm',
palette = 'flare')

However since facet grids require a map function, which can be prone to errors from de-synchronization across facets, it is much more reliable to use the catplot() function. It is functionally the same as the FacetGrid() function, with the issue that it can only plot categorical plots onto the grids (we use a swarm plot again for our example).

Relationship Plot

sns.relplot(data = df, x = 'Age', y = 'Fare', row = 'Sex',
col = 'Embarked', hue = 'Survived', size = 'Pclass',
style = 'Parch', palette = 'flare')

The twin-sibling of the catplot() function, the relplot() is the same but it deals with numerical inputs instead. By plotting relational plots, it covers the rest of the plots that catplot() cannot. On a fun sidenote, given its status as a numerical plot it can plot up to 7 columns at once! (In the example, we have a column for grid row, grid column, x input, y input, hue, size, and style; not terribly recommended due to messiness sadly…)

Linear Model Plot

ax = sns.lmplot( x = 'Age', y = 'Fare', hue = 'Survived', data = df, height = 10, palette = 'flare', row = 'Sex', col = 'Embarked')

Using the lmplot() function and two numerical columns as inputs. The linear model plot acts like a scatterplot but with facet grid rows and columns built in similar to catplot() and relplot(). However it’s main gimmick is that it comes with a built in simple linear regression model that does its best to fit onto your data. While the accuracy of it is certainly questionable, its ease of access makes it a useful tool for initial EDA.

Pair Plot

sns.pairplot(df, hue = 'Survived', palette = 'flare')

Using the pairplot() function and a dataframe as an input. This plots pairwise relationships between every numerical column in the dataframe, with a distribution plot on grid squares where the column is paired with itself. Useful way of viewing correlation at a glance.

Heatmap

corr = df[['Survived', 'Age',  'SibSp', 'Parch', 'Fare', 'Family']].corr()
sns.heatmap(corr, square = True, annot = True, ax = ax)

Using the heatmap() function and a correlation matrix as an input, it prints out the correlation value each numerical column has with another. Important for finding valuable features when undergoing linear regression.

Combined Plots

Having gone over most of the possible plots in seaborn, here are some examples of laying multiple plots on top of each other.

sns.boxplot(x = 'Pclass', y = 'Age', palette = 'flare',
data = df, ax = ax)
sns.violinplot(x = 'Pclass', y = 'Age', data = df,
palette = 'flare', ax = ax)
sns.swarmplot(x = 'Pclass', y = 'Age', color = 'black',
data = df, ax = ax)

A plot to view distributions of a categorical value made from combining a box plot, a violin plot, and a swarm plot. The box plot shows the quartiles, while the violin plot and the swarm plot helps you visualize the shape of the distribution. Frankly it’s quite ugly, so would not recommend using it other than as an example.

sns.kdeplot(data = df, y = 'Age', x = 'Fare', color = '#663777', palette = 'flare', ax = ax)
sns.histplot(data = df, y = 'Age', x = 'Fare', color = '#663777', palette = 'flare', ax = ax)
sns.scatterplot(data = df, y = 'Age', x = 'Fare', color = '#663777', palette = 'flare', ax = ax)

Another plot to view distributions, this time between two numerical columns. This combines a bivariate KDE plot, a bivariate histogram plot, and a scatter plot together. While each plot individually does a decent job of visualizing the distribution spread, combining them only serves to enhance the overall performance.

Final Comments

All plots above were made using the documentation on the seaborn website (https://seaborn.pydata.org/tutorial.html). Each plot has many parameters that were not fully gone over given the scope of this story, so if one were to create their own plot I would highly advise giving the documentation a look-over to play around with it. While this post was made with myself in-mind, I would be perfectly happy if others were to use it as a visualization cheat-sheet as well!

Best of luck,

Andrew Yeh

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store