When working with data, it’s quite common to have large expanses of data. A column related to the variety of car paint colors. A column related to the weights of different cruise ship passengers. A column related to the number of visitors an amusement park stall gets. These are all data where the number of rows could range from the tens to the thousands. When you have this much information, parsing through and properly understanding it can be a hassle. The basic step to do would be to check the main Ms, which are the min, max, mean, median, mode, and mstandard deviation. While this gives you a couple numbers that inform you of how the data is spread apart, it’s still quite rough. Just because you know the mean is 4,128 doesn’t ‘mean’ you can quite picture it’s significance. Knowing the standard deviation and the minimum / 25% percentile / 50% percentile / 75% percentile / maximum values do help but unless you are a superhuman with a computer for a brain, it’s hard for your general audience to attach importance to numbers. The solution to this issue is through using visualizations.
Visualizations are ways to transform data from numbers and characters to aesthetically pleasing shapes and lines. There are many different types of visualizations, each with their different use cases and optimal situations. It can range from histograms, which stuff data into bins to show the distribution using counts, to pie charts that show how much percentage a data type makes of the whole. These are just several examples of how visualizations help portray your data in an easier to digest way for both you and your audience. When thinking of visualizations in python, the first thing that a majority of users will think of will be the python library Matplotlib. Matplotlib is a comprehensive library that heavily cuts down on the amount of code that one needs to create a shape or diagram in python. With just a few lines of code and some data as food, one can instantly conjure up a visualization that will suit their needs. As an example, we will simply use one line of code.
- matplotlib.pyplot.hist(df[‘sepal length (cm)’])
The above image is a histogram visualization of the sepal length column from the first image. Instead of being overwhelmed by number after number, the visualization provides a nice clear overview of how the numbers are spread apart. Just by giving a quite glance, most of the sepal lengths are around 4.8~6.8 centimeters in size, with lengths under 4.8 and above 6.8 as clear outliers given their rarity. Matplotlib is the most well-known of the data visualization libraries because it is the the pioneer of the entire data visualization field, being the first data visualization library to exist, as well as the grandfather to many current libraries. One such grandchild, our main focus for today, is the python library seaborn.
Like the other grandchildren, seaborn is a library that is built off of matplotlib. Created by Michael Waskom, it utilizes many of it’s existing syntax and functions, while adding in a bit of it’s own flavor. As a data visualization library, a common question would be to know why one would use seaborn when matplotlib, a stable, battle-hardened veteran of the data visualization field is both reliable and well-grounded. The answer to this simply comes down to one issue: aesthetics. While it is true that beauty is in the eye of the beholder, and that I have no desire of purposely alienating readers that find matplotlib plots to be a thing of beauty, it isn’t a terribly unpopular opinion when I say that the general matplotlib visualization are quite drab in appearance. Aesthetically pleasing matplotlib plots do exist, but they require quite a bit of effort and tweaking in order for them to look pleasing to the eye. A simple showcase of seaborn’s prowess would be to simply recreate the earlier plot.
- seaborn.distplot(df[‘sepal length (cm)’])
Compare this plot with the one made earlier with matplotlib. Both plots were created with one line of code, and do not have any excess styling done. I do not consider myself a professional art critique refined in the ways of applying the 7 Principles of Art to a small histogram quickly made with python, but in my humble opinion if I were to showcase data to potential business partners or even just to myself, I would prefer looking at a seaborn plot. This is just an example of how seaborn can turn a simple plot and touch it up in a way that even an inexperienced user can easily create a decent looking plot.
While the above example was how easily it was to create a plot with minimal code, here are some examples of plots made by more proficient users of the library, taken from the gallery on the seaborn website.
- Joint and Marginal Histograms
- Violin Plot
- Tutorials on how to begin creating your own seaborn plots https://seaborn.pydata.org/tutorial.html
- Link to the gallery with more showcases on seaborn plots https://seaborn.pydata.org/examples/index.html
With all the strength and power that seaborn has, it does not go without it’s downsides. When compared to matplotlib, while seaborn can produce much aesthetically pleasing plots without much issue, it goes at the cost of some flexibility when it comes to the data itself. Seaborn’s goal of streamlining the plotting process sacrifices some of matplotlib’s complexity when it comes to preparing the data for the plot. A big example would be seaborn’s inability to create proper pie charts, which matplotlib could do with ease. Likewise, there are many occasions where using sister python visualization libraries such as bokeh or Gleam is preferable, as they can create more powerful and interactive visualizations that are convenient to place in websites. However, when it comes to creating a static plot and taking into account complexity, aesthetics, and ease of access, seaborn is still the best visualization library to use and should definitely deserve a spot in every data scientist’s toolkit.