Art of Data Visualization

"A picture is worth a thousand words."

Visualisation is a cornerstone of data science and one of the most powerful tools for Exploratory data analysis. Therefore it is important to understand the key principles behind data visualisation before applying your favourite statistical method. Mastering visualisation is a hallmark of an effective data scientist to understand the relevant features of the data at hand.

Edward Tufte has given a lot of thought to this subject and has written several bookarrow-up-right on this topic. However, our focus in this post is on “visualisation for data exploration” which is different from “visualisation for presentation” and doesn't concerns itself with conveying information/knowledge to a broader audience.

Data can be more than just storytelling.arrow-up-right Some charts can convince decision makers in the effectiveness of certain interventions and others can increase public awareness. e.g This charts shows clear effectiveness of vaccines.

Charts can also make a case that world is improving. Some of the fun facts (as pointed out by bill gatesarrow-up-right) are:

  • You’re 37 times less likely to be killed by a bolt of lightning than you were at the turn of the century

  • Time spent doing laundry fell from 11.5 hours a week in 1920 to an hour and a half in 2014.

  • You’re way less likely to die on the job.

  • The global average IQ score is rising by about 3 IQ points every decade.

An effective visualisation technique can convey characteristics of interest, suchas, at what value is the distribution centered?, is the distribution symmetric, what range contains $95 %$ of the data.

Histograms can for example effectively convey this information. A histogram divides data into non-overlapping bins of the same size and plots the counts of number of values that fall in that interval. Histograms can tell us about the range of the data and in which interval the majority (e.g 95%) of data lies.

Kernal Density Estimation:

We plot Kernel Density Estimation plot alongside the histogram of a numpy array (representing an image from a depth sensor).

"The greatest value of a picture is when it forces us to notice what we never expected to see." - John W. Tukey

How to decide which visualisation tool is right for you?

  1. Scatter plots — numeric x numeric: If both dimensions of the data are numeric, the most natural first type of plot to consider is the scatter plot: plotting points that simply correspond to the different coordinates of the data.

  2. Line plots — numeric x numeric (sequential): For two dimensional data where one of the dimensions in naturally sequential (this comes up often, for instance, when monitoring a time series, so that one of the two dimensions of each data point is time).

  3. Box and whiskers and violin plots — categorical x numeric: When one dimension is numeric and one is categorical

  4. Heat map and bubble plots — categorical x categorical: When both dimensions of our 2D data are categorical, we have even less information to use.

TODO: Add Examples below

Line Graph

Line Graph with error margins:

Repeated observations are aggregated as semantic grouping is used and we use error bands to represent the uncertainty around the value with maximum count.

Histograms:

Add here

Read More:

[1] Andrew Abela. Advanced Presentations by Design: Creating Communication that Drives Action. Pfeiffer, 2nd edition, 2013.

[2] http://www.datasciencecourse.org/notes/visualization/arrow-up-right

Last updated