Exploratory Data Analysis
Exploratory Data Analysis
The greatest value of a picture is when it forces us to notice what we never expected to see. -John Tukey
In the bigger picture Data-driven science, we start by collecting a data set of reasonable size, and then looking for patterns that ideally will play the role of hypotheses for future analysis. Exploratory data analysis is the search for patterns and trends in a given data set. The goal of exploratory data analysis is to get you thinking about your data and reasoning about your question. i.e to make sure we have the right data, any problems with the dataset, determining if what we answer our desired question and get rough idea of what the answer will look like.
People are not very good at looking at a column of numbers or a whole spreadsheet and then determining important characteristics of the data. They find looking at numbers to be tedious, boring, and/or overwhelming. Exploratory data analysis techniques have been devised as an aid in this situation. Most of these techniques work in part by hiding certain aspects of the data while making other aspects more clear.
Overall, The goal of exploratory analysis is to examine or explore the data and find relationships that weren’t previously known. Exploratory analyses explore how different measures might be related to each other but do not confirm that relationship as causitive. After the basic exploratory analysis we can pause and think that if our question needs refinement or if we need to collect more or new data.
Exploratory data analysis is about looking carefully at your dataset and identify any errors in data collection processing, finding violations of statistical assumptions, and suggesting interesting hypotheses.
As first step we could answer the following questions: Who constructed this dataset, when, and why? what is size of the data and the description of the various columns?
EDA always precedes formal (confirmatory) data analysis. EDA is useful for:
Detection of mistakes
Checking of assumptions
Determining relationships among the explanatory variables
Assessing the direction and rough size of relationships between explanatory and outcome variables,
Preliminary selection of appropriate models of the relationship between an outcome variable and one or more explanatory variables.
Data Acquizition
Note: automate as much as possible so you can easily get fresh data.
List the data you need and how much you need.
Find and document where you can get that data (identify what kind of data the organization has)
Check how much space it take and engineering efforts required (initial assessment).
Understand Operational Constraints (e.g. what data is actually available at inference time)
Legal/Ethics/Privacy
Get access authorizations and check legal obligations
proactively identify ethical risks, including how your work could be mis-used by harassers, trolls, authoritarian governments, or for propaganda/disinformation campaigns (and plan how to reduce these risks)
identify [potential biases and potential negative feedback loops]
Ensure sensitive Information is deleted or protected (e.g anonymized)
EDA Methods:
EDA method is either non-graphical or graphical.
Each method is either univariate or multivariate (usually just bivariate). Overall,the four types of EDA are univariate non-graphical, multivariate nongraphical, univariate graphical, and multivariate graphical.
Non-graphical methods generally involve calculation of summary statistics, while graphical methods obviously summarize the data in a diagrammatic or pictorial way.
Univariate methods look at one variable (data column) at a time, while multivariate methods look at two or more variables at a time to explore relationships. Usually our multivariate EDA will be bivariate (looking at exactly two variables), but occasionally it will involve three or more variables. It is almost always a good idea to perform univariate EDA on each of the components of a multivariate EDA before performing the multivariate EDA.
EDA Summary:
Before EDA:
Check the size and type of data
See if the data is in appropriate for - Convert the data to a format you can easily manupulate (without changing the data itself)
Sample a test set, set it aside and never look at it
EDA:
Grab a copy of the data and Read in your data using appropriate python or R libraries
Check the packaging e.g check the number of rows and columns and see they match the dataset description.
Study Each Attribute and its characteristics (Name, Type, % of missing values, noisy, usefulness, type of distribution)
Look at the top and the bottom of your data e.g using head() and tail() commands to get a sense of what the rows look like
Perform Summary statistics. e.g Tukey’s five number summary is a great start for numerical values, consisting of the extreme values (max and min), plus the median and quartile elements
Visualize the data: Perform Pairwise correlations, Plots of distributions to identify correlations and identify classes.
Formulate your question e.g Are air pollution levels higher on the east coast than on the west coast? For Supervised Machine Learning, identify the target attribute
Study how you would solve the problem manually
Identify the promising transformations you may want to apply
Make plans to collect more of different data (if needed and if possible)
all of this is an an iterative process to get at the truth and answer our questions.
Statistical knowledge required:
(Expectation and Mean, Variance and Standard deviation, Co-varriance and Correlation, Median Quartile, Interquartile range, Percentile/quantile, Mode )
Visualization knowledge required:
knowing which methods are suitable for which type of data
Andrew Abela. Advanced Presentations by Design: Creating Communication that Drives Action. Pfeiffer, 2nd edition, 2013.
The Art of Data Science, A Guide for Anyone Who Works with Data by Roger D. Peng and Elizabeth Matsui
Last updated