13  Exploratory data analysis

Data analysis involves steps like cleaning, transforming, inspecting, and modelling data to extract meaningful information. This process can serve various purposes, including exploratory and confirmatory analyses, as well as descriptive or predictive tasks.

Before building models or making predictions, it’s essential to explore the data to identify underlying patterns and structures. Data analysts employ both numerical and visual techniques to uncover insights that might be hidden within the dataset. However, it’s crucial for analysts to avoid over-interpreting apparent patterns and to ensure that the findings are reliable for the given data and potentially applicable to new datasets as well. Exploratory data analysis fills this role.

Following are a few other definitions of exploratory data analysis (EDA).

13.1 Definitions

From Wikipedia:

In statistics, exploratory data analysis (EDA) is an approach to analyzing data sets to summarize their main characteristics, often with visual methods. A statistical model can be used or not, but primarily EDA is for seeing what the data can tell us beyond the formal modeling or hypothesis testing task.

From Wickham and Grolemund (2023):

EDA is not a formal process with a strict set of rules. More than anything, EDA is a state of mind. During the initial phases of EDA you should feel free to investigate every idea that occurs to you. Some of these ideas will pan out, and some will be dead ends.

From SAS:

EDA is necessary for the next stage of data research. If there was an analogy to exploratory data analysis, it would be that of a painter examining their tools and available time, before deciding on what best to paint.

13.2 Origins

The field of EDA got into the forefront with the publication of Tukey’s Exploratory Data Analysis (Tukey, 1977). Tukey’s aim in writing the book was to provide individual and isolated techniques useful to data analysts. All of Tukey’s techniques in the EDA book can be done by hand with pencil and paper.

Figure 13.1: Book cover of Tukey’s Exploratory Data Analysis

Following are some quotes by Tukey from the EDA book.

13.2.1 On measures

It is important to understand what you can do before you learn to measure how well you seem to have done it.

13.2.2 On pictures

The greatest value of a picture is when it forces us to notice what we never expected to see.

13.2.3 On exploration

Once upon a time, statisticians only explored.

13.2.4 On not having one right answer

There can be many ways to approach a body of data. Not all are equally good.