Instantly share code, notes, and snippets.
mGalarnyk / project1.md
- Star 1 You must be signed in to star a gist
- Fork 9 You must be signed in to fork a gist
Exploratory Data Analysis Project 1
This assignment uses data from the UC Irvine Machine Learning Repository, a popular repository for machine learning datasets. In particular, we will be using the “Individual household electric power consumption Data Set” which I have made available on the course web site:
Dataset: Electric power consumption [20Mb] Description: Measurements of electric power consumption in one household with a one-minute sampling rate over a period of almost 4 years. Different electrical quantities and some sub-metering values are available.
akundu3 commented Dec 16, 2018
Hi! I am just starting to work on this dataset. Some exploratory analysis in a python notebook. This will be great for reference. Thank you for posting!
Sorry, something went wrong.
- Suggest edit
Assignment #1 (demo). Exploratory data analysis with Pandas
Assignment #1 (demo). exploratory data analysis with pandas #.
mlcourse.ai – Open Machine Learning Course
Author: Yury Kashnitsky . Translated and edited by Sergey Isaev , Artem Trunov , Anastasia Manokhina , and Yuanyuan Pao . All content is distributed under the Creative Commons CC BY-NC-SA 4.0 license.
Same assignment as a Kaggle Kernel + solution .
In this task you should use Pandas to answer a few questions about the Adult dataset. (You don’t have to download the data – it’s already in the repository). Choose the answers in the web-form .
Unique values of features (for more information please see the link above):
age : continuous;
workclass : Private , Self-emp-not-inc , Self-emp-inc , Federal-gov , Local-gov , State-gov , Without-pay , Never-worked ;
fnlwgt : continuous;
education : Bachelors , Some-college , 11th , HS-grad , Prof-school , Assoc-acdm , Assoc-voc , 9th , 7th-8th , 12th , Masters , 1st-4th , 10th , Doctorate , 5th-6th , Preschool ;
education-num : continuous;
marital-status : Married-civ-spouse , Divorced , Never-married , Separated , Widowed , Married-spouse-absent , Married-AF-spouse ,
occupation : Tech-support , Craft-repair , Other-service , Sales , Exec-managerial , Prof-specialty , Handlers-cleaners , Machine-op-inspct , Adm-clerical , Farming-fishing , Transport-moving , Priv-house-serv , Protective-serv , Armed-Forces ;
relationship : Wife , Own-child , Husband , Not-in-family , Other-relative , Unmarried ;
race : White , Asian-Pac-Islander , Amer-Indian-Eskimo , Other , Black ;
sex : Female , Male ;
capital-gain : continuous.
capital-loss : continuous.
hours-per-week : continuous.
native-country : United-States , Cambodia , England , Puerto-Rico , Canada , Germany , Outlying-US(Guam-USVI-etc) , India , Japan , Greece , South , China , Cuba , Iran , Honduras , Philippines , Italy , Poland , Jamaica , Vietnam , Mexico , Portugal , Ireland , France , Dominican-Republic , Laos , Ecuador , Taiwan , Haiti , Columbia , Hungary , Guatemala , Nicaragua , Scotland , Thailand , Yugoslavia , El-Salvador , Trinadad&Tobago , Peru , Hong , Holand-Netherlands ;
salary : >50K , <=50K .
1. How many men and women ( sex feature) are represented in this dataset?
2. What is the average age ( age feature) of women?
3. What is the percentage of German citizens ( native-country feature)?
4-5. What are the mean and standard deviation of age for those who earn more than 50K per year ( salary feature) and those who earn less than 50K per year?
6. Is it true that people who earn more than 50K have at least high school education? ( education – Bachelors , Prof-school , Assoc-acdm , Assoc-voc , Masters or Doctorate feature)
7. Display age statistics for each race ( race feature) and each gender ( sex feature). Use groupby() and describe() . Find the maximum age of men of Amer-Indian-Eskimo race.
8. Among whom is the proportion of those who earn a lot ( >50K ) greater: married or single men ( marital-status feature)? Consider as married those who have a marital-status starting with Married ( Married-civ-spouse , Married-spouse-absent or Married-AF-spouse ), the rest are considered bachelors.
9. What is the maximum number of hours a person works per week ( hours-per-week feature)? How many people work such a number of hours, and what is the percentage of those who earn a lot ( >50K ) among them?
10. Count the average time of work ( hours-per-week ) for those who earn a little and a lot ( salary ) for each country ( native-country ). What will these be for Japan?
Coursera | Exploratory Data Analysis | Project assignment 1
6.894 : Interactive Data Visualization
Assignment 2: exploratory data analysis.
In this assignment, you will identify a dataset of interest and perform an exploratory analysis to better understand the shape & structure of the data, investigate initial questions, and develop preliminary insights & hypotheses. Your final submission will take the form of a report consisting of captioned visualizations that convey key insights gained during your analysis.
Step 1: Data Selection
First, you will pick a topic area of interest to you and find a dataset that can provide insights into that topic. To streamline the assignment, we've pre-selected a number of datasets for you to choose from.
However, if you would like to investigate a different topic and dataset, you are free to do so. If working with a self-selected dataset, please check with the course staff to ensure it is appropriate for the course. Be advised that data collection and preparation (also known as data wrangling ) can be a very tedious and time-consuming process. Be sure you have sufficient time to conduct exploratory analysis, after preparing the data.
After selecting a topic and dataset – but prior to analysis – you should write down an initial set of at least three questions you'd like to investigate.
Part 2: Exploratory Visual Analysis
Next, you will perform an exploratory analysis of your dataset using a visualization tool such as Tableau. You should consider two different phases of exploration.
In the first phase, you should seek to gain an overview of the shape & stucture of your dataset. What variables does the dataset contain? How are they distributed? Are there any notable data quality issues? Are there any surprising relationships among the variables? Be sure to also perform "sanity checks" for patterns you expect to see!
In the second phase, you should investigate your initial questions, as well as any new questions that arise during your exploration. For each question, start by creating a visualization that might provide a useful answer. Then refine the visualization (by adding additional variables, changing sorting or axis scales, filtering or subsetting data, etc. ) to develop better perspectives, explore unexpected observations, or sanity check your assumptions. You should repeat this process for each of your questions, but feel free to revise your questions or branch off to explore new questions if the data warrants.
- Final Deliverable
Your final submission should take the form of a Google Docs report – similar to a slide show or comic book – that consists of 10 or more captioned visualizations detailing your most important insights. Your "insights" can include important surprises or issues (such as data quality problems affecting your analysis) as well as responses to your analysis questions. To help you gauge the scope of this assignment, see this example report analyzing data about motion pictures . We've annotated and graded this example to help you calibrate for the breadth and depth of exploration we're looking for.
Each visualization image should be a screenshot exported from a visualization tool, accompanied with a title and descriptive caption (1-4 sentences long) describing the insight(s) learned from that view. Provide sufficient detail for each caption such that anyone could read through your report and understand what you've learned. You are free, but not required, to annotate your images to draw attention to specific features of the data. You may perform highlighting within the visualization tool itself, or draw annotations on the exported image. To easily export images from Tableau, use the Worksheet > Export > Image... menu item.
The end of your report should include a brief summary of main lessons learned.
Recommended Data Sources
To get up and running quickly with this assignment, we recommend exploring one of the following provided datasets:
World Bank Indicators, 1960–2017 . The World Bank has tracked global human developed by indicators such as climate change, economy, education, environment, gender equality, health, and science and technology since 1960. The linked repository contains indicators that have been formatted to facilitate use with Tableau and other data visualization tools. However, you're also welcome to browse and use the original data by indicator or by country . Click on an indicator category or country to download the CSV file.
Chicago Crimes, 2001–present (click Export to download a CSV file). This dataset reflects reported incidents of crime (with the exception of murders where data exists for each victim) that occurred in the City of Chicago from 2001 to present, minus the most recent seven days. Data is extracted from the Chicago Police Department's CLEAR (Citizen Law Enforcement Analysis and Reporting) system.
Daily Weather in the U.S., 2017 . This dataset contains daily U.S. weather measurements in 2017, provided by the NOAA Daily Global Historical Climatology Network . This data has been transformed: some weather stations with only sparse measurements have been filtered out. See the accompanying weather.txt for descriptions of each column .
Social mobility in the U.S. . Raj Chetty's group at Harvard studies the factors that contribute to (or hinder) upward mobility in the United States (i.e., will our children earn more than we will). Their work has been extensively featured in The New York Times. This page lists data from all of their papers, broken down by geographic level or by topic. We recommend downloading data in the CSV/Excel format, and encourage you to consider joining multiple datasets from the same paper (under the same heading on the page) for a sufficiently rich exploratory process.
The Yelp Open Dataset provides information about businesses, user reviews, and more from Yelp's database. The data is split into separate files ( business , checkin , photos , review , tip , and user ), and is available in either JSON or SQL format. You might use this to investigate the distributions of scores on Yelp, look at how many reviews users typically leave, or look for regional trends about restaurants. Note that this is a large, structured dataset and you don't need to look at all of the data to answer interesting questions. In order to download the data you will need to enter your email and agree to Yelp's Dataset License .
Additional Data Sources
If you want to investigate datasets other than those recommended above, here are some possible sources to consider. You are also free to use data from a source different from those included here. If you have any questions on whether your dataset is appropriate, please ask the course staff ASAP!
- data.boston.gov - City of Boston Open Data
- MassData - State of Masachussets Open Data
- data.gov - U.S. Government Open Datasets
- U.S. Census Bureau - Census Datasets
- IPUMS.org - Integrated Census & Survey Data from around the World
- Federal Elections Commission - Campaign Finance & Expenditures
- Federal Aviation Administration - FAA Data & Research
- fivethirtyeight.com - Data and Code behind the Stories and Interactives
- Buzzfeed News
- Socrata Open Data
- 17 places to find datasets for data science projects
You are free to use one or more visualization tools in this assignment. However, in the interest of time and for a friendlier learning curve, we strongly encourage you to use Tableau . Tableau provides a graphical interface focused on the task of visual data exploration. You will (with rare exceptions) be able to complete an initial data exploration more quickly and comprehensively than with a programming-based tool.
- Tableau - Desktop visual analysis software . Available for both Windows and MacOS; register for a free student license.
- Data Transforms in Vega-Lite . A tutorial on the various built-in data transformation operators available in Vega-Lite.
- Data Voyager , a research prototype from the UW Interactive Data Lab, combines a Tableau-style interface with visualization recommendations. Use at your own risk!
- R , using the ggplot2 library or with R's built-in plotting functions.
- Jupyter Notebooks (Python) , using libraries such as Altair or Matplotlib .
Data Wrangling Tools
The data you choose may require reformatting, transformation or cleaning prior to visualization. Here are tools you can use for data preparation. We recommend first trying to import and process your data in the same tool you intend to use for visualization. If that fails, pick the most appropriate option among the tools below. Contact the course staff if you are unsure what might be the best option for your data!
- Tableau Prep - Tableau provides basic facilities for data import, transformation & blending. Tableau prep is a more sophisticated data preparation tool
- Trifacta Wrangler - Interactive tool for data transformation & visual profiling.
- OpenRefine - A free, open source tool for working with messy data.
- Pandas - Data table and manipulation utilites for Python.
- dplyr - A library for data manipulation in R.
- Or, the programming language and tools of your choice...
The assignment score is out of a maximum of 10 points. Submissions that squarely meet the requirements will receive a score of 8. We will determine scores by judging the breadth and depth of your analysis, whether visualizations meet the expressivenes and effectiveness principles, and how well-written and synthesized your insights are.
We will use the following rubric to grade your assignment. Note, rubric cells may not map exactly to specific point scores.
This is an individual assignment. You may not work in groups.
Your completed exploratory analysis report is due by noon on Wednesday 2/19 . Submit a link to your Google Doc report using this submission form . Please double check your link to ensure it is viewable by others (e.g., try it in an incognito window).
Resubmissions. Resubmissions will be regraded by teaching staff, and you may earn back up to 50% of the points lost in the original submission. To resubmit this assignment, please use this form and follow the same submission process described above. Include a short 1 paragraph description summarizing the changes from the initial submission. Resubmissions without this summary will not be regraded. Resubmissions will be due by 11:59pm on Saturday, 3/14. Slack days may not be applied to extend the resubmission deadline. The teaching staff will only begin to regrade assignments once the Final Project phase begins, so please be patient.
- Due: 12pm, Wed 2/19
- Recommended Datasets
- Example Report
- Visualization & Data Wrangling Tools
- Submission form
- Bachelor’s in Data Science
- Master’s in Public Policy Analytics
- Statement of Purpose
- MBA in Data Science
- Online Data Science Master’s Degrees in 2023
- Data Science Programs Outside the US
- PhD in Data Science
- Master’s in Data Science Programs in California
- Master’s in Data Science Programs in Colorado
- Master’s in Data Science Programs in New York
- Master’s in Data Science Programs in Ohio
- Master’s in Data Science Programs in Texas
- Master’s in Data Science Programs in Washington, D.C.
- Online Bachelor’s in Computer Science
- Online Master’s in Computer Science
- Master’s in Accounting Analytics
- Master’s in Applied Statistics
- Online Master’s in Business Analytics
- Master’s in Business Intelligence
- Online Master’s in Computer Engineering
- Types of Cybersecurity
- Master’s in Geospatial Science
- Online Master’s in Health Informatics
- Online Master’s in Information Systems
- Online Master’s in Library Science
- Business Analyst Salary Guide
- How to Become a Business Analyst With No Experience
- Business Intelligence Analyst
- Computer Engineer
- Computer Scientist
- Computer Systems Analyst
- Cyber Security Salary Guide
- Data Analyst Salaries
- Data Analyst vs Data Scientist
- Data Architect
- Data Engineer
- Data Mining Specialist
- Data Scientist Salary Guide
- Digital Marketer
- Financial Analyst
- Information Security Analyst
- Market Research Analyst
- Marketing Analyst
- Product Manager
- Quantitative Analyst
- Web Designer
- Web Developer
- What Can You Do With a Computer Science Degree?
- Bay Area, CA
- Atlanta, GA
- Orlando, FL
- Toronto, ON
- Tucson and Phoenix, AZ
- Los Angeles, CA
- New York, NY
- Houston, TX
- Are Coding Bootcamps Worth it?
- Cybersecurity Bootcamps
- Data Science Bootcamps
- Digital Marketing Bootcamps
- Fintech Bootcamps
- Mobile Development Bootcamps
- UX/UI Bootcamps
- Artificial Intelligence Courses
- Blockchain Courses
- Business Analytics Courses
- Cybersecurity Courses
- Data Analytics Courses
- Data Science Courses
- Digital Marketing Courses
- Financial Analysis Courses
- FinTech Courses
- Machine Learning Courses
- UX/UI Courses
- Reasons to Learn Data Science Online
- Learn jQuery
- Learn React.js
- Learn MySQL
- Soft Skills
- Hard Skills
- Computer Science vs. Computer Engineering
- Cyber Security vs. Computer Science
- Data Analytics vs. Business Analytics
- Data Science vs. Machine Learning
- Data Science vs. Computer Science
- Data Science vs. Statistics
- Difference Between Bias and Variance
- Difference Between UX and UI
- How to Deal with Missing Data
- ARIMA Modeling
- Probability Theory
- Automated Machine Learning
- Decision Tree
- Gradient Descent
- Linear Regression
- Logistic Regression
- Exploratory Data Analysis
- What is a Database?
- What is Business Analytics?
- Neural Network
- What is Computer Engineering?
- What is an Information System?
- What is Computer Science?
- What is Cyber Security?
- What is Digital Marketing?
- What is FinTech?
- Ways to Improve Data Visualization
- What is Data Structure?
- How to Research Financial Aid for STEM
Home / Learning / What is Data Analytics? / Exploratory Data Analysis
What Is Exploratory Data Analysis?
Many data scientists will agree that it is very easy to get lost in data—the more you collect, study and analyze, the more you want to explore. Rabbit holes of information are familiar and friendly places for data analysts and data scientists to dive into and spend hours extracting, modeling and analyzing these large datasets.
Data is collected and housed in some sort of data repository. It could be as simple as a spreadsheet or as complex as a database that comprises multiple spreadsheets or datasets. Generally, the rows in a database are individual records while the columns are the various characteristics of each record. But the human eye (and brain) can only scan so much data to analyze and learn from it. Exploratory data analysis allows analysts, scientists and business leaders to use visual tools to learn from the data.
Case western reserve university, cwru data analytics boot camp.
CWRU Data Analytics Boot Camp is a rigorous, part-time program that prepares students with the fundamental skills for data analytics and visualization. Through hands-on, in-person instruction, you’ll cover a wide range of topics and graduate ready to apply your skills in the workforce.
Columbia engineering data analytics boot camp.
Are you ready to become a data-driven professional? Columbia Engineering Data Analytics Boot Camp is a challenging, part-time bootcamp that equips learners with the specialized skills for data analytics and visualization through hands-on, in-person classes.
University of California, Berkeley
Berkeley data analytics boot camp.
Turn data into actionable insights. Berkeley Data Analytics Boot Camp is a dynamic, part-time program that covers the in-demand tools and technologies for data analytics and visualization through rigorous, project-based classes.
University of Texas at Austin
The data analysis & visualization boot camp at texas mccombs.
The Data Analysis and Visualization Boot Camp at Texas McCombs puts the student experience first, teaching the knowledge and skills to conduct data analysis on a wide array of real-world problems. Students dive into a comprehensive curriculum, learning how to collect, analyze, and visualize big data.
University of Southern California
Usc viterbi data analytics boot camp.
Expand your skill set and grow as a data analyst. This program covers the specialized skills to be successful in the field of data in 24 weeks.
What Is Exploratory Data Analysis? EDA Definition
Simply defined, exploratory data analysis (EDA for short) is what data analysts do with large sets of data, looking for patterns and summarizing the dataset’s main characteristics beyond what they learn from modeling and hypothesis testing. EDA is a philosophy that allows data analysts to approach a database without assumptions. When a data analyst employs EDA, it’s like they’re asking the data to tell them what they don’t know.
The National Institute of Standards and Technology (NIST) describes EDA as an approach to data analysis, not a model, that uses these techniques:
- Maximize insights into a dataset.
- Uncover underlying structures.
- Extract important variables.
- Detect outliers and anomalies.
- Test underlying assumptions.
- Develop parsimonious models.
- Determine optimal factor settings.
NIST explains that EDA is an approach to data analysis that “postpones the usual assumptions about what kind of model the data [follows]” and allows the data to reveal its underlying structure and model.
EDA is typically used for these four goals:
- Exploring a single variable and looking at trends over time.
- Checking data for errors.
- Checking assumptions.
- Looking at relationships between variables.
University of london, online bsc data science and business analytics.
The online BSc Data Science and Business Analytics from the University of London, with academic direction from LSE, enables students to build essential technical and critical thinking skills and prepare for careers in data science, analytics and other growing fields – while they work, without relocating.
Example of Exploratory Data Analysis
It is not unusual for a data scientist to employ EDA before any other data analysis or modeling. It is often a step in data analysis that lets data scientists look at a dataset to identify trends, outliers, patterns and errors.
Example 1: EDA in retail
In a retail setting, business intelligence applications and experts look at data to measure sales in terms of how many units were sold, how much customers spent, what buyers also bought and seasonality of sales; then, they compare sales month over month, quarter over quarter and year over year. There are a lot more data points that retailers look at, of course, but a data analyst is generally looking to answer specific questions: How many units were sold? Over what time period? For how much? What are the demographics of our customers? And so on.
An EDA approach asks different questions. For example, what trends did we see in the last year in units sold? In this retail case study example from You CANalytics , an analyst would look at this graphic and note an interesting finding in the number of product categories purchased in one year. The number of categories declined, as expected, but then it spiked at 50-plus. Why? Further investigation revealed that other retailers were buying from them and reselling their goods. This could allow the retailer to develop a business-to-business sales strategy and build relationships with these smaller retailers.
Example 2: EDA in health care research
In a study published in PLoS ONE on exploratory data analysis of a clinical study group, researchers used EDA to verify the homogeneity of their patient population and identify outliers, but they also used it to help them identify subpopulations).
The patients in the study were identified by 40 attributes, including sex. The female groups were verified to be more homogenous than the male set, which researchers segmented into five subgroups. The researchers recommended separate testing for the five male subgroups, in order to avoid leading to false conclusions for the clinical trials.
Example 3: EDA in electronic medical records
Hospitals, health departments and health care networks contain vast amounts of data collected from electronic medical records (EMR) that non–data experts don’t know what to do with. These EMRs are subject to intense compliance regulations in order to protect patients’ privacy. However, health care organizations are looking for ways to leverage the data without tying it to individuals.
In a study published by Journal of Medical Internet Research, a group of researchers built a visual data mining system and tested it on the EMR of more than 14,000 patients who suffer from chronic kidney disease (CKD). The researchers took 13 years’ worth of information to build visualizations of CKD progression over time, as well as the presence of other medical conditions that are present in CKD patients at the same time, which may affect their outcomes.
Techniques and Tools
EDA methods typically fall into graphical or non-graphical methods and univariate or multivariate methods. It relies heavily on visuals, which analysts use to look for patterns, outliers, trends and unexpected results.
Graphical vs. non-graphical EDA
Graphical exploratory data analysis employs visual tools to display data, such as:
- Box plots : used to graphically depict data through their quartiles at five data points—lowest, first, median, third and maximum values; also sometimes called a whisker plot. Analysts use it to look at large sets of data. An example of this in practice is a utility that tracks water usage on a monthly basis.
- Heatmap : data visualization that uses colors to compare and contrast numbers in a set of data; also known as shading matrices. An example of this in practice would be traffic analyses, which look at heavy traffic patterns by time of day, day of the week and season.
- Histograms : bar chart that groups numbers together in a series of intervals, especially when there is an infinite variable, such as weights and measures. For example, it can be used to measure agricultural growth where units would be grouped in height ranges (100–150 cm vs. 100, 101, 102, etc.).
- Line graphs : one of the most basic types of charts that plots data points on a graph; has a wealth of uses in almost every field of study.
- Pictograms : replace numbers with images to visually explain data. They’re common in the design of infographics, as well as visuals that data scientists can use to explain complex findings to non-data-scientist professionals and the public.
- Scattergrams or scatterplots : typically used to display two variables in a set of data and then look for correlations among the data. For example, scientists might use it to evaluate the presence of two particular chemicals or gases in marine life in an effort to look for a relationship between the two variables.
Non-graphical exploratory data analysis involves data collection and reporting in nonvisual or non-pictorial formats.
- Exploratory Data Analysis - Course Project 1
- by Ghida Ibrahim
- Last updated over 7 years ago
- Hide Comments (–) Share Hide Toolbars
Twitter Facebook Google+
Or copy & paste this link into an email or IM:
Assignment 1: Exploratory Data Analysis (EDA) Description Run...
Assignment 1: Exploratory Data Analysis (EDA)
Description Run and interpret EDA using SAS Studio within all the variables of the STOCKS dataset in SASHELP. Be sure to calculate the appropriate metrics for few variables of your choice. • Min • Max • Median • Mean
Things you may comment on each variable • Is the variable normally distributed? • If not normally distributed, is it skewed, and to which side? • If not normally distributed, is it very peaked, or very flat? • Are there unusual values? • Is there a category with very few values? • What type of variable is it (continuous, categorical, ordinal, dichotomous, other)
Explain in a word with two pages, with figure
Answer & Explanation
The STOCKS dataset is a well-balanced dataset that can be used for a variety of statistical analyses, but it is important to keep in mind that some of the variables are not normally distributed. Therefore, it is recommended to use non-parametric statistical tests or transform the data before using parametric statistical tests.
Exploratory Data Analysis (EDA) on the STOCKS Dataset in SASHELP
Introduction The STOCKS dataset in SASHELP consists of statistics on stock charges for a lot of corporations. The dataset consists of the following variables:
Ticker: The stock symbol of the organisation Date: The date on which the stock fee turned into recorded Open: The establishing fee of the inventory at the given date High: The highest charge of the stock at the given date Low: The lowest price of the inventory at the given date Close: The closing fee of the inventory on the given date Volume: The quantity of shares traded at the given date
Descriptive Statistics To begin our EDA, permit's calculate some descriptive data for every of the variables. The following desk indicates the minimal, most, median, and imply for each variable:
Next, let's test each of the variables for normality. Normality is an important assumption for many statistical tests. The following table shows the results of the Shapiro-Wilk normality test for each variable:
As you can see, all of the variables except for Date are significantly non-normal (p-value < 0.05).
Skewness and Kurtosis
Since some of the variables are not normally distributed, let's take a look at their skewness and kurtosis. Skewness measures how symmetrical the distribution is, while kurtosis measures how peaked or flat the distribution is.
The following table shows the skewness and kurtosis for each variable:
The skewness and kurtosis values for all the variables are fantastically near 0, which shows that the distributions are approximately symmetrical and mesokurtic (typically peaked).
Unusual Values There are no unusual values within the dataset.
Categories with Very Few Values There are not any categories with very few values inside the dataset.
Variable Types The variable kinds are as follows:
Ticker: specific Date: non-stop Open: continuous High: non-stop Low: continuous Close: continuous Volume: continuous
Conclusion The STOCKS dataset contains information on stock expenses for a variety of organizations over a long time frame. The descriptive facts display that the variables aren't commonly distributed, however they may be approximately symmetrical and mesokurtic. There aren't any unusual values inside the dataset, and there aren't any categories with very few values.
Overall, the STOCKS dataset is a nicely-balanced dataset that may be used for a whole lot of statistical analyses. However, it's far critical to preserve in mind that a number of the variables aren't typically allotted, so suitable statistical tests need to be used.
Recommendations Based at the results of our EDA, the subsequent guidelines are made: Use non-parametric statistical exams, consisting of the Wilcoxon rank-sum take a look at or the Kruskal-Wallis check, whilst evaluating groups of information, for the reason that a number of the variables are not typically dispensed.
Transform the information, which includes via taking the logarithm, to make it more normally disbursed earlier than the usage of parametric statistical exams. Be aware of the capability for outliers inside the information, and use appropriate strategies to identify and handle them.
Next Steps The subsequent steps in our evaluation would be to perform more particular analyses, together with:
Identifying styles within the stock charges through the years Comparing the performance of various shares Forecasting destiny stock expenses
We might also want to remember using different statistical techniques, which includes gadget learning, to construct models which could predict destiny inventory prices.
Causal Inference: The Mixtape by Scott Cunningham (2021)
Park, C., & Wang, M. (2023). A goodness-of-fit test for the Birnbaum-Saunders distribution based on the probability plot. arXiv preprint arXiv:2308.10150 .
Mishra, P., Pandey, C. M., Singh, U., Keshri, A., & Sabaretnam, M. (2019). Selection of appropriate statistical methods for data analysis. Annals of cardiac anaesthesia , 22 (3), 297.
- Q Compare and contrast CRUIKSHANK V. UNITED STATES(1876) and Presser v. Illinois, 116 U.S. 252 (1886). Who were the indivi... Answered 34d ago
- Q Someone help me figure out how I can make these bar & histogram graphs on excel? Been having trouble with it along w... Answered 81d ago
- Q Olivia (9 months old), Malcolm (4 years old), and Andrea (12 years old) are siblings being raised by their biological pa... Answered over 90d ago
- Q #1 What is the value today of a money machine that will pay $7,939.00 per year for 28.00 years? Assume the first payme... Answered over 90d ago
- Q Please discuss the list of the ethical principles that prompted the nurse's actions in the video. Videos show nurse who ... Answered over 90d ago
- Q Below you will find a conversion funnel from a month of traffic on the website of a major hotel chain (Marriott or Hyatt... Answered over 90d ago
- Q Although rare in mental performance consulting work, we can come across clients with suicidal ideation. As their mental ... Answered over 90d ago
- Q HOSPITAL STAFFING The Hospital is opening a new maternity wing with 60 beds and the Administrator must determine the n... Answered over 90d ago
- Q Mary Francis has just returned to her office after attending preliminary discussions with investment bankers. Her last m... Answered over 90d ago
- Q Discuss a research paper on RESEARCH ON INDIGENOUS WOMEN OR COMMUNITIES IN NATIONAL ASSOCIATION OF NATIVE FRIENDSHIP CEN... Answered over 90d ago
- Q 28. If carbon monoxide (CO) is bound to complex IV in a liver and muscle cell, you'd expect muscle cells to produce ___ ... Answered over 90d ago
- Q According to DeMello, where are the animals located in Western society, and what are their socially-constructed categori... Answered 67d ago
- Q . CengageNOWv2 | Online teacl x y! CVP: Before- and After-Tax Tar x + U A G IM Paus m/ilrn/takeAssignment/takeAssignmen... Answered over 90d ago
- Q 1. Select one of the topics below. Structures of the Pulmonary System, (Chapter 28, p. 655) Functions of the Pulmona... Answered 5d ago
- Q Please explain and answer Questions 1,3,4,6,16,20,29,30 . 66 . CHAPTER 1 Thinking Critically and the square numbers. ... Answered 71d ago
- Q Locate an article that covered the 2016 presidential election. Look for evidence in the article for priming, framing, an... Answered over 90d ago
An official website of the United States government
The .gov means it's official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you're on a federal government site.
The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.
- Account settings
- Browse Titles
NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.
Secondary Analysis of Electronic Health Records [Internet]. Cham (CH): Springer; 2016. doi: 10.1007/978-3-319-43742-2_15
Secondary Analysis of Electronic Health Records [Internet].
Chapter 15 exploratory data analysis.
Matthieu Komorowski , Dominic C. Marshall , Justin D. Salciccioli , and Yves Crutain .
Published online: September 10, 2016.
In this chapter, the reader will learn about the most common tools available for exploring a dataset, which is essential in order to gain a good understanding of the features and potential issues of a dataset, as well as helping in hypothesis generation.
- Learning Objectives
- Why is EDA important during the initial exploration of a dataset?
- What are the most essential tools of graphical and non-graphical EDA?
Exploratory data analysis (EDA) is an essential step in any research analysis. The primary aim with exploratory analysis is to examine the data for distribution, outliers and anomalies to direct specific testing of your hypothesis. It also provides tools for hypothesis generation by visualizing and understanding the data usually through graphical representation [ 1 ]. EDA aims to assist the natural patterns recognition of the analyst. Finally, feature selection techniques often fall into EDA. Since the seminal work of Tukey in 1977, EDA has gained a large following as the gold standard methodology to analyze a data set [ 2 , 3 ]. According to Howard Seltman (Carnegie Mellon University), “loosely speaking, any method of looking at data that does not include formal statistical modeling and inference falls under the term exploratory data analysis” [ 4 ].
EDA is a fundamental early step after data collection (see Chap. 11 ) and pre-processing (see Chap. 12 ), where the data is simply visualized, plotted, manipulated, without any assumptions, in order to help assessing the quality of the data and building models. “Most EDA techniques are graphical in nature with a few quantitative techniques. The reason for the heavy reliance on graphics is that by its very nature the main role of EDA is to explore, and graphics gives the analysts unparalleled power to do so, while being ready to gain insight into the data. There are many ways to categorize the many EDA techniques” [ 5 ].
The interested reader will find further information in the textbooks of Hill and Lewicki [ 6 ] or the NIST/SEMATECH e-Handbook [ 1 ]. Relevant R packages are available on the CRAN website [ 7 ].
Maximize insight into the database/understand the database structure;
Visualize potential relationships (direction and magnitude) between exposure and outcome variables;
Detect outliers and anomalies (values that are significantly different from the other observations);
Develop parsimonious models (a predictive or explanatory model that performs with as few exposure variables as possible) or preliminary selection of appropriate models;
Extract and create clinically relevant variables.
- Graphical or non-graphical methods
- Univariate (only one variable, exposure or outcome) or multivariate (several exposure variables alone or with an outcome variable) methods.
15.2. Part 1—Theoretical Concepts
15.2.1. suggested eda techniques.
Tables 15.1 and 15.2 suggest a few EDA techniques depending on the type of data and the objective of the analysis.
Suggested EDA techniques depending on the type of data
Most useful EDA techniques depending on the objective
15.2.2. Non-graphical EDA
These non-graphical methods will provide insight into the characteristics and the distribution of the variable(s) of interest.
Univariate Non-graphical EDA
Tabulation of Categorical Data (Tabulation of the Frequency of Each Category)
A simple univariate non-graphical EDA method for categorical variables is to build a table containing the count and the fraction (or frequency) of data of each category. An example of tabulation is shown in the case study (Table 15.3 ).
Example of tabulation table
Characteristics of Quantitative Data: Central Tendency, Spread, Shape of the Distribution (Skewness, Kurtosis)
Sample statistics express the characteristics of a sample using a limited set of parameters. They are generally seen as estimates of the corresponding population parameters from which the sample comes from. These characteristics can express the central tendency of the data (arithmetic mean, median, mode), its spread (variance, standard deviation, interquartile range, maximum and minimum value) or some features of its distribution (skewness, kurtosis). Many of those characteristics can easily be seen qualitatively on a histogram (see below). Note that these characteristics can only be used for quantitative variables (not categorical).
Central tendency parameters
The arithmetic mean, or simply called the mean is the sum of all data divided by the number of values. The median is the middle value in a list containing all the values sorted. Because the median is affected little by extreme values and outliers, it is said to be more “robust” than the mean (Fig. 15.1 ).
Symmetrical versus asymmetrical (skewed) distribution, showing mode, mean and median
When calculated on the entirety of the data of a population (which rarely occurs), the variance σ 2 is obtained by dividing the sum of squares by n, the size of the population.
The sample formula for the variance of observed data conventionally has n-1 in the denominator instead of n to achieve the property of “unbiasedness”, which roughly means that when calculated for many different random samples from the same population, the average should match the corresponding population quantity (here σ 2 ). s 2 is an unbiased estimator of the population variance σ 2 . s 2 = ∑ i = 1 n ( x i - x - ) 2 ( n - 1 ) 15.1
The standard deviation is simply the square root of the variance. Therefore it has the same units as the original data, which helps make it more interpretable.
The sample standard deviation is usually represented by the symbol s. For a theoretical Gaussian distribution, mean plus or minus 1, 2 or 3 standard deviations holds 68.3, 95.4 and 99.7 % of the probability density, respectively.
Interquartile range (IQR)
The IQR is calculated using the boundaries of data situated between the 1st and the 3rd quartiles. Please refer to the Chap. 13 “Noise versus Outliers” for further detail about the IQR. I Q R = Q 3 - Q 1 15.2
In the same way that the median is more robust than the mean, the IQR is a more robust measure of spread than variance and standard deviation and should therefore be preferred for small or asymmetrical distributions.
- Symmetrical distribution (not necessarily normal) and N > 30 : express results as mean ± standard deviation.
- Asymmetrical distribution or N < 30 or evidence for outliers: use median ± IQR, which are more robust.
Skewness is a measure of a distribution’s asymmetry. Kurtosis is a summary statistic communicating information about the tails (the smallest and largest values) of the distribution. Both quantities can be used as a means to communicate information about the distribution of the data when graphical methods cannot be used. More information about these quantities can be found in [ 9 ]).
We provide as a reference some of the common functions in R language for generating summary statistics relating to measures of central tendency (Table 15.4 ).
Main R functions for basic measure of central tendencies and variability
Testing the Distribution
Several non-graphical methods exist to assess the normality of a data set (whether it was sampled from a normal distribution), like the Shapiro-Wilk test for example. Please refer to the function called “Distribution” in the GitHub repository for this book (see code appendix at the end of this Chapter).
Several statistical methods for outlier detection fall into EDA techniques, like Tukey’s method, Z-score, studentized residuals, etc [ 8 ]. Please refer to the Chap. 14 “Noise versus Outliers” for more detail about this topic.
Multivariate Non-graphical EDA
Cross-tabulation represents the basic bivariate non-graphical EDA technique. It is an extension of tabulation that works for categorical data and quantitative data with only a few variables. For two variables, build a two-way table with column headings matching the levels of one variable and row headings matching the levels of the other variable, then fill in the counts of all subjects that share a pair of levels. The two variables may be both exposure, both outcome variables, or one of each.
Covariance and Correlation
Covariance and correlation measure the degree of the relationship between two random variables and express how much they change together (Fig. 15.2 ).
Examples of covariance for three different data sets
The covariance is computed as follows: c o v ( x , y ) = ∑ i = 1 n ( x i - x ¯ ) ( y i - y ¯ ) n - 1 15.3 where x and y are the variables, n the number of data points in the sample, x ¯ the mean of the variable x and y ¯ the mean of the variable y.
A positive covariance means the variables are positively related (they move together in the same direction), while a negative covariance means the variables are inversely related. A problem with covariance is that its value depends on the scale of the values of the random variables. The larger the values of x and y, the larger the covariance. It makes it impossible for example to compare covariances from data sets with different scales (e.g. pounds and inches). This issue can be fixed by dividing the covariance by the product of the standard deviation of each random variable, which gives Pearson’s correlation coefficient.
Correlation is therefore a scaled version of covariance, used to assess the linear relationship between two variables and is calculated using the formula below. C o r ( x , y ) = C o v ( x , y ) s x s y 15.4 where C o v ( x , y ) is the covariance between x and y and s x , s y are the sample standard deviations of x and y .
The significance of the correlation coefficient between two normally distributed variables can be evaluated using Fisher’s z transformation (see the cor.test function in R for more details). Other tests exist for measuring the non-parametric relationship between two variables, such as Spearman’s rho or Kendall’s tau.
15.2.3. Graphical EDA
Univariate Graphical EDA
Histograms are among the most useful EDA techniques, and allow you to gain insight into your data, including distribution, central tendency, spread, modality and outliers.
Histograms are bar plots of counts versus subgroups of an exposure variable. Each bar represents the frequency (count) or proportion (count divided by total count) of cases for a range of values. The range of data for each bar is called a bin. Histograms give an immediate impression of the shape of the distribution (symmetrical, uni/multimodal, skewed, outliers…). The number of bins heavily influences the final aspect of the histogram; a good practice is to try different values, generally from 10 to 50. Some examples of histograms are shown below as well as in the case studies. Please refer to the function called “Density” in the GitHub repository for this book (see code appendix at the end of this Chapter) (Figs. 15.3 and 15.4 ).
Example of histogram
Example of histogram with density estimate
Histograms enable to confirm that an operation on data was successful. For example, if you need to log-transform a data set, it is interesting to plot the histogram of the distribution of the data before and after the operation (Fig. 15.5 ).
Example of the effect of a log transformation on the distribution of the dataset
Histograms are interesting for finding outliers. For example, pulse oximetry can be expressed in fractions (range between 0 and 1) or percentage, in medical records. Figure 15.6 is an example of a histogram showing the distribution of pulse oximetry, clearly showing the presence of outliers expressed in a fraction rather than as a percentage.
Distribution of pulse oximetry
Stem and leaf plots (also called stem plots) are a simple substitution for histograms. They show all data values and the shape of the distribution. For an example, Please refer to the function called “Stem Plot” in the GitHub repository for this book (see code appendix at the end of this Chapter) (Fig. 15.7 ).
Example of stem plot
Boxplots are interesting for representing information about the central tendency, symmetry, skew and outliers, but they can hide some aspects of the data such as multimodality. Boxplots are an excellent EDA technique because they rely on robust statistics like median and IQR.
Figure 15.8 shows an annotated boxplot which explains how it is constructed. The central rectangle is limited by Q1 and Q3, with the middle line representing the median of the data. The whiskers are drawn, in each direction, to the most extreme point that is less than 1.5 IQR beyond the corresponding hinge. Values beyond 1.5 IQR are considered outliers.
Example of boxplot with annotations
The “outliers” identified by a boxplot, which could be called “boxplot outliers” are defined as any points more than 1.5 IQRs above Q3 or more than 1.5 IQRs below Q1. This does not by itself indicate a problem with those data points. Boxplots are an exploratory technique, and you should consider designation as a boxplot outlier as just a suggestion that the points might be mistakes or otherwise unusual. Also, points not designated as boxplot outliers may also be mistakes. It is also important to realize that the number of boxplot outliers depends strongly on the size of the sample. In fact, for data that is perfectly normally distributed, we expect 0.70 % (about 1 in 140 cases) to be “boxplot outliers”, with approximately half in either direction.
2D Line Plot
2D line plots represent graphically the values of an array on the y-axis, at regular intervals on the x-axis (Fig. 15.9 ).
Example of 2D line plot
Probability Plots (Quantile-Normal Plot/QN Plot, Quantile-Quantile Plot/QQ Plot)
Probability plots are a graphical test for assessing if some data follows a particular distribution. They are most often used for testing the normality of a data set, as many statistical tests have the assumption that the exposure variables are approximately normally distributed. These plots are also used to examine residuals in models that rely on the assumption of normality of the residuals (ANOVA or regression analysis for example).
The interpretation of a QN plot is visual (Fig. 15.10 ): either the points fall randomly around the line (data set normally distributed) or they follow a curved pattern instead of following the line (non-normality). QN plots are also useful to identify skewness, kurtosis, fat tails, outliers, bimodality etc.
Example of QQ plot
Besides the probability plots, there are many quantitative statistical tests (not graphical) for testing for normality, such as Pearson Chi 2 , Shapiro-Wilk, and Kolmogorov-Smirnov.
Deviation of the observed distribution from normal makes many powerful statistical tools useless. Note that some data sets can be transformed to a more normal distribution, in particular with log-transformation and square-root transformations. If a data set is severely skewed, another option is to discretize its values into a finite set.
Multivariate Graphical EDA
Representing several boxplots side by side allows easy comparison of the characteristics of several groups of data (example Fig. 15.11 ). An example of such boxplot is shown in the case study.
Side-by-side boxplot showing the cardiac index for five levels of Positive end-expiratory pressure (PEEP)
Scatterplots are built using two continuous, ordinal or discrete quantitative variables (Fig. 15.12 ). Each data point’s coordinate corresponds to a variable. They can be complexified to up to five dimensions using other variables by differentiating the data points’ size, shape or color.
Scatterpolot showing an example of actual mortality per rate of predicted mortality
Scatterplots can also be used to represent high-dimensional data in 2 or 3D (Fig. 15.13 ), using T-distributed stochastic neighbor embedding (t-SNE) or principal component analysis (PCA). t-SNE and PCA are dimension reduction features used to reduce complex data set in two (t-SNE) or more (PCA) dimensions.
3D representation of the first three dimension of a PCA
For binary variables (e.g. 28-day mortality vs. SOFA score), 2D scatterplots are not very helpful (Fig. 15.14 , left). By dividing the data set in groups (in our example: one group per SOFA point), and plotting the average value of the outcome in each group, scatterplots become a very powerful tool, capable for example to identify a relationship between a variable and an outcome (Fig. 15.14 , right).
Graphs of SOFA versus mortality risk
Curve fitting is one way to quantify the relationship between two variables or the change in values over time (Fig. 15.15 ). The most common method for curve fitting relies on minimizing the sum of squared errors (SSE) between the data and the fitted function. Please refer to the “Linear Fit” function to create linear regression slopes in R.
Example of linear regression
More Complicated Relationships
Many real life phenomena are not adequately explained by a straight-line relationship. An always increasing set of methods and algorithms exist to deal with that issue. Among the most common:
- Adding transformed explanatory variables, for example, adding x 2 or x 3 to the model.
- Using other algorithms to handle more complex relationships between variables (e.g., generalized additive models, spline regression, support vector machines, etc.).
Heat Maps and 3D Surface Plots
Heat maps are simply a 2D grid built from a 2D array, whose color depends on the value of each cell. The data set must correspond to a 2D array whose cells contain the values of the outcome variable. This technique is useful when you want to represent the change of an outcome variable (e.g. length of stay) as a function of two other variables (e.g. age and SOFA score).
The color mapping can be customized (e.g. rainbow or grayscale). Interestingly, the Matlab function imagesc scales the data to the full colormap range. Their 3D equivalent is mesh plots or surface plots (Fig. 15.16 ).
Heat map ( left ) and surface plot ( right )
15.3. Part 2—Case Study
This case study refers to the research that evaluated the effect of the placement of indwelling arterial catheters (IACs) in hemodynamically stable patients with respiratory failure in intensive care, from the MIMIC-II database.
- The categorical data was first tabulated.
- Summary statistics were then generated to describe the variables of interest.
- Graphical EDA was used to generate histograms to visualize the data of interest.
15.3.1. Non-graphical EDA
To analyze, visualize and test for association or independence of categorical variables, they must first be tabulated. When generating tables, any missing data will be counted in a separate “NA” (“Not Available”) category. Please refer to the Chap. 13 “Missing Data” for approaches in managing this problem. There are several methods for creating frequency or contingency tables in R, such as for example, tabulating outcome variables for mortality, as demonstrated in the case study. Refer to the “Tabulate” function found in the GitHub repository for this book (see code appendix at the end of this Chapter) for details on how to compute frequencies of outcomes for different variables.
Multiple statistical tests are available in R and we refer the reader to the Chap. 16 “Data Analysis” for additional information on use of relevant tests in R. For examples of a simple Chi-square…” as “For examples of a simple Chi-squared test, please refer to the “Chi-squared” function found in the GitHub repository for this book (see code appendix at the end of this Chapter). In our example, the hypothesis of independence between expiration in ICU and IAC is accepted ( p > 0.05). On the contrary, the dependence link between day-28 mortality and IAC is rejected.
Summary statistics as described above include, frequency, mean, median, mode, range, interquartile range, maximum and minimum values. An extract of summary statistics of patient demographics, vital signs, laboratory results and comorbidities, is shown in Table 6 . Please refer to the function called “EDA Summary” in the GitHub repository for this book (see code appendix at the end of this Chapter) (Table 15.5 ).
Comparison between the two study cohorts (subsample of variables only)
When separate cohorts are generated based on a common variable, in this case the presence of an indwelling arterial catheter, summary statistics are presented for each cohort.
It is important to identify any differences in subject baseline characteristics. The benefits of this are two-fold: first it is useful to identify potentially confounding variables that contribute to an outcome in addition to the predictor (exposure) variable. For example, if mortality is the outcome variable then differences in severity of illness between cohorts may wholly or partially account for any variance in mortality. Identifying these variables is important as it is possible to attempt to control for these using adjustment methods such as multivariable logistic regression. Secondly, it may allow the identification of variables that are associated with the predictor variable enriching our understanding of the phenomenon we are observing.
The analytical extension of identifying any differences using medians, means and data visualization is to test for statistically significant differences in any given subject characteristic using for example Wilcoxon-Rank sum test. Refer to Chap. 16 for further details in hypothesis testing.
15.3.2. Graphical EDA
Graphical representation of the dataset of interest is the principle feature of exploratory analysis.
Histograms are considered the backbone of EDA for continuous data. They can be used to help the researcher understand continuous variables and provide key information such as their distribution. Outlined in noise and outliers, the histogram allows the researcher to visualize where the bulk of the data points are placed between the maximum and minimum values. Histograms can also allow a visual comparison of a variable between cohorts. For example, to compare severity of illness between patient cohorts, histograms of SOFA score can be plotted side by side (Fig. 15.17 ). An example of this is given in the code for this chapter using the “side-by-side histogram” function (see code appendix at the end of this Chapter).
histograms of SOFA scores by intra-arterial catheter status
Boxplot and ANOVA
Outside of the scope of this case study, the user may be interested in analysis of variance. When performing EDA and effective way to visualize this is through the use of boxplot. For example, to explore differences in blood pressure based on severity of illness subjects could be categorized by severity of illness with blood pressure values at baseline plotted (Fig. 15.18 ). Please refer to the function called “Box Plot” in the GitHub repository for this book (see code appendix at the end of this Chapter).
Side-by-side boxplot of MAP for different levels of severity at admission
The box plot shows a few outliers which may be interesting to explore individually, and that people with a high SOFA score (>10) tend to have a lower blood pressure than people with a lower SOFA score.
In summary, EDA is an essential step in many types of research but is of particular use when analyzing electronic health care records. The tools described in this chapter should allow the researcher to better understand the features of a dataset and also to generate novel hypotheses.
Always start by exploring a dataset with an open mind for discovery.
EDA allows to better apprehend the features and possible issues of a dataset.
EDA is a key step in generating research hypothesis.
- Code Appendix
The code used in this chapter is available in the GitHub repository for this book: https://github.com/MIT-LCP/critical-data-book . Further information on the code is available from this website.
Electronic supplementary material : The online version of this chapter (doi: 10.1007/978-3-319-43742-2_15 ) contains supplementary material, which is available to authorized users.
Open Access This chapter is distributed under the terms of the Creative Commons Attribution-NonCommercial 4.0 International License ( http://creativecommons.org/licenses/by-nc/4.0/ ), which permits any noncommercial use, duplication, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, a link is provided to the Creative Commons license and any changes made are indicated.
The images or other third party material in this chapter are included in the work’s Creative Commons license, unless indicated otherwise in the credit line; if such material is not included in the work’s Creative Commons license and the respective action is not permitted by statutory regulation, users will need to obtain permission from the license holder to duplicate, adapt or reproduce the material.
- Cite this Page Komorowski M, Marshall DC, Salciccioli JD, et al. Exploratory Data Analysis. 2016 Sep 10. In: Secondary Analysis of Electronic Health Records [Internet]. Cham (CH): Springer; 2016. Chapter 15. doi: 10.1007/978-3-319-43742-2_15
- PDF version of this page (549K)
- PDF version of this title (15M)
In this Page
- Part 1—Theoretical Concepts
- Part 2—Case Study
Similar articles in PubMed
- Surface boxplots. [Stat (Int Stat Inst). 2014] Surface boxplots. Genton MG, Johnson C, Potter K, Stenchikov G, Sun Y. Stat (Int Stat Inst). 2014; 3(1):1-11.
- Systems biology visualization tools for drug target discovery. [Expert Opin Drug Discov. 2010] Systems biology visualization tools for drug target discovery. Huan T, Wu X, Chen JY. Expert Opin Drug Discov. 2010 May; 5(5):425-39. Epub 2010 Apr 19.
- Exploratory data analysis of a clinical study group: Development of a procedure for exploring multidimensional data. [PLoS One. 2018] Exploratory data analysis of a clinical study group: Development of a procedure for exploring multidimensional data. Konopka BM, Lwow F, Owczarz M, Łaczmański Ł. PLoS One. 2018; 13(8):e0201950. Epub 2018 Aug 23.
- Review Missing Data. [Secondary Analysis of Electron...] Review Missing Data. Salgado CM, Azevedo C, Proença H, Vieira SM. Secondary Analysis of Electronic Health Records. 2016
- Review Translational Metabolomics of Head Injury: Exploring Dysfunctional Cerebral Metabolism with Ex Vivo NMR Spectroscopy-Based Metabolite Quantification. [Brain Neurotrauma: Molecular, ...] Review Translational Metabolomics of Head Injury: Exploring Dysfunctional Cerebral Metabolism with Ex Vivo NMR Spectroscopy-Based Metabolite Quantification. Wolahan SM, Hirt D, Glenn TC. Brain Neurotrauma: Molecular, Neuropsychological, and Rehabilitation Aspects. 2015
- Exploratory Data Analysis - Secondary Analysis of Electronic Health Records Exploratory Data Analysis - Secondary Analysis of Electronic Health Records
Your browsing activity is empty.
Activity recording is turned off.
Turn recording back on
Connect with NLM
National Library of Medicine 8600 Rockville Pike Bethesda, MD 20894
Web Policies FOIA HHS Vulnerability Disclosure
Help Accessibility Careers
What Is Exploratory Data Analysis?
Exploratory data analysis is one of the first steps in the data analytics process. In this post, we explore what EDA is, why it’s important, and a few techniques worth familiarizing yourself with.
Data analytics requires a mixed range of skills. This includes practical expertise, such as knowing how to scrape and store data. It also requires more nuanced problem-solving abilities, such as how to analyze data and draw conclusions from it. As a statistical approach, exploratory data analysis (or EDA) is vital for learning more about a new dataset.
Applied early on in the data analytics process, EDA can help you learn a great deal about a dataset’s inherent attributes and properties.
Interested in learning some data analytics skills? Try this free data short course out to see if you like it.
In this post, we’ll introduce the topic in more detail, answering the following questions:
- What is exploratory data analysis?
- Why is exploratory data analysis important?
- What are the underlying principles of exploratory data analysis?
- What are some techniques can you use for exploratory data analysis?
First up, though…
1. What is exploratory data analysis?
In data analytics, exploratory data analysis is how we describe the practice of investigating a dataset and summarizing its main features. It’s a form of descriptive analytics .
EDA aims to spot patterns and trends, to identify anomalies, and to test early hypotheses. Although exploratory data analysis can be carried out at various stages of the data analytics process , it is usually conducted before a firm hypothesis or end goal is defined.
In general, EDA focuses on understanding the characteristics of a dataset before deciding what we want to do with that dataset.
Exploratory data analytics often uses visual techniques, such as graphs, plots, and other visualizations. This is because our natural pattern-detecting abilities make it much easier to spot trends and anomalies when they’re represented visually.
As a simple example, outliers (or data points that skew a trend) stand out much more immediately on a scatter graph than they do in columns on a spreadsheet.
Source: Indoor-Fanatikerderivative work: Indoor-Fanatiker, CC0, via Wikimedia Commons
In the image, the outlier (in red) is immediately clear. Even if you’re new to data analytics, this approach will seem familiar. If you ever plotted a graph in math or science at school in order to infer information about a dataset, then you’ve carried out a basic EDA.
The American mathematician John Tukey formally introduced the concept of exploratory data analysis in 1961. The idea of summarizing a dataset’s key characteristics coincided with the development of early programming languages such as R and S. This was also a time when scientists and engineers were working on new data-driven problems related to early computing.
Since then, EDA has been widely adopted as a core tenet of data analytics and data science more generally. It is now considered a common—indeed, indispensable—part of the data analytics process.
Want to try your hand at exploratory data analysis? Try this free, practical tutorial on exploratory data analysis as part of our beginner’s short course. You’ll calculate descriptive statistics for a real dataset, and create pivot tables.
2. Why is exploratory data analysis important?
At this stage, you might be asking yourself: why bother carrying out an EDA?
After all, data analytics today is far more sophisticated than it was in the 1960s. We have algorithms that can automate so many tasks. Surely it’s easier (and even preferable) to skip this step of the process altogether?
In truth, it has been shown time and again that effective EDA provides invaluable insights that an algorithm cannot. You can think of this a bit like running a document through a spellchecker versus reading it yourself. While software is useful for spotting typos and grammatical errors, only a critical human eye can detect the nuance.
An EDA is similar in this respect—tools can help you, but it requires our own intuition to make sense of it. This personal, in-depth insight will support detailed data analysis further down the line.
Specifically, some key benefits of an EDA include:
Spotting missing and incorrect data
As part of the data cleaning process , an initial data analysis (IDA) can help you spot any structural issues with your dataset.
You may be able to fix these, or you might find that you need to reprocess the data or collect new data entirely. While this can be a nuisance, it’s better to know upfront, before you dive in with a deeper analysis.
Understanding the underlying structure of your data
Properly mapping your data ensures that you maintain high data quality when transferring it from its source to your database, spreadsheet, data warehouse, etc. Understanding how your data is structured means you can avoid mistakes from creeping in.
Testing your hypothesis and checking assumptions
Before diving in with a full analysis, it’s important to make sure any assumptions or hypotheses you’re working on stand up to scrutiny.
While an EDA won’t give you all the details, it will help you spot if you’re inferring the right outcomes based on your understanding of the data. If not, then you know that your assumptions are wrong, or that you are asking the wrong questions about the dataset.
Calculating the most important variables
When carrying out any data analysis, it’s necessary to identify the importance of different variables.
This includes how they relate to each other. For example, which independent variables affect which dependent variables? Determining this early on will help you extract the most useful information later on.
Creating the most efficient model
When carrying out your full analysis, you’ll need to remove any extraneous information. This is because needless additional data can either skew your results or simply obscure key insights with unnecessary noise.
In pursuit of your goal, aim to include the fewest number of necessary variables. EDA helps identify information that you can extract.
Determining error margins
EDA isn’t just about finding helpful information. It’s also about determining which data might lead to unavoidable errors in your later analysis.
Knowing which data will impact your results helps you to avoid wrongly accepting false conclusions or incorrectly labeling an outcome as statistically significant when it isn’t.
Identifying the most appropriate statistical tools to help you
Perhaps the most practical outcome of your EDA is that it will help you determine which techniques and statistical models will help you get what you need from your dataset.
For instance, do you need to carry out a predictive analysis or a sentiment analysis ? An EDA will help you decide. You can learn about different types of data analysis in this guide .
As is hopefully clear by now, intuition and reflection are key skills for carrying out exploratory data analysis. While EDA can involve executing defined tasks, interpreting the results of these tasks is where the real skill lies.
3. What are the underlying principles of exploratory data analysis?
Now we know what exploratory data analysis is and why it’s important, how exactly does it work?
In short, exploratory data analysis considers what to look for, how to look for it, and, finally, how to interpret what we discover. At its core, EDA is more of an attitude than it is a step-by-step process. Exploring data with an open mind tends to reveal its underlying nature far more readily than making assumptions about the rules we think (or want) it to adhere to.
In data analytics terms, we can generally say that exploratory data analysis is a qualitative investigation, not a quantitative one . This means that it involves looking at a dataset’s inherent qualities with an inquisitive mindset. Usually, it does not attempt to make cold measurements or draw insights about a dataset’s content. This comes later on.
You’d be forgiven for thinking this sounds a bit esoteric for a scientific field like data analytics! But don’t worry. There are some practical principles to exploratory data analysis that can help you proceed. A key one of these is known as the five-number summary.
What is the five-number summary?
The five-number summary is a set of five descriptive statistics. Simple though these are, they make a useful starting point for any exploratory data analysis.
The aim of the five-number summary is not to make a value judgment on which statistics are the most important or appropriate, but to offer a concise overview of how different observations in the dataset are distributed. This allows us to ask more nuanced questions about the data, such as ‘why are the data distributed this way?’ or ‘what factors might impact the shape of these data?’ These sorts of questions are vital for obtaining insights that will help us determine the goals for our later analysis.
The five-number summary includes the five most common sample percentiles:
- The sample minimum (the smallest observation)
- The lower quartile (the median of the lower half of the data)
- The median (the average / middle value)
- The upper quartile (the median of the upper half of the data)
- The sample maximum (the largest observation)
The lower and upper quartiles are essentially the median of the lower and upper halves of the dataset. These can be used to determine the interquartile range , which is the middle 50% of the dataset.
In turn, this helps describe the overall spread of the data, allowing you to identify any outliers. These five statistics can be easily shown using a box plot.
Source: Dcbmariano, CC BY-SA 4.0 via Wikimedia Commons
The five-number summary can be used to determine a great number of additional attributes about a given dataset. This is why it is such a foundational part of data exploration.
To make matters easier, many programming languages, including R and Python, have inbuilt functions for determining the five-number summary and producing the corresponding box plots.
4. What are some techniques you can use for exploratory data analysis?
As we’ve already explained, most (though not all) EDA techniques are graphical in nature.
Graphical tools, like the box plot described previously, are very helpful for revealing a dataset’s hidden secrets. What follows are some common techniques for carrying out exploratory data analysis. Many of these rely on visualizations that can be easily created using tools like R, Python , S-Plus, or KNIME, to name a few popular ones.
Source: Michiel1972, CC BY-SA 3.0 via Wikimedia Commons
Univariate analysis is one of the simplest forms of data analysis. It looks at the distribution of a single variable (or column of data) at a time.
While univariate analysis does not strictly need to be visual, it commonly uses visualizations such as tables, pie charts, histograms, or bar charts.
Source: Public domain via Wikimedia Commons
Multivariate analysis looks at the distribution of two or more variables and explores the relationship between them. Most multivariate analyses compare two variables at a time (bivariate analysis).
However, it sometimes involves three or more variables. Either way, it is good practice to carry out univariate analysis on each of the variables before doing a multivariate EDA.
Any plot or graph that has two or more data points can be used to create a multivariate visualization (for example, a line graph that plots speed against time).
Classification or clustering analysis
Source: Chire, CC BY-SA 3.0 via Wikimedia Commons
Clustering analysis is when we place objects into groups based on their common properties.
It is similar to classification. The key difference is that classification involves grouping items using explicit, predefined classes (e.g. categorizing a dataset of people based on a range of their heights).
Clustering, meanwhile, involves grouping data based on what they implicitly tell us (such as whether someone’s height means they are highly likely, quite likely, or not at all likely to bang their head on a doorframe!)
Source: Berland, Public domain, via Wikimedia Commons
Although predictive analysis is commonly used in machine learning and AI to make (as the name suggests) predictions, it’s also popular for EDA.
In this context, it doesn’t always refer to uncovering future information but simply using predictive methods—such as linear regression—to find unknown attributes (for example, using existing data to infer the values for gaps in historical data).
These represent just a small handful of the techniques you can use for conducting an EDA. But they hopefully offer a taste of the kinds of approaches you can take when trying to better understand a dataset.
5. In summary
In this post, we’ve introduced the topic of exploratory data analysis, why it’s important, and some techniques you might want to familiarize yourself with before carrying one out. We’ve learned that exploratory data analysis:
- Summarizes the features and characteristics of a dataset.
- Is a philosophy, or approach, rather than a defined process.
- Often draws out insights using visualizations, e.g. graphs and plots.
- Is important for spotting errors, checking assumptions, identifying relationships between variables, and selecting the right data modeling tools.
- Builds on the five-number summary, namely: the sample minimum and maximum, the lower and upper quartile, and the median.
- Employs various techniques, such as univariate and multivariate analysis, clustering, and predictive analytics, to name a few.
To learn more about exploratory data analysis or to put it into context within the broader data analytics process, try our free, five-day data analytics short course . Otherwise, for more introductory data analytics topics, check out the following posts:
- What is web scraping? A complete beginner’s guide
- What is data cleaning and why does it matter?
- What’s the difference between quantitative and qualitative data?
Learn everything you need to know about exploratory data analysis, a method used to analyze and summarize data sets.
Exploratory data analysis (EDA) is used by data scientists to analyze and investigate data sets and summarize their main characteristics, often employing data visualization methods. It helps determine how best to manipulate data sources to get the answers you need, making it easier for data scientists to discover patterns, spot anomalies, test a hypothesis, or check assumptions.
EDA is primarily used to see what data can reveal beyond the formal modeling or hypothesis testing task and provides a provides a better understanding of data set variables and the relationships between them. It can also help determine if the statistical techniques you are considering for data analysis are appropriate. Originally developed by American mathematician John Tukey in the 1970s, EDA techniques continue to be a widely used method in the data discovery process today.
The main purpose of EDA is to help look at data before making any assumptions. It can help identify obvious errors, as well as better understand patterns within the data, detect outliers or anomalous events, find interesting relations among the variables.
Data scientists can use exploratory analysis to ensure the results they produce are valid and applicable to any desired business outcomes and goals. EDA also helps stakeholders by confirming they are asking the right questions. EDA can help answer questions about standard deviations, categorical variables, and confidence intervals. Once EDA is complete and insights are drawn, its features can then be used for more sophisticated data analysis or modeling, including machine learning .
Specific statistical functions and techniques you can perform with EDA tools include:
- Clustering and dimension reduction techniques, which help create graphical displays of high-dimensional data containing many variables.
- Univariate visualization of each field in the raw dataset, with summary statistics.
- Bivariate visualizations and summary statistics that allow you to assess the relationship between each variable in the dataset and the target variable you’re looking at.
- Multivariate visualizations, for mapping and understanding interactions between different fields in the data.
- K-means Clustering is a clustering method in unsupervised learning where data points are assigned into K groups, i.e. the number of clusters, based on the distance from each group’s centroid. The data points closest to a particular centroid will be clustered under the same category. K-means Clustering is commonly used in market segmentation, pattern recognition, and image compression.
- Predictive models, such as linear regression, use statistics and data to predict outcomes.
There are four primary types of EDA:
- Univariate non-graphical. This is simplest form of data analysis, where the data being analyzed consists of just one variable. Since it’s a single variable, it doesn’t deal with causes or relationships. The main purpose of univariate analysis is to describe the data and find patterns that exist within it.
- Stem-and-leaf plots, which show all data values and the shape of the distribution.
- Histograms, a bar plot in which each bar represents the frequency (count) or proportion (count/total count) of cases for a range of values.
- Box plots, which graphically depict the five-number summary of minimum, first quartile, median, third quartile, and maximum.
- Multivariate nongraphical: Multivariate data arises from more than one variable. Multivariate non-graphical EDA techniques generally show the relationship between two or more variables of the data through cross-tabulation or statistics.
- Multivariate graphical: Multivariate data uses graphics to display relationships between two or more sets of data. The most used graphic is a grouped bar plot or bar chart with each group representing one level of one of the variables and each bar within a group representing the levels of the other variable.
Other common types of multivariate graphics include:
- Scatter plot, which is used to plot data points on a horizontal and a vertical axis to show how much one variable is affected by another.
- Multivariate chart, which is a graphical representation of the relationships between factors and a response.
- Run chart, which is a line graph of data plotted over time.
- Bubble chart, which is a data visualization that displays multiple circles (bubbles) in a two-dimensional plot.
- Heat map, which is a graphical representation of data where values are depicted by color.
Some of the most common data science tools used to create an EDA include:
- Python: An interpreted, object-oriented programming language with dynamic semantics. Its high-level, built-in data structures, combined with dynamic typing and dynamic binding, make it very attractive for rapid application development, as well as for use as a scripting or glue language to connect existing components together. Python and EDA can be used together to identify missing values in a data set, which is important so you can decide how to handle missing values for machine learning.
- R: An open-source programming language and free software environment for statistical computing and graphics supported by the R Foundation for Statistical Computing. The R language is widely used among statisticians in data science in developing statistical observations and data analysis.
For a deep dive into the differences between these approaches, check out " Python vs. R: What's the Difference? "
Use IBM Watson® Studio to determine whether the statistical techniques that you are considering for data analysis are appropriate.
IBM Watson® Studio provides an interface for analysts and data scientists to dig deeper into their datasets. This helps them to provide summary insights to their stakeholders and it also allows them to evaluate if the datasets are balanced enough to build meaningful models.
Assignment 4 Clustering and Classifying Images
1 minute read
DRAFT Guidelines for the Assignment:
In this assignment we will continue to work with materials and algorithms discussed in class. We will use Orange Data Mining to carry out this task.
The first exercise is a clustering exercise. You should choose a corpus of images (somewhere between 50-200 would be best). As we discussed in class, it is possible to choose particular images from a large dataset found at places like Kaggle, but the topic should be something that you know something about. In the second part you will classify them into subtypes. Choosing a dataset of images you are not very interested in, or a dataset that you have enough knowledge to categorize will make it difficult.
I suggest you use this part of the assignment to do some exploratory data analysis (EDA). Using the pre-set Orange Data Mining workflow I distributed (images2.ows), you can address the following questions:
- How does a built-in algorithm in ODM cluster the data?
- Do other algorithms give you different results?
- What features seem to be most characteristic of the different quadrants of the image plot?
- Using hierarchical clustering, isolate specific clusters to look more closely.
In the second exercise you
In your write up, you should include insights and ideas from the new book Distant Viewing (Arnold & Tilton).
Make sure that you include images from your analysis.
You May Also Enjoy
Assignment 3 spatial data.
3 minute read
DRAFT Guidelines for the Assignment: The Spatial Data Assignment is an assignment in one step. It builds upon the work we did in class on October 17th, with...
Digital Literacy Narrative F23
8 minute read
Guidelines for the Assignment: This version of the assignment is for the Fall 2023 semester. The Digital Literacy Narrative is a progressive assignment...
Lab 2 F23: Creating a Static Site in GitHub Pages
4 minute read
Guidelines for the Lab: Today in lab one of the things we will be doing is to create your own site at GitHub Pages in which you will do your coursework. Cre...
Assignment 2 Working with a Corpus
6 minute read
Note that this is a previous semester’s assignment. Guidelines for the Assignment: The Corpus Assignment, otherwise known as Assignment 2, will be complete...