## 6.894 : Interactive Data Visualization

Assignment 2: exploratory data analysis.

In this assignment, you will identify a dataset of interest and perform an exploratory analysis to better understand the shape & structure of the data, investigate initial questions, and develop preliminary insights & hypotheses. Your final submission will take the form of a report consisting of captioned visualizations that convey key insights gained during your analysis.

## Step 1: Data Selection

First, you will pick a topic area of interest to you and find a dataset that can provide insights into that topic. To streamline the assignment, we've pre-selected a number of datasets for you to choose from.

However, if you would like to investigate a different topic and dataset, you are free to do so. If working with a self-selected dataset, please check with the course staff to ensure it is appropriate for the course. Be advised that data collection and preparation (also known as data wrangling ) can be a very tedious and time-consuming process. Be sure you have sufficient time to conduct exploratory analysis, after preparing the data.

After selecting a topic and dataset – but prior to analysis – you should write down an initial set of at least three questions you'd like to investigate.

## Part 2: Exploratory Visual Analysis

Next, you will perform an exploratory analysis of your dataset using a visualization tool such as Tableau. You should consider two different phases of exploration.

In the first phase, you should seek to gain an overview of the shape & stucture of your dataset. What variables does the dataset contain? How are they distributed? Are there any notable data quality issues? Are there any surprising relationships among the variables? Be sure to also perform "sanity checks" for patterns you expect to see!

In the second phase, you should investigate your initial questions, as well as any new questions that arise during your exploration. For each question, start by creating a visualization that might provide a useful answer. Then refine the visualization (by adding additional variables, changing sorting or axis scales, filtering or subsetting data, etc. ) to develop better perspectives, explore unexpected observations, or sanity check your assumptions. You should repeat this process for each of your questions, but feel free to revise your questions or branch off to explore new questions if the data warrants.

- Final Deliverable

Your final submission should take the form of a Google Docs report – similar to a slide show or comic book – that consists of 10 or more captioned visualizations detailing your most important insights. Your "insights" can include important surprises or issues (such as data quality problems affecting your analysis) as well as responses to your analysis questions. To help you gauge the scope of this assignment, see this example report analyzing data about motion pictures . We've annotated and graded this example to help you calibrate for the breadth and depth of exploration we're looking for.

Each visualization image should be a screenshot exported from a visualization tool, accompanied with a title and descriptive caption (1-4 sentences long) describing the insight(s) learned from that view. Provide sufficient detail for each caption such that anyone could read through your report and understand what you've learned. You are free, but not required, to annotate your images to draw attention to specific features of the data. You may perform highlighting within the visualization tool itself, or draw annotations on the exported image. To easily export images from Tableau, use the Worksheet > Export > Image... menu item.

The end of your report should include a brief summary of main lessons learned.

## Recommended Data Sources

To get up and running quickly with this assignment, we recommend exploring one of the following provided datasets:

World Bank Indicators, 1960–2017 . The World Bank has tracked global human developed by indicators such as climate change, economy, education, environment, gender equality, health, and science and technology since 1960. The linked repository contains indicators that have been formatted to facilitate use with Tableau and other data visualization tools. However, you're also welcome to browse and use the original data by indicator or by country . Click on an indicator category or country to download the CSV file.

Chicago Crimes, 2001–present (click Export to download a CSV file). This dataset reflects reported incidents of crime (with the exception of murders where data exists for each victim) that occurred in the City of Chicago from 2001 to present, minus the most recent seven days. Data is extracted from the Chicago Police Department's CLEAR (Citizen Law Enforcement Analysis and Reporting) system.

Daily Weather in the U.S., 2017 . This dataset contains daily U.S. weather measurements in 2017, provided by the NOAA Daily Global Historical Climatology Network . This data has been transformed: some weather stations with only sparse measurements have been filtered out. See the accompanying weather.txt for descriptions of each column .

Social mobility in the U.S. . Raj Chetty's group at Harvard studies the factors that contribute to (or hinder) upward mobility in the United States (i.e., will our children earn more than we will). Their work has been extensively featured in The New York Times. This page lists data from all of their papers, broken down by geographic level or by topic. We recommend downloading data in the CSV/Excel format, and encourage you to consider joining multiple datasets from the same paper (under the same heading on the page) for a sufficiently rich exploratory process.

The Yelp Open Dataset provides information about businesses, user reviews, and more from Yelp's database. The data is split into separate files ( business , checkin , photos , review , tip , and user ), and is available in either JSON or SQL format. You might use this to investigate the distributions of scores on Yelp, look at how many reviews users typically leave, or look for regional trends about restaurants. Note that this is a large, structured dataset and you don't need to look at all of the data to answer interesting questions. In order to download the data you will need to enter your email and agree to Yelp's Dataset License .

## Additional Data Sources

If you want to investigate datasets other than those recommended above, here are some possible sources to consider. You are also free to use data from a source different from those included here. If you have any questions on whether your dataset is appropriate, please ask the course staff ASAP!

- data.boston.gov - City of Boston Open Data
- MassData - State of Masachussets Open Data
- data.gov - U.S. Government Open Datasets
- U.S. Census Bureau - Census Datasets
- IPUMS.org - Integrated Census & Survey Data from around the World
- Federal Elections Commission - Campaign Finance & Expenditures
- Federal Aviation Administration - FAA Data & Research
- fivethirtyeight.com - Data and Code behind the Stories and Interactives
- Buzzfeed News
- Socrata Open Data
- 17 places to find datasets for data science projects

## Visualization Tools

You are free to use one or more visualization tools in this assignment. However, in the interest of time and for a friendlier learning curve, we strongly encourage you to use Tableau . Tableau provides a graphical interface focused on the task of visual data exploration. You will (with rare exceptions) be able to complete an initial data exploration more quickly and comprehensively than with a programming-based tool.

- Tableau - Desktop visual analysis software . Available for both Windows and MacOS; register for a free student license.
- Data Transforms in Vega-Lite . A tutorial on the various built-in data transformation operators available in Vega-Lite.
- Data Voyager , a research prototype from the UW Interactive Data Lab, combines a Tableau-style interface with visualization recommendations. Use at your own risk!
- R , using the ggplot2 library or with R's built-in plotting functions.
- Jupyter Notebooks (Python) , using libraries such as Altair or Matplotlib .

## Data Wrangling Tools

The data you choose may require reformatting, transformation or cleaning prior to visualization. Here are tools you can use for data preparation. We recommend first trying to import and process your data in the same tool you intend to use for visualization. If that fails, pick the most appropriate option among the tools below. Contact the course staff if you are unsure what might be the best option for your data!

## Graphical Tools

- Tableau Prep - Tableau provides basic facilities for data import, transformation & blending. Tableau prep is a more sophisticated data preparation tool
- Trifacta Wrangler - Interactive tool for data transformation & visual profiling.
- OpenRefine - A free, open source tool for working with messy data.

## Programming Tools

- JavaScript data utilities and/or the Datalib JS library .
- Pandas - Data table and manipulation utilites for Python.
- dplyr - A library for data manipulation in R.
- Or, the programming language and tools of your choice...

The assignment score is out of a maximum of 10 points. Submissions that squarely meet the requirements will receive a score of 8. We will determine scores by judging the breadth and depth of your analysis, whether visualizations meet the expressivenes and effectiveness principles, and how well-written and synthesized your insights are.

We will use the following rubric to grade your assignment. Note, rubric cells may not map exactly to specific point scores.

## Submission Details

This is an individual assignment. You may not work in groups.

Your completed exploratory analysis report is due by noon on Wednesday 2/19 . Submit a link to your Google Doc report using this submission form . Please double check your link to ensure it is viewable by others (e.g., try it in an incognito window).

Resubmissions. Resubmissions will be regraded by teaching staff, and you may earn back up to 50% of the points lost in the original submission. To resubmit this assignment, please use this form and follow the same submission process described above. Include a short 1 paragraph description summarizing the changes from the initial submission. Resubmissions without this summary will not be regraded. Resubmissions will be due by 11:59pm on Saturday, 3/14. Slack days may not be applied to extend the resubmission deadline. The teaching staff will only begin to regrade assignments once the Final Project phase begins, so please be patient.

- Due: 12pm, Wed 2/19
- Recommended Datasets
- Example Report
- Visualization & Data Wrangling Tools
- Submission form

+61 481607654

## Numeracy and Data Analysis

Introduction.

Data analysis is defined as the process of analysing, transforming, modelling of available information in order to determine the suitable result are further helpful in making valuable decision (Chatfield, 2016). In order to process, the functions of data mining there are number of useful techniques that benefits to determine the best result. In this report, mean, mode, median is discussed and liner-forecasting model is used to calculate the following values.

Assignment Prime is an online assignment writing service provider which caters the academic need of students.

In order to understand the usefulness of Data analysis techniques, weather data of Chester is procured for last 10 days during the period of 2017-18 in June (Weather of Chester. 2018).

## 1) Table format

2) steps to calculate.

It is also known as average that is basically calculated by adding the values of number and then dividing these with total number of observations. Few steps to calculate means such as:

- Return to average of numbers
- Determine average of number that is based on single criteria.

Formula to calculate mean in excel is =AVERAGE(Value1, Value2…)

Formula in statistics = x¯¯¯=∑xN

So mean of Temperature =13.5 0 C,

ii) Median:

In simple terms, it is considered to be the middle value of large group of numbers that usually separate the lower half from upper half. It is observed that when data series are with odd digits of values than median is the actual middle component and in case if series is even value than median is the average of the two middle elements (Pole, West and Harrison, 2018). Following are the basic steps to calculate median:

- Arrange data into series from lowest to greatest.
- Din case of odd number the middle values is taken as median and if values are even than two number are selected to determine the average median.

Formula to calculate median in excel is =MEDIAN(number1, number 2…)

Formula in Statistics = Median = (n+1 / 2)thterm

So median of Temperature =16 0 C,

It is as kind of average that usually defined as the most frequently occurring of a number within a given series of data. It is observed that in many series there may not be a single mode or two or multiple mode according to nature of data series, either binomial or multimodal series. There is a systematic manner to calculate mode for a following data series such as:

- Collect and arrange the data series in ascending order so that separation can be made easily.
- The highest number of time single digits appear into a series is considering being modal values of that particular data series.

Formula to calculate mode is =MODE.MULT(number1, number 2…)

Formula to calculate median in excel is =MODE(number1, number 2…)

So mode of Temperature = 15 0 C,

It is referred to be the collection of values among a maximum and a minimum value. In Excel, a range is characterized by the reference of the upper left cell (least values) of the range and the reference of the lower right cell (greatest values) of the range. In addition, separate cells can be added to this choice, at that point the range is called an unpredictable cell go. In Excel, the base and greatest esteem are incorporated (Wang and Sun, 2015). There is a simple basic step to calculate range the highest values minus lowest value in series.

Range is calculated by minimising minimum range from maximum range

Formula to calculate Maximum range = MAX (number1, number 2…); 17

Formula to calculate Minimum Range = MIN (number1, number 2…); 9

Range is calculated for Temperature = MAX – MIN; (17-9 ) = 8 0 C,

v) Standard Deviation:

In statistics, the standard deviation (SD), is denoted by sigma σ or the Latin letter s) which is an important measure used to calculate the amount of difference or dispersion within a value of particular data set. It is determined as the square base of change by deciding the variety between every data points that is relevant toward the mean. In the event that the information focuses are further from the mean, there is a higher deviation inside the informational index; in this way, the more spread out the data, the higher the standard deviation. Few steps to calculate standard deviations are:

- Arrange data into continuous series so that result can easily be determined.
- Apply the formula according to the nature of series such as = STDEV.S( ), where s denote sample SD or =STDEV.P( )in which p stands for population.

Standard deviation for Temperature = 8 0 C

## 4. Linear forecasting model

To determine the value of m in y = mx + c the following are the assumption values of c =30, x=10 and y =20 so putting values in equation:

- Steps to calculate m = m (the Slope) needs some calculation:
- Step to calculate c = it remains constant factor. As it will be calculated as c = y - mx
- Using the calculated 'm' and 'c' values, forecast the weather indicator for day 15 and day 23 .

For forecasting day 15 was forecasted as 12.886 0 C for temperature For forecasting day 23 was forecasted as 16.2213 0 C

This forecast is calculated by using formula FORECAST.LINEAR(x,yknownvalues,xknownvalues).

## C onclusion

From the above report, it has been concluded that data analysis is a crucial tool that helps to determine the values for better decision making and forecasting. Different techniques are helpful in extracting accurate values of gives data series.

- Chatfield, C., 2016. The analysis of time series: an introduction. Chapman and Hall/CRC.
- Pole, A., West, M. and Harrison, J., 2018. Applied Bayesian forecasting and time series analysis. Chapman and Hall/CRC.
- Wang, D. and Sun, Z., 2015. Big data analysis and parallel load forecasting of electric power user side. Proceedings of the CSEE, 35(3), pp.527-537.

## Numeracy and Data Analysis Related to Weather

University:

- Unit No: N/A
- Level: High school
- Pages: 1 / Words 174
- Paper Type: Assignment
- Course Code: N/A
- Downloads: 448

In any city of your choice, gather the humidity day for a period of 10 consecutive days and the data can be gathered by using online sources. Once the collection is completed, further you are required to prepare the report by taking into account followings-

- Data should be arranged in an appropriate format
- Data should be presented using two types of charts of your choice. For example, line chart, bar chart
- Calculate the below mentioned and highlight the final value-
- standard deviation
- How m is calculated, write down each and every step
- How c is calculated, write each and every step
- By using the value of m and c, predict the humidity for day 15 and 20

Report should be prepared in 1000 words approximately

Purpose of this assessment The purpose of this unit is to provide students with an understanding of how management information and decision-making are enhanced by the application of statistical methods. Students can show how they can inform management thinking and will learn about a range of

## Numeracy and Data Analysis: Calculation of Mean, Mode, Median, Range, and Standard Deviation

