Module 1:2 - Visual Displays of Data

From SSMS
Jump to: navigation, search

Introduction: Frequency Distribution[edit]

Good visual displays convey important information regarding your observation, bringing life and meaning to values in your data set. Graphical displays of the frequencies of particular outcomes. Frequency distribution graph is one of the most important displays of data; it is often called histogram, especially when displaying frequencies for continuous data.

Frequency Distribution

Example: Cups of Coffee Consumed Each Day[edit]

Having a list of data values does not convey much meaning in the numbers, however, visualization of these data would allow us to examine what do these numbers tell us.

Frequency Distribution

Data organization[edit]

Opening the associated data using JMP, we notice that the table consists of two columns: person, and cup of coffee. It is advised to have each row be a unitive sampling, which is each individual person responded to the survey in this example. Typically when we take samples, each person represents a unit, since we observed 10 units in this study, we would want 10 rows for the 10 individuals.

Data Table

Another good convention is to have a column that identifies each person or units, which is labeled as persons for this example.

For each individuals, there is a separate column identified as “cup of coffee”, which includes information regarding to how many cups of coffee did each individual reported to drink.

Frequency Distribution using Distribution Platform[edit]

Frequency Distribution

To obtain a frequency distribution plot for variable “cup of coffee” (which is the measurement we obtained from these 10 individuals), we will use the distribution platform via Analyze > Distribution. This distribution platform is helpful anytime when we are trying to understand the characteristics of a single variable. In this example, the variable we are interested in examining is “cup of coffee”, thus we assign “cup of coffee” as the role Y, Columns. Although column Persons is in our data set, we won’t use it when visualizing coffee, since it is just here in our data set for the purpose of keeping tracks of individuals we have. This is useful down the line if one of the individuals gives us bad data, then we would know which row to exclude. Looking at the distribution graph, we can see right away that we have a symmetric distribution, which means that we have as many observations above as we do below and equal frequencies.

Changing Bin Sizes Using Grabber[edit]

Bins with 2 unit Increments

Looking at the axis, we notice that our bin size is 1 unit: increasing in 1 unit increments. In order to change the bin size, we can select the grabber tool underneath tool menu. By click and drag the bars to the right, we can increase our size of the bins. Now that we have bins increasing in increments of 2 units, notice how we have 1 count between -1 to 1, this is because we have one individual reported drinking 0 cups of coffee. Be careful that this bin in -1 to 1 exclusive, therefore a value of 1 does not count in this bin.



Now the next bin goes from 1 to 3, by clicking on the bin, you can select all the individuals that count in this bin. From the triangle drop menu, you can select Histogram Options > Show Counts, which enables you to see counts for each bin reflected by bin’s height.



When using the grabber tool and drag bars to the left, you can decrease the bin sizes back to normal. Notice that even though each bins have different output for their counts, they are all displaying the same information regarding the data, just a matter of how many bins you would like to have in order to capture the differences in frequency for your data.

Changing Bin Sizes Using Specification Window[edit]

By double clicking the y-axis using the arrow key, JMP will return us with the y-axis specification window. This is another way we can change the increment of our bin sizes by altering the input for “increment”. If you put in 2 for “increment”, while keeping the minimum 0 and maximum 10, you will notice how your bin sizes have increased to 2 units. This time, our lowest bin starts at 0 and goes to 2, which is what we specified in our y-axis specification window.

Y Axis Specification Window




Example SAT by Year[edit]

In the sample data set “SAT by Year”, we have 408 rows with 19 columns. Let’s start off with something simple, using the distribution platform (Analyze > Distribution), we can put the variable “SAT Verbal” and “SAT Math” under the role Y, Column. The platform will return us with two histogram graphs with information from all the different states across all the different time. If you double click a single bin in one of the two graphs, JMP will return you with a fresh table including individuals within the selected bins with the rest of their information.

New table for SAT Verbal=(580,589)

Bin Sizes and Analysis[edit]

Using the Grabber tool, we can increase the bin sizes by dragging towards the right. Notice that as our bins become larger, we don’t see as much fidelity of the data anymore, meaning that we can’t really tell where our observations are. By dragging to the left, we can also make the bin sizes really small, which gives us more detailed visual representation of the data set.


Bins with large increments
Bins with small increments
Uniform Scaling


How many bins (how big/small) you should use when analyzing data set is an arbitrary question. Remember that the reason why we create visualization of the data is to tell a story about what observation or measurements you've had, and to find meaning in these data. So there is no correct answer limiting you to only use a certain amount bins with a certain bin size, what you should be asking yourself is: what do I want to convey, and create the visualization that convey this message.


Distribution platform gives us nice visualizations of where observations lie on the scale we observed. Remember that we do have two variables on the same units, it might be helpful to select Red Triangle Menu > Uniform Scaling, this option changes the scale on both of the variables on equal units (not the BIN SIZE). Having uniformed scales allows us to compare distribution nicely and to see where observations are, in this example, we can see that on average, the spread of SAT Verbal and SAT Math are about the same.

Describing the Shape of Distribution[edit]

Ways to Describe the Shape[edit]

Figure 4.1

Symmetry

  • Symmetric: when distribution has some access of reflection; that is, when it falls equivalently into positive and negative direction
  • Asymmetric: when distribution does NOT fall equivalently into positive and negative direction; if a distribution is asymmetric, the distribution has skew

Skew

  • Positively skewed = tail is dragged to positive direction
  • Negatively skewed = tail is dragged to negative direction
  • Obtaining numeric measure of skewness in JMP: Analyze >> Distribution >> Red triangle Under Summary Statistics >> Customize Summary Statistics >> turn on measure of skewness >> return quantitative, numeric value of skewness
  • Larger absolute value of skew represents larger skew
  • Symmetric distribution does NOT have skew

Example to Work With[edit]

SAT score by Year from JMP Module 1.2 Journal

Analyze >> Distribution >> Put “Population” and “ACT Score (2004)” on Y Column >> Click “OK”

Population distribution has a positive skew. Numeric value of skew is 2.59 (See Figure 4.1)

ACT Score distribution has a negative skew. Numeric value of skew is -0.85 (See Figure 4.1)

Graph Builder Basics[edit]

Graph Builder[edit]

Graph builder is a drag and drop graphing platform

There are variable inputs of Y, X, Y Grouping, X Grouping, Wrapping, Overlaying, Coloring, Sizing, and Map Shaping.

Options for graphing visualization include: Scatter plot, Smooth line of fit, Line of Fit, Ellipse, Contour, Line plot, Bar chart, Area, Box Plot, Histogram, Heatmap, Pie chart, Treemap plot, Caption Box, Formula, and Map Shapes.

JMP Demonstration[edit]

SAT score by Year from JMP Module 1.2 Journal

Graph >> Graph Builder

When a variable (eg. SAT Math) is put on Y axis and there is no variable on X axis, Jittering occurs (Figure 5.1). Jittering is spreading out data on x axis so that the data are not stacked on top of each other. Jittering is useful because it tells information about how much data is present at different levels of axis. If you want to delete Jittering, right click the distribution platform >> Points >> Uncheck Jitter (Figure 5.2)

Figure 5.1
Figure 5.2

Put Region on X axis and SAT Math on Y. Within each region, it shows SAT math score distributions within each region (Figure 5.3). Change the distribution visualization to histogram, and it is a nice way to compare different distributions (Figure 5.4). Change the distribution visualization to Bar chart, but it is not preferable many times because it only show one dimension of the data by showing the height of the bar (Figure 5.5). Change the distribution visualization to Box Plot, and it shows nice sense of spread of data (Figure 5.6).

Figure 5.3
Figure 5.4
Figure 5.5
Figure 5.6

Put SAT Math on Y, SAT Verbal on X. JMP returns a scatter plot because both of them are continuous variables (Figure 5.7). In a scatter plot, you can use line, smoother line, or ellipse to show more information (Figure 5.8, 5.9, 5.10). In smoother line, you can control smoothness of the line by changing “Lambda” under “Smoother.” In ellipse, the graph shows the bivariate density, or where the most of observations are in the plot.

Figure 5.7
Figure 5.8
Figure 5.9
Figure 5.10

To observe if the SAT Math and SAT Verbal score have to same trajectory across the years, we simply can add in the factor of “Year” in Graph Builder. If put “Year” in Group X, graph builder returns a panel for each year in the dataset across X axis, showing the relationship between SAT Math and Verbal in each year (Figure 5.11). If put “Year” in Group Y, graph builder does the same thing across Y axis (Figure 5.12). If put “Year” in Wrap, JMP creates a little grit, and it shows a lot more especially when we have many different levels (Figure 5.13). If put “Year” in Overlay, JMP is showing the relationship across different the years in one single plot, so it looks more messy (Figure 5.14).

Figure 5.11
Figure 5.12
Figure 5.13
Figure 5.14

Put the variable “State” in Map Shape. JMP will return a US map. Then, drop SAT Math score in the middle of the graph, and the graph shows geometic distribution of SAT math scores using heat map (Figure 5.15). Other grouping variables are still usable.

Figure 5.15

Click “Strat Over” to begin graph building again.

By clicking “Done” button, creating the graph will be finished. By clicking “Show Control Panel” under red triangle in “Graph Builder,” JMP will return the control panel.