The "center" of a data set is also a way of describing the location. The two most widely used measures of the "center" of the data are the mean (average) and the median . To calculate the mean weight of 50 people, add the 50 weights together and divide by 50. To find the median weight of the 50 people, order the data and find the number that splits the data into two equal parts. The median is generally a better measure of the center when there are extreme values or outliers because it is not affected by the precise numerical values of the outliers. The mean is the most common measure of the center.
The letter used to represent the sample mean is an \(x\) with a bar over it (pronounced “\(x\) bar”): \(\overline\). The Greek letter \(\mu\) (pronounced "mew") represents the population mean. One of the requirements for the sample mean to be a good estimate of the population mean is for the sample taken to be truly random.
You can quickly find the location of the median by using the expression
The letter \(n\) is the total number of data values in the sample. If \(n\) is an odd number, the median is the middle value of the ordered data (ordered smallest to largest). If \(n\) is an even number, the median is equal to the average of the two middle values after the data has been ordered. For example, if the total number of data values is 97, then
The median is the 49 th value in the ordered data. If the total number of data values is 100, then
The median occurs midway between the 50 th and 51 st values. The location of the median and the value of the median are not the same. The upper case letter \(M\) is often used to represent the median. The next example illustrates the location of the median and the value of the median.
The following dataset is in order from smallest to largest:
3; 4; 8; 8; 10; 11; 12; 13; 14; 15
Calculate the mean and the median.
Answer
The calculation for the mean is:
To find the median, \(M\), first use the formula for the location. The location is:
Starting at the smallest value, the median is located between the 5 th and 6 th values (the 10 and 11):
3; 4; 8; 8; 10; 11; 12; 13; 14; 15
The following dataset is ordered from smallest to largest. Calculate the mean and median.
3; 4; 5; 7; 7; 7; 7; 8; 8; 9; 9; 10; 10; 10; 10; 10; 11; 12; 12; 13; 14; 14; 15; 15; 17; 17; 18; 19; 19; 19; 21; 21; 22; 22; 23; 24; 24; 24; 24
Answer
Mean: \(3 + 4 + 5 + 7 + 7 + 7 + 7 + 8 + 8 + 9 + 9 + 10 + 10 + 10 + 10 + 10 + 11 + 12 + 12 + 13 + 14 + 14 + 15 + 15 + 17 + 17 + 18 + 19 + 19 + 19 + 21 + 21 + 22 + 22 + 23 + 24 + 24 + 24 = 544\)
Median: Starting at the smallest value, the median is the 20 th term, which is 13.
Another measure of the center is the mode. The mode is the most frequent value. There can be more than one mode in a data set as long as those values have the same frequency and that frequency is the highest. If there are no repeats in a dataset, meaning each value occurs exactly one time, there is no mode.
Statistics exam scores for 20 students are as follows:
50; 53; 59; 59; 63; 63; 72; 72; 72; 72; 72; 76; 78; 81; 83; 84; 84; 84; 90; 93
Answer
The most frequent score is 72, which occurs five times. Mode = 72.
The number of books checked out from the library from 25 students are as follows:
0; 0; 0; 1; 2; 3; 3; 4; 4; 5; 5; 7; 7; 7; 7; 8; 8; 8; 8; 10; 10; 11; 11; 12; 12
Answer
There is a tie for the most frequent value: 7 and 8 both occur four times. Mode = 7 and 8.
One of the differences between the two data sets that any measure of center doesn't capture is the variety of data within the set. To describe the variation quantitatively, we use measures of variation or measures of spread. Just as there are several different measures of center, there are also several different measures of variation. In this section, we examine two of the most frequently used measures of variation: the range and standard deviation.
The range of a data set is the difference between the maximum (largest) and minimum (smallest) observations.
Find the range of the data:
8 | 12 | 13 | 11 | 10 | 9 | 14 | 8 | 6 | 14 | 7 | 8 | 13 |
Solution
The range of the data is the difference between the largest and the smallest values in the data set: 14−6=8
The range only measures the total variation and doesn't capture any variation between the minimum and maximum observed values. In contrast to the range, the standard deviation takes into account all the observations. It is the preferred measure of variation when the mean is used as the measure of center. Roughly speaking, the standard deviation measures variation by indicating how far, on average, the observations are from the mean. For a data set with a large amount of variation, the observations will, on average, be far from the mean; so the standard deviation will be large. For a data set with a small amount of variation, the observations will, on average, be close to the mean; so the standard deviation will be small.
If \(x\) is a number, then the difference "\(x\) – mean" is called its deviation. In a data set, there are as many deviations as there are items in the data set. The deviations are used to calculate the standard deviation. If the numbers belong to a population, in symbols a deviation is \(x - \mu\). For sample data, in symbols a deviation is \(x - \bar\).
The procedure to calculate the standard deviation depends on whether the numbers are the entire population or are data from a sample. The calculations are similar, but not identical. Therefore the symbol used to represent the standard deviation depends on whether it is calculated from a population or a sample. The lower case letter s represents the sample standard deviation and the Greek letter \(\sigma\) (sigma, lower case) represents the population standard deviation. If the sample has the same characteristics as the population, then s should be a good estimate of \(\sigma\).
To calculate the standard deviation, we need to calculate the variance first. The variance is the average of the squares of the deviations (the \(x - \bar\) values for a sample, or the \(x - \mu\) values for a population). The symbol \(\sigma^\) represents the population variance; the population standard deviation \(\sigma\) is the square root of the population variance. The symbol \(s^\) represents the sample variance; the sample standard deviation s is the square root of the sample variance. You can think of the standard deviation as a special average of the deviations.
If the numbers come from a census of the entire population and not a sample, when we calculate the average of the squared deviations to find the variance, we divide by \(N\), the number of items in the population. If the data are from a sample rather than a population, when we calculate the average of the squared deviations, we divide by n – 1, one less than the number of items in the sample.
For the sample standard deviation, the denominator is \(n - 1\), that is one less than the sample size.