In this section, we will look at alternative measures of centre (the mean) and spread (the standard deviation) that are only useful when working with symmetric distributions without outliers. While this may seem unnecessarily restrictive, these two measures have the advantage of being able to fully describe the centre and spread of a symmetric distribution with only two numbers.
1. The mean.
The mean is the balance point of a distribution.
A general characteristic of symmetric distributions is that the mean and the median coincide.
Example 1:
The balance point of the distribution is also the point that splits the distribution in half. That is, there are two data points to the left of the mean and two to the right. The relationship between the mean and the median
Whereas the median lies at the midpoint of a distribution, the mean is the balance point of the distribution. For approximately symmetric distributions, both the median and mean will be approximately equal in value.
NOTE: If a distribution is symmetric, there will be little difference in the value of the mean and median and we can use either. In such circumstances, the mean is often preferred because it is more familiar to most people.
Example 2:
Note that the mean is affected by changing the largest data value (8), but the median is not.Example 3:
Choosing between the mean and the median
The mean and the median are both measures of the centre of a distribution. If the distribution is:
+ symmetric and there are no outliers, either the mean or the median can be used to indicate the centre of the distribution
+ clearly skewed and/or there are outliers, it is more appropriate to use the median to indicate the centre of the distribution.
EXERCISE 1:
A group of 195 people were asked to record (to one decimal place) the average number of hours they spent on email each week over a 10 week period. The data are shown in the following histogram:
(a) Find possible values for the median.
(b) Find the maximum value for the IQR.
(c) How to calculate the estimated mean or standard deviation?
SOLUTION
This is a problem based on a histogram.
+ Total number of people (N): 195
+ Data: Average number of hours spent on email per week.
+ Frequency Distribution (from the histogram):
Conclusion: Median is in the interval 5.0 - 9.9
Conclusion: (c) How to calculate the estimated mean or standard deviation? In conclusion: The standard deviation is approximately 7.39 hours.
* You can learn more about how to apply these formulas by watching this video on Find the Mean, Variance, & Standard Deviation of Frequency Grouped Data Table.
https://youtu.be/NhahVPv6CeM?si=jeyGFHDpPeVVFf5s
EXERCISE 2:
2. The standard deviation- To measure the spread of a data distribution around the median (M) we use the interquartile range (IQR).
- To measure the spread of a data distribution about the mean ( ¯x) we use the standard deviation (s).
Although not easy to see from the formula, the standard deviation is an average of the squared deviations of each data value from the mean.
We work with the squared deviations because the sum of the deviations around the mean (the balance point) will always be zero.
Example : The following are the heights (in cm) of a group of women.
176 160 163 157 168 172 173 169
Determine the mean and standard deviation of the women’s heights. Give your answers correct to two decimal places.
Solution:
The mean height of the women is ¯x = 167.25 cm and the standard deviation is s = 6.67 cm.
EXERCISES
To determine whether a mean or standard deviation makes sense, we have to look at the type of data each variable represents.
Mean and standard deviation are only meaningful for Quantitative (Numerical) data, where the numbers represent an actual quantity or measurement. They do not make sense for Qualitative (Categorical) data, where numbers are just labels or categories.
Do not make sense to calculate a mean or standard deviation for: :
b. Sex d. Post code f. Weight (underweight, normal, overweight)
Extra question 1.1: Please explain your answer.
(a) What data type is the variable "Sex"? Why not calculate Mean/SD? Which measure of central tendency for this variable?
Answer: Categorical (Nominal).
- These are descriptive categories (e.g., Male, Female). You cannot mathematically average "Male" and "Female."
- Use Mode.
(b) What data type is the variable "Post code"? Why not calculate Mean/SD? Which measure of central tendency for this variable?
Answer: Categorical (Nominal)
- Although postcodes are numerical, they are labels for locations. Adding two postcodes together and dividing by two results in a meaningless number.
- Use Mode.
(c) What data type is the variable "Weight (underweight, normal, overweight)"? Why not calculate Mean/SD? Which measure of central tendency for this variable?
Answer: Categorical (Ordinal)
- While these categories have an order, they are still labels. You would use frequencies (percentages) or a median here, rather than a mean.
- Use Median or Mode.
Extra question 1.2: Please explain your answer.
(a) What data type is the variable "Speed (in km/h)"? Why calculate Mean/SD?
Answer: Quantitative.
Speed is a measurement. An "average speed" is a common and useful metric.
(b) What data type is the variable " Age (in years)"? Why calculate Mean/SD?
Answer: Quantitative.
Age represents a count of years. Calculating the "average age" of a group is standard practice.
(c) What data type is the variable " Neck circumference (in cm)"? Why calculate Mean/SD?
Answer: Quantitative.
This is a physical measurement in centimeters.
Extra question 1.3: Which Measure to Use?
Answers:
Level of Measurement - Best Measure - Why?
1. Nominal (Labels, no order) - Mode - It is the only measure that identifies the most frequent category.
2. Ordinal (Labels with order) - Median - It finds the middle "rank" without assuming the distance between ranks is equal.
3. Interval/Ratio (Symmetric) - Mean - It uses every data point to find the mathematical centre.
4. Interval/Ratio (Skewed/Outliers) - Median - It isn't "pulled" away by extreme values (like a few millionaires in a salary study).
This video provides a clear visual walkthrough of how the shape of your data distribution dictates whether you should report the mean, median, or mode.
https://youtu.be/SNP6EH27WY4?si=iFCJAQv5_DqDI-XK
Q2.
Comments
Post a Comment
Bình luận của bạn sẽ được duyệt trước khi đăng.