Chapter 20 - Comparing means using paired data
In-class activity
Researchers want to infer the difference in the true average weight of adult standard poodles vs that of adult miniature poodles. (Define adult as 4 years old.)
They sample 40 standard poodle newborn puppies and 40 miniature poodle newborn puppies. They then randomly select 40 typical dog owners and give each owner a pair of puppies (1 standard and 1 miniature). Researchers then weigh the dogs at 4 years old.
The data
Here is what the data looks like. The measurements are in lbs:
Exploring correlation
Typically, when we collect paired data, we hope the paired data will be positively correlated.
It seems reasonable in our case where paired data are linked by “owner”. An owner who tends to over-feed is more likely to have heavier dogs. Other factors might be the amount of exercise given to the dogs by an owner.
Let’s check the sample correlation of weight measurements on standard and mini poodles in our paired data.
Your turn
It’s your turn to do the data analysis. Recall the first step to analyzing paired data. The data is stored in poodles_weight object.
If you need a hint for wrangling data, click “hints”.
You can find the difference in paired data stored in two columns by:
table$col_name1 - table$col_name2Replace “table”, “col_name1”, “col_name2” with the actual name of data table and column names.
Answer the following questions on paper. We’ll mark it together in-class!
Is the data independent or dependent samples? Explain.
Let \(\mu_1\) denote the population average weight of adult standard poodles and \(\mu_2\) denote the population average weight of adult mini poodles. What statistic do we use to estimate \(\mu_1 - \mu_2\)?
What is the sampling distribution of statistic in part b)?
What are the assumptions/conditions required for part c)?
Find a 90% confidence interval for the difference in true mean weight between the two populations.
It was noted in an 100-year old encyclopedia that the difference in avg weights is 30lbs. However, researchers believe that the weight of miniature poodles has remained similar to historic level while the weight of standard poodles has decreased over time. Write down appropriate null and alternative hypothesis.
Calculate p-value for part f), and state the conclusion of hypothesis test at a 2% significance level.
- What are the assumptions/conditions for part g?
Extra optional material
Let us illustrate why you need to use the correct approach for paired data otherwise confidence intervals and hypothesis tests won’t have the correct behaviour (in repeated sampling) via simulation study.
Let \(\delta := \mu_1-\mu_2\) denote the parameter of interest, i.e. (population mean weight of standard poodles) minus (population mean weight of mini poodles).
Let \(\bar{D}\) denote sample mean of differences, which we use to find estimate of \(\delta\).
Pop-quiz: when using paired sample, what is the correct formula for \(SE(\bar{D})\)? Type numerical value of \(SE(\bar{D})\) for the poodles_weight dataset in the box below to check you’ve got the right answer.
What if we use the wrong method of analysis?
Suppose we accidentally forget that we have paired data, and use the analysis method for independent samples. What happens? We would use \(\bar{Y}_1-\bar{Y}_2\) to compute estimate of \(\delta\), with \(SE(\bar{Y}_1-\bar{Y}_2)\) calculated via formula \(\sqrt{\dfrac{s^2_{1} }{n_1} + \dfrac{s_{2}^2 }{n_2}}\) (incorrect approach since we have paired-data!).
Caluclate the numerical value of \(SE(\bar{Y}_1-\bar{Y}_2)\) using formula intended for independent samples in the coding block below.
Type numerical value of \(SE(\bar{Y}_1-\bar{Y}_2)\) using formula intended for independent samples below to check your answer:
Fluke or anticipated behaviour?
It is a mathematical fact that point estimate of \(\mu_1-\mu_2\) will be the same regardless if you applied the correct analysis for paired data, or used the incorrect one intended for independent samples (i.e. \(\bar{d} = \bar{y}_1 - \bar{y}_2\)).
However, when you compare correct SE with the incorrect SE, which one is larger? Let try redoing our fictional study on poodles B=1000 times:
This shows the behaviour we noticed (i.e. SE from correct method of analysis for paired data < SE from incorrectly applying method for independent data on paired data) is not a fluke. It happens whenever the paired data has a positive sample correlation.
As a result, a 90% confidence interval constructed based on the wrong method will be too needlessly long, or too short?
Final thoughts
When measurements on paired subjects truly have a moderate/strong positive relationship, the data will nearly always show a positive sample correlation if the study were repeated.
When this is the case, if we use method for independent data to analyze paired data, the probability of confidence interval capturing the true \((\mu_1-\mu_2)\) will be noticeably greater or less than intended?
As for hypothesis testing, the P(Type I error) will be noticeably greater or less than intended?
© 2024 Vivian Meng – Material Licensed under CC BY-NC-ND 4.0