Moving from Descriptive Analyses Into Inference
Objectives: In this notebook, we’ll discuss the Scientific Method as it relates to data-based research. We’ll also discuss ethical analyses and avoiding \(p\)-hacking (also known as data snooping or data dredging).
Exploratory Analysis and Sample Data versus Inference and Populations
In the previous notebook, TidyAnalysesInR
, you conducted an exploratory analysis of the penguins_samp1
dataset. You summarized individual variables and looked for associations between groups of variables. Everything you did there only reflected the \(44\) penguins in that particular dataset though. These were your sample penguins, and you did a great job describing them – but how can we use those sample penguins to make and project claims about all penguins?
The Importance of Random Representative Samples
The good news is that, under certain conditions, we can use sample data to make claims at the population level. Those conditions are
- that the sample is randomly selected.
- that the sample is representative of the population.
Even when these conditions are satisfied, we know that repeated samples will result in slightly different summary characteristics, but statistical tools help us to quantify how much these summary characteristics are expected to vary from one sample to the next. Gaining a handle on this sample variation allows us to estimate how far the true population characteristics may fall from our sample characteristics. Understanding this allows us to make assertions (with some degree of confidence) regarding the population we seek to understand.
There will be more on these statistical tools and using them to build estimates for population characteristics (population parameters) in the next notebook. For now, we’ll highlight some pitfalls and some of the ways statistical inference can go wrong.
What to Watch Out for In Inference
It’s worth highlighting some avoidable pitfalls when conducting statistical inference as well as understanding some issues that are unavoidable. We’ll discuss the avoidable issue of data snooping/data dredging/p-hacking, and we’ll also highlight unavoidable errors that come along with the uncertainty inherent in our samples.
We’ll start with the issue of data snooping below.
Why Can’t We Use penguins_samp1
Anymore?
We’ve looked at our sample penguins data already. We explored that data without making any hypotheses ahead of time. We searched for relationships within that data set and we were successful in finding some. How do we know that those relationships/associations aren’t just noise in the data? Simply put, we don’t!
Unfortunately, we can’t go back on this now either. As ethical scientists, we can’t use the penguins_samp1
data for inference tasks because we know what those data say. We are now in a scenario where the data has informed which questions should be asked, whereas the data should only be used to answer the questions we had before we looked at it. To put it in terms analogous to schooling – the data wrote the exam, so certainly the data has an advantage in answering the questions.
For Example: In our analysis from the TidyAnalysesInR
notebook, at the end of our exploratory analysis, we asked whether the average penguin bill_length_mm
exceeded \(45\)mm. I suggested that you ask this questions because the average bill length for a penguin in that dataset was over \(46\)mm. If we test that hypothesis using the data from last time (penguins_samp1
), it should be no surprise that we obtain a statistically significant result – we snooped for it.
Uncertainty and Errors in Statistics
Statistics helps us better understand the world around us by quantifying uncertainty. This means that none of the claims we make using statistics can be made with 100% certainty. Perhaps you remember two major tools from inferential statistics – confidence intervals and hypothesis tests.
Confidence Intervals: Provide a range of plausible values within which we expect a population parameter to live. For a 95% confidence interval, we would expect that if 100 independent random samples were collected and a confidence interval for our parameter of interest was built corresponding to each sample, approximately 95 of those confidence intervals would contain the true population parameter.
- We don’t take 100 independent random samples – we take a single one. How do we know that the random sample we’ve obtained isn’t one of those 5% of samples resulting in a confidence interval that doesn’t contain the true population parameter? We don’t – we just know that it is more likely that we’ve obtained an interval that does contain the population parameter we are looking for.
- If the cost of building an interval that misses the population parameter is quite high, then we’ll need to change our confidence level. For example, building 99% confidence intervals, we expect only one out of every 100 random samples to result in a confidence interval that misses the population parameter.
Hypothesis Tests: In classical hypothesis testing, we assume a null claim (“there’s nothing to see here”) against an alternative. In doing so, our goal is to determine whether our data are compatible with the null claim or if the observed data is so extremely unlikely under the assumption that the null claim is true, that we should believe the alternative to be true instead. Like with confidence intervals, there is no guarantee that we come to the correct decision, even if all of our math is right and our samples are random. There are two types of errors that can be made.
- Type I Error: A Type I error occurs when we reject the null hypothesis, but the null hypothesis correctly describes the reality we live in. The likelihood of such an error is \(\alpha\), the level of significance of the test. That is, for a test done at the 5% level of significance, the likelihood of rejecting a null hypothesis when we shouldn’t have is 5%.
- Type II Error: A Type II error occurs when we fail to reject the null hypothesis, but the null hypothesis does not reflect the reality we live in – the alternative hypothesis is the one compatible with our reality. The likelihood of a Type II error is more difficult to obtain. It depends on the true effect-size, the sample size, and \(\alpha\). The probability of a Type II error is often described as one minus the power of the test.
Both Type I and Type II Error can occur in biological datasets. For example, if our sample size was too small, we could encounter Type II error because we did not collect enough data to detect meaningful differences. This is common in the field of ecology, where each datapoint is labor-intensive to collect.
Type I error is more likely to occur with VERY large sample sizes or when our sample is not representative. If our sample size is very large, we might pick up on small differences that are not biologically meaningful. If our sample is different from the population, we might detect a difference that is meaningful for our sample, but not the entire population.
This is why it is important to think about experimental design before collecting data!
Snooping for Errors
Since we’ve identified that our data can mislead us into erroneous conclusions, we should also be able to more concretely identify why we can no longer use the penguins_samp1
data. All that exploratory work we’ve done has led us towards identifying those 5% of findings which are just attributable to the sampling error (noise in the data)! There are some things we can do to protect against this.
Don’t: Conduct exploratory analyses, find interesting associations in your data, and then run tests (or build confidence intervals) on that same data to confirm those findings. This is data snooping and is unethical statistics.
- We can’t let the data “write the exam and also take the exam”.
Bare Minimum: If you are working with data that has already been collected and you have no opportunity to collect new data, you should randomly break your data up into an exploratory set (to explore and generate hypotheses) and a separate test set (to test those hypotheses on). In using this strategy, the test set should be completely unseen during the exploratory phase.
Best (With Current Data): If you have no opportunity to collect new data, refer back to the Bare Minimum above. Otherwise, if you are working with data that has already been collected and you’ll be able to collect more, you can use your existing data to explore and generate hypotheses. Once you are done with this phase, you should register your hypotheses and collect a new (and independent) random sample of observations to test those hypotheses on.
Best (Without Current Data): See the previous bullet point. If you can afford (time, money, etc) to collect two independent random samples, you should proceed using that strategy. If you can’t afford to collect two independent random samples and you have theoretical or previous empirical justification for particular hypotheses, you should register those hypotheses, collect your random sample of data, and then test your hypotheses.
Warning: The more hypotheses we test on a single set of data, the greater the likelihood that we obtain at least one result simply due to sampling error. That is, we increase the likelihood of claiming a significant finding when it really is just noise in the data.
Your Task
Over the next several class meetings you will,
- Select one of the biological datasets available. You can read the description of the dataset and variables, but don’t snoop!
- Following the principles outlined above, decide on a hypothesis that you could test with this dataset. Formulate both a null and alternative hypothesis.
- Design and conduct an analysis of the dataset. Can you reject your null hypothesis?
- Be sure your analysis is well documented and described in your notebook. Envision your notebook as your methodology that you’re sharing with the scientific community.
- Once your analysis is complete, follow the
Pull -> Commit -> Push
workflow for publishing your notebook to your GitHub Repository.