Interpreting Group-Level Data for Individual-Level Application

How accurate is your gadget of choice at predicting your individual response to exercise? How close is your watch’s V̇O₂max estimate? How close was the predicted race time to your real results? How far apart are your FTP/threshold estimates between a ramp test, a 20 minute test, and a 40-60 minute time-trial?

How come with millions of people giving away their biometric data to tech companies, it’s still so difficult to predict individual responses?

In a few words, because lots of data improves group-level confidence intervals, but does not improve individual-level prediction intervals.

Let’s simulate some data and I’ll try to explain what those words mean. Then we can look at real-world lactate data and consider what interpretations we can draw from it.

This is a quick re-write of a twitter thread I posted in September, 2023. Since X née Twitter is more difficult to read these days without an account, and even less worth it than ever to create an account.. it might be worth transferring over some of my threads back here to the old blog. We’ll see!

There is uncertainty in every measurement. Uncertainty comes from measurement error and ‘real’ variation in whatever we are trying to observe.

If I stick my finger up into the wind, I will have a large measurement error: my finger isn’t very sensitive to measuring wind speed. The wind might also have lots of variation from any one moment to the next.

If I’m recording my heart rate during exercise, the device I use might have more or less measurement error. Consider a 12-lead EKG measuring electrical activity across the torso and reporting a continuous wave pattern, compared to a chest strap detecting the same signal from only two electrodes and reporting bpm once per second. Then compare that to an optical watch trying to derive HR from changes in light reflectance through skin between pulses. The former has lower measurement error, the latter has larger error.

ref: https://militaryhealth.bmj.com/content/early/2023/11/24/military-2023-002541. I have no idea what this paper is talking about, it’s just a nice figure showing the spread of measurement error between chest and wrist HR devices. from @DrSianAllen

Then there are all kinds of sources of biological variability that will change my heart rate, one moment to the next, and between different exercise sessions at the same workload. (We tried to quantified this day-to-day variability in HR and other common cycling metrics in a recent paper).

Those are examples of individual-level uncertainty, but this of course scales up to group-level uncertainty. We (should) know that different athletes can have very different heart rates at the same workload, or even at the same relative intensity. 150 bpm will mean very different things to different people. The HR maintained during steady-state exercise might be 150 bpm in one athlete, and 190 bpm in another.


Simulated Example of Power Data

Let’s consider something more obvious: power output or threshold power will be very different between athletes of different fitness, training history, competitive levels, etc.

If we are interested in estimating the mean power output from the population of endurance-trained athletes, we can’t measure everyone. So we take a random sample of individuals from the population.

By taking a random sample, we hope that all of the random variation in all of the possible differences between individuals wash each other out and we will get a representative estimate of the true mean value of the population.

Simulated data. x-axis is power output, y-axis is the number of observations (i.e. each individual observation within the sample (observed) and population (unobserved) data). The ‘population’ dataset is simulated to have a mean of 250 W with a normal distribution across a range from 100 to 400 W. We randomly sample 20 observations from that population. Our sample group has an observed mean of 242 W.

Confidence Intervals

The Confidence interval (CI) represents a range of uncertainty around an observed estimate. It represents how close we think our estimate represents the ‘true’ value. Or, a range of values within which we can be more confident the ‘true’ value exists, but within that range we cannot be certain where in that interval the value truly is. The real score might as well be any value within that range.

The CI is dependent on the number of samples (observations), and the variation between each of those observations. We make assumptions that the variation of the observed sample represents the ‘true’ variation in the population. Thus, if the sample is not random and representative of the true population, this assumption will be violated and our estimate will not represent the true population.

Simulated data. x-axis is power output, y-axis is the number of observations from the sampled data. Red shaded area is 95% confidence interval. These data are the same as above and will be used below for our example. The 95% CI suggests we can be confident that the true un-observed population mean is somewhere within 224-261 W, based on the number of samples and the variation we observe among the sampled observations.

The number of observations is also important to estimating the CI around the estimated mean. Because more observations gives us a better estimate of the true distribution of scores within the population.

A larger sample size should improve the estimate of the mean (the sample mean) toward the true population mean, as well as tighten the CI around that estimate. More observations – more data – helps to improve our confidence in our estimate of the true mean of the population.

Simulated data. x-axis is power output, y-axis is the number of observations from the sampled data. Red shaded area is 95% confidence interval. More observations reveals more about the distribution of the population. Our estimated mean power output (249 W) has improved toward the real mean of 250 W, and our 95% confidence interval has narrowed to 242-256 W. We can be more confident that the real mean is close to our observed estimate (sample mean).

So it is generally simple to improve an estimate of a group-level (population-level) parameter. More data allows us to be more confident in our estimate and predictions.

But what about predicting any one individual observation from within that population?

Prediction Intervals

The prediction interval (PI) captures the uncertainty of predicting individual observations. Like CI, PI represent a range of uncertainty, but at an individual-level. It represents a range within which we can expect the next observation to fall, and that observation might fall anywhere within this range.

Probabilistically, if the population parameter is normally distributed (as in our simulated example), then we are more likely to observe values close to the mean, however we cannot know whether the next observation – the next athlete that walks into our lab to be tested – will be near the mean, or an outlier near the margins of the population.

Simulated data. x-axis is power output, y-axis is the number of observations from the sampled data. Red shaded area is 95% confidence interval (CI). Blue shaded area is 95% prediction interval (PI). Roughly, the 95% PI contains 95% of the current observations. The PI will always be wider than the CI when there is variation within the population from which the sample has been taken.

What happens when we scale up our number of observations, like large population-level datasets that our gadgets are collecting?

More data narrows the confidence interval, i.e. improves our confident and reduces uncertainty for the true population mean value. With gigantic population-scale datasets, our gadgets can be extremely confident where the population mean falls, with narrow CI.

However, the prediction interval has remained wide in this simulated dataset. Why?

Simulated data. x-axis is power output, y-axis is the number of observations from the sampled data. Red shaded area is 95% confidence interval (CI). Blue shaded area is 95% prediction interval (PI). as n → N, CI → zero, but PI → 1.96 z-score, representing the range of values within which 95% of the population fall (e.g. ~152-348 W).

Prediction intervals are determined primarily by the magnitude of real variation within the population. So as we sample more observations, we get a better estimate of that real variation, but that variation is not reduced..

Different athletes really truly do have very different power outputs, and so our PI will always remain wide to reflect the uncertainty we will have for whether the next athlete we sample will have a higher or a lower power output.

Think of it as:
⚫ As the number of samples (n) approaches the population size (N),
🔴 Confidence Intervals converges on (→) the true mean value, i.e. CI are reduced to zero.
🔵Prediction Intervals converge on (→) the real variation within the population, i.e. 1.96 SDs or 95% of the variance in the population.

If we measured every individual in the population, then our ‘estimated’ mean score would exactly equal the true population mean score, and our range of uncertainty around that estimate would be zero; CI would converge to a range of zero around the mean.

But if we measured every individual except one, we still don’t know where that last individual observation will fall within the population distribution.

Interim TLDR

  • More data improves the accuracy of group-level estimates, such as mean score
  • More data reduces uncertainty and narrows confidence intervals around a group-level estimate
  • More data does not improve individual estimates (as much), and does not reduce the uncertainty (the prediction interval) to predict the next individual-level observation (as much).
  • A sample of observations that do not represent a random distribution of scores from the population will result in inaccurate group-level and individual-level estimates.

In my opinion, we need to consider group- and individual-level uncertainty whenever we are interpreting research or individual athlete training data.


Real Data of Blood Lactate Curves

Here is a real-world example from a dataset where we collected blood lactate ([BLa]) samples. These data came from the same experiment from which we published test-retest reliability values in common cycling metrics.

Briefly, we collected capillary [BLa] from 21 trained female & male cyclists, at the end of each 5-minute work stage during a graded ‘5-1’ cycling assessment to maximal task tolerance.

The [BLa] scores were then plotted as a function of ‘relative intensity’, i.e. a percent scale of the highest workload attained by each athlete. We then fit these scores to cubic polynomials to generate the estimated lactate curve for each individual athlete.

x-axis is relative exercise intensity, y-axis is [BLa]. Cubic polynomial curves fit to observed data points (not shown) taken at each work stage.

We can see some clear variation between each individual lactate curve. The values near maximal exercise intensity appear to have a wider spread than at lower intensities, but maybe in relative terms this is about the same? i.e., the absolute variance is wider because the values are larger at higher vs lower intensity?

How do our confidence and prediction intervals look on these continuous data?

If we wanted to estimate the group-level mean [BLa] curve, we can look at the 95% confidence interval around the marginal ensemble (group-level) mean [BLa] curve.

x-axis is relative exercise intensity, y-axis is [BLa]. The highlighted curve is the group estimated marginal trendline, with shaded area representing 95% CI from a linear mixed effects model with structure BLa ~ poly(Intensity, 3) + (0 + Intensity| Subject) +(1 | Subject).

It’s a bit visually deceptive because the line is trending up at higher intensity, which makes the shaded CI range seem smaller than where the line is more horizontal at lower intensity. In reality, the 95% CI range is slightly wider in absolute terms at 100% than at lower intensities, but of course it is narrower in relative terms compared to the marginal group mean value. The absolute 95% CI range (the top to bottom of the shaded area) is around 1.0 to 1.5 mmol·L⁻¹ at lower and higher intensities, respectively.

Of course, does the group-level mean [BLa] curve have any meaning to us?

It is important to consider when we are trying to estimate a group- or population-level mean response, and when we are trying to predict a future individual observation.

To predict an individual-level response we can look at the 95% prediction interval.

x-axis is relative exercise intensity, y-axis is [BLa]. The highlighted curve is the group estimated marginal trendline, with shaded areas representing 95% CI and PI, respectively, from the same mixed effects model as above.

The prediction intervals remain wide at around 4.5 mmol·L⁻¹ across the range of exercise intensity. Indicating lower confidence to predict the [BLa] curve for the next athlete we test (given no additional information). The PI would be even wider if we measured a single observation of [BLa] at peak exercise intensity.


Between-Individual vs Within-Individual Uncertainty

Now hopefully we understand a bit more about the uncertainty we should expect for group-level and individual-level observations. These are used to estimate and predict values between individuals within a certain population.

What about quantifying uncertainty within a single individual across repeated measurements?

This is the difference between group-level research and individual-level application, i.e. coaching! Group-level research makes single observations from multiple individuals to estimate a population.

Individual-level coaching is about making multiple repeated observations over time from a single individual to predict future observations from that individual. This allows us to have more confidence in making decisions, prescribing training, predicting future observations, etc.

To quantify the within-individual variability we should expect for common cycling metrics between any given session, take a look at our recently published study where we compared heart rate, oxygen uptake (V̇O₂), NIRS muscle oxygenation (SmO₂), and the same [BLa] data I’ve presented here.

Representative example of muscle oxygen saturation (SmO₂), heart rate (HR), and blood lactate ([BLa]) data measured in the same athlete across two different trials. Metrics without y-axis labels are arbitrarily scaled to peak values.

TLDR

  • Every measurement has uncertainty
  • Consider if we are trying to make group-level estimates from group-level data (confidence intervals), …
  • Or individual-level predictions from group-level data (prediction intervals), …
  • Or individual-level predictions from individual-level repeated measures (SEM & MDC).
  • Understanding the sources and magnitude of uncertainty (our CIs & PIs!) can help us to be more confidence – not less confident – in how we are interpreting our data and applying that information for our athletes.

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.