With most of the world still in the grips of the COVID-19 pandemic, much attention is now focused on the development of a safe and effective vaccine.

6 October 2020

Never in our memories have clinical trials received so many column inches and the subject of representativeness of sample data is top of many minds.

“A representative sample is a subset of a population that seeks to accurately reflect the characteristics of the larger group.”

Investopedia

In many applications of data science, it is advantageous to have sample data that is representative of the population you are modelling. However, this is not always possible, and techniques have been developed to derive information from different data sets. The requirements of the data will depend on the specific questions to be answered. For example, the data needed to identify rare side effects of a drug may be very different to the data sample needed to assess the average efficacy of a vaccine.

“The selection of people from whom to collect input depends upon the specific questions and issues to be addressed… Thus, the selection process starts by considering the research question: what are the specific objectives to be addressed by collecting patient input?”

US Food and Drug Administration (FDA) guidance note, June 2020

Why is representation a challenge?

Given the choice, researchers would generally collect perfectly representative data samples to study. Sometimes, however, such samples are not available:

Medical trials seeking to understand a specific condition often need to exclude subjects who suffer from other medical conditions. This can result in a reduced pool of older subjects as co-morbidities become more common with age.
For testing the efficacy of coronavirus vaccines, study participants with high exposure to the virus are not available in countries where case numbers are very low, leading many phase II and phase III trials to take place in countries with high numbers of infections.
Pension plans trying to predict life spans of their participants traditionally look at mortality data for their historical pensioner population. This population may be very different from that now working for the sponsor, resulting in experience data not representative of the future population.

Other times, the very process of selecting a data sample will result in biases in the resulting sample:

Many psychology studies advertise for subjects solely amongst university students.
Language barriers and cultural differences may result in certain minority groups not responding to clinical trial appeals.
Until the 1990s women of childbearing age were not considered for clinical trials over concerns of side effects on fetuses like those observed with the drug thalidomide in the ’50s and ’60s.

Getting the most out of available data

Accepting these limitations on sampling, techniques have been developed to help control for unrepresentative data: namely, studying subgroups that are likely to have different results in a study. In a medical trial, these would be groups that exhibit physiological differences that could result in different responses to a treatment or drug – in the case of COVID, the initial differences in effects of the disease might suggest grouping on gender, age and ethnicity.

To ensure that medical studies looking at subgroups of a general population are comparable, the FDA has developed standard terminology for age, gender, and ethnicity groups. The UK’s National Institute for Health and Care Excellence (NICE) has set out guidance for using subgroups in analyzing patient data. In particular, NICE advises that subgroups should be defined before any study takes place to increase the credibility of the subgroup analysis.

If using subgrouping techniques, the data sample does not need to be representative of the ultimate population. However, the subgroups need to provide the building blocks to understand the ultimate population, and there must be enough data in each group of the sample to provide credible results.

Applications in the world of longevity modelling

As computing power has increased, the tools to understand longevity (or on the other side of the coin, mortality) patterns in populations have developed.

The traditional approach to understanding current longevity for a given population is to create a life table which combines data from a large sample to give probabilities of survival at different ages (and usually genders). The first published life tables were developed by Edmond Halley in 1693. The life table probabilities can be pieced together to calculate life expectancy for people at different ages. However, using a single life table to predict life expectancy requires the underlying data to be representative of your population.

The first step in addressing this representation issue is to group the data in the underlying sample. It is now common for standard mortality tables to be developed for different groups such as blue or white collar workers, those with high or low affluence and for specific professions. This gives some extra precision for pension plans and insurers trying to predict longevity for a specific population. However, the number of groups is limited by the size of the original data set and this approach still relies on the grouped sample data being representative of the population. For example, a blue collar table is only relevant for a blue collar pension plan if the blue collar data sample is representative of the plan’s population across other dimensions that impact longevity, such as income and lifestyle.

To deal with this problem, Club Vita use multi-variate analysis to look at multiple longevity characteristics (such as affluence, address, type of work, etc.) at once and develop a suite of different base tables that capture longevity for large combinations of underlying characteristics. Instead of splitting the data into ever and ever smaller groups we can instead analyze the incremental effects of varying specific characteristics within the same model.

This results in the ability to capture a longevity assumption appropriate for each individual within a population based on their characteristics. An appropriate assumption for any population can then be created by piecing together the assumptions for the individuals that make up that population, removing the need for representativeness in the original data sample. This is the approach we use to generate our VitaCurves model.

Need more information?

If you have any further questions our team would be happy to help you.

Get in touch