Administrative data is the term used to describe everyday data about individuals collected by government departments and agencies. Examples include exam results, benefit receipt and National Insurance payments.
Attrition is the discontinued participation of study participants in a longitudinal study. Attrition can reflect a range of factors, from the study participant not being traceable to them choosing not to take part when contacted. Attrition is problematic both because it can lead to bias in the study findings (if the attrition is higher among some groups than others) and because it reduces the size of the sample.
Body mass index is a measure used to assess if an individual is a healthy weight for their height. It is calculated by dividing the individual’s weight by the square of their height, and it is typically represented in units of kg/m2.
Cohort studies are concerned with charting the lives of groups of individuals who experience the same life events within a given time period. The best known examples are birth cohort studies, which follow a group of people born in a particular period.
Complete case analysis is the term used to describe a statistical analysis that only includes participants for which we have no missing data on the variables of interest. Participants with any missing data are excluded.
Conditioning refers to the process whereby participants’ answers to some questions may be influenced by their participation in the study – in other words, their responses are ‘conditioned’ by their being members of a longitudinal study. Examples would include study respondents answering questions differently or even behaving differently as a result of their participation in the study.
Confounding occurs where the relationship between independent and dependent variables is distorted by one or more additional, and sometimes unmeasured, variables. A confounding variable must be associated with both the independent and dependent variables but must not be an intermediate step in the relationship between the two (i.e. not on the causal pathway).
For example, we know that physical exercise (an independent variable) can reduce a person’s risk of cardiovascular disease (a dependent variable). We can say that age is a confounder of that relationship as it is associated with, but not caused by, physical activity and is also associated with coronary health. See also ‘unobserved heterogeneity’, below.
Cross-sectional surveys involve interviewing a fresh sample of people each time they are carried out. Some cross-sectional studies are repeated regularly and can include a large number of repeat questions (questions asked on each survey round).
Data harmonisation involves retrospectively adjusting data collected by different surveys to make it possible to compare the data that was collected. This enables researchers to make comparisons both within and across studies. Repeating the same longitudinal analysis across a number of studies allows researchers to test whether results are consistent across studies, or differ in response to changing social conditions.
Data imputation is a technique for replacing missing data with an alternative estimate. There are a number of different approaches, including mean substitution and model-based multivariate approaches.
Data linkage simply means connecting two or more sources of administrative, educational, geographic, health or survey data relating to the same individual for research and statistical purposes. For example, linking housing or income data to exam results data could be used to investigate the impact of socioeconomic factors on educational outcomes.
Dummy variables, also called indicator variables, are sets of dichotomous (two-category) variables we create to enable subgroup comparisons when we are analysing a categorical variable with three or more categories.
General ability is a term used to describe cognitive ability, and is sometimes used as a proxy for intelligent quotient (IQ) scores.
Heterogeneity is a term that refers to differences, most commonly differences in characteristics between study participants or samples. It is the opposite of homogeneity, which is the term used when participants share the same characteristics. Where there are differences between study designs, this is sometimes referred to as methodological heterogeneity. Both participant or methodological differences can cause divergences between the findings of individual studies and if these are greater than chance alone, we call this statistical heterogeneity. See also: unobserved heterogeneity.
Household panel surveys collect information about the whole household at each wave of data collection, to allow individuals to be viewed in the context of their overall household. To remain representative of the population of households as a whole, studies will typically have rules governing how new entrants to the household are added to the study.
Kurtosis is sometimes described as a measure of ‘tailedness’. It is a characteristic of the distribution of observations on a variable and denotes the heaviness of the distribution’s tails. To put it another way, it is a measure of how thin or fat the lower and upper ends of a distribution are.
Longitudinal studies gather data about the same individuals (‘study participants’) repeatedly over a period of time, in some cases from birth until old age. Many longitudinal studies focus upon individuals, but some look at whole households or organisations.
Non-response bias is a type of bias introduced when those who participate in a study differ to those who do not in a way that is not random (for example, if attrition rates are particularly high among certain sub-groups). Non-random attrition over time can mean that the sample no longer remains representative of the original population being studied. Two approaches are typically adopted to deal with this type of missing data: weighting survey responses to re-balance the sample, and imputing values for the missing information.
Observational studies focus on observing the characteristics of a particular sample without attempting to influence any aspects of the participants’ lives. They can be contrasted with experimental studies, which apply a specific ‘treatment’ to some participants in order to understand its effect.
Panel studies follow the same individuals over time. They vary considerably in scope and scale. Examples include online opinion panels and short-term studies whereby people are followed up once or twice after an initial interview.
A percentile is a measure that allows us to explore the distribution of data on a variable. It denotes the percentage of individuals or observations that fall below a specified value on a variable. The value that splits the number of observations evenly, i.e. 50% of the observations on a variable fall below this value and 50% above, is called the 50th percentile or more commonly, the median.
In prospective studies, individuals are followed over time and data about them is collected as their characteristics or circumstances change.
Recall error or bias describes the errors that can occur when study participants are asked to recall events or experiences from the past. It can take a number of forms – participants might completely forget something happened, or misremember aspects of it, such as when it happened, how long it lasted, or other details. Certain questions are more susceptible to recall bias than others. For example, it is usually easy for a person to accurately recall the date they got married, but it is much harder to accurately recall how much they earned in a particular job, or how their mood at a particular time.
Record linkage studies involve linking together administrative records (for example, benefit receipts or census records) for the same individuals over time.
A reference group is a category on a categorical variable to which we compare other values. It is a term that is commonly used in the context of regression analyses in which categorical variables are being modelled.
Residuals are the difference between your observed values (the constant and predictors in the model) and expected values (the error), i.e. the distance of the actual value from the estimated value on the regression line.
Respondent burden is a catch all phrase that describes the perceived burden faced by participants as a result of their being involved in a study. It could include time spent taking part in the interview and inconvenience this may cause, as well as any difficulties faced as a result of the content of the interview.
In retrospective studies, individuals are sampled and information is collected about their past. This might be through interviews in which participants are asked to recall important events, or by identifying relevant administrative data to fill in information on past events and circumstances.
Sample is a subset of a population that is used to represent the population as a whole. This reflects the fact that it is often not practical or necessary to survey every member of a particular population. In the case of birth cohort studies, the larger ‘population’ from which the sample is drawn comprises those born in a particular period. In the case of a household panel study like Understanding Society, the larger population from which the sample was drawn comprised all residential addresses in the UK.
A sampling frame is a list of the target population from which potential study participants can be selected.
Skewness is the measure of how assymetrical the distribution of observations are on a variable. If the distribution has a more pronounced/longer tail at the upper end of the distribution (right-hand side), we say that the distribution is negatively skewed. If it is more pronounced/longer at the lower end (left-hand side), we say that it is positively skewed.
Study participants are the individuals who are interviewed as part of a longitudinal study.
Survey weights can be used to adjust a survey sample so it is representative of the survey population as a whole. They may be used to reduce the impact of attrition on the sample, or to correct for certain groups being over-sampled.
The term used to refer to a round of data collection in a particular longitudinal study (for example, the age 7 sweep of the National Child Development Study refers to the data collection that took place in 1965 when the participants were aged 7). Note that the term wave often has the same meaning.
The population of people that the study team wants to research, and from which a sample will be drawn.
Tracing (or tracking) describes the process by which study teams attempt to locate participants who have moved from the address at which they were last interviewed.
Unobserved heterogeneity is a term that describes the existence of unmeasured (unobserved) differences between study participants or samples that are associated with the (observed) variables of interest. The existence of unobserved variables means that statistical findings based on the observed data may be incorrect.
Variables is the term that tends to be used to describe data items within a dataset. So, for example, a questionnaire might collect information about a participant’s job (its title, whether it involves any supervision, the type of organisation they work for and so on). This information would then be coded using a code-frame and the results made available in the dataset in the form of a variable about occupation. In data analysis variables can be described as ‘dependent’ and ‘independent’, with the dependent variable being a particular outcome of interest (for example, high attainment at school) and the independent variables being the variables that might have a bearing on this outcome (for example, parental education, gender and so on).
The term used to refer to a round of data collection in a particular longitudinal study (for example, the age 7 wave of the National Child Development Study refers to the data collection that took place in 1965 when the participants were aged 7). Note that the term sweep often has the same meaning.
The first thing study teams need to decide is who the study will focus on.
Think back to the three examples in the last section – each has a different sample population:
Most studies select their sample from within certain geographic limits. This might be for practical or scientific reasons. The geographic limits could be very small, for instance a city or county, or very large, such as the whole of the UK.
The first two examples are known as cohort studies and target specific groups or sections of the population. Cohort study samples share a common experience at a particular point in time. For example, a birth cohort follows children born within a specific period. Other cohorts follow groups of students in the same year at school, patients diagnosed with a certain disease at a particular point in time, or new recruits entering an organisation or industry in a given year.
Some studies, like Understanding Society, target the UK population as a whole. One challenge this presents is the fact that the population is always changing.
Studies that seek to represent the whole population must be ‘dynamic’ – that is, there needs to be a way in which new members can join the sample. Otherwise there is a risk that, over time, the sample will become increasingly different to the population it is meant to represent.
Understanding Society creates a dynamic sample by including people who move into participating households. For example, if the child of a participating household leaves home to move in with a partner, the partner will join the sample. Similarly, if a couple breaks up and forms two new households, both new households become part of the sample.
To select a sample, researchers need a ‘sampling frame’. This is a list of everyone in the target population of interest, from which a sample can be drawn. The choice of sampling frame depends on who the study wants to sample and when they would like to first interview them.
For example, the SWS wanted to interview women before they became pregnant, which ruled out certain sampling options (such as recruiting the sample through maternity records).
When assessing the sampling frame used for a study, it is important to consider how accurately the frame reflects the target population of interest. For example, does it include people who are not in the target population at all (and who need to be identified and weeded out)? Or is it missing people who are in the target population?
Child Benefit Records were used as the sampling frame for the Millennium Cohort Study. At the time, Child Benefit was universal, which meant that the list of recipients in 2000-01 (when the study started) was an accurate reflection of all UK families with a child born in the study’s target year.
However, Child Benefit Records are no longer as suitable a sampling frame for birth cohort studies because the benefit is no longer universal. Changes made in 2013 mean that the records under-represent higher earners, who are no longer entitled to Child Benefit. If a study were to use the current Child Benefit Records as a sampling frame today, the sample would under-represent higher income households.
Planning for all surveys involves considering the likely achieved sample size – that is, how many participants are likely to take part.
Cross-sectional study teams will identify the ideal achieved sample size, as well as the likely response rate – that is, the number of people who complete the survey divided by the number of people who were invited to take part (minus any who turn out to be ineligible). Study teams usually issue a sample that is larger than their ideal achieved sample size to take into account that response rates are never 100%.
With longitudinal studies, these calculations are more complex. The study teams need to think about the sample over a longer time period, collecting data a number of times.
An important consideration for longitudinal study teams is attrition – that is, participants dropping out of the study, either permanently or temporarily.
Some attrition is unavoidable (for example, participants might die or leave the country). Other attrition is avoidable but challenging to overcome (for example, keeping in touch with participants who move or persuading reluctant participants to take part).
The sample design for a longitudinal study will involve making judgements about the starting sample size needed to ensure that the study can withstand likely attrition levels over time.
In the case of some longitudinal studies, the target population is much larger than the desired number of participants so a smaller subsample needs to be selected. Study teams use various methods to make sure that this subsample is as representative of the target population as possible. These sampling methods have become more sophisticated over time as sampling methods have evolved.
For example, the first three British birth cohorts selected their sample of births by choosing a specific week within the relevant year (1946, 1958 and 1970). All births within those weeks were eligible to be included in the first round of each study. The 1958 and 1970 birth cohorts included these participants in subsequent waves of the study; in the case of the 1946 birth cohort, a subsample of cases from the first study were followed up.
There were several limitations to this approach. In particular, the sample is potentially not representative of everyone born in that year – only of those born in that season. This makes it is impossible to use the data to explore issues like whether season of birth affects later outcomes, such as educational attainment.
This is one of the reasons that the most recent birth cohort, the Millennium Cohort Study, selected its sample of births from across a whole school year. This allows researchers to be confident that the data collected can be used to make inferences about the wider population born at the turn of the century.
However, it is important to be aware that there is a debate within epidemiology about whether the importance of having representative samples drawn from well-defined populations has been overrated. Instead, it is argued, some research questions are better addressed by sample designs that focus upon particular groups of interest rather by seeking to obtain a representative sample of the relevant population as a whole. For an introduction to this discussion see this article in the Longitudinal and Life Course Studies journal.
As covered in the previous sections, most longitudinal study teams aim to select representative samples that reflect the composition of the target population. However, unless the starting sample is very large indeed, this means that there will be relatively small numbers of participants from minority groups.
While the proportions of participants from minority groups might accurately reflect the make-up of the wider population, the small numbers can constrain the research that can be done using these groups.
For example, imagine a particular group represents 2 per cent of the UK population as a whole. If a longitudinal study achieves 8,000 interviews in its first sweep of data collection, it will include around 160 participants from the minority group – too small for any detailed statistical analysis, especially if some of these participants drop out at subsequent sweeps.
As a result, some studies now ‘boost’ the number of participants from particular. Examples of longitudinal studies that have taken this approach include:
If a study contains a boosted number of participants from a particular group, survey weights should be applied to adjust the overall results so that they are representative of the population as a whole. Sample weighting involves some individuals counting as less than one case, while others may count for more.