Administrative data is the term used to describe everyday data about individuals collected by government departments and agencies. Examples include exam results, benefit receipt and National Insurance payments.
Attrition is the discontinued participation of study participants in a longitudinal study. Attrition can reflect a range of factors, from the study participant not being traceable to them choosing not to take part when contacted. Attrition is problematic both because it can lead to bias in the study findings (if the attrition is higher among some groups than others) and because it reduces the size of the sample.
Cohort studies are concerned with charting the lives of groups of individuals who experience the same life events within a given time period. The best known examples are birth cohort studies, which follow a group of people born in a particular period.
Conditioning refers to the process whereby participants’ answers to some questions may be influenced by their participation in the study – in other words, their responses are ‘conditioned’ by their being members of a longitudinal study. Examples would include study respondents answering questions differently or even behaving differently as a result of their participation in the study.
Confounding occurs where the relationship between independent and dependent variables is distorted by one or more additional, and sometimes unmeasured, variables. A confounding variable must be associated with both the independent and dependent variables but must not be an intermediate step in the relationship between the two (i.e. not on the causal pathway).
For example, we know that physical exercise (an independent variable) can reduce a person’s risk of cardiovascular disease (a dependent variable). We can say that age is a confounder of that relationship as it is associated with, but not caused by, physical activity and is also associated with coronary health. See also ‘unobserved heterogeneity’, below.
Cross-sectional surveys involve interviewing a fresh sample of people each time they are carried out. Some cross-sectional studies are repeated regularly and can include a large number of repeat questions (questions asked on each survey round).
Data harmonisation involves retrospectively adjusting data collected by different surveys to make it possible to compare the data that was collected. This enables researchers to make comparisons both within and across studies. Repeating the same longitudinal analysis across a number of studies allows researchers to test whether results are consistent across studies, or differ in response to changing social conditions.
Data linkage simply means connecting two or more sources of administrative, educational, geographic, health or survey data relating to the same individual for research and statistical purposes. For example, linking housing or income data to exam results data could be used to investigate the impact of socioeconomic factors on educational outcomes.
Household panel surveys collect information about the whole household at each wave of data collection, to allow individuals to be viewed in the context of their overall household. To remain representative of the population of households as a whole, studies will typically have rules governing how new entrants to the household are added to the study.
Longitudinal studies gather data about the same individuals (‘study participants’) repeatedly over a period of time, in some cases from birth until old age. Many longitudinal studies focus upon individuals, but some look at whole households or organisations.
Non-response bias is a type of bias introduced when those who participate in a study differ to those who do not in a way that is not random (for example, if attrition rates are particularly high among certain sub-groups). Non-random attrition over time can mean that the sample no longer remains representative of the original population being studied. Two approaches are typically adopted to deal with this type of missing data: weighting survey responses to re-balance the sample, and imputing values for the missing information.
Observational studies focus on observing the characteristics of a particular sample without attempting to influence any aspects of the participants’ lives. They can be contrasted with experimental studies, which apply a specific ‘treatment’ to some participants in order to understand its effect.
Panel studies follow the same individuals over time. They vary considerably in scope and scale. Examples include online opinion panels and short-term studies whereby people are followed up once or twice after an initial interview.
In prospective studies, individuals are followed over time and data about them is collected as their characteristics or circumstances change.
Recall error or bias describes the errors that can occur when study participants are asked to recall events or experiences from the past. It can take a number of forms – participants might completely forget something happened, or misremember aspects of it, such as when it happened, how long it lasted, or other details. Certain questions are more susceptible to recall bias than others. For example, it is usually easy for a person to accurately recall the date they got married, but it is much harder to accurately recall how much they earned in a particular job, or how their mood at a particular time.
Record linkage studies involve linking together administrative records (for example, benefit receipts or census records) for the same individuals over time.
Respondent burden is a catch all phrase that describes the perceived burden faced by participants as a result of their being involved in a study. It could include time spent taking part in the interview and inconvenience this may cause, as well as any difficulties faced as a result of the content of the interview.
In retrospective studies, individuals are sampled and information is collected about their past. This might be through interviews in which participants are asked to recall important events, or by identifying relevant administrative data to fill in information on past events and circumstances.
Sample is a subset of a population that is used to represent the population as a whole. This reflects the fact that it is often not practical or necessary to survey every member of a particular population. In the case of birth cohort studies, the larger ‘population’ from which the sample is drawn comprises those born in a particular period. In the case of a household panel study like Understanding Society, the larger population from which the sample was drawn comprised all residential addresses in the UK.
A sampling frame is a list of the target population from which potential study participants can be selected.
Study participants are the individuals who are interviewed as part of a longitudinal study.
Survey weights can be used to adjust a survey sample so it is representative of the survey population as a whole. They may be used to reduce the impact of attrition on the sample, or to correct for certain groups being over-sampled.
The term used to refer to a round of data collection in a particular longitudinal study (for example, the age 7 sweep of the National Child Development Study refers to the data collection that took place in 1965 when the participants were aged 7). Note that the term wave often has the same meaning.
The population of people that the study team wants to research, and from which a sample will be drawn.
Tracing (or tracking) describes the process by which study teams attempt to locate participants who have moved from the address at which they were last interviewed.
Unobserved heterogeneity is a term from econometrics that describes the existence of variables about an individual that have not been measured (unobserved) but are associated with the (observed) variables of interest. The existence of unobserved variables means that statistical findings based on the observed data may be incorrect.
Variables is the term that tends to be used to describe data items within a dataset. So, for example, a questionnaire might collect information about a participant’s job (its title, whether it involves any supervision, the type of organisation they work for and so on). This information would then be coded using a code-frame and the results made available in the dataset in the form of a variable about occupation. In data analysis variables can be described as ‘dependent’ and ‘independent’, with the dependent variable being a particular outcome of interest (for example, high attainment at school) and the independent variables being the variables that might have a bearing on this outcome (for example, parental education, gender and so on).
The term used to refer to a round of data collection in a particular longitudinal study (for example, the age 7 wave of the National Child Development Study refers to the data collection that took place in 1965 when the participants were aged 7). Note that the term sweep often has the same meaning.
Longitudinal studies have a number of particular advantages in terms of the quantity or quality of the data that they collect:
Detail over the life course. The value of longitudinal studies increases as each sweep builds on what is already known about the study participants. This means that on many topics, longitudinal studies typically contain far more detailed information than could be collected through a one-off survey. For example, many studies collect a detailed array of information about study participants’ education, work histories and health conditions.
Establishing the order in which events occur. Longitudinal data collection allows researchers to build up a more accurate and reliably ordered account of the key events and experiences in study participants’ lives. Understanding the order in which events occur is important in assessing causation.
Reducing recall bias. Longitudinal studies help reduce the impact of recall error or bias, which occurs when people forget or misremember events when asked about them later. In longitudinal studies, participants provide information about their current circumstances, or are asked to remember events over only a short period of time (that is, since the time of the last sweep).
Many of the advantages of longitudinal studies relate to the analytic questions their data can help address. For example, longitudinal data help with:
Exploring patterns of change and the dynamics of individual behaviour. Longitudinal data allows researchers to explore dynamic rather than static concepts. This is important for understanding how people move from one situation to another (for example, through work, poverty, parenthood, ill health and so on).
The link between earlier life circumstances and later outcomes. By building up detailed information over time, longitudinal studies are able to paint a rich and accurate picture of participants’ lives.
In the case of birth cohort studies this has allowed researchers to explore how circumstances earlier in life can influence later outcomes. For example, some of the most well-known findings from the cohort studies describe the long-lasting reach of socio-economic disadvantage in childhood.
Longitudinal data also allow us to assess the time-related characteristics of particular events or circumstances (that is, their duration, frequency or timing). For example, does the impact of ill health change depending on when in their life someone becomes ill, how long they remain ill, and how often they experience illnesses?
Providing insights into causal mechanisms and processes. Many surveys provide evidence about the association between particular circumstances and outcomes. For example, a cross-sectional study might find that the unemployed have poorer health than those in work (so, in other words, there is an association between health and employment status). But interpreting this association is more challenging. Might, for example, unemployment be the cause of poor health – or perhaps poor health could lead to unemployment? Longitudinal data cannot definitively ‘prove’ causality, but unlike data from cross-sectional studies, it has a number of important attributes that give more insights into the causal processes that might be involved:
Distinguishing between age and cohort effects. Longitudinal studies can help researchers to distinguish between changes that happen as people get older, known as ‘age effects’, and generational differences that reflect the historical, economic and social context within which different cohorts grew up, known as ‘cohort’ or ‘generational’ effects.
For example, cross-sectional data might show a clear relationship between age and political affiliation (with older age groups being more likely to vote for the Conservative party). Longitudinal data would allow analysts to investigate whether the older generations in the UK are more likely than younger ones to support the Conservative party (a cohort effect), or whether people all people become more likely to vote Conservative as they get older (an age effect).
Age and cohort/generational effects also need to be distinguished from ‘period’ effects; these refer to forces that influence everyone – for example, key events in history that affect everyone irrespective of their age or the generation they were born into.