Administrative data

Administrative data is the term used to describe everyday data about individuals collected by government departments and agencies. Examples include exam results, benefit receipt and National Insurance payments.


Attrition is the discontinued participation of study participants in a longitudinal study. Attrition can reflect a range of factors, from the study participant not being traceable to them choosing not to take part when contacted. Attrition is problematic both because it can lead to bias in the study findings (if the attrition is higher among some groups than others) and because it reduces the size of the sample.

Body mass index

Body mass index is a measure used to assess if an individual is a healthy weight for their height. It is calculated by dividing the individual’s weight by the square of their height, and it is typically represented in units of kg/m2.

Cohort studies

Cohort studies are concerned with charting the lives of groups of individuals who experience the same life events within a given time period. The best known examples are birth cohort studies, which follow a group of people born in a particular period.

Complete case analysis

Complete case analysis is the term used to describe a statistical analysis that only includes participants for which we have no missing data on the variables of interest. Participants with any missing data are excluded.


Conditioning refers to the process whereby participants’ answers to some questions may be influenced by their participation in the study – in other words, their responses are ‘conditioned’ by their being members of a longitudinal study. Examples would include study respondents answering questions differently or even behaving differently as a result of their participation in the study.


Confounding occurs where the relationship between independent and dependent variables is distorted by one or more additional, and sometimes unmeasured, variables. A confounding variable must be associated with both the independent and dependent variables but must not be an intermediate step in the relationship between the two (i.e. not on the causal pathway).

For example, we know that physical exercise (an independent variable) can reduce a person’s risk of cardiovascular disease (a dependent variable). We can say that age is a confounder of that relationship as it is associated with, but not caused by, physical activity and is also associated with coronary health. See also ‘unobserved heterogeneity’, below.


Cross-sectional surveys involve interviewing a fresh sample of people each time they are carried out. Some cross-sectional studies are repeated regularly and can include a large number of repeat questions (questions asked on each survey round).

Data harmonisation

Data harmonisation involves retrospectively adjusting data collected by different surveys to make it possible to compare the data that was collected. This enables researchers to make comparisons both within and across studies. Repeating the same longitudinal analysis across a number of studies allows researchers to test whether results are consistent across studies, or differ in response to changing social conditions.

Data imputation

Data imputation is a technique for replacing missing data with an alternative estimate. There are a number of different approaches, including mean substitution and model-based multivariate approaches.

Data linkage

Data linkage simply means connecting two or more sources of administrative, educational, geographic, health or survey data relating to the same individual for research and statistical purposes. For example, linking housing or income data to exam results data could be used to investigate the impact of socioeconomic factors on educational outcomes.

Dummy variables

Dummy variables, also called indicator variables, are sets of dichotomous (two-category) variables we create to enable subgroup comparisons when we are analysing a categorical variable with three or more categories.

General ability

General ability is a term used to describe cognitive ability, and is sometimes used as a proxy for intelligent quotient (IQ) scores.


Heterogeneity is a term that refers to differences, most commonly differences in characteristics between study participants or samples. It is the opposite of homogeneity, which is the term used when participants share the same characteristics. Where there are differences between study designs, this is sometimes referred to as methodological heterogeneity. Both participant or methodological differences can cause divergences between the findings of individual studies and if these are greater than chance alone, we call this statistical heterogeneity. See also: unobserved heterogeneity.

Household panel surveys

Household panel surveys collect information about the whole household at each wave of data collection, to allow individuals to be viewed in the context of their overall household. To remain representative of the population of households as a whole, studies will typically have rules governing how new entrants to the household are added to the study.


Kurtosis is sometimes described as a measure of ‘tailedness’. It is a characteristic of the distribution of observations on a variable and denotes the heaviness of the distribution’s tails. To put it another way, it is a measure of how thin or fat the lower and upper ends of a distribution are.

Longitudinal studies

Longitudinal studies gather data about the same individuals (‘study participants’) repeatedly over a period of time, in some cases from birth until old age. Many longitudinal studies focus upon individuals, but some look at whole households or organisations.

Non-response bias

Non-response bias is a type of bias introduced when those who participate in a study differ to those who do not in a way that is not random (for example, if attrition rates are particularly high among certain sub-groups). Non-random attrition over time can mean that the sample no longer remains representative of the original population being studied. Two approaches are typically adopted to deal with this type of missing data: weighting survey responses to re-balance the sample, and imputing values for the missing information.

Observational studies

Observational studies focus on observing the characteristics of a particular sample without attempting to influence any aspects of the participants’ lives. They can be contrasted with experimental studies, which apply a specific ‘treatment’ to some participants in order to understand its effect.

Panel studies

Panel studies follow the same individuals over time. They vary considerably in scope and scale. Examples include online opinion panels and short-term studies whereby people are followed up once or twice after an initial interview.


A percentile is a measure that allows us to explore the distribution of data on a variable. It denotes the percentage of individuals or observations that fall below a specified value on a variable. The value that splits the number of observations evenly, i.e. 50% of the observations on a variable fall below this value and 50% above, is called the 50th percentile or more commonly, the median.

Prospective study

In prospective studies, individuals are followed over time and data about them is collected as their characteristics or circumstances change.

Recall error or bias

Recall error or bias describes the errors that can occur when study participants are asked to recall events or experiences from the past. It can take a number of forms – participants might completely forget something happened, or misremember aspects of it, such as when it happened, how long it lasted, or other details. Certain questions are more susceptible to recall bias than others. For example, it is usually easy for a person to accurately recall the date they got married, but it is much harder to accurately recall how much they earned in a particular job, or how their mood at a particular time.

Record linkage

Record linkage studies involve linking together administrative records (for example, benefit receipts or census records) for the same individuals over time.

Reference group

A reference group is a category on a categorical variable to which we compare other values. It is a term that is commonly used in the context of regression analyses in which categorical variables are being modelled.


Residuals are the difference between your observed values (the constant and predictors in the model) and expected values (the error), i.e. the distance of the actual value from the estimated value on the regression line.

Respondent burden

Respondent burden is a catch all phrase that describes the perceived burden faced by participants as a result of their being involved in a study. It could include time spent taking part in the interview and inconvenience this may cause, as well as any difficulties faced as a result of the content of the interview.

Retrospective study

In retrospective studies, individuals are sampled and information is collected about their past. This might be through interviews in which participants are asked to recall important events, or by identifying relevant administrative data to fill in information on past events and circumstances.


Sample is a subset of a population that is used to represent the population as a whole. This reflects the fact that it is often not practical or necessary to survey every member of a particular population. In the case of birth cohort studies, the larger ‘population’ from which the sample is drawn comprises those born in a particular period. In the case of a household panel study like Understanding Society, the larger population from which the sample was drawn comprised all residential addresses in the UK.

Sampling frame

A sampling frame is a list of the target population from which potential study participants can be selected.


Skewness is the measure of how assymetrical the distribution of observations are on a variable. If the distribution has a more pronounced/longer tail at the upper end of the distribution (right-hand side), we say that the distribution is negatively skewed. If it is more pronounced/longer at the lower end (left-hand side), we say that it is positively skewed.

Study participants

Study participants are the individuals who are interviewed as part of a longitudinal study.

Survey weights

Survey weights can be used to adjust a survey sample so it is representative of the survey population as a whole. They may be used to reduce the impact of attrition on the sample, or to correct for certain groups being over-sampled.


The term used to refer to a round of data collection in a particular longitudinal study (for example, the age 7 sweep of the National Child Development Study refers to the data collection that took place in 1965 when the participants were aged 7). Note that the term wave often has the same meaning.

Target population

The population of people that the study team wants to research, and from which a sample will be drawn.

Tracing (or tracking)

Tracing (or tracking) describes the process by which study teams attempt to locate participants who have moved from the address at which they were last interviewed.

Unobserved heterogeneity

Unobserved heterogeneity is a term that describes the existence of unmeasured (unobserved) differences between study participants or samples that are associated with the (observed) variables of interest. The existence of unobserved variables means that statistical findings based on the observed data may be incorrect.


Variables is the term that tends to be used to describe data items within a dataset. So, for example, a questionnaire might collect information about a participant’s job (its title, whether it involves any supervision, the type of organisation they work for and so on). This information would then be coded using a code-frame and the results made available in the dataset in the form of a variable about occupation. In data analysis variables can be described as ‘dependent’ and ‘independent’, with the dependent variable being a particular outcome of interest (for example, high attainment at school) and the independent variables being the variables that might have a bearing on this outcome (for example, parental education, gender and so on).


The term used to refer to a round of data collection in a particular longitudinal study (for example, the age 7 wave of the National Child Development Study refers to the data collection that took place in 1965 when the participants were aged 7). Note that the term sweep often has the same meaning.

Learning Hub

Health behaviours

Why use longitudinal data to study health behaviours?

Medical professionals are increasingly concerned with the threat posed to the population’s health by modern lifestyles. Smoking, drinking, poor diet, lack of exercise or sleep, stress and many other common features of daily life are posing serious public health risks.

While some habits, like drug use, could have an immediate impact on health, the danger often lies in the cumulative effects of sustained unhealthy behaviour over time.

Longitudinal studies make a unique contribution to our understanding of health behaviours by tracking people’s lifetime habits. The data have been used to track these cumulative effects, and to determine whether there are critical points in our lives where changing habits can make the biggest difference. Longitudinal evidence has been instrumental in addressing major public health problems, such as obesity, smoking, alcohol consumption and common mental illnesses.

Selected longitudinal evidence on health behaviours

The risk of child obesity has increased almost three-fold in five generations

Researchers have discovered that children born since 1990 are up to three times more likely than older generations to be overweight or obese by age 10, by comparing data from five different longitudinal studies. Read more.

Health ‘benefits’ of moderate drinking are overstated

According to findings from the 1958 National Child Development Study, claims such as a glass of red wine a day is good for your health are unfounded. Researchers were able to bust these myths by looking at how people’s drinking patterns change over the course of their lives, and crucially in relation to their health and education. Read more.

Women with fewer educational qualifications are less likely to eat well during pregnancy

According to findings from the Southampton Women’s Survey, 55 per cent of women with no qualifications had extremely poor diets while pregnant, compared to just 3 per cent of women with degrees. The effect of education remained even when taking into account their social class, whether they lived in a deprived neighbourhood, and whether they received benefits. Read more.

Teen cannabis use linked to other illicit drug taking in early adulthood

According to findings from the Avon Longitudinal Study of Parents and Children, regular and occasional cannabis use in adolescence is associated with a greater risk of other illicit drug taking in early adulthood. Read more.

Middle-aged couch potatoes may be ‘planted’ more than 30 years earlier

According to findings from the 1970 British Cohort Study, children who watch a lot of TV also spend more time in front of the screen at age 42 than those who had watched relatively little television in childhood. Forty-two-year-olds who watched TV for at least three hours a day were more likely to be in only ‘fair’ or ‘poor’ health and to report that they were either overweight or obese. Read more.

Many people in the UK struggle to get a good night’s sleep

According to findings from Understanding Society, those who struggle to get the recommended 7 hours of good quality sleep per night are less likely to report good health and mental wellbeing. Read more.

What information do longitudinal studies collect on health behaviours?

Longitudinal studies collect such an array of information about participants’ lifestyles that it would be impossible to list them all. These are some of the most common health behaviours covered by the studies.

Alcohol consumption

Information on alcohol consumption is often quite extensive in longitudinal studies. Most studies (depending on their focus) will ask study participants how much and what they drink. For younger participants, many studies also ask about the drinking habits of the people around them, for instance, whether they have friends, siblings or parents who drink. In older studies, participants who have abstained from alcohol or cut back the amount they drink might be asked about their reasons why.

Policymakers and researchers are increasingly interested in the age at which people start drinking (alcohol initiation). Younger participants will be asked at very early ages (10 or 11) if they’ve ever had a drink, and some older studies have attempted to collect this information retrospectively. Find out more about collecting information retrospectively in the Study Design module.

Some studies will ask a standard series of questions designed to determine whether someone is (or is at risk of becoming) an alcoholic. These questions might include ‘How often during the last year have you had a feeling of guilt or remorse after drinking?’, ‘How often during the last year have you failed to do what was normally expected from you because of your drinking?’ or ‘Have you or somebody else been injured as a result of your drinking?’

Diet and nutrition

Information on study participants’ diets can range considerably in detail depending on the purpose of the study (read more about scientific aims and objectives). However, diet is such an important aspect of our health that most studies will cover it in some way.

Many studies ask about junk food, such as how often participants have takeaways, ready-meals, sweets, crisps and fizzy drinks.

Studies or sweeps with a strong health focus may go into much more depth, asking participants about certain foods within each food group. For instance, this can be as specific as asking how often they eat apples, pears, oranges, grapes, etc. to determine their fruit intake. Health-related studies and sweeps will also include a good deal of information on drinks, including tea, coffee, water and milk. They might also cover dietary supplements, like vitamins.

And what about breakfast, supposedly the most important meal of the day? A lot of studies ask specifically about breakfast, including how often participants skip breakfast, or go to breakfast clubs (for school-age participants).

Illegal substance use

Longitudinal studies tend to cover legal substances (such as alcohol and tobacco) in greater detail as these are more widely used in the general population. However, participants’ use of hard drugs can have a detrimental impact on health and other areas of life. For this reason, most studies will collect some information on illegal substance use (with greater detail in medically-focused studies).

As with alcohol, questions about illegal substance use tend to focus on what drugs participants have taken and how often. Some will also ask specifically whether participants have ever suffered from drug addiction. The longitudinal nature of the studies allows researchers to determine the stage of life when participants took drugs, for instance, whether a mother took drugs during pregnancy.

Some studies will also ask about the drug habits of important people in the participants’ lives, for example whether their friends or parents use drugs or have substance abuse problems.

Physical activity (and inactivity)

Longitudinal studies cover physical activity and sedentary behaviour in quite some depth, and using a range of methods. Broadly speaking, longitudinal studies often cover:

  • how often participants get exercise
  • the intensity level of exercise
  • how often they take part in light or moderate ‘work’, such as housework, caring, or gardening – activities that participants might not classify as exercise
  • how often participants take part in various individual and team sports
  • for children, whether they are physically active in and out of school
  • how much time they spend sitting, lying down, or in front of the TV or other screens.

Many studies also take a range of measures of physical function – how well participants can perform basic tasks of everyday life. Participants’ physical ability may be related to how active they are, particularly at important life stages, such as infancy/early years and later life.

Many studies are now asking their participants to wear activity monitors, which will provide objective measures of their active and sedentary behaviour. These measures can be compared to the information reported by participants about their physical activity.


Getting a good night’s sleep is known to be related to many different aspects of our health. Longitudinal studies are increasingly interested in participants’ sleep patterns and behaviours. Many studies cover the following aspects of sleep:

  • average hours per night
  • time taken to fall asleep
  • average number of times participants wake up in the night
  • sleep routines, including bedtimes (for younger participants)
  • whether participants share a bed or bedroom (for younger participants)
  • whether participants have bad dreams, night terrors, sleepwalking, sleep talking
  • whether participants feel rested when they wake up
  • how often participants feel tired.


Smoking (tobacco)

For decades, longitudinal studies have been contributed to our understanding of the harmful effects of smoking. In fact, the 1958 National Child Development Study is credited with discovering the lasting effects of smoking during pregnancy on child development.

The questions longitudinal studies ask about cigarette smoking are very similar to the questions about alcohol consumption. Most studies collect information on the average number of cigarettes participants smoke per day, and the age they started smoking. If a participant has cut back on smoking, many studies will ask when and why.

Some studies ask in more detail about other types of tobacco use, such as pipes, cigars, rolled cigarettes, chewing tobacco, e-cigarettes, filtered/unfiltered cigarettes – but this kind of question is less common than questions about types of alcohol.

Information about parents’ smoking habits are collected when participants are younger. Some studies will continue to ask about sources of second-hand smoke in later years, such as whether people often smoke around participants, or if they have a partner at home who smokes.

Find out more about what information longitudinal studies collect in the Introduction to longitudinal studies module.

How do longitudinal studies collect information on health behaviours?

Information on health behaviours are collected using a wide range of different methods and modes. Study teams select the data collection tools most appropriate to the particular behaviour being examined. However, as with other topics, a great deal of information is collected through questionnaires.

Understandably, questions about health behaviour tend to be asked of the study participants directly. When study participants are younger, parents will be asked about both their own habits and those of their children. Participants are usually only asked to recall their health behaviours over a relatively short period of time.

In the case of some health behaviours, study teams might opt to include standardised, validated tests in their questionnaires, which are designed to assess whether someone has a disorder or is putting their health at risk. One example is the World Health Organisation’s Alcohol Use Disorders Identification Test, a series of questions used by several studies to assess whether someone has alcoholism.

But people can have misconceptions about even their own behaviour. For instance, participants might think they drink less than they actually do, or that they get more exercise. Different methods of data collection can help improve the accuracy of participants’ responses. For instance, studies are increasingly asking participants to complete food diaries. This involves keeping a (more or less) real-time record of the food they eat. Another example is asking participants to wear activity monitors to measure how often they are physically active and how often they are sedentary. Some of the most interesting analyses look at the differences between objective and self-reported measures.

There are practical considerations that need to be taken into account when collecting data using these methods. For instance, study teams need to consider whether  it is reasonable to ask busy participants to keep diaries and wear electronic devices and for how long (two days, two weeks?). There are also logistical challenges in getting equipment like activity monitors back from participants afterwards. Read more about practical considerations in the Study design module.

Study teams also need to think about survey mode when collecting data about health behaviours. For instance, will participants be more or less likely to answer sensitive questions if they are sat face-to-face with an interviewer, or if they answer the questions at home on paper or online? Studies may opt for mixing modes (for example, some face-to-face, some paper self-completion) to get the best possible response. Read more about mixing modes in the Study design module.

Find out more about the importance of data collection methods and modes for longitudinal studies in the Study design module.

Advantages of using longitudinal data

There are a wide range of advantages to using longitudinal data to study health behaviours.

Information over the life course: Unlike cross-sectional studies, longitudinal studies track participants’ habits over an extended period of time. This approach is critical to studies of health behaviours. Some habits may be ingrained from childhood, while others change over time. The consequences of our lifetime behaviours are often cumulative. In other cases, the point at which we change our behaviours is important. For example, are the benefits of quitting smoking early in adulthood the same as quitting in later life? Longitudinal studies allow researchers to establish the order in which life events occur. Learn more about the differences between longitudinal and cross-sectional studies in the Introduction to longitudinal studies module.

Prospective data collection: Health behaviours can be difficult to remember accurately. Longitudinal studies typically ask participants to remember back a short time, and many are starting to collect real-time or near real-time information through diaries and activity monitors. Learn more about the benefits of prospective data collection in the Study design module.

Opportunities for intergenerational comparisons: Many longitudinal studies have measured health behaviours, allowing for cross-study comparisons.

Sample size: Longitudinal studies have large sample sizes, which allow for robust assessments of the common health behaviours that are linked to key public health problems Learn more about longitudinal study samples in the Study design module.

Find out more about the strengths of longitudinal data in the Introduction to longitudinal studies module.

Challenges of using longitudinal data

The challenges of using longitudinal data to study health behaviours depends on what data you use. These are some of the common issues researchers can face.

Self-reported information: Much of the information collected by longitudinal studies relies on participants’ own reports of their habits. This can be difficult to do accurately, and it is possible that people over- or under-report their own behaviour. This can be problematic if it is not random, for example, if people with unhealthy diets are more likely to under-report their consumption of junk food than others.

Limits on objective data collection: Studies cannot monitor their participants all the time, so information from diaries or activity monitors are often limited to a two-day period. It is possible that these two days are uncharacteristic for some participants, or that participants are conscious of the monitor or diary and behave differently while they are using it. Also, some data from activity monitors and time diaries can be very complex and challenging to use.

Short-term recall: If participants are asked to recall information, such as how much they drink, they will only be asked to remember back over a short period, for example the past week. This could also result in under or over-reporting health behaviours if the past week was unusual for some participants.

Influence of participation: Of course, some participants might be influenced by being part of a study to clean up their act. They may become more aware of lifestyle-related diseases, or just more aware of their own behaviour by taking part in the study. Read more about weaknesses of longitudinal data in the Introduction to longitudinal studies module.

Changes to measures over time: Ways of measuring health behaviours are always changing in order to improve the quality of data collected. However, this can pose a challenge if studies want to assess change over time by looking at comparable questions. You can read more about these challenges in the Study design module. It can also be the case that studies do not collect the same level of detail on certain health behaviours at every sweep. There could be various reasons for this, such as the specific focus of the sweep, or the funding available.

Sample size: While longitudinal studies benefit from large sample sizes, they are sometimes not large enough to look at uncommon behaviours or conditions/diseases. You can read more about this issue in the Study design module.

Find out more about the challenges of longitudinal data in the Introduction to longitudinal studies module.

CLOSER studies to consider

Hertfordshire Cohort Study

This study has collected detailed information on diet, alcohol consumption and other health behaviours among an ageing population. This data has been linked to the participants’ birth records, allowing for research into the role of early life circumstances in establishing healthy or unhealthy habits among older people.

MRC National Survey of Health and Development

The 1946 British birth cohort is the oldest national cohort study, and is invaluable for studying how lifetime health and health behaviours are related to ageing.

1958 National Child Development Study

This study has a breadth of data across health behaviours, rich social data and a focused biomedical sweep at age 44/45.

1970 British Cohort Study

This study contains a similar spread of information across different health behaviours as the 1946 and 1958 cohorts, and is carrying out a biomedical sweep at age 46. However, unlike the previous generations, this cohort have grown up with fast food and more sedentary lifestyles – making for interesting cross-study comparisons.

Avon Longitudinal Study of Parents and Children

This study has a strong biomedical focus, and includes very detailed information on diet and substance use, among other health behaviours, which can be compared to its rich data on participants’ health. It has also collected information from several generations: participants, their parents, grandparents and now their children.

Southampton Women’s Survey

Exploring women’s diets before and during pregnancy was one of the scientific objectives of this study. It includes rich information on diet from before women became pregnant and throughout their children’s lives.

Millennium Cohort Study

This study – the most recent of the national birth cohorts to start – includes information on health behaviours from very early life, including alcohol and cigarette use from age 11, detailed information on sleep at age 14, combined with information on parents’ health behaviours.

Understanding Society

This study has collected information on nutrition, smoking and physical activity at regular intervals from whole households of participants. It has also collected information on sleep and alcohol consumption at certain sweeps, and regularly conducts a full health assessment with a sub-sample of adults.