Now that the dataset is loaded and initial preparation is complete, we can begin exploring the data.
By running the Stata command ‘describe’, we will get a summary of the dataset, including the number of observations and a table of the variable names and labels.
There are 4,497 observations and 8 variables. The ncdsid variable comprises unique identifier codes for each study participant. Other variables in the dataset include the study participant’s family background, whether their mother and father left education at the minimum age or not (n016nmed, n716dade) and their father’s social class (n1171). n622 is the sex of the study participant, while early life factors include their ‘general ability’ (n920) and body-mass index at age 11 (bmi11) and our outcome variable body-mass index at age 42 (bmi42). Note that ‘CM’ in some of the variable labels stands for ‘cohort member’, i.e. the participants in the study.
We can use the ‘summarize’ command to learn more about the variables we will employ in our analyses.
As you can see from the output table above, there are no missing data; each variable has 4,497 observations. Although survey datasets will usually have at least some missing data, we have already removed any study participants with missing data for the purposes of our analyses. As indicated by the minimum and maximum values in the output table, the dataset has 3 continuous variables (bmi42, n920 and bmi11), 3 dichotomous variables (n622, n016nmed, and n716dade), and 1 categorical variable (n1171).
The Learning Hub is a resource for students and educators
tel | +44 (0)20 7331 5102 |
---|---|
closer@ucl.ac.uk |
Sign up for our email newsletters to get the latest from CLOSER
Sign up