A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

Confidence intervals are used to express the uncertainty associated with a population estimate. For example, imagine we wanted to use a survey to estimate the mean age of a population. The 95% confidence interval tells us that if we sampled this same population lots of times, and generated a CI each time, 95% of these CI would contain the true mean age of the population.

Source: Stat Trek Statistics Dictionary.

Aggregate or macro data are data about populations, groups, regions or countries. These are data that have been averaged, totaled or otherwise derived from the individual level data found in the survey datasets.

Sampling bias occurs when a sample statistic does not accurately reflect the true value of the parameter in the target population. Sample estimates might be too high or too low compared to the true population values. This may arise where the sample is not representative of the population.

Source: SAGE Research Methods.

A survey case is a unit for which values are captured. Typically, surveys use individuals, families/households or institutions/organisations as observation units (cases). In survey datasets, cases are usually stored in rows.s

A variable that can take on one value values from a set of discrete and mutually exclusive list of responses. For example a marital status variable can include the categories of single (never married), married, civil partnership, divorced, widowed etc. and respondent can be assigned only one value from this list.

Chroropleth maps colour or shade different areas according to a range of values, e.g. population density or per-capita income

The process of dividing a population into groups, then selecting a simple random sample of groups and sampling everyone in those groups. An example of this is geographical clustering, which is often efficiently applied in face-to-face surveys. Clustering of addresses limits travel for interviewers and so allows survey producers to sample more respondents for a given budget.

Sources: An Introduction to Statistical Methods and Data Analysis.

A codebook describes the contents, structure, and layout of a data collection. Codebooks begin with basic front matter, including the study title, name of the principal investigator(s), table of contents, and an introduction describing the purpose and format of the codebook. Some codebooks also include methodological details, such as how weights were computed, and data collection instruments, while others, especially with larger or more complex data collections, leave those details for a separate user guide and/or data collection instrument.

A control variable is a variable that is included in an analysis in order to control or eliminate its influence on the variables of interest. For example if we are looking at the relationship between having a university degree and smoking prevalence, we might need to consider the impact of age at the same time. Older generation respondents are more likely to smoke than a younger generation. If we control for age we can see whether graduates are less likely to smoke than non-graduates once age has been accounted for.

Source: SAGE Research Methods.

Cross-sectional data are collected from a sample at a single point in time. It is often likened to taking a snapshot. Cross-sectional studies are quick and relatively simple, but they cannot provide information about the change in the same individuals or units over time. They can however be used to look at aggregate changes in the population as a whole.

Source: SAGE Research Methods.

A variable that is created from one or more already existing variables by following some sort of calculation or other data processing techniques. For example, respondent’s estimated annual income from savings and investments could be derived from several reported income variables.

Statistics used to describe basic features of one or more variables in a study. They provide simple summaries about the data and do not test any hypotheses about the data.

Source: Research Methods and Knowledge Base.

The documentation to a dataset contains information about the dataset, such as who collected it, how, when and how data were processed to produce derived variables.

Equal interval simply divides the data into an equal sized subranges. For example if your data ranged from 0 to 300, and you specified three classes 0 to 300, the ranges would be: 0–100, 101–200, and 201–300.

Data that contain information about the sampled units (e.g. respondents, households) measured on two or more occasions.

Source: SAGE Research Methods.

In a survey setting microdata is individual-level data stored in cases, usually with one case per respondent. In business microdata setting data are stored at the firm-level with one case per firm. Cases are usually stored in rows.

Some variables have values that are recorded as missing. These values may be missing unintentionally (due to data entry errors) or may stem from the survey design (e.g. if only part of the sample were asked a particular question). Sometimes non-substantive responses (such as ‘don’t know’) are also recorded as missing values. To draw accurate inferences about the data missing values need to be treated prior to the analyses, e.g. excluded.

This method attempts to model mathematically or statistically data from two or more variables measured on the same observations. The multivariate statistical modelling often involves a dependent variable and multiple independent variables. Examples of multivariate analyses are factor analysis, latent class analysis, and multivariate regressions. In contrast, univariate method involves an analysis of a single variable.

Sources: Centre for Statistical Methodology; STATA; Science Direct; UCLA Institute for Digital Research & Education.

The Natural breaks (Jenks) method groups similar values together, and breaks are assigned where there are relatively large distances between the classes. This reduces variance within classes and maximises variance between classes.

This a categorical variable that contains values which represent categories which do not have a natural order. The values assigned to the categories can be presented in any order. For example, there is no natural order to a set of categories describing the religion a person follows.

Non-substantive responses are responses that do not offer a quantifiable value. Examples include responses such as: ‘Unsure / undecided’, ‘Cannot recall’, ‘Have no idea’, ‘Don’t know (DK). Unlike substantive responses they cannot be used in analysis.

Source: American Association of Public Opinion Research(pdf).

This a categorical variable that contains values which represent categories which have a natural order. For example a highest level of qualification variable might follow an order such as:

- higher degree
- first degree
- further education below degree
- GCSE or equivalent
- no qualification

An outlier is an extreme value that differs greatly from other values in a set of values.

Source: Stat Trek Statistics Dictionary..

A panel refers to a survey sample in which the same units or respondents are surveyed or interviewed on two or more occasions (waves).

Source: SAGE Research Methods..

In survey design, a population is an entire collection of observation units, for example all 'residents in England and Wales in 2011', about which researchers seek to draw inferences.

Statistics produced using a sample of cases (sample statistics), which are designed to produce an estimate about the characteristics of the population (population parameter).

Source: SAGE Research Methods.

Precision is a measure of the variation of a survey estimator for a population parameter.

It refers to the size of deviations from a survey estimate (i.e. a survey statistic, such as a mean or percentage) that occurs over repeated application of the same probability-based sampling procedures using the same sampling frame and sample size. Standard errors and confidence intervals are two examples of commonly used measures of precision.

Source: SAGE Research Methods.

A sample based on random selection of elements. It should be possible for the sample designer to calculate the probability with which an element in the population is selected for inclusion in the sample.

PSPP is an open source statistics package which has a similar design and basic functionality of SPSS. Visit the website for more information.

Quantile classification arranges data so there are the same count of features in each class. This will result in an equal distribution of shading across the maps. This can result in a misleading map, as similar features can be in different classes, and widely different features in the same class.

A variable that stores responses given to a question in the survey in their original form.

A representative sample is one that replicates the characteristics of the population.

A person, or other entity, who responds to a survey.

A sample is a subset of a population.

The process of selecting and examining a portion (a sample) of a larger group of potential participants (a population) in order to produce inferences that apply to the broader group of participants.

Source: SAGE Research Methods.

SPSS is a commercial statistics package. Visit the website for more information.

Standard error measures the uncertainty associated with the estimate. The standard error of the mean is a measure of how representative a sample is of the population from which it was drawn. It measures the amount that a sample statistic (such as a percentage) varies from the true population statistic.

Standard error is related to standard deviation and the standard deviation can be used to calculate it. For a given sample size, the standard error equals the standard deviation divided by the square root of the sample size.

The standard error is also inversely proportional to the sample size; the larger the sample size, the smaller the standard error.

A statistical model is a theoretical construction of the relationship of explanatory variables to variables of interest created to better understand these relationships.

They typically consist of a collection of probability distributions and are used to describe patterns of variability that data may display.

The statistical model is expressed as a function. For example, a researcher may model a linear relationship using the regression function below:

y = b_{0} + b_{1}x_{1} + b_{2}x_{2} ... + b_{i}x_{i}

In this model, y represents an outcome variable and xi represents its corresponding predictor variables. The term b_{0} is an intercept for the model. The term b_{i} is a regression coefficient and represents the numerical relationship between the predictor variables and the outcome for the *i*th term.

Statistically modelling is a major topic and outside the scope of the module. Readers who want to know more will find extensive accounts of statistical models including linear regression and logistic regression in statistical texts and online.

Sources: Science Direct; Magoosh Statistics Blog.

The type of probability sampling where researchers divide population into non-overlapping groups (strata) and collect a simple random sample of participants from each stratum. In contrast to cluster sampling, cluster sampling uses simple random sampling to select clusters and everyone in those clusters are sampled.

Sources: An Introduction to Statistical Methods and Data Analysis; SAGE Research Methods

A structured interview follows a strict protocol using a set of defined questions administered in the same order to all interviewees. It allows for a quick collection of focused data, however there are limited opportunities for probing and further exploration of topics. The interviews are usually conducted face to face or over the phone.

Survey nonresponse can occur at both an item and unit level.

Item nonresponse occurs when a sample member responds to the survey, but fails to provide a valid response to a particular item (e.g. a question they refuse to answer).

Unit nonresponse occurs when eligible sample members either cannot be contacted, refuse to participate in the survey or do not provide sufficient information for their responses to be valid. Unit nonresponse can be a source of bias in survey estimates and reducing unit nonresponse is an important objective of good survey practice.

Source: SAGE Research Methods.

The unit which is being analysed. This is synonymous to the case.

Univariate analysis models data that consist of a single variable. Examples of univariate analyses include descriptive statistics (mean, standard deviation, kurtosis) goodness-of-fit tests and the Student’s t-test.

Source: Science Direct.

A representation of a characteristic for one case. For one variable, values may vary from one case to another. E.g for the variable ‘sex’ the values may be ‘male’ or ‘female’.

A description of the values a variable can take on. Sometimes nominal values are coded as numbers and the label helps to describe what each of these numbers means. E.g. for the variable ‘sex’ the values may be:

- female
- male
- other

They are attributes that describe cases and can vary from one entity to another. In surveys, this is usually a characteristic that varies between cases.

Source: Stat Trek Statistics Dictionary.

Weighting is a statistical adjustment made to survey data to improve accuracy of survey estimates. Weighting can correct for unequal probabilities of selection and survey non-response.

Source: SAGE Research Methods.