Home

# Regression

## Definitions

• Regression (analysis) involves identifying the relationship between the mean of a dependent variable and one or more ‘independent’ variables.
Sewell

• ‘Regression analysis involves identifying the relationship between a dependent variable and one or more independent variables. A model of the relationship is hypothesized, and estimates of the parameter values are used to develop an estimated regression equation. Various tests are then employed to determine if the model is satisfactory. If the model is deemed satisfactory, the estimated regression equation can be used to predict the value of the dependent variable given values for the independent variables.’
Encyclopædia Britannica

•   ‘c. Statistics. The relationship between the mean value of a random variable and the corresponding values of one or more other variables; coefficient of regression = regression coefficient in sense 8 below.’
Oxford English Dictionary, 2002.

• ‘A word, introduced by Galton and deriving from the phrase 'regression towards the mean' that is often used as shorthand for linear regression or multiple regression models. In these models the mean of one variable Y is presumed to be dependent on one or more other variables (x1, x2,...). The variable Y is variously known as the response variable or outcome variable. The x-variables are known as predictor variables, explanatory variables, controlled variables or, potentially confusingly, independent variables). In the context of a factorial experiment the x-variables are the factors.’
Oxford Dictionary of Statistics, 2002.

• ‘Regression analysis is a statistical technique for investigating and modeling the relationship between variables.’
Montgomery, Peck and Vining (2001)

## Introduction

Regression is the art of inferring E(Y|x) (the expected value of Y, given x) from a number of realisations of the pair (x, y). One can never generalize beyond one’s data without making subjective assumptions (Hume 1739–40; Mitchell 1980; Schaffer 1994; Wolpert 1996). Although linearity is rather special (outside quantum mechanics no real system is truly linear), detecting linear relations has been the focus of much research in statistics and machine learning for decades and the resulting algorithms are well understood, well developed and efficient. Also, a linear model may still be useful for modelling a non-linear process. For example, the simplest non-trivial model obtainable from the Taylor expansion of any infinitely-differentiable function is a linear model (the first-order expansion of the Taylor series). For these reasons it is often reasonable to assume a linear relationship exists between X and Y in the first instance. A system of linear equations is considered overdetermined if there are more equations than unknowns. In general regression analysis is performed when there are enough (x, y) so that the system is overdetermined. The problem, then, is finding an approximate solution to an overdetermined system of linear equations. In order to find the best approximate solution, one needs to define an error function and measure it. The Gauss-Markov theorem states that in a linear model in which the errors have expectation zero and are uncorrelated and have equal variances, a best linear unbiased estimator (BLUE) of the coefficients is given by the least-squares estimator. Note that, being Bayesians at heart, we are careful to make all of our assumptions explicit. Linear (least squares) regression admits a unique closed-form solution, but the trick is to use a numerically stable method, such as QR decomposition.