BINARY OUTCOME, MULTIPLE PREDICTORS
(LOGISTIC REGRESSION)

Chapter incomplete as of 1/20/00

| Background | Simple Logistic Regression | Multiple Logistic Regression |

Background

Logistic Regression Model

Suppose that you have a binary outcome Y (e.g., DISEASE: yes or no) and binary predictor X (e.g., EXPOSURE: yes or no). In such circumstances, you could use a two-by-two table to estimate the relative risk (cohort studies only) or odds ratio (cohort or case-control studies) to quantify the association between X and Y. Now suppose you have binary outcome Y and continuous predictor X (e.g., AGE: years). The 2-by-2 table option is no longer viable. Problems like this call for logistic regression. In addition, regression is well suited for problems when the predictor variable is binary or has multiple categorical levels, and when there are multiple independent variables in the problem; logistic regression is a versatile and powerful technique.

Let us start by considering a simple logistic regression model -- "simple" in the sense that there is only one independent variable X. The logistic regression model fits the response function Y based on X (or multiple X1, X2, ...) with an S-shaped curve called the logistic function, as given by:

E(Y|x) = exp(ß0 + ß1x) / (1 + exp( ß0 + ß1x))

where E(Y|x) is the expected value of Y given x, "exp" represents exponent on the natural logarithmic base (i.e., 2.71828...), and ß0 and ß1 are regression coefficients. Now, if we denote E(Y|x) by p, since the mean response is a probability when the dependent variable is binary when we use 0 to indicate "failure" and 1 to indicate "success," then

p = exp(ß0 + ß1x) / (1 + exp( ß0 + ß1x))

We then make the logit transformation to outcome p, where the logit is the natural log of the odds:

logit = loge(p / 1 - p)

For example, if the probability of an outcome, p = .1, then the logit = loge(.1 / .9) = -2.1972.

The logit transformation is elegant, for when put into the logistic function (equation 2) we get:

logit = ß0 + ß1x

thus "linearizing" the function, providing the equation in a form for regression analysis, where ß0 represents the intercept and ß1 represents the slope of the model.

Notice that the logit (log odds) is the dependent variable in this model. For example, if the estimate of ß1 were .25, this would suggest that the logit would increase by an average of .25 units for each unit increase in X. However, we could determine the average shift in the odds ratio (OR) based on the estimate of ß1. For example, suppose that we have binary predictor X so that X = 1 when exposed and X = 0 when unexposed. Then,

loge(OR) = loge[(odds|X = 1) / (odds|X =0)]
= loge(odds|X=1) - loge(odds|X=0)
= logit(X=1) - logit(X=0)
= ß0 + ß1x - ß0
= ß1

Thus the slope is just the loge of the odds ratio, and OR = exp(ß1).

Now suppose we have continuous predictor variable X with one group having X = x* and another group having X = x. The group X = x will serve as a baseline for comparison. The log odds ratio associated with x* relative to x is given by:

loge(OR) = logit(X=x*) - logit(X=x) = (ß0 + ß1x*) - (ß0 - ß1x) = ß1(x*-x)

Therefore, the loge(OR) for each unit increase in X is ß1 and the OR = exp(ß1). In addition, intercept parameter ß0 represents the log odds of the outcome at baseline X = 0.

Fitting the Model

Fitting the logistic function to a data set is not an easy matter. For simplicity, let us consider a simple logistic regression model (having one independent variable X) with n observations. The jth individual has level X = xj and outcome yj = 1 if positive for the outcome (diseased) and yj = 0 if disease free. We will assume that a logistic function holds in the population and will use a maximum likelihood method to estimate logistic regression parameters ß0 and ß1. For the jth individual, the probability of the observed outcome is P(Y = yj | X = xj) = pj if yj = 1 and qj if yj = 0. For the complete sample the probability of all the outcomes given values for xj's in the data set = [p1y1q11-y1][p2y2q21-y2] . . . [pnynq11-yn] = L(ß0, ß1). We wish to derive estimates for ß0 and ß1 so that the value of L is maximized. Such estimates, called maximum likelihood estimates, are computationally complex, usually involving the simultaneous solution to non-linear equations, but fairly straight-forward when done with the assistance of available computational algorithms.

Simple Logistic Regression

To illustrate logistic regression analysis, let us use the data set FEV.DBF. (The data set is in dBase III format so that it better transports into Epi Info 2000.) Data, originally part of a larger study of pulmonary function in children and adolescents as reported by Rosner (1995, p. 40) and Tager et al. (1985), are documented in the code book below:

Filename: fev
Size:     654 records of 19 bytes each.

Name       Type    Len Description
---------- ------- --- ------------
ID         Real      5 Identification number (1, 2, . . . , 654)
AGE        Integer   2 in years (min = 3, max = 19)
FEV        Real      6 Forced expiratory volume (l/sec) (min = 0.791, max = 5.793)
HEIGHT     Real      4 in inches (min = 46, max = 74)
SEX        Integer   1 1 = male, 0 = female
SMOKE      Integer   1 1 = current smoker, 0 = non-smoker

Let us ascertain determinants of smoking. (SMOKE will be the outcome variable for our illustrative analyses.) To start, let us look at the association between SEX and SMOKE. Of course we could do a simple 2-by-2 table analysis (TABLES SEX SMOKE) and determine that 26 of the 336 males smoke (7.7%) while 39 of the 318 females (12.3%) smoke (OR^male = 0.60). To run a logistic regression analysis from Epi2000, click on Logistic Regression and then fill in the dialogue box with the outcome variable (SMOKE) and highlight the variable SEX in the Other Variables box. Then click Statistics | Unconditional. (Conditional logistic regression would be there was sample matching.)

Output is:

MVAWIN - Multivariate Analysis for Windows  -  Fri Jan 21 12:59:42 2000

Term                  Cutpoint      Coeff.      S.E.     Z-Statistic    P-Value
----                  --------      ------      ----     -----------    -------
SEX                     SINGLE     -0.5108      0.2663     -1.9183      0.0551
CONSTANT                           -1.9676      0.1710    -11.5098      0.0000

Term                              Odds Ratio     Lower       Upper   
----                              ----------     -----       -----   
SEX                                  0.600       0.356       1.011

Log-Likelihood (cycle  1)   :   -211.7240
Log-Likelihood (cycle  4)   :   -209.8468

-2*Maximized Log-Likelihood :    419.6936

       TEST             Statistic   D.F.    P-Value                
       ----             ---------   ----    -------                
Score                      3.7390      1     0.0532
Likelihood Ratio           3.7545      1     0.0527



Notice that the coefficient associated with SEX = -0.5108 with a standard error of 0.2663. This is the maximum likelihood estimate of ß1. (Let b1 represent this estimate.) The term labeled CONSTANT represents the estimate for the intercept (b0). The exponent of the slope estimate is the odds ratio estimate, in this case OR^ = exp(-0.5108) = 0.60. Notice that this odds ratio statistics is identical to that achieved by the 2-by-2 analysis. A 95% confidence interval for OR based on the logist regression model is (0.36, 1.01).

The output also displays Log Likelihoods for the first and last iterations of the maximum likelihood procedure. The initial log likelihood is computed for a model that includes just the outcome variable. The final log likelihood gives the maximum likelihood solution with the independent variable(s) included. The greater the difference between these two Log-Likelihood statistics, the greater the improvement in model fit as a result of including the independent variable. In this example, this difference = -209.8468 - (-211.7240) = -1.8772. Multiplying this result by -2 gives 3.7544, which is known as the Likelihood Ratio. Under the null hypothesis that ß1 = 0, this statisic has a chi-square distribution with 1 degree of freedom for each parameter estimated beyond the constant term. The reported p value of 0.0527 indicates that the relationship between SEX and SMOKE is not quite significant at the alpha = .05 level.

The score statistic, in this case 3.7390, is the derivative of L(b0, b1) at b1 = 0, and can be used for a similar purpose (i.e., to test ß1 = 0 with a chi-square distribution).

Illustrative example with a continuous X: Now let us look at the association between the a continuous variable, say AGE, and the binary outcome SMOKE. The logistic regression procedure is run again, this time selecting AGE as the independent variable. Output is:

MVAWIN - Multivariate Analysis for Windows  -  Fri Jan 21 13:04:14 2000

Term                  Cutpoint      Coeff.      S.E.     Z-Statistic    P-Value
----                  --------      ------      ----     -----------    -------
AGE                     SINGLE      0.4836      0.0551      8.7734      0.0000
CONSTANT                           -7.7439      0.7089    -10.9242      0.0000

Term                              Odds Ratio     Lower       Upper   
----                              ----------     -----       -----   
AGE                                  1.622       1.456       1.807

Log-Likelihood (cycle  1)   :   -211.7240
Log-Likelihood (cycle  6)   :   -159.2825

-2*Maximized Log-Likelihood :    318.5650

       TEST             Statistic   D.F.    P-Value                
       ----             ---------   ----    -------                
Score                    106.8740      1     0.0000
Likelihood Ratio         104.8831      1     0.0000

This shows an odds ratio estimate of 1.62 (95% confidence interval: 1.46, 1.81) for each additional year of age -- smoking is definitely age related in this population.

END OF DRAFT