BINARY OUTCOME, MULTIPLE PREDICTORS
(LOGISTIC REGRESSION)

Chapter incomplete as of 1/20/00

| Background | Simple Logistic Regression | Multiple Logistic Regression |

Background

Logistic Regression Model

Suppose that you have a binary outcome Y (e.g., DISEASE: yes or no) and binary predictor X (e.g., EXPOSURE: yes or no). In such circumstances, you could use a two-by-two table to estimate the relative risk (cohort studies only) or odds ratio (cohort or case-control studies) to quantify the association between X and Y. Now suppose you have binary outcome Y and continuous predictor X (e.g., AGE: years). The 2-by-2 table option is no longer viable. Problems like this call for logistic regression. In addition, regression is well suited for problems when the predictor variable is binary or has multiple categorical levels, and when there are multiple independent variables in the problem; logistic regression is a versatile and powerful technique.

Let us start by considering a simple logistic regression model -- "simple" in the sense that there is only one independent variable X. The logistic regression model fits the response function Y based on X (or multiple X₁, X₂, ...) with an S-shaped curve called the logistic function, as given by:

E(Y|x) = exp(ß₀ + ß₁x) / (1 + exp( ß₀ + ß₁x))

where E(Y|x) is the expected value of Y given x, "exp" represents exponent on the natural logarithmic base (i.e., 2.71828...), and ß₀ and ß₁ are regression coefficients. Now, if we denote E(Y|x) by p, since the mean response is a probability when the dependent variable is binary when we use 0 to indicate "failure" and 1 to indicate "success," then

p = exp(ß₀ + ß₁x) / (1 + exp( ß₀ + ß₁x))

We then make the logit transformation to outcome p, where the logit is the natural log of the odds:

logit = log_e(p / 1 - p)

For example, if the probability of an outcome, p = .1, then the logit = log_e(.1 / .9) = -2.1972.

The logit transformation is elegant, for when put into the logistic function (equation 2) we get:

logit = ß₀ + ß₁x

thus "linearizing" the function, providing the equation in a form for regression analysis, where ß₀ represents the intercept and ß₁ represents the slope of the model.

Notice that the logit (log odds) is the dependent variable in this model. For example, if the estimate of ß₁ were .25, this would suggest that the logit would increase by an average of .25 units for each unit increase in X. However, we could determine the average shift in the odds ratio (OR) based on the estimate of ß₁. For example, suppose that we have binary predictor X so that X = 1 when exposed and X = 0 when unexposed. Then,

log_e(OR) = log_e[(odds|X = 1) / (odds|X =0)]
= log_e(odds|X=1) - log_e(odds|X=0)
= logit(X=1) - logit(X=0)
= ß₀ + ß₁x - ß₀
= ß₁

Thus the slope is just the log_e of the odds ratio, and OR = exp(ß₁).

Now suppose we have continuous predictor variable X with one group having X = x* and another group having X = x. The group X = x will serve as a baseline for comparison. The log odds ratio associated with x* relative to x is given by:

log_e(OR) = logit(X=x*) - logit(X=x) = (ß₀ + ß₁x*) - (ß₀ - ß₁x) = ß₁(x*-x)

Therefore, the log_e(OR) for each unit increase in X is ß₁ and the OR = exp(ß₁). In addition, intercept parameter ß₀ represents the log odds of the outcome at baseline X = 0.

Fitting the Model

Fitting the logistic function to a data set is not an easy matter. For simplicity, let us consider a simple logistic regression model (having one independent variable X) with n observations. The j^th individual has level X = x_j and outcome y_j = 1 if positive for the outcome (diseased) and y_j = 0 if disease free. We will assume that a logistic function holds in the population and will use a maximum likelihood method to estimate logistic regression parameters ß₀ and ß₁. For the j^th individual, the probability of the observed outcome is P(Y = y_j | X = x_j) = p_j if y_j = 1 and q_j if y_j = 0. For the complete sample the probability of all the outcomes given values for x_j's in the data set = [p₁^y1q₁^1-y1][p₂^y2q₂^1-y2] . . . [p_n^ynq₁^1-yn] = L(ß₀, ß₁). We wish to derive estimates for ß₀ and ß₁ so that the value of L is maximized. Such estimates, called maximum likelihood estimates, are computationally complex, usually involving the simultaneous solution to non-linear equations, but fairly straight-forward when done with the assistance of available computational algorithms.

Simple Logistic Regression

To illustrate logistic regression analysis, let us use the data set FEV.DBF. (The data set is in dBase III format so that it better transports into Epi Info 2000.) Data, originally part of a larger study of pulmonary function in children and adolescents as reported by Rosner (1995, p. 40) and Tager et al. (1985), are documented in the code book below:

Filename: fev Size: 654 records of 19 bytes each.

Name Type Len Description ---------- ------- --- ------------ ID Real 5 Identification number (1, 2, . . . , 654) AGE Integer 2 in years (min = 3, max = 19) FEV Real 6 Forced expiratory volume (l/sec) (min = 0.791, max = 5.793) HEIGHT Real 4 in inches (min = 46, max = 74) SEX Integer 1 1 = male, 0 = female SMOKE Integer 1 1 = current smoker, 0 = non-smoker

Let us ascertain determinants of smoking. (SMOKE will be the outcome variable for our illustrative analyses.) To start, let us look at the association between SEX and SMOKE. Of course we could do a simple 2-by-2 table analysis (TABLES SEX SMOKE) and determine that 26 of the 336 males smoke (7.7%) while 39 of the 318 females (12.3%) smoke (OR^_male = 0.60). To run a logistic regression analysis from Epi2000, click on Logistic Regression and then fill in the dialogue box with the outcome variable (SMOKE) and highlight the variable SEX in the Other Variables box. Then click Statistics | Unconditional. (Conditional logistic regression would be there was sample matching.)

Output is:

MVAWIN - Multivariate Analysis for Windows - Fri Jan 21 12:59:42 2000

Term Cutpoint Coeff. S.E. Z-Statistic P-Value ---- -------- ------ ---- ----------- ------- SEX SINGLE -0.5108 0.2663 -1.9183 0.0551 CONSTANT -1.9676 0.1710 -11.5098 0.0000

Term Odds Ratio Lower Upper ---- ---------- ----- ----- SEX 0.600 0.356 1.011

Log-Likelihood (cycle 1) : -211.7240 Log-Likelihood (cycle 4) : -209.8468

-2*Maximized Log-Likelihood : 419.6936

TEST Statistic D.F. P-Value ---- --------- ---- ------- Score 3.7390 1 0.0532 Likelihood Ratio 3.7545 1 0.0527

Notice that the coefficient associated with SEX = -0.5108 with a standard error of 0.2663. This is the maximum likelihood estimate of ß₁. (Let b₁ represent this estimate.) The term labeled CONSTANT represents the estimate for the intercept (b₀). The exponent of the slope estimate is the odds ratio estimate, in this case OR^ = exp(-0.5108) = 0.60. Notice that this odds ratio statistics is identical to that achieved by the 2-by-2 analysis. A 95% confidence interval for OR based on the logist regression model is (0.36, 1.01).

The output also displays Log Likelihoods for the first and last iterations of the maximum likelihood procedure. The initial log likelihood is computed for a model that includes just the outcome variable. The final log likelihood gives the maximum likelihood solution with the independent variable(s) included. The greater the difference between these two Log-Likelihood statistics, the greater the improvement in model fit as a result of including the independent variable. In this example, this difference = -209.8468 - (-211.7240) = -1.8772. Multiplying this result by -2 gives 3.7544, which is known as the Likelihood Ratio. Under the null hypothesis that ß₁ = 0, this statisic has a chi-square distribution with 1 degree of freedom for each parameter estimated beyond the constant term. The reported p value of 0.0527 indicates that the relationship between SEX and SMOKE is not quite significant at the alpha = .05 level.

The score statistic, in this case 3.7390, is the derivative of L(b₀, b₁) at b₁ = 0, and can be used for a similar purpose (i.e., to test ß₁ = 0 with a chi-square distribution).

Illustrative example with a continuous X: Now let us look at the association between the a continuous variable, say AGE, and the binary outcome SMOKE. The logistic regression procedure is run again, this time selecting AGE as the independent variable. Output is:

MVAWIN - Multivariate Analysis for Windows - Fri Jan 21 13:04:14 2000

Term Cutpoint Coeff. S.E. Z-Statistic P-Value ---- -------- ------ ---- ----------- ------- AGE SINGLE 0.4836 0.0551 8.7734 0.0000 CONSTANT -7.7439 0.7089 -10.9242 0.0000

Term Odds Ratio Lower Upper ---- ---------- ----- ----- AGE 1.622 1.456 1.807

Log-Likelihood (cycle 1) : -211.7240 Log-Likelihood (cycle 6) : -159.2825

-2*Maximized Log-Likelihood : 318.5650

TEST Statistic D.F. P-Value ---- --------- ---- ------- Score 106.8740 1 0.0000 Likelihood Ratio 104.8831 1 0.0000

This shows an odds ratio estimate of 1.62 (95% confidence interval: 1.46, 1.81) for each additional year of age -- smoking is definitely age related in this population.

END OF DRAFT

BINARY OUTCOME, MULTIPLE PREDICTORS (LOGISTIC REGRESSION)

Background

Logistic Regression Model

Fitting the Model

Simple Logistic Regression

BINARY OUTCOME, MULTIPLE PREDICTORS
(LOGISTIC REGRESSION)