Binary Outcome, Independent Groups, Cohort Studies

Background
� Exposure Groups � 2-by-2 Table � Illustrative Data Set (TOXIC.REC)
Descriptive Methods
Inferential Statistics
� Estimation � Hypothesis Testing
Power and Sample Size Requirements
� Power � Sample Size Requirements
Exercises
References

Background

Exposure Groups

In the previous chapter we considered a binary outcome in the form of an incidence proportion or prevalence from a single group. This chapter compares incidence proportions or prevalences in two groups. One group is characterized by an "exposure" and the other group by the exposure's absence (smokers and non-smokers, for instance). The exposure can in fact represent any behavior (e.g., smoking), treatment (e.g., an intervention intended to prevent smoking), trait (e.g., a family history of smoking) or environmental exposure (e.g., passive exposure to tobacco smoke).

2-by-2 Table

Data of this form are often displayed in a 2-by-2 table. Epi Info sets up its tables as follows:

	Disease+	Disease-
Exposure +	a	b	n₁
Exposure -	c	d	n₂
	m₁	m₂	N

Users should be aware that two-by-two tables may be set up with exposure status in rows and disease status in columns (as above), or vice-versa. This will not materially affect the interpretation of results, but will alter formulas. Moreover, textbooks differ in the way the table will be displayed. For example, Rosner (1995), Schlesselman (1982), Kahn & Sempos (1989), and Kelsey, et al. (1986) show the exposure across table rows (as above). Other texts, such as Kleinbaum et al. (1982), Rothman (1986), Greenberg et al. (1996), and Gerstman (1998) have the disease variable as the row variable and exposure as the column variable.

In discussing these data, let p₁ represent the incidence proportion or prevalence of disease in group 1:

p₁ = a / n₁

Let p₂ represent the incidence proportion or prevalence of disease in group 2:

p₂ = c / n₂

The ratio of p₁ and p₂ is the relative risk (RR):

RR = p₁ / p₂

The relative risk is a measure of association between the exposure and disease, with values of 1 indicating no association.

The association between the exposure and disease can also be quantified in terms of a risk difference (RD):

RD = p₁- p₂

or in terms of an incidence odds ratio (OR):

OR = ( p₁/[1 - p₁]) / (p₂/[1 - p₂])

Let us continue our practice of placing "hats" (^) over symbols to denote estimators. (Hats should actually be placed over the symbols, but because of my technical limitations on the Web I show them next to symbols.) Parameters will be rendered "hatless." For example, RR^ represents the relative risk estimator and RR represents the relative risk parameter.

Illustrative Data Set (`TOXIC.REC`)

To illustrate techniques in this chapter, let us consider a cohort study of toxicity in cancer patients undergoing bone marrow ablation with the drug cytarabine (Jolson, et al, 1992). Data are stored in TOXIC.REC with the exposure status of individuals contained in the variable GENERIC (1 = generic manufacturer of the drug, 2 = innovator manufacturer of the drug) and outcome variable stored as variable TOX (1= yes 2 = no). We wish to determine whether treatment with the drug produced by the generic manufacturer is associated with higher toxicity rates than treatments with the innovator product.

Descriptive Methods

When working with data in an Epi Info REC file, data are cross-tabulated in the ANALYSIS program with the TABLES command used as follows:

EPI6> TABLES <exposure> <disease>

where <exposure> and <disease> represent the names of the exposure and disease variables, respectively.

For example, to cross-tabulate the illustrative data, the following commands are issued:

EPI6> READ TOXIC.REC
EPI6> TABLES GENERIC TOX

Output is:

                           TOX
GENERIC    |          1          2        Total
-----------+-----------------------------+------
         1 |         11         14       | 25
         2 |          3         31       | 34
-----------+-----------------------------+------
     Total           14         45         59

Based on these data we calculate:

p^₁ = 11 / 25 = .440 (incidence of toxicity in group 1)
p^₂ = 3 / 34 = .088 (incidence of toxicity in group 2)

Most importantly,

RR^ = p^₁ / p^₂ = 0.440 / 0.088 = 5.0 (relative risk)

In addition:

OR^ = ( p^₁/[1 - p^₁]) / (p^₂/[1 - p^₂]) = (.440 / .560) / (.088 / .912) = 8.1 (disease odds ratio) and

RD^ = p^₁ - p^₂ = 0.440 - 0.088 = 0.352 (risk difference).

The relative risk estimate of 5.0 suggests that people exposed to the generic product had 5-times the risk of toxicity compared to people exposed to the innovator product.

Inferential Statistics

Inferential statistics are printed below the 2-by-2 table. Estimates for the odds ratio parameter and relative risk parameter are included. Below these are the hypothesis testing statistics.

Estimation

Estimates for the Relative Risk Parameter

The output showing estimates for relative risk parameter is:

RISK RATIO(RR)(Outcome:TOX=1; Exposure:GENERIC=1) 4.99
95% confidence limits for RR 1.55 < RR < 16.03

The relative risk point estimate (RR^) = p^₁ / p^₂ = 0.440 / 0.088 = 4.99. This is the statistic of choice when working with cohort data, showing clearly that the incidence of toxicity associated with the generic product is 5 times higher than with the innovator product. The 95% confidence for the RR parameter is (1.55, 16.03).

To avoid giving a false impression of precision, risk ratio and odds ratio estimates should be reported to one-decimal place, and no more. For example, the RR estimate for the illustrative example should be reported as 5.0 (95% confidence interval for RR: 1.6, 16.0).

Estimates for the Odds Ratio Parameter

Odds ratio estimates are also relevant -- although less often used -- representing the ratio of incidence odds for the outcome. The output showing estimates for the incidence odds ratio parameter is:

Odds ratio 8.12 Cornfield 95% confidence limits for OR 1.67 < OR < 44.69* *May be inaccurate Maximum likelihood estimate of OR (MLE) 7.81 Exact 95% confidence limits for MLE 1.71 < OR < 50.40 Exact 95% Mid-P limits for MLE 1.97 < OR < 39.64 Probability of MLE >= 7.81 if population OR = 1.0 0.00223872

The OR^ = (.440 / .560) / (.088 / .912) = 8.1. This estimate is also calculated by a maximum likelihood method (OR^ = 7.81), which is said to be more accurate when working with small samples. (The maximum likelihood method is based on a more complex algorithm that will not be covered here.) In most instances, the methods will differ on slightly, and the regular method is adequate.

95% confidence intervals for the OR parameter are calculated with three methods:

Cornfield's method (Taylor series variance estimate)
Exact MLE (maximum likelihood estimation) method
Exact MLE method, Mid-P

For the current example, Cornfield's method may be inaccurate. ((Notice the inaccuracy warning printed below the first confidence interval.) In such instances, it is prudent to report one of the Exact MLE confidence intervals. The illustrative example has an exact Mid-P 95% confidence interval for the OR of (1.97, 29.64).

Working with Cross-Tabulated Data

Output from EpiTable is:

      Ill+     Ill-
    +-----------------+
Exp+| 11     | 14     |
    +--------+--------+
Exp-| 3      | 31     |
    +-----------------+

Measure of association and 95% confidence interval

Relative risk                    4.99     1.55, 16.03
Risk difference             35.18/100    13.51, 56.85
Attributable fraction           79.9%     35.5, 93.8
Population risk dif.         14.9/100      5.7, 24.1
Population attr. fraction       62.8%     27.9, 73.7
Odds ratio                       8.12     1.95, 33.73

Notice that in addition to the relative risk and odds ratio (and several other statistics) the risk difference is calculated (highlighted). The risk difference = 44.0 / 100 - 8.9 / 100 = 35.2 / 100, indicating an excess of 35.2 cases per 100 exposures. The 95% confidence interval for the risk difference is (13.5 per 100, 56.9 per 100).

Hypothesis Testing

Testing the Relative Risk for Significance

We want to test whether an observed association is significant. The null and alternative hypotheses can be expressed in multiple, synonymous ways. Let us us the following form:

H₀: RR = 1
H₁: RR not = 1

Relationship Between Confidence Interval and Hypothesis Test

We can conduct the test at a = .05 with the 95% confidence interval for the RR. When the 95% confidence interval for the RR parameter includes the value of 1, the null hypothesis will be retained. When the 95% interval excludes this value, the null hypothesis will be rejected. For the illustrative data set, the confidence intervals for the RR = (1.6, 16.0). Therefore, data are significant.

Chi-Square Test

The common testing approach to testing is the chi-square test. For background on the chi-square test click here.

Epi Info's TABLES command calculates three different chi-squared statistics. These are:

The uncorrected chi-squared statistic
The Mantel-Haenszel chi-squared statistic
The Yates (continuity) corrected chi-squared statistic

All these chi-square statistics are based on a comparison of the observed table counts to the expected table counts. Expected counts (E_i) represent hypothetical occurrence if there was no association between the exposure and disease, and are calculated:

E_i = (row total * column total / grand total)

For the illustrative example, expected counts are:

                           TOX
GENERIC    |          1          2        Total
-----------+-----------------------------+------
         1 |        5.93       19.07     | 25
         2 |        8.07       25.93     | 34
-----------+-----------------------------+------
     Total           14         45         59

The formula for the uncorrected chi-square statistic is:

c�-stat (uncorrected) = S[(O_i - E_i)² / E_i]

where O_i represents the observed count in cell i {i: a, b, c, d} and E_i represents the expected count in cell i.

For the illustrative data set, c�-stat = [(11 - 5.93)² / (5.93) + (14 - 19.07)² / (19.07) + (3 - 8.07)² / (8.07) + (31 - 25.93)² / (25.93)] = 9.86.

Under the null hypothesis, the chi-square test statistic for a 2-by-2 table has a chi-square sampling distribution with 1 degree of freedom.

Output for the illustrative set is:

                         Chi-Squares   P-values
                         -----------   --------
        Uncorrected:         9.85     0.00169835
        Mantel-Haenszel:     9.68     0.00185979
        Yates corrected:     8.00     0.00467202

One may ask, Which of the above chi-square statistics is best? Although each chi-squared statistic has its advantages, it is unclear which is best (e.g., see Mantel, 1974). For the sake of consistently, let us report Yates's corrected chi-square statistics, since this is least likely to result in a false rejection. Our illustrative example shows Chi-square, Yates'(1, N = .59) = 8.00, p = .0047.

Assumptions

The above chi-square statistics assume:

sampling independence (within and between groups), and
expected frequencies greater than or equal to 5.

If the expected frequency assumption is violated, Epi Info issues the following warning and recommendation:

An expected value is less than 5; recommend Fisher exact results

Please heed this recommendation.

Fisher's Test

Fisher's exact test is based on summing exact binomial probabilities for permutations that are equally or more extreme than observed results, assuming the null hypothesis is true and the table's margins are fixed. (This procedure is explained in Rosner, 1995, p. 376.) To illustrate Fisher's test, let us consider a study performed to confirm the association between a drug called Kayexelate and the occurrence of colonic necrosis in post-operative patients (Gerstman et al., 1992). In this study, we compare colonic necrosis rates in Kayexelate exposed- and unexposed-patients. Data are stored in KX-NECRO.REC as variables KX (exposed to Kayexelate: Y/N) and NECRO (colonic necrosis: Y/N).

Data are "crunched" with the commands:

EPI6> READ KX-NECRO
EPI6> TABLES KX NECRO

Output is:

                  NECRO
KX         |     +     - | Total
-----------+-------------+------
         + |     2   115 |   117
         - |     0   862 |   862
-----------+-------------+------
     Total |     2   977 |   979

Results show 2 of the 117 Kayexelate-exposed patients experienced colonic necrosis; 0 of the 862 non-exposed patients were so effected. Under the null hypothesis, we expect the following values:

	NECRO+	NECRO-
KX+:	0.24	116.76	117
KX-:	1.76	860.24	862
	2	977	979

Thus, the chi-square tests should be avoided in favor of Fisher's test. Fisher's test results, reported by Epi Info, are:

Fisher exact: 1-tailed P-value: 0.0141750 <---
2-tailed P-value: 0.0141750 <---

Assuming alpha = .05, the null hypothesis is rejected.

Power and Sample Size Requirements

Power

The power of a hypothesis test for cohort studies depends on the sizes of the two groups being compared, the expected proportions in the two groups, and the alpha level of the test. We can use EpiTable | Sample | Power calculation | Cohort Study to perform power analyses of this sort. This program prompts the user for (a) the number of subjects in the exposed group (n₁), (b) the ratio of unexposed to exposed subjects (n₂/ n₁) (c) the expected relative risk (a RR "worth detecting" - see comment, below), (d) the "attack rate" [expected proportion] among unexposed subjects (p₂), and (e) the alpha level of the test. Calculations are based on formulas in Fleiss (1981, pp. 44 - 45).

As an illustrative example, let us consider a study that has 100 exposed subjects, 100 unexposed subjects (so that allocation ratio = 1), an expected proportion of 10% in the unexposed group, an alpha level of .05, and an expected RR of 2. Based on these assumptions, EpiTable calculates a power of 42.4%. (This is inadequate, since power should be at least 80%.)

Comment: In determining power, the investigator must have in mind the relative risk worth detecting. In the words of Fleiss's (1981, p. 34) "An investigator will often have some idea of the order of magnitude of proportions he or she is studying. This knowledge might come from previous research, from an accumulation of clinical experience, from small-scale pilot work, or from vital statistics reports. Given at least some information, the investigator can, using his or her imagination and expertise, come up with an estimate of a difference between two proportions that is scientifically or clinically important. Given no information, the investigator has no basis for designing the study intelligently and would be hard put to justify designing it at all." When large gaps in background knowledge exist, the researcher should pursue a pilot study before embarking on the full scale investigation.

Sample Size

Since conducting a study with insufficient power is a waste of time, the investigator should determine the sample size requirements of a study before data are collected. This can be done with the program EpiTable | Sample | Sample Size | Cohort Study. For example, to achieve 80% power to detect a RR of 2 in a study with an allocation ratio of 1:1, an expected proportion in the unexposed group of 10%, and alpha = .05, EpiTable determines that 219 subjects are needed in each group. To achieve 80% power to detect a RR of 3, we need 72 subjects in each group. To achieve 90% to detect a RR of 3, we need 92 subjects per group.

Suggestion: Try replicating the above calculations by way of practice.

Exercises

(1) EAR.REC: Otitis Media Clinical Trial (Rosner, 1990, p. 68, modified)

Data are from a clinical trial on the treatment of acute otitis media in children. Group 1 received a 14-day trial of cefaclor. Group 2 received a 14-day trial of amoxicillin. This information is contained in the variable called AB (1 = cefaclor, 2 = amoxicillin). A total of 278 infected-ears were treated, with clearance of infection represented in variable CLEAR (1 = yes, 2 = no).

(A) Calculate the clearance rates (p^₁ and p^₂) associated with each antibiotic. Report both relevant counts and percentages.
(B) Calculate the relative rate of clearance associated with cefaclor. Include a 95% confidence interval for RR.
(C) Perform a test of association. Report the null and alternative hypotheses, let alpha = .05, report hypothesis testing statistics, state your conclusion, and interpret your overall results.

(2) PRISON.REC: Human Immunodeficiency Virus Infection in a Women's Correctional Institution (Smith et al., 1991)

A study of HIV infection in women entering the New York State Prison system cross-classified 465 inmates with respect to HIV sero-positivity (variable HIV) and history of intravenous drug use (variable IVDU).

(A) Calculate the prevalence of HIV in each exposure groups.
(B) Calculate the prevalence ratio, while including a 95% confidence interval for the parameter.
(C) Perform a test of the association.

(3) LABOR.REC: Induction of Labor and Meconium Staining

Induced labor (by administering pitocin and other hormones) in near-term pregnancies is a common obstetrical procedure. Meconium staining during childbirth is a sign of fetal distress. Use LABOR.REC to determine whether there is an association between INDUCE and MECON. In so doing,

(A) Report the incidence of meconium staining in each group.
(B) Calculate the relative risk of meconium staining associated with the induction of labor. Include a 95% confidence interval.
(C) Test the association for significance. List the null and alternative, set your alpha level, report test statistics, state your conclusion).
(D) Summarize the above descriptive and inferential findings.

(4) OSWEGO.REC: Food Poisoning in Oswego, New York (Centers for Disease Control, 1992)

Data from an outbreak of gastrointestinal illness following a church supper in upstate New York are reported in OSWEGO.REC. Variable in the data set are self-explanatory (use the VARIABLES command to see variable names). Based on these data, fill in the table below and determine the most likely source of agent.

Food Ate Food Did Not Eat Food Relative Risk 95% conf. int. p*

Ill Total % Ill Total %

Baked Ham 29 46 63.0% 17 29 58.6% 1.1 0.7 - 1.6 .70

Spinach ___ ___ ___ ___ ___ ___ ___ ___ ___

Mashed Pot. ___ ___ ___ ___ ___ ___ ___ ___ ___

Cabbage Sal. ___ ___ ___ ___ ___ ___ ___ ___ ___

Jell-O ___ ___ ___ ___ ___ ___ ___ ___ ___

Rolls ___ ___ ___ ___ ___ ___ ___ ___ ___

Brown bread ___ ___ ___ ___ ___ ___ ___ ___ ___

Milk ___ ___ ___ ___ ___ ___ ___ ___ ___

Coffee ___ ___ ___ ___ ___ ___ ___ ___ ___

Water ___ ___ ___ ___ ___ ___ ___ ___ ___

Cakes ___ ___ ___ ___ ___ ___ ___ ___ ___

Van. ice cream ___ ___ ___ ___ ___ ___ ___ ___ ___

Choc. ice cream ___ ___ ___ ___ ___ ___ ___ ___ ___

Fruit salad ___ ___ ___ ___ ___ ___ ___ ___ ___

* uncorrected chi-square or Fisher's exact test, as appropriate.

(5) FOODBRNE: Foodborne Outbreak X

The instructor will provide you the background and data for a foodborne disease outbreak. Computerize these data and analyze these data in a way similar to above. Your table should look something like this:

Food Ate Food Did Not Eat Food Relative Risk 95% conf. int. p*

Ill Total % Ill Total %

Food1 ___ ___ ___ ___ ___ ___ ___ ___ ___

Food2 ___ ___ ___ ___ ___ ___ ___ ___ ___

etc. ___ ___ ___ ___ ___ ___ ___ ___ ___

Identify the most likely source of exposure.

(6) RESTENOS: Restenosis Following Coronary Atherectomy (Zhou et al., 1996)

Each year, cardiologists open many clogged arteries only to have these same arteries restenose following surgery. A study sponsored by the NIH / Heart, Lung and Blood Institute was performed to determine whether silent infection with a common virus (cytomegalovirus) was predictive of the regrowth of arterial plaque. In 21 of the 49 patients with serologic evidence of cytomegalovirus infection, regrowth of arterial plaque was noted. In contrast, 2 of the 26 patients without serologic evidence of cytomegalovirus had plaque regrowth.

(A) Create a 2-by-2 table with these data.
(B) Determine the relative risk of restenosis associated with cytomegalovirus infection. Include a 95% confidence interval and p value for this estimate.
(C) Do data support the theory that subclinical viral infections may play a role in arteriosclerosis?

(7) PHENFORM: Phenformin and Cardiovascular Death (Osborn, 1979, modified)

In a clinical trial of phenform for the treatment of diabetes treatment, 26 out 204 patients treated with phenformin died from cardiovascular disease, whereas two of 64 control patients died of cardiovascular disease. Based on these data:

(A) Calculate cardiovascular death rates in each group.
(B) Put these data into a 2-by-2 table and using either EpiTable or STATCALC to calculate the relative risk of cardiovascular associated with phenformin. Include a 95% confidence interval for RR.
(C) Perform a hypothesis test to determine whether the observed relative risk is significant.
(D) Briefly interpret your results.

(8) SIZE-COH: Cohort Power and Sample Size Exercises

(A) Assume: alpha = .05; power = .8; allocation ratio = 1:1, and background rate (p₂) of 25%. What size sample is needed to detect RR = 2? RR = 3? RR = 4?
(B) What is the power of a study looking for RR = 2, assuming n₁ = 50, n₂ = 100, p₂ = 5%, and alpha =.05. What if the true RR = 3? What if RR = 4?

(9) BI-HELM1.REC: Bicycle Helmet Use in Two Northern California Counties (Perales et al., 1994)

In 1991, 1491 bicyclists were hospitalized for head injuries in California. Forty percent of these injuries were in 0- to 12-year olds. BI-HELM1 contains bicycle helmet use data for 1651 bicycle riders in two northern California counties: Santa Clara County and Contra Costa County. A data documentation table for the data set is see below:

Variable Type Len Description

SCHOOL Real 8 1=Kennedy (Santa Clara County)
2=Los Arboles (Santa Clara County)
3=Cassell (Santa Clara County)
4=Miner (Santa Clara County)
5=Sakamoto (Santa Clara County)
6=Toyon (Santa Clara County)
7=Lietz (Santa Clara County)
8=Sedgewick (Santa Clara County)
9=Belshaw (Santa Clara County)
10=Disco Bay (Contra Costa County)
11=Fair Oaks (Contra Costa County)
12=Grant (Contra Costa County)
13=Walnut Acres (Contra Costa County)
14=Standwood (Contra Costa County)
15=Downer (Contra Costa County)

COUNTY Integer 2 1 = Santa Clara
2 = Contra Costa

HELMETUSE Integer 1 Rider wearing helmet: 1 = yes / 2 = no

MATCHVAR Integer 1 Matching variable based on percent of school population receiving reduced or free meals at school; a surrogate measure of neighborhood SES. School pairs (Santa Clara school / Contra Costa school) as follows::
3: Miner / Fair Oaks
4: Sedgewick / Strandwood
5: Sakamoto / Walnut Acres
6: Toyon / Disco Bay
7: Lietz / Belshaw

Complete the following analyses:

(A) Calculate the helmet-use rates in the Santa Clara County (p^1) and Contra Costa County (p^2). Report relevant counts and percentages.
(B) Calculate the rate ratio and a 95% confidence interval for RR.
(C) Test whether rates differ significantly. (List the null and alternative hypotheses; let alpha = .05; report the hypothesis testing statistic; state the conclusion to the test.)

(10) Directed Paraphrasing Question

You are asked by a collegue how much data is needed to reproductive risk factors for breast cancer in women. In plain language, and in less than 5 minutes, paraphrase what you know about estimating the sample size requirements of cohort studies. Your aim is to direct you collegue toward collecting a reasonable amount of data.

(11) OC/MI

A study was conducted to look at the effects of oral contractives (OC) on heart disease in women 40 to 44 years of age. It
is found that among 5000 current OC users at baseline, 13 women develop a myocardial infaraction (MI) over a 3-year
period, whereas among 10,000 non-users, 7 develope an MI over a 3-year period. Based on these finding,

(A) Display the data in 2-by-2 table form.
(B) Calculate incidence rates of MI in the OC user and non-users. Also calculate the point estimate and 95% confidence
interval for the relative risk.
(C) Assess the statistical significance of the results (let alpha = .01; show all steps of the test). Create an expected table, and
based on this expectation, justify the use of either the Yates's chi-square test or Fisher's test.
(D) Summarize your results in plain language.

(12) OC/BRCA Sample Size

List the determinants of sample size requirements in cohort studies. Then, using this background knowledge, tell me how
you would go about estimating the size of study needed to deterimine the relationship between OC use and breast cancer over
a three year period. Perform some sample calculations, making clear the basis of your assumptions.

Click here for keys to Exercises

References

Centers for Disease Control. [CDC] (1992). Oswego: an outbreak of gastrointestinal illness following a church supper. E. I. S. Course. Washington, D.C.: Association of Teachers of Preventive Medicine, 1015 15th Street, N.W., Suite 405, 20005.

Centers for Disease Control [CDC]. (1992). Epi Info 2000 Documentation. Unpublished (part of online help).

Fleiss J. L. (1981). Statistical Methods for Rates and Proportions (2nd ed.). New York: John Wiley & Sons.

Gerstman, B. B., Kirkman, R., & Platt, R. (1992). Intestinal necrosis associated with postoperative orally administered sodium polystyrene sulfonate in sorbitol. American Journal of Kidney Disease, 20, 159161.

Gerstman, B. B. (1998). Epidemiology Kept Simple: An Introduction to Classic and Modern Epidemiology. New York: John Wiley & Sons.

Goodman, S. N. (1993). p Values, hypothesis tests, and likelihood: Implications for epidemiology of a neglected historical debate. American Journal of Epidemiology, 137, 485 - 496.

Greenberg, R. S., Daniels, S. R., Flanders, W. D., Eley, J. W., Boring, J.R. (1996). Medical Epidemiology. Norwalk, Connecticut: Appleton and Lange.

Jolson, H. M., Bosco, L., Bufton, M. G., Gerstman, B. B., Rinsler, S. S., and Williams, E. (1992). Cerebellar toxicity associated with high dose cytarabine therapy. Journal of the National Cancer Institute, 84, 500 - 505.

Kahn H. A., & Sempos C. T. (1989). Statistical Methods in Epidemiology. New York: Oxford University Press.

Kelsey, J. L., Whittemore, A. S., Evans, A. S., Thompson, W. D. (1996). Methods in Observational Epidemiology (2nd ed.) New York: Oxford University Press.

Kleinbaum, D. G., Kupper, L. L., Morgenstern, H. (1982). Epidemiologic Research. New York: Van Nostrand Reinhold.

Osborn, J. F. (1979). Statistical Exercises in Medical Research. New York: John Wiley & Sons.

Perales, D. & Gerstman, B. B. (1995, March). A bi-county comparative study of bicycle helmet knowledge and use by California elementary school children. The Ninth Annual California Conference on Childhood Injury Control, San Diego, CA.

Rosner, B. (1990). Fundamentals of Biostatistics (3rd ed.) Boston: PWS - Kent Publishing.

Rosner, B. (1995). Fundamentals of Biostatistics (4th ed.). Belmont, CA: Duxbury Press.

Rothman, K. J. (1986). Modern Epidemiology. Boston: Little, Brown and Company.

Schlesselman J. J. (1982). Case-Control Studies. New York: Oxford University Press.

Smith, P. F., Mikl, J., Truman, B. I., et al. (1991). HIV infection among women entering the N.Y. state correctional system. American Journal of Public Health, 81(Supplement), 35-40.

Zhou, Y. F., Leon, M. B., Waclawiw, M. A., Popma, J. J., Yu, Z. X., Finkel, T., & Epstein, S. E. (1996). Association between prior cytomegalovirus infection and the risk of restenosis after coronary atherectomy. New England Journal of Medicine, 335, 624-630.

Food	Ate Food			Did Not Eat Food			Relative Risk	95% conf. int.	p*
Food	Ill	Total	%	Ill	Total	%	Relative Risk	95% conf. int.	p*
Baked Ham	29	46	63.0%	17	29	58.6%	1.1	0.7 - 1.6	.70
Spinach	___	___	___	___	___	___	___	___	___
Mashed Pot.	___	___	___	___	___	___	___	___	___
Cabbage Sal.	___	___	___	___	___	___	___	___	___
Jell-O	___	___	___	___	___	___	___	___	___
Rolls	___	___	___	___	___	___	___	___	___
Brown bread	___	___	___	___	___	___	___	___	___
Milk	___	___	___	___	___	___	___	___	___
Coffee	___	___	___	___	___	___	___	___	___
Water	___	___	___	___	___	___	___	___	___
Cakes	___	___	___	___	___	___	___	___	___
Van. ice cream	___	___	___	___	___	___	___	___	___
Choc. ice cream	___	___	___	___	___	___	___	___	___
Fruit salad	___	___	___	___	___	___	___	___	___

Variable	Type	Len	Description
SCHOOL	Real	8	1=Kennedy (Santa Clara County) 2=Los Arboles (Santa Clara County) 3=Cassell (Santa Clara County) 4=Miner (Santa Clara County) 5=Sakamoto (Santa Clara County) 6=Toyon (Santa Clara County) 7=Lietz (Santa Clara County) 8=Sedgewick (Santa Clara County) 9=Belshaw (Santa Clara County) 10=Disco Bay (Contra Costa County) 11=Fair Oaks (Contra Costa County) 12=Grant (Contra Costa County) 13=Walnut Acres (Contra Costa County) 14=Standwood (Contra Costa County) 15=Downer (Contra Costa County)
COUNTY	Integer	2	1 = Santa Clara 2 = Contra Costa
HELMETUSE	Integer	1	Rider wearing helmet: 1 = yes / 2 = no
MATCHVAR	Integer	1	Matching variable based on percent of school population receiving reduced or free meals at school; a surrogate measure of neighborhood SES. School pairs (Santa Clara school / Contra Costa school) as follows:: 3: Miner / Fair Oaks 4: Sedgewick / Strandwood 5: Sakamoto / Walnut Acres 6: Toyon / Disco Bay 7: Lietz / Belshaw