BIOSTAT Lab Notes  Last updated: 12/22/2008   

Jump to lab: 1  |  2  |  3  |  4  |  5  |  6  |  7  |  8  |  9  |  10

Lab 1: Sampling a Population and Entering Data 

1. Random numbers

My random numbers below were generated  Aug 2001 using www.random.org. Sampling was done with replacement.

first

80

second

485

third

15

fourth

43

fifth

419

sixth

531

seventh

351

eighth

327

ninth

219

tenth

546

2. Data / hand-written table. 

id age sex date hiv sbp1 sbp2
80 21 F 1/9/04 2 120 126
485 42 M 10/21/04 2 118 118
15 5 M 1/12/05 2 83 86
43 11 F 2/17/04 1 126 124
419 30 M 12/28/04 2 119 118
531 50 M 12/29/04 2 114 118
351 28 M 8/19/04 2 119 119
327 27 M 8/31/04 2 115 111
219 24 M 8/19/04 2 127 129
546 52 M 10/13/04 2 94 89

3. Variable names and types. 

4. Data entry. 

Lab 2: Stemplots and Frequency Tables

1. Stem-and-leaf plots 

a.  Single stem-values

|0|5
|1|1
|2|1478
|3|0
|4|2
|5|02
x10

b. Split stem-values   

|0|5
|1|1
|1|
|2|14
|2|78
|3|0
|3|
|4|2
|4|
|5|02
x10

c. Which of the above plots do you prefer? Plot A. Why? It does a better job showing the shape of the distribution. [Reasonable interpretations of your own data.]

d. My distribution has a mound,  no pronounced skew, and no apparent outliers. (Keep in mind that with small data sets, it is difficult to make reliable statements about the shape of a distribution. We offer this analysis for practice purposes only. 

e. The mean looks to be in the high 20s or low 30s 

f. Data spread from 5 to 52. 

2. SPSS stemplot 

3.  Frequency Table

a. By hand 

Age (years)

Frequency Count

Relative Freq (%)

Cumulative Freq (%)

0!9

1

10

10

10!19

1

10

20

20!29

4

40

60

30!39

1

10

70

40!49

1

10

80

50!59

2

20

100

 

10

100

--

b. Tallied with SPSS AGEGRP coded 1 = 0 - 9, 2 = 10 - 19, and so on.  

 

 Code

Frequency

Percent

Valid Percent

Cumulative Percent

Valid

0

1

10.0

10.0

10.0

 

1

1

10.0

10.0

20.0

 

2

4

40.0

40.0

60.0

 

3

1

10.0

10.0

70.0

 

4

1

10.0

10.0

80.0

 

5

2

20.0

20.0

100.0

 

Total

10

100.0

100.0

 

c. Optional -- SPSS recode command

Lab 3: Summary Statistics and Boxplots

1. Sample mean 


a. n  = 10; Sum(x) = 290; =  (1 / n) � Sum(x) = (1 / 10) � 290 = 29.0


b. Uses of the sample mean  

(i) Predict s a value selected at random from the sample 

(ii) Predicts a value selected at random from population 

(iii) Predicts the population mean 

 

2. Variance and standard deviation 


a. Deviations table 

i x deviation sq deviation
1 21 -8 64
2 42 13 169
3 5 -24 576
4 11 -18 324
5 30 1 1
6 50 21 441
7 28 -1 1
8 27 -2 4
9 24 -5 25
10 52 23 529
Sums 290 0 2,134

 

 

b. SS = sum of last column = 2134  

 

Comment: Here is how this formula looks in horizontal fashion:  (21 - 29)� + (42 - 29)� + (5 - 29)� + (11 - 29)� + (30 - 29)� + (50 - 29)� + (28 - 29)� + (27 - 29)� + (24 - 29)� + (52 - 29)� 


c. s2 =  2134 / (10 - 1) = 237.1111 (years2


d. s = /(237.1111) = 15.3984 15.4 years 


e. 95% of the data will lie within two standard deviations of the mean when the distribution is Normal.

 

f. ...at least 75% of the values will lie within 2 standard deviations of the mean.

3. Quartiles and boxplot 

Ordered Array:

     5     11     21     24     27  |  28     30     42     50     52
               low group            |            high group

 

Five point summary: 5, 21, 27.5, 42, 52 

IQR = 42 - 21 = 21

FU = 42 + (1.5)(21) = 73.5; hence, no upper outside values and upper inside value = 52. 

FL = 21 - (1.5)(21) = -10.5; hence no lower outside values and The lower inside value = 5

Boxplot 

4. SPSS 

Lab 4: The Binomial pmf  

1. Introduction to the binomial.  

n = 3 

p = 0.265   

 

2. Mean and standard deviation. 

a. μ = np = 3 � 0.265 = 0.795
b. = sqrt(npq)= sqrt(3 � 0.265 � 0.735) = 0.7644

 

3. Choose function [binomial coefficients]

a. 3C0 = 3! / [(0!)(3 - 0)!] = (3 � 2 � 1) / [(1)(3 � 2 � 1)] = 1
b. 3C1 = 3! / [(1!)(3 - 1)!] = (3 � 2 � 1) / [(1)(2 � 1)] = 3
c. 3C3 = 3! / [(3!)(3 - 3)!] = (3 � 2 � 1) / [(3 � 2 � 1)(1)] = 1


4. [Binomial probabilities].
a. q = 1 - .265 = .735
b. Pr(X = 0) = (3C0)(0.2650)(0.7353-0) = (1)(1)(0.3971) = 0.3971
c. Pr(X = 1) = (3C1)(0.2651)(0.7353-1) = (3)(0.2650)(0.5402) = 0.4295
d. Pr(X = 2) = (3C2)(0.2652)(0.7353-2) = (3)(0.0702)(0.7350) = 0.1548
e. Pr(X = 3) = (3C3)(0.2653)(0.7353-3) = (1)(0.0186)(1) = 0.0186

 

5. Area under the curve.

a. Shade bar corresponding to Pr(X = 0)

 

 


b. Shade bar corresponding to Pr(X = 3) 

This AUC = .0186 � 1 = .0186 


c. Pr(X = 0) + Pr(X = 3) = 0.3971 + .0186 = .4157 

 

Visually:

6. Cumulative probabilities 
a.
Pr(X 0) = 0.3971
b. Pr(X 1) = 0.3971 + 0.4295 = 0.8267
c. Pr(X 2) = 0.8267 + 0.1548 = 0.9815
d. Pr(X 3) = 0.9815 +  0.0186 = 1.0000
e. Shaded area corresponding to Pr(X 1) = (0.3971 � 1.0) + (0.4295 � 1.0) = 0.8267 

7. Right tail. 

a. Shade area corresponding to Pr(X > 1) 

b. Pr(X > 1) = Pr(X = 2) + Pr (X = 3) =  0.1548 + 0.0186 = 0.1733 

c. Pr(X > 1) = 1 - Pr(X 1) =  1 - 0.8267 = 0.1733 

 

8. Use of StaTable binomial probability calculator (optional/recommended).

Lab 5: The Normal Probability Distributions

1. Introduction

2. X ~ N(120, 20)

a. Label � with 120

b. Label � � markers with 100 and 140.

c. Label � � 2 markers with 80 and 160

 

 

 

d. 95% 

 

3. Standard Normal random variable 

a.  z = (80 - 120) / 20 = -2
b.  z = (80 - 100) / 20 = -1
c.  z = (160 - 120) / 20 = 2

4.Normal probabilities:

a. Note steps...
State: Pr(X 100) 

Standardize: z = (100 - 120) / 20 = -1

Sketch: not shown

Table B: Pr(Z -1) = .1587 


b. Pr(X > 100) = Pr(Z >1) = 1 - Pr(Z 1) = 1 - .1587 = 0.8413

Lab 6: Introduction to Inference 

1. Population & sample 

a. = 29.0 (unique to Dr. Gerstman's sample)
b. The s differs because of random [sampling] variation.  
c. Ways and � differ (p. 157)
(1) different sources - sample vs. population 
(2) is calculated in the sample, � is not 
(3) is a random variable, � is a constant
(4) different notation (obvious but important)

2. Experiment simulation 

As of September 2003, the distribution looked like this. (This distribution will change over time, as additional student means are added to the data.)

a. The mean of the SDM was 29.4 (as of September 2003; will change)
b. Standard deviation of the SDM was 4.05 (as of September 2003; will change)
c.  More Normal.

This table reviews the different distributions that students should keep in mind during the lab (not part of lab manual, but good review in case a student is confused.)

Various distributions considered in this lab:

Population
population2.sav

Sample
LnameF10.sav

Simulated SDM
SampleMeans.sav

Number of observations

N = 600

n = 10

changes as data accrues

mean

� = 29.5

varies from sample to sample

changes but  should approach 29.5 as study continues

standard deviation

s = 13.59

s varies from sample to sample

changes but should approach 4.3 as study continues

 

3. SDM model, right tail  

a. 2.5%
b. The theory performed well. It predicted that between 2 and 3 sample means would exceed 38. This turned out to be true.


4. Confidence interval for m, sigma known 

a. No answer required; the question gives SE =   / n = 13.59 / 10 =  4.30.

b. � (1.96)(SE) = 29.0 � (1.96)(4.30) = 29.0 � 8.4 = 20.6 to 37.4 

c. This particular CI did capture  � 

d. 5% 
e.29.0 � (1.645)(4.30) = 29.0 � 7.1 = 21.9 to 36.1  
f. 29.0 � (2.576)(4.30) = 29.0 � 11.1 = 17.9 to 40.1 

g. It increased. 

 

5. Hypothesis test 

a. H0: m = 32 versus Ha: m 32 

b. zstat = ( - �0) / SE = (29.0 - 32) / 4.30 = -0.70
c. The curve should accurately show the location of the student's zstat with the tail of the distribution shaded. Dr. Gerstman's one-tailed P = Pr(Z < -0.70) = 0.2420 
d. Two-tailed P = 0.4840  [
e. My test provides statistically nonsignificant evidence against the null hypothesis. (Comment: Retention of the false null hypothesis would result in a type II error. This is one of the key limitations of statisical hypothesis testing.) 

Lab 7: Inference About a Mean 

1. Student's pdf 

a. t9,.95 = 1.83; right tail = 0.05

b. t9,.975 =2.26; right tail = 0.025  


c. t9,.995 = 3.25; right tail = 0.005

2. t Confidence Interval 

a. The standard deviation of AGE in GerstmanB10.sav is 15.398
b. SE = s / n = 15.398 / 10 = 4.8693 
c. 95% confidence interval for μ = � (t9,.975)(SE) = 29.00 � (2.262)(4.869) = 29.00 � 11.01 = (18.0, 40.0).
Comment: This confidence interval suggests the population mean age is between 18 and 40. 

3. SPSS.  

4. Paired samples

a. These are paired samples because each data point in sample one is uniquely paired to a data point in sample two.

b. Data

SBP1   SBP2   DELTA
120    126     -6
118    118      0
 83     86     -3
126    124      2
 87     82      5
114    118     -4
119    119      0
115    111      4
127    129     -2
 94     89      5

c. Stemplot (including interpretation) 

-1|
-0|6
-0|432
 0|0024
 0|55
 1|
 �10 (DELTA) 

 

d. Descriptive statistics for delta variable:  = 0.1        s = 3.87155        n = 10  

 

e. Hypothesis test 

H0: md  = 0  versus  Ha: md 
tstat = (0.100 - 0) / 1.224 =  0.08; df = 10 - 1 = 9
 

One-sided P > 0.25; two-sided P > 0.50 (two-sided P .92 using Table G) 

The evidence against the hypothesis is not statistically significant. 

5. SPSS

6. Power of the test (optional)
a.
Power = [-1.96+(5  �  10 / 5)] = (1.20) = 0.8849 = 88.5%
b. Power = [-1.96+(2 � 10 / 5)] = (-0.70) = 0.2420 = 24.2%
c. Lowering the difference worth detecting decreased the power of the study (making it more difficult to detect the difference).
d. Power = [-1.96+(2  �  50 / 5)] = (0.87) = .8078 = 80.8%
e.
Increasing the sample size increased the power of the test.

Lab 8: Comparing Two Means 

1. Data

a. Summary statistics for the AGE variable in GerstmanB10.sav      n1 = 10        1 = 29.00         s1 = 15.40
b. Summary statistics for group 2 (given in the problem):                   n2 = 15        2 = 42.47         s2 = 12.48

 

2. Estimation  

a.  1 - 2 = 29.00 - 42.47 = -13.47. Group 2 has the higher average by 13.47 years. 
b.  SE = 5.840  
c. dfconserv =  9 
d. 95% confidence interval for �1 - �2 = 1 - 2 � (t9,.975)(SE) = -13.47 � (2.262)(5.840) = -13.47 � 13.21 = (-26.68 to -0.26 ) SE and mean difference should match, but CI will be a little wider (more conservative) than  SPSS confidence interval for "equal variance not assumed method". 
e. The interpretation should address the fact that the CI is trying to capture �1 - �2 with 95% confidence.

 

3.Hypothesis test 

a.  H0: μ1 -  μ2  = 0 versus Ha: μ1 - μ2   0
b. tstat = (29.00 - 42.47) /  5.840 = -2.31 with dfconserv = 9 
c. P-value based on t sampling distribution

Two-sided P-value 0.047 via Table G.  The hand-calculated results will be close but not identical to the "equal variance not assumed" t test results produced by SPSS.
Interpretation: Data provide significant evidence against H0

4. SPSS

5. Conditions. Graphical methods such as stemplots can be used to address Normality.. In practice, it is enough that the two distribution have similar shapes and no strong outliers.With small samples it is very difficult to check for Normality. 

Lab 9: Inference About a Proportion 

1. Sample size requirement    Make sure this problem is done accurately.

a. n = (1.96)2(0.5)(0.5) / 0.12 @ 96.04; use 97

b. n = (1.96)2(0.5)(0.5) / 0.052 @ 384.16; use 385

c. n = (1.96)2(0.5)(0.5) / 0.0252 @ 1536.64; use 1537
d. The sample size requirement quadruples.

 

2. New data set: population2.sav

 

3. SPSS tabulation Make sure this problem is properly done.

x = 66        
n
= 600         

 

4. Estimation Make sure this problem is done in its entirety.

a. = 66 / 600 = .110 or 11.0% 

b. Confidence interval for p Note that p~ = 68 / 604 = .1126;  q~ = .8874;  SEp~ = .01286. 

95% CI = .1126 � (1.96)(.01286) = .1126 � .0252 = (.0874 to .1379) 

Interpretation: Based on these data we can be 95% confident that the parameter (proportion in the source population) is between 8.8% to 13.8%. 

 

5. One-sample z test of the proportion Make sure this problem is done in it's entirety.

a. H0: p = 0.10 vs. Ha: p 0.10
b. SE = .01225; zstat = (0.11 - 0.1) / .01225 = 0.82
c. 

P-value = 0.412; The evidence against H0  is not significant.  

d. The 95% confidence interval captured the value hypothesized by the null (i.e., p0 = .10). Thus, the test of of H0: p = 0.10 will not be significant at the alpha = .05 level. (i.e., P > .05).


6. WinPepi. The WinPepi results are shown in the lab workbook.

 

Lab 10: Comparing Proportions 

1. Cross-tabulation 

2. Estimation 

a. Prevalences and prevalence ("risk") difference: 

Prevalence females = 1 = 14 / 159 = .08805

Prevalence in males = 2 = 52 / 441 = .1179; 

1 - 2 =  .08805 - .1179 = -0.02985

Interpretation: The prevalence in females in the sample is 3.0% less than in males.

 

b. 95% confidence interval for p1 - p2  

 

Note: Using WinPepi, the 95% CI by Wilson's method is -0.078 to 0.031. 

 

3. Hypothesis test  

a. H0: p1 = p2 versus Ha: p1   p2 (equivalently, H0: p1 - p2 = 0 versus Ha: p1p2 0)
b.  

zstat = 1.03
c. P = 0.302 


d. The evidence against H0 is not significant ("no significant difference in prevalence by gender"). Discuss the results in your own words.

4. WinPepi. (optional)