BIOSTAT Lab Notes Last updated: 12/22/2008

Jump to lab: 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10

Lab 1: Sampling a Population and Entering Data

1. Random numbers.

My random numbers below were generated Aug 2001 using www.random.org. Sampling was done with replacement.

first	80
second	485
third	15
fourth	43
fifth	419
sixth	531
seventh	351
eighth	327
ninth	219
tenth	546

2. Data / hand-written table.

id	age	sex	date	hiv	sbp1	sbp2
80	21	F	1/9/04	2	120	126
485	42	M	10/21/04	2	118	118
15	5	M	1/12/05	2	83	86
43	11	F	2/17/04	1	126	124
419	30	M	12/28/04	2	119	118
531	50	M	12/29/04	2	114	118
351	28	M	8/19/04	2	119	119
327	27	M	8/31/04	2	115	111
219	24	M	8/19/04	2	127	129
546	52	M	10/13/04	2	94	89

3. Variable names and types.

4. Data entry.

Lab 2: Stemplots and Frequency Tables

1. Stem-and-leaf plots

a. Single stem-values

|0|5
|1|1
|2|1478
|3|0
|4|2
|5|02
x10

b. Split stem-values

|0|5
|1|1
|1|
|2|14
|2|78
|3|0
|3|
|4|2
|4|
|5|02
x10

c. Which of the above plots do you prefer? Plot A. Why? It does a better job showing the shape of the distribution. [Reasonable interpretations of your own data.]

d. My distribution has a mound, no pronounced skew, and no apparent outliers. (Keep in mind that with small data sets, it is difficult to make reliable statements about the shape of a distribution. We offer this analysis for practice purposes only.

e. The mean looks to be in the high 20s or low 30s

f. Data spread from 5 to 52.

2. SPSS stemplot

3. Frequency Table

a. By hand

Age (years)	Frequency Count	Relative Freq (%)	Cumulative Freq (%)
0!9	1	10	10
10!19	1	10	20
20!29	4	40	60
30!39	1	10	70
40!49	1	10	80
50!59	2	20	100
	10	100	--

b. Tallied with SPSS AGEGRP coded 1 = 0 - 9, 2 = 10 - 19, and so on.

	Code	Frequency	Percent	Valid Percent	Cumulative Percent
Valid	0	1	10.0	10.0	10.0
	1	1	10.0	10.0	20.0
	2	4	40.0	40.0	60.0
	3	1	10.0	10.0	70.0
	4	1	10.0	10.0	80.0
	5	2	20.0	20.0	100.0
	Total	10	100.0	100.0

c. Optional -- SPSS recode command

Lab 3: Summary Statistics and Boxplots

1. Sample mean

a. n = 10; Sum(x) = 290; = (1 / n) � Sum(x) = (1 / 10) � 290 = 29.0

b. Uses of the sample mean

(i) Predict s a value selected at random from the sample

(ii) Predicts a value selected at random from population

(iii) Predicts the population mean

2. Variance and standard deviation

a. Deviations table

i	x	deviation	sq deviation
1	21	-8	64
2	42	13	169
3	5	-24	576
4	11	-18	324
5	30	1	1
6	50	21	441
7	28	-1	1
8	27	-2	4
9	24	-5	25
10	52	23	529
Sums	290	0	2,134

b. SS = sum of last column = 2134

Comment: Here is how this formula looks in horizontal fashion: (21 - 29)� + (42 - 29)� + (5 - 29)� + (11 - 29)� + (30 - 29)� + (50 - 29)� + (28 - 29)� + (27 - 29)� + (24 - 29)� + (52 - 29)�

c. s² = 2134 / (10 - 1) = 237.1111 (years²)

d. s = /(237.1111) = 15.3984 15.4 years

e. 95% of the data will lie within two standard deviations of the mean when the distribution is Normal.

f. ...at least 75% of the values will lie within 2 standard deviations of the mean.

3. Quartiles and boxplot

Ordered Array:

5 11 21 24 27 | 28 30 42 50 52
low group | high group

Five point summary: 5, 21, 27.5, 42, 52

IQR = 42 - 21 = 21

F_U = 42 + (1.5)(21) = 73.5; hence, no upper outside values and upper inside value = 52.

F_L = 21 - (1.5)(21) = -10.5; hence no lower outside values and The lower inside value = 5

Boxplot

4. SPSS

Lab 4: The Binomial pmf

1. Introduction to the binomial.

n = 3

p = 0.265

2. Mean and standard deviation.

a. μ = np = 3 � 0.265 = 0.795
b. = sqrt(npq)= sqrt(3 � 0.265 � 0.735) = 0.7644

3. Choose function [binomial coefficients]

a. ₃C₀ = 3! / [(0!)(3 - 0)!] = (3 � 2 � 1) / [(1)(3 � 2 � 1)] = 1
b. ₃C₁ = 3! / [(1!)(3 - 1)!] = (3 � 2 � 1) / [(1)(2 � 1)] = 3
c. ₃C₃ = 3! / [(3!)(3 - 3)!] = (3 � 2 � 1) / [(3 � 2 � 1)(1)] = 1

4. [Binomial probabilities].
a. q = 1 - .265 = .735
b. Pr(X = 0) = (₃C₀)(0.265⁰)(0.735^3-0) = (1)(1)(0.3971) = 0.3971
c. Pr(X = 1) = (₃C₁)(0.265¹)(0.735^3-1) = (3)(0.2650)(0.5402) = 0.4295
d. Pr(X = 2) = (₃C₂)(0.265²)(0.735^3-2) = (3)(0.0702)(0.7350) = 0.1548
e. Pr(X = 3) = (₃C₃)(0.265³)(0.735^3-3) = (1)(0.0186)(1) = 0.0186

5. Area under the curve.

a. Shade bar corresponding to Pr(X = 0)

b. Shade bar corresponding to Pr(X = 3)

This AUC = .0186 � 1 = .0186

c. Pr(X = 0) + Pr(X = 3) = 0.3971 + .0186 = .4157

Visually:

6. Cumulative probabilities
a. Pr(X 0) = 0.3971
b. Pr(X 1) = 0.3971 + 0.4295 = 0.8267
c. Pr(X 2) = 0.8267 + 0.1548 = 0.9815
d. Pr(X 3) = 0.9815 + 0.0186 = 1.0000
e. Shaded area corresponding to Pr(X 1) = (0.3971 � 1.0) + (0.4295 � 1.0) = 0.8267

7. Right tail.

a. Shade area corresponding to Pr(X > 1)

b. Pr(X > 1) = Pr(X = 2) + Pr (X = 3) = 0.1548 + 0.0186 = 0.1733

c. Pr(X > 1) = 1 - Pr(X 1) = 1 - 0.8267 = 0.1733

8. Use of StaTable binomial probability calculator (optional/recommended).

Lab 5: The Normal Probability Distributions

1. Introduction

2. X ~ N(120, 20)

a. Label � with 120

b. Label � � markers with 100 and 140.

c. Label � � 2 markers with 80 and 160

d. 95%

3. Standard Normal random variable

a. z = (80 - 120) / 20 = -2
b. z = (80 - 100) / 20 = -1
c. z = (160 - 120) / 20 = 2

4.Normal probabilities:

a. Note steps...
State: Pr(X 100)

Standardize: z = (100 - 120) / 20 = -1

Sketch: not shown

Table B: Pr(Z -1) = .1587

b. Pr(X > 100) = Pr(Z >1) = 1 - Pr(Z 1) = 1 - .1587 = 0.8413

Lab 6: Introduction to Inference

1. Population & sample

a. = 29.0 (unique to Dr. Gerstman's sample)
b. The s differs because of random [sampling] variation.
c. Ways and � differ (p. 157)
(1) different sources - sample vs. population
(2) is calculated in the sample, � is not
(3) is a random variable, � is a constant
(4) different notation (obvious but important)

2. Experiment simulation

As of September 2003, the distribution looked like this. (This distribution will change over time, as additional student means are added to the data.)

a. The mean of the SDM was 29.4 (as of September 2003; will change)
b. Standard deviation of the SDM was 4.05 (as of September 2003; will change)
c. More Normal.

This table reviews the different distributions that students should keep in mind during the lab (not part of lab manual, but good review in case a student is confused.)

Various distributions considered in this lab:	Population population2.sav	Sample LnameF10.sav	Simulated SDM SampleMeans.sav
Number of observations	N = 600	n = 10	changes as data accrues
mean	� = 29.5	varies from sample to sample	changes but should approach 29.5 as study continues
standard deviation	s = 13.59	s varies from sample to sample	changes but should approach 4.3 as study continues

3. SDM model, right tail

a. 2.5%
b. The theory performed well. It predicted that between 2 and 3 sample means would exceed 38. This turned out to be true.

4. Confidence interval for m, sigma known

a. No answer required; the question gives SE = / n = 13.59 / 10 = 4.30.

b. � (1.96)(SE) = 29.0 � (1.96)(4.30) = 29.0 � 8.4 = 20.6 to 37.4

c. This particular CI did capture �

d. 5%
e.29.0 � (1.645)(4.30) = 29.0 � 7.1 = 21.9 to 36.1
f. 29.0 � (2.576)(4.30) = 29.0 � 11.1 = 17.9 to 40.1

g. It increased.

5. Hypothesis test

a. H₀: m = 32 versus H_a: m 32

b. z_stat = ( - �₀) / SE = (29.0 - 32) / 4.30 = -0.70
c. The curve should accurately show the location of the student's z_stat with the tail of the distribution shaded. Dr. Gerstman's one-tailed P = Pr(Z < -0.70) = 0.2420
d. Two-tailed P = 0.4840 [
e. My test provides statistically nonsignificant evidence against the null hypothesis. (Comment: Retention of the false null hypothesis would result in a type II error. This is one of the key limitations of statisical hypothesis testing.)

Lab 7: Inference About a Mean

1. Student's pdf

a. t_9,.95 = 1.83; right tail = 0.05

b. t_9,.975 =2.26; right tail = 0.025

c. t_9,.995 = 3.25; right tail = 0.005

2. t Confidence Interval

a. The standard deviation of AGE in GerstmanB10.sav is 15.398
b. SE = s / n = 15.398 / 10 = 4.8693
c. 95% confidence interval for μ = � (t_9,.975)(SE) = 29.00 � (2.262)(4.869) = 29.00 � 11.01 = (18.0, 40.0).
Comment: This confidence interval suggests the population mean age is between 18 and 40.

3. SPSS.

4. Paired samples

a. These are paired samples because each data point in sample one is uniquely paired to a data point in sample two.

b. Data

SBP1   SBP2   DELTA
120    126     -6
118    118      0
83     86     -3
126    124      2
87     82      5
114    118     -4
119    119      0
115    111      4
127    129     -2
94     89      5

c. Stemplot (including interpretation)

-1| -0|6 -0|432 0|0024 0|55 1| �10 (DELTA)

Central location: The median is 0 (underlined).
Spread: Data spread from -6 to 5.
Shape: The distribution appears to be mound-shaped with no major departures from Normality (cautious statement because the sample is very small)

d. Descriptive statistics for delta variable: = 0.1 s = 3.87155 n = 10

e. Hypothesis test

H₀: m_d = 0 versus H_a: m_d 0
t_stat = (0.100 - 0) / 1.224 = 0.08; df = 10 - 1 = 9

One-sided P > 0.25; two-sided P > 0.50 (two-sided P .92 using Table G)

The evidence against the hypothesis is not statistically significant.

5. SPSS

6. Power of the test (optional)
a. Power = [-1.96+(5 � 10 / 5)] = (1.20) = 0.8849 = 88.5%
b. Power = [-1.96+(2 � 10 / 5)] = (-0.70) = 0.2420 = 24.2%
c. Lowering the difference worth detecting decreased the power of the study (making it more difficult to detect the difference).
d. Power = [-1.96+(2 � 50 / 5)] = (0.87) = .8078 = 80.8%
e. Increasing the sample size increased the power of the test.

Lab 8: Comparing Two Means

1. Data

a. Summary statistics for the AGE variable in GerstmanB10.sav: n₁= 10 ₁ = 29.00 s₁ = 15.40
b. Summary statistics for group 2 (given in the problem): n₂ = 15 ₂ = 42.47 s₂ = 12.48

2. Estimation

a. ₁ - ₂ = 29.00 - 42.47 = -13.47. Group 2 has the higher average by 13.47 years.
b. SE = 5.840
c. df_conserv = 9
d. 95% confidence interval for �₁ - �₂ = ₁ - ₂ � (t_9,.975)(SE) = -13.47 � (2.262)(5.840) = -13.47 � 13.21 = (-26.68 to -0.26 ) SE and mean difference should match, but CI will be a little wider (more conservative) than SPSS confidence interval for "equal variance not assumed method".
e. The interpretation should address the fact that the CI is trying to capture �₁ - �₂ with 95% confidence.

3.Hypothesis test

a. H₀: μ₁ - μ₂ = 0 versus H_a: μ₁ - μ₂ 0 b. t_stat = (29.00 - 42.47) / 5.840 = -2.31 with df_conserv = 9
c. P-value based on t sampling distribution

Two-sided P-value 0.047 via Table G. The hand-calculated results will be close but not identical to the "equal variance not assumed" t test results produced by SPSS.
Interpretation: Data provide significant evidence against H₀.

4. SPSS

5. Conditions. Graphical methods such as stemplots can be used to address Normality.. In practice, it is enough that the two distribution have similar shapes and no strong outliers.With small samples it is very difficult to check for Normality.

Lab 9: Inference About a Proportion

1. Sample size requirement Make sure this problem is done accurately.

a. n = (1.96)²(0.5)(0.5) / 0.1² @ 96.04; use 97

b. n = (1.96)²(0.5)(0.5) / 0.05² @ 384.16; use 385

c. n = (1.96)²(0.5)(0.5) / 0.025² @ 1536.64; use 1537
d. The sample size requirement quadruples.

2. New data set: population2.sav

3. SPSS tabulation Make sure this problem is properly done.

x = 66
n = 600

4. Estimation Make sure this problem is done in its entirety.

a. = 66 / 600 = .110 or 11.0%

b. Confidence interval for p: Note that p~ = 68 / 604 = .1126; q~ = .8874; SE_p~ = .01286.

95% CI = .1126 � (1.96)(.01286) = .1126 � .0252 = (.0874 to .1379)

Interpretation: Based on these data we can be 95% confident that the parameter (proportion in the source population) is between 8.8% to 13.8%.

5. One-sample z test of the proportion Make sure this problem is done in it's entirety.

a. H₀: p = 0.10 vs. H_a: p 0.10
b. SE = .01225; z_stat = (0.11 - 0.1) / .01225 = 0.82
c.

P-value = 0.412; The evidence against H₀ is not significant.

d. The 95% confidence interval captured the value hypothesized by the null (i.e., p₀ = .10). Thus, the test of of H₀: p = 0.10 will not be significant at the alpha = .05 level. (i.e., P > .05).

6. WinPepi. The WinPepi results are shown in the lab workbook.