데이터 검정 R programming

Programing/R- programming

데이터 검정 R programming

sosal 2014. 8. 21. 14:23

* http://sosal.kr/
* made by so_Sal
*/

Data

Pima Indian: 9~13세기에 걸쳐 아메리카로 이주해온 몽골리언계

주식: 식물성. (나무의 순, 잡초, 밀, 콩, 호박 등)

1960년대 이후 고지방/고칼로리 식습관으로 당뇨환자 증가.

Pima.tr data

8개의 변수

npreg: number of pregnancies.

glu: plasma glucose concentration in an oral glucose tolerance test.

bp: diastolic blood pressure (mm Hg).

skin:triceps skin fold thickness (mm).

bmi: body mass index (weight in kg/(height in m)\^2).

ped: diabetes pedigree function.

age: age in years.

type: Yes or No, for diabetic according to WHO criteria.

정규성 검정 (Normality Test)

one sample t-test

데이터가 정규분포를 따르는지를 판단하는 방법

H0 (귀무가설): 주어진 데이터의 분포는 정규분포를 따른다.

Ha (대립가설): 주어진 데이터의 분포는 정규분포를 따르지 않는다.

- Shapiro-Wilk normality test

shapiro.test() 함수를 이용하여 데이터 변수에 대한 정규성 검정을 할 수 있다.

> library(MASS) # Pima.tr 이 들어있는 라이브러리

> attach(Pima.tr) # Pima.tr$bmi 로 접근하지 않더라도, 바로 bmi로 접근 가능하다.

> head(Pima.tr)

npreg glu bp skin bmi ped age type

1 5 86 68 28 30.2 0.364 24 No

2 7 195 70 33 25.1 0.163 55 Yes

3 5 77 82 41 35.8 0.156 35 No

4 0 165 76 43 47.9 0.259 26 No

5 0 107 60 25 26.4 0.133 23 No

6 5 97 76 27 35.6 0.378 52 Yes

- type

Yes: 당뇨병을 가진 환자

No: 당뇨병이 없는 환자

> shapiro.test(bmi)

Shapiro-Wilk normality test

data: bmi

W = 0.991, p-value = 0.2523 # p-value를 통해 귀무가설을 기각할 수 없으므로 정규분포를 따른다고 할 수 있다.

# 유의수준 (significance level): 0.05로 할 경우

- qqnorm(bmi) 결과

one sample t-test

> bmi.ttest <- t.test(bmi, mu=30)

> bmi.ttest

One Sample t-test

data: bmi

t = 5.3291, df = 199, p-value = 2.661e-07

alternative hypothesis: true mean is not equal to 30

95 percent confidence interval:

31.45521 33.16479

sample estimates:

mean of x

32.31

# P value가 0.5 이하이므로 귀무가설 기각 -> 정규분포를 따른다.

> names(bmi.ttest)

[1] "statistic" "parameter" "p.value" "conf.int" "estimate" "null.value" "alternative" "method" "data.name"

statistic: 검정통계랑

parameter: 파라미터

p.value: p값...... 등등 볼 수 맀음.

two sample t-test & F test

> var.test(bmi ~ type) # 두 집단의 등분산 검정 / '~'는 type이라는 factor값에 의해 분류

F test to compare two variances

data: bmi by type

F = 1.7595, num df = 131, denom df = 67, p-value = 0.01115

alternative hypothesis: true ratio of variances is not equal to 1

95 percent confidence interval:

1.140466 2.637564

sample estimates:

ratio of variances

1.75945

# F 검정 결과 귀무가설을 기각하므로 등분산이 아님.

> t.test(bmi ~ type)

Welch Two Sample t-test

data: bmi by type

t = -4.512, df = 171.457, p-value = 1.188e-05

alternative hypothesis: true difference in means is not equal to 0

95 percent confidence interval:

-5.224615 -2.044547

sample estimates:

mean in group No mean in group Yes

31.07424 34.70882

# 당뇨의 유무에 따라 bmi의 차이가 있다고 결론

짝검정 (Paired t-test)

DataL anorexia

Treat: Factor of three levels: "Cont" (control), "CBT" (Cognitive Behavioural treatment) and "FT" (family treatment).

Prewt: Weight of patient before study period, in lbs.

Postwt: Weight of patient after study period, in lbs.

서로 독립적인 두 집단의 평균을 비교 (평균이 같다/ 같지않다)

> FT <- subset(anorexia, Treat=='FT')

> head(FT)

Treat Prewt Postwt

56 FT 83.8 95.2

57 FT 83.3 94.3

58 FT 86.0 91.5

59 FT 82.5 91.9

60 FT 86.7 100.3

61 FT 79.6 76.7

> shapiro.test(FT$Prewt - FT$Postwt)

Shapiro-Wilk normality test

data: FT$Prewt - FT$Postwt

W = 0.9536, p-value = 0.5156

# 정규분포를 따른다

> t.test( FT$Prewt, FT$Postwt, paired=TRUE )

Paired t-test

data: FT$Prewt and FT$Postwt

t = -4.1849, df = 16, p-value = 0.0007003

alternative hypothesis: true difference in means is not equal to 0

95 percent confidence interval:

-10.94471 -3.58470

sample estimates:

mean of the differences

-7.264706

P-value가 유의수준 0.05보다 작기 때문에, 가족치료를 실행하기 전, 후의 차이가 0이 아니라고 결론내릴 수 있다.

-> 가족치료 효과가 있다고 판단 가능.

CBT: Cognitive Behavior Treatment)로 수행하였을 경우의 몸무게 차이를 짝검정

> CBT <- subset(anorexia, Treat=='CBT')

> shapiro.test( CBT$Prewt - CBT$Postwt )

Shapiro-Wilk normality test

data: CBT$Prewt - CBT$Postwt

W = 0.8962, p-value = 0.007945

p-value가 0.05보다 작으므로 귀무가설을 기각. => 데이터가 정규분포를 따르지 않음.

따라서 t-test 대신 비모수 방법인 wilcoxon signed rank test를 이용한다.

paired 데이터이므로 치료 전후의 차이가 0인지를 검정하면 된다.

> wilcox.test( CBT$Prewt, CBT$Postwt, paired=TRUE )

Wilcoxon signed rank test with continuity correction

data: CBT$Prewt and CBT$Postwt

V = 131.5, p-value = 0.06447

alternative hypothesis: true location shift is not equal to 0

경고메시지:

In wilcox.test.default(CBT$Prewt, CBT$Postwt, paired = TRUE) :

tie가 있어 정확한 p값을 계산할 수 없습니다.

paired=TRUE 옵션을 사용하여 수행. p-value가 유의수준이 아니므로 CBT 전후 체중차이 없다고 판단할 수 있다.

저작자표시 비영리 변경금지 (새창열림)

'Programing > R- programming' 카테고리의 다른 글

범주형 자료의 통계분석 R programming (4)	2014.08.21
상관관계 분석 R programming (0)	2014.08.21
파일 입출력, R 프로그래밍 (0)	2014.08.21
데이터 추출 및 병합 연산 R프로그래밍 (0)	2014.08.21
apply 함수군 (lapply, sapply, tapply) R 프로그래밍 (1)	2014.08.21

현재글데이터 검정 R programming

so_sal　

Bioinformatics analyst Data scientist

adobe, 아크로뱃, system, Acrobat9, 프로세스, sosal, 애크로뱃, 링크드리스트, find, socket, 시스템, 리눅스, fork, binary, PDF, SIS, 어도비, Acrobat, process, Linux,

Today :
Yesterday :

so_sal