주) 본인 학습을 위해 학습기간중 지속적으로 update 됩니다. 방문하신 분들을 위한 것이 아닙니다. 타이핑 연습용!^^
상관분석
피어슨 모수 상관분석
- 두 연속변수의 선형관계
- 모집단의 데이터는 '정규분포'를 가정
- 모집단의 모 상관 추정이 목적
- 피어슨 표본 상관계수로 모상관계수 추정
- 양수/음수는 상관의 방향
- 계수의 절대값은 상관의 강도
- cor(x, y, method=" ") # 피어슨 표본 상관계수
- cor.test( ) # 피어슨 모 상관 검정
- cor(iris[, c(1:4)]) # 상관행렬
# Load dataset iris
data(iris)
# check missing value
table(is.na(iris))
FALSE
750
# Check structure
str(iris)
'data.frame': 150 obs. of 5 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
#
# Pearson sample coefficient
cor(iris$Sepal.Length,iris$Sepal.Width, method="pearson")
[1] -0.1175698
# Significent test
cor.test(iris$Sepal.Length,iris$Sepal.Width, method="pearson")
Pearson's product-moment correlation
data: iris$Sepal.Length and iris$Sepal.Width
t = -1.4403, df = 148, p-value = 0.1519 # H0 accept!
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.27269325 0.04351158
sample estimates:
cor
-0.1175698
#
# correlation matrix
cor(iris[,c(1,2,3,4)]) # iris변수 1,2,3,4
Sepal.Length Sepal.Width Petal.Length Petal.Width
Sepal.Length 1.0000000 -0.1175698 0.8717538 0.8179411
Sepal.Width -0.1175698 1.0000000 -0.4284401 -0.3661259
Petal.Length 0.8717538 -0.4284401 1.0000000 0.9628654
Petal.Width 0.8179411 -0.3661259 0.9628654 1.0000000
비모수적 상관분석
- 모집단이 정규분포가 아니다. 따라서 비모수검정.
- 스피어만의 순위검정
- 켄달의 타우 검정
- H0 : 두 연속 변수 x,y는 서로 독립이다.
- cor(x, y, method="spearman") 함수 # 스피어만 표본 상관계수
- cor.test(x, y, method="spearman") # 스피어만 상관 검정
# Spearman non-parametric test
cor(iris$Sepal.Length,iris$Sepal.Width, method="spearman")
[1] -0.1667777
# significent test
cor.test(iris$Sepal.Length,iris$Sepal.Width, method="spearman")
Spearman's rank correlation rho
data: iris$Sepal.Length and iris$Sepal.Width
S = 656283, p-value = 0.04137 # H0 reject! ???
alternative hypothesis: true rho is not equal to 0
sample estimates:
rho
-0.1667777
경고메시지(들):
cor.test.default(iris$Sepal.Length, iris$Sepal.Width, method = "spearman")에서:
tie때문에 정확한 p값을 계산할 수 없습니다
선형회귀모형
선형회귀분석(SLR)
- SLR : Simple Linear Regression
* 회귀모형
- 오차항 가정 (정규, 등분산, 독립)
- 회귀식 y= beta0 + beta1*x
- 회귀계수 beta0 와 beta1
- beta1 = 기울기 = y에 주는 영향력 크기
- 해법 ordinary least square(OLS)
* 회귀유의성검정
- 회귀식 타당성 : R2 =SSR/SST
- 분산분석 F-검정, p-value
- SST = SST + SSE
* 가정검토(회귀진단)
- 정규성 shapiro.test()
- 등분산성
- 독립성
# Simple linear regression
# dataset - base - cars
str(cars)
'data.frame': 50 obs. of 2 variables:
$ speed: num 4 4 7 7 8 9 10 10 10 11 ...
$ dist : num 2 10 4 22 16 10 18 26 34 17 ...
# Check dataset cars
table(is.na(cars))
FALSE
100
# Pearson correlation test
cor.test(cars$speed,cars$dist)
Pearson's product-moment correlation
data: cars$speed and cars$dist
t = 9.464, df = 48, p-value = 1.49e-12
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.6816422 0.8862036
sample estimates:
cor
0.8068949
#
# SLR
model<-lm(dist~speed, cars)
summary(model)
Call:
lm(formula = dist ~ speed, data = cars)
Residuals:
Min 1Q Median 3Q Max
-29.069 -9.525 -2.272 9.215 43.201
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -17.5791 6.7584 -2.601 0.0123 *
speed 3.9324 0.4155 9.464 1.49e-12 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 15.38 on 48 degrees of freedom
Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
# normality test
shapiro.test(model$residuals)
Shapiro-Wilk normality test
data: model$residuals
W = 0.94509, p-value = 0.02152
# Durbin-Whatson independence test
install.packages("lmtest")
library(lmtest)
dwtest(model)
Durbin-Watson test
data: model
DW = 1.6762, p-value = 0.09522
alternative hypothesis: true autocorrelation is greater than 0
#
# THE END
다중선형회귀분석(MLR)
- MLR : Multiple Linear Regression
* 회귀모형
- 회귀식 y= f(xi, betai)
- 회귀계수 beta0 와 beta1..
* 회귀유의성검정
- 회귀식 타당성 : R2 =SSR/SST
- 분산분석 F-검정, p-value
- SST = SST + SSE
* 가정검토(회귀진단)
- 정규성 shapiro.test()
- 등분산성
- 독립성
- multi-collinearity-car-vif()
# Load datasets - MASS - Cars93
library(MASS)
str(Cars93)
'data.frame': 93 obs. of 27 variables:
$ Manufacturer : Factor w/ 32 levels "Acura","Audi",..: 1 1 2 2 3 4 4 4 4 5 ...
$ Model : Factor w/ 93 levels "100","190E","240",..: 49 56 9 1 6 24 54 74 73 35 ...
$ Type : Factor w/ 6 levels "Compact","Large",..: 4 3 1 3 3 3 2 2 3 2 ...
$ Min.Price : num 12.9 29.2 25.9 30.8 23.7 14.2 19.9 22.6 26.3 33 ...
$ Price : num 15.9 33.9 29.1 37.7 30 15.7 20.8 23.7 26.3 34.7 ...
$ Max.Price : num 18.8 38.7 32.3 44.6 36.2 17.3 21.7 24.9 26.3 36.3 ...
$ MPG.city : int 25 18 20 19 22 22 19 16 19 16 ...
$ MPG.highway : int 31 25 26 26 30 31 28 25 27 25 ...
$ AirBags : Factor w/ 3 levels "Driver & Passenger",..: 3 1 2 1 2 2 2 2 2 2 ...
$ DriveTrain : Factor w/ 3 levels "4WD","Front",..: 2 2 2 2 3 2 2 3 2 2 ...
$ Cylinders : Factor w/ 6 levels "3","4","5","6",..: 2 4 4 4 2 2 4 4 4 5 ...
$ EngineSize : num 1.8 3.2 2.8 2.8 3.5 2.2 3.8 5.7 3.8 4.9 ...
$ Horsepower : int 140 200 172 172 208 110 170 180 170 200 ...
$ RPM : int 6300 5500 5500 5500 5700 5200 4800 4000 4800 4100 ...
$ Rev.per.mile : int 2890 2335 2280 2535 2545 2565 1570 1320 1690 1510 ...
$ Man.trans.avail : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 1 1 1 1 1 ...
$ Fuel.tank.capacity: num 13.2 18 16.9 21.1 21.1 16.4 18 23 18.8 18 ...
$ Passengers : int 5 5 5 6 4 6 6 6 5 6 ...
$ Length : int 177 195 180 193 186 189 200 216 198 206 ...
$ Wheelbase : int 102 115 102 106 109 105 111 116 108 114 ...
$ Width : int 68 71 67 70 69 69 74 78 73 73 ...
$ Turn.circle : int 37 38 37 37 39 41 42 45 41 43 ...
$ Rear.seat.room : num 26.5 30 28 31 27 28 30.5 30.5 26.5 35 ...
$ Luggage.room : int 11 15 14 17 13 16 17 21 14 18 ...
$ Weight : int 2705 3560 3375 3405 3640 2880 3470 4105 3495 3620 ...
$ Origin : Factor w/ 2 levels "USA","non-USA": 2 2 2 2 2 1 1 1 1 1 ...
$ Make : Factor w/ 93 levels "Acura Integra",..: 1 2 4 3 5 6 7 9 8 10 ...
# Check missing value
table(is.na(Cars93))
FALSE TRUE
2498 13
#
# Select Xi and Y(Price) variables
cars93<-Cars93[,c('Price','MPG.city','EngineSize','RPM','Fuel.tank.capacity','Weight')]
str(cars93)
'data.frame': 93 obs. of 6 variables:
$ Price : num 15.9 33.9 29.1 37.7 30 15.7 20.8 23.7 26.3 34.7 ...
$ MPG.city : int 25 18 20 19 22 22 19 16 19 16 ...
$ EngineSize : num 1.8 3.2 2.8 2.8 3.5 2.2 3.8 5.7 3.8 4.9 ...
$ RPM : int 6300 5500 5500 5500 5700 5200 4800 4000 4800 4100 ...
$ Fuel.tank.capacity: num 13.2 18 16.9 21.1 21.1 16.4 18 23 18.8 18 ...
$ Weight : int 2705 3560 3375 3405 3640 2880 3470 4105 3495 3620 ...
table(is.na(cars93))
FALSE
558
# Check multiple corelation
cor(cars93) #Weight가 Price에 가장 큰 영향.
Price MPG.city EngineSize RPM Fuel.tank.capacity Weight
Price 1.000000000 -0.5945622 0.5974254 -0.004954931 0.6194800 0.6471790
MPG.city -0.594562163 1.0000000 -0.7100032 0.363045129 -0.8131444 -0.8431385
EngineSize 0.597425392 -0.7100032 1.0000000 -0.547897805 0.7593062 0.8450753
RPM -0.004954931 0.3630451 -0.5478978 1.000000000 -0.3333452 -0.4279315
Fuel.tank.capacity 0.619479981 -0.8131444 0.7593062 -0.333345218 1.0000000 0.8940181
Weight 0.647179005 -0.8431385 0.8450753 -0.427931473 0.8940181 1.0000000
# MLR - base -lm()
model<-lm(Price~., cars93)
summary(model)
Call:
lm(formula = Price ~ ., data = cars93)
Residuals:
Min 1Q Median 3Q Max
-11.344 -3.552 -0.556 2.252 35.390
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -38.107015 14.098134 -2.703 0.00826 **
MPG.city -0.290017 0.231941 -1.250 0.21451
EngineSize 4.303060 1.329544 3.236 0.00171 **
RPM 0.007066 0.001378 5.127 1.76e-06 ***
Fuel.tank.capacity 0.111959 0.481680 0.232 0.81675
Weight 0.004375 0.003386 1.292 0.19973
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 6.508 on 87 degrees of freedom
Multiple R-squared: 0.5707, Adjusted R-squared: 0.546
F-statistic: 23.13 on 5 and 87 DF, p-value: 1.053e-14
#
# 변수선택
model2<-step(model, method="both")
Start: AIC=354.19
Price ~ MPG.city + EngineSize + RPM + Fuel.tank.capacity + Weight
Df Sum of Sq RSS AIC
- Fuel.tank.capacity 1 2.29 3687.3 352.24
- MPG.city 1 66.22 3751.2 353.84
- Weight 1 70.72 3755.7 353.95
<none> 3685.0 354.19
- EngineSize 1 443.68 4128.7 362.76
- RPM 1 1113.58 4798.6 376.74
Step: AIC=352.24
Price ~ MPG.city + EngineSize + RPM + Weight
Df Sum of Sq RSS AIC
- MPG.city 1 77.37 3764.6 352.18
<none> 3687.3 352.24
- Weight 1 122.89 3810.2 353.29
- EngineSize 1 450.82 4138.1 360.97
- RPM 1 1152.50 4839.8 375.54
Step: AIC=352.18
Price ~ EngineSize + RPM + Weight
Df Sum of Sq RSS AIC
<none> 3764.6 352.18
- EngineSize 1 446.63 4211.3 360.60
- Weight 1 480.84 4245.5 361.35
- RPM 1 1147.43 4912.1 374.92
#
summary(model2)
Call:
lm(formula = Price ~ EngineSize + RPM + Weight, data = cars93)
Residuals:
Min 1Q Median 3Q Max
-10.511 -3.806 -0.300 1.447 35.255
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -51.793292 9.106309 -5.688 1.62e-07 ***
EngineSize 4.305387 1.324961 3.249 0.00163 **
RPM 0.007096 0.001363 5.208 1.22e-06 ***
Weight 0.007271 0.002157 3.372 0.00111 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 6.504 on 89 degrees of freedom
Multiple R-squared: 0.5614, Adjusted R-squared: 0.5467
F-statistic: 37.98 on 3 and 89 DF, p-value: 6.746e-16
#
# nomality test
shapiro.test(model2$residuals)
Shapiro-Wilk normality test
data: model2$residuals
W = 0.85873, p-value = 5.976e-08
#
# multicollinearity
install.packages("car")
library(car)
vif(model2) # 분산팽창계수 <10
EngineSize RPM Weight
4.108869 1.437810 3.520026
#
# THE END
일반화선형모형(GLM)
- GLM : Generalized Linear Model
- Y가 이항, 다항, 포아송 변수, 가우시안 확률분포 등
- linear predictor, link function 활용
- 확률분포 P(Y)= link function 함수 (linear predictor함수)
- glm(formula, family, data, ...)

GOF (적합도) 검정
-GOF : Goodness-of-Fit
- 귀무가설(Ho) : 모 분포는 OO 분포이다.
OO분포유형 | library | function |
정규분포 | base | ks.test(data, "pnorm" ) |
base | shapiro.test(data) | |
nortest | ad.test(data) | |
지수분포 | base | ks.test(data, "pexp") |
포아송분포 | base | ks.test(data, "ppois") |
# Load dataset iris
str(iris)
# Check missing values
table(is.na(iris))
#
# Normality test
ks.test(iris$Sepal.Length, "pnorm")
shapiro.test(iris$Sepal.Length)
#
독립성검정
- 두개 이상의 범주형 변수의 관계
- 한 모집단에서 두 분류 변수를 관측
- 교차표, 이원분류표, 분할표 분석
- 귀무가설 (H0): 두 변수는 서로 독립이다 (즉, 관계가 없다).
- chisq.test() 함수로 수행(동질성 검정에서도 사용)
- Fisher의 Exact 검정 (기대도수 5이하가 20% 이상) : fisher.test()
# 교차표, contingency table
# 두 변수의 교집합 건수
table(survey$Sex, survey$Smoke)
Heavy Never Occas Regul
Female 5 99 9 5
Male 6 89 10 12
# 독립성 검정 - chisq.test()
# 성별과 흡연여부 독립성 검정
chisq.test(survey$Sex, survey$Smoke)
Pearson's Chi-squared test
data: survey$Sex and survey$Smoke
X-squared = 3.5536, df = 3, p-value = 0.3139
#
# fisher exact test
fisher.test(survey$Sex, survey$Smoke)
Fisher's Exact Test for Count Data
data: survey$Sex and survey$Smoke
p-value = 0.3105
alternative hypothesis: two.sided
#
# THE END
'학습 및 사례' 카테고리의 다른 글
Google 제미니와의 컨설팅 협업 - 2차원(=두 변수) 사분면에 기업 배치하기 (0) | 2025.06.15 |
---|---|
[빅데이터분석기사 실기 4탄] 학습실행 - 머신러닝, Unsupervised Learning (0) | 2025.06.02 |
경기도일자리재단 베이비부머인턴쉽(컨설팅형) 역량교육 참여 (3) - Problem-solving (0) | 2025.05.31 |
[빅데이터분석기사 실기 3탄] 학습실행 - 머신러닝, Prediction or Regression (0) | 2025.05.27 |
[빅데이터분석기사 실기 2탄] 학습실행 - 머신러닝, Classification (0) | 2025.05.25 |