[빅데이터분석기사 실기 5탄] 학습실행

학습 및 사례

[빅데이터분석기사 실기 5탄] 학습실행 - 통계분석

에스오에스데이터랩 2025. 6. 7. 11:06

주) 본인 학습을 위해 학습기간중 지속적으로 update 됩니다. 방문하신 분들을 위한 것이 아닙니다. 타이핑 연습용!^^

상관분석

피어슨 모수 상관분석

- 두 연속변수의 선형관계

- 모집단의 데이터는 '정규분포'를 가정

- 모집단의 모 상관 추정이 목적

- 피어슨 표본 상관계수로 모상관계수 추정

- 양수/음수는 상관의 방향

- 계수의 절대값은 상관의 강도

- cor(x, y, method=" ") # 피어슨 표본 상관계수

- cor.test( ) # 피어슨 모 상관 검정

- cor(iris[, c(1:4)]) # 상관행렬

# Load dataset iris
data(iris)
# check missing value
table(is.na(iris))
FALSE 
  750 
# Check structure
str(iris)
'data.frame':   150 obs. of  5 variables:
 $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
 $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
 $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
#
# Pearson sample coefficient
cor(iris$Sepal.Length,iris$Sepal.Width, method="pearson")
[1] -0.1175698
# Significent test
cor.test(iris$Sepal.Length,iris$Sepal.Width, method="pearson")
 Pearson's product-moment correlation

data:  iris$Sepal.Length and iris$Sepal.Width
t = -1.4403, df = 148, p-value = 0.1519 # H0 accept!
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.27269325  0.04351158
sample estimates:
       cor 
-0.1175698 
#
# correlation matrix 
cor(iris[,c(1,2,3,4)]) # iris변수 1,2,3,4
 Sepal.Length Sepal.Width Petal.Length Petal.Width
Sepal.Length    1.0000000  -0.1175698    0.8717538   0.8179411
Sepal.Width    -0.1175698   1.0000000   -0.4284401  -0.3661259
Petal.Length    0.8717538  -0.4284401    1.0000000   0.9628654
Petal.Width     0.8179411  -0.3661259    0.9628654   1.0000000

비모수적 상관분석

- 모집단이 정규분포가 아니다. 따라서 비모수검정.

- 스피어만의 순위검정

- 켄달의 타우 검정

- H0 : 두 연속 변수 x,y는 서로 독립이다.

- cor(x, y, method="spearman") 함수 # 스피어만 표본 상관계수
- cor.test(x, y, method="spearman") # 스피어만 상관 검정

# Spearman non-parametric test
cor(iris$Sepal.Length,iris$Sepal.Width, method="spearman")
[1] -0.1667777
# significent test
cor.test(iris$Sepal.Length,iris$Sepal.Width, method="spearman")
  Spearman's rank correlation rho

data:  iris$Sepal.Length and iris$Sepal.Width
S = 656283, p-value = 0.04137 # H0 reject! ???
alternative hypothesis: true rho is not equal to 0
sample estimates:
       rho 
-0.1667777 

경고메시지(들):
cor.test.default(iris$Sepal.Length, iris$Sepal.Width, method = "spearman")에서:
  tie때문에 정확한 p값을 계산할 수 없습니다

선형회귀모형

선형회귀분석(SLR)

- SLR : Simple Linear Regression

* 회귀모형

- 오차항 가정 (정규, 등분산, 독립)

- 회귀식 y= beta0 + beta1*x

- 회귀계수 beta0 와 beta1

- beta1 = 기울기 = y에 주는 영향력 크기

- 해법 ordinary least square(OLS)

* 회귀유의성검정

- 회귀식 타당성 : R2 =SSR/SST

- 분산분석 F-검정, p-value

- SST = SST + SSE

* 가정검토(회귀진단)

- 정규성 shapiro.test()

- 등분산성

- 독립성

# Simple linear regression
# dataset - base - cars
str(cars)
'data.frame':   50 obs. of  2 variables:
 $ speed: num  4 4 7 7 8 9 10 10 10 11 ...
 $ dist : num  2 10 4 22 16 10 18 26 34 17 ...
# Check dataset cars
table(is.na(cars))
FALSE 
  100
# Pearson correlation test
cor.test(cars$speed,cars$dist)
 Pearson's product-moment correlation

data:  cars$speed and cars$dist
t = 9.464, df = 48, p-value = 1.49e-12
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.6816422 0.8862036
sample estimates:
      cor 
0.8068949 
#
# SLR
model<-lm(dist~speed, cars)  
summary(model)
Call:
lm(formula = dist ~ speed, data = cars)

Residuals:
    Min      1Q  Median      3Q     Max 
-29.069  -9.525  -2.272   9.215  43.201 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -17.5791     6.7584  -2.601   0.0123 *  
speed         3.9324     0.4155   9.464 1.49e-12 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 15.38 on 48 degrees of freedom
Multiple R-squared:  0.6511,    Adjusted R-squared:  0.6438 
F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12
# normality test
shapiro.test(model$residuals)
Shapiro-Wilk normality test

data:  model$residuals
W = 0.94509, p-value = 0.02152
# Durbin-Whatson independence test
install.packages("lmtest")
library(lmtest)
dwtest(model)
 Durbin-Watson test

data:  model
DW = 1.6762, p-value = 0.09522
alternative hypothesis: true autocorrelation is greater than 0
#
# THE END

다중선형회귀분석(MLR)

- MLR : Multiple Linear Regression

* 회귀모형

- 회귀식 y= f(xi, betai)

- 회귀계수 beta0 와 beta1..

* 회귀유의성검정

- 회귀식 타당성 : R2 =SSR/SST
- 분산분석 F-검정, p-value
- SST = SST + SSE

* 가정검토(회귀진단)

- 정규성 shapiro.test()

- 등분산성

- 독립성

- multi-collinearity-car-vif()

# Load datasets - MASS - Cars93
library(MASS)
str(Cars93)
'data.frame':   93 obs. of  27 variables:
 $ Manufacturer      : Factor w/ 32 levels "Acura","Audi",..: 1 1 2 2 3 4 4 4 4 5 ...
 $ Model             : Factor w/ 93 levels "100","190E","240",..: 49 56 9 1 6 24 54 74 73 35 ...
 $ Type              : Factor w/ 6 levels "Compact","Large",..: 4 3 1 3 3 3 2 2 3 2 ...
 $ Min.Price         : num  12.9 29.2 25.9 30.8 23.7 14.2 19.9 22.6 26.3 33 ...
 $ Price             : num  15.9 33.9 29.1 37.7 30 15.7 20.8 23.7 26.3 34.7 ...
 $ Max.Price         : num  18.8 38.7 32.3 44.6 36.2 17.3 21.7 24.9 26.3 36.3 ...
 $ MPG.city          : int  25 18 20 19 22 22 19 16 19 16 ...
 $ MPG.highway       : int  31 25 26 26 30 31 28 25 27 25 ...
 $ AirBags           : Factor w/ 3 levels "Driver & Passenger",..: 3 1 2 1 2 2 2 2 2 2 ...
 $ DriveTrain        : Factor w/ 3 levels "4WD","Front",..: 2 2 2 2 3 2 2 3 2 2 ...
 $ Cylinders         : Factor w/ 6 levels "3","4","5","6",..: 2 4 4 4 2 2 4 4 4 5 ...
 $ EngineSize        : num  1.8 3.2 2.8 2.8 3.5 2.2 3.8 5.7 3.8 4.9 ...
 $ Horsepower        : int  140 200 172 172 208 110 170 180 170 200 ...
 $ RPM               : int  6300 5500 5500 5500 5700 5200 4800 4000 4800 4100 ...
 $ Rev.per.mile      : int  2890 2335 2280 2535 2545 2565 1570 1320 1690 1510 ...
 $ Man.trans.avail   : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 1 1 1 1 1 ...
 $ Fuel.tank.capacity: num  13.2 18 16.9 21.1 21.1 16.4 18 23 18.8 18 ...
 $ Passengers        : int  5 5 5 6 4 6 6 6 5 6 ...
 $ Length            : int  177 195 180 193 186 189 200 216 198 206 ...
 $ Wheelbase         : int  102 115 102 106 109 105 111 116 108 114 ...
 $ Width             : int  68 71 67 70 69 69 74 78 73 73 ...
 $ Turn.circle       : int  37 38 37 37 39 41 42 45 41 43 ...
 $ Rear.seat.room    : num  26.5 30 28 31 27 28 30.5 30.5 26.5 35 ...
 $ Luggage.room      : int  11 15 14 17 13 16 17 21 14 18 ...
 $ Weight            : int  2705 3560 3375 3405 3640 2880 3470 4105 3495 3620 ...
 $ Origin            : Factor w/ 2 levels "USA","non-USA": 2 2 2 2 2 1 1 1 1 1 ...
 $ Make              : Factor w/ 93 levels "Acura Integra",..: 1 2 4 3 5 6 7 9 8 10 ...
# Check missing value
table(is.na(Cars93))
FALSE  TRUE 
 2498    13 
 #
 # Select Xi and Y(Price) variables
 cars93<-Cars93[,c('Price','MPG.city','EngineSize','RPM','Fuel.tank.capacity','Weight')]
str(cars93)
'data.frame':   93 obs. of  6 variables:
 $ Price             : num  15.9 33.9 29.1 37.7 30 15.7 20.8 23.7 26.3 34.7 ...
 $ MPG.city          : int  25 18 20 19 22 22 19 16 19 16 ...
 $ EngineSize        : num  1.8 3.2 2.8 2.8 3.5 2.2 3.8 5.7 3.8 4.9 ...
 $ RPM               : int  6300 5500 5500 5500 5700 5200 4800 4000 4800 4100 ...
 $ Fuel.tank.capacity: num  13.2 18 16.9 21.1 21.1 16.4 18 23 18.8 18 ...
 $ Weight            : int  2705 3560 3375 3405 3640 2880 3470 4105 3495 3620 ...
 table(is.na(cars93))
FALSE 
  558
# Check multiple corelation
cor(cars93) #Weight가 Price에 가장 큰 영향.
                   Price   MPG.city EngineSize          RPM Fuel.tank.capacity     Weight
Price               1.000000000 -0.5945622  0.5974254 -0.004954931          0.6194800  0.6471790
MPG.city           -0.594562163  1.0000000 -0.7100032  0.363045129         -0.8131444 -0.8431385
EngineSize          0.597425392 -0.7100032  1.0000000 -0.547897805          0.7593062  0.8450753
RPM                -0.004954931  0.3630451 -0.5478978  1.000000000         -0.3333452 -0.4279315
Fuel.tank.capacity  0.619479981 -0.8131444  0.7593062 -0.333345218          1.0000000  0.8940181
Weight              0.647179005 -0.8431385  0.8450753 -0.427931473          0.8940181  1.0000000

# MLR - base -lm()
model<-lm(Price~., cars93)
summary(model)
Call:
lm(formula = Price ~ ., data = cars93)

Residuals:
    Min      1Q  Median      3Q     Max 
-11.344  -3.552  -0.556   2.252  35.390 

Coefficients:
                     Estimate Std. Error t value Pr(>|t|)    
(Intercept)        -38.107015  14.098134  -2.703  0.00826 ** 
MPG.city            -0.290017   0.231941  -1.250  0.21451    
EngineSize           4.303060   1.329544   3.236  0.00171 ** 
RPM                  0.007066   0.001378   5.127 1.76e-06 ***
Fuel.tank.capacity   0.111959   0.481680   0.232  0.81675    
Weight               0.004375   0.003386   1.292  0.19973    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 6.508 on 87 degrees of freedom
Multiple R-squared:  0.5707,    Adjusted R-squared:  0.546 
F-statistic: 23.13 on 5 and 87 DF,  p-value: 1.053e-14
#
# 변수선택
model2<-step(model, method="both")
Start:  AIC=354.19
Price ~ MPG.city + EngineSize + RPM + Fuel.tank.capacity + Weight

                     Df Sum of Sq    RSS    AIC
- Fuel.tank.capacity  1      2.29 3687.3 352.24
- MPG.city            1     66.22 3751.2 353.84
- Weight              1     70.72 3755.7 353.95
<none>                            3685.0 354.19
- EngineSize          1    443.68 4128.7 362.76
- RPM                 1   1113.58 4798.6 376.74

Step:  AIC=352.24
Price ~ MPG.city + EngineSize + RPM + Weight

             Df Sum of Sq    RSS    AIC
- MPG.city    1     77.37 3764.6 352.18
<none>                    3687.3 352.24
- Weight      1    122.89 3810.2 353.29
- EngineSize  1    450.82 4138.1 360.97
- RPM         1   1152.50 4839.8 375.54

Step:  AIC=352.18
Price ~ EngineSize + RPM + Weight

             Df Sum of Sq    RSS    AIC
<none>                    3764.6 352.18
- EngineSize  1    446.63 4211.3 360.60
- Weight      1    480.84 4245.5 361.35
- RPM         1   1147.43 4912.1 374.92
#
summary(model2)
Call:
lm(formula = Price ~ EngineSize + RPM + Weight, data = cars93)

Residuals:
    Min      1Q  Median      3Q     Max 
-10.511  -3.806  -0.300   1.447  35.255 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -51.793292   9.106309  -5.688 1.62e-07 ***
EngineSize    4.305387   1.324961   3.249  0.00163 ** 
RPM           0.007096   0.001363   5.208 1.22e-06 ***
Weight        0.007271   0.002157   3.372  0.00111 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 6.504 on 89 degrees of freedom
Multiple R-squared:  0.5614,    Adjusted R-squared:  0.5467 
F-statistic: 37.98 on 3 and 89 DF,  p-value: 6.746e-16
#
# nomality test
shapiro.test(model2$residuals)
Shapiro-Wilk normality test

data:  model2$residuals
W = 0.85873, p-value = 5.976e-08
#
# multicollinearity
install.packages("car")
 library(car)
 vif(model2) # 분산팽창계수 <10 
 EngineSize        RPM     Weight 
 4.108869   1.437810   3.520026 
 #
 # THE END

일반화선형모형(GLM)

- GLM : Generalized Linear Model

- Y가 이항, 다항, 포아송 변수, 가우시안 확률분포 등

- linear predictor, link function 활용

- 확률분포 P(Y)= link function 함수 (linear predictor함수)

- glm(formula, family, data, ...)

GOF (적합도) 검정

-GOF : Goodness-of-Fit

- 귀무가설(Ho) : 모 분포는 OO 분포이다.

OO분포유형	library	function
정규분포	base	ks.test(data, "pnorm" )
	base	shapiro.test(data)
	nortest	ad.test(data)
지수분포	base	ks.test(data, "pexp")
포아송분포	base	ks.test(data, "ppois")

# Load dataset iris
str(iris)
# Check missing values
table(is.na(iris))
#
# Normality test
ks.test(iris$Sepal.Length, "pnorm")
shapiro.test(iris$Sepal.Length)
#

독립성검정

- 두개 이상의 범주형 변수의 관계

- 한 모집단에서 두 분류 변수를 관측

- 교차표, 이원분류표, 분할표 분석

- 귀무가설 (H0): 두 변수는 서로 독립이다 (즉, 관계가 없다).
- chisq.test() 함수로 수행(동질성 검정에서도 사용)

- Fisher의 Exact 검정 (기대도수 5이하가 20% 이상) : fisher.test()

# 교차표, contingency table
# 두 변수의 교집합 건수
table(survey$Sex, survey$Smoke) 
         Heavy Never Occas Regul
Female     5    99     9     5
Male       6    89    10    12
# 독립성 검정 - chisq.test()

# 성별과 흡연여부 독립성 검정
chisq.test(survey$Sex, survey$Smoke)
 Pearson's Chi-squared test

data:  survey$Sex and survey$Smoke
X-squared = 3.5536, df = 3, p-value = 0.3139
#
# fisher exact test
fisher.test(survey$Sex, survey$Smoke)
 Fisher's Exact Test for Count Data

data:  survey$Sex and survey$Smoke
p-value = 0.3105
alternative hypothesis: two.sided
#
# THE END

저작자표시 비영리 변경금지 (새창열림)