학습 및 사례

[빅데이터분석기사 실기 2탄] 학습실행 - 머신러닝, Classification

에스오에스데이터랩 2025. 5. 25. 13:02

주) 본인 학습을 위해 학습기간중 지속적으로 update 됩니다. 방문하신 분들을 위한 것이 아닙니다. 타이핑 연습용!^^

 

머신러닝 수행 단계별 사용 변수명

단계 내용 사용변수명
1 데이터셋 전처리한다. -
2 데이터셋 분할한다. train, test
3 머신러닝 모델 만든다.  model
4 테스트셋으로 '분류'한다. yhat
5 실제분류값과 비교 검증한다. -

 

 

ML 모델별 라이브러리 및 함수

ML 분류 모델 유형 R 라이브러리 함수명
K-최근접이웃(KNN) class knn( )
나이브베이즈 e1071 naiveBayes( )
다항 로지스틱회귀 nnet multinom( )
인공지능신경망(ANN) nnet nnet( )
의사결정트리(DT) rpart rpart( )
서포트벡터머신(SVM) kernlab ksvm( )
랜덤포레스트 randomForest randomForest( )

 

 

 

ConfusionMatrix

    test$Species
(실제값)
    TRUE FALSE
yhat
(예측값)
POSITIVE T.P F.P
NEGATIVE T.N F.N

 

accuracy = (T.P + T.N) ÷ ( T.P + T.N + F.P + F.N)

sensitivity = T.P ÷ ( T.P + F.P)

specificity = T.P ÷ ( T.P + T.N)


 

# Load datasets -base - read.csv()

# 기본설치된 iris dataset 사용

# Check missing values - base - table(is.na())

더보기

table(is.na(iris))  # NA 데이터 갯수 확인

FALSE 
  750 

 

머신러닝 - 데이터셋 분할

 

# Split datasets - caret -createDataPartition()

더보기

idx <- createDataPartition()  # 빈도수를 고려한 index 무작위 비복원 추출

train <- ...   # 트레인셋 만들기

test <- .....   # 테스트셋 만들기

table(train$Species); table(test$Species) # Species 갯수 출력

R code

# Split datasets - caret - createDataPartition()
library(caret)
idx<-createDataPartition(iris$Species, p=0.7, list=F)
train<-iris[idx,]
test<-iris[-idx,]
table(train$Species);table(test$Species)

 

R code 실행결과

더보기

> table(train$Species);table(test$Species)

    setosa versicolor  virginica 
        35         35         35 

    setosa versicolor  virginica 
        15         15         15 

 

# K-fold cross validation

 

머신러닝 - '분류' 모델

 

목적변수 : $Species (범주변수, 분류변수)

설명변수 : $Species 제외 4개 (연속형 변수)

 

#1. KNN-class-knn()

더보기

yhat <- knn( )  # 트레이닝셋과 테스트셋 동시 이용하여  니얼리스트네이버 모델 만들기

confusionMatrix(yhat, ...)  # yhat추정값과 실제값의 비교

R code

# KNN - class - knn()
library(class)
yhat<-knn(train[,-5], cl=train$Species, test[,-5], k=3)
confusionMatrix(yhat, test$Species)

 

R code 실행결과

더보기

# confusionMatrix(yhat, test$Species)

Confusion Matrix and Statistics

                                  Reference
Prediction           setosa versicolor virginica
             setosa         15          0         0
            versicolor      0         13         1
             virginica       0          2        14

Overall Statistics
                                         
               Accuracy : 0.9333         
                 95% CI : (0.8173, 0.986)
    No Information Rate : 0.3333         
    P-Value [Acc > NIR] : < 2.2e-16      
                                         
                  Kappa : 0.9            
                                         
 Mcnemar's Test P-Value : NA             

Statistics by Class:

                     Class: setosa Class: versicolor Class: virginica
Sensitivity                 1.0000            0.8667           0.9333
Specificity                 1.0000            0.9667           0.9333
Pos Pred Value              1.0000            0.9286           0.8750
Neg Pred Value              1.0000            0.9355           0.9655
Prevalence                  0.3333            0.3333           0.3333
Detection Rate              0.3333            0.2889           0.3111
Detection Prevalence        0.3333            0.3111           0.3556
Balanced Accuracy           1.0000            0.9167           0.9333

 

#2. NB- e1071 - naiveBayes()

더보기

model <- naiveBayes() #  트레이닝셋으로 나이브베이지 모델 만들기

yhat <- predict(model, ....) # 테스트셋을  model에 넣어 분류값 yhat에 저장

confusionMatrix(yhat, ...)  # yhat추정값과 실제값의 비교

R code

library(e1071)
model<-naiveBayes(Species~.,train, laplace=1)
yhat<-predict(y, test, type="class")
confusionMatrix(yhat, test$Species)

 

R code 실행결과

더보기

Confusion Matrix and Statistics

            Reference
Prediction   setosa versicolor virginica
  setosa         15          0         0
  versicolor      0         14         0
  virginica       0          1        15

Overall Statistics
                                          
               Accuracy : 0.9778          
                 95% CI : (0.8823, 0.9994)
    No Information Rate : 0.3333          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.9667          
                                          
 Mcnemar's Test P-Value : NA              

Statistics by Class:

                     Class: setosa Class: versicolor Class: virginica
Sensitivity                 1.0000            0.9333           1.0000
Specificity                 1.0000            1.0000           0.9667
Pos Pred Value              1.0000            1.0000           0.9375
Neg Pred Value              1.0000            0.9677           1.0000
Prevalence                  0.3333            0.3333           0.3333
Detection Rate              0.3333            0.3111           0.3333
Detection Prevalence        0.3333            0.3111           0.3556
Balanced Accuracy           1.0000            0.9667           0.9833

 

#3. Logistic regression - nnet -multinom()

더보기

model <- multinom() #  트레이닝셋으로 다항 로지스틱회귀 모델 만들기

yhat <- predict(model, ....) # 테스트셋을  model에 넣어 분류값 yhat에 저장

confusionMatrix(yhat, ...)  # yhat추정값과 실제값의 비교

R code

library(nnet)
model<-multinom(Species~., train)
yhat<-predict(model, test, type="class")
confusionMatrix(yhat, test$Species)

 

R code 실행결과

더보기

Confusion Matrix and Statistics

            Reference
Prediction   setosa versicolor virginica
  setosa         15          0         0
  versicolor      0         13         0
  virginica       0          2        15

Overall Statistics
                                          
               Accuracy : 0.9556          
                 95% CI : (0.8485, 0.9946)
    No Information Rate : 0.3333          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.9333          
                                          
 Mcnemar's Test P-Value : NA              

Statistics by Class:

                     Class: setosa Class: versicolor Class: virginica
Sensitivity                 1.0000            0.8667           1.0000
Specificity                 1.0000            1.0000           0.9333
Pos Pred Value              1.0000            1.0000           0.8824
Neg Pred Value              1.0000            0.9375           1.0000
Prevalence                  0.3333            0.3333           0.3333
Detection Rate              0.3333            0.2889           0.3333
Detection Prevalence        0.3333            0.2889           0.3778
Balanced Accuracy           1.0000            0.9333           0.9667

 

#4. ANN - nnet - nnet()

더보기

model <- nnet() #  트레이닝셋으로 인공지능 모델 만들기

yhat <- predict(model, ....) # 테스트셋을  model에 넣어 분류값 yhat에 저장

confusionMatrix(yhat, ...)  # yhat추정값과 실제값의 비교

R code

library(nnet)
model<-nnet(Species~., train, size=3)
yhat<-predict(model, test, type="class")
yhat<-as.factor(yhat)

 

R code 실행결과

더보기

Confusion Matrix and Statistics

            Reference
Prediction   setosa versicolor virginica
  setosa         15          0         0
  versicolor      0         13         0
  virginica       0          2        15

Overall Statistics
                                          
               Accuracy : 0.9556          
                 95% CI : (0.8485, 0.9946)
    No Information Rate : 0.3333          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.9333          
                                          
 Mcnemar's Test P-Value : NA              

Statistics by Class:

                     Class: setosa Class: versicolor Class: virginica
Sensitivity                 1.0000            0.8667           1.0000
Specificity                 1.0000            1.0000           0.9333
Pos Pred Value              1.0000            1.0000           0.8824
Neg Pred Value              1.0000            0.9375           1.0000
Prevalence                  0.3333            0.3333           0.3333
Detection Rate              0.3333            0.2889           0.3333
Detection Prevalence        0.3333            0.2889           0.3778
Balanced Accuracy           1.0000            0.9333           0.9667

 

#5. DT - rpart -rpart()

더보기

model <- raprt() #  트레이닝셋으로 의사결정트리 모델 만들기

yhat <- predict(model, ....) # 테스트셋을  model에 넣어 분류값 yhat에 저장

confusionMatrix(yhat, ...)  # yhat추정값과 실제값의 비교

R code

library(rpart)
model<-rpart(Species~., train)
yhat<-predict(model, test, type="class")
confusionMatrix(yhat, test$Species)

 

R code 실행결과

더보기

Confusion Matrix and Statistics

            Reference
Prediction   setosa versicolor virginica
  setosa         15          0         0
  versicolor      0         13         3
  virginica       0          2        12

Overall Statistics
                                          
               Accuracy : 0.8889          
                 95% CI : (0.7595, 0.9629)
    No Information Rate : 0.3333          
    P-Value [Acc > NIR] : 1.408e-14       
                                          
                  Kappa : 0.8333          
                                          
 Mcnemar's Test P-Value : NA              

Statistics by Class:

                     Class: setosa Class: versicolor Class: virginica
Sensitivity                 1.0000            0.8667           0.8000
Specificity                 1.0000            0.9000           0.9333
Pos Pred Value              1.0000            0.8125           0.8571
Neg Pred Value              1.0000            0.9310           0.9032
Prevalence                  0.3333            0.3333           0.3333
Detection Rate              0.3333            0.2889           0.2667
Detection Prevalence        0.3333            0.3556           0.3111
Balanced Accuracy           1.0000            0.8833           0.8667

 

#6. SVM - kernlab -ksvm()

더보기

model <- ksvm()  #  트레이닝셋으로 서포트벡터머신 모델 만들기

yhat <- predict(model, ....)  # 테스트셋을  model에 넣어 분류값 yhat에 저장

confusionMatrix(yhat, ...)   # yhat추정값과 실제값의 비교

R code

library(kernlab)
model<-ksvm(Species~., train, kernel="rbfdot")
yhat<-predict(model, test)
confusionMatrix(yhat, test$Species)

 

R code 실행결과

더보기

Confusion Matrix and Statistics

            Reference
Prediction   setosa versicolor virginica
  setosa         15          0         0
  versicolor      0         13         0
  virginica       0          2        15

Overall Statistics
                                          
               Accuracy : 0.9556          
                 95% CI : (0.8485, 0.9946)
    No Information Rate : 0.3333          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.9333          
                                          
 Mcnemar's Test P-Value : NA              

Statistics by Class:

                     Class: setosa Class: versicolor Class: virginica
Sensitivity                 1.0000            0.8667           1.0000
Specificity                 1.0000            1.0000           0.9333
Pos Pred Value              1.0000            1.0000           0.8824
Neg Pred Value              1.0000            0.9375           1.0000
Prevalence                  0.3333            0.3333           0.3333
Detection Rate              0.3333            0.2889           0.3333
Detection Prevalence        0.3333            0.2889           0.3778
Balanced Accuracy           1.0000            0.9333           0.9667

 

#7. RF - randomForest - randomForest()

더보기

model <- nnet()  #  트레이닝셋으로 랜덤포레스트 모델 만들기

yhat <- predict(model, ....)  # 테스트셋을  model에 넣어 분류값 yhat에 저장

confusionMatrix(yhat, ...)   # yhat추정값과 실제값의 비교

R code

library(randomForest)
model<-randomForest(Species~., train)
yhat<-predict(model, test, type="class")
confusionMatrix(yhat, test$Species)

 

R code 실행결과

더보기

Confusion Matrix and Statistics

            Reference
Prediction   setosa versicolor virginica
  setosa         15          0         0
  versicolor      0         13         0
  virginica       0          2        15

Overall Statistics
                                          
               Accuracy : 0.9556          
                 95% CI : (0.8485, 0.9946)
    No Information Rate : 0.3333          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.9333          
                                          
 Mcnemar's Test P-Value : NA              

Statistics by Class:

                     Class: setosa Class: versicolor Class: virginica
Sensitivity                 1.0000            0.8667           1.0000
Specificity                 1.0000            1.0000           0.9333
Pos Pred Value              1.0000            1.0000           0.8824
Neg Pred Value              1.0000            0.9375           1.0000
Prevalence                  0.3333            0.3333           0.3333
Detection Rate              0.3333            0.2889           0.3333
Detection Prevalence        0.3333            0.2889           0.3778
Balanced Accuracy           1.0000            0.9333           0.9667

 

모델별 accuracy

Model Accuracy
K-최근접이웃(KNN) 0.9333
나이브베이즈 0.9778
다항 로지스틱회귀 0.9556 
인공신경망(ANN) 0.8889
의사결정트리(DT) 0.8889
서포트벡터머신(SVM) 0.9556
랜덤포레스트 0.9556 

 

 


머신러닝 수행 단계별 내용과 변수명

단계 내용 변수명
1 데이터셋 전처리한다. -
2 데이터셋 분할한다. train, test
3 머신러닝 모델 만든다.  model
4 테스트셋으로 '분류'한다. yhat
5 실제분류와 비교 검증한다.  

 

자필기억

 

 

매일 아침 작성...

생각날때 마다 작성...

 

 

R종합코드

>ML-Classification
> 
># Split dataset - caret - createDataPartion()
> library(caret)
> idx<-createDataPartion(iris$Species, p=0.7, list=F)
> train<-iris[idx,]
> test<-iris[-idx,]
> table(train$Species) ; table(test$Species)
>
> #1 KNN -class -knn()
> library(class)
> yhat<-knn(train[,-5], cl=train$Species, test[,-5], k=3)
> confusionMatrix(yhat, test$Species)
> 
> #2 NB - e1071 -naiveBayes()
> library(e1071)
> model<-naiveBayes(Species~., train, laplace=1)
> yhat<-predict(model, test, type="class")
> confusionMatrix(yhat, test$Species)
>               
> #LR - nnet - multinom()
> library(nnet)
> model<-multinom(Species~., train)
# weights:  18 (10 variable)
initial  value 115.354290 
iter  10 value 13.863136
iter  20 value 5.725088
iter  30 value 5.284753
iter  40 value 5.225208
iter  50 value 5.209580
iter  60 value 5.204902
iter  70 value 5.203633
iter  80 value 5.202805
final  value 5.202441 
converged
> yhat<-predict(model, test, type="class")
> confusionMatrix(yhat, test$Species)
>
> # ANN - nnet - nnet()
> library(nnet)
> model<-nnet(Species~., train, size=3)
# weights:  27
initial  value 117.328797 
iter  10 value 48.757928
iter  20 value 48.532610
iter  30 value 48.520212
iter  40 value 48.519961
iter  50 value 48.513659
iter  60 value 47.228537
iter  70 value 6.007694
iter  80 value 5.407288
iter  90 value 5.253816
iter 100 value 5.213450
final  value 5.213450 
stopped after 100 iterations
> yhat<-predict(model, test, type="class")
> yhat<-as.factor(yhat) # as a factor
> confusionMatrix(yhat, test$Species)
> 
> # DT - rpart - rpart()
> library(rpart)
> model<-rpart(Species~.,train)
> yhat<-predict(model, test, type="class")
>  confusionMatrix(yhat, test$Species)
> 
> # SVM - kernlab - ksvm()
> library(kernlab)

다음의 패키지를 부착합니다: ‘kernlab’

The following object is masked from ‘package:ggplot2’:

    alpha

> model<-ksvm(Species~., train, kernel="rbfdot")
>  yhat< predict(model, test)
>  confusionMatrix(yhat, test$Species)
> 
> # RF - randomForest -randomForest()
> library(randomForest)
randomForest 4.7-1.2
Type rfNews() to see new features/changes/bug fixes.

다음의 패키지를 부착합니다: ‘randomForest’

The following object is masked from ‘package:ggplot2’:

    margin

> model<-randomForest(Species~., train)
> yhat<-predict(model, test)
> confusionMatrix(yhat, test$Species)
Confusion Matrix and Statistics

            Reference
Prediction   setosa versicolor virginica
  setosa         15          0         0
  versicolor      0         15         0
  virginica       0          0        15

Overall Statistics
                                     
               Accuracy : 1          
                 95% CI : (0.9213, 1)
    No Information Rate : 0.3333     
    P-Value [Acc > NIR] : < 2.2e-16  
                                     
                  Kappa : 1          
                                     
 Mcnemar's Test P-Value : NA         

Statistics by Class:

                     Class: setosa Class: versicolor Class: virginica
Sensitivity                 1.0000            1.0000           1.0000
Specificity                 1.0000            1.0000           1.0000
Pos Pred Value              1.0000            1.0000           1.0000
Neg Pred Value              1.0000            1.0000           1.0000
Prevalence                  0.3333            0.3333           0.3333
Detection Rate              0.3333            0.3333           0.3333
Detection Prevalence        0.3333            0.3333           0.3333
Balanced Accuracy           1.0000            1.0000           1.0000
>

 

데이터분할에 따라

정확도(accuracy) = 1.0 도 나온다!