[빅데이터분석기사 실기 2탄] 학습실행

학습 및 사례

[빅데이터분석기사 실기 2탄] 학습실행 - 머신러닝, Classification

에스오에스데이터랩 2025. 5. 25. 13:02

주) 본인 학습을 위해 학습기간중 지속적으로 update 됩니다. 방문하신 분들을 위한 것이 아닙니다. 타이핑 연습용!^^

머신러닝 수행 단계별 사용 변수명

단계	내용	사용변수명
1	데이터셋 전처리한다.	-
2	데이터셋 분할한다.	train, test
3	머신러닝 모델 만든다.	model
4	테스트셋으로 '분류'한다.	yhat
5	실제분류값과 비교 검증한다.	-

ML 모델별 라이브러리 및 함수

ML 분류 모델 유형	R 라이브러리	함수명
K-최근접이웃(KNN)	class	knn( )
나이브베이즈	e1071	naiveBayes( )
다항 로지스틱회귀	nnet	multinom( )
인공지능신경망(ANN)	nnet	nnet( )
의사결정트리(DT)	rpart	rpart( )
서포트벡터머신(SVM)	kernlab	ksvm( )
랜덤포레스트	randomForest	randomForest( )

ConfusionMatrix

		test$Species (실제값)
		TRUE	FALSE
yhat (예측값)	POSITIVE	T.P	F.P
yhat (예측값)	NEGATIVE	T.N	F.N

accuracy = (T.P + T.N) ÷ ( T.P + T.N + F.P + F.N)

sensitivity = T.P ÷ ( T.P + F.P)

specificity = T.P ÷ ( T.P + T.N)

# Load datasets -base - read.csv()

# 기본설치된 iris dataset 사용

# Check missing values - base - table(is.na())

table(is.na(iris)) # NA 데이터 갯수 확인

FALSE
750

머신러닝 - 데이터셋 분할

# Split datasets - caret -createDataPartition()

idx <- createDataPartition() # 빈도수를 고려한 index 무작위 비복원 추출

train <- ... # 트레인셋 만들기

test <- ..... # 테스트셋 만들기

table(train$Species); table(test$Species) # Species 갯수 출력

R code

# Split datasets - caret - createDataPartition()
library(caret)
idx<-createDataPartition(iris$Species, p=0.7, list=F)
train<-iris[idx,]
test<-iris[-idx,]
table(train$Species);table(test$Species)

R code 실행결과

> table(train$Species);table(test$Species)

    setosa versicolor  virginica
        35         35         35

    setosa versicolor  virginica
        15         15         15

# K-fold cross validation

머신러닝 - '분류' 모델

목적변수 : $Species (범주변수, 분류변수)

설명변수 : $Species 제외 4개 (연속형 변수)

#1. KNN-class-knn()

yhat <- knn( ) # 트레이닝셋과 테스트셋 동시 이용하여 니얼리스트네이버 모델 만들기

confusionMatrix(yhat, ...) # yhat추정값과 실제값의 비교

R code

# KNN - class - knn()
library(class)
yhat<-knn(train[,-5], cl=train$Species, test[,-5], k=3)
confusionMatrix(yhat, test$Species)

R code 실행결과

# confusionMatrix(yhat, test$Species)

Confusion Matrix and Statistics

Reference
Prediction setosa versicolor virginica
setosa         15          0         0
versicolor      0         13         1
virginica       0          2        14

Overall Statistics

               Accuracy : 0.9333
                 95% CI : (0.8173, 0.986)
    No Information Rate : 0.3333
    P-Value [Acc > NIR] : < 2.2e-16

                  Kappa : 0.9

Mcnemar's Test P-Value : NA

Statistics by Class:

                     Class: setosa Class: versicolor Class: virginica
Sensitivity                 1.0000            0.8667           0.9333
Specificity                 1.0000            0.9667           0.9333
Pos Pred Value              1.0000            0.9286           0.8750
Neg Pred Value              1.0000            0.9355           0.9655
Prevalence                  0.3333            0.3333           0.3333
Detection Rate              0.3333            0.2889           0.3111
Detection Prevalence        0.3333            0.3111           0.3556
Balanced Accuracy           1.0000            0.9167           0.9333

#2. NB- e1071 - naiveBayes()

model <- naiveBayes() # 트레이닝셋으로 나이브베이지 모델 만들기

yhat <- predict(model, ....) # 테스트셋을 model에 넣어 분류값 yhat에 저장

confusionMatrix(yhat, ...) # yhat추정값과 실제값의 비교

R code

library(e1071)
model<-naiveBayes(Species~.,train, laplace=1)
yhat<-predict(y, test, type="class")
confusionMatrix(yhat, test$Species)

R code 실행결과

Confusion Matrix and Statistics

            Reference
Prediction   setosa versicolor virginica
  setosa         15          0         0
  versicolor      0         14         0
  virginica       0          1        15

Overall Statistics

               Accuracy : 0.9778
                 95% CI : (0.8823, 0.9994)
    No Information Rate : 0.3333
    P-Value [Acc > NIR] : < 2.2e-16

                  Kappa : 0.9667

Mcnemar's Test P-Value : NA

Statistics by Class:

                     Class: setosa Class: versicolor Class: virginica
Sensitivity                 1.0000            0.9333           1.0000
Specificity                 1.0000            1.0000           0.9667
Pos Pred Value              1.0000            1.0000           0.9375
Neg Pred Value              1.0000            0.9677           1.0000
Prevalence                  0.3333            0.3333           0.3333
Detection Rate              0.3333            0.3111           0.3333
Detection Prevalence        0.3333            0.3111           0.3556
Balanced Accuracy           1.0000            0.9667           0.9833

#3. Logistic regression - nnet -multinom()

model <- multinom() # 트레이닝셋으로 다항 로지스틱회귀 모델 만들기

yhat <- predict(model, ....) # 테스트셋을 model에 넣어 분류값 yhat에 저장

confusionMatrix(yhat, ...) # yhat추정값과 실제값의 비교

R code

library(nnet)
model<-multinom(Species~., train)
yhat<-predict(model, test, type="class")
confusionMatrix(yhat, test$Species)

R code 실행결과

Confusion Matrix and Statistics

            Reference
Prediction   setosa versicolor virginica
  setosa         15          0         0
  versicolor      0         13         0
  virginica       0          2        15

Overall Statistics

               Accuracy : 0.9556
                 95% CI : (0.8485, 0.9946)
    No Information Rate : 0.3333
    P-Value [Acc > NIR] : < 2.2e-16

                  Kappa : 0.9333

Mcnemar's Test P-Value : NA

Statistics by Class:

                     Class: setosa Class: versicolor Class: virginica
Sensitivity                 1.0000            0.8667           1.0000
Specificity                 1.0000            1.0000           0.9333
Pos Pred Value              1.0000            1.0000           0.8824
Neg Pred Value              1.0000            0.9375           1.0000
Prevalence                  0.3333            0.3333           0.3333
Detection Rate              0.3333            0.2889           0.3333
Detection Prevalence        0.3333            0.2889           0.3778
Balanced Accuracy           1.0000            0.9333           0.9667

#4. ANN - nnet - nnet()

model <- nnet() # 트레이닝셋으로 인공지능 모델 만들기

yhat <- predict(model, ....) # 테스트셋을 model에 넣어 분류값 yhat에 저장

confusionMatrix(yhat, ...) # yhat추정값과 실제값의 비교

R code

library(nnet)
model<-nnet(Species~., train, size=3)
yhat<-predict(model, test, type="class")
yhat<-as.factor(yhat)

R code 실행결과

#5. DT - rpart -rpart()

model <- raprt() # 트레이닝셋으로 의사결정트리 모델 만들기

yhat <- predict(model, ....) # 테스트셋을 model에 넣어 분류값 yhat에 저장

confusionMatrix(yhat, ...) # yhat추정값과 실제값의 비교

R code

library(rpart)
model<-rpart(Species~., train)
yhat<-predict(model, test, type="class")
confusionMatrix(yhat, test$Species)

R code 실행결과

Confusion Matrix and Statistics

            Reference
Prediction   setosa versicolor virginica
  setosa         15          0         0
  versicolor      0         13         3
  virginica       0          2        12

Overall Statistics

               Accuracy : 0.8889
                 95% CI : (0.7595, 0.9629)
    No Information Rate : 0.3333
    P-Value [Acc > NIR] : 1.408e-14

                  Kappa : 0.8333

Mcnemar's Test P-Value : NA

Statistics by Class:

                     Class: setosa Class: versicolor Class: virginica
Sensitivity                 1.0000            0.8667           0.8000
Specificity                 1.0000            0.9000           0.9333
Pos Pred Value              1.0000            0.8125           0.8571
Neg Pred Value              1.0000            0.9310           0.9032
Prevalence                  0.3333            0.3333           0.3333
Detection Rate              0.3333            0.2889           0.2667
Detection Prevalence        0.3333            0.3556           0.3111
Balanced Accuracy           1.0000            0.8833           0.8667

#6. SVM - kernlab -ksvm()

model <- ksvm() # 트레이닝셋으로 서포트벡터머신 모델 만들기

yhat <- predict(model, ....) # 테스트셋을 model에 넣어 분류값 yhat에 저장

confusionMatrix(yhat, ...) # yhat추정값과 실제값의 비교

R code

library(kernlab)
model<-ksvm(Species~., train, kernel="rbfdot")
yhat<-predict(model, test)
confusionMatrix(yhat, test$Species)

R code 실행결과

#7. RF - randomForest - randomForest()

model <- nnet() # 트레이닝셋으로 랜덤포레스트 모델 만들기

yhat <- predict(model, ....) # 테스트셋을 model에 넣어 분류값 yhat에 저장

confusionMatrix(yhat, ...) # yhat추정값과 실제값의 비교

R code

library(randomForest)
model<-randomForest(Species~., train)
yhat<-predict(model, test, type="class")
confusionMatrix(yhat, test$Species)

R code 실행결과

모델별 accuracy

Model	Accuracy
K-최근접이웃(KNN)	0.9333
나이브베이즈	0.9778
다항 로지스틱회귀	0.9556
인공신경망(ANN)	0.8889
의사결정트리(DT)	0.8889
서포트벡터머신(SVM)	0.9556
랜덤포레스트	0.9556

머신러닝 수행 단계별 내용과 변수명

단계	내용	변수명
1	데이터셋 전처리한다.	-
2	데이터셋 분할한다.	train, test
3	머신러닝 모델 만든다.	model
4	테스트셋으로 '분류'한다.	yhat
5	실제분류와 비교 검증한다.

자필기억

매일 아침 작성...

생각날때 마다 작성...

R종합코드

>ML-Classification
> 
># Split dataset - caret - createDataPartion()
> library(caret)
> idx<-createDataPartion(iris$Species, p=0.7, list=F)
> train<-iris[idx,]
> test<-iris[-idx,]
> table(train$Species) ; table(test$Species)
>
> #1 KNN -class -knn()
> library(class)
> yhat<-knn(train[,-5], cl=train$Species, test[,-5], k=3)
> confusionMatrix(yhat, test$Species)
> 
> #2 NB - e1071 -naiveBayes()
> library(e1071)
> model<-naiveBayes(Species~., train, laplace=1)
> yhat<-predict(model, test, type="class")
> confusionMatrix(yhat, test$Species)
>               
> #LR - nnet - multinom()
> library(nnet)
> model<-multinom(Species~., train)
# weights:  18 (10 variable)
initial  value 115.354290 
iter  10 value 13.863136
iter  20 value 5.725088
iter  30 value 5.284753
iter  40 value 5.225208
iter  50 value 5.209580
iter  60 value 5.204902
iter  70 value 5.203633
iter  80 value 5.202805
final  value 5.202441 
converged
> yhat<-predict(model, test, type="class")
> confusionMatrix(yhat, test$Species)
>
> # ANN - nnet - nnet()
> library(nnet)
> model<-nnet(Species~., train, size=3)
# weights:  27
initial  value 117.328797 
iter  10 value 48.757928
iter  20 value 48.532610
iter  30 value 48.520212
iter  40 value 48.519961
iter  50 value 48.513659
iter  60 value 47.228537
iter  70 value 6.007694
iter  80 value 5.407288
iter  90 value 5.253816
iter 100 value 5.213450
final  value 5.213450 
stopped after 100 iterations
> yhat<-predict(model, test, type="class")
> yhat<-as.factor(yhat) # as a factor
> confusionMatrix(yhat, test$Species)
> 
> # DT - rpart - rpart()
> library(rpart)
> model<-rpart(Species~.,train)
> yhat<-predict(model, test, type="class")
>  confusionMatrix(yhat, test$Species)
> 
> # SVM - kernlab - ksvm()
> library(kernlab)

다음의 패키지를 부착합니다: ‘kernlab’

The following object is masked from ‘package:ggplot2’:

    alpha

> model<-ksvm(Species~., train, kernel="rbfdot")
>  yhat< predict(model, test)
>  confusionMatrix(yhat, test$Species)
> 
> # RF - randomForest -randomForest()
> library(randomForest)
randomForest 4.7-1.2
Type rfNews() to see new features/changes/bug fixes.

다음의 패키지를 부착합니다: ‘randomForest’

The following object is masked from ‘package:ggplot2’:

    margin

> model<-randomForest(Species~., train)
> yhat<-predict(model, test)
> confusionMatrix(yhat, test$Species)
Confusion Matrix and Statistics

            Reference
Prediction   setosa versicolor virginica
  setosa         15          0         0
  versicolor      0         15         0
  virginica       0          0        15

Overall Statistics
                                     
               Accuracy : 1          
                 95% CI : (0.9213, 1)
    No Information Rate : 0.3333     
    P-Value [Acc > NIR] : < 2.2e-16  
                                     
                  Kappa : 1          
                                     
 Mcnemar's Test P-Value : NA         

Statistics by Class:

                     Class: setosa Class: versicolor Class: virginica
Sensitivity                 1.0000            1.0000           1.0000
Specificity                 1.0000            1.0000           1.0000
Pos Pred Value              1.0000            1.0000           1.0000
Neg Pred Value              1.0000            1.0000           1.0000
Prevalence                  0.3333            0.3333           0.3333
Detection Rate              0.3333            0.3333           0.3333
Detection Prevalence        0.3333            0.3333           0.3333
Balanced Accuracy           1.0000            1.0000           1.0000
>

데이터분할에 따라

정확도(accuracy) = 1.0 도 나온다!

저작자표시 비영리 변경금지 (새창열림)

'학습 및 사례' 카테고리의 다른 글

경기도일자리재단 베이비부머인턴쉽(컨설팅형) 역량교육 참여 (3) - Problem-solving (0)	2025.05.31
[빅데이터분석기사 실기 3탄] 학습실행 - 머신러닝, Prediction or Regression (0)	2025.05.27
[빅데이터분석기사 실기 1탄] 학습계획 수립 (0)	2025.05.19
[직접생산확인] 2025년 실태조사원 역량교육 - 학습계획, 학습방법 및 시험결과 (0)	2025.05.18
[직접생산확인조사] 공장없는 공장주소로 인한 '배정업체 반납' 사례(1) (0)	2025.05.12

현재글[빅데이터분석기사 실기 2탄] 학습실행 - 머신러닝, Classification

* 중소기업의 성장 파트너 에스오에스데이터랩 *

대표/컨설턴트/경영지도사 (중소벤처기업부 등록번호 제12107호) 문의: dsyoon63@gmail.com

스마트공장, 윤석열, 직접생산확인제도, 윤석열대통령, 길냥이깜이, 유수율제고서비스, 직접생산확인증명, 머신러닝, 윤석열탄핵인용, 계엄령, 2025년 안산국제거리극축제, 고무나무치료, 직접생산확인조사, cu고양이 춘식, 호야치료, 조명용제어장치, cu고양이춘식, 윤석열척결, 구글AI, 춘식고양이,

Today :
Yesterday :

* 중소기업의 성장 파트너 에스오에스데이터랩 *