머신러닝(8): 경사하강법, XGBoost, LightGBM, CatBoost

데이터 분석/머신러닝

머신러닝(8): 경사하강법, XGBoost, LightGBM, CatBoost

민서타 2023. 9. 30. 17:17

Gradient_Descent

-손실함수를 최적화 목적, f(x) = wx + b의 식에서 기울기(가중치): w와 y절편(b) 조절을 통해 정답 도달

-Learning late를 너무 높게 잡을 시 지역 최소점에 빠질 위험이 있으므로 적절한 하이퍼 파라미터 설정 중요

1. Data 생성

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

import seaborn as sns

plt.rcParams['figure.figsize'] = (12.0, 9.0)

from sklearn.datasets import make_regression

# Generate Data 표본의 갯수, 독립변수의 수, 종속변수의 수, 정규분포의 표준편차

x, y = make_regression(n_samples=1000, n_features=1, n_targets=1, noise=5.0, random_state=42)

print(x.shape)

print(y.shape)

data = pd.DataFrame({"x" : x.reshape(-1, ), "y" : y})

2. 경사하강법 구현

-비용함수의 경우 MSE 또는 MAE를 사용(MAE is more good / 이상치 영향을 거의 받지 않기 때문에)

def gradient_descent(w=0.1, b=0.1, learning_rate=1e-2, max_iter=100, tol=1e-4):

for iter in range(max_iter):

y_hat = w * x + b

error = ((y_hat - y ) ** 2).mean()

if error <= tol:

break

w_grad = learning_rate * ((y_hat - y) * x).mean()

b_grad = learning_rate * (y_hat - y).mean()

w = w - w_grad

b = b - b_grad

return w, b

w, b = gradient_descent(learning_rate=0.01, max_iter = 10000)

sns.scatterplot(data=data, x="x", y="y", label="Data")

plt.plot(x, w * x + b, color='red', label='Regression Line')

plt.xlabel("X")

plt.ylabel("Y")

plt.title("Linear Regression with Gradient Descent")

plt.legend()

plt.show()

3. 구현 시 부족했던 점

-구현하며 data의 차원을 맞춰주지 않아 회귀 선이 잘못된 답(직선)으로 표기, reshape를 통한 정답 구현

XGBoost

-GBM 개선(속도면에서 매우 우월, 병렬 학습이 지원되도록 구현, 과적합 방지하는 Penalty term 존재)

-분류, 회귀 모두 지원하며 결측치를 내부적으로 처리해줌

-데이터 1000 <= N <= 30000 사용 편리

1. Data 생성

import os

import gc

import re

import pickle

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

import seaborn as sns

import warnings

warnings.filterwarnings("ignore")

from sklearn.model_selection import train_test_split

from sklearn.metrics import confusion_matrix, f1_score

from xgboost import XGBClassifier, XGBRegressor

from collections import Counter

2. Data 전처리 #XGBoost의 경우 특수문자를 허용하지 않음, data = 깃허브 수술 전 사망 환자 데이터 활용

# Feature Name Cleaning

regex = re.compile(r"\[|\]|<", re.IGNORECASE)

data.columns = [regex.sub("_", col) if any(x in str(col) for x in set(('[', ']', '<'))) else col for col in data.columns.values]

col = []

missing = []

level = []

for name in data.columns:

#결측치

miss = data[name].isnull().sum() / data.shape[0]

missing.append(round(miss, 4))

#라벨링

lel = data[name].dropna()

level.append(len(list(set(lel))))

#Col

col.append(name)

tg_view = pd.concat([pd.DataFrame(col, columns = ['name']),

pd.DataFrame(missing, columns = ['Miss_Percentage']),

pd.DataFrame(level, columns = ['level'])], axis = 1)

drop_col = tg_view['name'][(tg_view['level'] <= 1) | (tg_view['Miss_Percentage'] >= 0.8)]

data.drop(columns = drop_col, inplace = True)

print(data.shape)

3. 모델 학습 및 평가

Y = data['censor']

X = data.drop(columns= 'censor')

idx = list(range(X.shape[0]))

train_idx, valid_idx = train_test_split(idx, test_size = 0.3, random_state = 42)

print(len(train_idx), len(valid_idx))

print(Counter(Y.iloc[train_idx]), Counter(Y.iloc[valid_idx]))

good_model = XGBClassifier(n_estimators = 15, learning_rate = 0.3, max_depth = 3, reg_alpha = 0.1, objective = 'binary:logistic', random_state = 49 )

good_model.fit(X.iloc[train_idx], Y.iloc[train_idx])

#Train test

y_pre_train = good_model.predict(X.iloc[train_idx])

cm_train = confusion_matrix(Y.iloc[train_idx], y_pre_train)

print('Train ACC :cm_train {}'.format((cm_train[0,0] + cm_train[1,1]) / cm_train.sum()))

print("Train F1-Score : {}".format(f1_score(Y.iloc[train_idx], y_pre_train)))

print('-----------------------------')

#Test test

y_pred_test = good_model.predict(X.iloc[valid_idx])

cm_test = confusion_matrix(Y.iloc[valid_idx], y_pred_test)

print('Test ACC :cm_test {}'.format((cm_test[0,0] + cm_test[1,1]) / cm_test.sum()))

print("Test F1-Score : {}".format(f1_score(Y.iloc[valid_idx], y_pred_test)))

4. 핵심 Hyper-parameter Tuning

-max_depth, n_estimators, learning_rate, tree_method, colsample_bynode, reg_lambda

-max_depth : 5 ~ 15 # 높아질 시 과적합 위험

-n_estimators : 50 ~ 2000 #learning_rate와 set, 총 몇 번의 tuning을 진행할 지

-learning_rate : 0.3 ~ 0.01 #pseudo-residual(실제값 - 예측값) parameter

-tree_method : 'exact', 'approximate', 'gpu_kist' #Best_split을 찾을 때 데이터를 사용하는 방법

-colsample_bynode : 0.5 ~ 0.8 # local randomization

-reg_lambda : 0.5 ~ 5.0 # L2 norm

LightGBM

-split point를 찾는 방식에서 효율을 극대화(작은 메모리, 빠른 성능 트리)

-n_rows <= 10000 일 시 과적합이 될 가능성이 큼

-max_depth : 15 ~ 25 # 높아질 시 과적합 위험

-n_estimators : 50 ~ #learning_rate와 set, 총 몇 번의 tuning을 진행할 지

-learning_rate : 0.3 ~ 0.00 #pseudo-residual(실제값 - 예측값) parameter

-num_leaves : 2^depth >= max(# leaf nodes)

-min_child_samples : 1~100

-colsample_bynode : 0.5 ~ 0.8 # node split 시 동일한 세팅이어야하므로 global randomization을 제공

-reg_lambda : 0.5 ~ 5.0 # L2 norm

-silent : -1

CatBoost

-Cateogorical 변수에 특화

-iterations : 50 ~ 50000 #of trees

-learning_rate : 0.3 ~ 0.01 #pseudo-residual(실제값 - 예측값) parameter

-max_depth : 5 ~ 25

-rsm : random subspace method

-border_count : 254 or 32 or 128 #histograom-basded

-silent : -1