Boostcamp AI Tech (P2 - Day01)

토론 게시판에 첫 게시물 작성

제목: RandomForestClassifier를 한 번 돌려봤습니다

AUC score는 0.7289 밖에 안 나왔지만, 시도한 방법을 공유하고 제가 놓친 것과 새로 시도할 수 있는 방법들에 대해 토론하면 좋을 것 같아서 적어봅니다.

예제 코드에 있는 Gradient Boosting 방식과 Random Forest 방식에는 각각 장단점이 있는데, 전자보다 후자가 (noisy한 데이터에의 경우) 일반화 성능에 있어서 나을 수 있다는 정보를 바탕으로 시도하게 되었습니다.

참고 자료
- Gradient Boosting vs Random Forest
- In Depth: Parameter tuning for Random Forest

성능이 더 떨어진 것에 대해 제가 확인, 혹은 추측한 바로는 다음 몇 가지가 있습니다.

output.csv를 확인해보면 예측된 확률값이 고르게 퍼져 있지 않고, 대부분 ‘0.0’이며 가끔 ‘1.0’, ‘0.4’ 등이 등장하는 등 매우 극단적으로 분포하고 있다.
애초에 Overfitting이 큰 문제가 되고 있는 상황이 아니었기 때문에, 일반화 성능에 초점을 맞출 필요가 없었다.
feature extraction 등 유의미한 전처리가 뒷받침 되지 않은 상태에서 모델만 바꾸는 것으로는 성능 향상을 기대하기 힘들다.
주어진 예제 코드를 변경하는 과정에서, 기존에 적용되던 (어느 정도 성능을 보이던) 하이퍼 파라미터와 feature engineering 등의 효과를 기대할 수 없게 되었다.
하이퍼 파라미터 최적화가 적용되지 않았다.

코드는 간단하게 from sklearn.ensemble import RandomForestClassifier 로 모델을 불러온 뒤, baseline3에서 호출하는 make_lgb_oof_prediction 함수의 몇몇 부분을 rfc 모델로 대체해서 baseline3를 실행했습니다.

def make_lgb_oof_prediction(train, y, test, features, categorical_features='auto', model_params=None, folds=10):
    x_train = train[features]
    x_test = test[features]
    
    # 테스트 데이터 예측값을 저장할 변수
    test_preds = np.zeros(x_test.shape[0])
    
    # Out Of Fold Validation 예측 데이터를 저장할 변수
    y_oof = np.zeros(x_train.shape[0])
    
    # 폴드별 평균 Validation 스코어를 저장할 변수
    score = 0
    
    # 피처 중요도를 저장할 데이터 프레임 선언
    fi = pd.DataFrame()
    fi['feature'] = features
    
    # Stratified K Fold 선언
    skf = StratifiedKFold(n_splits=folds, shuffle=True, random_state=SEED)

    for fold, (tr_idx, val_idx) in enumerate(skf.split(x_train, y)):
        # train index, validation index로 train 데이터를 나눔
        x_tr, x_val = x_train.loc[tr_idx, features], x_train.loc[val_idx, features]
        y_tr, y_val = y[tr_idx], y[val_idx]
        
        print(f'fold: {fold+1}, x_tr.shape: {x_tr.shape}, x_val.shape: {x_val.shape}')
        
        # 모델 훈련 (randomforest)
        rfc = RandomForestClassifier(n_estimators = 1000, max_depth=32)
        rfc.fit(x_tr, y_tr)

        # Validation 데이터 예측
        val_preds = rfc.predict(x_val)
        
        # Validation index에 예측값 저장 
        y_oof[val_idx] = val_preds
        
        # 폴드별 Validation 스코어 측정
        print(f"Fold {fold + 1} | AUC: {roc_auc_score(y_val, val_preds)}")
        print('-'*80)

        # score 변수에 폴드별 평균 Validation 스코어 저장
        score += roc_auc_score(y_val, val_preds) / folds
        
        # 테스트 데이터 예측하고 평균해서 저장
        test_preds += rfc.predict(x_test) / folds
    
        # 폴드별 피처 중요도 저장
        fi[f'fold_{fold+1}'] = rfc.feature_importances_

        del x_tr, x_val, y_tr, y_val
        gc.collect()
        
    print(f"\nMean AUC = {score}") # 폴드별 Validation 스코어 출력
    print(f"OOF AUC = {roc_auc_score(y, y_oof)}") # Out Of Fold Validation 스코어 출력
        
    # 폴드별 피처 중요도 평균값 계산해서 저장 
    fi_cols = [col for col in fi.columns if 'fold_' in col]
    fi['importance'] = fi[fi_cols].mean(axis=1)
    
    return y_oof, test_preds, fi