Credit Default Prediction Pipeline

πŸ“… January 2024 Β· Completed

Overview

This project builds a production-ready binary classification pipeline for predicting the probability of credit default within the next 12 months.

Problem Statement

Given bureau data and application features for a loan applicant, predict the probability of default (PD) to inform underwriting decisions.

Approach

  1. Data Ingestion β€” Pulling from multiple bureau data sources
  2. Feature Engineering β€” WoE encoding, lag features, aggregation across accounts
  3. Model Training β€” XGBoost with Optuna hyperparameter search
  4. Validation β€” Walk-forward validation preserving temporal ordering
  5. Deployment β€” Batch scoring via MLflow model registry

Results

Metric Value
AUC-ROC 0.831
Gini 0.662
KS Statistic 0.51
PSI (month-3) 0.08

Key Learnings

  • Temporal leakage is the #1 enemy in credit risk models
  • WoE encoding significantly improved model stability
  • Monotonicity constraints improved trust with the business

Code

Full code available on GitHub.