Credit Default Prediction Pipeline

📅 January 2024 · Completed

⌥ GitHub 🚀 Demo 📄 Paper

Python XGBoost Credit Risk Feature Engineering MLflow

Overview

This project builds a production-ready binary classification pipeline for predicting the probability of credit default within the next 12 months.

Problem Statement

Given bureau data and application features for a loan applicant, predict the probability of default (PD) to inform underwriting decisions.

Approach

Data Ingestion — Pulling from multiple bureau data sources
Feature Engineering — WoE encoding, lag features, aggregation across accounts
Model Training — XGBoost with Optuna hyperparameter search
Validation — Walk-forward validation preserving temporal ordering
Deployment — Batch scoring via MLflow model registry

Results

Metric	Value
AUC-ROC	0.831
Gini	0.662
KS Statistic	0.51
PSI (month-3)	0.08

Key Learnings

Temporal leakage is the #1 enemy in credit risk models
WoE encoding significantly improved model stability
Monotonicity constraints improved trust with the business

Code

Full code available on GitHub.