Nitish John Toppo

Dec 01 2025|9 min read

Problem Statement

Student loan providers face significant challenges in ensuring timely repayments and minimizing defaults. Traditional collection strategies often lack personalization and predictive capabilities, leading to ineffective resource allocation and higher credit risk exposure. The need arises for an AI-driven solution that can accurately predict borrower repayment behavior, optimize collection strategies, and provide early warnings for potential defaults.

Project Objectives

Develop a predictive AI system to assess the likelihood of student loan repayment or default.
Enhance collection efficiency by segmenting borrowers based on risk categories.
Leverage historical and vendor-provided data to identify behavioral patterns.
Automate reporting and model validation against past performance.
Continuously re-evaluate models to ensure optimal performance in changing economic conditions.

Scope of Work

Data Ingestion:
- Collect external vendor data for new borrowers.
- Use internal historical repayment and borrower data for existing customers.
Data Preprocessing:
- Perform data cleaning, missing value imputation, and feature engineering.
- Identify key data points such as income, credit score, repayment history, demographics, and loan details.
Model Development:
- Train multiple ML models (Random Forest, LightGBM, XGBoost) using historical and vendor data.
- Apply hyperparameter tuning to optimize model performance.
Validation & Evaluation:
- Validate models against past repayment data.
- Compare accuracy, recall, and precision metrics across models.
- Select the best-performing models for production.
Deployment & Reporting:
- Deploy AI pipelines for real-time predictions on Databricks.
- Generate risk reports and dashboards for business stakeholders.
- Re-evaluate models periodically to adapt to borrower behavior changes.

Approach Followed

Data Collection & Integration
- Vendor data used for new borrower onboarding.
- Internal loan and repayment data used for existing customers.
- Data stored and managed using MySQL.
Data Cleaning & Feature Engineering
- Removed duplicates, handled missing values, and standardized formats.
- Identified critical features (e.g., repayment history, credit utilization, income-to-loan ratio).
Model Training & Validation
- Used PySpark on Databricks for distributed processing of large datasets.
- Implemented Random Forest, LightGBM, and XGBoost models.
- Conducted hyperparameter tuning to optimize performance.
- Validated models using historical repayment outcomes.
Performance Evaluation
- Compared models using AUC, F1-score, and recall (focus on identifying potential defaulters).
- Best model selected based on predictive accuracy and business interpretability.
Reporting & Re-Evaluation
- Generated detailed reports highlighting borrower risk categories.
- Conducted re-evaluation cycles to improve performance with updated datasets.

Tech Stack

Data Storage: MySQL
Data Processing: PySpark, Databricks
Programming Language: Python
Machine Learning Models: Random Forest, LightGBM, XGBoost
Model Optimization: Hyperparameter Tuning
Deployment & Reporting: Databricks Notebooks, Automated Dashboards

Model Performance Metrics

After rigorous testing and validation, LightGBM emerged as the best-performing model.

Accuracy: 86%
ROC AUC: 72
Precision: 48
F1 Score: 38

Explore by tags

Blogs you may like

There are no more blogs for this category