CAPSTONE TORONTO METROPOLITAN UNIVERSITY

DATA ANALYSIS • MACHINE LEARNING • 2026

ROLE

Researcher & Data Analyst TMU Student

Predictive

Modelling of

Cervical Cancer

Risk Factors



01 - ABSTRACT

PYTHON

XG BOOST

DATA MINING

Cervical cancer kills an estimated 94% of its victims in low- and middle-income countries — not because it's undetectable, but because screening access is limited. This project asked: can machine learning models trained on routinely collected patient history data meaningfully identify who is at risk, before they reach a clinic?

WHY CERVICAL CANCER?

I chose cervical cancer as the focus of my data analysis project because it is deeply personal to me. My mom passed away from cervical cancer, and that experience shaped my interest in understanding the disease beyond a clinical perspective. Through this project, I wanted to explore how data can be used to identify risk factors, improve early detection, and ultimately contribute to better outcomes. It reflects both my analytical skill set and my motivation to apply data-driven insights to meaningful, real-world health challenges.

858

Patients

6.41%

Positive Cases

27

Features Used

5

Models Trained

02 - METHODOLOGY

How it was built


01

03

04

05

Data Preprocessing

Feature Importance

Stability & Analysis

Standardized missing value encodings across 36 variables. Removed features with >90% missingness. Excluded post-screening diagnostic columns (Hinselmann, Schiller, Citology, Dx variants) to eliminate information leakage. Final feature set: 27 predictors.


02

Model Selection

Selected 5 models spanning interpretability vs. performance: unweighted logistic regression (baseline), class-weighted logistic regression, decision tree (max depth=5), balanced random forest, and class-weighted XGBoost.


Training & Evaluation

75/25 stratified train-test split to preserve positive class ratio. Evaluated on accuracy, precision, recall, F1, and ROC-AUC. Recall was the primary metric — missing a true positive in a screening context is the most ethically costly error.


Permutation importance computed on balanced random forest (highest ROC-AUC). Top predictors: hormonal contraceptive duration, number of pregnancies, number of sexual partners, and age at first sexual intercourse.


Stratified cross-validation and repeated train-test splits assessed robustness. Class-weighted logistic regression showed highest mean recall (0.327). Random forest had highest mean ROC-AUC (0.671).


03 - RESULTS

Model Performance


MODEL

ACCURACY

0.9349

PRECISION

0.0000

RECALL

0.0000

F1

0.0000

ROC-AUC

0.6542


Logistic Regression (Class-Weighted)

BEST RECALL

0.7442

0.1273

0.5000

0.2029

0.6624


Decision Tree (max depth = 5)

0.9116

0.0000

0.0000

0.0000

0.4534


Random Forest (Balanced)

BEST ROC-AUC

0.9349

0.5000

0.0714

0.1250

0.6943


XGBoost (Class-Weighted)

0.8558

0.0952

0.1429

0.1143

0.5437


Primary evaluation metrics. Accuracy alone is misleading under severe class imbalance.

Logistic Regression (Unweighted)

Class-Weighted Logistic Regression

Highest recall on test set (0.50) and highest mean recall in cross-validation (0.327). Detects the most true positives — aligned with the goal of not missing high-risk patients.

BEST FOR SCREENING

Balanced Random Forest

Highest ROC-AUC (0.6953 on test, 0.671 mean). Best for ranking patients by risk when follow-up resources are limited. Feature importance via permutation analysis.

BEST DISCRIMINATION

Unweighted Models

LR (unweighted) and Decision Tree both achieved 0 recall despite high accuracy. Classic class-imbalance failure — high accuracy, zero clinical utility.

BASELINE TRAP

04 - FEATURE IMPORTANCE

What actually predicts risk

Permutation importance was computed on the balanced random forest — the model with the highest ROC-AUC. Each feature was shuffled in isolation; the drop in performance measures how much the model relied on it.

Reproductive and sexual history variables dominated. STD indicators were surprisingly weak predictors — consistent with prior literature, likely due to systematic underreporting of sensitive items.

04 - ETHICS & LIMITATIONS

What the model can’t do

Extreme Class Imbalance

Only 55 positive cases in 858. Recall and F1 scores swing significantly depending on which positives land in training vs. test. The models are not yet reliable enough for standalone clinical deployment.

LIMITATION

LIMITATION

Missing & Self-Reported Data

Several key predictors (STD subtypes, sexual history) have substantial missingness — partly because they are sensitive and underreported. This directly weakens the predictors the model most needs.


ETHICS

False Negatives > False Positives

A missed biopsy-positive case means a delayed diagnosis and potentially worse outcomes. The evaluation framework was deliberately oriented around sensitivity for this reason — accuracy is not the right metric here.

ETHICS

Decision-Support, Not Diagnosis

These models should only ever be used as risk-flagging tools with human clinical oversight. They must not be applied outside the population they were validated on, and thresholds must be documented and monitored.

05 - STACK & ACCESS

Tools & Resources

PYTHON

PANDAS

SCIKIT-LEARN

XGBOOST

All Projects

JUPYTER NOTEBOOK

UCI ML REPOSITORY

FULL CASE STUDY ↗
Next Project