posted on 2023-08-02, 14:20authored byHolly A. Hill, Preetesh Jain, Chi Young Ok, Koji Sasaki, Han Chen, Michael L. Wang, Ken Chen
<p>Selection process and workflow of prognostic models. <b>A,</b> Flowchart depicting MCL patient selection process for inclusion in models. <b>B,</b> Flowchart showing data availability for patient cohort (<i>n</i> = 794). All patients had clinicopathologic data. Most patients (<i>n</i> = 642) had cytogenetic and/or genomic data. <b>C,</b> Workflow of ML (XGBoost modeling). The dataset containing all 794 patients was split into a training/validation set and a test set. The test set was held from all initial preprocessing and hyperparameter tuning to avoid data leakage. Data preprocessing included removing zero and NZV features, dummy encoding categorical features, and collapsing low-frequency categorical variables into an “other” category. The training set was again split into 10 cross-fold validation sets where hyperparameters for the XGBoost model were tuned. The hyperparameter set with the highest mean ROC AUC was chosen for the final fit on the test set, and performance was evaluated to check the model's fit. Variable importance was visualized with a VIP and SHAP additive values. The fitted XGBoost model was launched using a REST API, demonstrating clinical utility.</p>
Our model is the first to integrate a dynamic algorithm with multiple clinical and molecular features, allowing for accurate predictions of MCL disease outcomes in a large patient cohort.