William Qian's Portfolio

Research

Completed

A Causal Approach to Fair Predictive Modeling via Penalized Maximum Likelihood Estimation

Brown University

Instructor: Prof. Alice Paul

Keywords: Causal Inference, Optimization

Automated decision-making systems are increasingly prevalent across various domains, including hiring, loan approval, and criminal justice. However, the widespread use of these systems has raised significant concerns about the fairness of the models they employ. These systems are often trained on historical data that may contain inherent biases, leading to the amplification or perpetuation of these biases and resulting in unfair treatment of certain groups.

Various approaches have been developed to mitigate the unfairness in automated decision-making systems. These approaches are typically categorized into three groups: pre-processing, in-processing, and post-processing methods. However, a common challenge in those methods is the trade-off between fairness and accuracy. Many methods require decision-makers to define what constitutes an optimal balance, which can introduce subjective biases into the process. There is also a need for continuous monitoring and human intervention to ensure that models remain fair over time.

To address that problem, a causal approach that focused on in-processing was proposed. This approach leverages path-specific effects (PSEs) and penalized maximum likelihood estimation. By incorporating PSEs into the model, they can identify the degree of discrimination within the data and generate new predictions that meet fairness constraints through an optimized semi-parametric likelihood function. This method not only simplifies the optimization problem but also produces more interpretable results and holds decision-makers accountable for the consequences of their predictions.

[Github]

Completed

Emulators for cohort-based, Markov state, simulation models: an application to the RESPOND model for Opioid Use Disorder (OUD)

Brown University, Syndemics Lab at Boston Medical Center

Project: HEAL Data2Action (D2A) Program

Instructor: Prof. Stavroula Chrysanthopoulou

Keywords: GEE, GLME, MERF, LSTM, Model Tuning

Purpose: In this study we present an innovative idea and unique example of building emulators for cohort-based Markov simulation models.

Methods: We have developed the Researching Effective Strategies to Prevent Opioid Death (RESPOND) simulation model to characterize the Opioid Use Disorder (OUD) dynamics, make projections, evaluate interventions, and inform decision making in this area. Due to its complex structure, implementation of RESPOND can be computationally intensive. The objective of this study is to build an emulator (metamodel), namely a simulation model of a simpler structure, to efficiently map model inputs to outputs based on a calibrated version of the RESPOND model to observed OUD trends in Massachusetts. We explore three statistical approaches for an emulator of the RESPOND model; 1) a generalized linear model (GLM) for longitudinal data, 2) a Mixed-Effects Random Forest (MERF), and 3) a Long Short-Term Memory (LSTM) recurrent neural network model. We describe model selection procedures for determining the best fitted model (set of parameter values) of each model category separately to the available simulated RESPOND data. We compare the three approaches in terms of their accuracy (weighed Root Mean Squared Error (wRMSE) and Mean Absolute Error (MAE)) and efficiency (running time) using simulated data (overall and fatal overdose counts) from the calibrated RESPOND model.

Results/Conclusions: Our findings show that the LSTM model provide more accurate predictions, while GLMs can provide insightful details about observed trends and associations of key factors over time. We also discuss how emulators can be used to create an online, user-friendly applications to depict evaluation of health care alternatives in real time due to the efficiency of the underlying algorithms.

[Demo]

Completed

Research on Malicious URL Detection Based on Random Forest

University of Chicago (Remote)

Instructor: Prof. Nick Feamster

Keywords: Malicious URLs Detection, Machine Learning, SVM, KNN, XGBoost, Decision Tree, Random Forest

Malicious URLs have become serious threats to cybersecurity, also forming incubators for Internet criminal activities. With visiting malicious URLs, visitors may undergo illegal actions such as spamming, phishing and drive-by downloads which seriously threat visitors' privacy and security that cause losses of billions of dollars every year. Traditional methods such as using URL blacklists to detect malicious URLs can classify most of the known URLs but are poorly effective when processing newly generated ones. To forestall greater economic losses, it is imperative to exert a method that can classify URLs in a timely manner. To improve timeliness of detecting malicious URLs, we use machine learning algorithms to automatically classify URLs. In this article, we selected the experiment results of several common machine learning models on our data set as the baseline and compared them horizontally with the outcome of random forest classifier. After that, we optimize the classifier to make the random forest classifier to achieve the best outcome within the lowest complexity.

[Paper]