Team members
Ana Gordon, Aimee Tran Ba Huy, Le Thuy Duong Nguyen, Jada Thompson, Mitia Andrieux
Project summary
Sex disparities in pharmaceutical clinical trials lead to an increased risk of adverse drug reactions (ADRs) in women. By leveraging machine learning (ML), unbIAsed.Rx provides data-driven insights to address this issue, promoting safer and more inclusive healthcare practices.
Keywords
Pharmaceutical Drugs, Machine Learning (ML), Artificial Intelligence (AI), Healthcare, Medication, Pharmaceutical, Regression Model, Bias, Prediction Model, Adverse Reactions, Convolutional Neural Network (CNN), ShuffleNet, Optical character recognition (OCR)
Inspiration
The glaring underrepresentation of women in pharmaceutical clinical trials is not only an urgent issue but a fundamental injustice with significant implications for health outcomes. Despite the prevalence of certain diseases being higher in women, over 60% of drug trials include a disproportionately low number of female participants compared to the affected population (National Academies of Sciences, Engineering, and Medicine). This disparity is compounded by the fact that as of 2016, 70% of biomedical experiments failed to report sex as a variable of interest (Sugimoto CR, Ahn Y-Y, Smith E, Macaluso B, Larivière V). The oversight of not analyzing or considering sex differences in research data can lead to significant consequences, such as an increased risk of ADRs for women. Specifically, female patients are found to have a 1.5- to 1.7-fold greater risk of experiencing negative side effects from medications (Rademaker M.).
This situation highlights the urgent need for more inclusive and sex-specific approaches in clinical research to ensure the safety and efficacy of treatments for all patients. Addressing this issue is crucial not only for the well-being of women but also for advancing medical knowledge and improving healthcare outcomes for everyone. Ensuring equitable representation in clinical trials is a fundamental step toward achieving true equality in healthcare, where treatments are developed and tested in ways that benefit all individuals, regardless of sex.
The goal of our project, unbIAsed.Rx, is to address the significant underrepresentation of women in pharmaceutical clinical trials by raising awareness about this critical issue among both physicians and patients. We aim to bridge the knowledge gap by making complex medical and pharmaceutical information accessible to the general public, thereby empowering patients to advocate for their own health. This initiative not only enables patients to make informed decisions but also equips physicians with the insights needed to make more educated healthcare decisions. By highlighting the underrepresentation of women in clinical trials, we seek to foster meaningful changes in trial design, promoting more inclusive and representative research practices. Ultimately, our project is dedicated to advancing equity in healthcare, ensuring that medical treatments are developed and tested in ways that are safe, effective, and beneficial for everyone, regardless of sex.
List of technologies
Machine Learning and Data Processing:
Python, Pandas, PyTorch, NumPy, Seaborn, Matplotlib, Pytrials, Pickle, Sklearn, RandomForestRegressor, Torch Vision, Google Colab, ShuffleNet, Optical Character Recognition.
Web Development (Frontend, Backend, Deployment):
Python, Flask, HTML, CSS, Javascript, SQLite, Digital Ocean, GitHub.
Project development
Brainstorming
The brainstorming phase of our project was a significant time investment, setting the foundation for our subsequent work. In the first week, equipped with our initial dataset (MedEffect), we focused on answering key questions crucial to our project development:
- Literature review: We investigated existing projects to draw inspiration and refine our focus on including sex as a factor in our risk assessment analysis.
- Target audience: We identified our primary users (practitioners, patients, the general public, and the pharmaceutical industry).
- Dataset evaluation: We assessed whether our dataset was sufficient for our project’s scope and explored ways to make it more comprehensive.
Fortunately, our mentors and TAs, Tyler Jackson and Prakhar Ganesh, were essentiel in this process. They helped us clarify our ideas and structure our thoughts, notably through connecting us with other experts in the field.
By examining past applications, websites, and research, we sharpened our project’s focus. Our aim became clearer: to identify sex-specific ADRs and develop a sex-sensitive predictive model. This model would enhance drug safety monitoring through a user-friendly interface, empowering patients to have meaningful conversations with their practitioners about their health.
Data
A crucial part of the project was sourcing meaningful data. To tailor our efforts toward Canadian users, we utilized datasets specific to Canada. The first database used was the MedEffect Canada Vigilance Adverse Reaction Online Database. This database contains information about suspected adverse reactions to health products reported by users, health professionals, manufacturers, or distributors. More information about the MedEffect database can be found here. The second database we leveraged was ClinicalTrials.gov, which provides clinical research studies for over 200 countries. It relies on sponsors and investigators to submit and update study information and complies with laws and regulations requiring the public sharing of clinical trial data, including results. Learn more about ClinicalTrials here.
Once data was obtained, we primarily worked with the MedEffect database to filter the reports by the name of the medication and the condition it was used to treat, and to count the number of reports by sex. The ClinicalTrials database is then queried using the medication and condition keywords to find potential clinical research results. This data is then merged and preprocessed to prepare it for training the ML model.
Implementation
Our project involves the implementation of three ML modules, a daunting task for the three-week timeline.
The first model allows users to input a picture of their medication label. We decided to implement an Optical Character Recognition (OCR) tool from the Python-Tesseract open-source engine. The OCR converts the text in the image into a bitmap. Since the OCR engine is already trained, the data can therefore be read and converted into text. The model uses object detection to enable the detection of objects, deep neural networks (DNN) to classify images, and image processing to process image input with image manipulation and augmentation. Pytesseract also uses OpenCV, in order to properly focus on real-time Computer Vision. All of the words detected in the image are then compared to the names of the drugs in our dataset, which then triggers the search bar with the name of the drug detected.
The second model allows users to input a picture of their pill. We used a Kaggle Dataset with over 9,500 labels to train a ShuffleNet V2 model. We also considered the ResNet50 and ResNeXt models; however, ShuffleNet achieved a higher accuracy of 90%. This model categorizes the detected pill into one of ten types: Alaxan, Bactidol, Biogesic, Lamictal, DayZinc, Rivaroxabanm, Fish Oil, Kremil S, Medicol, or Neozep. The model uses a convolution neural network (CNN) to classify the inputted picture. The CNN applies a filter (also called a kernel) onto itself to learn features in the feed-forward neural network. We used the ShuffleNet_V2_X0_5_Weights to train the model, which is optimized for speed rather than FLOPs. With time, we would like to train on a larger dataset to include more variety in the categories, as well as integrate more generic medical terminology to properly relate the categories with the drugs in our dataset.
Oftentimes, medical labels contain a lot of additional information, which makes it quite difficult to understand and find the name of the drug prescribed. Additionally, managing different pills can sometimes cause confusion, especially when most pill bottles look very similar. Our picture label and picture pill features enable simpler and quicker access to vital information, making the tool a more accessible and user-friendly information provider.
The third model allows users to select or input a drug as well as their physical condition, to predict the risk of an ADR. We trained a Random Forest Regressor model with biases found in the participant distribution of clinical trials for drug-condition pairs. This type of model works by constructing decision trees built on randomly selected subsets of the data. It then combines the outputs of all the decision trees to make predictions for new, unseen data. We chose this model since it is capable of handling nonlinear relationships better than other models, such as linear regression models. This feature allows users to select a drug that is already in our dataset, or input their own drug that may not be included yet. They can then select one of the conditions from our dataset, which then generates a predicted percentage risk of having an ADR. This feature is the cornerstone of our project, playing a crucial role on our website. It has the potential to effectively address sex biases in pharmaceuticals by raising awareness of this issue.
Impact/Innovation
Our solution, unbIAsed.Rx, stands out by addressing a critical and often overlooked issue: the underrepresentation of women in clinical trials and the subsequent disparities in drug safety and efficacy. While there are a few initiatives that focus on personalized medicine and pharmacogenomics, most do not specifically target sex-based differences in ADRs. Competitors in this space typically offer broader pharmacovigilance tools or generalized predictive models that do not adequately consider sex as a significant variable. Our solution uniquely addresses the root cause of disparities in drug reactions—the historical neglect of sex as a biological variable in clinical research. By bringing attention to the underrepresentation of women in clinical trials and highlighting the resulting disparities in drug reactions, we aim to promote greater awareness and advocacy for more inclusive and effective clinical practices. This awareness is critical for driving policy changes and improving clinical trial designs to ensure equity in healthcare. By making this information easily accessible through a user-friendly interface, we help bridge the knowledge gap that exists between complex medical data and everyday users. Our solution has the potential to significantly enhance drug safety, reduce healthcare costs by mitigating the risk of adverse reactions, and ultimately, improve patient outcomes.
The potential impact of unbIAsed.Rx on its intended audience and the broader field is significant. For patients, it means safer medication usage and a greater ability to advocate for their own health needs. For healthcare providers, it offers a tool to better assess the risks associated with prescribing medications to diverse patients, ultimately leading to more personalized and effective treatments. In the broader context, unbIAsed.Rx has the potential to drive systemic change in how clinical trials are designed and conducted, encouraging the inclusion of sex as a critical factor in research and ultimately promoting more equitable healthcare practices.
Limitations
However, with any predictive model, there is the potential for harm if the tool is misused or if the data is misinterpreted. To mitigate these risks, we have implemented safeguards such as clear disclaimers, educational resources for users, and ongoing updates to the model to reflect the most current research. We emphasize that unbIAsed.Rx is not a substitute for professional medical advice, but an informational resource to raise awareness. Although every effort has been made to ensure that the information provided is accurate, up-to-date, and complete, the absence of a warning for a given drug or combination thereof in no way should not be construed to indicate that the drug or combination is safe, effective or appropriate for any given patient at all times.
We also prioritize data privacy and security with personal information kept confidential, recognizing the sensitivity of health data and the need to protect users' personal information. Users further retain full autonomy to make their own informed decisions and we strongly encourage patients to consult with a healthcare professional before making any medical decisions or changes to their treatment plan.
Challenges we ran into & how we overcame them
Data availability / working with existing data
One significant challenge we faced was the availability and quality of data. The MedEffect database, which served as our primary data source, required extensive cleaning and filtering before it could be effectively utilized for our project. Initially, we intended to focus exclusively on cardiovascular conditions, given their prevalence and the well-documented sex-based differences in outcomes. However, we soon realized that limiting our project to cardiovascular data might not address the broader issues faced by a more diverse population. Our goal was to create a platform that could help as many people as possible by providing insights across a wide range of health conditions.
Moreover, the MedEffect dataset did not contain information about cases where drug use did not result in an adverse reaction, which limited our ability to comprehensively assess drug safety. To overcome this, we conducted a thorough literature review and sought additional datasets, such as the Canadian Chronic Disease Surveillance System and ClinicalTrials.gov, to supplement our primary data. This approach allowed us to extend our analysis beyond cardiovascular conditions, incorporating other significant health issues such as respiratory conditions (asthma, chronic obstructive pulmonary disorders), diabetes, schizophrenia, and neurological conditions (dementia, epilepsy, multiple sclerosis, parkinsonism).
Consultations with experts in the field were crucial in guiding our data selection process and ensuring that our data was robust enough to support our predictive model.
Machine learning problem
One of the main challenges we encountered was defining a clear ML problem. As our project developed, cleaning and parsing through our data consumed a significant portion of our time. We soon realized the risk of limiting our work to merely performing a statistical analysis on existing data, rather than making predictions based on unseen data, such as clinical trials or volunteer side effect reports. Our specific question became: for a particular drug-condition pair, how likely is an individual to experience an ADR, considering their sex?
To address this, we leveraged statistical tendencies and explored various ML models, including categorical approaches for predicting risk levels. Ultimately, we opted for a regression model to provide more meaningful and informative percentage predictions rather than just low, medium, or high categories.
In our effort to improve the accessibility of our platform, we decided to incorporate the CNN image recognition feature and the Optical Character Recognition (OCR) to allow users to take photos of their medication boxes or labels for easy identification and risk assessment.
What we learned & accomplishments we’re proud of
During the course of this program, we gained many valuable insights and knowledge. The first few weeks allowed us to learn the foundations of Artificial Intelligence and ML through lectures, workshops, TA sessions, and many more.
Our project allowed us to strengthen the foundations learned in the first four weeks of the lab and apply these concepts to real-life problems. A key takeaway from working on the project is the importance of clean data to be able to work with it. Without such data, it is difficult to formulate a proper ML problem and obtain the expected results from the ML model. Overall, working on the project taught us the whole pipeline of how to set up a ML problem, from an idea to training a model and integrating it into a user-friendly platform.
We are proud of everything we’ve accomplished over the span of 7 weeks, from keeping up with the ambitious and fast-paced curriculum and the development of our project. We are especially proud of creating a complete project in the context of using AI for good.
We would not have been able to accomplish this without the support of our TAs, mentors, advisers, and of course our fellow peers at the AI4Good Lab. These connections were definitely one of the highlights of our participation in the program.
What’s next for the project
As we advance with our project unbIAsed.Rx, we are focused on enhancing the technical capabilities of our models. For our primary model, which assesses the risk of developing ADRs based on sex, our goal is to improve its accuracy. Achieving this requires acquiring more data for training. We are actively pursuing additional data sources, including collaboration with the Canada Vigilance Program to access complementary data in the MedEffect database, which offers crucial information on patients' medical histories and test results. This enriched dataset will enable us to enhance the precision of our model. In parallel, we are refining our CNN to accurately identify a wider range of pharmaceutical pills. Our aim is to ensure that the system can not only recognize a broad spectrum of pills but also accurately indicate when an unidentifiable pill is input, avoiding incorrect matches.
Our ultimate goal is to develop unbIAsed.Rx into a fully-fledged product and launch a comprehensive website to effectively raise awareness about sex disparities in pharmaceutical trials and promote fairness in healthcare. We are grateful to participate in the fall cohort of the Mila Entrepreneurship Lab, which will provide us with the tools and support needed to realize these objectives and bring our vision to life.
Acknowledgements & References
Acknowledgements:
We sincerely appreciate the guidance and mentorship provided by our TA, Prakhar Ganesh, and mentor Tyler Jackson, along with the valuable insights from Eptehal Nashoush, Khaoula Chehbouni, Maryam Molamohammadi, Nicole Osayande, and Padideh Nouri. We are also deeply thankful to the AI4Good Lab for offering us the opportunity to learn, grow, and develop our project, and for their unwavering support throughout this journey.
References:
National Academies of Sciences, Engineering, and Medicine. 2022. Improving Representation in Clinical Trials and Research: Building Research Equity for Women and Underrepresented Groups. Washington, DC: The National Academies Press. https://doi.org/10.17226/26479.
Rademaker M. Do women have more adverse drug reactions? Am J Clin Dermatol. 2001;2(6):349-51. doi: 10.2165/00128071-200102060-00001. PMID: 11770389.
Sugimoto CR, Ahn YY, Smith E, Macaluso B, Larivière V. Factors affecting sex-related reporting in medical research: a cross-disciplinary bibliometric analysis. Lancet. 2019 Feb 9;393(10171):550-559. doi: 10.1016/S0140-6736(18)32995-7. PMID: 30739690.