AI/LLM Generated gene alteration and expression based subtyping for each tumor type #114

inodb · 2024-03-20T19:21:24Z

Background:

Cancer Classification: Cancer manifests in various forms across different tissues and organs of the body. The classification of cancer plays a pivotal role in understanding its behavior, prognosis, and treatment strategies. Over the years, advancements in medical research and technology have led to a deeper understanding of the molecular and cellular mechanisms underlying cancer development, thereby refining the classification systems used by oncologists and researchers. At its core, cancer classification categorizes malignancies based on a multitude of factors, including their tissue of origin, histological characteristics, genetic alterations, and clinical behavior.
OncoTree: OncoTree is a dynamic and flexible community-driven cancer classification platform encompassing rare and common cancers that provides clinically relevant and appropriately granular cancer classification for clinical decision support systems and oncology research.
cBioPortal: cBioPortal is an open-source platform for cancer genomics data analysis and visualization. It provides a centralized resource for exploring and analyzing large-scale cancer genomic data sets, including genomic alterations, gene expression, and clinical information. The platform integrates data from multiple sources, including The Cancer Genome Atlas (TCGA) and the International Cancer Genome Consortium (ICGC), and makes it available through a web interface for researchers, clinicians, and the general public. All samples in cBioPortal are assigned a particular cancer type based on OncoTree
The Challenge: in cBioPortal there are many pages where it would be useful to list a set of default genes when we know what cancer type the user is looking at. E.g. imagine exploring a breast cancer dataset, it probably makes sense to look at BRCA1, BRCA2 and EGFR alterations. Similarly, for Glioblastoma you'll want to look at IDH1 and IDH2. We can use an LLM (or another method) to generate these recommended genes for each OncoTree code by e.g. constructing a prompt like "Which genes are relevant for subtype x"

Goal:

Generate a list of recommended default genes for each OncoTree code that are often used for molecular classification of that subtype

Approach:

Try different prompts on any LLM of choice and script a way to do this semi-automatically
Some example prompts:

Which genes and pathways are relevant for classifying Breast Cancer? Could you give your answer in a JSON structure like:
{
    "Breast Cancer": {
        "Mutation-Based": ["Gene1", "Gene2", "Gene3"],
        "Expression-Based": ["GeneX","GeneY","GeneZ"]
        "Pathways": ["PathwayA","PathwayB"]
}

The LLM of choice can be vanilla ChatGPT, Gemini, something you train yourself, etc
Start with just the main OncoTree Types, e.g. "Breast Cancer", "Lung Cancer", etc
Explore ways to validate the proposed genes. One way would be to leverage the cBioPortal API to see if samples with this OncoTree code have any alterations in those genes
For the 350h project we can try to do the same for the more detailed subtypes like e.g. 'Breast Lobular Carcinoma In Situ'.

Need skills:
Prompt Engineering, Python or similar scripting language

Possible mentors:
@inodb Chris Fong @ao508

The text was updated successfully, but these errors were encountered:

Steveolas · 2024-03-21T19:24:49Z

Super interesting, I don't know much about genetics so I may be off here. But I think it may be worth your while to check out the Mamba model. Although it is a smaller model it can be fine-tuned, and it has a different architecture than transformers (Most other LLMs) that makes it better for large context problems. And genomic problems (from my understanding) can generally have large contexts.

Ilan

sohamchatterjee50 · 2024-03-22T23:54:49Z

Hey @inodb

I am currently pusrsuing my Masters in Aritificial Intelligence from University of Amsterdam. Prior to this, I worked at Amazon as an Applied Scientist where I was responsible for finetuning models for rereanking purposes. As part of both professional and research activities, I have delved into LLMs for prompt engineering. As part of the research ongoing at UvA, I am exposed to a lot of cutting edge reserch, AI for Science. I would love to contribute into an area which involves both AI and biology, especially bioinformatics.

Mail: sohamchatterjee50@gmail.com
LinkedIn: https://www.linkedin.com/in/soham-chatterjee-3410abb8/
It would be great if we can connect sometime on call.

SumitdevelopAI · 2024-03-23T03:51:57Z

Hello @inodb

My name Sumit Sharma. Currently Pursuing my B.tech in AI from Parul University. please help how to apply for this project.

wuyuqing0327 · 2024-03-26T22:07:56Z

Hi @inodb

My name is Yuqing Wu, preferred as Chelsea. I'm currently a master's degree student in Data Science at the University of Chicago. Before this program, I also worked as a machine learning engineer for over 5 years. I currently have a part-time job as a Research Assistant at the Institute of Population and Precision and Health to do machine learning algorithms to identify the impact of microbiome and Bacteroides on blood pressure, preventing people's diseases.

I'm really interested in this project, I can leverage LLM to process and identify text-based information, identifying genes that are relevant for the molecular classification of cancer subtypes.

Linkedin: https://www.linkedin.com/in/chelsea-uchi0327
E-mail: wuyuqing0327@gmail.com/yuqingw1@uchicago.edu

RainieFu · 2024-03-27T05:09:21Z

Hi @inodb,

I'm Rainie, a third-year undergraduate student at the University of British Columbia, majoring in Computer Science and Statistics. Currently, I am doing a full-time internship as a Bioinformatics Research Assistant at Vancouver Prostate Cancer, where my focus lies in genomic and statistical analysis within the realm of Prostate Cancer. Here, I'm deeply engaged in identifying potential biomarkers to introduce new treatment arms in clinical trials, leveraging a diverse set of technologies encompassing both biological methodologies and data-driven analytics.

My academic journey has equipped me with a solid foundation in data mining and machine learning, as well as proficiency in scripting languages like python and R, enabling me to grasp intricate machine learning algorithms and well-known libraries with ease. I am genuinely passionate about contributing my skills and knowledge to projects that make a tangible impact. This project is the perfect intersection of my academic interests and practical skills. I'm eager to join forces, brainstorm solutions, and ultimately contribute to advancements in cancer research.

Looking forward to the possibility of working together on this exciting endeavor, and I am happy to discuss further about my application. Feel free to connect with me via the following:

Linkedin: https://www.linkedin.com/in/rainie-fu/
Email: rainiefu0813@gmail.com

inodb · 2024-03-28T16:11:35Z

Hi @SelahattinAksoy @Steveolas @sohamchatterjee50 @SumitdevelopAI @wuyuqing0327 @RainieFu!

Thanks so much for your interest! Unfortunately, I'm not able to meet with everyone, but want to encourage you all to try and submit a proposal for this project if you're interested. Make sure to submit it thru the https://summerofcode.withgoogle.com/ website before 4/2!

If you're able to share a proposal in a Google Doc as well before I'll do my best to provide some feedback, you can send it via a DM on https://slack.cbioportal.org

Thanks so much all!

manheraa · 2024-03-28T20:52:46Z

Hi @inodb I am Sachetan Heralagi, a student currently pursuing my undergraduate degree at KLE Institute of Technology
In my academic journey, I have had the opportunity to explore various neural network architectures, from simple ANNs to more complex models like VGG19. One of the projects I take pride in is leading a team to develop an automated irrigation system using ANN technology for weather prediction and integration with Internet of Things (IoT) devices. Our project even made it to the finals of IEEE YESIST12, which was a significant achievement for us.
And I have recently worked with overian carcinoma subtype classification in which we used Transfer Learning i.e(VGG19).

I am particularly drawn to this project because I have worked with natural language processing and deep learning and I am enthusiastic about the prospect of collaborating with like-minded individuals .And I have proficiency in different ml and dl liberaries like tensorflow,keras, pytorch .

linkedin :www.linkedin.com/in/sachetan-heralagi
mail:manuheralagi4@gmail.com

TheMightyRaider · 2024-12-15T06:14:05Z

@inodb Nice explanation of the problem, but being a non-technical with respect to the dataset, I have a little problem understanding the objective.

To my understanding,

OncoTree is a classification system which assigns unique codes to various cancer type.
cBioPortal contains cancer datasets for various types [ breast, lungs, and so on.. ] extracted from multiple source. Each sample in the datasets are mapped to the OncoTree cancer type.

The problem statement would be,

For each oncotree code, We would need to identify the genes that prominently help in classifying the oncotree code? For example, If IDC is the code, then BRAC1 and BRAC2 would be the prominent gene for it?
Find a way to validate the proposed gene?

It would be helpful if you could verify whether my understanding is right.

gowrimatadh5783 · 2024-12-16T19:22:50Z

@inodb Hey my name is Gowri Matadh, and I am interested in working on the project to generate recommended gene lists for OncoTree cancer types using AI. With a strong background in artificial intelligence and several IEEE-published papers, I am confident in my ability to contribute effectively to this project.

I have experience working with LLMs and data validation techniques, which I believe will be valuable in implementing and refining the proposed solution. I would love the opportunity to contribute to this important work.

Thank you for considering my request. I look forward to hearing from you.
mail: gowrimatadh@gmail.com

arkhamHack · 2025-03-03T04:16:37Z

hi @inodb I am avigyan sinha a ml data software engineer. i graduated last year from my comp sci degree. I have experience in building production grade ml data systems and would love to contribute to this project. do let me know how to connect with you and the other mentors, i would love to discuss ideas with you.

CAT-ROM · 2025-03-06T16:00:35Z

Hi @inodb , I’m Roshini! I’m pursuing a dual degree in Data Science (IITM) and Biotechnology (GITAM). I have experience working with AI, LLMs, and bioinformatics, including prompt engineering and leveraging models for information extraction. The idea of using LLMs to generate gene recommendations for OncoTree classifications is fascinating, and I would love to collaborate on this project. Looking forward to contributing and learning from the team!

jhaayush2004 · 2025-03-10T20:19:00Z

Hi @inodb, I'm Ayush, a prefinal year AI passionate guy from IIIT Ranchi pursuing B.Tech in CSE(Specialization in AI and DS). I loved the problem statement on finetuning of LLMs to generate recommended gene lists for OncoTree cancer types. With my background in RAG, LLM fine-tuning, prompt engineering , and API integrations, I believe I can contribute effectively to developing a robust gene recommendation system. for cBioPortal.

Would it be effective if we go with hybrid strategy for Gene suggestions ,i.e, we can integrate RAG knowledge in the process to validate the finetuned prompted LLM's response or secondly could we go with Multi-Prompt Ensemble Method, i.e, instead of a single prompt per cancer type , we could experiment with multiple prompts for eg. casualty based or classification based and so on.

Would love to hear your suggestions on these ideas! Also is there any specific area where contribution would be most impactful at this stage of project ?

Regards,
Ayush Shaurya Jha
LinkedIn: www.linkedin.com/in/shauryasphinx
Email: shauryasphinx@gmail.com

Bhavyaadusu · 2025-03-16T13:21:39Z

Hi @inodb ,My name is Adusumilli Bhavya, and I am currently pursuing a B.Tech in Computational Biology at Mahindra University. My academic journey has provided me with a strong foundation in bioinformatics, computational genomics, structural biology, and machine learning applications in healthcare.

My interest in cancer genomics and molecular classification stems from my work on computational drug discovery and genomic variant annotation.
I am particularly drawn to this GSoC project because it lies at the intersection of computational genomics, AI-driven biomarker discovery, and clinical oncology research. The opportunity to work with LLMs for molecular subtyping of cancer directly aligns with my skills in bioinformatics, prompt engineering, and machine learning pipelines. Moreover, my experience with Linux, pipeline development, and HPC for large-scale genomic data analysis will help me implement automated validation workflows for this project.

My long-term goal is to advance computational approaches in cancer research, focusing on biomarker discovery, transcriptomic analysis, and AI-driven precision medicine. Through this project, I aim to contribute to the open-source cBioPortal community while refining my expertise in LLM-powered genomic data analysis. I am eager to collaborate with experienced mentors and researchers, learn from the community, and build a tool that enhances cancer classification and clinical decision-making.

Comfortade · 2025-03-16T22:15:52Z

Hi @inodb,
I'm Comfort, a biochemistry student with experience in applying machine learning to genomics, particularly in analyzing PPI data and multi-omics for predictive modeling using Python.

I'm really interested in contributing to this project. I do have a question - Is there an existing validation framework or feedback mechanism to assess the accuracy of AI-generated gene recommendations? Also, is there a defined confidence threshold for determining whether a gene should be included in the final list?

nomrat09 · 2025-03-18T19:24:29Z

Hi @inodb

I’m Nimrat, a biotech freshman with relevant experience in epigenetics, geneome seqencing, and AI/ML applications in bioinformatics. I’ve previously worked on projects involving genetic data analysis, and this project caught my attention as it aligns closely with my interests in leveraging machine learning for gene alteration and expression-based subtyping.

The intersection of computational biology and AI is something I’m passionate about, especially in understanding complex biological patterns through data-driven approaches. I’m excited to contribute to this project and explore how LLMs can enhance the classification and subtyping of gene expression patterns.

Looking forward to collaborating with the team!

linkedin: https://www.linkedin.com/in/nimrat-k-191a6b25a/
email: nimratkr256@gmail.com

ABiiitH · 2025-03-21T10:50:56Z

Hi @inodb

I’m very interested in the project on integrating gnomAD variant annotations into cBioPortal, as it aligns well with my background in cancer genomics and interest in improving the interpretability of somatic mutation data. I was wondering if you see this integration being limited to adding gnomAD allele frequencies as annotations in mutation tables and OncoPrints, or is there scope to support additional functionality such as filtering out common germline variants during data processing or analysis? I’m curious about how deeply the gnomAD data is intended to influence mutation interpretation within the portal.

I’m Aryaman Bahl, a dual-degree student at IIIT Hyderabad, currently researching DNA methylation in breast cancer at the CCNSB Lab under Prof. Nita Parekh. I work extensively with RRBS data, DMR identification, and bioinformatics tools like Python, R, MethylKit, and Bismark. I’ve also contributed to projects at the intersection of machine learning and genomics, and I’m passionate about building tools that bridge biological data with meaningful insights. I’d be excited to contribute to this project and the broader cBioPortal ecosystem.

Looking forward to hearing from you!

Warm regards,
Aryaman Bahl
aryaman.bahl@research.iiit.ac.in

RayRishika · 2025-03-23T20:08:13Z

Hey @inodb , I’m Rishika, an engineering student passionate about computational biology and AI.

I have experience in Gen AI, LLMs, LangChain, and RAG techniques and have done courses on bioinformatics.

I am also working on some research initiatives, especially in genomics, precision medicine, and AI for biology. If there’s any opportunity where my skills in Python, SQL, RAG, and bioinformatics could be useful, I’d love to contribute!

We could build a Graph Neural Network (GNN)-based recommendation system where:
OncoTree cancer types, genes, and pathways form nodes in a heterogeneous knowledge graph.
We leverage RAG-based retrieval from cBioPortal and TCGA to validate LLM-generated outputs.

In this way, the final gene list is refined using GNNs or graph embeddings to predict new high-confidence gene associations.

Thank you for this opportunity, and I look forward to collaborating with the team!

Email: rayrishika07@gmail.com
LinkedIn: https://www.linkedin.com/in/rishika-ray-481a09289/

belatrix450666 · 2025-03-24T01:03:39Z

Hi @inodb,

I am Markéta, a Software Engineering student at Masaryk University, and I am very interested in contributing to cBioPortal. I have a strong passion for AI-driven solutions in precision medicine, and coming from a family with a medical background, I am particularly motivated to develop open-source AI tools that enhance clinical decision-making for medical professionals.

With experience in Gen AI, LLMs, and prompt engineering, I propose the following approach:

fine-tune an LLM or use a Graph Neural Network (GNN) to generate relevant gene lists for each OncoTree code
validate outputs against real patient data using the cBioPortal API
add a feedback loop with domain experts to refine AI-generated recommendations

I look forward to submitting my final proposal and hopefully working with you.

Best regards,
Markéta Schlemmerová
Email: marketaschlemmerova@gmail.com

MelikaaDastranj · 2025-03-24T01:23:11Z

Hi @inodb
I am Melika, a second year PhD student in Computer Science with a strong background in programming, algorithms, and research-driven problem solving. I’m excited about the opportunity to contribute to the gene fusion visualization project at cBioPortal, as it aligns with my technical skills and my long-term interest in health-focused applications.

My current research involves hyperspectral imaging (HSI), which is increasingly being used in medical diagnostics such as cancer detection. While my work has primarily focused on signal processing and high-dimensional data analysis, I see this project as a valuable step toward expanding my expertise into computational biology and genomic data visualization.

I am proficient in Python and have experience with JavaScript, and although I haven’t worked with React.js yet, I’m a quick learner and confident in my ability to adapt. I'm especially drawn to the challenge of enhancing the patient view to clearly communicate complex fusion events, and I’m eager to contribute to a platform that has real-world impact in cancer research.

This project offers a meaningful opportunity to grow technically while building toward a future career where computing and healthcare intersect.

Best,
Melika
Email: mdastra1@binghamton.edu

Comfortade · 2025-03-25T10:06:35Z

Hi @ao508
I'm Comfort, a biochemistry student with experience in applying machine learning to genomics, particularly in analyzing PPI data and multi-omics for predictive modeling using Python.

I'm really interested in contributing to this project. I do have a question - Is there an existing validation framework or feedback mechanism to assess the accuracy of AI-generated gene recommendations? Also, is there a defined confidence threshold for determining whether a gene should be included in the final list?

BhagyasriUddandam · 2025-03-25T16:12:56Z

Hi! I'm Bhagyasri, a recent CS grad with a passion for AI/ML. My projects in machine learning and data analysis make me excited about this research. I'm a recent AI/ML graduate passionate about leveraging large language models for scientific research. My background in machine learning and prompt engineering positions me to potentially enhance gene subtyping methodologies.

I propose exploring:

Systematic evaluation frameworks for LLM-generated gene lists
Performance comparisons across different AI models
Robust validation strategies for computational biology

Interested in collaboratively refining this innovative approach to cancer subtyping. Curious about potential contributions and next steps.

More details about my technical background: https://bhagii.vercel.app/

https://www.linkedin.com/in/bhagyasri-u/

Contact: bhagyasriuddandam@gmail.com

inodb added Python Size: Medium (175h) Size: Large (350h) Difficulty: Medium GSoC-2024 GSoC 2024 Candidate Projects Prompt Engineering labels Mar 20, 2024

This comment was marked as resolved.

Sign in to view

inodb added GSoC-2025 GSoC 2025 Candidate Projects and removed GSoC-2024 GSoC 2024 Candidate Projects labels Jan 24, 2025

inodb added the cBioPortal label Feb 11, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AI/LLM Generated gene alteration and expression based subtyping for each tumor type #114

AI/LLM Generated gene alteration and expression based subtyping for each tumor type #114

inodb commented Mar 20, 2024 •

edited by ao508

Loading

This comment was marked as resolved.

Steveolas commented Mar 21, 2024

sohamchatterjee50 commented Mar 22, 2024

SumitdevelopAI commented Mar 23, 2024

wuyuqing0327 commented Mar 26, 2024

RainieFu commented Mar 27, 2024

inodb commented Mar 28, 2024

manheraa commented Mar 28, 2024 •

edited

Loading

TheMightyRaider commented Dec 15, 2024 •

edited

Loading

gowrimatadh5783 commented Dec 16, 2024 •

edited

Loading

arkhamHack commented Mar 3, 2025

CAT-ROM commented Mar 6, 2025

jhaayush2004 commented Mar 10, 2025 •

edited

Loading

Bhavyaadusu commented Mar 16, 2025

Comfortade commented Mar 16, 2025

nomrat09 commented Mar 18, 2025

ABiiitH commented Mar 21, 2025

RayRishika commented Mar 23, 2025

belatrix450666 commented Mar 24, 2025 •

edited

Loading

MelikaaDastranj commented Mar 24, 2025

Comfortade commented Mar 25, 2025

BhagyasriUddandam commented Mar 25, 2025

AI/LLM Generated gene alteration and expression based subtyping for each tumor type #114

AI/LLM Generated gene alteration and expression based subtyping for each tumor type #114

Comments

inodb commented Mar 20, 2024 • edited by ao508 Loading

This comment was marked as resolved.

Steveolas commented Mar 21, 2024

sohamchatterjee50 commented Mar 22, 2024

SumitdevelopAI commented Mar 23, 2024

wuyuqing0327 commented Mar 26, 2024

RainieFu commented Mar 27, 2024

inodb commented Mar 28, 2024

manheraa commented Mar 28, 2024 • edited Loading

TheMightyRaider commented Dec 15, 2024 • edited Loading

gowrimatadh5783 commented Dec 16, 2024 • edited Loading

arkhamHack commented Mar 3, 2025

CAT-ROM commented Mar 6, 2025

jhaayush2004 commented Mar 10, 2025 • edited Loading

Bhavyaadusu commented Mar 16, 2025

Comfortade commented Mar 16, 2025

nomrat09 commented Mar 18, 2025

ABiiitH commented Mar 21, 2025

RayRishika commented Mar 23, 2025

belatrix450666 commented Mar 24, 2025 • edited Loading

MelikaaDastranj commented Mar 24, 2025

Comfortade commented Mar 25, 2025

BhagyasriUddandam commented Mar 25, 2025

inodb commented Mar 20, 2024 •

edited by ao508

Loading

manheraa commented Mar 28, 2024 •

edited

Loading

TheMightyRaider commented Dec 15, 2024 •

edited

Loading

gowrimatadh5783 commented Dec 16, 2024 •

edited

Loading

jhaayush2004 commented Mar 10, 2025 •

edited

Loading

belatrix450666 commented Mar 24, 2025 •

edited

Loading