Knowledge Vault 2/33 - ICLR 2014-2023
Chloé-Agathe Azencott ICLR 2017 - Invited Talk - High-dimensional feature selection in precision medicine
<Resume Image >

Concept Graph & Resume using Claude 3 Opus | Chat GPT4 | Gemini Adv | Llama 3:

graph LR classDef precision fill:#f9d4d4, font-weight:bold, font-size:14px; classDef example fill:#d4f9d4, font-weight:bold, font-size:14px; classDef data fill:#d4d4f9, font-weight:bold, font-size:14px; classDef knowledge fill:#f9f9d4, font-weight:bold, font-size:14px; classDef multitask fill:#f9d4f9, font-weight:bold, font-size:14px; classDef challenges fill:#d4f9f9, font-weight:bold, font-size:14px; classDef resources fill:#f9d4d4, font-weight:bold, font-size:14px; A[Chloé-Agathe Azencott
ICLR 2017] --> B[Precision medicine: tailored treatments,
genetic focus. 1] B --> C[Trastuzumab: precision medicine example,
HER2 breast cancers. 2] A --> D[Data-driven biology/medicine: patient
similarities, data selection. 3] D --> E[Cheaper sequencing, limited sample
sizes vs features. 4] D --> F[Missing heritability: unidentified genetic
factors, high-dimensional data. 5] A --> G[Prior knowledge constraints reduce
dimensionality, improve interpretability. 6] G --> H[DNA info: linear, pathway,
interaction, 3D structure. 7] G --> I[Binary feature selection: relevance
scores, structured regularization. 8] I --> J[SCONES: constrained SNP selection,
biological networks, minimum cut. 9] A --> K[Multitask approaches increase sample
size, related traits. 10] K --> L[Multitask SCONES: feature similarity
across tasks, minimum cut. 11] K --> M[Task similarity incorporates task
relationships in feature selection. 12] K --> N[Multitask LASSO: task-independent and
-specific weights, task descriptors. 13] A --> O[Feature stability crucial for
interpretability, often overlooked. 14] A --> P[Complex patterns challenging with
limited genomic samples. 15] A --> Q[P-values for selected features
in complex models: open problem. 16] A --> R[Privacy concerns sharing genetic
data, learning challenges. 17] A --> S[Heterogeneity complicates feature selection,
requires data alignment, subgroups. 18] A --> T[Diverse data integration challenges
interpretable predictive modeling. 19] A --> U[Polygenic risk scores common
but limited, slow ML adoption. 20] A --> V[Microscopy challenges: cell segmentation,
classification, automated analysis. 21] A --> W[EHRs: valuable but incomplete,
time-series, multimodal genetic data. 22] A --> X[Resources for non-geneticists applying
ML to genetics, precision medicine. 23] X --> Y[ML-geneticist collaboration needed for
cancer research, other diseases. 24] A --> Z[GWAS emphasizes correlation over
causation in biomarker identification. 25] A --> AA[Feature selection formalisms resemble
Dempster-Shafer theory, incidence algebra. 26,27] A --> AB[Data uncertainty challenges in
integrative models, varying error probabilities. 28,29] AB --> AC[Speaker invites input on
methods for data uncertainty. 30] class A,B,C precision; class D,E,F data; class G,H,I,J knowledge; class K,L,M,N multitask; class O,P,Q,R,S,T,U,V,W challenges; class X,Y,Z,AA,AB,AC resources;

Resume:

1.-Precision medicine aims to adapt treatments to patient specifics like genetics, lifestyle, and environment, particularly focusing on genetic factors.

2.-Trastuzumab is an early example of precision medicine, working effectively against HER2-overexpressing breast cancers but not benefiting non-overexpressing patients.

3.-Data-driven biology and medicine identify similarities between patients with similar phenotypes/outcomes, requiring data and feature selection methods.

4.-Sequencing costs are decreasing, enabling larger-scale genome sequencing, but sample sizes remain limited compared to the number of features.

5.-Missing heritability refers to the inability to identify most genetic factors underlying inheritable traits, partly due to high-dimensional, low-sample statistics.

6.-Integrating prior biological knowledge as constraints on the feature space can help reduce dimensionality and improve interpretability of models.

7.-DNA has linear, group (pathway), interaction (gene/protein), and 3D structural information that can be used to constrain feature selection.

8.-Binary feature selection using relevance scores and structured regularization can efficiently incorporate large biological networks and handle noisy data.

9.-SCONES (Scans for Select and Connected Explanatory SNPs) performs constrained SNP selection on biological networks, solving a minimum cut problem.

10.-Multitask approaches can effectively increase sample size when multiple related traits or outcomes are available, such as in plant genetics.

11.-Multitask SCONES enforces similarity of selected features across related tasks by extending the regularizer and solving via minimum cut.

12.-Task similarity can be further incorporated into multitask feature selection based on prior knowledge of task relationships.

13.-Multitask LASSO decomposes model weights into task-independent and task-specific components, enabling the use of task descriptors to drive the decomposition.

14.-Stability of selected features across data subsets is crucial for model interpretability and is often overlooked in feature selection.

15.-Moving beyond additive models to capture more complex patterns is challenging with limited sample sizes in genomic data.

16.-Computing p-values for selected features in complex models is an open problem important to the statistical genetics community.

17.-Privacy is a major concern in sharing genetic data, and learning from privacy-protected data is a significant challenge.

18.-Heterogeneity in sample populations and data sources complicates feature selection and requires data alignment, normalization, and modeling subgroup differences.

19.-Integrating diverse data types like gene expression, methylation, images, and text poses challenges for interpretable predictive modeling.

20.-Risk prediction using polygenic risk scores is common but limited, with slow adoption of more complex machine learning models.

21.-Microscopic imaging data is increasingly available but presents unique challenges for automated analysis, such as cell segmentation and classification.

22.-Electronic health records contain valuable but incomplete, time-series, and multimodal data that could be combined with genetic information.

23.-The speaker provides resources for non-geneticists interested in applying machine learning to problems in genetics and precision medicine.

24.-Collaboration between machine learning experts and geneticists is needed to solve important problems in cancer research and other diseases.

25.-In genome-wide association studies, correlation is often emphasized over causation when identifying biomarkers for treatment selection or disease prognosis.

26.-The formalism for optimizing relevance scores and structured regularizers in feature selection resembles Dempster-Shafer theory, with potential for further exploration.

27.-The high dimensionality of genomic data poses challenges for optimization techniques in feature selection, potentially benefiting from incidence algebra methods.

28.-Uncertainty in different data sources, such as base call errors in genomic data or dynamic range issues in mass spectrometry, is not well-addressed in current integrative models.

29.-Integrating data sources with varying error probabilities across features (e.g., nucleotides, proteins) remains an open problem in precision medicine.

30.-The speaker invites input from the audience on methods for handling uncertainty in different data sources when integrating them for analysis.

Knowledge Vault built byDavid Vivancos 2024