Paper List

Bioinformatics

Pharmacophore-based design by learning on voxel grids

Unknown

This paper addresses the computational bottleneck and limited novelty in conventional pharmacophore-based virtual screening by introducing a voxel cap...
Bioinformatics

CONFIDE: Hallucination Assessment for Reliable Biomolecular Structure Prediction and Design

Unknown

This paper addresses the critical limitation of current protein structure prediction models (like AlphaFold3) where high-confidence scores (pLDDT) can...
Bioinformatics

On the Approximation of Phylogenetic Distance Functions by Artificial Neural Networks

Unknown

This paper addresses the core challenge of developing computationally efficient and scalable neural network architectures that can learn accurate phyl...
Bioinformatics

EcoCast: A Spatio-Temporal Model for Continual Biodiversity and Climate Risk Forecasting

Unknown

This paper addresses the critical bottleneck in conservation: the lack of timely, high-resolution, near-term forecasts of species distribution shifts ...
Machine Learning

Training Dynamics of Learning 3D-Rotational Equivariance

Unknown

This work addresses the core dilemma of whether to use computationally expensive equivariant architectures or faster symmetry-agnostic models with dat...
Bioinformatics

Fast and Accurate Node-Age Estimation Under Fossil Calibration Uncertainty Using the Adjusted Pairwise Likelihood

Unknown

This paper addresses the dual challenge of computational inefficiency and sensitivity to fossil calibration errors in Bayesian divergence time estimat...
Bioinformatics

Few-shot Protein Fitness Prediction via In-context Learning and Test-time Training

Unknown

This paper addresses the core challenge of accurately predicting protein fitness with only a handful of experimental observations, where data collecti...
Bioinformatics

scCluBench: Comprehensive Benchmarking of Clustering Algorithms for Single-Cell RNA Sequencing

Unknown

This paper addresses the critical gap of fragmented and non-standardized benchmarking in single-cell RNA-seq clustering, which hinders objective compa...

2 / 9

Journal: ArXiv Preprint

Published: Unknown

BioinformaticsComputational Biology

STAR-GO: Improving Protein Function Prediction by Learning to Hierarchically Integrate Ontology-Informed Semantic Embeddings

Department of Computer Engineering, Bogazici University, Istanbul, Turkiye

Mehmet Efe Akca, Gökçe Uludoğan, Arzucan Özgür, İnci M. Baytaş

The 30-Second View

IN SHORT: This paper addresses the core challenge of generalizing protein function prediction to unseen or newly introduced Gene Ontology (GO) terms by overcoming the limitations of existing models that either prioritize graph structure at the expense of semantic meaning or vice versa.

Innovation (TL;DR)

Methodology Introduces a novel GO embedding module that integrates textual definitions (via SBERT-BioBERT) with ontology graph structure through a multi-task autoencoder, learning unified representations that preserve both semantic similarity and hierarchical dependencies.
Methodology Proposes a hierarchical Transformer decoder that processes GO terms in topological order (ancestors to descendants) using causal self-attention, enabling information propagation across ontology levels and capturing functional dependencies.
Biology Demonstrates superior zero-shot generalization to unseen GO terms, particularly for Molecular Function and Biological Process terms, by effectively leveraging semantic information from textual definitions, which transfers better to novel ontology concepts than purely structural embeddings.

Key conclusions

STAR-GO achieves state-of-the-art or competitive performance across all three GO subontologies (BP, CC, MF), with the highest AUC scores (e.g., 0.989 for BP, 0.988 for CC, 0.995 for MF), indicating strong term-level discriminability.
In zero-shot evaluation on 16 held-out GO terms, STAR-GO variants achieve the highest AUCs in 13 cases, significantly outperforming baselines like DeepGOZero and DeepGO-SE, demonstrating superior generalization to unseen functions.
Ablation studies reveal that semantic embeddings (STAR_T) achieve the best zero-shot results for most MF and BP terms (e.g., AUC of 0.949 for GO:0001228), while structural embeddings (STAR_S) perform best for a few terms but poorly for MF, highlighting the critical role of semantic information for generalization.

Background and Gap： Existing protein function prediction models often treat GO terms as independent labels or rely on GO embedding techniques that prioritize either graph structure (e.g., anc2Vec) or semantic content (e.g., SBERT-BioBERT), limiting their ability to generalize to unseen terms and adapt to evolving ontologies.

Abstract: Motivation: Accurate prediction of protein function is essential for elucidating molecular mechanisms and advancing biological and therapeutic discovery. Yet experimental annotation lags far behind the rapid growth of protein sequence data. Computational approaches address this gap by associating proteins with Gene Ontology (GO) terms, which encode functional knowledge through hierarchical relations and textual definitions. However, existing models often emphasize one modality over the other, limiting their ability to generalize, particularly to unseen or newly introduced GO terms that frequently arise as the ontology evolves, and making the previously trained models outdated. Results: We present STAR-GO, a Transformer-based framework that jointly models the semantic and structural characteristics of GO terms to enhance zero-shot protein function prediction. STAR-GO integrates textual definitions with ontology graph structure to learn unified GO representations, which are processed in hierarchical order to propagate information from general to specific terms. These representations are then aligned with protein sequence embeddings to capture sequence–function relationships. STAR-GO achieves state-of-the-art performance and superior zero-shot generalization, demonstrating the utility of integrating semantics and structure for robust and adaptable protein function prediction. Availability: Code and pre-trained models are available at https://github.com/boun-tabi-lifelu/stargo.

Code