Meetings

Recent preprints

  • on2vec: Ontology Embeddings with Graph Neural Networks and Sentence Transformers

    Ontologies provide structured vocabularies and relationships essential for organizing biological knowledge, yet their symbolic nature limits integration with modern machine learning methods. Leveraging recent advances in graph neural networks (GNNs) and transformer-based language models, we present on2vec, a toolkit developed during the DBCLS BioHackathon 2025 for generating vector embeddings from OWL ontologies. on2vec integrates structural information from ontology hierarchies with semantic features from textual annotations using HuggingFace Sentence Transformers, producing domain-aware embeddings suitable for downstream biomedical applications and ontology-based reasoning tasks.
  • AI in Practice: Insights from a Community Survey of Biohackathon Participants

    Understanding the practical application of artificial intelligence (AI) in research is increasingly important as it becomes embedded in life sciences and bioinformatics. This paper reports on a multilingual survey, developed through community discussions at the 2025 BioHackathon in Japan and distributed through its networks, to capture current practices, successes, and challenges in AI adoption. The survey, offered in English, Japanese, and Thai, received 105 responses spanning diverse demographics, regions, and professional backgrounds. Findings reveal that most participants are frequent AI users, with tools like ChatGPT, Gemini, and Claude widely adopted, with ChatGPT as number one response. AI is primarily used to assist or draft tasks in coding, research, and writing, while full task automation remains uncommon, reflecting a preference for AI as a collaborative aid rather than a replacement. Successes were noted in efficiency, coding support, and proposal writing, whereas challenges centered on accuracy and reliability. Institutional support emerged as a key factor: respondents in Japan, Thailand, and the private sector reported stronger support and higher satisfaction than English-speaking or academic counterparts. By documenting real-world practices and concerns, this survey provides a valuable community-driven resource to guide responsible AI development and foster international collaboration in bioinformatics.
  • Translating and Formalizing the MIRAGE Guidelines to a Prototype MIRAGE Ontology and DCAT3 Extension Vocabulary for Glycomics Data Management

    The Minimum Information Required for A Glycomics Experiment (MIRAGE) guidelines have established comprehensive reporting standards for glycomics research, yet their implementation in semantic web technologies remains limited. We present the first comprehensive semantic formalization of MIRAGE guidelines through an integrated RDF ontology framework comprising the MIRAGE Ontology and MIRAGE-DCAT3 vocabulary. The MIRAGE Ontology models glycan structures, biological specimens, analytical instruments, and experimental processes with formal OWL semantics and SHACL validation constraints. The complementary MIRAGE-DCAT3 vocabulary extends W3C DCAT3 with glycomics-specific metadata properties for dataset cataloging and discovery. Our implementation addresses critical challenges in glycomics data interoperability through comprehensive mappings to established ontologies including GlycoRDF, PSI-MS, and DCTERMS. This semantic framework enables automated quality assessment, federated data querying, and enhanced reproducibility in glycomics research, supporting broader adoption of FAIR principles in the glycobiology community. The framework demonstrates comprehensive coverage of MIRAGE reporting requirements across multiple analytical platforms including mass spectrometry, liquid chromatography, capillary electrophoresis, NMR spectroscopy, and lectin microarray analysis.
  • DBCLS BioHackathon 2025 report: Creation and Publication Analytical Workflow of Creators' Interests

    At the DBCLS BioHackathon 2025, we converted metatranscriptomic analytical shell scripts into Common Workflow Language (CWL) containerized with Docker. Sub-workflows were created for metagenomic assembly, read mapping, and gene annotation, and validated with test datasets. The workflows, released on GitHub and WorkflowHub, improve reproducibility and address issues of reusability and software environment dependency. We also evaluated CWL best practices from the perspective of life scientists, classifying them by difficulty, importance, and applicability to promote FAIR principles and software quality. In parallel, we established a benchmarking framework for pangenome-based structural variant (SV) calling using data from the Dai population. Graph-based references from the Human and Chinese Pangenome Consortia were compared with linear references using minimap2 and vg giraffe. Results showed improved alignment accuracy and variant detection with pangenomes, demonstrating their value for reducing mapping bias and enhancing SV discovery.
  • A Standards-Compliant, Multi-Modal Platform for Offline Access to SRA Metadata

    The SRAmetaDBB project, presented at BioHackathon Japan 2023, introduced an experimental JavaScript pipeline for creating SQLite databases from NCBI SRA (Sequence Read Archive) metadata dumps, with a vision for offline analysis and integration with Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) systems. While promising, the prototype faced significant challenges in performance, memory management, and production readiness when scaling to the full SRA dataset of over 45 million records. This paper presents SRAKE (SRA Knowledge Engine), a complete reimplementation in Go that not only addresses these limitations but extends the original vision with semantic search capabilities, quality control mechanisms, and multiple access interfaces. SRAKE achieves a 20-fold improvement in ingestion speed, maintains constant memory usage through zero-copy streaming, and provides standards-compliant interfaces following clig.dev guidelines. The platform introduces biomedical-specific semantic search using SapBERT embeddings via ONNX Runtime, implements comprehensive quality control thresholds for search results, and offers multiple access modalities including a CLI, REST API, MCP server for AI integration, and a simple web interface. Our development implementation demonstrates that SRAKE successfully transforms the experimental SRAmetaDBB concept into a production-ready platform, and seamless integration with modern AI workflows while maintaining the core vision of providing offline-capable, LLM-ready access to SRA metadata.
  • A Lightweight PURL Resolver for Linked Life Science Data

    Knowledge graphs in the life sciences are increasingly published using the Resource Description Framework (RDF) and queried via SPARQL endpoints. While these technologies enable powerful data integration, the identifiers returned in SPARQL results often do not resolve to meaningful resources, leaving users with non-actionable links. To address this issue, we developed a lightweight Persistent Uniform Resource Locator (PURL) resolver during the BioHackathon Japan 2025. The resolver is implemented in PHP, chosen for its ubiquity on standard web servers and its compatibility with the EasyRDF library for RDF handling. It is easy to configure, requires minimal maintenance, and supports both database redirects and ontology term rendering with content negotiation for RDF serializations. The system is available as open-source software (https://github.com/JKoblitz/purl-resolver) and deployed at https://purl.dsmz.de, where it now resolves most identifiers from the DSMZ Digital Diversity SPARQL endpoint (https://sparql.dsmz.de). Database IRIs lead to the corresponding web interfaces, ontology IRIs from the DSMZ Digital Diversity Ontology render directly as term pages, and unmapped entities are delegated to database-side resolvers. This approach enhances the usability of knowledge graphs by ensuring that all identifiers remain actionable for both humans and machines.
  • AI for Computational Biology: Highlights from the first BioAI Hackathon at University of Warsaw

    The BioAI Hackathon at the Centre of New Technologies at the University of Warsaw convened 43 international researchers to collaboratively explore artificial intelligence (AI) approaches for solving complex challenges in computational biology. Nine interdisciplinary and multi-institutional teams addressed the following problems: disease-gene prioritization, microbiome analysis, drug-protein interactions, alternative splicing prediction, chromatin architecture study and toxicological profiling. Using cutting-edge tools such as graph neural networks (GNNs), large language models (LLMs), and multi-omics integration frameworks, participants developed scalable and reproducible analytical pipelines. The results include: a disease gene prioritization framework using GNNs, a microbiome dynamics analysis for poultry health prediction and the construction of chromatin structure-aware regulatory networks. All projects follow the open science principles and display translational potential. This hackathon underscores the transformative role of AI in biomedicine and the value of collaborative, time-bounded innovation for accelerating discovery in life sciences. All projects are publicly available on GitHub: https://github.com/SFGLab
  • 1
  • 2