Meetings
BioHackSWAT4HCLS 2025
BioHackathon Europe 2025
4th BioHackathon Germany
DBCLS BioHackathon 2025
ELIXIR INTOXICOM
Recent preprints
-
BioHackEU24 report: Creating user benefit from ARC-ISA RO-Crate machine-actionability & Increasing FAIRness of digital agrosystem resources by extending Bioschemas
As part of the BioHackathon Europe 2024, we here report on the progress that both project 19 and project 24 have made during the event. For the purpose of this report we will present the abstract of both projects and then dive deeper on what work was done during the BioHackathon. -
Enhancing multi-omic analyses through a federated microbiome analysis service
Multi-omics datasets are an increasingly prevalent and necessary resource for achieving scientificadvances in microbial ecosystem research. However, they present twin challenges to researchinfrastructures: firstly the utility of multi-omics datasets relies entirely on interoperability ofomics layers, i.e. on formalised data linking. Secondly, microbiome derived data typically leadto computationally expensive analyses, and so rely on the availability of high performancecomputing (HPC) or cloud infrastructures. These challenges can be better met by combining the resources of multiple groups, services and infrastructures. In this BioHackathon Europe 2024 project, we envisioned a “federated microbiome analysis service” and worked on three tracks of development towards it: mapping metagenomics metadata standards to Schema.org and Bioschemas terms, rendering Nextflow workflow executions as RO-Crates, and tooling for creating, viewing and interlinking human-readable RO-Crate previews. -
1st SpatialData Developer Workshop
This pre-print is aimed at sharing the results of the “1st SpatialData workshop,” an in-person event organized by the SpatialData team and funded by the Chan Zuckerberg Initiative (CZI) that brought together expertise from different fields, including methods developers of a variety of tools for single-cell and spatial omics. The purpose is to explore new directions to advance the field of spatial omics. By leveraging multiple programming languages, including Python, R, and JavaScript, the event focuses on four central hackathon tracks: R interoperability: This track aims to enhance the integration and compatibility of R and Python with the SpatialData Python framework by using the language-agnostic SpatialData Zarr file format (which follows, when possible, the NGFF specification). Visualization interoperability: This track is dedicated to improving the seamless integration of visualization tools across different systems and programming languages via a tool-agnostic view configuration. Scalability and benchmarking: Participants will identify, benchmark, and address computational bottlenecks within the SpatialData framework. Ergonomics and user-friendliness: This track focuses on enhancing the usability and accessibility of the SpatialData framework for both first-time users and third-party developers. These tracks aim to foster collaboration and innovation, driving advancements in the analysis and infrastructure of spatial omics and imaging data. -
On the value of data
What is the value of a dataset? This is a key question for many data management decisions. It is a difficult question to answer, as “it depends”, on the needs, the consumer, other data, and many other factors. This work aims at sketching an approach to evaluate data, that is based on trade-offs decisions between different aspects to data, respect to different usage scenarios. This work hs been developed as a “collaborative paper” (on the value of data, a collaborative experiment). The version here reported is what resulted from the discussions at the BioHackathon in Fukushima, 2024. -
Exploring Bioinformatics in the Wild: Insights from Real LLM Conversations
The intersection of artificial intelligence (AI) and conversational data offers promising oppor- tunities for advancing research in specialized fields such as biology and health sciences. The WildChat dataset, comprising over one million user-ChatGPT interactions, serves as a valuable resource for analyzing how advanced language models engage with complex topics. This work aims to explore how conversational AI models interpret and manage bioinformatics-related queries, assessing their effectiveness and identifying areas for improvement. By filtering and analyzing bioinformatics-related interactions within WildChat, the study highlights the current capabilities and limitations of these models, providing insights into their potential roles in supporting and enhancing research, education, and practical applications in bioinformatics and biology. Key findings include that GPT-3.5 Turbo can save both time and money while still providing satisfactory performance in handling bioinformatics-related queries, making it a cost-effective option for many applications. However, models like Llama 3 8B Instruct and Mistral 7B Instruct were found to underperform in comparison, struggling with the specialized vocabulary and nuanced contexts inherent in bioinformatics. Additionally, it was observed that Anthropic’s Claude model is notably harder to jailbreak, suggesting stronger safeguards against misuse, which is crucial for maintaining the integrity of conversational AI in sensitive domains. Expanding the scope of conversational datasets to include a broader range of detailed interactions is crucial for developing more robust, context-aware bioinformatics tools. This investigation not only underscores the strengths and weaknesses of current conversational AI systems but also offers a roadmap for future improvements, ultimately contributing to the evolving interface between AI technology and bioinformatics. -
BioHack24 report: Toward improving mechanisms for extracting RDF shapes from large inputs
RDF shapes have proven to be effective mechanisms for describing and validating RDF content.Typically, shapes are written by domain experts. However, writing and maintaining theseshapes can be challenging when dealing with large and complex schemas. To address this issue,automatic shape extractors have been proposed. These tools are designed to analyze existingRDF content and generate shapes that conform with the underlying schemas. Nevertheless,extracting shapes from large datasets presents significant scalability challenges.In this document, we describe our work during the 2024 BioHackathon held in Fukushima,Japan, to tackle this problem. Our approach is based on slicing the input data, performingparallelized shape extraction processes, and merging the resulting partial outputs. By refiningour software and methods, we successfully extracted shapes from a subset of UniProt, containingan estimated 15.9 billion triples. -
BioHack24 report: Using discovered RDF schemes: a compilation of potential use cases for shapes reusage
RDF shapes are formal expressions of schema structures in RDF data. Their primary purposeis twofold: describing and validating RDF data. However, as machine-readable representationsof the expected structures in a given data source, RDF shapes can be applied to varioustasks that require automatic comprehension of data schemas. In this paper, we present ourwork conducted during the DBCLS BioHackathon 2024 in Fukushima, Japan, to harness thepotential of RDF shapes. The identified and partially implemented use cases include thegeneration and validation of SPARQL queries, data and schema visualization, mappings toother formal syntaxes, and applications in data modeling scenarios.