nf-duckdbcache: DuckDB cache for Nextflow

This document provides an overview of the DuckDB Cache Plugin for Nextflow, designed to use DubkcDB as a backend for storing execution cache data.

Introduction

Nextflow’s DuckDB Cache Plugin enables the storage of task execution metadata in a DuckDB database, ensuring data persistence and scalability. This plugin is particularly useful for workflows that require robust, centralized caching mechanisms across distributed environments.

Changelog

2025-04-21 Version 1.0.0-rc2 released
2025-03-01 Initial version 0.0.1

Features

Centralized Cache Storage
Scalability
Fault Tolerance
Compatibility

Prerequisites

Before using the Cache Plugin, ensure the following requirements are met:

Installation

nextflow.config

plugins {
    id "nf-duckdbcache@1.0.0-rc2"
}

Configuration

nextflow.config

duckdbcache{
}

File

Specify a file where store your cache:

nextflow.config

duckdbcache{
    file = "nfduckdb.cache"
}

TIP If you have installed duckdb command line you can inspect the file using SQL commands as select * from CACHE_ENTRIES or select * from INDEX_ENTRIES

PostgreSQL

Specify a postgresql connection details:

nextflow.config

duckdbcache{
    postgre{
        enabled = true
        host = "127.0.0.1"
        user = "user1"
        password = "password1"
        database = "postgres"
    }
}

using enabled config you can enable/disable the postgresql configuration easily.

TIP Using any database tool you can query the cache using SQL commands as in previous section

SQLite

Specify an SQLite connection details:

nextflow.config

duckdbcache{
    sqlite{
        enabled = true
        file = "sqlite-cache.db"
    }
}

using enabled config you can enable/disable the sqlite configuration easily.

TIP Using sqlite3 tool you can query the cache using SQL commands as in previous section

Usage

Once the plugin is installed and configured, it will automatically handle caching for your Nextflow workflows.

Example

Here is an example of a simple workflow leveraging the PostgreSQL Cache Plugin:

process sayHello {
  input:
  val x
  output:
  stdout
  script:
  """
    echo $x
  """
}

workflow {
  Channel.of('Bonjour', 'Ciao', 'Hello', 'Hola') | sayHello | view
}

nextflow.config

duckdbcache{
    postgre{
        enabled = true
        host = "127.0.0.1"
        user = "user1"
        password = "password1"
        database = "postgres"
    }
}

Run the workflow with:

nextflow run myWorkflow.nf -resume

If you run multiple times the pipeline you’ll see how Nextflow is using cached tasks and pipeline run faster

Console log

You can check the cache status using the log command:

nextflow plugin nf-duckdbcache:log

TIMESTAMP               DURATION        RUN NAME                STATUS  REVISION ID     SESSION ID                              COMMAND
2025-01-28 10:06:38     11.9s           elated_noyce            ERR     5a90025ed9      5276a9cb-1999-4b95-bf8d-563a22373710    nextflow run main.nf -resume -c local.config
2025-01-28 10:07:02     11.6s           distracted_mcnulty      ERR     5a90025ed9      5276a9cb-1999-4b95-bf8d-563a22373710    nextflow run main.nf -resume -c local.config
2025-01-28 10:20:30     11.5s           crazy_wright            ERR     5a90025ed9      5276a9cb-1999-4b95-bf8d-563a22373710    nextflow run main.nf -resume -c local.config

If you use a recent nextflow version you can provide parameters to the log command

`nextflow plugin nf-duckdbcache:log --s '<→' `

`nextflow plugin nf-duckdbcache:log --fields timestamp,runame `

The idea is to use the same standard parameters as nextflow log command

Troubleshooting

If you encounter issues, ensure the following:

License

This plugin is licensed under the MIT License.

Contributing

Contributions are welcome! Please submit issues or pull requests to the project’s GitHub repository.

Contact

For support, contact the Incremental Steps Software Solutions team or refer to the plugin documentation at https://incsteps.github.io/nf-duckdbcache/index.html