This document provides an overview of the DuckDB Cache Plugin for Nextflow, designed to use DubkcDB as a backend for storing execution cache data.
Introduction
Nextflow’s DuckDB Cache Plugin enables the storage of task execution metadata in a DuckDB database, ensuring data persistence and scalability. This plugin is particularly useful for workflows that require robust, centralized caching mechanisms across distributed environments.
Changelog
-
2025-04-21 Version 1.0.0-rc2 released
-
2025-03-01 Initial version 0.0.1
Features
-
Centralized Cache Storage
-
Scalability
-
Fault Tolerance
-
Compatibility
Prerequisites
Before using the Cache Plugin, ensure the following requirements are met:
Installation
plugins {
id "nf-duckdbcache@1.0.0-rc2"
}
Configuration
duckdbcache{
}
File
Specify a file where store your cache:
duckdbcache{
file = "nfduckdb.cache"
}
TIP If you have installed duckdb
command line you can inspect the file using SQL commands
as select * from CACHE_ENTRIES
or select * from INDEX_ENTRIES
PostgreSQL
Specify a postgresql connection details:
duckdbcache{
postgre{
enabled = true
host = "127.0.0.1"
user = "user1"
password = "password1"
database = "postgres"
}
}
using enabled
config you can enable/disable the postgresql configuration easily.
TIP Using any database tool you can query the cache using SQL commands as in previous section
SQLite
Specify an SQLite connection details:
duckdbcache{
sqlite{
enabled = true
file = "sqlite-cache.db"
}
}
using enabled
config you can enable/disable the sqlite configuration easily.
TIP Using sqlite3
tool you can query the cache using SQL commands as in previous section
Usage
Once the plugin is installed and configured, it will automatically handle caching for your Nextflow workflows.
Example
Here is an example of a simple workflow leveraging the PostgreSQL Cache Plugin:
process sayHello {
input:
val x
output:
stdout
script:
"""
echo $x
"""
}
workflow {
Channel.of('Bonjour', 'Ciao', 'Hello', 'Hola') | sayHello | view
}
duckdbcache{
postgre{
enabled = true
host = "127.0.0.1"
user = "user1"
password = "password1"
database = "postgres"
}
}
Run the workflow with:
nextflow run myWorkflow.nf -resume
If you run multiple times the pipeline you’ll see how Nextflow is using cached tasks and pipeline run faster
Console log
You can check the cache status using the log
command:
nextflow plugin nf-duckdbcache:log
TIMESTAMP DURATION RUN NAME STATUS REVISION ID SESSION ID COMMAND
2025-01-28 10:06:38 11.9s elated_noyce ERR 5a90025ed9 5276a9cb-1999-4b95-bf8d-563a22373710 nextflow run main.nf -resume -c local.config
2025-01-28 10:07:02 11.6s distracted_mcnulty ERR 5a90025ed9 5276a9cb-1999-4b95-bf8d-563a22373710 nextflow run main.nf -resume -c local.config
2025-01-28 10:20:30 11.5s crazy_wright ERR 5a90025ed9 5276a9cb-1999-4b95-bf8d-563a22373710 nextflow run main.nf -resume -c local.config
If you use a recent nextflow version you can provide parameters to the log
command
`nextflow plugin nf-duckdbcache:log --s '<→' `
`nextflow plugin nf-duckdbcache:log --fields timestamp,runame `
The idea is to use the same standard parameters as nextflow log
command
Troubleshooting
If you encounter issues, ensure the following:
License
This plugin is licensed under the MIT License.
Contributing
Contributions are welcome! Please submit issues or pull requests to the project’s GitHub repository.
Contact
For support, contact the Incremental Steps Software Solutions team or refer to the plugin documentation at https://incsteps.github.io/nf-duckdbcache/index.html