Bacalhau Docs
GithubSlackBlogEnterprise
v1.6.x
  • Documentation
  • Use Cases
  • CLI & API
  • References
  • Community
v1.6.x
  • Welcome
  • Getting Started
    • How Bacalhau Works
    • Getting Started
      • Step 1: Install the Bacalhau CLI
      • Step 2: Running Your Own Job
      • Step 3: Checking on the Status of Your Job
    • Creating Your Own Bacalhau Network
      • Setting Up a Cluster on Amazon Web Services (AWS) with Terraform πŸš€
      • Setting Up a Cluster on Google Cloud Platform (GCP) With Terraform πŸš€
      • Setting Up a Cluster on Azure with Terraform πŸš€
    • Hardware Setup
    • Container Onboarding
      • Docker Workloads
      • WebAssembly (Wasm) Workloads
  • Setting Up
    • Running Nodes
      • Node Onboarding
      • GPU Installation
      • Job selection policy
      • Access Management
      • Node persistence
      • Configuring Your Input Sources
      • Configuring Transport Level Security
      • Limits and Timeouts
      • Test Network Locally
      • Bacalhau WebUI
      • Private IPFS Network Setup
    • Workload Onboarding
      • Container
        • Docker Workload Onboarding
        • WebAssembly (Wasm) Workloads
        • Bacalhau Docker Image
        • How To Work With Custom Containers in Bacalhau
      • Python
        • Building and Running Custom Python Container
        • Running Pandas on Bacalhau
        • Running a Python Script
        • Running Jupyter Notebooks on Bacalhau
        • Scripting Bacalhau with Python
      • R (language)
        • Building and Running your Custom R Containers on Bacalhau
        • Running a Simple R Script on Bacalhau
      • Run CUDA programs on Bacalhau
      • Running a Prolog Script
      • Reading Data from Multiple S3 Buckets using Bacalhau
      • Running Rust programs as WebAssembly (WASM)
      • Generate Synthetic Data using Sparkov Data Generation technique
    • Networking Instructions
      • Accessing the Internet from Jobs
      • Utilizing NATS.io within Bacalhau
    • GPU Workloads Setup
    • Automatic Update Checking
    • Marketplace Deployments
      • Google Cloud Marketplace
    • Inter-Nodes TLS
  • Guides
    • Configuration Management
    • Write a config.yaml
    • Write a SpecConfig
    • Using Labels and Constraints
  • Examples
    • Table of Contents for Bacalhau Examples
    • Data Engineering
      • Using Bacalhau with DuckDB
      • Ethereum Blockchain Analysis with Ethereum-ETL and Bacalhau
      • Convert CSV To Parquet Or Avro
      • Simple Image Processing
      • Oceanography - Data Conversion
      • Video Processing
      • Bacalhau and BigQuery
    • Data Ingestion
      • Copy Data from URL to Public Storage
      • Pinning Data
      • Running a Job over S3 data
    • Model Inference
      • EasyOCR (Optical Character Recognition) on Bacalhau
      • Running Inference on Dolly 2.0 Model with Hugging Face
      • Speech Recognition using Whisper
      • Stable Diffusion on a GPU
      • Stable Diffusion on a CPU
      • Object Detection with YOLOv5 on Bacalhau
      • Generate Realistic Images using StyleGAN3 and Bacalhau
      • Stable Diffusion Checkpoint Inference
      • Running Inference on a Model stored on S3
    • Model Training
      • Training Pytorch Model with Bacalhau
      • Training Tensorflow Model
      • Stable Diffusion Dreambooth (Finetuning)
    • Molecular Dynamics
      • Running BIDS Apps on Bacalhau
      • Coresets On Bacalhau
      • Genomics Data Generation
      • Gromacs for Analysis
      • Molecular Simulation with OpenMM and Bacalhau
    • Systems Engineering
      • Ad-hoc log query using DuckDB
  • References
    • Jobs Guide
      • Job Specification
        • Job Types
        • Task Specification
          • Engines
            • Docker Engine Specification
            • WebAssembly (WASM) Engine Specification
          • Publishers
            • IPFS Publisher Specification
            • Local Publisher Specification
            • S3 Publisher Specification
          • Sources
            • IPFS Source Specification
            • Local Source Specification
            • S3 Source Specification
            • URL Source Specification
          • Network Specification
          • Input Source Specification
          • Resources Specification
          • ResultPath Specification
        • Constraint Specification
        • Labels Specification
        • Meta Specification
      • Job Templates
      • Queuing & Timeouts
        • Job Queuing
        • Timeouts Specification
      • Job Results
        • State
    • CLI Guide
      • Single CLI commands
        • Agent
          • Agent Overview
          • Agent Alive
          • Agent Node
          • Agent Version
        • Config
          • Config Overview
          • Config Auto-Resources
          • Config Default
          • Config List
          • Config Set
        • Job
          • Job Overview
          • Job Describe
          • Job Executions
          • Job History
          • Job List
          • Job Logs
          • Job Run
          • Job Stop
        • Node
          • Node Overview
          • Node Approve
          • Node Delete
          • Node List
          • Node Describe
          • Node Reject
      • Command Migration
    • API Guide
      • Bacalhau API overview
      • Best Practices
      • Agent Endpoint
      • Orchestrator Endpoint
      • Migration API
    • Node Management
    • Authentication & Authorization
    • Database Integration
    • Debugging
      • Debugging Failed Jobs
      • Debugging Locally
    • Running Locally In Devstack
    • Setting up Dev Environment
  • Help & FAQ
    • Bacalhau FAQs
    • Glossary
    • Release Notes
      • v1.5.0 Release Notes
      • v1.4.0 Release Notes
  • Integrations
    • Apache Airflow Provider for Bacalhau
    • Lilypad
    • Bacalhau Python SDK
    • Observability for WebAssembly Workloads
  • Community
    • Social Media
    • Style Guide
    • Ways to Contribute
Powered by GitBook
LogoLogo

Use Cases

  • Distributed ETL
  • Edge ML
  • Distributed Data Warehousing
  • Fleet Management

About Us

  • Who we are
  • What we value

News & Blog

  • Blog

Get Support

  • Request Enterprise Solutions

Expanso (2025). All Rights Reserved.

On this page
  • Introduction
  • Prerequisites
  • Running the Query
  • Running with a YAML file
  • More complex queries

Was this helpful?

Export as PDF
  1. Examples
  2. Data Engineering

Using Bacalhau with DuckDB

PreviousData EngineeringNextEthereum Blockchain Analysis with Ethereum-ETL and Bacalhau

Was this helpful?

Introduction

is a relational table-oriented database management system and supports SQL queries for producing analytical results. It also comes with various features that are useful for data analytics.

DuckDB is suited for the following use cases:

  • Processing and storing tabular datasets, e.g. from CSV or Parquet files

  • Interactive data analysis, e.g. Joining & aggregate multiple large tables

  • Concurrent large changes, to multiple large tables, e.g. appending rows, adding/removing/updating columns

  • Large result set transfer to client

In this example tutorial, we will show how to use DuckDB with Bacalhau. The advantage of using DuckDB with Bacalhau is that you don’t need to install, there is no need to download the datasets since the datasets are are already available on the server you are looking to run the query on.

Prerequisites

To get started, you need to install the Bacalhau client, see more information

We will be using Bacalhau setup with the standard .

We will also need to have a server with some data to run the query on. In this example, we will use a server with the Yellow Taxi Trips dataset.

If you do not already have this data on your server, you can download it using the scripts in the prep_data directory. The command to download the data is ./prep_data/run_download_jobs.sh - and you must have the /bacalhau_data directory on your server.

Running the Query

To submit a job, run the following Bacalhau command:

bacalhau docker run -e QUERY="select 1" docker.io/bacalhauproject/duckdb:latest

This is a simple query that will return a single row with a single column - but the query will be executed in DuckDB, on a remote server.

Structure of the command

Let's look closely at the command above:

  • bacalhau docker run: call to bacalhau

  • -e QUERY="select 1": the query to execute

  • docker.io/bacalhauproject/duckdb:latest: the name and the tag of the docker image we are using

When a job is submitted, Bacalhau runs the query in DuckDB, and returns the results to the client.

After we run it, when we describe the job, we can see the following in standard output:

Standard Output
β”Œβ”€β”€β”€β”€β”€β”€β”€β”
β”‚   1   β”‚
β”‚ int32 β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€
β”‚     1 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”˜

Running with a YAML file

What if you didn't want to run everything on the command line? You can use a YAML file to define the job. In simple_query.sql, we have a simple query that will return the number of rows in the dataset.

-- simple_query.sql
SELECT COUNT(*) AS row_count FROM yellow_taxi_trips;

To run this query, we can use the following YAML file:

# duckdb_query_job.yaml
Tasks:
  - Engine:
      Params:
        Image: docker.io/bacalhauproject/duckdb:latest
        WorkingDirectory: ""
        EnvironmentVariables:
          - QUERY=WITH yellow_taxi_trips AS (SELECT * FROM read_parquet('{{ .filename }}')) {{ .query }}
      Type: docker
    Name: duckdb-query-job
    InputSources:
      - Source:
          Type: "localDirectory"
          Params:
            SourcePath: "/bacalhau_data"
            ReadWrite: true
        Target: "/bacalhau_data"
    Publisher:
      Type: "local"
      Params:
        TargetPath: "/bacalhau_data"
    Network:
      Type: Full
    Resources:
      CPU: 2000m
      Memory: 2048Mi
    Timeouts: {}
Type: batch
Count: 1

Though this looks like a lot of code, it is actually quite simple. The Tasks section defines the task to run, and the InputSources section defines the input dataset. The Publisher section defines where the results will be published, and the Resources section defines the resources required for the job.

All the work is done in the environment variables, which are passed to the Docker image, and handed to DuckDB to execute the query.

To run this query, we can use the following command:

bacalhau job run duckdb_query_job.yaml --template-vars="filename=/bacalhau_data/yellow_tripdata_2020-02.parquet" --template-vars="QUERY=$(cat simple_query.sql)"

This breaks down into the following steps:

  1. bacalhau job run: call to bacalhau

  2. duckdb_query_job.yaml: the YAML file we are using

  3. --template-vars="filename=/bacalhau_data/yellow_tripdata_2020-02.parquet": the file to read

  4. --template-vars="QUERY=$(cat simple_query.sql)": the query to execute

When we run this, we get back the following simple output:

Standard Output
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ row_count β”‚
β”‚   int64   β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚   6299367 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

More complex queries

Let's say we want to run a more complex query. In window_query_simple.sql, we have a query that will return the average number of rides per 5 minute interval.

-- window_query_simple.sql
SELECT
    DATE_TRUNC('hour', tpep_pickup_datetime) + INTERVAL (FLOOR(EXTRACT(MINUTE FROM tpep_pickup_datetime) / 5) * 5) MINUTE AS interval_start,
    COUNT(*) AS ride_count
FROM
    yellow_taxi_trips
GROUP BY
    interval_start
ORDER BY
    interval_start;

When we run this, we get back the following output:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   interval_start    β”‚ ride_count β”‚
β”‚      timestamp      β”‚   int64    β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 2008-12-31 22:20:00 β”‚          1 β”‚
β”‚ 2008-12-31 23:00:00 β”‚          1 β”‚
β”‚ 2008-12-31 23:05:00 β”‚          1 β”‚
β”‚ 2008-12-31 23:10:00 β”‚          1 β”‚
β”‚ 2008-12-31 23:15:00 β”‚          1 β”‚
β”‚ 2008-12-31 23:30:00 β”‚          1 β”‚
β”‚ 2009-01-01 00:00:00 β”‚          3 β”‚
β”‚ 2009-01-01 00:05:00 β”‚          3 β”‚
β”‚ 2009-01-01 00:15:00 β”‚          1 β”‚
β”‚ 2009-01-01 00:40:00 β”‚          1 β”‚
β”‚ 2009-01-01 01:15:00 β”‚          1 β”‚
β”‚ 2009-01-01 01:20:00 β”‚          1 β”‚
β”‚ 2009-01-01 01:35:00 β”‚          1 β”‚
β”‚ 2009-01-01 01:40:00 β”‚          1 β”‚
β”‚ 2009-01-01 02:00:00 β”‚          2 β”‚
β”‚ 2009-01-01 02:15:00 β”‚          1 β”‚
β”‚ 2009-01-01 04:05:00 β”‚          1 β”‚
β”‚ 2009-01-01 04:15:00 β”‚          2 β”‚
β”‚ 2009-01-01 04:45:00 β”‚          1 β”‚
β”‚ 2009-01-01 06:30:00 β”‚          1 β”‚
β”‚          Β·          β”‚          Β· β”‚
β”‚          Β·          β”‚          Β· β”‚
β”‚          Β·          β”‚          Β· β”‚
β”‚ 2020-03-05 12:15:00 β”‚          1 β”‚

The sql file needs to be run in a single line, otherwise the line breaks will cause some issues with the templating. We're working on improving this!

With this structure, you can now run virtually any query you want on remote servers, without ever having to download the data. Welcome to compute over data by Bacalhau!

Need Help?

If you get stuck or have questions:

Check out the

Open an issue in our

Join our

official Bacalhau Documentation
GitHub repository
Slack
DuckDB
here
setting up Bacalhau network