Using Bacalhau with DuckDB
Last updated
Last updated
News & Blog
BlogGet Support
Request Enterprise SolutionsExpanso (2024). All Rights Reserved.
DuckDB is a relational table-oriented database management system and supports SQL queries for producing analytical results. It also comes with various features that are useful for data analytics.
DuckDB is suited for the following use cases:
Processing and storing tabular datasets, e.g. from CSV or Parquet files
Interactive data analysis, e.g. Joining & aggregate multiple large tables
Concurrent large changes, to multiple large tables, e.g. appending rows, adding/removing/updating columns
Large result set transfer to client
In this example tutorial, we will show how to use DuckDB with Bacalhau. The advantage of using DuckDB with Bacalhau is that you don’t need to install, there is no need to download the datasets since the datasets are are already available on the server you are looking to run the query on.
To get started, you need to install the Bacalhau client, see more information here
We will be using Bacalhau setup with the standard setting up Bacalhau network.
We will also need to have a server with some data to run the query on. In this example, we will use a server with the Yellow Taxi Trips dataset.
If you do not already have this data on your server, you can download it using the scripts in the prep_data
directory. The command to download the data is ./prep_data/run_download_jobs.sh
- and you must have the /bacalhau_data
directory on your server.
To submit a job, run the following Bacalhau command:
This is a simple query that will return a single row with a single column - but the query will be executed in DuckDB, on a remote server.
Let's look closely at the command above:
bacalhau docker run
: call to bacalhau
-e QUERY="select 1"
: the query to execute
docker.io/bacalhauproject/duckdb:latest
: the name and the tag of the docker image we are using
When a job is submitted, Bacalhau runs the query in DuckDB, and returns the results to the client.
After we run it, when we describe
the job, we can see the following in standard output:
What if you didn't want to run everything on the command line? You can use a YAML file to define the job. In simple_query.sql
, we have a simple query that will return the number of rows in the dataset.
To run this query, we can use the following YAML file:
Though this looks like a lot of code, it is actually quite simple. The Tasks
section defines the task to run, and the InputSources
section defines the input dataset. The Publisher
section defines where the results will be published, and the Resources
section defines the resources required for the job.
All the work is done in the environment variables, which are passed to the Docker image, and handed to DuckDB to execute the query.
To run this query, we can use the following command:
This breaks down into the following steps:
bacalhau job run
: call to bacalhau
duckdb_query_job.yaml
: the YAML file we are using
--template-vars="filename=/bacalhau_data/yellow_tripdata_2020-02.parquet"
: the file to read
--template-vars="QUERY=$(cat simple_query.sql)"
: the query to execute
When we run this, we get back the following simple output:
Let's say we want to run a more complex query. In window_query_simple.sql
, we have a query that will return the average number of rides per 5 minute interval.
When we run this, we get back the following output:
The sql file needs to be run in a single line, otherwise the line breaks will cause some issues with the templating. We're working on improving this!
With this structure, you can now run virtually any query you want on remote servers, without ever having to download the data. Welcome to compute over data by Bacalhau!
If you get stuck or have questions:
Check out the official Bacalhau Documentation
Open an issue in our GitHub repository
Join our Slack
We recommend using Expanso Cloud to create your network! But if you'd like to set up a cluster on your own, you can use our tool Andaime to do this too.
If you have any questions about the platform - please contact us on Slack or Email us!