Ad-hoc log query using DuckDB
Remote Log Analysis using DuckDB and Bacalhau
This guide provides an overview of using DuckDB with Bacalhau for remote log analysis. By leveraging these tools, you can perform detailed analyses without the need to download datasets locally.
Overview
DuckDB is a powerful in-memory SQL database management system ideal for data analytics. Bacalhau facilitates decentralized job execution, meaning you can run jobs remotely without having to log in or build complicated services. Together, they make a powerful tool for remote, ad-hoc server interaction.
Prerequisites
You will need a Bacalhau cluster with the following configuration:
1. Run a DuckDB job on Bacalhau
To run a DuckDB job on Bacalhau, all you need to do is use the DuckDB container. To submit a job, run the following command:
Structure of the Command
bacalhau docker run: command to run a Docker container on Bacalhau
davidgasquez/datadex:v0.2.0: name and tag of the Docker image
duckdb -s "select 1": the DuckDB CLI command to execute
When a job is submitted, Bacalhau prints out the related job_id
. We store that ID in an environment variable (JOB_ID
) so that we can reuse it later on.
3. Declarative Job Description
The same job can be submitted in a declarative format. Create a YAML file (e.g., duckdb1.yaml
) with the following content:
Then run the command:
4. Checking the State of Your Jobs
Job Status Check the status of the job:
When it says
Published
orCompleted
, the job is done, and we can fetch the results.Job Information Get more details about your job:
Job Download Download your job results:
5. Viewing Your Job Output
Each job creates 3 subfolders in your results directory:
combined_results
per_shard
raw
To view the output file:
Expected output:
6. Running Arbitrary SQL Commands
Below is an example command to run arbitrary SQL queries against the NYC Yellow Taxi Trips dataset. This dataset is hosted on IPFS for demonstration purposes.
Structure of the Command
bacalhau docker run: command to run a Docker container on Bacalhau
-i ipfs://...: specifying IPFS CIDs so Bacalhau can mount the data at
/inputs
--workdir /inputs: sets the working directory inside the container to
/inputs
davidgasquez/duckdb:latest: Docker image with DuckDB installed
duckdb -s: the DuckDB CLI command to execute
7. Declarative Job Description for Arbitrary SQL
You can also present the above job in YAML format. For example, duckdb2.yaml
:
Then run:
8. Checking Job Status, Describing, Downloading Results
Job Status
Job Information
Job Download
Viewing Your Job Output
Sample output might look like this:
Need Support?
If you have questions or need support or guidance, please reach out to the Bacalhau team via Slack (look for the #general channel).
Last updated