Log Management
Execute an ad-hoc log query using DuckDB
Process logs on hundreds of servers using DuckDB
Perform remote user analysis using log windowing
Upload logs from servers to MotherDuck
Upload logs from servers to Snowflake
Upload logs from servers to MongoDB
Upload logs from servers to Redshift
This guide provides an overview of using DuckDB with Bacalhau for remote log analysis. By leveraging these tools, you can perform detailed analyses without the need to download datasets locally.
DuckDB is a powerful in-memory SQL database management system ideal for data analytics. Bacalhau facilitates decentralized job execution, meaning you can run jobs remotely without having to log in or build complicated services. Together, they make a powerful tool for remote, ad-hoc server interaction.
You will need a Bacalhau cluster with the following configuration:
To run a DuckDB job on Bacalhau, all you need to do is use the DuckDB container. To submit a job, run the following command:
bacalhau docker run: command to run a Docker container on Bacalhau
davidgasquez/datadex:v0.2.0: name and tag of the Docker image
duckdb -s "select 1": the DuckDB CLI command to execute
When a job is submitted, Bacalhau prints out the related job_id
. We store that ID in an environment variable (JOB_ID
) so that we can reuse it later on.
The same job can be submitted in a declarative format. Create a YAML file (e.g., duckdb1.yaml
) with the following content:
Then run the command:
Job Status Check the status of the job:
When it says Published
or Completed
, the job is done, and we can fetch the results.
Job Information Get more details about your job:
Job Download Download your job results:
Each job creates 3 subfolders in your results directory:
combined_results
per_shard
raw
To view the output file:
Expected output:
Below is an example command to run arbitrary SQL queries against the NYC Yellow Taxi Trips dataset. This dataset is hosted on IPFS for demonstration purposes.
bacalhau docker run: command to run a Docker container on Bacalhau
-i ipfs://...: specifying IPFS CIDs so Bacalhau can mount the data at /inputs
--workdir /inputs: sets the working directory inside the container to /inputs
davidgasquez/duckdb:latest: Docker image with DuckDB installed
duckdb -s: the DuckDB CLI command to execute
You can also present the above job in YAML format. For example, duckdb2.yaml
:
Then run:
Job Status
Job Information
Job Download
Sample output might look like this:
If you have questions or need support or guidance, please reach out to the Bacalhau team via Slack (look for the #general channel).