1 of 23

v1.6.x (Latest)

Overview

What is Bacalhau?

Bacalhau is an open-source distributed compute orchestration framework designed to bring compute to the data. Instead of moving large datasets around networks, Bacalhau makes it easy to execute jobs close to the data's location, drastically reducing latency and resource overhead.

Why It Matters

Highly Distributed Architecture: Deploy compute networks that span regions, cloud providers, and on-premises datacenters—all working together as a unified system.
Resilient Operation: Compute nodes operate effectively even with intermittent connectivity to orchestrators, maintaining service availability during network partitioning or isolation.
Data Sovereignty & Security: Process sensitive data within security boundaries without requiring it to leave your premises, enabling computation while preserving data control.
Cross-Organizational Computation: Allow specific vetted computations on protected datasets without exposing raw data, breaking data silos between organizations.
Resource Efficiency: By minimizing data transfers, Bacalhau saves bandwidth costs and ensures jobs run faster.
High Scalability: As your data and processing needs grow, simply add more compute nodes on demand—whether on-premises or in the cloud.
Ease of Integration: Bacalhau works with existing container images (Docker, etc.), meaning you can leverage your current workflows without major rewrites.

Key Features

Single Binary Simplicity: Bacalhau is a single self-contained binary that functions as a client, orchestrator, and compute node—making it incredibly easy to set up and scale your distributed compute network.
Modular Architecture: Bacalhau's design supports multiple execution engines (Docker, WebAssembly) and storage providers through clean interfaces, allowing for easy extension.
Orchestrator-Compute Model: A dedicated orchestrator coordinates job scheduling, while compute nodes run tasks—all from the same binary with different runtime modes.
Flexible Storage Integrations: Bacalhau integrates with S3, HTTP/HTTPS, and other storage systems, letting you pull data from various sources.
Multiple Job Types: Support for batch, ops, daemon, and service job types to accommodate different workflow requirements.
Declarative & Imperative Submissions: Define jobs in a YAML spec (declarative) or pass all arguments via CLI (imperative).
Publisher Support: Output results to local volumes, S3, or other storage backends—so your artifacts are readily accessible.

Use Cases

Bacalhau's distributed compute framework enables a wide range of applications across different industries:

Log Processing

Process logs efficiently at scale by running distributed jobs directly at the source, reducing costs by up to 93% in bandwidth usage while improving real-time insights. Bacalhau supports various job types for log management:

Daemon Jobs: Continuously run on each node for real-time log aggregation and compression
Service Jobs: Handle ongoing processing tasks like log aggregation and issue detection
Batch Jobs: Execute on-demand in-depth analysis of historical log data
Ops Jobs: Enable real-time querying of live logs for urgent investigations

Distributed Data Warehousing

Query and analyze data across multiple regions by deploying compute tasks directly where your data resides. This approach reduces latency, enhances performance, and ensures compliance with data sovereignty regulations. Bacalhau integrates with modern data tools like Apache Iceberg and DuckDB to enable:

Reduced data movement with local query execution
Improved query performance through compute-data proximity
Seamless scalability with dynamic node addition
Compliance with data regulations through region-specific processing

Fleet Management

Efficiently manage distributed nodes across multiple environments with capabilities for:

Remote execution of commands without requiring SSH access
Automated software deployment and configuration updates
Real-time metrics and logs collection
Targeted job execution based on node attributes
Rapid incident response and automated recovery

Distributed Machine Learning

Train and deploy ML models across a distributed compute fleet, optimizing performance while keeping data in place:

Distribute training across multiple machines to handle larger models
Process data locally to minimize network transfers
Deploy inference jobs near users for low-latency predictions
Support federated learning for privacy-sensitive applications

Edge Computing

Run compute tasks closer to the data source for applications requiring low latency and minimal bandwidth usage:

Process and analyze sensor, IoT, or video data in real time
Perform pre-processing and filtering at the edge before sending refined data
Distribute tasks across available edge resources dynamically
Ensure data privacy by keeping computations near the source

How It Works

Bacalhau's architecture enables you to create compute networks that bridge traditional infrastructure boundaries. When you submit a job, Bacalhau intelligently determines which compute nodes are best positioned to process the data based on locality, availability, and your defined constraints—without requiring data movement or constant connectivity.

This approach is particularly valuable for:

Organizations with data that cannot leave certain security boundaries
Multi-region operations where data transfer is expensive or impractical
Scenarios where multiple parties need to collaborate on analysis without sharing raw data
Edge computing environments with intermittent connectivity

Community

Bacalhau has a very friendly community and we are always happy to help you get started:

GitHub Discussions – ask anything about the project, give feedback, or answer questions that will help other users.
Join the Slack Community and go to #bacalhau channel – it is the easiest way to engage with other members in the community and get help.
Contributing – learn how to contribute to the Bacalhau project.

Key Concepts

Bacalhau is built around a few core ideas and terminologies. If you're new to Bacalhau, here's what you need to know:

Distributed Compute Orchestration

Bacalhau coordinates computing workloads across a network of machines, intelligently matching jobs to resources.

Bacalhau acts as a dispatcher: You submit jobs (e.g., container workloads), and it finds the best node to run them based on available resources, data location, and constraints.

Bring the Compute to the Data

Instead of moving data to compute, Bacalhau moves compute to where data lives, reducing network overhead and improving efficiency.

Traditionally, big data solutions shuffle large datasets across networks to a central compute cluster.
Bacalhau inverts this approach: it places compute tasks where the data already resides—whether in local storage, an S3 bucket, or other storage providers—reducing unnecessary data movement.

Jobs & Executions

Bacalhau organizes work in a hierarchy that enables efficient resource allocation and parallelization.

A Job defines the overall workflow (e.g., "run a Docker image with these arguments").
A job can be broken into multiple Executions that run in parallel across different compute nodes.
Bacalhau optimizes these executions based on data locality and available resources.

Job Types

Bacalhau supports various execution patterns to accommodate different workload requirements:

Batch Jobs: One-time execution of a workload, typically for data processing tasks that run to completion.
Ops Jobs: Administrative or operational tasks, often for system maintenance or monitoring.
Daemon Jobs: Long-running background processes that perform ongoing work.
Service Jobs: Web services or APIs that need to remain available and respond to requests.

Node Types

The Bacalhau network consists of specialized components, each with specific responsibilities:

Orchestrator Node: Receives job submissions, schedules executions, and monitors state. Started with bacalhau serve --orchestrator.
Compute Node: Executes workloads locally, typically requiring Docker or another runtime. Started with bacalhau serve --compute.
Hybrid Node: Serves both roles at once—often used for local dev or small setups. Started with bacalhau serve --orchestrator --compute.

Execution Engines

Bacalhau runs your code through pluggable runtime environments:

Bacalhau supports multiple execution engines through its modular architecture:
- Docker: For container-based workloads
- WebAssembly (WASM): For lightweight, sandboxed execution
The framework is designed to accommodate additional engines as needed.

Storage Providers

Bacalhau can access data from various sources through a clean, extensible interface:

Bacalhau can mount data from various sources through its flexible storage provider interface:
- S3-compatible storage
- HTTP/HTTPS URLs
- Local filesystems
- IPFS
- And more via storage provider plugins

Publisher

After execution, Bacalhau ensures your results are accessible where you need them:

After a job finishes, its results can be published to a specific backend—like local disk, S3 or IPFS—so they're easy to retrieve.

Communication Layer

A reliable messaging system allows Bacalhau components to coordinate effectively:

Bacalhau uses NATS.io as its communication backbone:
- Orchestrators act as NATS servers
- Compute nodes connect as NATS clients
- This provides reliable, scalable messaging between components

Architecture

Bacalhau employs a distributed, node-based architecture that brings compute operations closer to data. Built around a single self-contained binary that serves multiple roles, Bacalhau makes it remarkably simple to deploy and scale a distributed compute network.

High-Level Overview

User Submits a Job: The user, through the Bacalhau CLI or API, sends a job definition to the Orchestrator. Jobs can be submitted in two ways:
- Imperative: bacalhau docker run ... with command-line arguments
- Declarative: bacalhau job run <job-spec> using a YAML specification file
Orchestrator Schedules Tasks: Based on resource availability, data location, and job requirements, the Orchestrator assigns tasks to Compute Nodes.
Compute Nodes Execute Tasks: Each Compute Node pulls the necessary image(s), mounts or fetches input data (local, S3, etc.), and runs the task in an isolated environment.
Results Publication: Once the task completes, outputs are published to configured storage. The Orchestrator updates the job's status accordingly.

All of these components run from the same Bacalhau binary, just in different modes, making deployment remarkably simple.

Core Components

Orchestrators

Core Role: Receives job submissions, maintains job state, and coordinates scheduling
NATS Server: Acts as a messaging infrastructure hub
Scalability: Support for multiple orchestrator nodes for high availability
Communication: Broadcasts scheduling decisions and listens for status updates

Compute Nodes

Primary Function: Execute containerized or WASM-based workloads
Resource Management: Advertise available CPU, memory, GPU, and storage capacity
Data Handling: Fetch or mount input data from various sources and publish results
Isolation: Run tasks in Docker containers or WASM environments

Modular Architecture

Bacalhau features a pluggable architecture with well-defined interfaces that enable extension without modifying core code:

Execution Engine Interface: Currently supports Docker and WebAssembly (WASM) workloads, with a clean API for adding new execution environments
Storage Provider Interface: Plug in various storage backends (S3, HTTP/HTTPS, local paths, IPFS) for both input and output handling
Publisher Interface: Easily add new ways to publish and share computation results

Data-Centric Design

A key differentiator of Bacalhau is its data-centric approach:

Data Locality: The system intelligently schedules jobs on nodes with local access to data
Minimal Transfer: Moves computation to data rather than moving large datasets
Data Sovereignty: Process sensitive data within security boundaries without requiring it to leave premises
Cross-Organizational Computation: Enable collaborative analysis on protected datasets without exposing raw data

Network Resilience

Bacalhau's architecture is designed to maintain operations even during network disruptions:

Event-Driven State: All system events are stored in local ledgers and shared during normal operation
Independent Operation: Nodes continue functioning during network outages
State Reconciliation: When network partitions heal, nodes exchange missed events
Local Decision Making: Orchestrators can make scheduling decisions with available information

Deployment Models

Bacalhau's single-binary architecture supports flexible deployment configurations:

Single Node: Run orchestrator and compute services on one machine (ideal for development)
Regional Cluster: Distributed compute nodes within a single geographic region
Global Cluster: Compute network spanning multiple regions and data centers

Security Considerations

Execution Environments: Tasks run in Docker containers or WASM environments with appropriate resource limits
Access Control: Each node requires valid credentials for accessing private data sources
Data Boundaries: Computation happens within defined security perimeters, protecting sensitive information

Observability

Metrics & Logging: Each node can expose metrics on resource usage and job performance
Event Tracking: Orchestrators record job lifecycle events for monitoring and auditing

Getting Started

Quick Start

This Quick Start guide shows how to run your first Bacalhau job with minimal setup. Bacalhau's design as a single self-contained binary makes it incredibly easy to set up your own distributed compute network in minutes.

Prerequisites

Docker installed on any machine that runs a compute node
Bacalhau CLI installed (see below)

1. Installation

Once installed, verify with:

2. Start a Hybrid Node

Open a terminal and run:

This command launches both an orchestrator and a compute node in one process
Keep it running; you'll see logs indicating it's ready

3. Submit a Data Analysis Job

Bacalhau supports two primary methods of job submission: Imperative (CLI) and Declarative (YAML). We'll demonstrate a word count job on the classic novel Moby Dick.

Create a word-count.yaml file:

Then run the job using:

The job downloads a sample dataset and processes it locally
Bacalhau will display job progress until completion
You'll receive a Job ID once the job is submitted

4. Inspect the Job

Replace <jobID> with the actual ID printed in step 2
You can run bacalhau job logs <job-id> to just get the execution logs

5. Retrieve Results

Download and view your job results:

Note: You should see a word frequency analysis of the Moby Dick text file!

🎉 Success!

You've just:

Started a local Bacalhau network
Submitted a job using both imperative and declarative methods
Tracked job progress with detailed descriptions
Retrieved and viewed job results

Installation

This section explains how to install Bacalhau on your machine, verify it's working, and understand basic requirements. Bacalhau is distributed as a single self-contained binary that can function as a client, orchestrator node, and compute node—greatly simplifying deployment and management of your distributed compute network.

Install the Bacalhau Binary

To install the CLI, choose your environment, and run the command(s) below.

Verify the Installation

This should print:

The local binary version
The version of the orchestrator this client is connected to, if any
The latest available version of bacalhau in case you are running an outdated version.

If you get command not found, verify your PATH includes the Bacalhau binary.

Upgrading Bacalhau

To upgrade Bacalhau to the latest version, run the installation script. If Bacalhau is already installed, this will update it to the most recent version available.

Requirements & Tips

Docker:
- Must be installed and running on any compute node to handle Docker-based jobs.
AWS Credentials (if you’re using S3):
- For S3 inputs or outputs, the node needs valid AWS credentials (e.g., environment variables).
Running an Orchestrator & Compute:
bacalhau devstack:
- Perfect for local development or running tests.

Next Steps

Head over to Basic CLI Usage to learn how to submit, describe, and stop jobs.
Check Common Workflows for steps on mounting data (S3, local folders) and publishing outputs.
Explore References for advanced node management (Docker Compose, devstack, multi-node clusters).

Network Setup

This guide explains how to set up Bacalhau networks for various deployment scenarios, from development environments to production deployments.

Introduction

Bacalhau's architecture consists of two primary node types:

Orchestrator nodes that schedule and manage jobs
Compute nodes that execute workloads

Compute nodes connect to orchestrators, but don't need to be reachable by orchestrators or other compute nodes, making deployment simpler.

Getting Started with Bacalhau

Choose the setup option that best matches your needs:

Setup Option

Best For

Key Benefit

Expanso Cloud

Production deployments

Fully managed orchestrator service

DevStack

Development & testing

Quick local setup with minimal configuration

Self-Hosted Network

Custom infrastructure requirements

Complete control over all components

Option 1: Expanso Cloud

(Recommended for Production)

Expanso Cloud provides a managed orchestrator service, eliminating the need to set up and maintain your own orchestrator.

Setting Up with Expanso Cloud

Sign up for Expanso Cloud to receive:
- An orchestrator endpoint
- Authentication credentials
- A configuration file

# config.yaml
Compute:
  Enabled: true
  Orchestrators:
    - <orchestrator-endpoint>
  Auth:
    Token: <auth-token>
  TLS:
    RequireTLS: true

Start your compute node:

bacalhau serve --compute -c config.yaml

Submit jobs to the Expanso Cloud orchestrator:

bacalhau docker run \
  --api-host <expanso-endpoint> \
  ubuntu:latest \
  -- echo "Hello from Expanso Cloud!"

This is the simplest way to run a Bacalhau network with minimal setup and maintenance.

Option 2: Self-Hosted Network

If you need to host your own orchestrator, follow these steps for a custom deployment.

Setting Up an Orchestrator Node

On your designated orchestrator machine:

# Start an orchestrator-only node
bacalhau serve --orchestrator

Take note of this machine's IP address or hostname - you'll need it to connect compute nodes.

Adding Compute Nodes

On each machine that will execute jobs:

# Start a compute-only node connected to your orchestrator
bacalhau serve --compute -c Compute.Orchestrators=<orchestrator-ip>:4222

Replace <orchestrator-ip> with the actual IP address or hostname of your orchestrator.

Verifying Your Cluster

Check that all nodes are connected:

# List all nodes in your network
bacalhau node list

You should see your orchestrator and all compute nodes listed.

Note: The setup described above creates an open network suitable for testing in trusted environments. For securing your network, refer to the Security Best Practices in the Reference section.

Option 3: DevStack

DevStack provides a pre-configured local environment perfect for development and testing.

# Launch a complete development environment
bacalhau devstack

This pre-configures a transient orchestrator and compute nodes by default, giving you a complete environment for testing with minimal setup.

You can submit jobs to your DevStack just like any other Bacalhau network:

bacalhau docker run ubuntu:latest -- echo "Hello from DevStack!"

Alternative Setup Methods

These methods provide additional ways to set up Bacalhau for specific use cases.

Single Hybrid Node

For the simplest local setup, you can run a single node that acts as both orchestrator and compute:

# Launch a combined orchestrator and compute node
bacalhau serve --orchestrator --compute

This starts Bacalhau in "hybrid mode" where:

The orchestrator handles job scheduling
The compute service executes containers
Both components run in the same process

This option is useful for initial testing or for very small deployments.

Docker Deployment

Run Bacalhau in Docker for easier management:

# Run an orchestrator node
docker run -p 4222:4222 ghcr.io/bacalhau-project/bacalhau:latest serve --orchestrator

# Run a compute node using Docker-in-Docker
docker run --privileged -p 4222:4222 \
  ghcr.io/bacalhau-project/bacalhau:latest-dind \
  serve --compute -c Compute.Orchestrators=<orchestrator-ip>:4222

The bacalhau:latest-dind image includes Docker-in-Docker capabilities required for compute nodes.

Docker Compose Setup

For a quick multi-node setup, Bacalhau provides Docker Compose examples that create a complete network suitable for testing:

Clone Network Setups Repository Clone the repository containing the network setups:
```
git clone https://github.com/bacalhau-project/bacalhau-network-setups.git
```
Navigate to a Specific Setup Change directory to your desired setup under docker-compose:
```
cd bacalhau-network-setups/docker-compose/<setup-name>
```
Start the Network Use Docker Compose to bring up the network:
```
docker compose up
```

These setups enable deployment and testing of Bacalhau across multiple nodes, including an orchestrator and persistent data storage.

Next Steps

Secure your network with our Security Guide
Learn how to submit jobs to your network
Explore common workflows for different use cases

Basic Usage

Overview & Global Flags

This page explains how the Bacalhau CLI is structured and which global flags are most commonly used. Understanding these fundamentals will help you work efficiently with all Bacalhau commands.

What You'll Learn

The general structure and organization of Bacalhau commands
How global flags affect command behavior regardless of the specific command
How to customize output formats and control connection settings
How to specify configuration files and data directories
Where to find more detailed command references

CLI Structure

Bacalhau commands follow a consistent pattern that makes them intuitive and predictable:

Top-Level Commands

Bacalhau's CLI groups commands into logical categories:

agent: Client-side commands for checking health, version, and node information
job: Core job management (create, list, describe, stop, retrieve logs, etc.)
node: Cluster node management and inspection
config: Client configuration management
docker: Imperative command for running Docker-based jobs

Command Examples

Global Flags

These flags work with any command and provide consistent behavior across the CLI. They're especially useful for scripting and automation.

Connection Settings

Example:

Configuration Management

Examples:

Output Formatting

Examples:

Getting Help

For detailed information about any command's available flags:

This will show all available options, including both global flags and command-specific flags.

Tip: For full details on each command's available flags, see the CLI Reference or type bacalhau <command> --help.

Submitting Jobs

This guide introduces the basics of submitting jobs to Bacalhau. Whether you're running a quick task or setting up a more complex job, you'll learn the essential approaches.

What You'll Learn

How to run quick jobs with simple commands
How to create reusable job specifications
Basic job configuration options

Quick Jobs: Command Line Approach

The fastest way to run a job is using the bacalhau docker run command. This is perfect for simple tasks or when you're just getting started.

bacalhau docker run \
  ubuntu:latest \
  -- echo "Hello from Bacalhau"

By default, this runs a batch job (one-time execution). You can also run ops jobs using --target all.

bacalhau docker run \
  --target all \
  ubuntu:latest \
  -- echo "Running as an ops job"

Key Options

--cpu 0.5: Request half a CPU core
--memory 512mb: Request 512MB of memory
--id-only: Show just the job ID (useful for scripts)

Pro Tip: Everything after the -- is executed inside the container.

Reusable Jobs: YAML Specification

For jobs you'll run multiple times or want to save, create a YAML specification file:

# hello-job.yaml
Name: "hello-bacalhau"
Type: batch
Count: 1
Tasks:
  - Name: "task1"
    Engine:
      Type: "docker"
      Params:
        Image: "ubuntu:latest"
        Entrypoint:
          - "echo"
          - "Hello from a YAML spec!"

Submit it with:

bacalhau job run hello-job.yaml

This approach helps you:

Save job configurations for later use
Share job definitions with teammates
Make small changes without retyping everything

Job Types and Choosing Methods

Bacalhau supports several job types:

batch: One-time execution (default for command line)
ops: Administrative tasks targeting specific nodes (use --target all to run on all nodes)
service: Long-running services that run on any N nodes
daemon: Background processes that run continuously on all nodes

Important: Service and daemon jobs can only be created using YAML specifications as they're designed for repeatable or updatable workloads.

When to Choose Each Method

Use the command line for:
- Quick, one-time batch jobs
- Simple ops jobs with --target all
Use YAML files when:
- Running service or daemon jobs
- Creating repeatable job configurations
- Sharing job definitions with teammates

Inspecting Jobs

Once you've submitted jobs to Bacalhau and identified them through job listing, you'll often need to dig deeper into specific jobs. This guide covers the commands for getting detailed information about your jobs.

What You'll Learn

How to view comprehensive details about a specific job
How to track a job's history and state changes
How to examine individual job executions

Describing a Job

To see complete details about a specific job, use:

Replace <job-id> with your actual job ID. You can use the full ID or just the first few characters (if they uniquely identify the job).

Sample Output

The output shows you:

Basic job information (ID, name, type, state)
Summary of job completion status
Job history timeline
Execution details on which nodes ran the job
Execution history showing state changes
Standard output from the job execution

Customizing the Output Format

Change the output format for easier parsing or integration with other tools:

Tracking Job History

To see how a job's state has changed over time:

The history shows important events like state transitions and execution updates.

Filtering History Events

Filter history by event type:

Filter by a specific execution:

Viewing Job Executions

For jobs that run on multiple nodes or have multiple attempts, check the executions:

Each execution represents an instance of your job running on a specific node.

Customizing Execution List

Listing Jobs

This guide shows you how to view and filter the jobs in your Bacalhau environment. Being able to list jobs is essential for monitoring your workloads and finding specific jobs to inspect further.

What You'll Learn

How to list all your jobs
How to filter jobs by various criteria
How to customize the output format

Basic Job Listing

To see your recent jobs, use:

bacalhau job list

By default, this shows your 10 most recent jobs with basic information.

Sample Output

 CREATED   ID          JOB     TYPE   STATE
 11:01:45  j-f827bd29  docker  batch  Completed
 17:24:37  j-feea35d9  docker  batch  Completed
 ...
 20:14:13  j-edce7319  docker  batch  Completed
To fetch more records use:
	bacalhau job list --limit 10 --next-token Ok46MTA6MTA

The output columns show:

CREATED: When the job was created (time)
ID: The job's unique identifier
JOB: The job engine type (usually docker)
TYPE: The job type (batch, service, etc.)
STATE: Current job state (Completed, Running, Pending, Failed, etc.)

Filtering Your Job List

You can refine your job list using various flags:

Limit the Number of Results

bacalhau job list --limit 5

Filter by Labels

Labels help organize and categorize your jobs:

bacalhau job list --labels "env=dev,project=research"

More complex label filtering:

bacalhau job list --labels "region in (us-east-1, us-west-1)"

Change Result Order

Order by creation time or job ID:

bacalhau job list --order-by created_at

Reverse the order (newest last):

bacalhau job list --order-reversed

Pagination

When you have many jobs, the output will include a pagination token:

# Use the next token from previous results
bacalhau job list --limit 10 --next-token Ok46MTA6MTA

Customizing Output Format

By default, results appear in a table format. You can choose other formats:

JSON Output

bacalhau job list --output json

For more readable JSON:

bacalhau job list --output json --pretty

YAML Output

bacalhau job list --output yaml

CSV Output

Useful for importing into spreadsheets:

bacalhau job list --output csv

Table Formatting Options

Additional options for table output:

# Show full values without truncation
bacalhau job list --wide

# Hide the header row
bacalhau job list --hide-header

# Remove table styling
bacalhau job list --no-style

Retrieving Logs

After submitting a job to Bacalhau, you'll typically need to inspect its execution logs to monitor progress and troubleshoot issues. This guide explains how to access and manage logs from your Bacalhau jobs.

What You'll Learn

How to view job execution logs
How to stream logs in real-time during job execution
How to filter logs for specific executions

Execution Logs

Execution logs contain the standard output (stdout) and standard error (stderr) from your job, which are invaluable for monitoring and debugging.

Basic Log Retrieval

To view the logs for a completed or running job:

This displays stdout/stderr from the container execution, showing you exactly what your job printed during its run.

Real-Time Log Streaming

For long-running jobs, you can stream logs as they're generated:

This is similar to tail -f and will continuously show new log entries until you press Ctrl+C or the job completes.

Filtering Logs

If your job has multiple parallel executions, you can focus on a specific one:

You can find execution IDs by running bacalhau job describe <jobID>.

Tailing Logs

To view only the most recent log entries:

Following Logs During Job Submission

When submitting a new job, you can immediately follow the logs by adding the --follow flag to your job run command:

This is convenient as it combines job submission and log following into a single command, eliminating the need to run a separate job logs command.

For docker run commands, you can similarly use:

Downloading Results

After a Bacalhau job completes, you'll need to retrieve the output files generated by your job. This guide explains the basics of downloading job results.

What You'll Learn

How to specify output paths in your jobs
How to retrieve job results using the CLI

Getting Job Results

To download the results of a completed job:

bacalhau job get <jobID>

This command downloads all outputs from the job to your current directory.

Specifying an Output Directory

You can specify where to save the downloaded results:

bacalhau job get <jobID> --output-dir /path/to/save

Download Timeout Setting

For larger downloads, you can adjust the timeout:

bacalhau job get <jobID> --download-timeout-secs 10m

Specifying Job Outputs and Publisher

When submitting a job, you need to define which files or directories should be collected as outputs, and where those outputs should be published.

Using Command Line

For Docker jobs, use the --output flag to define outputs and the --publisher flag to specify where to publish the results:

bacalhau docker run \
  --output results:/outputs \
  --publisher s3://my-bucket/results-folder \
  ubuntu:latest \
  -- echo "Hello, World!" > /outputs/hello.txt

This tells Bacalhau to:

Collect everything in the /outputs directory of the container
Publish it to the specified S3 bucket and path
Make it available for download with bacalhau job get

Using Declarative Submission

You can also define outputs in a job specification file:

Type: batch
Count: 1
Tasks:
  - Name: main
    Engine:
      Type: docker
      Params:
        Image: python:3.9
        Entrypoint:
          - "python"
          - "-c"
          - "import os; os.makedirs('/outputs', exist_ok=True); open('/outputs/result.txt', 'w').write('Analysis complete!')"
    Publisher:
      Type: s3
      Params:
        Bucket: my-bucket
        Key: results-folder
    ResultPaths:
      - Name: results
        Path: /outputs

Submit this job using:

bacalhau job run job-spec.yaml

Multiple Output Paths

You can specify multiple output paths in a single job:

bacalhau docker run \
  --output logs:/var/log \
  --output results:/outputs \
  --publisher s3://my-bucket/results-folder \
  ubuntu:latest \
  -- <your-command>

Downloaded Results Structure

After running bacalhau job get, the results will be organized in a directory structure like this:

.
├── job-j-6bafb8d4
│   ├── exitCode
│   ├── outputs
│   │   └── file1
│   ├── stderr
│   └── stdout

The directory structure includes:

exitCode: Contains the exit code of the job
outputs: Contains all the files from the job's specified output directories
stderr: Captures any error output from the job
stdout: Captures the standard output from the job

Stopping a Job

Sometimes you need to terminate a running job before it completes naturally. Bacalhau provides a straightforward way to stop jobs in progress.

Stopping a Running Job

To stop a job that's currently running:

bacalhau job stop <jobID>

How It Works

When you issue a stop command:

The Bacalhau orchestrator marks the job for termination
A signal is sent to all compute nodes running tasks for that job
The compute nodes terminate the running containers
Resources allocated to the job are released
The job's state is updated to Stopped

Verifying Termination

To confirm a job has been properly stopped:

bacalhau job describe <jobID>

Look for the State field, which should show Stopped once the termination is complete.

When to Stop a Job

Common scenarios where stopping a job is necessary:

Stuck or Misconfigured Jobs: Jobs that are stuck in a loop, using incorrect data, or producing errors
Resource Optimization: When a job is too resource-intensive or taking too long
Prioritization Changes: When higher-priority work arrives and you need to free up resources
Service Jobs: For jobs designed to run continuously, the stop command is especially useful when the service is no longer needed

Working with Nodes

While agent commands let you interact with your local Bacalhau process, node commands allow you to manage the broader network of compute resources. The orchestrator tracks these nodes, and this guide covers common operations you'll need for monitoring and managing your compute infrastructure.

Surveying Your Compute Network

To get a quick overview of all nodes in your network:

This displays a table of nodes with essential information about IDs, types, approval status, and connection state. Use this command for monitoring cluster health and identifying nodes that need attention.

Focusing on Available Compute Resources

When planning job deployments or troubleshooting resource constraints:

This enhanced view shows version information and supported execution engines. This helps you identify nodes with specific capabilities for your workloads.

Filtering Nodes

Filter the list to show only nodes with specific characteristics:

This filtering capability helps you find nodes in specific regions or with specialized hardware.

Deep Dive into Node Details

When you need comprehensive information about a specific node:

This provides extensive details on the node's identity, resources, and capabilities. Use this when investigating specific issues or verifying a node's configuration.

Debugging the Agent

The Bacalhau agent is the process your client directly communicates with. By default, this is running on localhost:1234, but can be changed using --api-host and --api-port flags. For local testing or small clusters, you'll frequently need to check the agent's health and examine its configuration.

Checking Agent Health and Version

When troubleshooting connectivity or verifying your setup:

This returns a simple health check response, confirming your client can communicate with the agent.

To check which version you're running:

This displays version information helpful when verifying installations, troubleshooting issues, or reporting bugs.

Inspecting Agent Configuration

During setup or when diagnosing issues:

This returns the complete configuration in YAML format, showing network parameters, resource limits, and admission control settings. Use this when jobs aren't being accepted or resources aren't properly allocated.

Examining Node Details

To get detailed information about the agent's node:

This shows information about node identity, available resources, and supported features. Use this when setting up a new node or troubleshooting job scheduling issues.

Working with Remote Agents

To connect to a remote agent:

This pattern works with all agent commands and is useful for monitoring production clusters or diagnosing connectivity issues between network components.

Common Workflows

Mounting Input Data

This page explains how to feed external data into Bacalhau jobs from various sources. Bacalhau's modular architecture enables flexible data mounting from multiple storage providers, with S3-compatible storage, local directories, IPFS, and HTTP/HTTPS URLs supported out of the box.

What You'll Learn

How to mount data from different sources to your Bacalhau jobs
The syntax and options for each data source type
Best practices for efficient data handling

Input Mounting Basics

Bacalhau jobs often need access to input data. The general syntax for mounting input data is:

bacalhau docker run \
  --input <URI><SOURCE>:<TARGET> \
  IMAGE -- COMMAND

Where:

URI is the protocol identifier (file://, s3://, ipfs://, http://, https://)
SOURCE specifies the path to the data
TARGET is the path where the data will be mounted in the container

This pattern is consistent across all input types, making it easy to understand and use regardless of the data source.

...
InputSources:
  - Alias: input
    Target: <TARGET>
    Source:
      Type: <URI>
      Params:
        key: value

Where:

URI is the protocol identifier (file://, s3://, ipfs://, http://, https://)
TARGET is the path where the data will be mounted in the container
PARAMS are key value configuration depending on the input type

Local Directories

bacalhau docker run \
  --input file:///path/to/local/data:/data \
  ubuntu:latest -- cat /data/input.txt

Type: batch
Count: 1
Tasks:
  - Name: "task1"
    Engine:
      Type: "docker"
      Params:
        Image: "ubuntu:latest"
        Parameters:
          - "cat"
          - "/data/input.txt"
    InputSources:
      - Alias: input_data
        Target: /data
        Source:
          Type: local
          Params:
            Path: /path/to/local/data

This mounts the directory /path/to/local/data from the host machine to /data inside the container.

S3-Compatible Storage

S3 integration connects to storage solutions compatible with the S3 API, such as AWS S3, Google Cloud Storage, and locally deployed MinIO

bacalhau docker run \
  --input s3://my-bucket/datasets/sample.csv:/data/sample.csv \
  ubuntu:latest -- cat /data/sample.csv

Type: batch
Count: 1
Tasks:
  - Name: "task1"
    Engine:
      Type: "docker"
      Params:
        Image: "ubuntu:latest"
        Parameters:
          - "cat"
          - "/data/sample.csv"
    InputSources:
      - Alias: input_data
        Target: /data/sample.csv
        Source:
          Type: s3
          Params:
            Bucket: my-bucket
            Key: datasets/sample.csv

This downloads and mounts the S3 object to the specified path in the container.

HTTP/HTTPS URLs

URL-based inputs provide access to web-hosted resources.

bacalhau docker run \
  --input https://example.com/data.csv:/data/data.csv \
  ubuntu:latest -- head -n 10 /data/data.csv

Type: batch
Count: 1
Tasks:
  - Name: "task1"
    Engine:
      Type: "docker"
      Params:
        Image: "ubuntu:latest"
        Parameters:
          - head
          - -n
          - "10"
          - /data/data.csv
    InputSources:
      - Alias: input_data
        Target: /data/data.csv
        Source:
          Type: urlDownload
          Params:
            URL: https://example.com/data.csv

IPFS (InterPlanetary File System)

IPFS provides content-addressable, peer-to-peer storage for decentralized data sharing.

bacalhau docker run \
  --input ipfs://QmZ4tDuvesekSs4qM5ZBKpXiZGun7S2CYtEZRB3DYXkjGx:/data \
  ubuntu:latest -- cat /data/file.txt

Type: batch
Count: 1
Tasks:
  - Name: "task1"
    Engine:
      Type: "docker"
      Params:
        Image: "ubuntu:latest"
        Parameters:
          - cat
          - /data/file.txt
    InputSources:
      - Alias: input_data
        Target: /data
        Source:
          Type: ipfs
          Params:
            CID: QmZ4tDuvesekSs4qM5ZBKpXiZGun7S2CYtEZRB3DYXkjGx

The IPFS CID (Content Identifier) points to the specific content you want to mount.

Multiple Inputs

You can combine multiple inputs from different sources in a single job:

bacalhau docker run \
  --input file:///path/to/config:/config \
  --input s3://my-bucket/datasets/data.csv:/data/data.csv \
  --input https://example.com/reference.json:/data/reference.json \
  python:3.9 -- python /config/process.py

Working with Large Datasets

For very large datasets, consider these optimization strategies:

bacalhau docker run \
  --cpu 4 \
  --memory 8GB \
  --disk 100GB \
  --input s3://big-data-bucket/huge-dataset/:/data/ \
  python:3.9 -- python process_big_data.py

Best practices:

Increase resource allocations as needed
Use data locality to minimize transfer costs
Process data in chunks when possible
Choose efficient data formats (Parquet, Arrow, etc.)

Tips & Caveats

Credentials: Some mount sources (S3) require proper credentials or connectivity
Data Locality: Use Bacalhau label selectors to run jobs on nodes that have or close to the data
IPFS Network: Compute nodes must be connected to an IPFS daemon to support this storage type
Size Limits: Very large inputs may require increased disk allocations using --disk

Next Steps

Learn how to retrieve and publish outputs from jobs
See a complete example workflow that includes input data
Explore resource constraints for jobs with large data processing needs

Publishing & Retrieving Results

This guide explains how to configure output publishing and retrieve results from Bacalhau jobs across different storage systems. Proper output handling is essential for building effective data pipelines and workflows.

What You'll Learn

How Bacalhau's Publishers mechanism works
How to configure different output destination types
How to retrieve outputs from various storage systems
How to choose the right publisher for your use case

Understanding Publishers and Result Paths

In Bacalhau, you need to configure two key components for handling outputs:

A Publisher defines where your job's output files are stored after execution
Result Paths specify which directories should be captured as job results.

Retrieving Local Outputs

After your job completes, retrieve outputs using the bacalhau job get command:

This will download all published outputs to your current directory.

Important Notes:

If you define a publisher without specifying result paths, only stdout and stderr will be uploaded to the chosen publisher
If you define result paths without a publisher, the job will fail
You can have multiple result paths, each capturing different directories

Publisher Types

Bacalhau supports multiple publisher types to accommodate different needs and infrastructure requirements.

S3 Publisher

The S3 Publisher uploads outputs to an Amazon S3 bucket or any S3-compatible storage service, such as MinIO. The compute node must have permission to write to the bucket, and the orchestrator must have permission to provide pre-signed URLs to download the results.

IPFS Publisher

The IPFS Publisher uploads outputs to the InterPlanetary File System. Both the client (downloading the result) and the compute node must be connected to an IPFS daemon.

Local Publisher

The Local Publisher saves outputs to the local filesystem of the compute node that ran your job. This is intended for local testing only, as it requires the client downloading the results to be on the same network as the compute node.

If you are using the local publish, make SURE you have set the path to be available to your job.

For example, in your config file for your node, you probably want to mount in the local file system:

Troubleshooting

No Outputs Found

If you don't see expected outputs:

Check that your job wrote to the directories specified in your ResultPaths
Verify the job completed successfully with bacalhau job describe <jobID>
Check for errors in the logs with bacalhau job logs <jobID>

S3 Publishing Issues

For S3 publisher problems:

Ensure compute nodes have proper IAM roles or credentials to write to the bucket
Check that the orchestrator has permissions to generate pre-signed URLs

IPFS Publishing Issues

For IPFS publisher issues:

Ensure IPFS daemon is running on both compute node and client
Check for network connectivity between nodes
Verify you have enough disk space for pinning

Resource Constraints

Configure CPU, memory, disk, and GPU requirements for your Bacalhau jobs to ensure efficient resource utilization.

CPU & Memory

Specify CPU cores and memory allocation for your jobs. Default values are CPU: 500m and Memory: 512Mb.

GPU Requirements

Request GPU resources for machine learning, deep learning, and other GPU-accelerated tasks.

Note: The compute node must have available GPUs with proper drivers, and your container image should include necessary GPU libraries (e.g., CUDA).

Disk Space

Control how much disk space your job can use:

Common Issues

Job Stuck Pending: You may be requesting resources that aren't available. Check available resources with bacalhau node list or reduce requirements.
Out of Memory (OOM): Increase memory allocation or process data in smaller batches.
Disk Space Issues: Increase disk allocation or clean up temporary files during processing.

Best Practices

Start with conservative resource requests and scale up as needed
For memory-intensive tasks, add a 20-30% buffer to your estimated peak usage
Check if your framework can effectively use multiple GPUs before requesting them

Common Pitfalls

This page provides a quick reference for common issues encountered by Bacalhau users and their solutions. Identifying and avoiding these pitfalls will help you create more reliable jobs and workflows.

What You'll Learn

How to diagnose and resolve common Bacalhau job issues
Strategies for debugging stuck, failed, or misbehaving jobs
Best practices to prevent common problems

Jobs Stuck in Pending State

One of the most common issues is jobs remaining in the "Pending" state and never executing.

Possible Causes

No available nodes: No compute nodes are connected to the orchestrator
Resource constraints too high: Requesting more CPU, memory, or GPU than any available node can provide
Mismatched node selector: Job requirements don't match available node capabilities
Network partitioning: Orchestrator can't communicate with compute nodes

Diagnosis

Check the job status and specifications for clues:

bacalhau job describe <jobID>
# For more detailed information in YAML format
bacalhau job describe <jobID> --output yaml

Look for status messages that might indicate scheduling issues.

Check available compute nodes:

bacalhau node list

Ensure there are active compute nodes with sufficient resources.

Solutions

Reduce resource requests: Lower CPU, memory, or GPU requirements
Add more compute nodes: Add capacity to your cluster
Check network connectivity: Ensure nodes can communicate with each other
Modify job requirements: Adjust constraints to match available resources

Input Data Access Issues

Problems accessing or mounting input data are another common source of failures.

Possible Causes

Wrong path or URL: Incorrect or inaccessible source location
Missing credentials: No or invalid authentication for S3 or private URLs
Network limitations: Compute node can't reach data source
Path mapping errors: Incorrect source-to-destination mapping

Diagnosis

Check job specs and status:

bacalhau job describe <jobID> --output yaml

If the job started but failed during execution, check logs:

bacalhau job logs <jobID>

Look for messages like "file not found" or "access denied".

Solutions

Validate paths: Double-check that source paths, URLs, or S3 buckets exist and are accessible
Check credentials: Ensure proper environment variables or configuration for authenticated sources
Test connectivity: Verify the compute node can reach the data source
Local testing: Test data access locally before running on Bacalhau

Example of corrected input mounting:

# INCORRECT (missing file)
bacalhau docker run --input /path/does/not/exist:/data ubuntu:latest -- cat /data/file.txt

# CORRECT
bacalhau docker run --input /path/that/exists:/data ubuntu:latest -- cat /data/file.txt

No Output Found

Jobs complete successfully, but expected output files are missing.

Possible Causes

Wrong output path: Not writing to the /outputs directory
Command errors: The job ran but the command failed to produce output
Permission issues: Container user can't write to output location
Publisher configuration: Publisher not configured correctly

Diagnosis

Check job specification and execution details:

bacalhau job describe <jobID> --output yaml

If the job executed, check logs for clues about what the job did:

bacalhau job logs <jobID>

Verify your job actually wrote to the /outputs directory.

Solutions

Use absolute paths: Always use absolute paths in your commands
Write to /outputs: Ensure your job writes to the /outputs directory specifically
Add debugging: Add commands to list directories and print current working directory
Check permissions: Ensure your process has permission to write to the output location

Example of corrected output writing:

# INCORRECT (writing to wrong location)
bacalhau docker run ubuntu:latest -- echo "Hello" > result.txt

# CORRECT
bacalhau docker run ubuntu:latest -- bash -c 'echo "Hello" > /outputs/result.txt'

Container Errors

Issues with container execution or container image availability.

Possible Causes

Image not found: The specified container image doesn't exist or is inaccessible
Command errors: The command specified doesn't exist in the container
Resource limitations: The container runs out of resources during execution
Exit codes: The container process exits with a non-zero code

Diagnosis

Check job specification for container configuration:

bacalhau job describe <jobID> --output yaml

If the container started, check logs for execution errors:

bacalhau job logs <jobID>

Look for messages about image pulling or command execution.

Solutions

Verify image exists: Check that the image name is correct and accessible
Test locally: Try running the container locally with Docker first
Check command: Ensure the command exists in the container and has correct syntax
Adjust resources: Provide sufficient CPU, memory, and disk for your workload

Example of corrected container image:

# INCORRECT (typo in image name)
bacalhau docker run ubuntuu:latest -- echo "Hello"

# CORRECT
bacalhau docker run ubuntu:latest -- echo "Hello"

Resource Exhaustion

Jobs fail because they run out of resources during execution.

Possible Causes

Out of memory (OOM): Job exceeds allocated memory
Disk space exhaustion: Job writes more data than allocated disk space
CPU thrashing: Insufficient CPU allocation causes extreme slowdown
GPU memory errors: CUDA out of memory errors for GPU jobs

Diagnosis

Check job specification and status:

bacalhau job describe <jobID> --output yaml

If the job executed, check logs for error messages:

bacalhau job logs <jobID>

Look for error messages about memory, disk space, or resource limits.

Solutions

Increase resources: Allocate more memory, CPU, or disk space
Optimize code: Reduce resource usage in your application
Process in batches: Break large workloads into smaller chunks
Clean up temporary files: Remove unneeded files during processing

Example of increased resource allocation:

# Increased memory allocation
bacalhau docker run --memory 4GB python:3.9 -- python memory_intensive_script.py

# Increased disk space
bacalhau docker run --disk 20GB ubuntu:latest -- dd if=/dev/zero of=/outputs/large_file bs=1M count=15000

Command Line Parsing Issues

Problems related to how commands and arguments are passed to containers.

Possible Causes

Missing separator: No -- between Bacalhau flags and container command
Quote handling: Issues with shell quotes and argument passing
Special characters: Problems with special characters in commands

Diagnosis

Check the exact command being executed:

bacalhau job describe <jobID> --output yaml

Look at the command fields to see what was actually executed.

Solutions

Use the separator: Always use -- between Bacalhau flags and the container command
Quote properly: Be careful with nested quotes in shell commands
Use bash -c: For complex commands, wrap them in bash -c '...'
Use yaml specs: For very complex commands, use declarative YAML specifications

Example of corrected command syntax:

# INCORRECT (missing separator)
bacalhau docker run ubuntu:latest echo "Hello"

# CORRECT
bacalhau docker run ubuntu:latest -- echo "Hello"

# CORRECT (complex command)
bacalhau docker run ubuntu:latest -- bash -c 'for i in {1..5}; do echo "Number $i"; done > /outputs/result.txt'