Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Bacalhau is an open-source distributed compute orchestration framework designed to bring compute to the data. Instead of moving large datasets around networks, Bacalhau makes it easy to execute jobs close to the data's location, drastically reducing latency and resource overhead.
Highly Distributed Architecture: Deploy compute networks that span regions, cloud providers, and on-premises datacenters—all working together as a unified system.
Resilient Operation: Compute nodes operate effectively even with intermittent connectivity to orchestrators, maintaining service availability during network partitioning or isolation.
Data Sovereignty & Security: Process sensitive data within security boundaries without requiring it to leave your premises, enabling computation while preserving data control.
Cross-Organizational Computation: Allow specific vetted computations on protected datasets without exposing raw data, breaking data silos between organizations.
Resource Efficiency: By minimizing data transfers, Bacalhau saves bandwidth costs and ensures jobs run faster.
High Scalability: As your data and processing needs grow, simply add more compute nodes on demand—whether on-premises or in the cloud.
Ease of Integration: Bacalhau works with existing container images (Docker, etc.), meaning you can leverage your current workflows without major rewrites.
Single Binary Simplicity: Bacalhau is a single self-contained binary that functions as a client, orchestrator, and compute node—making it incredibly easy to set up and scale your distributed compute network.
Modular Architecture: Bacalhau's design supports multiple execution engines (Docker, WebAssembly) and storage providers through clean interfaces, allowing for easy extension.
Orchestrator-Compute Model: A dedicated orchestrator coordinates job scheduling, while compute nodes run tasks—all from the same binary with different runtime modes.
Flexible Storage Integrations: Bacalhau integrates with S3, HTTP/HTTPS, and other storage systems, letting you pull data from various sources.
Multiple Job Types: Support for batch, ops, daemon, and service job types to accommodate different workflow requirements.
Declarative & Imperative Submissions: Define jobs in a YAML spec (declarative) or pass all arguments via CLI (imperative).
Publisher Support: Output results to local volumes, S3, or other storage backends—so your artifacts are readily accessible.
Bacalhau's distributed compute framework enables a wide range of applications across different industries:
Bacalhau's architecture enables you to create compute networks that bridge traditional infrastructure boundaries. When you submit a job, Bacalhau intelligently determines which compute nodes are best positioned to process the data based on locality, availability, and your defined constraints—without requiring data movement or constant connectivity.
This approach is particularly valuable for:
Organizations with data that cannot leave certain security boundaries
Multi-region operations where data transfer is expensive or impractical
Scenarios where multiple parties need to collaborate on analysis without sharing raw data
Edge computing environments with intermittent connectivity
Bacalhau has a very friendly community and we are always happy to help you get started:
GitHub Discussions – ask anything about the project, give feedback, or answer questions that will help other users.
Join the Slack Community and go to #bacalhau channel – it is the easiest way to engage with other members in the community and get help.
Contributing – learn how to contribute to the Bacalhau project.
This Quick Start guide shows how to run your first Bacalhau job with minimal setup. Bacalhau's design as a single self-contained binary makes it incredibly easy to set up your own distributed compute network in minutes.
Docker installed on any machine that runs a compute node
Bacalhau CLI installed (see below)
Once installed, verify with:
Open a terminal and run:
This command launches both an orchestrator and a compute node in one process
Keep it running; you'll see logs indicating it's ready
Bacalhau supports two primary methods of job submission: Imperative (CLI) and Declarative (YAML). We'll demonstrate a word count job on the classic novel Moby Dick.
Create a word-count.yaml
file:
Then run the job using:
The job downloads a sample dataset and processes it locally
Bacalhau will display job progress until completion
You'll receive a Job ID once the job is submitted
Replace <jobID>
with the actual ID printed in step 2
You can run bacalhau job logs <job-id>
to just get the execution logs
Download and view your job results:
Note: You should see a word frequency analysis of the Moby Dick text file!
You've just:
Started a local Bacalhau network
Submitted a job using both imperative and declarative methods
Tracked job progress with detailed descriptions
Retrieved and viewed job results
Install Bacalhau using the one-liner below (Linux/macOS) or see the for Windows and Docker options.
Bacalhau employs a distributed, node-based architecture that brings compute operations closer to data. Built around a single self-contained binary that serves multiple roles, Bacalhau makes it remarkably simple to deploy and scale a distributed compute network.
User Submits a Job: The user, through the Bacalhau CLI or API, sends a job definition to the Orchestrator. Jobs can be submitted in two ways:
Imperative: bacalhau docker run ...
with command-line arguments
Declarative: bacalhau job run <job-spec>
using a YAML specification file
Orchestrator Schedules Tasks: Based on resource availability, data location, and job requirements, the Orchestrator assigns tasks to Compute Nodes.
Compute Nodes Execute Tasks: Each Compute Node pulls the necessary image(s), mounts or fetches input data (local, S3, etc.), and runs the task in an isolated environment.
Results Publication: Once the task completes, outputs are published to configured storage. The Orchestrator updates the job's status accordingly.
All of these components run from the same Bacalhau binary, just in different modes, making deployment remarkably simple.
Core Role: Receives job submissions, maintains job state, and coordinates scheduling
NATS Server: Acts as a messaging infrastructure hub
Scalability: Support for multiple orchestrator nodes for high availability
Communication: Broadcasts scheduling decisions and listens for status updates
Primary Function: Execute containerized or WASM-based workloads
Resource Management: Advertise available CPU, memory, GPU, and storage capacity
Data Handling: Fetch or mount input data from various sources and publish results
Isolation: Run tasks in Docker containers or WASM environments
Bacalhau features a pluggable architecture with well-defined interfaces that enable extension without modifying core code:
Execution Engine Interface: Currently supports Docker and WebAssembly (WASM) workloads, with a clean API for adding new execution environments
Storage Provider Interface: Plug in various storage backends (S3, HTTP/HTTPS, local paths, IPFS) for both input and output handling
Publisher Interface: Easily add new ways to publish and share computation results
A key differentiator of Bacalhau is its data-centric approach:
Data Locality: The system intelligently schedules jobs on nodes with local access to data
Minimal Transfer: Moves computation to data rather than moving large datasets
Data Sovereignty: Process sensitive data within security boundaries without requiring it to leave premises
Cross-Organizational Computation: Enable collaborative analysis on protected datasets without exposing raw data
Bacalhau's architecture is designed to maintain operations even during network disruptions:
Event-Driven State: All system events are stored in local ledgers and shared during normal operation
Independent Operation: Nodes continue functioning during network outages
State Reconciliation: When network partitions heal, nodes exchange missed events
Local Decision Making: Orchestrators can make scheduling decisions with available information
Bacalhau's single-binary architecture supports flexible deployment configurations:
Single Node: Run orchestrator and compute services on one machine (ideal for development)
Regional Cluster: Distributed compute nodes within a single geographic region
Global Cluster: Compute network spanning multiple regions and data centers
Execution Environments: Tasks run in Docker containers or WASM environments with appropriate resource limits
Access Control: Each node requires valid credentials for accessing private data sources
Data Boundaries: Computation happens within defined security perimeters, protecting sensitive information
Metrics & Logging: Each node can expose metrics on resource usage and job performance
Event Tracking: Orchestrators record job lifecycle events for monitoring and auditing
This guide explains how to set up Bacalhau networks for various deployment scenarios, from development environments to production deployments.
Bacalhau's architecture consists of two primary node types:
Orchestrator nodes that schedule and manage jobs
Compute nodes that execute workloads
Compute nodes connect to orchestrators, but don't need to be reachable by orchestrators or other compute nodes, making deployment simpler.
Choose the setup option that best matches your needs:
Expanso Cloud
Production deployments
Fully managed orchestrator service
DevStack
Development & testing
Quick local setup with minimal configuration
Self-Hosted Network
Custom infrastructure requirements
Complete control over all components
(Recommended for Production)
Expanso Cloud provides a managed orchestrator service, eliminating the need to set up and maintain your own orchestrator.
Setting Up with Expanso Cloud
Sign up for Expanso Cloud to receive:
An orchestrator endpoint
Authentication credentials
A configuration file
Start your compute node:
Submit jobs to the Expanso Cloud orchestrator:
This is the simplest way to run a Bacalhau network with minimal setup and maintenance.
If you need to host your own orchestrator, follow these steps for a custom deployment.
Setting Up an Orchestrator Node
On your designated orchestrator machine:
Take note of this machine's IP address or hostname - you'll need it to connect compute nodes.
Adding Compute Nodes
On each machine that will execute jobs:
Replace <orchestrator-ip>
with the actual IP address or hostname of your orchestrator.
Verifying Your Cluster
Check that all nodes are connected:
You should see your orchestrator and all compute nodes listed.
Note: The setup described above creates an open network suitable for testing in trusted environments. For securing your network, refer to the Security Best Practices in the Reference section.
DevStack provides a pre-configured local environment perfect for development and testing.
This pre-configures a transient orchestrator and compute nodes by default, giving you a complete environment for testing with minimal setup.
You can submit jobs to your DevStack just like any other Bacalhau network:
These methods provide additional ways to set up Bacalhau for specific use cases.
For the simplest local setup, you can run a single node that acts as both orchestrator and compute:
This starts Bacalhau in "hybrid mode" where:
The orchestrator handles job scheduling
The compute service executes containers
Both components run in the same process
This option is useful for initial testing or for very small deployments.
Run Bacalhau in Docker for easier management:
The bacalhau:latest-dind
image includes Docker-in-Docker capabilities required for compute nodes.
For a quick multi-node setup, Bacalhau provides Docker Compose examples that create a complete network suitable for testing:
Clone Network Setups Repository Clone the repository containing the network setups:
Navigate to a Specific Setup
Change directory to your desired setup under docker-compose
:
Start the Network Use Docker Compose to bring up the network:
These setups enable deployment and testing of Bacalhau across multiple nodes, including an orchestrator and persistent data storage.
Secure your network with our Security Guide
Learn how to submit jobs to your network
Explore common workflows for different use cases
This section explains how to install Bacalhau on your machine, verify it's working, and understand basic requirements. Bacalhau is distributed as a single self-contained binary that can function as a client, orchestrator node, and compute node—greatly simplifying deployment and management of your distributed compute network.
To install the CLI, choose your environment, and run the command(s) below.
This should print:
The local binary version
The version of the orchestrator this client is connected to, if any
The latest available version of bacalhau in case you are running an outdated version.
If you get command not found
, verify your PATH includes the Bacalhau binary.
To upgrade Bacalhau to the latest version, run the installation script. If Bacalhau is already installed, this will update it to the most recent version available.
Docker:
Must be installed and running on any compute node to handle Docker-based jobs.
AWS Credentials (if you’re using S3):
For S3 inputs or outputs, the node needs valid AWS credentials (e.g., environment variables).
Running an Orchestrator & Compute:
bacalhau devstack
:
Perfect for local development or running tests.
Head over to Basic CLI Usage to learn how to submit, describe, and stop jobs.
Check Common Workflows for steps on mounting data (S3, local folders) and publishing outputs.
Explore References for advanced node management (Docker Compose, devstack, multi-node clusters).
This page explains how the Bacalhau CLI is structured and which global flags are most commonly used. Understanding these fundamentals will help you work efficiently with all Bacalhau commands.
The general structure and organization of Bacalhau commands
How global flags affect command behavior regardless of the specific command
How to customize output formats and control connection settings
How to specify configuration files and data directories
Where to find more detailed command references
Bacalhau commands follow a consistent pattern that makes them intuitive and predictable:
Bacalhau's CLI groups commands into logical categories:
agent
: Client-side commands for checking health, version, and node information
job
: Core job management (create, list, describe, stop, retrieve logs, etc.)
node
: Cluster node management and inspection
config
: Client configuration management
docker
: Imperative command for running Docker-based jobs
These flags work with any command and provide consistent behavior across the CLI. They're especially useful for scripting and automation.
Example:
Examples:
Examples:
For detailed information about any command's available flags:
This will show all available options, including both global flags and command-specific flags.
Tip: For full details on each command's available flags, see the CLI Reference or type bacalhau <command> --help
.
Create a (or use the one provided):
See for how to run a local or hybrid node with bacalhau serve --orchestrator --compute
.
--api-host string
Hostname for the Bacalhau API
localhost
--api-port int
Port for the Bacalhau API
1234
-c, --config string
Config file(s) or dot separated path(s) to config values
-
--data-dir string
The filesystem path where Bacalhau stores its data
~/.bacalhau
--output format
Output format style
json
, yaml
, table
--pretty
Format JSON or YAML output for readability
-
Bacalhau is built around a few core ideas and terminologies. If you're new to Bacalhau, here's what you need to know:
Bacalhau coordinates computing workloads across a network of machines, intelligently matching jobs to resources.
Bacalhau acts as a dispatcher: You submit jobs (e.g., container workloads), and it finds the best node to run them based on available resources, data location, and constraints.
Instead of moving data to compute, Bacalhau moves compute to where data lives, reducing network overhead and improving efficiency.
Traditionally, big data solutions shuffle large datasets across networks to a central compute cluster.
Bacalhau inverts this approach: it places compute tasks where the data already resides—whether in local storage, an S3 bucket, or other storage providers—reducing unnecessary data movement.
Bacalhau organizes work in a hierarchy that enables efficient resource allocation and parallelization.
A Job defines the overall workflow (e.g., "run a Docker image with these arguments").
A job can be broken into multiple Executions that run in parallel across different compute nodes.
Bacalhau optimizes these executions based on data locality and available resources.
Bacalhau supports various execution patterns to accommodate different workload requirements:
Batch Jobs: One-time execution of a workload, typically for data processing tasks that run to completion.
Ops Jobs: Administrative or operational tasks, often for system maintenance or monitoring.
Daemon Jobs: Long-running background processes that perform ongoing work.
Service Jobs: Web services or APIs that need to remain available and respond to requests.
The Bacalhau network consists of specialized components, each with specific responsibilities:
Orchestrator Node: Receives job submissions, schedules executions, and monitors state. Started with bacalhau serve --orchestrator
.
Compute Node: Executes workloads locally, typically requiring Docker or another runtime. Started with bacalhau serve --compute
.
Hybrid Node: Serves both roles at once—often used for local dev or small setups. Started with bacalhau serve --orchestrator --compute
.
Bacalhau runs your code through pluggable runtime environments:
Bacalhau supports multiple execution engines through its modular architecture:
Docker: For container-based workloads
WebAssembly (WASM): For lightweight, sandboxed execution
The framework is designed to accommodate additional engines as needed.
Bacalhau can access data from various sources through a clean, extensible interface:
Bacalhau can mount data from various sources through its flexible storage provider interface:
S3-compatible storage
HTTP/HTTPS URLs
Local filesystems
IPFS
And more via storage provider plugins
After execution, Bacalhau ensures your results are accessible where you need them:
After a job finishes, its results can be published to a specific backend—like local disk, S3 or IPFS—so they're easy to retrieve.
A reliable messaging system allows Bacalhau components to coordinate effectively:
Bacalhau uses NATS.io as its communication backbone:
Orchestrators act as NATS servers
Compute nodes connect as NATS clients
This provides reliable, scalable messaging between components
After submitting a job to Bacalhau, you'll typically need to inspect its execution logs to monitor progress and troubleshoot issues. This guide explains how to access and manage logs from your Bacalhau jobs.
How to view job execution logs
How to stream logs in real-time during job execution
How to filter logs for specific executions
Execution logs contain the standard output (stdout) and standard error (stderr) from your job, which are invaluable for monitoring and debugging.
To view the logs for a completed or running job:
This displays stdout/stderr from the container execution, showing you exactly what your job printed during its run.
For long-running jobs, you can stream logs as they're generated:
This is similar to tail -f
and will continuously show new log entries until you press Ctrl+C or the job completes.
If your job has multiple parallel executions, you can focus on a specific one:
You can find execution IDs by running bacalhau job describe <jobID>
.
To view only the most recent log entries:
When submitting a new job, you can immediately follow the logs by adding the --follow
flag to your job run command:
This is convenient as it combines job submission and log following into a single command, eliminating the need to run a separate job logs
command.
For docker run commands, you can similarly use:
This guide introduces the basics of submitting jobs to Bacalhau. Whether you're running a quick task or setting up a more complex job, you'll learn the essential approaches.
How to run quick jobs with simple commands
How to create reusable job specifications
Basic job configuration options
The fastest way to run a job is using the bacalhau docker run
command. This is perfect for simple tasks or when you're just getting started.
By default, this runs a batch job (one-time execution). You can also run ops jobs using --target all
.
--cpu 0.5
: Request half a CPU core
--memory 512mb
: Request 512MB of memory
--id-only
: Show just the job ID (useful for scripts)
Pro Tip: Everything after the --
is executed inside the container.
For jobs you'll run multiple times or want to save, create a YAML specification file:
Submit it with:
This approach helps you:
Save job configurations for later use
Share job definitions with teammates
Make small changes without retyping everything
Bacalhau supports several job types:
batch: One-time execution (default for command line)
ops: Administrative tasks targeting specific nodes (use --target all
to run on all nodes)
service: Long-running services that run on any N nodes
daemon: Background processes that run continuously on all nodes
Important: Service and daemon jobs can only be created using YAML specifications as they're designed for repeatable or updatable workloads.
Use the command line for:
Quick, one-time batch jobs
Simple ops jobs with --target all
Use YAML files when:
Running service or daemon jobs
Creating repeatable job configurations
Sharing job definitions with teammates
This guide shows you how to view and filter the jobs in your Bacalhau environment. Being able to list jobs is essential for monitoring your workloads and finding specific jobs to inspect further.
How to list all your jobs
How to filter jobs by various criteria
How to customize the output format
To see your recent jobs, use:
By default, this shows your 10 most recent jobs with basic information.
The output columns show:
CREATED: When the job was created (time)
ID: The job's unique identifier
JOB: The job engine type (usually docker)
TYPE: The job type (batch, service, etc.)
STATE: Current job state (Completed, Running, Pending, Failed, etc.)
You can refine your job list using various flags:
Labels help organize and categorize your jobs:
More complex label filtering:
Order by creation time or job ID:
Reverse the order (newest last):
When you have many jobs, the output will include a pagination token:
By default, results appear in a table format. You can choose other formats:
For more readable JSON:
Useful for importing into spreadsheets:
Additional options for table output:
This page explains how to feed external data into Bacalhau jobs from various sources. Bacalhau's modular architecture enables flexible data mounting from multiple storage providers, with S3-compatible storage, local directories, IPFS, and HTTP/HTTPS URLs supported out of the box.
How to mount data from different sources to your Bacalhau jobs
The syntax and options for each data source type
Best practices for efficient data handling
Bacalhau jobs often need access to input data. The general syntax for mounting input data is:
Where:
URI
is the protocol identifier (file://, s3://, ipfs://, http://, https://)
SOURCE
specifies the path to the data
TARGET
is the path where the data will be mounted in the container
This pattern is consistent across all input types, making it easy to understand and use regardless of the data source.
Where:
URI
is the protocol identifier (file://, s3://, ipfs://, http://, https://)
TARGET
is the path where the data will be mounted in the container
PARAMS are key value configuration depending on the input type
This mounts the directory /path/to/local/data
from the host machine to /data
inside the container.
S3 integration connects to storage solutions compatible with the S3 API, such as AWS S3, Google Cloud Storage, and locally deployed MinIO
This downloads and mounts the S3 object to the specified path in the container.
URL-based inputs provide access to web-hosted resources.
IPFS provides content-addressable, peer-to-peer storage for decentralized data sharing.
The IPFS CID (Content Identifier) points to the specific content you want to mount.
You can combine multiple inputs from different sources in a single job:
For very large datasets, consider these optimization strategies:
Best practices:
Increase resource allocations as needed
Use data locality to minimize transfer costs
Process data in chunks when possible
Choose efficient data formats (Parquet, Arrow, etc.)
Credentials: Some mount sources (S3) require proper credentials or connectivity
Data Locality: Use Bacalhau label selectors to run jobs on nodes that have or close to the data
IPFS Network: Compute nodes must be connected to an IPFS daemon to support this storage type
Size Limits: Very large inputs may require increased disk allocations using --disk
Learn how to retrieve and publish outputs from jobs
See a complete example workflow that includes input data
Explore resource constraints for jobs with large data processing needs
Sometimes you need to terminate a running job before it completes naturally. Bacalhau provides a straightforward way to stop jobs in progress.
To stop a job that's currently running:
When you issue a stop command:
The Bacalhau orchestrator marks the job for termination
A signal is sent to all compute nodes running tasks for that job
The compute nodes terminate the running containers
Resources allocated to the job are released
The job's state is updated to Stopped
To confirm a job has been properly stopped:
Look for the State
field, which should show Stopped
once the termination is complete.
Common scenarios where stopping a job is necessary:
Stuck or Misconfigured Jobs: Jobs that are stuck in a loop, using incorrect data, or producing errors
Resource Optimization: When a job is too resource-intensive or taking too long
Prioritization Changes: When higher-priority work arrives and you need to free up resources
Service Jobs: For jobs designed to run continuously, the stop
command is especially useful when the service is no longer needed
The Bacalhau agent
is the process your client directly communicates with. By default, this is running on localhost:1234
, but can be changed using --api-host
and --api-port
flags. For local testing or small clusters, you'll frequently need to check the agent's health and examine its configuration.
When troubleshooting connectivity or verifying your setup:
This returns a simple health check response, confirming your client can communicate with the agent.
To check which version you're running:
This displays version information helpful when verifying installations, troubleshooting issues, or reporting bugs.
During setup or when diagnosing issues:
This returns the complete configuration in YAML format, showing network parameters, resource limits, and admission control settings. Use this when jobs aren't being accepted or resources aren't properly allocated.
To get detailed information about the agent's node:
This shows information about node identity, available resources, and supported features. Use this when setting up a new node or troubleshooting job scheduling issues.
To connect to a remote agent:
This pattern works with all agent commands and is useful for monitoring production clusters or diagnosing connectivity issues between network components.
While agent
commands let you interact with your local Bacalhau process, node
commands allow you to manage the broader network of compute resources. The orchestrator tracks these nodes, and this guide covers common operations you'll need for monitoring and managing your compute infrastructure.
To get a quick overview of all nodes in your network:
This displays a table of nodes with essential information about IDs, types, approval status, and connection state. Use this command for monitoring cluster health and identifying nodes that need attention.
When planning job deployments or troubleshooting resource constraints:
This enhanced view shows version information and supported execution engines. This helps you identify nodes with specific capabilities for your workloads.
Filter the list to show only nodes with specific characteristics:
This filtering capability helps you find nodes in specific regions or with specialized hardware.
When you need comprehensive information about a specific node:
This provides extensive details on the node's identity, resources, and capabilities. Use this when investigating specific issues or verifying a node's configuration.
Once you've submitted jobs to Bacalhau and identified them through job listing, you'll often need to dig deeper into specific jobs. This guide covers the commands for getting detailed information about your jobs.
How to view comprehensive details about a specific job
How to track a job's history and state changes
How to examine individual job executions
To see complete details about a specific job, use:
Replace <job-id>
with your actual job ID. You can use the full ID or just the first few characters (if they uniquely identify the job).
The output shows you:
Basic job information (ID, name, type, state)
Summary of job completion status
Job history timeline
Execution details on which nodes ran the job
Execution history showing state changes
Standard output from the job execution
Change the output format for easier parsing or integration with other tools:
To see how a job's state has changed over time:
The history shows important events like state transitions and execution updates.
Filter history by event type:
Filter by a specific execution:
For jobs that run on multiple nodes or have multiple attempts, check the executions:
Each execution represents an instance of your job running on a specific node.
This guide explains how to configure output publishing and retrieve results from Bacalhau jobs across different storage systems. Proper output handling is essential for building effective data pipelines and workflows.
How Bacalhau's Publishers mechanism works
How to configure different output destination types
How to retrieve outputs from various storage systems
How to choose the right publisher for your use case
In Bacalhau, you need to configure two key components for handling outputs:
A Publisher defines where your job's output files are stored after execution
Result Paths specify which directories should be captured as job results.
Retrieving Local Outputs
After your job completes, retrieve outputs using the bacalhau job get
command:
This will download all published outputs to your current directory.
Important Notes:
If you define a publisher without specifying result paths, only stdout and stderr will be uploaded to the chosen publisher
If you define result paths without a publisher, the job will fail
You can have multiple result paths, each capturing different directories
Bacalhau supports multiple publisher types to accommodate different needs and infrastructure requirements.
The S3 Publisher uploads outputs to an Amazon S3 bucket or any S3-compatible storage service, such as MinIO. The compute node must have permission to write to the bucket, and the orchestrator must have permission to provide pre-signed URLs to download the results.
The IPFS Publisher uploads outputs to the InterPlanetary File System. Both the client (downloading the result) and the compute node must be connected to an IPFS daemon.
Local Publisher
The Local Publisher saves outputs to the local filesystem of the compute node that ran your job. This is intended for local testing only, as it requires the client downloading the results to be on the same network as the compute node.
If you are using the local publish, make SURE you have set the path to be available to your job.
For example, in your config file for your node, you probably want to mount in the local file system:
If you don't see expected outputs:
Check that your job wrote to the directories specified in your ResultPaths
Verify the job completed successfully with bacalhau job describe <jobID>
Check for errors in the logs with bacalhau job logs <jobID>
For S3 publisher problems:
Ensure compute nodes have proper IAM roles or credentials to write to the bucket
Check that the orchestrator has permissions to generate pre-signed URLs
For IPFS publisher issues:
Ensure IPFS daemon is running on both compute node and client
Check for network connectivity between nodes
Verify you have enough disk space for pinning
You can read more about that here:
After a Bacalhau job completes, you'll need to retrieve the output files generated by your job. This guide explains the basics of downloading job results.
How to specify output paths in your jobs
How to retrieve job results using the CLI
To download the results of a completed job:
This command downloads all outputs from the job to your current directory.
You can specify where to save the downloaded results:
For larger downloads, you can adjust the timeout:
When submitting a job, you need to define which files or directories should be collected as outputs, and where those outputs should be published.
For Docker jobs, use the --output
flag to define outputs and the --publisher
flag to specify where to publish the results:
This tells Bacalhau to:
Collect everything in the /outputs
directory of the container
Publish it to the specified S3 bucket and path
Make it available for download with bacalhau job get
You can also define outputs in a job specification file:
Submit this job using:
You can specify multiple output paths in a single job:
After running bacalhau job get
, the results will be organized in a directory structure like this:
The directory structure includes:
exitCode
: Contains the exit code of the job
outputs
: Contains all the files from the job's specified output directories
stderr
: Captures any error output from the job
stdout
: Captures the standard output from the job
This fetches the latest Bacalhau release and places it in /usr/local/bin
or a similar path.
You many need sudo mode or root access to install the binary at the desired path
Windows users can download the latest release tarball from Github and extract bacalhau.exe
to any location available in the PATH environment variable.
Base Image
Suitable for running orchestrators, clients or compute nodes with no docker support
Docker in Docker
Suitable for running compute nodes that can run docker based jobs
Require --privileged
mode when running the container
Configure CPU, memory, disk, and GPU requirements for your Bacalhau jobs to ensure efficient resource utilization.
Specify CPU cores and memory allocation for your jobs. Default values are CPU: 500m and Memory: 512Mb.
Request GPU resources for machine learning, deep learning, and other GPU-accelerated tasks.
Note: The compute node must have available GPUs with proper drivers, and your container image should include necessary GPU libraries (e.g., CUDA).
Control how much disk space your job can use:
Job Stuck Pending: You may be requesting resources that aren't available. Check available resources with bacalhau node list
or reduce requirements.
Out of Memory (OOM): Increase memory allocation or process data in smaller batches.
Disk Space Issues: Increase disk allocation or clean up temporary files during processing.
Start with conservative resource requests and scale up as needed
For memory-intensive tasks, add a 20-30% buffer to your estimated peak usage
Check if your framework can effectively use multiple GPUs before requesting them
To select specific GPU types, use label selectors:
This page provides a quick reference for common issues encountered by Bacalhau users and their solutions. Identifying and avoiding these pitfalls will help you create more reliable jobs and workflows.
How to diagnose and resolve common Bacalhau job issues
Strategies for debugging stuck, failed, or misbehaving jobs
Best practices to prevent common problems
One of the most common issues is jobs remaining in the "Pending" state and never executing.
No available nodes: No compute nodes are connected to the orchestrator
Resource constraints too high: Requesting more CPU, memory, or GPU than any available node can provide
Mismatched node selector: Job requirements don't match available node capabilities
Network partitioning: Orchestrator can't communicate with compute nodes
Check the job status and specifications for clues:
Look for status messages that might indicate scheduling issues.
Check available compute nodes:
Ensure there are active compute nodes with sufficient resources.
Reduce resource requests: Lower CPU, memory, or GPU requirements
Add more compute nodes: Add capacity to your cluster
Check network connectivity: Ensure nodes can communicate with each other
Modify job requirements: Adjust constraints to match available resources
Problems accessing or mounting input data are another common source of failures.
Wrong path or URL: Incorrect or inaccessible source location
Missing credentials: No or invalid authentication for S3 or private URLs
Network limitations: Compute node can't reach data source
Path mapping errors: Incorrect source-to-destination mapping
Check job specs and status:
If the job started but failed during execution, check logs:
Look for messages like "file not found" or "access denied".
Validate paths: Double-check that source paths, URLs, or S3 buckets exist and are accessible
Check credentials: Ensure proper environment variables or configuration for authenticated sources
Test connectivity: Verify the compute node can reach the data source
Local testing: Test data access locally before running on Bacalhau
Example of corrected input mounting:
Jobs complete successfully, but expected output files are missing.
Wrong output path: Not writing to the /outputs
directory
Command errors: The job ran but the command failed to produce output
Permission issues: Container user can't write to output location
Publisher configuration: Publisher not configured correctly
Check job specification and execution details:
If the job executed, check logs for clues about what the job did:
Verify your job actually wrote to the /outputs
directory.
Use absolute paths: Always use absolute paths in your commands
Write to /outputs
: Ensure your job writes to the /outputs
directory specifically
Add debugging: Add commands to list directories and print current working directory
Check permissions: Ensure your process has permission to write to the output location
Example of corrected output writing:
Issues with container execution or container image availability.
Image not found: The specified container image doesn't exist or is inaccessible
Command errors: The command specified doesn't exist in the container
Resource limitations: The container runs out of resources during execution
Exit codes: The container process exits with a non-zero code
Check job specification for container configuration:
If the container started, check logs for execution errors:
Look for messages about image pulling or command execution.
Verify image exists: Check that the image name is correct and accessible
Test locally: Try running the container locally with Docker first
Check command: Ensure the command exists in the container and has correct syntax
Adjust resources: Provide sufficient CPU, memory, and disk for your workload
Example of corrected container image:
Jobs fail because they run out of resources during execution.
Out of memory (OOM): Job exceeds allocated memory
Disk space exhaustion: Job writes more data than allocated disk space
CPU thrashing: Insufficient CPU allocation causes extreme slowdown
GPU memory errors: CUDA out of memory errors for GPU jobs
Check job specification and status:
If the job executed, check logs for error messages:
Look for error messages about memory, disk space, or resource limits.
Increase resources: Allocate more memory, CPU, or disk space
Optimize code: Reduce resource usage in your application
Process in batches: Break large workloads into smaller chunks
Clean up temporary files: Remove unneeded files during processing
Example of increased resource allocation:
Problems related to how commands and arguments are passed to containers.
Missing separator: No --
between Bacalhau flags and container command
Quote handling: Issues with shell quotes and argument passing
Special characters: Problems with special characters in commands
Check the exact command being executed:
Look at the command fields to see what was actually executed.
Use the separator: Always use --
between Bacalhau flags and the container command
Quote properly: Be careful with nested quotes in shell commands
Use bash -c: For complex commands, wrap them in bash -c '...'
Use yaml specs: For very complex commands, use declarative YAML specifications
Example of corrected command syntax: