1 of 47

References

Operators & Deployment

Deployment Guide

Setting Up a Cluster on Amazon Web Services (AWS) with Terraform 🚀

Welcome to the guide for setting up your own Bacalhau cluster across multiple AWS regions! This guide will walk you through creating a robust, distributed compute cluster that's perfect for running your Bacalhau workloads.

What You'll Build

Think of this as building your own distributed supercomputer! Your cluster will provision compute nodes spread across different AWS regions for global coverage.

Before You Start

You'll need a few things ready:

Terraform (version 1.0.0 or newer)
AWS CLI installed and configured
An active AWS account with appropriate permissions
Your AWS credentials configured
An SSH key pair for securely accessing your nodes
A Bacalhau network

Quick Setup Guide

Create your environment configuration file:
Fill in your AWS details in env.tfvars.json:
Configure your desired regions in locations.yaml. Here's an example (we have a full list of these in all_locations.yaml):

Make sure the AMI exists in the region you need it to! You can confirm this by executing the following command:

Update your Bacalhau config/config.yaml (the defaults are mostly fine, just update the Orchestrator, and Token lines):

Deploy your cluster using the Python deployment script:

Understanding the Configuration

Why use a deployment script? Why not use Terraform directly?

Terraform on AWS requires switching to different workspaces when deploying to different availability zones. As a result, we had to setup a separate deploy.py script which switches to each workspace for you under the hood, to make it easier.

Core Configuration Files

env.tfvars.json: Your main configuration file containing AWS-specific settings`
locations.yaml: Defines which regions to deploy to and instance configurations
config/config.yaml: Bacalhau node configuration

Essential Settings in env.tfvars.json

app_name: Name for your cluster resources
app_tag: Tag for resource management
bacalhau_installation_id: Unique identifier for your cluster
username: SSH username for instances
public_key_path: Path to your SSH public key
private_key_path: Path to your SSH private key
bacalhau_config_file_path: Path to the config file for this compute node (should point at the orchestrator and have the right token)

Location Configuration (locations.yaml)

Each region entry requires:

region: AWS region (e.g., us-west-2)
zone: Availability zone (e.g., us-west-2a)
instance_type: EC2 instance type (e.g., t3.medium)
instance_ami: AMI ID for the region
node_count: Number of instances to deploy

Taking Your Cluster for a Test Drive

Once everything's up and running, let's make sure it works!

Configure your Bacalhau client:
List your compute nodes:
Run a test job:
Check job status:

Troubleshooting Tips

Deployment Issues

Verify AWS credentials are properly configured:
Check IAM permissions
Ensure you have quota available in target regions

Node Health Issues

SSH into a node:
Check Bacalhau service logs:
Check Docker container status:

Network Issues

Verify security group rules (ports 22, 80, and 4222 should be open)
Check VPC and subnet configurations
Ensure internet gateway is properly attached

Common Solutions

If nodes aren't joining the network:
- Check NATS connection string in config.yaml
- Verify security group allows port 4222
- Ensure nodes can reach the orchestrator
If jobs aren't running:
- Check compute is enabled in node config
- Verify Docker is running properly
- Check available disk space
If deployment fails:
- Look for errors in Terraform output
- Check AWS service quotas
- Verify AMI availability in chosen regions

Cleanup

Remove all resources:

Monitoring

Check node health:

Understanding the Directory Structure

Need Help?

If you get stuck or have questions:

We're here to help you get your cluster running smoothly! 🌟

Setting Up a Cluster on Google Cloud Platform (GCP) With Terraform 🚀

Welcome to the guide for setting up your own Bacalhau cluster across multiple Google Cloud Platform (GCP) regions! This guide will walk you through creating a robust, distributed compute cluster that's perfect for running your Bacalhau workloads.

What You'll Build

Think of this as building your own distributed supercomputer! Your cluster will provision compute nodes spread across different GCP regions for global coverage.

Before You Start

You'll need a few things ready:

Terraform (version 1.0.0 or newer)
A running Bacalhau orchestrator node
Google Cloud SDK installed and set up
An active GCP billing account
Your organization ID handy
An SSH key pair for securely accessing your nodes

Quick Setup Guide

Make sure you are logged in with GCP. This could involve both of the following commands:

gcloud auth login
gcloud auth application-default login

Clone the examples repo to your machine and go into the GCP directory.

gh repo clone bacalhau-project/examples
cd setting-up-bacalhau-cluster/setting-up-bacalhau-with-terraform-on-GC

Now, make a copy of the example environment file:
```
cp env.json.example env.json
```
Open up env.json and fill in your GCP details (more on this below!)

Update your config/config.yaml with your orchestrator information. Specifically, these lines:

  Orchestrators:
    - nats://EXAMPLE-7a02-4083-bf08-bcc2f5fbc025.us1.cloud.expanso.dev:4222
  Auth:
    Token: "EXAMPLE-aEEFukWVffnf5jb9QkpNnwfiBWEk3475csM7ysudpbFTzYBap5c7sWr6"

Let Terraform get everything ready:
```
terraform init --env-file env.json
```
Launch your cluster:
```
terraform apply --env-file env.json
```

The entire process takes about 8 minutes, but should end with something like the below:

Apply complete! Resources: 17 added, 0 changed, 0 destroyed.

deployment_status = {
  "asia-northeast1" = {
    "external_ip" = "35.221.64.233"
    "health_check" = "healthy"
    "name" = "bacalhau-cluster-2501020854-asia-northeast1-vm"
  }
  
[...]

  "europe-west12" = {
    "external_ip" = "34.17.50.110"
    "internal_ip" = "10.210.0.2"
    "name" = "bacalhau-cluster-2501020854-europe-west12-vm"
    "zone" = "europe-west12-a"
  }
}

You're good to go!

Customizing Your Network

The env.json file is where all the magic happens. Here's what you'll need to fill in:

Essential Settings

bootstrap_project_id: Your existing GCP project (just used for setup)
base_project_name: What you want to call your new project
gcp_billing_account_id: Where the charges should go
gcp_user_email: Your GCP email address
org_id: Your organization's ID
app_tag: A friendly name for your resources (like "bacalhau-demo")

Node Configuration

bacalhau_data_dir: Where job data should be stored
bacalhau_node_dir: Where node configs should live
username: Your SSH username
public_key: Path to your SSH public key

Location Settings

You can set up nodes in different regions with custom configurations:

"locations": {
  "us-central1": {
    "zone": "us-central1-a",
    "node_count": 3,
    "machine_type": "e2-standard-4"
  }
}

Taking Your Cluster for a Test Drive

Once everything's up and running, let's make sure it works!

First configure the CLI to use your cluster:

bacalhau config set -c API.Host=<orchestrator-ip>

Check on the health of your nodes:
```
bacalhau node list
```

Run a simple test job:

bacalhau docker run ubuntu echo "Hello from my cluster!"

Check on your jobs:
```
bacalhau list
```
Get your results:
```
bacalhau get <job-id>
```

Troubleshooting Tips

Having issues? Here are some common solutions:

Deployment Problems

Double-check your GCP permissions
Make sure your billing account is active
Verify that all needed APIs are turned on in GCP

Node Health Issues

Look at the logs on a node: journalctl -u bacalhau-startup.service
Check Docker logs on a node: docker logs <container-id>
Make sure that port 4222 isn't blocked

Job Running Troubles

Verify your NATS connection settings
Check if nodes are properly registered
Make sure compute is enabled in your config

Cleaning Up

When you're done, clean everything up with:

terraform destroy --env-file env.json

Need to Check on Things?

If you need to peek under the hood, here's how:

Find your node IPs:
```
terraform output instance_ips
```
SSH into a node:
```
ssh -i ~/.ssh/id_rsa ubuntu@<public-ip>
```
Check on Docker:
```
docker ps
```

Go into the container on the node:

CONTAINER_ID=$(docker ps --filter name=^/bacalhau_node --format '{{.ID}}' | head -n1)
docker exec -it $CONTAINER_ID /bin/bash

Understanding the Configuration Files

Here's what each important file does in your setup:

Core Files

main.tf: Your main Terraform configuration
variables.tf: Where input variables are defined
outputs.tf: What information Terraform will show you
config/config.yaml: How your Bacalhau nodes are configured
scripts/startup.sh: Gets your nodes ready to run
scripts/bacalhau-startup.service: Manages the Bacalhau service

Cloud-Init and Docker Setup

cloud-init/init-vm.yml: Sets up your VM environment, installs packages, and gets services running
config/docker-compose.yml: Runs Bacalhau in a privileged container with all the right volumes and health checks

The neat thing is that most of your configuration happens in just one file: env.json. Though if you want to get fancy, there's lots more you can customize!

Need Help?

If you get stuck or have questions:

We're here to help you get your cluster running smoothly! 🌟

Setting Up a Cluster on Azure with Terraform 🚀

Welcome to the guide for setting up your own Bacalhau cluster across multiple Azure regions! This guide will walk you through creating a robust, distributed compute cluster that's perfect for running your Bacalhau workloads.

What You'll Build

Think of this as building your own distributed supercomputer! Your cluster will provision compute nodes spread across different Azure regions for global coverage.

Before You Start

You'll need a few things ready:

Terraform (version 1.0.0 or newer)
A running Bacalhau orchestrator node
Azure CLI installed and set up
An active Azure subscription
Your subscription ID handy
An SSH key pair for securely accessing your nodes

Quick Setup Guide

First, create a terraform.tfvars.json file with your Azure details:
Open up terraform.tfvars.json and fill in your Azure details:
Update your config/config.yaml with your orchestrator information. Specifically, these lines:

Let Terraform get everything ready:
Launch your cluster:

Understanding the Configuration

The infrastructure is organized into modules:

Network: Creates VNets and subnets in each region
Security Group: Sets up NSGs with rules for SSH, HTTP, and NATS
Instance: Provisions VMs with cloud-init configuration

Taking Your Cluster for a Test Drive

Once everything's up and running, let's make sure it works!

Setup your configuration to point at your orchestrator node:
Check on the health of your nodes:
Run a simple test job:
Check on your jobs:
Get your results:

Troubleshooting Tips

Having issues? Here are some common solutions:

Deployment Problems

Double-check your Azure permissions
Make sure your subscription is active
Verify that all needed resource providers are registered

Node Health Issues

Look at the logs on a node: journalctl -u bacalhau-startup.service
Check Docker logs on a node: docker logs <container-id>
Make sure that port 4222 isn't blocked

Job Running Troubles

Verify your NATS connection settings
Check if nodes are properly registered
Make sure compute is enabled in your config

Cleaning Up

When you're done, clean everything up with:

Need to Check on Things?

If you need to peek under the hood, here's how:

Find your node IPs:
SSH into a node:
Check on Docker:
Go into the container on the node:

Understanding the Configuration Files

Here's what each important file does in your setup:

Core Files

main.tf: Your main Terraform configuration
variables.tf: Where input variables are defined
outputs.tf: What information Terraform will show you

Modules

modules/network: Handles VNet and subnet creation
modules/securityGroup: Manages network security groups
modules/instance: Provisions VMs with cloud-init

Cloud-Init and Docker Setup

cloud-init/init-vm.yml: Sets up your VM environment, installs packages, and gets services running
config/docker-compose.yml: Runs Bacalhau in a privileged container with all the right volumes and health checks

Azure Specific Commands

For ensuring that you have configured your Azure CLI correctly, here are some commands you can use:

Get available Azure locations

Get available VM sizes in a region

Get your subscription ID

Need Help?

If you get stuck or have questions:

We're here to help you get your cluster running smoothly! 🌟

Marketplace Deployments

Google Cloud Marketplace

Introduction

Well done on deploying your Bacalhau cluster! Now that the deployment is finished, this document will help with the next steps. It provides important information on how to interact with and manage the cluster. You'll find details on the outputs from the deployment, including how to set up and connect a Bacalhau Client, and how to authorize and connect a Bacalhau Compute node to the cluster. This guide gives everything needed to start using the Bacalhau setup

Deployment Outputs

After completing the deployment, several outputs will be presented. Below is a description of each output and instructions on how to configure your Bacalhau node using them.

Requester Public IP

Description: The IP address of the Requester node for the deployment and the endpoint where the Bacalhau API is served.

Usage: Configure the Bacalhau Client to connect to this IP address in the following ways:

Setting the --api-host CLI Flag:
Setting the BACALHAU_API_HOST environment variable:
Modifying the Bacalhau Configuration File:

Requester API Token

Description: The token used to authorize a client when accessing the Bacalhau API.

Usage: The Bacalhau client prompts for this token when a command is first issued to the Bacalhau API. For example:

Compute API Token

Description: The token used to authorize a Bacalhau Compute node to connect to the Requester Node.

Usage: A Bacalhau Compute node can be connected to the Requester Node using the following command:

Setting Up

Running Nodes

Node Onboarding

Introduction

This tutorial describes how to add new nodes to an existing private network. Two basic scenarios will be covered:

Pre-Prerequisites

You should have an established private network consisting of at least one requester node.

Add Host/Virtual Machine as a New Node

Let's assume that you already have a private network with at least one requester node. You will need to:

Set the token in the Compute.Auth.Token configuration key
Set the orchestrators IP address in the Compute.Orchestrators configuration key
Execute bacalhau serve specifying the node type via --orchestrator flag

Add a Cloud Instance as a New Node

To automate the process using Terraform follow these steps:

Determine the IP address of your requester node
Write a terraform script, which does the following:
1. Adds a new instance
2. Installs bacalhau on it
3. Launches a compute node
Execute the script

Support

GPU Installation

How to enable GPU support on your Bacalhau node

Bacalhau supports GPUs out of the box and defaults to allowing execution on all GPUs installed on the node.

Prerequisites

Bacalhau makes the assumption that you have installed all the necessary drivers and tools on your node host and have appropriately configured them for use by Docker.

In general for GPUs from any vendor, the Bacalhau client requires:

Nvidia

nvidia-smi installed and functional

AMD

rocm-smi tool installed and functional

Intel

xpu-smi tool installed and functional

GPU Node Configuration

When running your compute nodes, make sure you add the appropriate configuratio settings:

Job selection policy

When running a node, you can choose which jobs you want to run by using configuration options, environment variables or flags to specify a job selection policy.

Job selection probes

If you want more control over making the decision to take on jobs, you can use the JobAdmissionControl.ProbeExec and JobAdmissionControl.ProbeHTTP configuration keys.

These are external programs that are passed the following data structure so that they can make a decision about whether to take on a job:

The exec probe is a script to run that will be given the job data on stdin, and must exit with status code 0 if the job should be run.

The http probe is a URL to POST the job data to. The job will be rejected if the HTTP request returns a non-positive status code (e.g. >= 400).

For example, the following response will reject the job:

If the HTTP response is not a JSON blob, the content is ignored and any non-error status code will accept the job.

Access Management

How to configure authentication and authorization on your Bacalhau node.

Access Management

Bacalhau includes a flexible auth system that supports multiple methods of auth that are appropriate for different deployment environments.

By default

With no specific authentication configuration supplied, Bacalhau runs in "anonymous mode" – which allows unidentified users limited control over the system. "Anonymous mode" is only appropriate for testing or evaluation setups.

In anonymous mode, Bacalhau will allow:

Users identified by a self-generated private key to submit any job and cancel their own jobs.
Users not identified by any key to access other read-only endpoints, such as to read job lists, describe jobs, and query node or agent information.

Restricting anonymous access

Persistent State

Requester node database (job store)

Requester nodes store job state and history in a boltdb-backed store (pkg/jobstore/boltdb).

The location of the database file can be specified using the BACALHAU_JOB_STORE_PATH environment variable, which will specify which file to use to store the database. When not specified, the file will be {$BACALHAU_DIR}/{NODE_ID}-requester.db.

Compute node database (execution store)

By default, compute nodes store their execution information in an bolddb-backed store (pkg/compute/store/boltdb).

The location of the database file (for a single node) can be specified using the BACALHAU_COMPUTE_STORE_PATH environment variable, which will specify which file to use to store the database. When not specified, the file will be {$BACALHAU_DIR}/{NODE_ID}-compute.db.

Compute node restarts

As compute nodes restart, they will find they have existing state in the boltdb database. At startup the database currently iterates the executions to calculate the counters for each state. This will be a good opportunity to do some compaction of the records in the database, and cleanup items no longer in use.

Currently only batch jobs are possible, and so for each of the listed states below, no action is taken at restart. In future it would make sense to remove records older than a certain age, or moved them to failed, depending on their current state. For other job types (to be implemented) this may require restarting jobs, resetting jobs,

Inspecting the databases

The databases can be inspected using the bbolt tool. The bbolt tool can be installed to $GOBIN with:

Once installed, and assuming the database file is stored in $FILE you can use bbolt to:

Check the database integrity

List all buckets

Compact the database (by copying it)

Get some DB stats

List keys in a bucket

View a single key

Configuring Your Input Sources

Bacalhau has two ways to make use of external storage providers: Sources and Publishers. Sources are storage resources consumed as inputs to jobs. And Publishers are storage resources created with the results of jobs.

Sources

Publishers

Configuring Transport Level Security

How to configure TLS for the requester node APIs

By default, the requester node APIs used by the Bacalhau CLI are accessible over HTTP, but it is possible to configure it to use Transport Level Security (TLS) so that they are accessible over HTTPS instead. There are several ways to obtain the necessary certificates and keys, and Bacalhau supports obtaining them via ACME and Certificate Authorities or even self-signing them.

Once configured, you must ensure that instead of using http://IP:PORT you use https://IP:PORT to access the Bacalhau API

Getting a certificate from Let's Encrypt with ACME

Alternatively, you may set these options via the environment variable, BACALHAU_AUTO_TLS. If you are using a configuration file, you can set the values inNode.ServerAPI.TLS.AutoCert instead.

As a result of the Lets Encrypt verification step, it is necessary for the server to be able to handle requests on port 443. This typically requires elevated privileges, and rather than obtain these through a privileged account (such as root), you should instead use setcap to grant the executable the right to bind to ports <1024.

A cache of ACME data is held in the config repository, by default ~/.bacalhau/autocert-cache, and this will be used to manage renewals to avoid rate limits.

Getting a certificate from a Certificate Authority

Obtaining a TLS certificate from a Certificate Authority (CA) without using the Automated Certificate Management Environment (ACME) protocol involves a manual process that typically requires the following steps:

Choose a Certificate Authority: First, you need to select a trusted Certificate Authority that issues TLS certificates. Popular CAs include DigiCert, GlobalSign, Comodo (now Sectigo), and others. You may also consider whether you want a free or paid certificate, as CAs offer different pricing models.
Generate a Certificate Signing Request (CSR): A CSR is a text file containing information about your organization and the domain for which you need the certificate. You can generate a CSR using various tools or directly on your web server. Typically, this involves providing details such as your organization's name, common name (your domain name), location, and other relevant information.
Submit the CSR: Access your chosen CA's website and locate their certificate issuance or order page. You'll typically find an option to "Submit CSR" or a similar option. Paste the contents of your CSR into the provided text box.
Verify Domain Ownership: The CA will usually require you to verify that you own the domain for which you're requesting the certificate. They may send an email to one of the standard domain-related email addresses (e.g., admin@yourdomain.com, webmaster@yourdomain.com). Follow the instructions in the email to confirm domain ownership.
Complete Additional Verification: Depending on the CA's policies and the type of certificate you're requesting (e.g., Extended Validation or EV certificates), you may need to provide additional documentation to verify your organization's identity. This can include legal documents or phone calls from the CA to confirm your request.
Payment and Processing: If you're obtaining a paid certificate, you'll need to make the payment at this stage. Once the CA has received your payment and completed the verification process, they will issue the TLS certificate.

Once you have obtained your certificates, you will need to put two files in a location that bacalhau can read them. You need the server certificate, often called something like server.cert or server.cert.pem, and the server key which is often called something like server.key or server.key.pem.

Once you have these two files available, you must start bacalhau serve which two new flags. These are tlscert and tlskey flags, whose arguments should point to the relevant file. An example of how it is used is:

Alternatively, you may set these options via the environment variables, BACALHAU_TLS_CERT and BACALHAU_TLS_KEY. If you are using a configuration file, you can set the values inNode.ServerAPI.TLS.ServerCertificate and Node.ServerAPI.TLS.ServerKey instead.

Self-signed certificates

Once you have generated the necessary files, the steps are much like above, you must start bacalhau serve which two new flags. These are tlscert and tlskey flags, whose arguments should point to the relevant file. An example of how it is used is:

Alternatively, you may set these options via the environment variables, BACALHAU_TLS_CERT and BACALHAU_TLS_KEY. If you are using a configuration file, you can set the values inNode.ServerAPI.TLS.ServerCertificate and Node.ServerAPI.TLS.ServerKey instead.

If you use self-signed certificates, it is unlikely that any clients will be able to verify the certificate when connecting to the Bacalhau APIs. There are three options available to work around this problem:

Provide a CA certificate file of trusted certificate authorities, which many software libraries support in addition to system authorities.
Install the CA certificate file in the system keychain of each machine that needs access to the Bacalhau APIs.
Instruct the software library you are using not to verify HTTPS requests.

Limits and Timeouts

Resource Limits

These are the configuration keys that control the capacity of the Bacalhau node, and the limits for jobs that might be run.

Configuration key

Description

Compute.AllocatedCapacity.CPU

Specifies the amount of CPU a compute node allocates for running jobs. It can be expressed as a percentage (e.g., 85%) or a Kubernetes resource string

Compute.AllocatedCapacity.Disk

Specifies the amount of Disk space a compute node allocates for running jobs. It can be expressed as a percentage (e.g., 85%) or a Kubernetes resource string (e.g., 10Gi)

Compute.AllocatedCapacity.GPU

Specifies the amount of GPU a compute node allocates for running jobs. It can be expressed as a percentage (e.g., 85%) or a Kubernetes resource string (e.g., 1).

Note: When using percentages, the result is always rounded up to the nearest whole GPU

Compute.AllocatedCapacity.Memory

Specifies the amount of Memory a compute node allocates for running jobs. It can be expressed as a percentage (e.g., 85%) or a Kubernetes resource string (e.g., 1Gi)

bacalhau config set JobDefaults.Ops.Task.Resources.Memory=2Gi

Windows Support

Timeouts

Bacalhau can limit the total time a job spends executing. A job that spends too long executing will be cancelled, and no results will be published.

By default, a Bacalhau node does not enforce any limit on job execution time. Both node operators and job submitters can supply a maximum execution time limit. If a job submitter asks for a longer execution time than permitted by a node operator, their job will be rejected.

Configuring Execution Time Limits

Job submitters can pass the --timeout flag to any Bacalhau job submission CLI to set a maximum job execution time. The supplied value should be a whole number of seconds with no unit.

The timeout can also be added to an existing job spec by adding the Timeout property to the Spec.

Node operators can use configuration keys to specify default and maximum job execution time limits. The supplied values should be a numeric value followed by a time unit (one of s for seconds, m for minutes or h for hours).

Here is a list of the relevant properties:

JobDefaults.Batch.Task.Timeouts.ExecutionTimeout

Default value for batch job execution timeouts on your current compute node. It will be assigned to batch jobs with no timeout requirement defined

JobDefaults.Ops.Task.Timeouts.ExecutionTimeout

Default value for ops job execution timeouts on your current compute node. It will be assigned to ops jobs with no timeout requirement defined

JobDefaults.Batch.Task.Timeouts.TotalTimeout

Default value for the maximum execution timeout this compute node supports for batch jobs. Jobs with higher timeout requirements will not be bid on

JobDefaults.Ops.Task.Timeouts.TotalTimeout

Default value for the maximum execution timeout this compute node supports for ops jobs. Jobs with higher timeout requirements will not be bid on

Note, that timeouts can not be configured for Daemon and Service jobs.

Bacalhau WebUI

How to run the WebUI.

Overview

The Bacalhau WebUI offers an intuitive interface for interacting with the Bacalhau network. This guide provides comprehensive instructions for setting up and utilizing the WebUI.

WebUI Setup

Prerequisites

Ensure you have a Bacalhau v1.5.0 or later installed.

Configuration

To enable the WebUI, use the WebUI.Enabled configuration key:

bacalhau config set webui.enabled=true

By default, WebUI uses host=0.0.0.0 and port=8438. This can be configured via WebUI.Listen configuration key:

bacalhau config set webui.listen=<ip-address>:<port>

Accessing the WebUI

Once started, the WebUI is accessible at the specified address, localhost:8438 by default.

WebUI Features

Jobs

The updated WebUI allows you to view a list of jobs, including job status, run time, type, and a message in case the job failed.

Clicking on the id of a job in the list opens the job details page, where you can see the history of events related to the job, the list of nodes on which the job was executed and the real-time logs of the job.

Nodes

On the Nodes page you can see a list of nodes connected to your network, including node type, membership and connection statuses, amount of resources - total and currently available, and a list of labels of the node.

Clicking on the node id opens the node details page, where you can see the status and settings of the node, the number of running and scheduled jobs.

Workload Onboarding

This directory contains examples relating to performing common tasks with Bacalhau.

Container

Docker Workload Onboarding

How to use docker containers with Bacalhau

Docker Workloads

Bacalhau executes jobs by running them within containers. Bacalhau employs a syntax closely resembling Docker, allowing you to utilize the same containers. The key distinction lies in how input and output data are transmitted to the container via IPFS, enabling scalability on a global level.

This section describes how to migrate a workload based on a Docker container into a format that will work with the Bacalhau client.

Requirements

Here are few things to note before getting started:

Container Registry: Ensure that the container is published to a public container registry that is accessible from the Bacalhau network.
Architecture Compatibility: Bacalhau supports only images that match the host node's architecture. Typically, most nodes run on linux/amd64, so containers in arm64 format are not able to run.
Input Flags: The --input ipfs://... flag supports only directories and does not support CID subpaths. The --input https://... flag supports only single files and does not support URL directories. The --input s3://... flag supports S3 keys and prefixes. For example, s3://bucket/logs-2023-04* includes all logs for April 2023.

Note: Only about a third of examples have their containers here. The rest are under random docker hub registries.

Runtime Restrictions

To help provide a safe, secure network for all users, we add the following runtime restrictions:

Limited Ingress/Egress Networking:

Data Passing with Docker Volumes:

A job includes the concept of input and output volumes, and the Docker executor implements support for these. This means you can specify your CIDs, URLs, and/or S3 objects as input paths and also write results to an output volume. This can be seen in the following example:

The above example demonstrates an input volume flag -i s3://mybucket/logs-2023-04*, which mounts all S3 objects in bucket mybucket with logs-2023-04 prefix within the docker container at location /input (root).

Output volumes are mounted to the Docker container at the location specified. In the example above, any content written to /output_folder will be made available within the apples folder in the job results CID.

Once the job has run on the executor, the contents of stdout and stderr will be added to any named output volumes the job has used (in this case apples), and all those entities will be packaged into the results folder which is then published to a remote location by the publisher.

Onboarding Your Workload

Step 1 - Read Data From Your Directory

If you need to pass data into your container you will do this through a Docker volume. You'll need to modify your code to read from a local directory.

We make the assumption that you are reading from a directory called /inputs, which is set as the default.

Step 2 - Write Data to the Your Directory

If you need to return data from your container you will do this through a Docker volume. You'll need to modify your code to write to a local directory.

We make the assumption that you are writing to a directory called /outputs, which is set as the default.

Step 3 - Build and Push Your Image To a Registry

For example:

Step 4 - Test Your Container

To test your docker image locally, you'll need to execute the following command, changing the environment variables as necessary:

Let's see what each command will be used for:

For example:

The result of the commands' execution is shown below:

Step 5 - Run the Workload on Bacalhau

To launch your workload in a Docker container, using the specified image and working with input data specified via IPFS CID, run the following command:

To check the status of your job, run the following command:

To get more information on your job,run:

To download your job, run:

For example, running:

outputs:

The --input flag does not support CID subpaths for ipfs:// content.

Alternatively, you can run your workload with a publicly accessible http(s) URL, which will download the data temporarily into your public storage:

The --input flag does not support URL directories.

Troubleshooting

If you run into this compute error while running your docker image

This can often be resolved by re-tagging your docker image

Support

Bacalhau Docker Image

How to use Bacalhau Docker Image for task management

This documentation explains how to use the Bacalhau Docker image for task management with Bacalhau client.

Prerequisites

1. Pull the Bacalhau Docker image

docker pull ghcr.io/bacalhau-project/bacalhau:latest

Expected output:

latest: Pulling from bacalhau-project/bacalhau
d14ccdd25413: Pull complete
621f190d05c8: Pull complete
Digest: sha256:3cda5619984de9b56c738c50f94188684170f54f7e417f8dcbe74ff8ec8eb434
Status: Downloaded newer image for ghcr.io/bacalhau-project/bacalhau:latest
ghcr.io/bacalhau-project/bacalhau:latest

You can also pull a specific version of the image, e.g.:

docker pull ghcr.io/bacalhau-project/bacalhau:v1.6.0

1. Check the version of Bacalhau client

docker run -t ghcr.io/bacalhau-project/bacalhau:latest version

The output is similar to:

12:00:32.427 | INF pkg/repo/fs.go:93 > Initializing repo at '/root/.bacalhau' for environment 'production'
CLIENT  SERVER  UPDATE MESSAGE 
v1.3.0  v1.4.0

2. Run a Bacalhau Job

For example to run an Ubuntu-based job that prints the message 'Hello from Docker Bacalhau':

bacalhau docker run \
        --id-only \
        --wait \
        ubuntu:latest \
        -- sh -c 'uname -a && echo "Hello from Docker Bacalhau!"'

Structure of the command

--id-only: Output only the job id
--wait: Wait for the job to finish
ubuntu:latest. Ubuntu container
--: Separate Bacalhau parameters from the command to be executed inside the container
sh -c 'uname -a && echo "Hello from Docker Bacalhau!"': The command executed inside the container

The command execution in the terminal is similar to:

j-6ffd54b8-e992-498f-9ee9-766ab09d5daa

j-6ffd54b8-e992-498f-9ee9-766ab09d5daa is a job ID, which represents the result of executing a command inside a Docker container. It can be used to obtain additional information about the executed job or to access the job's results. We store that in an environment variable so that we can reuse it later on (env: JOB_ID=j-6ffd54b8-e992-498f-9ee9-766ab09d5daa)

To print the content of the Job ID, execute the following command:

bacalhau job describe j-6ffd54b8-e992-498f-9ee9-766ab09d5daa

The output is similar to:

ID            = j-6ffd54b8-e992-498f-9ee9-766ab09d5daa
Name          = j-6ffd54b8-e992-498f-9ee9-766ab09d5daa
Namespace     = default
Type          = batch
State         = Completed
Count         = 1
Created Time  = 2024-09-08 14:33:19
Modified Time = 2024-09-08 14:33:20
Version       = 0

Summary
Completed = 1

Job History
 TIME                 REV.  STATE      TOPIC       EVENT         
 2024-09-08 14:33:19  1     Pending    Submission  Job submitted 
 2024-09-08 14:33:19  2     Running                              
 2024-09-08 14:33:20  3     Completed                            

Executions
 ID          NODE ID     STATE      DESIRED  REV.  CREATED     MODIFIED    COMMENT      
 e-bd5746b8  n-e002001e  Completed  Stopped  6     27m21s ago  27m21s ago  Accepted job 

Execution e-bd5746b8 History
 TIME                 REV.  STATE              TOPIC            EVENT        
 2024-09-08 14:33:19  1     New                                              
 2024-09-08 14:33:19  2     AskForBid                                        
 2024-09-08 14:33:19  3     AskForBidAccepted  Requesting Node  Accepted job 
 2024-09-08 14:33:19  4     AskForBidAccepted                                
 2024-09-08 14:33:19  5     BidAccepted                                      
 2024-09-08 14:33:20  6     Completed                                        

Standard Output
Linux 7d5c3dcc7fc2 6.5.0-1024-gcp #26~22.04.1-Ubuntu SMP Fri Jun 14 18:48:45 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
Hello from Docker Bacalhau!

3. Submit a Job With Output Files

You always need to mount directories into the container to access files. This is because the container is running in a separate environment from your host machine.

The first part of this example should look familiar, except for the Docker commands.

bacalhau docker run \                                   
        --id-only \
        --wait \
        --gpu 1 \
        ghcr.io/bacalhau-project/examples/stable-diffusion-gpu:0.0.1 -- \
            python main.py --o ./outputs --p "A Docker whale and a cod having a conversation about the state of the ocean"

When a job is submitted, Bacalhau prints the related job_id (j-da29a804-3960-4667-b6e5-73f05e120117):

j-da29a804-3960-4667-b6e5-73f05e120117

4. Check the State of your Jobs

Job status: You can check the status of the job using bacalhau job list.

bacalhau job list

When it reads Completed, that means the job is done, and you can get the results.

Job information: You can find out more information about your job by using bacalhau job describe.

bacalhau job describe j-da29a804-3960-4667-b6e5-73f05e120117

Job download: You can download your job results directly by using bacalhau job get. Alternatively, you can choose to create a directory to store your results. In the command below, we created a directory and downloaded our job output to be stored in the result directory.

bacalhau job get ${JOB_ID} --output-dir result

After the download is complete, you should see the following contents in the results directory.

Support

How To Work With Custom Containers in Bacalhau

Bacalhau operates by executing jobs within containers. This example shows you how to build and use a custom docker container.

Prerequisite

To get started, you need to install the Bacalhau client, see more information here

1. Running Containers

Docker Command

You're likely familiar with executing Docker commands to start a container:

docker run docker/whalesay cowsay sup old fashioned container run

This command runs a container from the docker/whalesay image. The container executes the cowsay sup old fashioned container run command:

_________________________________
< sup old fashioned container run >
 ---------------------------------
    \
     \
      \
                    ##        .
              ## ## ##       ==
           ## ## ## ##      ===
       /""""""""""""""""___/ ===
  ~~~ {~~ ~~~~ ~~~ ~~~~ ~~ ~ /  ===- ~~~
       \______ o          __/
        \    \        __/
          \____\______/

Bacalhau Command

export JOB_ID=$(bacalhau docker run \
    --wait \
    --id-only \ 
    docker/whalesay -- bash -c 'cowsay hello web3 uber-run')

This command also runs a container from the docker/whalesay image, using Bacalhau. We use the bacalhau docker run command to start a job in a Docker container. It contains additional flags such as --wait to wait for job completion and --id-only to return only the job identifier. Inside the container, the bash -c 'cowsay hello web3 uber-run' command is executed.

When a job is submitted, Bacalhau prints out the related job_id. We store that in an environment variable so that we can reuse it later on.

j-7e41b9b9-a9e2-4866-9fce-17020d8ec9e0

You can download your job results directly by using bacalhau job get. Alternatively, you can choose to create a directory to store your results. In the command below, we created a directory (results) and downloaded our job output to be stored in that directory.

rm -rf results && mkdir -p results
bacalhau job get \
--output-dir results \
${JOB_ID}

Viewing your job output

cat ./results/stdout

 _____________________
< hello web3 uber-run >
 ---------------------
    \
     \
      \
                    ##        .
              ## ## ##       ==
           ## ## ## ##      ===
       /""""""""""""""""___/ ===
  ~~~ {~~ ~~~~ ~~~ ~~~~ ~~ ~ /  ===- ~~~
       \______ o          __/
        \    \        __/
          \____\______/

Both commands execute cowsay in the docker/whalesay container, but Bacalhau provides additional features for working with jobs at scale.

Bacalhau Syntax

Bacalhau uses a syntax that is similar to Docker, and you can use the same containers. The main difference is that input and output data is passed to the container via IPFS, to enable planetary scale. In the example above, it doesn't make too much difference except that we need to download the stdout.

The --wait flag tells Bacalhau to wait for the job to finish before returning. This is useful in interactive sessions like this, but you would normally allow jobs to complete in the background and use the bacalhau job list command to check on their status.

Another difference is that by default Bacalhau overwrites the default entry point for the container, so you have to pass all shell commands as arguments to the run command after the -- flag.

2. Building Your Own Custom Container For Bacalhau

To use your own custom container, you must publish the container to a container registry that is accessible from the Bacalhau network. At this time, only public container registries are supported.

To demonstrate this, you will develop and build a simple custom container that comes from an old Docker example. I remember seeing cowsay at a Docker conference about a decade ago. I think it's about time we brought it back to life and distribute it across the Bacalhau network.

# write to the cod.cow
$the_cow = <<"EOC";
   $thoughts
    $thoughts
                               ,,,,_
                            ┌Φ▓╬▓╬▓▓▓W      @▓▓▒,
                           ╠▓╬▓╬╣╬╬▓╬▓▓   ╔╣╬╬▓╬╣▓,
                    __,┌╓═╠╬╠╬╬╬Ñ╬╬╬Ñ╬╬¼,╣╬╬▓╬╬▓╬▓▓▓┐        ╔W_             ,φ▓▓
               ,«@▒╠╠╠╠╩╚╙╙╩Ü╚╚╚╚╩╙╙╚╠╩╚╚╟▓▒╠╠╫╣╬╬╫╬╣▓,   _φ╬▓╬╬▓,        ,φ╣▓▓╬╬
          _,φÆ╩╬╩╙╚╩░╙╙░░╩`=░╙╚»»╦░=╓╙Ü1R░│░╚Ü░╙╙╚╠╠╠╣╣╬≡Φ╬▀╬╣╬╬▓▓▓_   ╓▄▓▓▓▓▓▓╬▌
      _,φ╬Ñ╩▌▐█[▒░░░░R░░▀░`,_`!R`````╙`-'╚Ü░░Ü░░░░░░░│││░╚╚╙╚╩╩╩╣Ñ╩╠▒▒╩╩▀▓▓╣▓▓╬╠▌
     '╚╩Ü╙│░░╙Ö▒Ü░░░H░░R ▒¥╣╣@@@▓▓▓  := '`   `░``````````````````````````]▓▓▓╬╬╠H
       '¬═▄ `\░╙Ü░╠DjK` Å»»╙╣▓▓▓▓╬Ñ     -»`       -`      `  ,;╓▄╔╗∞  ~▓▓▓▀▓▓╬╬╬▌
             '^^^`   _╒Γ   `╙▀▓▓╨                     _, ⁿD╣▓╬╣▓╬▓╜      ╙╬▓▓╬╬▓▓
                 ```└                           _╓▄@▓▓▓╜   `╝╬▓▓╙           ²╣╬▓▓
                        %φ▄╓_             ~#▓╠▓▒╬▓╬▓▓^        `                ╙╙
                         `╣▓▓▓              ╠╬▓╬▓╬▀`
                           ╚▓▌               '╨▀╜
EOC

Next, the Dockerfile adds the script and sets the entry point.

# write the Dockerfile
FROM debian:stretch
RUN apt-get update && apt-get install -y cowsay
# "cowsay" installs to /usr/games
ENV PATH $PATH:/usr/games
RUN echo '#!/bin/bash\ncowsay "${@:1}"' > /usr/bin/codsay && \
    chmod +x /usr/bin/codsay
COPY cod.cow /usr/share/cowsay/cows/default.cow

Now let's build and test the container locally.

docker build -t ghcr.io/bacalhau-project/examples/codsay:latest . 2> /dev/null

docker run --rm ghcr.io/bacalhau-project/examples/codsay:latest codsay I like swimming in data

Once your container is working as expected then you should push it to a public container registry. In this example, I'm pushing to Github's container registry, but we'll skip the step below because you probably don't have permission. Remember that the Bacalhau nodes expect your container to have a linux/amd64 architecture.

docker buildx build --platform linux/amd64,linux/arm64 --push -t ghcr.io/bacalhau-project/examples/codsay:latest .

3. Running Your Custom Container on Bacalhau

Now we're ready to submit a Bacalhau job using your custom container. This code runs a job, downloads the results, and prints the stdout.

The bacalhau docker run command strips the default entry point, so don't forget to run your entry point in the command line arguments.

export JOB_ID=$(bacalhau docker run \
    --wait \
    --id-only \
    ghcr.io/bacalhau-project/examples/codsay:v1.0.0 \
    -- bash -c 'codsay Look at all this data')

When a job is submitted, Bacalhau prints out the related job_id. We store that in an environment variable so that we can reuse it later on.

Download your job results directly by using bacalhau job get command.

rm -rf results && mkdir -p results
bacalhau job get ${JOB_ID}  --output-dir results

View your job output

cat ./results/stdout

_______________________
< Look at all this data >
 -----------------------
   \
    \
                               ,,,,_
                            ┌Φ▓╬▓╬▓▓▓W      @▓▓▒,
                           ╠▓╬▓╬╣╬╬▓╬▓▓   ╔╣╬╬▓╬╣▓,
                    __,┌╓═╠╬╠╬╬╬Ñ╬╬╬Ñ╬╬¼,╣╬╬▓╬╬▓╬▓▓▓┐        ╔W_             ,φ▓▓
               ,«@▒╠╠╠╠╩╚╙╙╩Ü╚╚╚╚╩╙╙╚╠╩╚╚╟▓▒╠╠╫╣╬╬╫╬╣▓,   _φ╬▓╬╬▓,        ,φ╣▓▓╬╬
          _,φÆ╩╬╩╙╚╩░╙╙░░╩`=░╙╚»»╦░=╓╙Ü1R░│░╚Ü░╙╙╚╠╠╠╣╣╬≡Φ╬▀╬╣╬╬▓▓▓_   ╓▄▓▓▓▓▓▓╬▌
      _,φ╬Ñ╩▌▐█[▒░░░░R░░▀░`,_`!R`````╙`-'╚Ü░░Ü░░░░░░░│││░╚╚╙╚╩╩╩╣Ñ╩╠▒▒╩╩▀▓▓╣▓▓╬╠▌
     '╚╩Ü╙│░░╙Ö▒Ü░░░H░░R ▒¥╣╣@@@▓▓▓  := '`   `░``````````````````````````]▓▓▓╬╬╠H
       '¬═▄ `░╙Ü░╠DjK` Å»»╙╣▓▓▓▓╬Ñ     -»`       -`      `  ,;╓▄╔╗∞  ~▓▓▓▀▓▓╬╬╬▌
             '^^^`   _╒Γ   `╙▀▓▓╨                     _, ⁿD╣▓╬╣▓╬▓╜      ╙╬▓▓╬╬▓▓
                 ```└                           _╓▄@▓▓▓╜   `╝╬▓▓╙           ²╣╬▓▓
                        %φ▄╓_             ~#▓╠▓▒╬▓╬▓▓^        `                ╙╙
                         `╣▓▓▓              ╠╬▓╬▓╬▀`
                           ╚▓▌               '╨▀╜

Support

Run CUDA programs on Bacalhau

What is CUDA

In this tutorial, we will look at how to run CUDA programs on Bacalhau. CUDA (Compute Unified Device Architecture) is an extension of C/C++ programming. It is a parallel computing platform and programming model created by NVIDIA. It helps developers speed up their applications by harnessing the power of GPU accelerators.

In addition to accelerating high-performance computing (HPC) and research applications, CUDA has also been widely adopted across consumer and industrial ecosystems. CUDA also makes it easy for developers to take advantage of all the latest GPU architecture innovations

Advantage of GPU over CPU

Architecturally, the CPU is composed of just a few cores with lots of cache memory that can handle a few software threads at a time. In contrast, a GPU is composed of hundreds of cores that can handle thousands of threads simultaneously.

Computations like matrix multiplication could be done much faster on GPU than on CPU

Prerequisite

1. Running CUDA locally

You'll need to have the following installed:

NVIDIA GPU
CUDA drivers installed
nvcc installed

Checking if nvcc is installed:

nvcc --version

Downloading the programs:

mkdir inputs outputs
wget -P inputs https://raw.githubusercontent.com/tristanpenman/cuda-examples/master/00-hello-world.cu
wget -P inputs https://raw.githubusercontent.com/tristanpenman/cuda-examples/master/02-cuda-hello-world-faster.cu

Viewing the programs

00-hello-world.cu:

# View the contents of the standard C++ program
cat inputs/00-hello-world.cu

# Measure the time it takes to compile and run the program
nvcc -o ./outputs/hello ./inputs/00-hello-world.cu; ./outputs/hello

This example represents a standard C++ program that inefficiently utilizes GPU resources due to the use of non-parallel loops.

02-cuda-hello-world-faster.cu:

# View the contents of the CUDA program with vector addition
!cat inputs/02-cuda-hello-world-faster.cu

# Remove any previous output
rm -rf outputs/hello

# Measure the time for compilation and execution
nvcc --expt-relaxed-constexpr -o ./outputs/hello ./inputs/02-cuda-hello-world-faster.cu; ./outputs/hello

In this example we utilize Vector addition using CUDA and allocate the memory in advance and copy the memory to the GPU using cudaMemcpy so that it can utilize the HBM (High Bandwidth memory of the GPU). Compilation and execution occur faster (1.39 seconds) compared to the previous example (8.67 seconds).

2. Running a Bacalhau Job

To submit a job, run the following Bacalhau command:

export JOB_ID=$(bacalhau docker run \
    --gpu 1 \
    --timeout 3600 \
    --wait-timeout-secs 3600 \
    -i https://raw.githubusercontent.com/tristanpenman/cuda-examples/master/02-cuda-hello-world-faster.cu \
    --id-only \
    --wait \
    nvidia/cuda:11.2.2-cudnn8-devel-ubuntu18.04 \
    -- /bin/bash -c 'nvcc --expt-relaxed-constexpr  -o ./outputs/hello ./inputs/02-cuda-hello-world-faster.cu; ./outputs/hello ')

Structure of the Commands

bacalhau docker run: call to Bacalhau
-i https://raw.githubusercontent.com/tristanpenman/cuda-examples/master/02-cuda-hello-world-faster.cu: URL path of the input data volumes downloaded from a URL source.
nvidia/cuda:11.2.0-cudnn8-devel-ubuntu18.04: Docker container for executing CUDA programs (you need to choose the right CUDA docker container). The container should have the tag of "devel" in them.
nvcc --expt-relaxed-constexpr -o ./outputs/hello ./inputs/02-cuda-hello-world-faster.cu: Compilation using the nvcc compiler and save it to the outputs directory as hello
Note that there is ; between the commands: -- /bin/bash -c 'nvcc --expt-relaxed-constexpr -o ./outputs/hello ./inputs/02-cuda-hello-world-faster.cu; ./outputs/hello The ";" symbol allows executing multiple commands sequentially in a single line.
./outputs/hello: Execution hello binary: You can combine compilation and execution commands.

Note that the CUDA version will need to be compatible with the graphics card on the host machine

When a job is submitted, Bacalhau prints out the related job_id. We store that in an environment variable so that we can reuse it later on:

3. Checking the State of your Jobs

Job status: You can check the status of the job using bacalhau job list.

bacalhau job list --id-filter ${JOB_ID} --wide

When it says Published or Completed, that means the job is done, and we can get the results.

Job information: You can find out more information about your job by using bacalhau job describe.

bacalhau job describe ${JOB_ID}

Job download: You can download your job results directly by using bacalhau job get. Alternatively, you can choose to create a directory to store your results. In the command below, we created a directory (results) and downloaded our job output to be stored in that directory.

rm -rf results && mkdir -p results
bacalhau job get $JOB_ID --output-dir results

4. Viewing your Job Output

To view the file, run the following command:

cat results/stdout

Support

Running Rust programs as WebAssembly (WASM)

Prerequisites

1. Develop a Rust Program Locally

We can use cargo (which will have been installed by rustup) to start a new project (my-program) and compile it:

We can then write a Rust program. Rust programs that run on Bacalhau can read and write files, access a simple clock, and make use of pseudo-random numbers. They cannot memory-map files or run code on multiple threads.

In the main function main() an image is loaded, the original is saved, and then a loop is performed to reduce the width of the image by removing "seams." The results of the process are saved, including the original image with drawn seams and a gradient image with highlighted seams.

We also need to install the imageproc and image libraries and switch off the default features to make sure that multi-threading is disabled (default-features = false). After disabling the default features, you need to explicitly specify only the features that you need:

We can now build the Rust program into a WASM blob using cargo:

This command navigates to the my-program directory and builds the project using Cargo with the target set to wasm32-wasi in release mode.

This will generate a WASM file at ./my-program/target/wasm32-wasi/release/my-program.wasm which can now be run on Bacalhau.

2. Running WASM on Bacalhau

Now that we have a WASM binary, we can upload it to IPFS and use it as input to a Bacalhau job.

The -i flag allows specifying a URI to be mounted as a named volume in the job, which can be an IPFS CID, HTTP URL, or S3 object.

For this example, we are using an image of the Statue of Liberty that has been pinned to a storage facility.

Structure of the Commands

bacalhau wasm run: call to Bacalhau
./my-program/target/wasm32-wasi/release/my-program.wasm: the path to the WASM file that will be executed
_start: the entry point of the WASM program, where its execution begins
--id-only: this flag indicates that only the identifier of the executed job should be returned
-i ipfs://bafybeifdpl6dw7atz6uealwjdklolvxrocavceorhb3eoq6y53cbtitbeu:/inputs: input data volume that will be accessible within the job at the specified destination path

When a job is submitted, Bacalhau prints out the related job_id. We store that in an environment variable so that we can reuse it later on:

You can download your job results directly by using bacalhau job get. Alternatively, you can choose to create a directory to store your results. In the command below, we created a directory (wasm_results) and downloaded our job output to be stored in that directory.

We can now get the results.

Viewing Job Output

When we view the files, we can see the original image, the resulting shrunk image, and the seams that were removed.

Support

Python

Networking Instructions

Accessing the Internet from Jobs

By default, Bacalhau jobs do not have any access to the internet. This is to keep both compute providers and users safe from malicious activities.

Specifying Jobs to Access the Internet

To run Docker jobs on Bacalhau to access the internet, you'll need to specify one of the following:

full: unfiltered networking for any protocol --network=full
http: HTTP(S)-only networking to a specified list of domains --network=http
none: no networking at all, the default --network=none

Specifying none will still allow Bacalhau to download and upload data before and after the job using a Publisher.

Jobs using http must specify the domains they want to access when the job is submitted.

So, putting it together the job run should look like this:

The required networking can be specified using the --network flag. For http networking, the required domains can be specified using the --domain flag, multiple times for as many domains as required. Specifying a domain starting with a . means that all sub-domains will be included. For example, specifying .example.com will cover some.thing.example.com as well as example.com.

If you are seeing the following (or any DNS) error, it's you likely forgot the --network flag!

Bacalhau jobs are explicitly prevented from starting other Bacalhau jobs, even if a Bacalhau requester node is specified on the HTTP allowlist.

Setting Up Your Nodes

Submitting a job is the first part, the second part is ensuring your network can handle networking. You can set nodes up to accept jobs using an Admission Controller setting in the node config. For example:

Automatic Update Checking

Bacalhau has an update checking service to automatically detect whether a newer version of the software is available.

Users who are both running CLI commands and operating nodes will be regularly informed that a new release can be downloaded and installed.

For clients

Bacalhau will run an update check regularly when client commands are executed. If an update is available, explanatory text will be printed at the end of the command.

To force a manual update check, run the bacalhau version command, which will explicitly list the latest software release alongside the server and client versions.

For node operators

Bacalhau will run an update check regularly as part of the normal operation of the node.

If an update is available, an INFO level message will be printed to the log.

Configuring checks

Bacalhau has some configuration options for controlling how often checks are performed. By default, an update check will run no more than once every 24 hours. Users can opt out of automatic update checks using the configuration described below.

It's important to note that disabling the automatic update checks may lead to potential issues, arising from mismatched versions of different actors within Bacalhau.

To output update check config, run bacalhau config list:

Support

Node Management

Bacalhau clusters are composed of requester nodes, and compute nodes. The requester nodes are responsible for managing the compute nodes that make up the cluster. This functionality is only currently available when using NATS for the network transport.

The two main areas of functionality for the requester nodes are, managing the membership of compute nodes that require approval to take part in the cluster, and monitoring the health of the compute nodes. They are also responsible for collecting information provided by the compute nodes on a regular schedule.

Compute node membership

As compute nodes start, they register their existence with the requester nodes. Once registered, they will maintain a sentinel file to note that they are already registered, this avoids unnecessary registration attempts.

Once registered, the requester node will need to approve the compute node before it can take part in the cluster. This is to ensure that the requester node is aware of all the compute nodes that are part of the cluster. In future, we may provide mechanisms for auto-approval of nodes joining the cluster, but currently all compute nodes registering default to the PENDING state.

Listing the current nodes in the system will show requester nodes automatically APPROVED, and compute nodes in the PENDING state.

Nodes can be rejected using their node id, and optionally specifying a reason with the -m flag.

Nodes can be approved using their node id.

There is currently no support for auto-eviction of nodes, but they can be manually removed from the cluster using the node delete command. Note, if they are manually removed, they are able to manually re-register, so this is most useful when you know the node will not be coming back.

After all of these actions, the node list looks like

Compute node updates

Compute nodes will provide information about themselves to the requester nodes on a regular schedule. This information is used to help the requester nodes make decisions about where to schedule workloads.

These updates are broken down into:

Node Information: This is the information about the node itself, such as the hostname, CPU architecture, and any labels associated with the node. This information is persisted to the Node Info Store.
Resource Information: This is the information about the resources available on the node, such as the amount of memory, storage and CPU available. This information is held in memory and used to make scheduling decisions. It is not persisted to disk as it is considered transient.
Health Information: This heartbeat is used to determine if the node is still healthy, and if it is not, the requester node will mark the node as unhealthy. Eventually, the node will be marked as Unknown if it does not recover. This information is held in memory and used to make scheduling decisions. Like the resource information, it is not persisted to disk as it is considered transient.

Various configuration options are available to control the frequency of these updates, and the timeout for the health check. These can be set in the configuration file.

For the compute node, these settings are:

Node Information: InfoUpdateFrequency - The interval between updates of the node information.
Resource Information: ResourceUpdateFrequency - The interval between updates of the resource information.
Heartbeat: HeartbeatFrequency - The interval between heartbeats sent by the compute node.
Heartbeat: HeartbeatTopic - The name of the pubsub topic that heartbeat messages are sent via.

For the requester node, these settings are:

Heartbeat HeartbeatFrequency - How often the heartbeat server will check the priority queue of node heartbeats.
Heartbeat HeartbeatTopic - The name of the pubsub topic that heartbeat messages are sent via. Should be the same as the compute node value.
Node health NodeDisconnectedAfter - The interval after which the node will be considered disconnected if a heartbeat has not been received.

Cluster membership events

As compute nodes are added and removed from the cluster, the requester nodes will emit events to the NATS PubSub system. These events can be consumed by other systems to react to changes in the cluster membership.

Authentication & Authorization

Introduction

Robust authentication and authorization mechanisms are essential for maintaining security while enabling seamless collaboration. As of Bacalhau 1.7 release, we introduced a significant overhaul to its authentication and authorization systems, offering more flexibility, improved security, and better integration with enterprise environments.

1. Bacalhau Authentication

With Bacalhau 1.7, we have introduced three distinct authentication paths, each designed to cater to different use cases and environments. The authentication paths are:

Basic HTTP Authentication
API Tokens Auth
Single Sign-On via OAuth 2.0

1.1 HTTP Basic Authentication

The simplest approach leverages the time-tested HTTP Basic Authentication protocol, allowing users to access Bacalhau APIs using traditional username and password credentials. These credentials can be defined in the Node Configuration file, which offers two options for password storage:

Plain text passwords for simplicity and ease of setup
Bcrypt-hashed passwords for enhanced security

For CLI usage, users simply need to set the environment variables BACALHAU_API_USERNAME and BACALHAU_API_PASSWORD. For direct API calls, the standard Basic Authorization header with base64-encoded credentials can be used.

Below is a sample orchestrator config file that defines 3 users that can authenticate through basic auth.

In the above configuration:

The first two users have plain text passwords, while the third uses a BCRYPT hashed password for added security.
We have three users with different permission levels. These capabilities will be covered in detail in the authorization section below.

To help users and operators generate secure hashed passwords, a convenient CLI command was added that generates a BCRYPT hash of a password of your choosing. This command takes a plain string and converts it into a BCRYPT hash.

To use this configuration with the Bacalhau CLI, you would set the following environment variables:

For direct API calls, for example by using curl, you would encode the credentials in base64:

1.2 Authentication through API Tokens

For applications and scenarios where password-based authentication isn't ideal, Bacalhau 1.7 introduces API token support. Instead of username and password pairs, users can generate and use API keys as bearer tokens in authorization headers.

Configuration is straightforward – API keys are defined in the orchestrator config under user profiles. To use them with the Bacalhau CLI, users set the BACALHAU_API_KEY environment variable. For direct API access, the token is included in the Authorization header using the Bearer scheme.

Please note that API Keys are opaque tokens.

Here's a sample configuration for API tokens in Bacalhau:

In this configuration:

We have four API tokens with different permission levels:
- An administrator token with full access to all capabilities
- A monitoring token with read-only access to all resources
- A CI/CD pipeline token that can view nodes and has full control over jobs
- An agent management token that has full control over agents
Each token has a unique, randomly generated API key. You should generate strong, unique keys for your production environment using a secure random generator. Please note that API keys do not support BCRYPT hashing.

To use these API tokens with the Bacalhau CLI, you would set the following environment variable:

For direct API calls, for example by using curl, you would use the Bearer token authentication scheme:

1.3 Single Sign-On via OAuth 2.0

Perhaps the most significant addition since Bacalhau 1.7 is the support for OAuth 2.0 using the Device Code Flow. This enables Bacalhau to integrate seamlessly with enterprise identity providers such as Okta, Auth0, Azure Active Directory, and Google SSO.

This approach eliminates the need to define users directly in Bacalhau's configuration, instead delegating user management to the identity provider – a considerable advantage in corporate environments with existing identity infrastructure.

The configuration process involves specifying OAuth 2.0 endpoints, client IDs, and desired scopes. When users need to authenticate, they run bacalhau auth sso login, which presents a device code and URL. After completing authentication through their browser, they receive a JWT token that's automatically used for subsequent API calls (this token exchange will be done seamlessly and the user is not required to perform any extra actions).

Here's a sample configuration for OAuth 2.0 SSO in Bacalhau:

For this to setup work properly:

Register an OAuth 2.0 application in your identity provider (Okta, Auth0, Azure AD, etc.)
Configure it to support the Device Code Flow. Make sure the provider supports OAuth2 Device code flow.
Set up appropriate roles or groups in your identity provider to map to Bacalhau permissions

The permission mapping would happen in your identity provider. For example, in Okta you might create:

A "Bacalhau Admins" group with permissions: ["*"]
A "Bacalhau Readers" group with permissions: ["read:*"]
A "Bacalhau Job Managers" group with permissions: ["read:job", "write:job", "read:node"]

These permissions should be included in the JWT token under the custom claim permissions when users authenticate.

To authenticate using this setup, users would run:

Then the CLI would display something like this:

After completing authentication through their browser, the user would receive a JWT token that's automatically used for subsequent API calls. The token can be inspected with:

1.4 Authentication Priority in Bacalhau 1.7+

When configuring Bacalhau authentication, it's important to understand the precedence rules that determine which authentication method takes effect.

Environment variables take highest precedence in the authentication hierarchy, overriding any other configured methods. This means that if you have set BACALHAU_API_USERNAME and BACALHAU_API_PASSWORD for Basic Auth, or BACALHAU_API_KEY for API token authentication, these will be used regardless of any SSO tokens that may be stored locally from previous bacalhau auth sso login sessions.

If the BACALHAU_API_USERNAME/BACALHAU_API_PASSWORD and BACALHAU_API_PASSWORD are defined, an error message will be returned.

This design provides flexibility for users who need to temporarily switch between different authentication contexts without modifying configuration files

For example, a developer could have an SSO session for regular work but quickly switch to using an API key for testing by simply setting the appropriate environment variable. When the environment variable is unset, Bacalhau will fall back to the next available authentication method, typically returning to the previously established SSO session if available.

2. Granular Authorization in Bacalhau 1.7+

As of Bacalhau 1.7, we introduced a sophisticated authorization system built on a resource and capability model that brings fine-grained access control to the platform. This system divides API actions into specific combinations of resources and capabilities, enabling administrators to implement the principle of least privilege across their Bacalhau deployments.

2.1 Resource and Capability Framework

The permission structure is organized around two key dimensions:

Resources: The objects being accessed or modified (Nodes, Jobs, and Agents)
Capabilities: The types of operations being performed (Read and Write)

This creates a permission taxonomy following the pattern of action:resource, where permissions can be assigned individually or using wildcards for broader access grants.

2.2 Core Permission Set

Bacalhau supports the following core permissions:

"*" - The master permission granting full access to all capabilities across all resources
"read:*" - Provides read-only access across all resource types
"write:*" - Grants write access to all resource types
"read:node" - Allows viewing node information
"write:node" - Permits actions on the node
"read:job" - Enables querying job status, details, and logs, etc
"write:job" - Allows submitting, canceling, and managing job execution
"read:agent" - Provides access to agent information via "bacalhau agent" commands
"write:agent" - Any write actions on the agent.

2.3 Creating Role-Based Access Patterns

These permissions can be combined to create practical access patterns for different user roles and service accounts:

Administrator: ["*"] - Full access to all system functions
Read-only Analyst: ["read:*"] - Can view but not modify any resources
Job Manager: ["read:job", "write:job", "read:node"] - Complete control over jobs with visibility into nodes
Monitoring Service: ["read:node", "read:job"] - View-only access for system monitoring
CI/CD Pipeline: ["write:job", "read:job"] - Can submit and monitor jobs but can't access node details

2.4 Benefits for Different User Profiles

These authentication enhancements offer distinct advantages for different types of Bacalhau users:

Individual developers benefit from the simplicity of Basic Auth for quick setup and experimentation
DevOps teams can leverage API tokens for automation, CI/CD pipelines, and service-to-service communication
Enterprise environments gain seamless integration with existing identity infrastructure through OAuth 2.0
Security teams appreciate the granular permission model that enforces the principle of least privilege

3. Backward Compatibility with Previous Authentication Methods

Bacalhau 1.7 maintains backward compatibility with the previous authentication mechanism based on Open Policy Agent, ensuring a smooth transition path for existing deployments.

Users can continue to use their established OPA policies without immediate migration to the new authentication paths. However, it's important to note that while backward compatibility is preserved, mixing the old and new authentication methods within the same deployment is not supported.

Organizations must choose either to continue using the Open Policy Agent approach exclusively or to migrate fully to the new authentication system with Basic Auth, API Tokens, or OAuth 2.0.

This clean separation prevents potential security inconsistencies and configuration conflicts that could arise from overlapping authentication mechanisms.

For organizations planning to migrate, the Bacalhau team recommends first setting up the new authentication in a test environment, validating access patterns and permissions, and then performing a complete cutover rather than attempting a gradual or partial migration. This approach ensures security integrity throughout the transition while still providing flexibility in timing the upgrade to the enhanced authentication capabilities.

Inter-Nodes TLS

Securing node-to-node communication with TLS

Introduction

Secure communication between Bacalhau Compute Nodes and Orchestrators is crucial, especially when operating across untrusted networks. This guide demonstrates how to implement TLS encryption to protect inter-node communication and ensure data security.

Concept

Bacalhau Compute Nodes initiate communication with the orchestrator through NATS, a high-performance messaging system. The orchestrator node hosts the NATS server, which compute nodes automatically connect to upon startup.

As a distributed system, Bacalhau supports TLS encryption to secure these communication channels. While this guide demonstrates the implementation using self-signed certificates, the same principles apply when using company-issued or publicly trusted certificates.

Procedure

Step 1: Generate Root Certificate Authority

In this step, we'll guide you through generating the required certificates, focusing on self-signed certificate creation.

This step will produce two essential components: the self-signed root CA certificate and its corresponding private key.

Step 2: Generate NATS Server Certificate

In this step, we'll generate the certificate that enables TLS connections for the NATS server.

First, identify the DNS name or IP address used to connect to the orchestrator. This is typically found in the compute nodes' configuration under the "Orchestrators" field. For example:

If your config specifies nats://10.0.5.16:4222, use the IP address 10.0.5.16
If your config specifies nats://my-bacalhau-orchestrator-node:4222, use the DNS name my-bacalhau-orchestrator-node

Next, generate a server certificate signed by the Root CA (created in step 1). This certificate must include your chosen IP address or DNS name in its Subject Alternative Name field. Additionally, always include the IP address "127.0.0.1" in the Subject Alternative Names to support communications initiated from the orchestrator node itself.

This step will produce two critical files: the server certificate and its corresponding private key. Store both files securely in a protected location.

Step 3: Start Nodes with Certificates

In this step, we'll configure both orchestrator nodes and compute nodes with the generated certificates.

First, copy the following files to the orchestrator node:

The root certificate from step 1 (certificate file only, not the private key)
The server certificate from step 2
The server's private key from step 2

The orchestrator node should now have three files: the root certificate, server certificate, and server key file. Next, enable TLS support by adding the TLS configuration section to the orchestrator's configuration file. Example:

Next, prepare each compute node by copying the root certificate file (excluding the private key) to the node. Then, update each compute node's configuration to trust this certificate authority for secure server connections. Example:

After restarting the Bacalhau processes on all nodes, secure TLS communication will be established for all node-to-node interactions.

Hardware Setup

Different jobs may require different amounts of resources to execute. Some jobs may have specific hardware requirements, such as GPU. This page describes how to specify hardware requirements for your job.

Please bear in mind that each executor is implemented independently and these docs might be slightly out of date. Double check the man page for the executor you are using with bacalhau [executor] --help.

Docker Executor

The following table describes how to specify hardware requirements for the Docker executor.

How it Works

When you specify hardware requirements, the job will be offered out to the network to see if there are any nodes that can satisfy the requirements. If there are, the job will be scheduled on the node and the executor will be started.

GPU Setup

Bacalhau supports GPU workloads. Learn how to run a job using GPU workloads with the Bacalhau client.

Prerequisites

The Bacalhau network must have an executor node with a GPU exposed
Your container must include the CUDA runtime (cudart) and must be compatible with the CUDA version running on the node

Usage

Use following command to see available resources amount:

To submit a request for a job that requires more than the standard set of resources, add the --cpu and --memory flags. For example, for a job that requires 2 CPU cores and 4Gb of RAM, use --cpu=2 --memory=4Gb, e.g.:

To submit a GPU job request, use the --gpu flag under the docker run command to select the number of GPUs your job requires. For example:

Limitations

The following limitations currently exist within Bacalhau.

Maximum CPU and memory limits depend on the participants in the network
For GPU:
1. NVIDIA, Intel or AMD GPUs only
2. Only the Docker Executor supports GPUs

Developer Resources

Running Locally In Devstack

How to run a Bacalhau devstack locally

You can run a stand-alone Bacalhau network on your computer with the following guide.

The devstack command of bacalhau will start a 4 node cluster with 3 compute and 1 requester nodes.

This is useful to kick the tires and/or developing on the codebase. It's also the tool used by some tests.

Pre-requisites

x86_64 or ARM64 architecture
1. Ubuntu 20.0+ has most often been used for development and testing
Latest Bacalhau release

Install Bacalhau

You can install the Bacalhau CLI by running this command in a terminal:

Start the cluster

This will start a 4 node Bacalhau cluster.

Once everything has started up - you will see output like the following:

New Terminal Window

Open an additional terminal window to be used for submitting jobs. Copy and paste environment variables from previous message into this window, e.g.:

You are now ready to submit a job to your local devstack.

Submit a simple job

This will submit a simple job to a single node:

This should output something like the following:

This should output info about job execution and results:

Use bacalhau job get command to download job results:

Results will be downloaded to the current directory. Job results should have the following structure:

If you execute cat stdout it should read hello devstack test. If you write any files in your job, they will appear in volumes/output.

Support

Guides

Jobs Guide

Queuing

Introduction

Efficient job management and resource optimization are significant considerations. In our continued effort to support scalable distributed computing and data processing, we are excited to introduce job queuing in Bacalhau v1.4.0.

The Job Queuing feature was only added to the Bacalhau in version 1.4 and is not supported in previous versions. Consider upgrading to the latest version to optimize resource usage with Job Queuing.

Job Queuing allows to deal with the situation when there are no suitable nodes available on the network to execute a job. In this case, a user-defined period of time can be configured for the job, during which the job will wait for suitable nodes to become available or free in the network. This feature enables better flexibility and reliability in managing your distributed workloads.

Configuring Job Queuing in your Network

Node availability in your network is determined by capacity as well as job constraints such as label selectors, engines or publishers. For example, jobs will be queued if all nodes are currently busy, as well as if idle nodes do not match parameters in your job specification.

How does it work?

At the requester node level, you can set default queuing behavior for all jobs by defining the QueueTimeout parameter in the node's configuration file. Alternatively, within the job specification, you can include the QueueTimeout parameter directly in the configuration YAML. This flexibility allows you to tailor the queuing behavior to meet the specific needs of your distributed computing environment, ensuring that jobs are efficiently managed and resources are optimally utilized.

Requester Node

Here’s an example requester node configuration that sets the default job queuing time for an hour

The QueueBackoff parameter determines the interval between retry attempts by the requester node to assign queued jobs.

Job Specification

Here’s a sample job specification setting the QueueTimeout for this specific job, overwriting any node defaults.

CLI Command

You can also define timeouts for your jobs directly through the CLI using the --queue-timeout flag. This method provides a convenient way to specify queuing behavior on a per-job basis, allowing you to manage job execution dynamically without modifying configuration files.

For example, here is how you can submit a job with a specified queue timeout using the CLI:

Timeouts in Bacalhau are generally governed by the TotalTimeout value for your yaml specifications and the --timeout flag for your CLI commands. The default total timeout value is 30 minutes. Declaring any queue timeout that is larger than that without changing the total timeout value will result in a validation error.

Executing Job Queuing in Bacalhau

Jobs will be queued when all available nodes are busy and when there is no node that matches your job specifications. Let’s take a look at how queuing will be executed within your network.

Queued Jobs will initially display the Queued status. Using the bacalhau job describe command will showcase both the state of the job and the reason behind queuing.

For busy nodes:

For no matching nodes in the network:

Once appropriate node resources become available, these jobs will transition to either a Running or Completed status, allowing more jobs to be assigned to matching nodes.

Support & Feedback

As Bacalhau continues to evolve, our commitment to making distributed computing and data processing more accessible and efficient remains strong. We want to hear what you think about this feature so that we can make Bacalhau better and meet all the diverse needs and requirements of you, our users.

Labels and Constraints

This guide provides a comprehensive overview of Bacalhau's label and constraint system, which enables fine-grained control over job scheduling and resource allocation.

Understanding Labels and Constraints

Labels in Bacalhau are key-value pairs attached to nodes that describe their characteristics, capabilities, and properties. Constraints are rules you define when submitting jobs to ensure they run on nodes with specific labels.

Label Configuration

Command Line Configuration

You define labels when starting a Bacalhau node using the -c Labels flag:

Configuration File

You can also define labels in a YAML configuration file:

Then start the node with:

Verifying Labels

Check node labels using:

Constraint Operators

Bacalhau supports various operators for precise node selection:

Job Submission Patterns

Basic Constraint Usage

The below are common patterns for submitting jobs with constraints:

Environment-Specific Patterns

Label Management Best Practices

Naming Conventions

Follow these patterns for consistent label naming:

Use lowercase alphanumeric characters
Separate words with hyphens
Use descriptive prefixes for categorization

Examples:

Label Hierarchies

Organize labels hierarchically for better management:

Label Inheritance and Templates

Dynamic Label Assignment

Constraint Composition

Maintenance and Operations

Node Updates

Monitoring and Troubleshooting

Security and Compliance

Secure Workload Placement

Advanced Use Cases

Multi-Dimensional Constraints

Multi-team Coordination

Troubleshooting Common Issues

No Matching Nodes

If your job fails with no matching nodes:

Check available nodes and their labels:
Verify your constraints aren't too restrictive:
Ensure required nodes are online:

Label Updates Not Taking Effect

Remember that label changes require node restarts. After updating labels, take the following steps:

Gracefully stop the node
Apply new configuration
Restart the node
Verify labels with bacalhau node list

Conclusion

Effective use of Bacalhau's label and constraint system enables precise control over workload placement and resource utilization. Follow these best practices:

Use consistent naming conventions
Document your label taxonomy
Regularly audit and clean up unused labels
Test constraints before production deployment
Monitor constraint patterns for optimization opportunities

For additional support, consult the Bacalhau documentation or community resources.

Help & FAQ

Write a config.yaml

How to write the config.yaml file to configure your nodes

On installation, Bacalhau creates a .bacalhau directory that includes a config.yaml file tailored for your specific settings. This configuration file is the central repository for custom settings for your Bacalhau nodes.

When initializing a Bacalhau node, the node determines its configuration by following a specific hierarchy.

First, it checks the default settings, then the config.yaml file, followed by environment variables, and finally, any command line flags specified during execution. You set and override configuration in that sequence. This layered approach allows the default Bacalhau settings to provide a baseline, while environment variables and command-line flags add flexibility. However, the config.yaml file offers a reliable way to predefine all necessary settings before node creation across environments, ensuring consistency and ease of management.

Modifications to the config.yaml file are not dynamically applied to existing nodes. You need to restart the Bacalhau node for any changes to take effect.

A config.yaml file starts off empty. However, you can see all available settings using the following command

bacalhau config list

This command showcases over a hundred configuration parameters related to users, security, metrics, updates, and node configuration, providing a comprehensive overview of the customization options available for your Bacalhau setup.

Let’s go through the different options and how a configuration file is structured.

config.yaml Structure

The bacalhau config list command displays configuration paths, segmented with periods to indicate each sub-path you are configuring.

Consider these configuration settings: NameProvider and Labels. These settings help set name and labels for your Bacalhau node.

In a config.yaml, these settings are formatted like this:

labels:
    NodeType: WebServer
    OS: Linux
nameprovider: puuid

Configuration Options

In the table below are the Bacalhau configuration options in alphabetical order:

Configuration Option

Description

API.Auth.AccessPolicyPath

AccessPolicyPath is the path to a file or directory that will be loaded as the policy to apply to all inbound API requests. If unspecified, a policy that permits access to all API endpoints to both authenticated and unauthenticated users (the default as of v1.2.0) will be used.

API.Auth.Methods

Methods maps "method names" to authenticator implementations. A method name is a human-readable string chosen by the person configuring the system that is shown to users to help them pick the authentication method they want to use. There can be multiple usages of the same Authenticator type but with different configs and parameters, each identified with a unique method name. For example, if an implementation wants to allow users to log in with Github or Bitbucket, they might both use an authenticator implementation of type "oidc", and each would appear once on this provider with key / method name "github" and "bitbucket". By default, only a single authentication method that accepts authentication via client keys will be enabled.

API.Host

Host specifies the hostname or IP address on which the API server listens or the client connects.

API.Port

Port specifies the port number on which the API server listens or the client connects.

API.TLS.AutoCert

AutoCert specifies the domain for automatic certificate generation.

API.TLS.AutoCertCachePath

AutoCertCachePath specifies the directory to cache auto-generated certificates.

API.TLS.CAFile

CAFile specifies the path to the Certificate Authority file.

API.TLS.CertFile

CertFile specifies the path to the TLS certificate file.

API.TLS.Insecure

Insecure allows insecure TLS connections (e.g., self-signed certificates).

API.TLS.KeyFile

KeyFile specifies the path to the TLS private key file.

API.TLS.SelfSigned

SelfSigned indicates whether to use a self-signed certificate.

API.TLS.UseTLS

UseTLS indicates whether to use TLS for client connections.

Compute.AllocatedCapacity.CPU

CPU specifies the amount of CPU a compute node allocates for running jobs. It can be expressed as a percentage (e.g., "85%") or a Kubernetes resource string (e.g., "100m").

Compute.AllocatedCapacity.Disk

Disk specifies the amount of Disk space a compute node allocates for running jobs. It can be expressed as a percentage (e.g., "85%") or a Kubernetes resource string (e.g., "10Gi").

Compute.AllocatedCapacity.GPU

GPU specifies the amount of GPU a compute node allocates for running jobs. It can be expressed as a percentage (e.g., "85%") or a Kubernetes resource string (e.g., "1"). Note: When using percentages, the result is always rounded up to the nearest whole GPU.

Compute.AllocatedCapacity.Memory

Memory specifies the amount of Memory a compute node allocates for running jobs. It can be expressed as a percentage (e.g., "85%") or a Kubernetes resource string (e.g., "1Gi").

Compute.AllowListedLocalPaths

AllowListedLocalPaths specifies a list of local file system paths that the compute node is allowed to access.

Compute.Auth.Token

Token specifies the key for compute nodes to be able to access the orchestrator.

Compute.Enabled

Enabled indicates whether the compute node is active and available for job execution.

Compute.Heartbeat.InfoUpdateInterval

InfoUpdateInterval specifies the time between updates of non-resource information to the orchestrator.

Compute.Heartbeat.Interval

Interval specifies the time between heartbeat signals sent to the orchestrator.

Compute.Heartbeat.ResourceUpdateInterval

ResourceUpdateInterval specifies the time between updates of resource information to the orchestrator.

Compute.Orchestrators

Orchestrators specifies a list of orchestrator endpoints that this compute node connects to.

Compute.TLS.CACert

CACert specifies the CA file path that the compute node trusts when connecting to orchestrator.

Compute.TLS.RequireTLS

RequireTLS specifies if the compute node enforces encrypted communication with orchestrator.

DataDir

DataDir specifies a location on disk where the bacalhau node will maintain state.

DisableAnalytics

DisableAnalytics, when true, disables sharing anonymous analytics data with the Bacalhau development team

Engines.Disabled

Disabled specifies a list of engines that are disabled.

Engines.Types.Docker.ManifestCache.Refresh

Refresh specifies the refresh interval for cache entries.

Engines.Types.Docker.ManifestCache.Size

Size specifies the size of the Docker manifest cache.

Engines.Types.Docker.ManifestCache.TTL

TTL specifies the time-to-live duration for cache entries.

InputSources.Disabled

Disabled specifies a list of storages that are disabled.

InputSources.MaxRetryCount

ReadTimeout specifies the maximum number of attempts for reading from a storage.

InputSources.ReadTimeout

ReadTimeout specifies the maximum time allowed for reading from a storage.

InputSources.Types.IPFS.Endpoint

Endpoint specifies the multi-address to connect to for IPFS. e.g /ip4/127.0.0.1/tcp/5001

JobAdmissionControl.AcceptNetworkedJobs

AcceptNetworkedJobs indicates whether to accept jobs that require network access.

JobAdmissionControl.Locality

Locality specifies the locality of the job input data.

JobAdmissionControl.ProbeExec

ProbeExec specifies the command to execute for probing job submission.

JobAdmissionControl.ProbeHTTP

ProbeHTTP specifies the HTTP endpoint for probing job submission.

JobAdmissionControl.RejectStatelessJobs

RejectStatelessJobs indicates whether to reject stateless jobs, i.e. jobs without inputs.

JobDefaults.Batch.Priority

Priority specifies the default priority allocated to a batch or ops job. This value is used when the job hasn't explicitly set its priority requirement.

JobDefaults.Batch.Task.Publisher.Params

Params specifies the publisher configuration data.

JobDefaults.Batch.Task.Publisher.Type

Type specifies the publisher type. e.g. "s3", "local", "ipfs", etc.

JobDefaults.Batch.Task.Resources.CPU

CPU specifies the default amount of CPU allocated to a task. It uses Kubernetes resource string format (e.g., "100m" for 0.1 CPU cores). This value is used when the task hasn't explicitly set its CPU requirement.

JobDefaults.Batch.Task.Resources.Disk

Disk specifies the default amount of disk space allocated to a task. It uses Kubernetes resource string format (e.g., "1Gi" for 1 gibibyte). This value is used when the task hasn't explicitly set its disk space requirement.

JobDefaults.Batch.Task.Resources.GPU

GPU specifies the default number of GPUs allocated to a task. It uses Kubernetes resource string format (e.g., "1" for 1 GPU). This value is used when the task hasn't explicitly set its GPU requirement.

JobDefaults.Batch.Task.Resources.Memory

Memory specifies the default amount of memory allocated to a task. It uses Kubernetes resource string format (e.g., "256Mi" for 256 mebibytes). This value is used when the task hasn't explicitly set its memory requirement.

JobDefaults.Batch.Task.Timeouts.ExecutionTimeout

ExecutionTimeout is the maximum time allowed for task execution

JobDefaults.Batch.Task.Timeouts.TotalTimeout

TotalTimeout is the maximum total time allowed for a task

JobDefaults.Daemon.Priority

Priority specifies the default priority allocated to a service or daemon job. This value is used when the job hasn't explicitly set its priority requirement.

JobDefaults.Daemon.Task.Resources.CPU

JobDefaults.Daemon.Task.Resources.Disk

JobDefaults.Daemon.Task.Resources.GPU

JobDefaults.Daemon.Task.Resources.Memory

JobDefaults.Ops.Priority

Priority specifies the default priority allocated to a batch or ops job. This value is used when the job hasn't explicitly set its priority requirement.

JobDefaults.Ops.Task.Publisher.Params

Params specifies the publisher configuration data.

JobDefaults.Ops.Task.Publisher.Type

Type specifies the publisher type. e.g. "s3", "local", "ipfs", etc.

JobDefaults.Ops.Task.Resources.CPU

JobDefaults.Ops.Task.Resources.Disk

JobDefaults.Ops.Task.Resources.GPU

JobDefaults.Ops.Task.Resources.Memory

JobDefaults.Ops.Task.Timeouts.ExecutionTimeout

ExecutionTimeout is the maximum time allowed for task execution

JobDefaults.Ops.Task.Timeouts.TotalTimeout

TotalTimeout is the maximum total time allowed for a task

JobDefaults.Service.Priority

Priority specifies the default priority allocated to a service or daemon job. This value is used when the job hasn't explicitly set its priority requirement.

JobDefaults.Service.Task.Resources.CPU

JobDefaults.Service.Task.Resources.Disk

JobDefaults.Service.Task.Resources.GPU

JobDefaults.Service.Task.Resources.Memory

Labels

Labels are key-value pairs used to describe and categorize the nodes.

Logging.Level

Level sets the logging level. One of: trace, debug, info, warn, error, fatal, panic.

Logging.LogDebugInfoInterval

LogDebugInfoInterval specifies the interval for logging debug information.

Logging.Mode

Mode specifies the logging mode. One of: default, json.

NameProvider

NameProvider specifies the method used to generate names for the node. One of: hostname, aws, gcp, uuid, puuid.

Orchestrator.Advertise

Advertise specifies URL to advertise to other servers.

Orchestrator.Auth.Token

Token specifies the key for compute nodes to be able to access the orchestrator

Orchestrator.Cluster.Advertise

Advertise specifies the address to advertise to other cluster members.

Orchestrator.Cluster.Host

Host specifies the hostname or IP address for cluster communication.

Orchestrator.Cluster.Name

Name specifies the unique identifier for this orchestrator cluster.

Orchestrator.Cluster.Peers

Peers is a list of other cluster members to connect to on startup.

Orchestrator.Cluster.Port

Port specifies the port number for cluster communication.

Orchestrator.Enabled

Enabled indicates whether the orchestrator node is active and available for job submission.

Orchestrator.EvaluationBroker.MaxRetryCount

MaxRetryCount specifies the maximum number of times an evaluation can be retried before being marked as failed.

Orchestrator.EvaluationBroker.VisibilityTimeout

VisibilityTimeout specifies how long an evaluation can be claimed before it's returned to the queue.

Orchestrator.Host

Host specifies the hostname or IP address on which the Orchestrator server listens for compute node connections.

Orchestrator.NodeManager.DisconnectTimeout

DisconnectTimeout specifies how long to wait before considering a node disconnected.

Orchestrator.NodeManager.ManualApproval

ManualApproval, if true, requires manual approval for new compute nodes joining the cluster.

Orchestrator.Port

Host specifies the port number on which the Orchestrator server listens for compute node connections.

Orchestrator.Scheduler.HousekeepingInterval

HousekeepingInterval specifies how often to run housekeeping tasks.

Orchestrator.Scheduler.HousekeepingTimeout

HousekeepingTimeout specifies the maximum time allowed for a single housekeeping run.

Orchestrator.Scheduler.QueueBackoff

QueueBackoff specifies the time to wait before retrying a failed job.

Orchestrator.Scheduler.WorkerCount

WorkerCount specifies the number of concurrent workers for job scheduling.

Orchestrator.SupportReverseProxy

SupportReverseProxy configures the orchestrator node to run behind a reverse proxy

Orchestrator.TLS.CACert

CACert specifies the CA file path that the orchestrator node trusts when connecting to NATS server.

Orchestrator.TLS.ServerCert

ServerCert specifies the certificate file path given to NATS server to serve TLS connections.

Orchestrator.TLS.ServerKey

ServerKey specifies the private key file path given to NATS server to serve TLS connections.

Orchestrator.TLS.ServerTimeout

ServerTimeout specifies the TLS timeout, in seconds, set on the NATS server.

Publishers.Disabled

Disabled specifies a list of publishers that are disabled.

Publishers.Types.IPFS.Endpoint

Endpoint specifies the multi-address to connect to for IPFS. e.g /ip4/127.0.0.1/tcp/5001

Publishers.Types.Local.Address

Address specifies the endpoint the publisher serves on.

Publishers.Types.Local.Port

Port specifies the port the publisher serves on.

Publishers.Types.S3.PreSignedURLDisabled

PreSignedURLDisabled specifies whether pre-signed URLs are enabled for the S3 provider.

Publishers.Types.S3.PreSignedURLExpiration

PreSignedURLExpiration specifies the duration before a pre-signed URL expires.

ResultDownloaders.Disabled

Disabled is a list of downloaders that are disabled.

ResultDownloaders.Timeout

Timeout specifies the maximum time allowed for a download operation.

ResultDownloaders.Types.IPFS.Endpoint

Endpoint specifies the multi-address to connect to for IPFS. e.g /ip4/127.0.0.1/tcp/5001

StrictVersionMatch

StrictVersionMatch indicates whether to enforce strict version matching.

UpdateConfig.Interval

Interval specifies the time between update checks, when set to 0 update checks are not performed.

WebUI.Backend

Backend specifies the address and port of the backend API server. If empty, the Web UI will use the same address and port as the API server.

WebUI.Enabled

Enabled indicates whether the Web UI is enabled.

WebUI.Listen

Listen specifies the address and port on which the Web UI listens.