1 of 10

Running Nodes

Node Onboarding

Introduction

This tutorial describes how to add new nodes to an existing private network. Two basic scenarios will be covered:

Adding a machine as a new node.
Adding a as a new node.

Pre-Prerequisites

You should have an established private network consisting of at least one requester node.
You should have a new host (physical/virtual machine, cloud instance or docker container) with installed.

Add Host/Virtual Machine as a New Node

Let's assume that you already have a private network with at least one requester node. You will need to:

Set the token in the Compute.Auth.Token configuration key
Set the orchestrators IP address in the Compute.Orchestrators configuration key
Execute bacalhau serve specifying the node type via --orchestrator flag

Add a Cloud Instance as a New Node

To automate the process using Terraform follow these steps:

Determine the IP address of your requester node
Write a terraform script, which does the following:
1. Adds a new instance
2. Installs bacalhau on it
3. Launches a compute node
Execute the script

Support

GPU Installation

How to enable GPU support on your Bacalhau node

Bacalhau supports GPUs out of the box and defaults to allowing execution on all GPUs installed on the node.

Prerequisites

Bacalhau makes the assumption that you have installed all the necessary drivers and tools on your node host and have appropriately configured them for use by Docker.

In general for GPUs from any vendor, the Bacalhau client requires:

Nvidia

NVIDIA GPU Drivers
NVIDIA Container Toolkit (nvidia-docker2)
Verify installation by Running a Sample Workload
nvidia-smi installed and functional

AMD

AMD GPU drivers
rocm-smi tool installed and functional

See the Running ROCm Docker containers for guidance on how to run Docker workloads on AMD GPU.

Intel

Intel GPU drivers
xpu-smi tool installed and functional

See the Running on GPU under docker for guidance on how to run Docker workloads on Intel GPU.

GPU Node Configuration

Access to GPUs can be controlled using resource limits. To limit the number of GPUs that can be used per job, set a job resource limit. To limit access to GPUs from all jobs, set a total resource limit.

Job selection policy

When running a node, you can choose which jobs you want to run by using configuration options, environment variables or flags to specify a job selection policy.

Job selection probes

If you want more control over making the decision to take on jobs, you can use the JobAdmissionControl.ProbeExec and JobAdmissionControl.ProbeHTTP configuration keys.

These are external programs that are passed the following data structure so that they can make a decision about whether to take on a job:

{
  "node_id": "XXX",
  "job_id": "XXX",
  "spec": {
    "engine": "docker",
    "verifier": "ipfs",
    "job_spec_vm": {
      "image": "ubuntu:latest",
      "entrypoint": ["cat", "/file.txt"]
    },
    "inputs": [{
      "engine": "ipfs",
      "cid": "XXX",
      "path": "/file.txt"
    }]
  }
}

The exec probe is a script to run that will be given the job data on stdin, and must exit with status code 0 if the job should be run.

The http probe is a URL to POST the job data to. The job will be rejected if the HTTP request returns a non-positive status code (e.g. >= 400).

{
  "$schema": "http://json-schema.org/draft-04/schema#",
  "type": "object",
  "properties": {
    "shouldBid": {
      "description": "If the job should be accepted",
      "type": "boolean"
    },
    "shouldWait": {
      "description": "If the node should wait for an async response that will come later. `shouldBid` will be ignored",
      "type": "boolean",
      "default": false,
    },
    "reason": {
      "description": "Human-readable string explaining why the job should be accepted or rejected, or why the wait is required",
      "type": "string"
    }
  },
  "required": [
    "shouldBid",
    "reason"
  ]
}

For example, the following response will reject the job:

{
  "shouldBid": false,
  "reason": "The job did not pass this specific validation: ...",
}

If the HTTP response is not a JSON blob, the content is ignored and any non-error status code will accept the job.

Access Management

How to configure authentication and authorization on your Bacalhau node.

Access Management

Bacalhau includes a flexible auth system that supports multiple methods of auth that are appropriate for different deployment environments.

By default

With no specific authentication configuration supplied, Bacalhau runs in "anonymous mode" – which allows unidentified users limited control over the system. "Anonymous mode" is only appropriate for testing or evaluation setups.

In anonymous mode, Bacalhau will allow:

Users identified by a self-generated private key to submit any job and cancel their own jobs.
Users not identified by any key to access other read-only endpoints, such as to read job lists, describe jobs, and query node or agent information.

Restricting anonymous access

Bacalhau auth is controlled by policies. Configuring the auth system is done by supplying a different policy file.

Restricting API access to only users that have authenticated requires specifying a new authorization policy. You can download a policy that restricts anonymous access and install it by using:

curl -sL https://raw.githubusercontent.com/bacalhau-project/bacalhau/main/pkg/authz/policies/policy_ns_anon.rego -o ~/.bacalhau/no-anon.rego
bacalhau config set API.Auth.AccessPolicyPath ~/.bacalhau/no-anon.rego

Once the node is restarted, accessing the node APIs will require the user to be authenticated, but by default will still allow users with a self-generated key to authenticate themselves.

Restricting the list of keys that can authenticate to only a known set requires specifying a new authentication policy. You can download a policy that restricts key-based access and install it by using:

curl -sL https://raw.githubusercontent.com/bacalhau-project/bacalhau/main/pkg/authn/challenge/challenge_ns_no_anon.rego -o ~/.bacalhau/challenge_ns_no_anon.rego
bacalhau config set API.Auth.Methods '\{Method: ClientKey, Policy: \{Type: challenge, PolicyPath: ~/.bacalhau/challenge_ns_no_anon.rego\}\}'

Then, modify the allowed_clients variable in challange_ns_no_anon.rego to include acceptable client IDs, found by running bacalhau agent node.

bacalhau agent node | jq -rc .ClientID

Once the node is restarted, only keys in the allowed list will be able to access any API.

Username and password access

Users can authenticate using a username and password instead of specifying a private key for access. Again, this requires installation of an appropriate policy on the server.

curl -sL https://raw.githubusercontent.com/bacalhau-project/bacalhau/main/pkg/authn/ask/ask_ns_password.rego -o ~/.bacalhau/ask_ns_password.rego
bacalhau config set API.Auth.Methods '\{Method: Password, Policy: \{Type: ask, PolicyPath: ~/.bacalhau/ask_ns_password.rego\}\}'

Passwords are not stored in plaintext and are salted. The downloaded policy expects password hashes and salts generated by scrypt. To generate a salted password, the helper script in pkg/authn/ask/gen_password can be used:

cd pkg/authn/ask/gen_password && go run .

This will ask for a password and generate a salt and hash to authenticate with it. Add the encoded username, salt and hash into the ask_ns_password.rego.

Writing custom policies

In principle, Bacalhau can implement any auth scheme that can be described in a structured way by a policy file.

Policies are written in a language called Rego, also used by Kubernetes. Users who want to write their own policies should get familiar with the Rego language.

Custom authentication policies

Bacalhau will pass information pertinent to the current request into every authentication policy query as a field on the input variable. The exact information depends on the type of authentication used.

`challenge` authentication

challenge authentication uses identifies the user by the presence of a private key. The user is asked to sign an input phrase to prove they have the key they are identifying with.

Policies used for challenge authentication do not need to actually implement the challenge verification logic as this is handled by the core code. Instead, they will only be invoked if this verification passes.

Policies for this type will need to implement these rules:

bacalhau.authn.token: if the user should be authenticated, an access token they should use in subsequent requests. If the user should not be authenticated, should be undefined.

They should expect as fields on the input variable:

clientId: an ID derived from the user's private key that identifies them uniquely
nodeId: the ID of the requester node that this user is authenticating with
signingKey: the private key (as a JWK) that should be used to sign any access tokens to be returned

The simplest possible policy might therefore be this policy that returns the same opaque token for all users:

package bacalhau.authn

token := "anything"

A more realistic example that returns a signed JWT is in challenge_ns_anon.rego.

`ask` authentication

ask authentication uses credentials supplied manually by the user as identification. For example, an ask policy could require a username and password as input and check these against a known list. ask policies do all the verification of the supplied credentials.

Policies for this type will need to implement these rules:

bacalhau.authn.token: if the user should be authenticated, an access token they should use in subsequent requests. If the user should not be authenticated, should be undefined.
bacalhau.authn.schema: a static JSON schema that should be used to collect information about the user. The type of declared fields may be used to pick the input method, and if a field is marked as writeOnly then it will be collected in a secure way (e.g. not shown on screen). The schema rule does not receive any input data.

They should expect as fields on the input variable:

ask: a map of field names from the JSON schema to strings supplied by the user. The policy should validate these credentials.
nodeId: the ID of the requester node that this user is authenticating with
signingKey: the private key (as a JWK) that should be used to sign any access tokens to be returned

The simplest possible policy might therefore be one that asks for no data and returns the same opaque token for every user:

package bacalhau.authn

schema := {}
token := "anything"

A more realistic example that returns a signed JWT is in ask_ns_example.rego.

Custom authorization policies

Authorization policies do not vary depending on the type of authentication used – Bacalhau uses one authz policy for all API requests.

Authz policies are invoked for every API request. Authz policies should check the validity of any supplied access tokens and issue an authz decision for the requested API endpoint. It is not required that authz policies enforce that an access token is present – they may choose to grant access to unauthorized users.

Policies will need to implement these rules:

bacalhau.authz.token_valid: true if the access token in the request is "valid" (but does not necessarily grant access for this request), or false if it is invalid for every request (e.g. because it has expired) and should be discarded.
bacalhau.authz.allow: true if the user should be permitted to carry out the input request, false otherwise.

They should expect as fields on the input variable for both rules:

http: details of the user's HTTP request:
- host: the hostname used in the HTTP request
- method: the HTTP method (e.g. GET, POST)
- path: the path requested, as an array of path components without slashes
- query: a map of URL query parameters to their values
- headers: a map of HTTP header names to arrays representing their values
- body: a blob of any content submitted as the body
constraints: details about the receiving node that should be used to validate any supplied tokens:
- cert: keys that the input token should have been signed with
- iss: the name of a node that this node will recognize as the issuer of any signed tokens
- aud: the name of this node that is receiving the request

Notably, the constraints data is appropriate to be passed directly to the Rego io.jwt.decode_verify method which will validate the access token as a JWT against the given constraints.

The simplest possible authz policy might be this one that allows all users to access all endpoints:

package bacalhau.authz

allow := true
token_valid := true

A more realistic example (which is the Bacalhau "anonymous mode" default) is in policy_ns_anon.rego.

Persistent State

Requester node database (job store)

Requester nodes store job state and history in a boltdb-backed store (pkg/jobstore/boltdb).

The location of the database file can be specified using the BACALHAU_JOB_STORE_PATH environment variable, which will specify which file to use to store the database. When not specified, the file will be {$BACALHAU_DIR}/{NODE_ID}-requester.db.

Compute node database (execution store)

By default, compute nodes store their execution information in an bolddb-backed store (pkg/compute/store/boltdb).

The location of the database file (for a single node) can be specified using the BACALHAU_COMPUTE_STORE_PATH environment variable, which will specify which file to use to store the database. When not specified, the file will be {$BACALHAU_DIR}/{NODE_ID}-compute.db.

Compute node restarts

As compute nodes restart, they will find they have existing state in the boltdb database. At startup the database currently iterates the executions to calculate the counters for each state. This will be a good opportunity to do some compaction of the records in the database, and cleanup items no longer in use.

Currently only batch jobs are possible, and so for each of the listed states below, no action is taken at restart. In future it would make sense to remove records older than a certain age, or moved them to failed, depending on their current state. For other job types (to be implemented) this may require restarting jobs, resetting jobs,

State

Batch jobs

ExecutionStateCreated

No action

ExecutionStateBidAccepted

No action

ExecutionStateRunning

No action

ExecutionStateWaitingVerification

No action

ExecutionStateResultAccepted

No action

ExecutionStatePublishing

No action

ExecutionStateCompleted

No action

ExecutionStateFailed

No action

ExecutionStateCancelled

No action

Inspecting the databases

The databases can be inspected using the bbolt tool. The bbolt tool can be installed to $GOBIN with:

go install go.etcd.io/bbolt/cmd/bbolt

Once installed, and assuming the database file is stored in $FILE you can use bbolt to:

Check the database integrity

$ bbolt check $FILE
OK

List all buckets

$ bbolt buckets $FILE
execution
execution-history
execution-index

Compact the database (by copying it)

$ bolt compact -o DESTINATION_FILE $FILE
262144 -> 262144 bytes (gain=1.00x)

Get some DB stats

$ bbolt stats $FILE
Aggregate statistics for 3 buckets

Page count statistics
        Number of logical branch pages: 0
        Number of physical branch overflow pages: 0
        Number of logical leaf pages: 3
        Number of physical leaf overflow pages: 0
Tree statistics
        Number of keys/value pairs: 29
        Number of levels in B+tree: 2
Page size utilization
        Bytes allocated for physical branch pages: 0
        Bytes actually used for branch data: 0 (0%)
        Bytes allocated for physical leaf pages: 49152
        Bytes actually used for leaf data: 8991 (18%)
Bucket statistics
        Total number of buckets: 11
        Total number on inlined buckets: 8 (72%)
        Bytes used for inlined buckets: 2743 (30%)

List keys in a bucket

$ bbolt keys $FILE execution
e-42bdab56-bb27-42ca-a4c2-62aeb0f76342
e-9a937556-ac83-49ec-b794-a94c3aa068b2
e-d81034b8-c7c6-422d-af3f-87d680862a58
e-fb54458c-5bad-496e-a1cb-4fc091d882a1

View a single key

bbolt get $FILE execution e-9a937556-ac83-49ec-b794-a94c3aa068b2
{"ID":"e-9a937556-ac83-49ec-b794-a94c3aa068b2","Job":{"APIVersion":"V1beta2","Metadata": .... more JSON}

Configuring Your Input Sources

Bacalhau has two ways to make use of external storage providers: Sources and Publishers. Sources are storage resources consumed as inputs to jobs. And Publishers are storage resources created with the results of jobs.

Sources

Bacalhau allows you to use S3 or any S3-compatible storage service as an input source. Users can specify files or entire prefixes stored in S3 buckets to be fetched and mounted directly into the job execution environment. This capability ensures that your jobs have immediate access to the necessary data. See the for more details.

To use the S3 source, you will have to specify the mandatory name of the S3 bucket and the optional parameters Key, Filter, Region, Endpoint, VersionID and ChechsumSHA256.

Below is an example of how to define an S3 input source in YAML format:

InputSources:
  - Source:
      Type: "s3"
      Params:
        Bucket: "my-bucket"
        Key: "data/"
        Endpoint: "https://s3.us-west-2.amazonaws.com"
        ChecksumSHA256: "e3b0c44b542b..."
  - Target: "/data"

To start, you'll need to connect the Bacalhau node to an IPFS server so that you can run jobs that consume CIDs as inputs. You can either and run it locally, or you can connect to a remote IPFS server.

In both cases, you should have an for the IPFS server that should look something like this:

export IPFS_CONNECT=/ip4/10.1.10.10/tcp/80/p2p/QmVcSqVEsvm5RR9mBLjwpb2XjFVn5bPdPL69mL8PH45pPC

The multiaddress above is just an example - you'll need to get the multiaddress of the IPFS server you want to connect to.

You can then configure your Bacalhau node to use this IPFS server by adding the address to the InputSources.Types.IPFS.Endpoint configuration key:

bacalhau config set InputSources.Types.IPFS.Endpoint=/ip4/10.1.10.10/tcp/80/p2p/QmVcSqVEsvm5RR9mBLjwpb2XjFVn5bPdPL69mL8PH45pPC

See the for more details.

Below is an example of how to define an IPFS input source in YAML format:

InputSources:
  - Source:
      Type: "ipfs"
      Params:
        CID: "QmY7Yh4UquoXHLPFo2XbhXkhBvFoPwmQUSa92pxnxjY3fZ"
  - Target: "/data"

To use a local data source, you will have to:

Enable the use of local data when configuring the node itself by using the Compute.AllowListedLocalPaths configuration key, specifying the file path and access mode. For example

bacalhau config set Compute.AllowListedLocalPaths="/etc/config:rw,/etc/*.conf:ro".

In the job description specify parameters SourcePath - the absolute path on the compute node where your data is located and ReadWrite - the access mode.

Below is an example of how to define a Local input source in YAML format:

InputSources:
  - Source:
      Type: "localDirectory"
      Params:
        SourcePath: "/etc/config"
        ReadWrite: true
    Target: "/config"

To use a URL data source, you will have to specify only URL parameter, as in the part of the declarative job description below:

InputSources:
  - Source:
      Type: "urlDownload"
      Params:
        URL: "https://example.com/data/file.txt"
    Target: "/data"

Publishers

Bacalhau's S3 Publisher provides users with a secure and efficient method to publish job results to any S3-compatible storage service. To use an S3 publisher you will have to specify required parameters Bucket and Key and optional parameters Region, Endpoint, VersionID, ChecksumSHA256. See the for more details.

Here’s an example of the part of the declarative job description that outlines the process of using the S3 Publisher with Bacalhau:

Publisher:
  Type: "s3"
  Params:
    Bucket: "my-task-results"
    Key: "task123/result.tar.gz"
    Endpoint: "https://s3.us-west-2.amazonaws.com"

The IPFS publisher works using the same setup as - you'll need to have an IPFS server running and a multiaddress for it. Then you'll configure that multiaddress using the InputSources.Types.IPFS.Endpoint configuration key. Then you can use bacalhau job get <job-ID> with no further arguments to download the results.

To use the IPFS publisher you will have to specify CID which can be used to access the published content. See the for more details.

And part of the declarative job description with an IPFS publisher will look like this:

Publisher:
  Type: ipfs
PublishedResult:
  Type: ipfs
  Params:
    CID: "QmXoypizjW3WknFiJnKLwHCnL72vedxjQkDDP1mXWo6uco"

The Local Publisher should not be used for Production use as it is not a reliable storage option. For production use, we recommend using a more reliable option such as an S3-compatible storage service.

Another possibility to store the results of a job execution is on a compute node. In such case the results will be published to the local compute node, and stored as compressed tar file, which can be accessed and retrieved over HTTP from the command line using the get command. To use the Local publisher you will have to specify the only URL parameter with a HTTP URL to the location where you would like to save the result. See the for more details.

Here is an example of part of the declarative job description with a local publisher:

Publisher:
    Type: local
PublishedResult:
  Type: local
  Params:
    URL: "http://192.168.0.11:6001/e-c4b80d04-ff2b-49d6-9b99-d3a8e669a6bf.tgz"

Configuring Transport Level Security

How to configure TLS for the requester node APIs

By default, the requester node APIs used by the Bacalhau CLI are accessible over HTTP, but it is possible to configure it to use Transport Level Security (TLS) so that they are accessible over HTTPS instead. There are several ways to obtain the necessary certificates and keys, and Bacalhau supports obtaining them via ACME and Certificate Authorities or even self-signing them.

Once configured, you must ensure that instead of using http://IP:PORT you use https://IP:PORT to access the Bacalhau API

Getting a certificate from Let's Encrypt with ACME

Automatic Certificate Management Environment (ACME) is a protocol that allows for automating the deployment of Public Key Infrastructure, and is the protocol used to obtain a free certificate from the Let's Encrypt Certificate Authority.

Using the --autocert [hostname] parameter to the CLI (in the serve and devstack commands), a certificate is obtained automatically from Lets Encrypt. The provided hostname should be a comma-separated list of hostnames, but they should all be publicly resolvable as Lets Encrypt will attempt to connect to the server to verify ownership (using the ACME HTTP-01 challenge). On the very first request this can take a short time whilst the first certificate is issued, but afterwards they are then cached in the bacalhau repository.

Alternatively, you may set these options via the environment variable, BACALHAU_AUTO_TLS. If you are using a configuration file, you can set the values inNode.ServerAPI.TLS.AutoCert instead.

As a result of the Lets Encrypt verification step, it is necessary for the server to be able to handle requests on port 443. This typically requires elevated privileges, and rather than obtain these through a privileged account (such as root), you should instead use setcap to grant the executable the right to bind to ports <1024.

sudo setcap CAP_NET_BIND_SERVICE+ep $(which bacalhau)

A cache of ACME data is held in the config repository, by default ~/.bacalhau/autocert-cache, and this will be used to manage renewals to avoid rate limits.

Getting a certificate from a Certificate Authority

Obtaining a TLS certificate from a Certificate Authority (CA) without using the Automated Certificate Management Environment (ACME) protocol involves a manual process that typically requires the following steps:

Choose a Certificate Authority: First, you need to select a trusted Certificate Authority that issues TLS certificates. Popular CAs include DigiCert, GlobalSign, Comodo (now Sectigo), and others. You may also consider whether you want a free or paid certificate, as CAs offer different pricing models.
Generate a Certificate Signing Request (CSR): A CSR is a text file containing information about your organization and the domain for which you need the certificate. You can generate a CSR using various tools or directly on your web server. Typically, this involves providing details such as your organization's name, common name (your domain name), location, and other relevant information.
Submit the CSR: Access your chosen CA's website and locate their certificate issuance or order page. You'll typically find an option to "Submit CSR" or a similar option. Paste the contents of your CSR into the provided text box.
Verify Domain Ownership: The CA will usually require you to verify that you own the domain for which you're requesting the certificate. They may send an email to one of the standard domain-related email addresses (e.g., admin@yourdomain.com, webmaster@yourdomain.com). Follow the instructions in the email to confirm domain ownership.
Complete Additional Verification: Depending on the CA's policies and the type of certificate you're requesting (e.g., Extended Validation or EV certificates), you may need to provide additional documentation to verify your organization's identity. This can include legal documents or phone calls from the CA to confirm your request.
Payment and Processing: If you're obtaining a paid certificate, you'll need to make the payment at this stage. Once the CA has received your payment and completed the verification process, they will issue the TLS certificate.

Once you have obtained your certificates, you will need to put two files in a location that bacalhau can read them. You need the server certificate, often called something like server.cert or server.cert.pem, and the server key which is often called something like server.key or server.key.pem.

Once you have these two files available, you must start bacalhau serve which two new flags. These are tlscert and tlskey flags, whose arguments should point to the relevant file. An example of how it is used is:

bacalhau serve --node-type=requester --tlscert=server.cert --tlskey=server.key

Alternatively, you may set these options via the environment variables, BACALHAU_TLS_CERT and BACALHAU_TLS_KEY. If you are using a configuration file, you can set the values inNode.ServerAPI.TLS.ServerCertificate and Node.ServerAPI.TLS.ServerKey instead.

Self-signed certificates

If you wish, it is possible to use Bacalhau with a self-signed certificate which does not rely on an external Certificate Authority. This is an involved process and so is not described in detail here although there is a helpful script in the Bacalhau github repository which should provide a good starting point.

Once you have generated the necessary files, the steps are much like above, you must start bacalhau serve which two new flags. These are tlscert and tlskey flags, whose arguments should point to the relevant file. An example of how it is used is:

bacalhau serve --node-type=requester --tlscert=server.cert --tlskey=server.key

Alternatively, you may set these options via the environment variables, BACALHAU_TLS_CERT and BACALHAU_TLS_KEY. If you are using a configuration file, you can set the values inNode.ServerAPI.TLS.ServerCertificate and Node.ServerAPI.TLS.ServerKey instead.

If you use self-signed certificates, it is unlikely that any clients will be able to verify the certificate when connecting to the Bacalhau APIs. There are three options available to work around this problem:

Provide a CA certificate file of trusted certificate authorities, which many software libraries support in addition to system authorities.
Install the CA certificate file in the system keychain of each machine that needs access to the Bacalhau APIs.
Instruct the software library you are using not to verify HTTPS requests.

Limits and Timeouts

Resource Limits

These are the configuration keys that control the capacity of the Bacalhau node, and the limits for jobs that might be run.

Configuration key

Description

Compute.AllocatedCapacity.CPU

Specifies the amount of CPU a compute node allocates for running jobs. It can be expressed as a percentage (e.g., 85%) or a Kubernetes resource string

Compute.AllocatedCapacity.Disk

Specifies the amount of Disk space a compute node allocates for running jobs. It can be expressed as a percentage (e.g., 85%) or a Kubernetes resource string (e.g., 10Gi)

Compute.AllocatedCapacity.GPU

Specifies the amount of GPU a compute node allocates for running jobs. It can be expressed as a percentage (e.g., 85%) or a Kubernetes resource string (e.g., 1).

Note: When using percentages, the result is always rounded up to the nearest whole GPU

Compute.AllocatedCapacity.Memory

Specifies the amount of Memory a compute node allocates for running jobs. It can be expressed as a percentage (e.g., 85%) or a Kubernetes resource string (e.g., 1Gi)

It is also possible to additionally specify the number of resources to be allocated to each job by default, if the required number of resources is not specified in the job itself. JobDefaults.<>.Task.Resources.<Resource Type> configuration keys are used for this purpose. E.g. to provide each job with 2Gb of RAM the following key is used: JobDefaults.Ops.Task.Resources.Memory:

bacalhau config set JobDefaults.Ops.Task.Resources.Memory=2Gi

See the complete configuration keys list for more details.

Resource limits are not supported for Docker jobs running on Windows. Resource limits will be applied at the job bid stage based on reported job requirements but will be silently unenforced. Jobs will be able to access as many resources as requested at runtime.

Windows Support

Running a Windows-based node is not officially supported, so your mileage may vary. Some features (like resource limits) are not present in Windows-based nodes.

Bacalhau currently makes the assumption that all containers are Linux-based. Users of the Docker executor will need to manually ensure that their Docker engine is running and configured appropriately to support Linux containers, e.g. using the WSL-based backend.

Timeouts

Bacalhau can limit the total time a job spends executing. A job that spends too long executing will be cancelled, and no results will be published.

By default, a Bacalhau node does not enforce any limit on job execution time. Both node operators and job submitters can supply a maximum execution time limit. If a job submitter asks for a longer execution time than permitted by a node operator, their job will be rejected.

Applying job timeouts allows node operators to more fairly distribute the work submitted to their nodes. It also protects users from transient errors that result in their jobs waiting indefinitely.

Configuring Execution Time Limits

Job submitters can pass the --timeout flag to any Bacalhau job submission CLI to set a maximum job execution time. The supplied value should be a whole number of seconds with no unit.

The timeout can also be added to an existing job spec by adding the Timeout property to the Spec.

Node operators can use configuration keys to specify default and maximum job execution time limits. The supplied values should be a numeric value followed by a time unit (one of s for seconds, m for minutes or h for hours).

Here is a list of the relevant properties:

JobDefaults.Batch.Task.Timeouts.ExecutionTimeout

Default value for batch job execution timeouts on your current compute node. It will be assigned to batch jobs with no timeout requirement defined

JobDefaults.Ops.Task.Timeouts.ExecutionTimeout

Default value for ops job execution timeouts on your current compute node. It will be assigned to ops jobs with no timeout requirement defined

JobDefaults.Batch.Task.Timeouts.TotalTimeout

Default value for the maximum execution timeout this compute node supports for batch jobs. Jobs with higher timeout requirements will not be bid on

JobDefaults.Ops.Task.Timeouts.TotalTimeout

Default value for the maximum execution timeout this compute node supports for ops jobs. Jobs with higher timeout requirements will not be bid on

Note, that timeouts can not be configured for Daemon and Service jobs.

Bacalhau WebUI

How to run the WebUI.

Overview

The Bacalhau WebUI offers an intuitive interface for interacting with the Bacalhau network. This guide provides comprehensive instructions for setting up and utilizing the WebUI.

For contributing to the WebUI's development, please refer to the Bacalhau WebUI GitHub Repository.

WebUI Setup

Prerequisites

Ensure you have a Bacalhau v1.5.0 or later installed.

Configuration

To enable the WebUI, use the WebUI.Enabled configuration key:

bacalhau config set webui.enabled=true

By default, WebUI uses host=0.0.0.0 and port=8438. This can be configured via WebUI.Listen configuration key:

bacalhau config set webui.listen=<ip-address>:<port>

Accessing the WebUI

Once started, the WebUI is accessible at the specified address, localhost:8438 by default.

WebUI Features

Jobs

The updated WebUI allows you to view a list of jobs, including job status, run time, type, and a message in case the job failed.

Clicking on the id of a job in the list opens the job details page, where you can see the history of events related to the job, the list of nodes on which the job was executed and the real-time logs of the job.

Nodes

On the Nodes page you can see a list of nodes connected to your network, including node type, membership and connection statuses, amount of resources - total and currently available, and a list of labels of the node.

Clicking on the node id opens the node details page, where you can see the status and settings of the node, the number of running and scheduled jobs.

Access Management

How to configure authentication and authorization on your Bacalhau node.

Access Management

Bacalhau includes a flexible auth system that supports multiple methods of auth that are appropriate for different deployment environments.

By default

In anonymous mode, Bacalhau will allow:

Users identified by a self-generated private key to submit any job and cancel their own jobs.
Users not identified by any key to access other read-only endpoints, such as to read job lists, describe jobs, and query node or agent information.

Restricting anonymous access

Bacalhau auth is controlled by policies. Configuring the auth system is done by supplying a different policy file.

Restricting API access to only users that have authenticated requires specifying a new authorization policy. You can download a policy that restricts anonymous access and install it by using:

curl -sL https://raw.githubusercontent.com/bacalhau-project/bacalhau/main/pkg/authz/policies/policy_ns_anon.rego -o ~/.bacalhau/no-anon.rego
bacalhau config set API.Auth.AccessPolicyPath ~/.bacalhau/no-anon.rego

Once the node is restarted, accessing the node APIs will require the user to be authenticated, but by default will still allow users with a self-generated key to authenticate themselves.

curl -sL https://raw.githubusercontent.com/bacalhau-project/bacalhau/main/pkg/authn/challenge/challenge_ns_no_anon.rego -o ~/.bacalhau/challenge_ns_no_anon.rego
bacalhau config set API.Auth.Methods '\{Method: ClientKey, Policy: \{Type: challenge, PolicyPath: ~/.bacalhau/challenge_ns_no_anon.rego\}\}'

Then, modify the allowed_clients variable in challange_ns_no_anon.rego to include acceptable client IDs, found by running bacalhau agent node.

bacalhau agent node | jq -rc .ClientID

Once the node is restarted, only keys in the allowed list will be able to access any API.

Username and password access

Users can authenticate using a username and password instead of specifying a private key for access. Again, this requires installation of an appropriate policy on the server.

curl -sL https://raw.githubusercontent.com/bacalhau-project/bacalhau/main/pkg/authn/ask/ask_ns_password.rego -o ~/.bacalhau/ask_ns_password.rego
bacalhau config set API.Auth.Methods '\{Method: Password, Policy: \{Type: ask, PolicyPath: ~/.bacalhau/ask_ns_password.rego\}\}'

cd pkg/authn/ask/gen_password && go run .

This will ask for a password and generate a salt and hash to authenticate with it. Add the encoded username, salt and hash into the ask_ns_password.rego.

Writing custom policies

In principle, Bacalhau can implement any auth scheme that can be described in a structured way by a policy file.

Policies are written in a language called Rego, also used by Kubernetes. Users who want to write their own policies should get familiar with the Rego language.

Custom authentication policies

`challenge` authentication

challenge authentication uses identifies the user by the presence of a private key. The user is asked to sign an input phrase to prove they have the key they are identifying with.

Policies for this type will need to implement these rules:

bacalhau.authn.token: if the user should be authenticated, an access token they should use in subsequent requests. If the user should not be authenticated, should be undefined.

They should expect as fields on the input variable:

clientId: an ID derived from the user's private key that identifies them uniquely
nodeId: the ID of the requester node that this user is authenticating with
signingKey: the private key (as a JWK) that should be used to sign any access tokens to be returned

The simplest possible policy might therefore be this policy that returns the same opaque token for all users:

package bacalhau.authn

token := "anything"

A more realistic example that returns a signed JWT is in challenge_ns_anon.rego.

`ask` authentication

Policies for this type will need to implement these rules:

bacalhau.authn.token: if the user should be authenticated, an access token they should use in subsequent requests. If the user should not be authenticated, should be undefined.
bacalhau.authn.schema: a static JSON schema that should be used to collect information about the user. The type of declared fields may be used to pick the input method, and if a field is marked as writeOnly then it will be collected in a secure way (e.g. not shown on screen). The schema rule does not receive any input data.

They should expect as fields on the input variable:

ask: a map of field names from the JSON schema to strings supplied by the user. The policy should validate these credentials.
nodeId: the ID of the requester node that this user is authenticating with
signingKey: the private key (as a JWK) that should be used to sign any access tokens to be returned

The simplest possible policy might therefore be one that asks for no data and returns the same opaque token for every user:

package bacalhau.authn

schema := {}
token := "anything"

A more realistic example that returns a signed JWT is in ask_ns_example.rego.

Custom authorization policies

Authorization policies do not vary depending on the type of authentication used – Bacalhau uses one authz policy for all API requests.

Policies will need to implement these rules:

bacalhau.authz.token_valid: true if the access token in the request is "valid" (but does not necessarily grant access for this request), or false if it is invalid for every request (e.g. because it has expired) and should be discarded.
bacalhau.authz.allow: true if the user should be permitted to carry out the input request, false otherwise.

They should expect as fields on the input variable for both rules:

http: details of the user's HTTP request:
- host: the hostname used in the HTTP request
- method: the HTTP method (e.g. GET, POST)
- path: the path requested, as an array of path components without slashes
- query: a map of URL query parameters to their values
- headers: a map of HTTP header names to arrays representing their values
- body: a blob of any content submitted as the body
constraints: details about the receiving node that should be used to validate any supplied tokens:
- cert: keys that the input token should have been signed with
- iss: the name of a node that this node will recognize as the issuer of any signed tokens
- aud: the name of this node that is receiving the request

Notably, the constraints data is appropriate to be passed directly to the Rego io.jwt.decode_verify method which will validate the access token as a JWT against the given constraints.

The simplest possible authz policy might be this one that allows all users to access all endpoints:

package bacalhau.authz

allow := true
token_valid := true

A more realistic example (which is the Bacalhau "anonymous mode" default) is in policy_ns_anon.rego.

Running Nodes

Node Onboarding

Introduction

Pre-Prerequisites

Add Host/Virtual Machine as a New Node

Add a Cloud Instance as a New Node

Support

GPU Installation

Prerequisites

Nvidia

AMD

Intel

GPU Node Configuration

Job selection policy

Job selection probes

Access Management

Access Management

By default

Restricting anonymous access

Username and password access

Writing custom policies

Custom authentication policies

challenge authentication

ask authentication

Custom authorization policies

Persistent State

Requester node database (job store)

Compute node database (execution store)

Compute node restarts

Inspecting the databases

Check the database integrity

List all buckets

Compact the database (by copying it)

Get some DB stats

List keys in a bucket

View a single key

Configuring Your Input Sources

Sources

Publishers

Configuring Transport Level Security

Getting a certificate from Let's Encrypt with ACME

Getting a certificate from a Certificate Authority

Self-signed certificates

Limits and Timeouts

Resource Limits

Windows Support

Timeouts

Configuring Execution Time Limits

Bacalhau WebUI

Overview

WebUI Setup

Prerequisites

Configuration

Accessing the WebUI

WebUI Features

Jobs

Nodes

Persistent State

Requester node database (job store)

Compute node database (execution store)

Compute node restarts

Inspecting the databases

Check the database integrity

List all buckets

Compact the database (by copying it)

Get some DB stats

List keys in a bucket

View a single key

Configuring Your Input Sources

Sources

Publishers

Node Onboarding

Introduction

Pre-Prerequisites

Add Host/Virtual Machine as a New Node

Add a Cloud Instance as a New Node

Support

Job selection policy

Job selection probes

GPU Installation

`challenge` authentication

`ask` authentication

`challenge` authentication

`ask` authentication