1 of 100

v1.5.x

Welcome

Welcome to the Bacalhau documentation!

In Bacalhau v.1.5.0 a couple of things changed.

There is now a built-in WebUI.

What is Bacalhau?

Why Bacalhau?

Scalability and Flexibility:
- You can run large-scale computations without relying on a single cloud provider, enhancing flexibility and potentially reducing costs.
- Bacalhau enables distributed data processing, which can significantly speed up analysis and model training by parallelizing tasks across multiple nodes.
Data Privacy and Security:
- Bacalhau allows data to be processed close to its source, which can help maintain data privacy and comply with regulatory requirements.
Cost Efficiency:
- Utilize Bacalhau’s platform to dynamically allocate resources, ensuring optimal performance while controlling costs.

Automation and Orchestration:
- Seamless Integration: Bacalhau can be integrated into CI/CD pipelines, enabling automated deployment and scaling of machine learning models and other applications.
- Workload Scheduling: Efficiently schedule and manage workloads across a decentralized network, improving resource utilization and reliability.
Fault Tolerance:
- Decentralized infrastructure ensures high availability and resilience against failures, reducing downtime for critical applications.
Cost Management:
- Utilize Bacalhau’s platform to dynamically allocate resources, ensuring optimal performance while controlling costs.

Infrastructure Efficiency:
- Efficiently utilize idle or underutilized compute resources within an organization, maximizing hardware investments.
Simplified Management:
- Manage heterogeneous compute resources through a single platform, simplifying administrative tasks and reducing complexity.
Cost Reduction:
- Bacalhau’s can help drive down your compute costs by up to 72.5% for deploying your ML models and over 90% for your log processing spend.

Bacalhau simplifies the process of managing compute jobs by providing a unified platform for managing jobs across different regions, clouds, and edge devices.

How it works

Bacalhau consists of a network of nodes that enables orchestration between every compute resource, no matter whether it is a Cloud VM, an On-premise server, or Edge devices. The network consists of two types of nodes:

Requester Node: responsible for handling user requests, discovering and ranking compute nodes, forwarding jobs to compute nodes, and monitoring the job lifecycle.

Compute Node: responsible for executing jobs and producing results. Different compute nodes can be used for different types of jobs, depending on their capabilities and resources.

Data ingestion

Data is identified by its content identifier (CID) and can be accessed by anyone who knows the CID. Here are some options that can help you mount your data:

The options are not limited to the above-mentioned. You can mount your data anywhere on your machine, and Bacalhau will be able to run against that data

Security in Bacalhau

All workloads run under restricted Docker or WASM permissions on the node. Additionally, you can use existing (locked down) binaries that are pre-installed through Pluggable Executors.

Finally, endpoints (such as vaults) can also be used to provide secure access to Bacalhau. This way, the client can authenticate with Bacalhau using the token without exposing their credentials.

Use Cases

Bacalhau can be used for a variety of data processing workloads, including machine learning, data analytics, and scientific computing. It is well-suited for workloads that require processing large amounts of data in a distributed and parallelized manner.

Here are some example tutorials on how you can process your data with Bacalhau:

Community

Bacalhau has a very friendly community and we are always happy to help you get started:

Next Steps

Getting Started

How Bacalhau Works

In this tutorial we will go over the components and the architecture of Bacalhau. You will learn how it is built, what components are used, how you could interact and how you could use Bacalhau.

Chapter 1 - Architecture

Bacalhau is a peer-to-peer network of nodes that enables decentralized communication between computers. The network consists of two types of nodes, which can communicate with each other.

The requester and compute nodes together form a p2p network and use gossiping to discover each other, share information about node capabilities, available resources and health status. Bacalhau is a peer-to-peer network of nodes that enables decentralized communication between computers.

Requester Node: responsible for handling user requests, discovering and ranking compute nodes, forwarding jobs to compute nodes, and monitoring the job lifecycle.

Compute Node: responsible for executing jobs and producing results. Different compute nodes can be used for different types of jobs, depending on their capabilities and resources.

To interact with the Bacalhau network, users can use the Bacalhau CLI (command-line interface) to send requests to a requester node in the network. These requests are sent using the JSON format over HTTP, a widely-used protocol for transmitting data over the internet. Bacalhau's architecture involves two main sections which are the core components and interfaces.

Components overview

Core Components

The core components are responsible for handling requests and connecting different nodes. The network includes two different components:

Requester node

In the Bacalhau network, the requester node is responsible for handling requests from clients using JSON over HTTP. This node serves as the main custodian of jobs that are submitted to it. When a job is submitted to a requester node, it selects compute nodes that are capable and suitable to execute the job, and coordinates the job execution.

Compute node

In the Bacalhau network, it is the compute node that is responsible for determining whether it can execute a job or not. This model allows for a more decentralized approach to job orchestration as the network will function properly even if the requester nodes have stale view of the network, or if concurrent requesters are allocating jobs to the same compute nodes. Once the compute node has run the job and produced results, it will publish the results to a remote destination as specified in the job specification (e.g. S3), and notify the requester of the job completion. The compute node has a collection of named executors, storage sources, and publishers, and it will choose the most appropriate ones based on the job specifications.

Interfaces

The interfaces handle the distribution, execution, storage and publishing of jobs. In the following all the different components are described and their respective protocols are shown.

Transport

The transport interface is responsible for sending messages about jobs that are created, accepted, and executed to other compute nodes. It also manages the identity of individual Bacalhau nodes to ensure that messages are only delivered to authorized nodes, which improves network security. To achieve this, the transport interface uses a protocol, which is a point-to-point scheduling protocol that runs securely and is used to distribute job messages efficiently to other nodes on the network. This is our upgrade to previous handlers as it ensures that messages are delivered to the right nodes without causing network congestion, thereby making communication between nodes more scalable and efficient.

Executor

The executor is a critical component of the Bacalhau network that handles the execution of jobs and ensures that the storage used by the job is local. One of its main responsibilities is to present the input and output storage volumes into the job when it is run. The executor performs two primary functions: presenting the storage volumes in a format that is suitable for the executor and running the job. When the job is completed, the executor will merge the stdout, stderr and named output volumes into a results folder that is then published to a remote location. Overall, the executor plays a crucial role in the Bacalhau network by ensuring that jobs are executed properly, and their results are published accurately.

Storage Provider

In a peer-to-peer network like Bacalhau, storage providers play a crucial role in presenting an upstream storage source. There can be different storage providers available in the network, each with its own way of manifesting the CID (Content IDentifier) to the executor. For instance, there can be a POSIX storage provider that presents the CID as a POSIX filesystem, or a library storage provider that streams the contents of the CID via a library call. Therefore, the storage providers and Executor implementations are loosely coupled, allowing the POSIX and library storage providers to be used across multiple executors, wherever it is deemed appropriate.

Publisher

The publisher is responsible for uploading the final results of a job to a remote location where clients can access them, such as S3 or IPFS.

Chapter 2 - Job cycle

Job preparation

Advanced job preparation

Job Submission

You should use the Bacalhau client to send a task to the network. The client transmits the job information to the Bacalhau network via established protocols and interfaces. Jobs submitted via the Bacalhau CLI are forwarded to a Bacalhau network node at (http://bootstrap.production.bacalhau.org/) via port 1234 by default. This Bacalhau node will act as the requester node for the duration of the job lifecycle.

Bacalhau provides an interface to interact with the server via a REST API. Bacalhau uses 127.0.0.1 as the localhost and 1234 as the port by default.

Bacalhau Docker CLI commands

Bacalhau WASM CLI commands

Job Acceptance

When a job is submitted to a requester node, it selects compute nodes that are capable and suitable to execute the job, and communicate with them directly. The compute node has a collection of named executors, storage sources, and publishers, and it will choose the most appropriate ones based on the job specifications.

Job execution

Results publishing

When the Compute node completes the job, it publishes the results to S3's remote storage, IPFS.

Chapter 3 - Returning Information

The Bacalhau client receives updates on the task execution status and results. A user can access the results and manage tasks through the command line interface.

Get Job Results

To Get the results of a job you can run the following command.

One can choose from a wide range of flags, from which a few are shown below.

Describe a Job

To describe a specific job, inserting the ID to the CLI or API gives back an overview of the job.

List of Jobs

If you run more then one job or you want to find a specific job ID

Job Executions

To list executions follow the following commands.

Chapter 4 - Monitoring and Management

Stop a Job

Job History

Job Logs

Installation

Install the Bacalhau CLI

In this tutorial, you'll learn how to install and run a job with the Bacalhau client using the Bacalhau CLI or Docker.

Step 1 - Install the Bacalhau Client

The Bacalhau client is a command-line interface (CLI) that allows you to submit jobs to the Bacalhau. The client is available for Linux, macOS, and Windows. You can also run the Bacalhau client in a Docker container.

Step 1.1 - Install the Bacalhau CLI

Step 1.2 - Verify the Installation

To verify installation and check the version of the client and server, use the version command. To run a Bacalhau client command with Docker, prefix it with docker run ghcr.io/bacalhau-project/bacalhau:latest.

bacalhau version

docker run -it ghcr.io/bacalhau-project/bacalhau:latest version

If you're wondering which server is being used, the Bacalhau Project has a demo network that's shared with the community. This network allows you to familiarize with Bacalhau's capabilities and launch jobs from your computer without maintaining a compute cluster on your own.

Step 2 - Submit a Hello World job

bacalhau docker run [flags] IMAGE[:TAG|@DIGEST] [COMMAND] [ARG...]

bacalhau docker run \
--config api.host=bootstrap.production.bacalhau.org \
alpine echo helloWorld

Let's take a look at the results of the command execution in the terminal:

Job successfully submitted. Job ID: j-de72aeff-0f18-4f70-a07c-1366a0edcb64
Checking job status... (Enter Ctrl+C to exit at any time, your job will continue running):

 TIME          EXEC. ID    TOPIC            EVENT         
 15:32:50.323              Submission       Job submitted 
 15:32:50.332  e-6e4f2db9  Scheduling       Requested execution on n-f1c579e2 
 15:32:50.410  e-6e4f2db9  Execution        Running 
 15:32:50.986  e-6e4f2db9  Execution        Completed successfully 
                                             
To get more details about the run, execute:
	bacalhau job describe j-de72aeff-0f18-4f70-a07c-1366a0edcb64

To get more details about the run executions, execute:
	bacalhau job executions j-de72aeff-0f18-4f70-a07c-1366a0edcb64

After the above command is run, the job is submitted to the selected network, which processes the job and Bacalhau prints out the related job id:

Job successfully submitted. Job ID: j-de72aeff-0f18-4f70-a07c-1366a0edcb64
Checking job status...

The job_id above is shown in its full form. For convenience, you can use the shortened version, in this case: j-de72aeff.

docker run -t ghcr.io/bacalhau-project/bacalhau:latest \
                docker run \
                --id-only \
                --wait \
                ubuntu:latest -- \
                sh -c 'uname -a && echo "Hello from Docker Bacalhau!"'

Let's take a look at the results of the command execution in the terminal:

14:02:25.992 | INF pkg/repo/fs.go:81 > Initializing repo at '/root/.bacalhau' for environment 'production'
19b105c9-4cb5-43bd-a12f-d715d738addd

Step 3 - Checking the State of your Jobs

After having deployed the job, we now can use the CLI for the interaction with the network. The jobs were sent to the public demo network, where it was processed and we can call the following functions. The job_id will differ for every submission.

Step 3.1 - Job information:

bacalhau job describe j-de72aeff

Let's take a look at the results of the command execution in the terminal:

ID            = j-de72aeff-0f18-4f70-a07c-1366a0edcb64
Name          = j-de72aeff-0f18-4f70-a07c-1366a0edcb64
Namespace     = default
Type          = batch
State         = Completed
Count         = 1
Created Time  = 2024-10-07 13:32:50
Modified Time = 2024-10-07 13:32:50
Version       = 0

Summary
Completed = 1

Job History
 TIME                 TOPIC         EVENT         
 2024-10-07 15:32:50  Submission    Job submitted 
 2024-10-07 15:32:50  State Update  Running       
 2024-10-07 15:32:50  State Update  Completed     

Executions
 ID          NODE ID     STATE      DESIRED  REV.  CREATED    MODIFIED   COMMENT 
 e-6e4f2db9  n-f1c579e2  Completed  Stopped  6     4m18s ago  4m17s ago          

Execution e-6e4f2db9 History
 TIME                 TOPIC       EVENT                             
 2024-10-07 15:32:50  Scheduling  Requested execution on n-f1c579e2 
 2024-10-07 15:32:50  Execution   Running                           
 2024-10-07 15:32:50  Execution   Completed successfully            

Standard Output
helloWorld

This outputs all information about the job, including stdout, stderr, where the job was scheduled, and so on.

Step 3.2 - Job download:

bacalhau job get j-de72aeff

Fetching results of job 'j-de72aeff'...
Results for job 'j-de72aeff' have been written to...
/home/username/.bacalhau/job-j-de72aeff

After the download has finished you should see the following contents in the results directory.

job-j-de72aeff
├── exitCode
├── outputs
├── stderr
└── stdout

Step 4 - Viewing your Job Output

cat j-de72aeff/stdout

That should print out the string helloWorld.

helloWorld

Step 5 - Where to go next?

Here are few resources that provide a deeper dive into running jobs with Bacalhau:

Support

Create Network

In this tutorial you are setting up your own network

Introduction

Bacalhau allows you to create your own private network so you can securely run private workloads without the risks inherent in working on public nodes or inadvertently distributing data outside your organization.

This tutorial describes the process of creating your own private network from multiple nodes, configuring the nodes and running demo jobs.

TLDR

Create and apply auth token
Configure auth token and orchestrators list line on the other hosts
Copy and paste the environment variables it outputs under the "To connect to this node from the client, run the following commands in your shell" line to a client machine
Done! You can run an example, like:

bacalhau docker run apline echo hello

Prerequisites

Prepare the hosts on which the nodes are going to be set up. They could be:
1. Physical Hosts
3. Local Hypervisor VMs
Ensure that all nodes are connected to the same network and that the necessary ports are open for communication between them.
1. Ensure your nodes have an internet connection in case you have to download or upload any data (docker images, input data, results)

Start Initial Requestor Node

The Bacalhau network consists of nodes of two types: compute and requester. Compute Node is responsible for executing jobs and producing results. Requester Node is responsible for handling user requests, forwarding jobs to compute nodes and monitoring the job lifecycle.

The first step is to start up the initial Requester node. This node will connect to nothing but will listen for connections.

Start by creating a secure token. This token will be used for authentication between the orchestrator and compute nodes during their communications. Any string can be used as a token, preferably not easy to guess or brute-force. In addition, new authentication methods will be introduced in future releases.

Create and Set Up a Token

Let's use the uuidgen tool to create our token, then add it to the Bacalhau configuration and run the requester node:

# Create token and write it into the 'my_token' file
uuidgen > my_token

#Add token to the Bacalhau configuration
bacalhau config set orchestrator.auth.token=$(cat my_token)

#Start the Requester node
bacalhau serve --orchestrator

This will produce output similar to this, indicating that the node is up and running:

17:27:42.273 | INF cmd/cli/serve/serve.go:102 > Config loaded from: [/home/username/.bacalhau/config.yaml], and with data-dir /home/username/.bacalhau
17:27:42.322 | INF cmd/cli/serve/serve.go:228 > Starting bacalhau...
17:27:42.405 | WRN pkg/nats/logger.go:49 > Filestore [KV_node_v1] Stream state too short (0 bytes) [Server:n-0f29f45c-c894-4f8f-8a0a-8f2f1f64d96d]
17:27:42.479 | INF cmd/cli/serve/serve.go:300 > bacalhau node running [address:0.0.0.0:1234] [compute_enabled:false] [name:n-0f29f45c-c894-4f8f-8a0a-8f2f1f64d96d] [orchestrator_address:0.0.0.0:4222] [orchestrator_enabled:true] [webui_enabled:true]

To connect to this node from the local client, run the following commands in your shell:
export BACALHAU_API_HOST=127.0.0.1
export BACALHAU_API_PORT=1234

17:27:42.479 | INF webui/webui.go:65 > Starting UI server [listen:0.0.0.0:8438]
A copy of these variables have been written to: /home/username/.bacalhau/bacalhau.run

Note that for security reasons, the output of the command contains the localhost 127.0.0.1 address instead of your real IP. To connect to this node, you should replace it with your real public IP address yourself. The method for obtaining your public IP address may vary depending on the type of instance you're using. Windows and Linux instances can be queried for their public IP using the following command:

curl https://api.ipify.org

Create and Connect Compute Node

Now let's move to another host from the preconditions, start a compute node on it and connect to the requester node. Here you will also need to add the same token to the configuration as on the requester.

#Add token to the Bacalhau configuration
bacalhau config set compute.auth.token=$(cat my_token)

Then execute the serve command to connect to the requester node:

bacalhau serve --сompute --orchestrators=<Public-IP-of-Requester-Node>

This will produce output similar to this, indicating that the node is up and running:

# formatting has been adjusted for better readability
16:23:33.386 | INF cmd/cli/serve/serve.go:256 > bacalhau node running 
[address:0.0.0.0:1235] 
[capacity:"{CPU: 1.40, Memory: 2.9 GB, Disk: 13 GB, GPU: 0}"]
[compute_enabled:true] [engines:["docker","wasm"]]
[name:n-7a510a5b-86de-41db-846f-8c6a24b67482] [orchestrator_enabled:false]
[orchestrators:["127.0.0.1","0.0.0.0"]] [publishers:["local","noop"]]
[storages:["urldownload","inline"]] [webui_enabled:false]

To ensure that the nodes are connected to the network, run the following command, specifying the public IP of the requester node:

bacalhau --api-host <Public-IP-of-Requester-Node> node list

This will produce output similar to this, indicating that the nodes belong to the same network:

bacalhau --api-host 10.0.2.15 node list
 ID          TYPE       STATUS    LABELS                                              CPU     MEMORY      DISK         GPU  
 n-7a510a5b  Compute              Architecture=amd64 Operating-System=linux           0.8 /   1.5 GB /    12.3 GB /    0 /  
                                  git-lfs=true                                        0.8     1.5 GB      12.3 GB      0    
 n-b2ab8483  Requester  APPROVED  Architecture=amd64 Operating-System=linux

Submitting Jobs

To connect to the requester node find the following lines in the requester node logs:

To connect to this node from the local client, run the following commands in your shell:
export BACALHAU_API_HOST=<Public-IP-of-the-Requester-Node>
export BACALHAU_API_PORT=1234

The exact commands list will be different for each node and is outputted by the bacalhau serve command.

Note that by default such command contains 127.0.0.1 or 0.0.0.0 instead of actual public IP. Make sure to replace it before executing the command.

Now you can submit your jobs using the bacalhau docker run, bacalhau wasm run and bacalhau job run commands. For example submit a hello-world job:

bacalhau docker run alpine echo hello

Job successfully submitted. Job ID: j-5be2a5b2-567e-4f57-ac9e-8816e47ebeff
Checking job status... (Enter Ctrl+C to exit at any time, your job will continue running):

 TIME          EXEC. ID    TOPIC            EVENT         
 16:34:16.467              Submission       Job submitted 
 16:34:16.484  e-1e9dca31  Scheduling       Requested execution on n-d41eeae7 
 16:34:16.550  e-1e9dca31  Execution        Running 
 16:34:17.506  e-1e9dca31  Execution        Completed successfully 
                                             
To get more details about the run, execute:
	bacalhau job describe j-5be2a5b2-567e-4f57-ac9e-8816e47ebeff

To get more details about the run executions, execute:
	bacalhau job executions j-5be2a5b2-567e-4f57-ac9e-8816e47ebeff

You will be able to see the job execution logs on the compute node:

16:34:16.571 | INF pkg/executor/docker/executor.go:119 > starting execution [NodeID:n-d41eeae7] [execution:e-1e9dca31-7089-4cbf-a2f6-a584930bbae5] [executionID:e-1e9dca31-7089-4cbf-a2f6-a584930bbae5] [job:j-5be2a5b2-567e-4f57-ac9e-8816e47ebeff] [jobID:j-5be2a5b2-567e-4f57-ac9e-8816e47ebeff]

...

16:34:17.496 | INF pkg/executor/docker/executor.go:221 > received results from execution [executionID:e-1e9dca31-7089-4cbf-a2f6-a584930bbae5]
16:34:17.505 | INF pkg/compute/executor.go:196 > cleaning up execution [NodeID:n-d41eeae7] [execution:e-1e9dca31-7089-4cbf-a2f6-a584930bbae5] [job:j-5be2a5b2-567e-4f57-ac9e-8816e47ebeff]

Publishers and Sources Configuration

By default only local publisher and URL & local sources are available on the compute node. The following describes how to configure the appropriate sources and publishers:

Your chosen publisher can be set for your Bacalhau compute nodes declaratively or imperatively using either configuration yaml file:

Publisher:
  Type: "s3"
  Params:
    Bucket: "my-task-results"
    Key: "task123/result.tar.gz"
    Endpoint: "https://s3.us-west-2.amazonaws.com"

Or within your imperative job execution commands:

bacalhau docker run -p s3://bucket/key,opt=endpoint=http://s3.example.com,opt=region=us-east-1 ubuntu …

InputSources:
  - Source:
      Type: "s3"
      Params:
        Bucket: "my-bucket"
        Key: "data/"
        Endpoint: "https://storage.googleapis.com"
  - Target: "/data"

Publisher:
  Type: ipfs

Or within your imperative job execution commands:

bacalhau docker run --publisher ipfs ubuntu ...

InputSources:
  - Source:
      Type: "ipfs"
      Params:
        CID: "QmY7Yh4UquoXHLPFo2XbhXkhBvFoPwmQUSa92pxnxjY3fZ"
  - Target: "/data"

Or imperative format:

bacalhau docker run --input QmY7Yh4UquoXHLPFo2XbhXkhBvFoPwmQUSa92pxnxjY3fZ:/data ...

Bacalhau allows to publish job results directly to the compute node. Please note that this method is not a reliable storage option and is recommended to be used mainly for introductory purposes.

Publisher:
  Type: local

Or within your imperative job execution commands:

bacalhau docker run --publisher local ubuntu ...

bacalhau config set Compute.AllowListedLocalPaths=/etc/config:rw,/etc/*.conf:ro

Further, the path to local data in declarative or imperative form must be specified in the job. Declarative example of the local input source:

InputSources:
  - Source:
      Type: "localDirectory"
      Params:
        SourcePath: "/etc/config"
        ReadWrite: true
    Target: "/config"

Imperative example of the local input source:

bacalhau docker run --input file:///etc/config:/config ubuntu ...

Bacalhau Configuration Keys Overview

bacalhau config set JobAdmissionControl.AcceptNetworkedJobs=true

Labels: Describes the node with labels in a key=value format. Later labels can be used by the job as conditions for choosing the node on which to run on. For example:

bacalhau config set Labels=NodeType=WebServer

Compute.Orchestrators: Specifies list of orchestrators to connect to. Applies to compute nodes.

bacalhau config set Compute.Orchestrators=127.0.0.1

DataDir: Specifies path to the directory where the Bacalhau node will maintain its state. Default value is /home/username/.bacalhau. Can be helpful when a repo should be initialized in any but default path or when more than one node should be started on a single machine.

bacalhau config set DataDir=/path/to/new/directory

Webui.Enabled: Enables a WebUI, allowing to get up-to-date and demonstrative information about the jobs and nodes on your network

bacalhau config set WebUI.Enabled=true

Best Practices for Production Use Cases

Your private cluster can be quickly set up for testing packaged jobs and tweaking data processing pipelines. However, when using a private cluster in production, here are a few considerations to note.

Ensure separation of concerns in your cloud deployments by mounting the Bacalhau repository on a separate non-boot disk. This prevents instability on shutdown or restarts and improves performance within your host instances.

Hardware Setup

Different jobs may require different amounts of resources to execute. Some jobs may have specific hardware requirements, such as GPU. This page describes how to specify hardware requirements for your job.

Please bear in mind that each executor is implemented independently and these docs might be slightly out of date. Double check the man page for the executor you are using with bacalhau [executor] --help.

Docker Executor

The following table describes how to specify hardware requirements for the Docker executor.

Flag

Default

Description

--cpu

500m

Job CPU cores (e.g. 500m, 2, 8)

--memory

1Gb

Job Memory requirement (e.g. 500Mb, 2Gb, 8Gb).

--gpu

Job GPU requirement (e.g. 1).

How it Works

When you specify hardware requirements, the job will be offered out to the network to see if there are any nodes that can satisfy the requirements. If there are, the job will be scheduled on the node and the executor will be started.

GPU Setup

Bacalhau supports GPU workloads. Learn how to run a job using GPU workloads with the Bacalhau client.

Prerequisites

The Bacalhau network must have an executor node with a GPU exposed
Your container must include the CUDA runtime (cudart) and must be compatible with the CUDA version running on the node

Usage

Use following command to see available resources amount:

bacalhau node list --show=capacity

To submit a request for a job that requires more than the standard set of resources, add the --cpu and --memory flags. For example, for a job that requires 2 CPU cores and 4Gb of RAM, use --cpu=2 --memory=4Gb, e.g.:

bacalhau docker run ubuntu echo Hello World --cpu=2 --memory=4Gb

To submit a GPU job request, use the --gpu flag under the docker run command to select the number of GPUs your job requires. For example:

bacalhau docker run --gpu=1 nvidia/cuda:11.0.3-base-ubuntu20.04 nvidia-smi

Limitations

The following limitations currently exist within Bacalhau.

Maximum CPU and memory limits depend on the participants in the network
For GPU:
1. NVIDIA, Intel or AMD GPUs only
2. Only the Docker Executor supports GPUs

Container Onboarding

Docker Workloads

How to use docker containers with Bacalhau

Docker Workloads

Bacalhau executes jobs by running them within containers. Bacalhau employs a syntax closely resembling Docker, allowing you to utilize the same containers. The key distinction lies in how input and output data are transmitted to the container via IPFS, enabling scalability on a global level.

This section describes how to migrate a workload based on a Docker container into a format that will work with the Bacalhau client.

Requirements

Here are few things to note before getting started:

Container Registry: Ensure that the container is published to a public container registry that is accessible from the Bacalhau network.
Architecture Compatibility: Bacalhau supports only images that match the host node's architecture. Typically, most nodes run on linux/amd64, so containers in arm64 format are not able to run.
Input Flags: The --input ipfs://... flag supports only directories and does not support CID subpaths. The --input https://... flag supports only single files and does not support URL directories. The --input s3://... flag supports S3 keys and prefixes. For example, s3://bucket/logs-2023-04* includes all logs for April 2023.

Note: Only about a third of examples have their containers here. If you can't find one, feel free to contact the team.

Runtime Restrictions

To help provide a safe, secure network for all users, we add the following runtime restrictions:

Limited Ingress/Egress Networking:

Data Passing with Docker Volumes:

A job includes the concept of input and output volumes, and the Docker executor implements support for these. This means you can specify your CIDs, URLs, and/or S3 objects as input paths and also write results to an output volume. This can be seen in the following example:

The above example demonstrates an input volume flag -i s3://mybucket/logs-2023-04*, which mounts all S3 objects in bucket mybucket with logs-2023-04 prefix within the docker container at location /input (root).

Output volumes are mounted to the Docker container at the location specified. In the example above, any content written to /output_folder will be made available within the apples folder in the job results CID.

Once the job has run on the executor, the contents of stdout and stderr will be added to any named output volumes the job has used (in this case apples), and all those entities will be packaged into the results folder which is then published to a remote location by the publisher.

Onboarding Your Workload

Step 1 - Read Data From Your Directory

If you need to pass data into your container you will do this through a Docker volume. You'll need to modify your code to read from a local directory.

We make the assumption that you are reading from a directory called /inputs, which is set as the default.

Step 2 - Write Data to the Your Directory

If you need to return data from your container you will do this through a Docker volume. You'll need to modify your code to write to a local directory.

We make the assumption that you are writing to a directory called /outputs, which is set as the default.

Step 3 - Build and Push Your Image To a Registry

Most Bacalhau nodes are of an x86_64 architecture, therefore containers should be built for x86_64 systems.

For example:

Step 4 - Test Your Container

To test your docker image locally, you'll need to execute the following command, changing the environment variables as necessary:

Let's see what each command will be used for:

Exports the current working directory of the host system to the LOCAL_INPUT_DIR variable. This variable will be used for binding a volume and transferring data into the container.

Exports the current working directory of the host system to the LOCAL_OUTPUT_DIR variable. Similarly, this variable will be used for binding a volume and transferring data from the container.

Creates an array of commands CMD that will be executed inside the container. In this case, it is a simple command executing 'ls' in the /inputs directory and writing text to the /outputs/stdout file.

Launches a Docker container using the specified variables and commands. It binds volumes to facilitate data exchange between the host and the container.

For example:

The result of the commands' execution is shown below:

Step 5 - Upload the Input Data

Data is identified by its content identifier (CID) and can be accessed by anyone who knows the CID. You can use either of these methods to upload your data:

You can choose to

You can mount your data anywhere on your machine, and Bacalhau will be able to run against that data

Step 6 - Run the Workload on Bacalhau

To launch your workload in a Docker container, using the specified image and working with input data specified via IPFS CID, run the following command.

To check the status of your job, run the following command.

To get more information on your job, you can run the following command.

To download your job, run.

To put this all together into one would look like the following.

This outputs the following.

The --input flag does not support CID subpaths for ipfs:// content.

Alternatively, you can run your workload with a publicly accessible http(s) URL, which will download the data temporarily into your public storage:

The --input flag does not support URL directories.

Troubleshooting

If you run into this compute error while running your docker image

This can often be resolved by re-tagging your docker image

Support

Setting Up

Running Nodes

Node Onboarding

Introduction

This tutorial describes how to add new nodes to an existing private network. Two basic scenarios will be covered:

Pre-Prerequisites

Add Host/Virtual Machine as a New Node

Set the token in the Compute.Auth.Token configuration key
Set the orchestrators IP address in the Compute.Orchestrators configuration key
Execute bacalhau serve specifying the node type via --orchestrator flag

Add a Cloud Instance as a New Node

To automate the process using Terraform follow these steps:

Determine the IP address of your requester node
Write a terraform script, which does the following:
1. Adds a new instance
2. Installs bacalhau on it
3. Launches a compute node
Execute the script

Support

GPU Installation

How to enable GPU support on your Bacalhau node

Bacalhau supports GPUs out of the box and defaults to allowing execution on all GPUs installed on the node.

Prerequisites

Bacalhau makes the assumption that you have installed all the necessary drivers and tools on your node host and have appropriately configured them for use by Docker.

In general for GPUs from any vendor, the Bacalhau client requires:

Nvidia

nvidia-smi installed and functional

AMD

rocm-smi tool installed and functional

Intel

xpu-smi tool installed and functional

GPU Node Configuration

Job selection policy

When running a node, you can choose which jobs you want to run by using configuration options, environment variables or flags to specify a job selection policy.

Job selection probes

If you want more control over making the decision to take on jobs, you can use the JobAdmissionControl.ProbeExec and JobAdmissionControl.ProbeHTTP configuration keys.

These are external programs that are passed the following data structure so that they can make a decision about whether to take on a job:

The exec probe is a script to run that will be given the job data on stdin, and must exit with status code 0 if the job should be run.

The http probe is a URL to POST the job data to. The job will be rejected if the HTTP request returns a non-positive status code (e.g. >= 400).

For example, the following response will reject the job:

If the HTTP response is not a JSON blob, the content is ignored and any non-error status code will accept the job.

Access Management

How to configure authentication and authorization on your Bacalhau node.

Access Management

Bacalhau includes a flexible auth system that supports multiple methods of auth that are appropriate for different deployment environments.

By default

With no specific authentication configuration supplied, Bacalhau runs in "anonymous mode" – which allows unidentified users limited control over the system. "Anonymous mode" is only appropriate for testing or evaluation setups.

In anonymous mode, Bacalhau will allow:

Users identified by a self-generated private key to submit any job and cancel their own jobs.
Users not identified by any key to access other read-only endpoints, such as to read job lists, describe jobs, and query node or agent information.

Restricting anonymous access

Bacalhau auth is controlled by policies. Configuring the auth system is done by supplying a different policy file.

Restricting API access to only users that have authenticated requires specifying a new authorization policy. You can download a policy that restricts anonymous access and install it by using:

Once the node is restarted, accessing the node APIs will require the user to be authenticated, but by default will still allow users with a self-generated key to authenticate themselves.

Restricting the list of keys that can authenticate to only a known set requires specifying a new authentication policy. You can download a policy that restricts key-based access and install it by using:

Then, modify the allowed_clients variable in challange_ns_no_anon.rego to include acceptable client IDs, found by running bacalhau agent node.

Once the node is restarted, only keys in the allowed list will be able to access any API.

Username and password access

Users can authenticate using a username and password instead of specifying a private key for access. Again, this requires installation of an appropriate policy on the server.

Passwords are not stored in plaintext and are salted. The downloaded policy expects password hashes and salts generated by scrypt. To generate a salted password, the helper script in pkg/authn/ask/gen_password can be used:

This will ask for a password and generate a salt and hash to authenticate with it. Add the encoded username, salt and hash into the ask_ns_password.rego.

Writing custom policies

In principle, Bacalhau can implement any auth scheme that can be described in a structured way by a policy file.

Custom authentication policies

Bacalhau will pass information pertinent to the current request into every authentication policy query as a field on the input variable. The exact information depends on the type of authentication used.

`challenge` authentication

challenge authentication uses identifies the user by the presence of a private key. The user is asked to sign an input phrase to prove they have the key they are identifying with.

Policies used for challenge authentication do not need to actually implement the challenge verification logic as this is handled by the core code. Instead, they will only be invoked if this verification passes.

Policies for this type will need to implement these rules:

bacalhau.authn.token: if the user should be authenticated, an access token they should use in subsequent requests. If the user should not be authenticated, should be undefined.

They should expect as fields on the input variable:

clientId: an ID derived from the user's private key that identifies them uniquely
nodeId: the ID of the requester node that this user is authenticating with
signingKey: the private key (as a JWK) that should be used to sign any access tokens to be returned

The simplest possible policy might therefore be this policy that returns the same opaque token for all users:

`ask` authentication

ask authentication uses credentials supplied manually by the user as identification. For example, an ask policy could require a username and password as input and check these against a known list. ask policies do all the verification of the supplied credentials.

Policies for this type will need to implement these rules:

bacalhau.authn.token: if the user should be authenticated, an access token they should use in subsequent requests. If the user should not be authenticated, should be undefined.
bacalhau.authn.schema: a static JSON schema that should be used to collect information about the user. The type of declared fields may be used to pick the input method, and if a field is marked as writeOnly then it will be collected in a secure way (e.g. not shown on screen). The schema rule does not receive any input data.

They should expect as fields on the input variable:

ask: a map of field names from the JSON schema to strings supplied by the user. The policy should validate these credentials.
nodeId: the ID of the requester node that this user is authenticating with
signingKey: the private key (as a JWK) that should be used to sign any access tokens to be returned

The simplest possible policy might therefore be one that asks for no data and returns the same opaque token for every user:

Custom authorization policies

Authorization policies do not vary depending on the type of authentication used – Bacalhau uses one authz policy for all API requests.

Authz policies are invoked for every API request. Authz policies should check the validity of any supplied access tokens and issue an authz decision for the requested API endpoint. It is not required that authz policies enforce that an access token is present – they may choose to grant access to unauthorized users.

Policies will need to implement these rules:

bacalhau.authz.token_valid: true if the access token in the request is "valid" (but does not necessarily grant access for this request), or false if it is invalid for every request (e.g. because it has expired) and should be discarded.
bacalhau.authz.allow: true if the user should be permitted to carry out the input request, false otherwise.

They should expect as fields on the input variable for both rules:

http: details of the user's HTTP request:
- host: the hostname used in the HTTP request
- method: the HTTP method (e.g. GET, POST)
- path: the path requested, as an array of path components without slashes
- query: a map of URL query parameters to their values
- headers: a map of HTTP header names to arrays representing their values
- body: a blob of any content submitted as the body
constraints: details about the receiving node that should be used to validate any supplied tokens:
- cert: keys that the input token should have been signed with
- iss: the name of a node that this node will recognize as the issuer of any signed tokens
- aud: the name of this node that is receiving the request

Notably, the constraints data is appropriate to be passed directly to the Rego io.jwt.decode_verify method which will validate the access token as a JWT against the given constraints.

The simplest possible authz policy might be this one that allows all users to access all endpoints:

Node persistence

How to configure compute/requester persistence

Both compute nodes, and requester nodes, maintain state. How that state is maintained is configurable, although the defaults are likely adequate for most use-cases. This page describes how to configure the persistence of compute and requester nodes should the defaults not be suitable.

Compute node persistence

The computes nodes maintain information about the work that has been allocated to them, including:

The current state of the execution, and
The original job that resulted in this allocation

This information is used by the compute and requester nodes to ensure allocated jobs are completed successfully. By default, compute nodes store their state in a bolt-db database and this is located in the bacalhau repository along with configuration data. For a compute node whose ID is "abc", the database can be found in ~/.bacalhau/abc-compute/executions.db.

In some cases, it may be preferable to maintain the state in memory, with the caveat that should the node restart, all state will be lost. This can be configured using the environment variables in the table below.

Requester node persistence

When running a requester node, it maintains state about the jobs it has been requested to orchestrate and schedule, the evaluation of those jobs, and the executions that have been allocated. By default, this state is stored in a bolt db database that, with a node ID of "xyz" can be found in ~/.bacalhau/xyz-requester/jobs.db.

Connect Storage

Bacalhau has two ways to make use of external storage providers: Sources and Publishers. Sources are storage resources consumed as inputs to jobs. And Publishers are storage resources created with the results of jobs.

Sources

Publishers

Workload Onboarding

This directory contains examples relating to performing common tasks with Bacalhau.

Container

Python

R (language)

Data Ingestion

Networking Instructions

Marketplace Deployments

Guides

Examples

Data Engineering

This directory contains examples relating to data engineering workloads. The goal is to provide a range of examples that show you how to work with Bacalhau in a variety of use cases.

Model Inference

Model Training

Molecular Dynamics

References

Jobs Guide

Engines

Publishers

Sources

Create Network

In this tutorial you are setting up your own network

Introduction

This tutorial describes the process of creating your own private network from multiple nodes, configuring the nodes and running demo jobs.

TLDR

curl -sL https://get.bacalhau.org/install.sh | bash on every host
Create and apply auth token
Start the : bacalhau serve --orchestrator
Configure auth token and orchestrators list line on the other hosts
Copy and paste the environment variables it outputs under the "To connect to this node from the client, run the following commands in your shell" line to a client machine
Done! You can run an example, like:

bacalhau docker run apline echo hello

Prerequisites

Prepare the hosts on which the nodes are going to be set up. They could be:
1. Physical Hosts
2. Cloud VMs (, , or any other provider)
3. Local Hypervisor VMs
on each host
Ensure that all nodes are connected to the same network and that the necessary ports are open for communication between them.
1. Ensure your nodes have an internet connection in case you have to download or upload any data (docker images, input data, results)
Ensure that is installed in case you are going to run Docker Workloads

Bacalhau is designed to be versatile in its deployment, capable of running on various environments: physical hosts, virtual machines or cloud instances. Its resource requirements are modest, ensuring compatibility with a wide range of hardware configurations. However, for certain workloads, such as machine learning, it's advisable to consider hardware configurations optimized for computational tasks, including .

Start Initial Requestor Node

The first step is to start up the initial Requester node. This node will connect to nothing but will listen for connections.

Create and Set Up a Token

Let's use the uuidgen tool to create our token, then add it to the Bacalhau configuration and run the requester node:

# Create token and write it into the 'my_token' file
uuidgen > my_token

#Add token to the Bacalhau configuration
bacalhau config set orchestrator.auth.token=$(cat my_token)

#Start the Requester node
bacalhau serve --orchestrator

This will produce output similar to this, indicating that the node is up and running:

17:27:42.273 | INF cmd/cli/serve/serve.go:102 > Config loaded from: [/home/username/.bacalhau/config.yaml], and with data-dir /home/username/.bacalhau
17:27:42.322 | INF cmd/cli/serve/serve.go:228 > Starting bacalhau...
17:27:42.405 | WRN pkg/nats/logger.go:49 > Filestore [KV_node_v1] Stream state too short (0 bytes) [Server:n-0f29f45c-c894-4f8f-8a0a-8f2f1f64d96d]
17:27:42.479 | INF cmd/cli/serve/serve.go:300 > bacalhau node running [address:0.0.0.0:1234] [compute_enabled:false] [name:n-0f29f45c-c894-4f8f-8a0a-8f2f1f64d96d] [orchestrator_address:0.0.0.0:4222] [orchestrator_enabled:true] [webui_enabled:true]

To connect to this node from the local client, run the following commands in your shell:
export BACALHAU_API_HOST=127.0.0.1
export BACALHAU_API_PORT=1234

17:27:42.479 | INF webui/webui.go:65 > Starting UI server [listen:0.0.0.0:8438]
A copy of these variables have been written to: /home/username/.bacalhau/bacalhau.run

curl https://api.ipify.org

If you are using a cloud deployment, you can find your public IP through their console, e.g. and .

Create and Connect Compute Node

#Add token to the Bacalhau configuration
bacalhau config set compute.auth.token=$(cat my_token)

Then execute the serve command to connect to the requester node:

bacalhau serve --сompute --orchestrators=<Public-IP-of-Requester-Node>

This will produce output similar to this, indicating that the node is up and running:

# formatting has been adjusted for better readability
16:23:33.386 | INF cmd/cli/serve/serve.go:256 > bacalhau node running 
[address:0.0.0.0:1235] 
[capacity:"{CPU: 1.40, Memory: 2.9 GB, Disk: 13 GB, GPU: 0}"]
[compute_enabled:true] [engines:["docker","wasm"]]
[name:n-7a510a5b-86de-41db-846f-8c6a24b67482] [orchestrator_enabled:false]
[orchestrators:["127.0.0.1","0.0.0.0"]] [publishers:["local","noop"]]
[storages:["urldownload","inline"]] [webui_enabled:false]

To ensure that the nodes are connected to the network, run the following command, specifying the public IP of the requester node:

bacalhau --api-host <Public-IP-of-Requester-Node> node list

This will produce output similar to this, indicating that the nodes belong to the same network:

bacalhau --api-host 10.0.2.15 node list
 ID          TYPE       STATUS    LABELS                                              CPU     MEMORY      DISK         GPU  
 n-7a510a5b  Compute              Architecture=amd64 Operating-System=linux           0.8 /   1.5 GB /    12.3 GB /    0 /  
                                  git-lfs=true                                        0.8     1.5 GB      12.3 GB      0    
 n-b2ab8483  Requester  APPROVED  Architecture=amd64 Operating-System=linux

Submitting Jobs

To connect to the requester node find the following lines in the requester node logs:

To connect to this node from the local client, run the following commands in your shell:
export BACALHAU_API_HOST=<Public-IP-of-the-Requester-Node>
export BACALHAU_API_PORT=1234

The exact commands list will be different for each node and is outputted by the bacalhau serve command.

Note that by default such command contains 127.0.0.1 or 0.0.0.0 instead of actual public IP. Make sure to replace it before executing the command.

Now you can submit your jobs using the bacalhau docker run, bacalhau wasm run and bacalhau job run commands. For example submit a hello-world job:

bacalhau docker run alpine echo hello

Job successfully submitted. Job ID: j-5be2a5b2-567e-4f57-ac9e-8816e47ebeff
Checking job status... (Enter Ctrl+C to exit at any time, your job will continue running):

 TIME          EXEC. ID    TOPIC            EVENT         
 16:34:16.467              Submission       Job submitted 
 16:34:16.484  e-1e9dca31  Scheduling       Requested execution on n-d41eeae7 
 16:34:16.550  e-1e9dca31  Execution        Running 
 16:34:17.506  e-1e9dca31  Execution        Completed successfully 
                                             
To get more details about the run, execute:
	bacalhau job describe j-5be2a5b2-567e-4f57-ac9e-8816e47ebeff

To get more details about the run executions, execute:
	bacalhau job executions j-5be2a5b2-567e-4f57-ac9e-8816e47ebeff

You will be able to see the job execution logs on the compute node:

16:34:16.571 | INF pkg/executor/docker/executor.go:119 > starting execution [NodeID:n-d41eeae7] [execution:e-1e9dca31-7089-4cbf-a2f6-a584930bbae5] [executionID:e-1e9dca31-7089-4cbf-a2f6-a584930bbae5] [job:j-5be2a5b2-567e-4f57-ac9e-8816e47ebeff] [jobID:j-5be2a5b2-567e-4f57-ac9e-8816e47ebeff]

...

16:34:17.496 | INF pkg/executor/docker/executor.go:221 > received results from execution [executionID:e-1e9dca31-7089-4cbf-a2f6-a584930bbae5]
16:34:17.505 | INF pkg/compute/executor.go:196 > cleaning up execution [NodeID:n-d41eeae7] [execution:e-1e9dca31-7089-4cbf-a2f6-a584930bbae5] [job:j-5be2a5b2-567e-4f57-ac9e-8816e47ebeff]

Publishers and Sources Configuration

By default only local publisher and URL & local sources are available on the compute node. The following describes how to configure the appropriate sources and publishers:

To set up you need to specify environment variables such as AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY, populating a credentials file to be located on your compute node, i.e. ~/.aws/credentials, or creating an for your compute nodes if you are utilizing cloud instances.

Your chosen publisher can be set for your Bacalhau compute nodes declaratively or imperatively using either configuration yaml file:

Publisher:
  Type: "s3"
  Params:
    Bucket: "my-task-results"
    Key: "task123/result.tar.gz"
    Endpoint: "https://s3.us-west-2.amazonaws.com"

Or within your imperative job execution commands:

bacalhau docker run -p s3://bucket/key,opt=endpoint=http://s3.example.com,opt=region=us-east-1 ubuntu …

S3 compatible publishers can also be used as for your jobs, with a similar configuration.

InputSources:
  - Source:
      Type: "s3"
      Params:
        Bucket: "my-bucket"
        Key: "data/"
        Endpoint: "https://storage.googleapis.com"
  - Target: "/data"

By default, bacalhau does not connect or create its own IPFS network. Consider creating your network and connect to it using the .

can be set for your Bacalhau compute nodes declaratively or imperatively using either configuration yaml file:

Publisher:
  Type: ipfs

Or within your imperative job execution commands:

bacalhau docker run --publisher ipfs ubuntu ...

Data pinned to the IPFS network can be used as . To do this, you will need to specify the CID in declarative:

InputSources:
  - Source:
      Type: "ipfs"
      Params:
        CID: "QmY7Yh4UquoXHLPFo2XbhXkhBvFoPwmQUSa92pxnxjY3fZ"
  - Target: "/data"

Or imperative format:

bacalhau docker run --input QmY7Yh4UquoXHLPFo2XbhXkhBvFoPwmQUSa92pxnxjY3fZ:/data ...

Bacalhau allows to publish job results directly to the compute node. Please note that this method is not a reliable storage option and is recommended to be used mainly for introductory purposes.

can be set for your Bacalhau compute nodes declaratively or imperatively using configuration yaml file:

Publisher:
  Type: local

Or within your imperative job execution commands:

bacalhau docker run --publisher local ubuntu ...

The allows Bacalhau jobs to access files and directories that are already present on the compute node. To allow jobs to access local files when starting a node, the Compute.AllowListedLocalPaths configuration key should be used, specifying the path to the data and access mode :rw for Read-Write access or :ro for Read-Only (used by default). For example:

bacalhau config set Compute.AllowListedLocalPaths=/etc/config:rw,/etc/*.conf:ro

Further, the path to local data in declarative or imperative form must be specified in the job. Declarative example of the local input source:

InputSources:
  - Source:
      Type: "localDirectory"
      Params:
        SourcePath: "/etc/config"
        ReadWrite: true
    Target: "/config"

Imperative example of the local input source:

bacalhau docker run --input file:///etc/config:/config ubuntu ...

Bacalhau Configuration Keys Overview

Optimize your private network nodes performance and functionality with these most useful , related to the node management:

JobAdmissionControl.AcceptNetworkedJobs: Allows node to accept jobs, that require

bacalhau config set JobAdmissionControl.AcceptNetworkedJobs=true

Labels: Describes the node with labels in a key=value format. Later labels can be used by the job as conditions for choosing the node on which to run on. For example:

bacalhau config set Labels=NodeType=WebServer

Compute.Orchestrators: Specifies list of orchestrators to connect to. Applies to compute nodes.

bacalhau config set Compute.Orchestrators=127.0.0.1

DataDir: Specifies path to the directory where the Bacalhau node will maintain its state. Default value is /home/username/.bacalhau. Can be helpful when a repo should be initialized in any but default path or when more than one node should be started on a single machine.

bacalhau config set DataDir=/path/to/new/directory

Webui.Enabled: Enables a WebUI, allowing to get up-to-date and demonstrative information about the jobs and nodes on your network

bacalhau config set WebUI.Enabled=true

Best Practices for Production Use Cases

Ensure you are running the Bacalhau process from a dedicated system user with limited permissions. This enhances security and reduces the risk of unauthorized access to critical system resources. If you are using an orchestrator such as , utilize a service file to manage the Bacalhau process, ensuring the correct user is specified and consistently used. Here’s a
Create an authentication file for your clients. A can ease the process of maintaining secure data transmission within your network. With this, clients can authenticate themselves, and you can limit the Bacalhau API endpoints unauthorized users have access to.
Consistency is a key consideration when deploying decentralized tools such as Bacalhau. You can use an to affix a specific version of Bacalhau or specify deployment actions, ensuring that each host instance has all the necessary resources for efficient operations.
Ensure separation of concerns in your cloud deployments by mounting the Bacalhau repository on a separate non-boot disk. This prevents instability on shutdown or restarts and improves performance within your host instances.

That's all folks! 🎉 Please contact us on #bacalhau channel for questions and feedback!

Docker Workloads

How to use docker containers with Bacalhau

Docker Workloads

This section describes how to migrate a workload based on a Docker container into a format that will work with the Bacalhau client.

You can check out this example tutorial on to see how we used all these steps together.

Requirements

Here are few things to note before getting started:

Container Registry: Ensure that the container is published to a public container registry that is accessible from the Bacalhau network.
Architecture Compatibility: Bacalhau supports only images that match the host node's architecture. Typically, most nodes run on linux/amd64, so containers in arm64 format are not able to run.
Input Flags: The --input ipfs://... flag supports only directories and does not support CID subpaths. The --input https://... flag supports only single files and does not support URL directories. The --input s3://... flag supports S3 keys and prefixes. For example, s3://bucket/logs-2023-04* includes all logs for April 2023.

You can check to see a used by the Bacalhau team

Note: Only about a third of examples have their containers here. If you can't find one, feel free to contact the team.

Runtime Restrictions

To help provide a safe, secure network for all users, we add the following runtime restrictions:

Limited Ingress/Egress Networking:

All ingress/egress networking is limited as described in the documentation. You won't be able to pull data/code/weights/ etc. from an external source.

Data Passing with Docker Volumes:

bacalhau docker run \
  -i s3://mybucket/logs-2023-04*:/input \
  -o apples:/output_folder \
  ubuntu \
  bash -c 'ls /input > /output_folder/file.txt'

Onboarding Your Workload

Step 1 - Read Data From Your Directory

If you need to pass data into your container you will do this through a Docker volume. You'll need to modify your code to read from a local directory.

We make the assumption that you are reading from a directory called /inputs, which is set as the default.

You can specify which directory the data is written to with the CLI flag.

Step 2 - Write Data to the Your Directory

If you need to return data from your container you will do this through a Docker volume. You'll need to modify your code to write to a local directory.

We make the assumption that you are writing to a directory called /outputs, which is set as the default.

You can specify which directory the data is written to with the CLI flag.

Step 3 - Build and Push Your Image To a Registry

At this step, you create (or update) a Docker image that Bacalhau will use to perform your task. You from your code and dependencies, then to a public registry so that Bacalhau can access it. This is necessary for other Bacalhau nodes to run your container and execute the task.

Most Bacalhau nodes are of an x86_64 architecture, therefore containers should be built for x86_64 systems.

For example:

export IMAGE=myuser/myimage:latest
docker build -t ${IMAGE} .
docker image push ${IMAGE}

Step 4 - Test Your Container

To test your docker image locally, you'll need to execute the following command, changing the environment variables as necessary:

export LOCAL_INPUT_DIR=$PWD
export LOCAL_OUTPUT_DIR=$PWD
export CMD=(sh -c 'ls /inputs; echo do something useful > /outputs/stdout')
docker run --rm \
  -v ${LOCAL_INPUT_DIR}:/inputs  \
  -v ${LOCAL_OUTPUT_DIR}:/outputs \
  ${IMAGE} \
  ${CMD}

Let's see what each command will be used for:

export LOCAL_INPUT_DIR=$PWD

Exports the current working directory of the host system to the LOCAL_INPUT_DIR variable. This variable will be used for binding a volume and transferring data into the container.

export LOCAL_OUTPUT_DIR=$PWD

Exports the current working directory of the host system to the LOCAL_OUTPUT_DIR variable. Similarly, this variable will be used for binding a volume and transferring data from the container.

export CMD=(sh -c 'ls /inputs; echo do something useful > /outputs/stdout')

Creates an array of commands CMD that will be executed inside the container. In this case, it is a simple command executing 'ls' in the /inputs directory and writing text to the /outputs/stdout file.

docker run ... ${IMAGE} ${CMD}

Launches a Docker container using the specified variables and commands. It binds volumes to facilitate data exchange between the host and the container.

Bacalhau will use the if your image contains one. If you need to specify another entrypoint, use the --entrypoint flag to bacalhau docker run.

For example:

export LOCAL_INPUT_DIR=$PWD
export LOCAL_OUTPUT_DIR=$PWD
export CMD=(sh -c 'ls /inputs; echo "do something useful" > /outputs/stdout')
export IMAGE=ubuntu
docker run --rm \
  -v ${LOCAL_INPUT_DIR}:/inputs  \
  -v ${LOCAL_OUTPUT_DIR}:/outputs \
  ${IMAGE} \
  ${CMD}
cat stdout

The result of the commands' execution is shown below:

do something useful

Step 5 - Upload the Input Data

Data is identified by its content identifier (CID) and can be accessed by anyone who knows the CID. You can use either of these methods to upload your data:

You can choose to

You can mount your data anywhere on your machine, and Bacalhau will be able to run against that data

Step 6 - Run the Workload on Bacalhau

To launch your workload in a Docker container, using the specified image and working with input data specified via IPFS CID, run the following command.

bacalhau docker run --input ipfs://${CID} ${IMAGE} ${CMD}

To check the status of your job, run the following command.

bacalhau job list --id-filter JOB_ID

To get more information on your job, you can run the following command.

bacalhau job describe JOB_ID

To download your job, run.

bacalhau job get JOB_ID

To put this all together into one would look like the following.

JOB_ID=$(bacalhau docker run ubuntu echo hello | grep 'Job ID:' | sed 's/.*Job ID: \([^ ]*\).*/\1/')
echo "The job ID is: $JOB_ID"
bacalhau job list --id-filter $JOB_ID
sleep 5

bacalhau job list --id-filter $JOB_ID
bacalhau get $JOB_ID

ls shards

This outputs the following.

CREATED   ID        JOB                      STATE      VERIFIED  PUBLISHED
 10:26:00  24440f0d  Docker ubuntu echo h...  Verifying
 CREATED   ID        JOB                      STATE      VERIFIED  PUBLISHED
 10:26:00  24440f0d  Docker ubuntu echo h...  Published            /ipfs/bafybeiflj3kha...
11:26:09.107 | INF bacalhau/get.go:67 > Fetching results of job '24440f0d-3c06-46af-9adf-cb524aa43961'...
11:26:10.528 | INF ipfs/downloader.go:115 > Found 1 result shards, downloading to temporary folder.
11:26:13.144 | INF ipfs/downloader.go:195 > Combining shard from output volume 'outputs' to final location: '/Users/phil/source/filecoin-project/docs.bacalhau.org'
job-24440f0d-3c06-46af-9adf-cb524aa43961-shard-0-host-QmYgxZiySj3MRkwLSL4X2MF5F9f2PMhAE3LV49XkfNL1o3

The --input flag does not support CID subpaths for ipfs:// content.

Alternatively, you can run your workload with a publicly accessible http(s) URL, which will download the data temporarily into your public storage:

export URL=https://download.geofabrik.de/antarctica-latest.osm.pbf
bacalhau docker run --input ${URL} ${IMAGE} ${CMD}

bacalhau job list

bacalhau job get JOB_ID

The --input flag does not support URL directories.

Troubleshooting

If you run into this compute error while running your docker image

Creating job for submission ... done ✅
Finding node(s) for the job ... done ✅
Node accepted the job ... done ✅
Error while executing the job.

This can often be resolved by re-tagging your docker image

Support

If you have questions or need support or guidance, please reach out to the (#general channel).

How Bacalhau Works

In this tutorial we will go over the components and the architecture of Bacalhau. You will learn how it is built, what components are used, how you could interact and how you could use Bacalhau.

Chapter 1 - Architecture

Bacalhau is a peer-to-peer network of nodes that enables decentralized communication between computers. The network consists of two types of nodes, which can communicate with each other.

Requester Node: responsible for handling user requests, discovering and ranking compute nodes, forwarding jobs to compute nodes, and monitoring the job lifecycle.

Compute Node: responsible for executing jobs and producing results. Different compute nodes can be used for different types of jobs, depending on their capabilities and resources.

Components overview

Core Components

The core components are responsible for handling requests and connecting different nodes. The network includes two different components:

Requester node

Compute node

Interfaces

The interfaces handle the distribution, execution, storage and publishing of jobs. In the following all the different components are described and their respective protocols are shown.

Transport

Executor

Storage Provider

Publisher

The publisher is responsible for uploading the final results of a job to a remote location where clients can access them, such as S3 or IPFS.

Chapter 2 - Job cycle

Job preparation

You can create jobs in the Bacalhau network using various introduced in version 1.2. Each job may need specific variables, resource requirements and data details that are described in the .

Advanced job preparation

Prepare data with Bacalhau by , or . Mount data anywhere for Bacalhau to run against. Refer to , , and Source Specifications for data source usage.

Optimize workflows without completely redesigning them. Run arbitrary tasks using Docker containers and WebAssembly images. Follow the Onboarding guides for and workloads.

Explore GPU workload support with Bacalhau. Learn how to run using the Bacalhau client in the GPU Workloads section. Integrate Python applications with Bacalhau using the .

For node operation, refer to the section for configuring and running a Bacalhau node. If you prefer an isolated environment, explore the for performing tasks without connecting to the main Bacalhau network.

Job Submission

Bacalhau provides an interface to interact with the server via a REST API. Bacalhau uses 127.0.0.1 as the localhost and 1234 as the port by default.

bacalhau job run [flags]

You can use the command with to create a job in Bacalhau using JSON and YAML formats.

Endpoint: `PUT /api/v1/orchestrator/jobs`

You can use to submit a new job for execution.

You can use the bacalhau docker run to start a job in a Docker container. Below, you can see an excerpt of the commands:

Bacalhau Docker CLI commands

bacalhau docker run [flags] IMAGE[:TAG|@DIGEST] [COMMAND] [ARG...]

Flags:
    --concurrency int                  How many nodes should run the job (default 1)
    --cpu string                       Job CPU cores (e.g. 500m, 2, 8).
    --disk string                      Job Disk requirement (e.g. 500Gb, 2Tb, 8Tb).
    --domain stringArray               Domain(s) that the job needs to access (for HTTP networking)
    --download                         Should we download the results once the job is complete?
    --download-timeout-secs duration   Timeout duration for IPFS downloads. (default 5m0s)
    --dry-run                          Do not submit the job, but instead print out what will be submitted
    --entrypoint strings               Override the default ENTRYPOINT of the image
-e, --env strings                      The environment variables to supply to the job (e.g. --env FOO=bar --env BAR=baz)
-f, --follow                           When specified will follow the output from the job as it runs
-g, --gettimeout int                   Timeout for getting the results of a job in --wait (default 10)
    --gpu string                       Job GPU requirement (e.g. 1, 2, 8).
-h, --help                             help for run
    --id-only                          Print out only the Job ID on successful submission.
-i, --input storage                    Mount URIs as inputs to the job. Can be specified multiple times. Format: src=URI,dst=PATH[,opt=key=value]
                                    Examples:
                                    # Mount IPFS CID to /inputs directory
                                    -i ipfs://QmeZRGhe4PmjctYVSVHuEiA9oSXnqmYa4kQubSHgWbjv72
                                    # Mount S3 object to a specific path
                                    -i s3://bucket/key,dst=/my/input/path
                                    # Mount S3 object with specific endpoint and region
                                    -i src=s3://bucket/key,dst=/my/input/path,opt=endpoint=https://s3.example.com,opt=region=us-east-1
    --ipfs-connect string              The ipfs host multiaddress to connect to, otherwise an in-process IPFS node will be created if not set.
    --ipfs-serve-path string           path local Ipfs node will persist data to
    --ipfs-swarm-addrs strings         IPFS multiaddress to connect the in-process IPFS node to - cannot be used with --ipfs-connect. (default [/ip4/35.245.161.250/tcp/4001/p2p/12D3KooWAQpZzf3qiNxpwizXeArGjft98ZBoMNgVNNpoWtKAvtYH,/ip4/35.245.161.250/udp/4001/quic/p2p/12D3KooWAQpZzf3qiNxpwizXeArGjft98ZBoMNgVNNpoWtKAvtYH,/ip4/34.86.254.26/tcp/4001/p2p/12D3KooWLfFBjDo8dFe1Q4kSm8inKjPeHzmLBkQ1QAjTHocAUazK,/ip4/34.86.254.26/udp/4001/quic/p2p/12D3KooWLfFBjDo8dFe1Q4kSm8inKjPeHzmLBkQ1QAjTHocAUazK,/ip4/35.245.215.155/tcp/4001/p2p/12D3KooWH3rxmhLUrpzg81KAwUuXXuqeGt4qyWRniunb5ipjemFF,/ip4/35.245.215.155/udp/4001/quic/p2p/12D3KooWH3rxmhLUrpzg81KAwUuXXuqeGt4qyWRniunb5ipjemFF,/ip4/34.145.201.224/tcp/4001/p2p/12D3KooWBCBZnXnNbjxqqxu2oygPdLGseEbfMbFhrkDTRjUNnZYf,/ip4/34.145.201.224/udp/4001/quic/p2p/12D3KooWBCBZnXnNbjxqqxu2oygPdLGseEbfMbFhrkDTRjUNnZYf,/ip4/35.245.41.51/tcp/4001/p2p/12D3KooWJM8j97yoDTb7B9xV1WpBXakT4Zof3aMgFuSQQH56rCXa,/ip4/35.245.41.51/udp/4001/quic/p2p/12D3KooWJM8j97yoDTb7B9xV1WpBXakT4Zof3aMgFuSQQH56rCXa])
    --ipfs-swarm-key string            Optional IPFS swarm key required to connect to a private IPFS swarm
-l, --labels strings                   List of labels for the job. Enter multiple in the format '-l a -l 2'. All characters not matching /a-zA-Z0-9_:|-/ and all emojis will be stripped.
    --memory string                    Job Memory requirement (e.g. 500Mb, 2Gb, 8Gb).
    --network network-type             Networking capability required by the job. None, HTTP, or Full (default None)
    --node-details                     Print out details of all nodes (overridden by --id-only).
-o, --output strings                   name:path of the output data volumes. 'outputs:/outputs' is always added unless '/outputs' is mapped to a different name. (default [outputs:/outputs])
    --output-dir string                Directory to write the output to.
    --private-internal-ipfs            Whether the in-process IPFS node should auto-discover other nodes, including the public IPFS network - cannot be used with --ipfs-connect. Use "--private-internal-ipfs=false" to disable. To persist a local Ipfs node, set BACALHAU_SERVE_IPFS_PATH to a valid path. (default true)
-p, --publisher publisher              Where to publish the result of the job (default ipfs)
    --raw                              Download raw result CIDs instead of merging multiple CIDs into a single result
-s, --selector string                  Selector (label query) to filter nodes on which this job can be executed, supports '=', '==', and '!='.(e.g. -s key1=value1,key2=value2). Matching objects must satisfy all of the specified label constraints.
    --target all|any                   Whether to target the minimum number of matching nodes ("any") (default) or all matching nodes ("all") (default any)
    --timeout int                      Job execution timeout in seconds (e.g. 300 for 5 minutes)
    --wait                             Wait for the job to finish. Use --wait=false to return as soon as the job is submitted. (default true)
    --wait-timeout-secs int            When using --wait, how many seconds to wait for the job to complete before giving up. (default 600)
-w, --workdir string                   Working directory inside the container. Overrides the working directory shipped with the image (e.g. via WORKDIR in Dockerfile).

You can also use the bacalhau wasm run to run a job compiled into the (WASM) format. Below, you can find an excerpt of the commands in the Bacalhau CLI:

Bacalhau WASM CLI commands

bacalhau wasm run {cid-of-wasm | } [--entry-point ] [wasm-args ...] [flags]

Flags:
    --concurrency int                  How many nodes should run the job (default 1)
    --cpu string                       Job CPU cores (e.g. 500m, 2, 8).
    --disk string                      Job Disk requirement (e.g. 500Gb, 2Tb, 8Tb).
    --domain stringArray               Domain(s) that the job needs to access (for HTTP networking)
    --download                         Should we download the results once the job is complete?
    --download-timeout-secs duration   Timeout duration for IPFS downloads. (default 5m0s)
    --dry-run                          Do not submit the job, but instead print out what will be submitted
    --entry-point string               The name of the WASM function in the entry module to call. This should be a zero-parameter zero-result function that
                                will execute the job. (default "_start")
-e, --env strings                      The environment variables to supply to the job (e.g. --env FOO=bar --env BAR=baz)
-f, --follow                           When specified will follow the output from the job as it runs
-g, --gettimeout int                   Timeout for getting the results of a job in --wait (default 10)
    --gpu string                       Job GPU requirement (e.g. 1, 2, 8).
-h, --help                             help for run
    --id-only                          Print out only the Job ID on successful submission.
-U, --import-module-urls url           URL of the WASM modules to import from a URL source. URL accept any valid URL supported by the 'wget' command, and supports both HTTP and HTTPS.
-I, --import-module-volumes cid:path   CID:path of the WASM modules to import from IPFS, if you need to set the path of the mounted data.
-i, --input storage                    Mount URIs as inputs to the job. Can be specified multiple times. Format: src=URI,dst=PATH[,opt=key=value]
                                        Examples:
                                        # Mount IPFS CID to /inputs directory
                                        -i ipfs://QmeZRGhe4PmjctYVSVHuEiA9oSXnqmYa4kQubSHgWbjv72
                                        # Mount S3 object to a specific path
                                        -i s3://bucket/key,dst=/my/input/path
                                        # Mount S3 object with specific endpoint and region
                                        -i src=s3://bucket/key,dst=/my/input/path,opt=endpoint=https://s3.example.com,opt=region=us-east-1
    --ipfs-connect string              The ipfs host multiaddress to connect to, otherwise an in-process IPFS node will be created if not set.
    --ipfs-serve-path string           path local Ipfs node will persist data to
    --ipfs-swarm-addrs strings         IPFS multiaddress to connect the in-process IPFS node to - cannot be used with --ipfs-connect. (default [/ip4/35.245.161.250/tcp/4001/p2p/12D3KooWAQpZzf3qiNxpwizXeArGjft98ZBoMNgVNNpoWtKAvtYH,/ip4/35.245.161.250/udp/4001/quic/p2p/12D3KooWAQpZzf3qiNxpwizXeArGjft98ZBoMNgVNNpoWtKAvtYH,/ip4/34.86.254.26/tcp/4001/p2p/12D3KooWLfFBjDo8dFe1Q4kSm8inKjPeHzmLBkQ1QAjTHocAUazK,/ip4/34.86.254.26/udp/4001/quic/p2p/12D3KooWLfFBjDo8dFe1Q4kSm8inKjPeHzmLBkQ1QAjTHocAUazK,/ip4/35.245.215.155/tcp/4001/p2p/12D3KooWH3rxmhLUrpzg81KAwUuXXuqeGt4qyWRniunb5ipjemFF,/ip4/35.245.215.155/udp/4001/quic/p2p/12D3KooWH3rxmhLUrpzg81KAwUuXXuqeGt4qyWRniunb5ipjemFF,/ip4/34.145.201.224/tcp/4001/p2p/12D3KooWBCBZnXnNbjxqqxu2oygPdLGseEbfMbFhrkDTRjUNnZYf,/ip4/34.145.201.224/udp/4001/quic/p2p/12D3KooWBCBZnXnNbjxqqxu2oygPdLGseEbfMbFhrkDTRjUNnZYf,/ip4/35.245.41.51/tcp/4001/p2p/12D3KooWJM8j97yoDTb7B9xV1WpBXakT4Zof3aMgFuSQQH56rCXa,/ip4/35.245.41.51/udp/4001/quic/p2p/12D3KooWJM8j97yoDTb7B9xV1WpBXakT4Zof3aMgFuSQQH56rCXa])
    --ipfs-swarm-key string            Optional IPFS swarm key required to connect to a private IPFS swarm
-l, --labels strings                   List of labels for the job. Enter multiple in the format '-l a -l 2'. All characters not matching /a-zA-Z0-9_:|-/ and all emojis will be stripped.
    --memory string                    Job Memory requirement (e.g. 500Mb, 2Gb, 8Gb).
    --network network-type             Networking capability required by the job. None, HTTP, or Full (default None)
    --node-details                     Print out details of all nodes (overridden by --id-only).
-o, --output strings                   name:path of the output data volumes. 'outputs:/outputs' is always added unless '/outputs' is mapped to a different name. (default [outputs:/outputs])
    --output-dir string                Directory to write the output to.
    --private-internal-ipfs            Whether the in-process IPFS node should auto-discover other nodes, including the public IPFS network - cannot be used with --ipfs-connect. Use "--private-internal-ipfs=false" to disable. To persist a local Ipfs node, set BACALHAU_SERVE_IPFS_PATH to a valid path. (default true)
-p, --publisher publisher              Where to publish the result of the job (default ipfs)
    --raw                              Download raw result CIDs instead of merging multiple CIDs into a single result
-s, --selector string                  Selector (label query) to filter nodes on which this job can be executed, supports '=', '==', and '!='.(e.g. -s key1=value1,key2=value2). Matching objects must satisfy all of the specified label constraints.
    --target all|any                   Whether to target the minimum number of matching nodes ("any") (default) or all matching nodes ("all") (default any)
    --timeout int                      Job execution timeout in seconds (e.g. 300 for 5 minutes)
    --wait                             Wait for the job to finish. Use --wait=false to return as soon as the job is submitted. (default true)
    --wait-timeout-secs int            When using --wait, how many seconds to wait for the job to complete before giving up. (default 600)

Job Acceptance

Job execution

The selected compute node receives the job and starts its execution inside a container. The container can use different executors to work with the data and perform the necessary actions. A job can use the docker executor, WASM executor or a library storage volumes. Use to view the parameters to configure the Docker Engine. If you want tasks to be executed in a WebAssembly environment, pay attention to .

Results publishing

When the Compute node completes the job, it publishes the results to S3's remote storage, IPFS.

Bacalhau's seamless integration with IPFS ensures that users have a decentralized option for publishing their task results, enhancing accessibility and resilience while reducing dependence on a single point of failure. View to get the detailed information.

Bacalhau's S3 Publisher provides users with a secure and efficient method to publish task results to any S3-compatible storage service. This publisher supports not just AWS S3, but other S3-compatible services offered by cloud providers like Google Cloud Storage and Azure Blob Storage, as well as open-source options like MinIO. View to get the detailed information.

Chapter 3 - Returning Information

The Bacalhau client receives updates on the task execution status and results. A user can access the results and manage tasks through the command line interface.

Get Job Results

To Get the results of a job you can run the following command.

bacalhau job get [id] [flags]

One can choose from a wide range of flags, from which a few are shown below.

Usage:
  bacalhau job get [id] [flags]

Flags:
      --download-timeout-secs duration   Timeout duration for IPFS downloads. (default 5m0s)
  -h, --help                             help for get
      --ipfs-connect string              The ipfs host multiaddress to connect to, otherwise an in-process IPFS node will be created if not set.
      --ipfs-serve-path string           path local Ipfs node will persist data to
      --ipfs-swarm-addrs strings         IPFS multiaddress to connect the in-process IPFS node to - cannot be used with --ipfs-connect. (default [/ip4/35.245.161.250/tcp/4001/p2p/12D3KooWAQpZzf3qiNxpwizXeArGjft98ZBoMNgVNNpoWtKAvtYH,/ip4/35.245.161.250/udp/4001/quic/p2p/12D3KooWAQpZzf3qiNxpwizXeArGjft98ZBoMNgVNNpoWtKAvtYH,/ip4/34.86.254.26/tcp/4001/p2p/12D3KooWLfFBjDo8dFe1Q4kSm8inKjPeHzmLBkQ1QAjTHocAUazK,/ip4/34.86.254.26/udp/4001/quic/p2p/12D3KooWLfFBjDo8dFe1Q4kSm8inKjPeHzmLBkQ1QAjTHocAUazK,/ip4/35.245.215.155/tcp/4001/p2p/12D3KooWH3rxmhLUrpzg81KAwUuXXuqeGt4qyWRniunb5ipjemFF,/ip4/35.245.215.155/udp/4001/quic/p2p/12D3KooWH3rxmhLUrpzg81KAwUuXXuqeGt4qyWRniunb5ipjemFF,/ip4/34.145.201.224/tcp/4001/p2p/12D3KooWBCBZnXnNbjxqqxu2oygPdLGseEbfMbFhrkDTRjUNnZYf,/ip4/34.145.201.224/udp/4001/quic/p2p/12D3KooWBCBZnXnNbjxqqxu2oygPdLGseEbfMbFhrkDTRjUNnZYf,/ip4/35.245.41.51/tcp/4001/p2p/12D3KooWJM8j97yoDTb7B9xV1WpBXakT4Zof3aMgFuSQQH56rCXa,/ip4/35.245.41.51/udp/4001/quic/p2p/12D3KooWJM8j97yoDTb7B9xV1WpBXakT4Zof3aMgFuSQQH56rCXa])
      --ipfs-swarm-key string            Optional IPFS swarm key required to connect to a private IPFS swarm
      --output-dir string                Directory to write the output to.
      --private-internal-ipfs            Whether the in-process IPFS node should auto-discover other nodes, including the public IPFS network - cannot be used with --ipfs-connect. Use "--private-internal-ipfs=false" to disable. To persist a local Ipfs node, set BACALHAU_SERVE_IPFS_PATH to a valid path. (default true)
      --raw                              Download raw result CIDs instead of merging multiple CIDs into a single result

Describe a Job

To describe a specific job, inserting the ID to the CLI or API gives back an overview of the job.

bacalhau job describe [id] [flags]

You can use the command with to get a full description of a job in yaml format.

Endpoint: `GET /api/v1/orchestrator/jobs/:jobID`

You can use to retrieve the specification and current status of a particular job.

List of Jobs

If you run more then one job or you want to find a specific job ID

bacalhau job list [flags]

You can use the command with to list jobs on the network in yaml format.

Endpoint: `GET /api/v1/orchestrator/jobs`

You can use to retrieve a list of jobs.

Job Executions

To list executions follow the following commands.

bacalhau job executions [id] [flags]

You can use the command with to list all executions associated with a job, identified by its ID, in yaml format.

Endpoint: `GET /api/v1/orchestrator/jobs/:jobID/executions`

You can use to retrieve all executions for a particular job.

Chapter 4 - Monitoring and Management

The Bacalhau client provides the user with tools to monitor and manage the execution of jobs. You can get information about status, progress and decide on next steps. View the if you want to know the node's health, capabilities, and deployed Bacalhau version. To get information about the status and characteristics of the nodes in the cluster use .

Stop a Job

bacalhau job stop [id] [flags]

You can use the command with to cancel a job that was previously submitted and stop it running if it has not yet completed.

Endpoint: `DELETE /api/v1/orchestrator/jobs/:jobID`

You can use to terminate a specific job asynchronously.

Job History

bacalhau job history [id] [flags]

You can use the command with to enumerate the historical events related to a job, identified by its ID.

Endpoint: `GET /api/v1/orchestrator/jobs/:jobID/history`

You can use to retrieve historical events for a specific job.

Job Logs

bacalhau job logs [flags] [id]

You can use this to retrieve the log output (stdout, and stderr) from a job. If the job is still running it is possible to follow the logs after the previously generated logs are retrieved.

To familiarize yourself with all the commands used in Bacalhau, please view

Access Management

How to configure authentication and authorization on your Bacalhau node.

Access Management

Bacalhau includes a flexible auth system that supports multiple methods of auth that are appropriate for different deployment environments.

By default

In anonymous mode, Bacalhau will allow:

Users identified by a self-generated private key to submit any job and cancel their own jobs.
Users not identified by any key to access other read-only endpoints, such as to read job lists, describe jobs, and query node or agent information.

Restricting anonymous access

Bacalhau auth is controlled by policies. Configuring the auth system is done by supplying a different policy file.

Restricting API access to only users that have authenticated requires specifying a new authorization policy. You can download a policy that restricts anonymous access and install it by using:

curl -sL https://raw.githubusercontent.com/bacalhau-project/bacalhau/main/pkg/authz/policies/policy_ns_anon.rego -o ~/.bacalhau/no-anon.rego
bacalhau config set API.Auth.AccessPolicyPath ~/.bacalhau/no-anon.rego

Once the node is restarted, accessing the node APIs will require the user to be authenticated, but by default will still allow users with a self-generated key to authenticate themselves.

curl -sL https://raw.githubusercontent.com/bacalhau-project/bacalhau/main/pkg/authn/challenge/challenge_ns_no_anon.rego -o ~/.bacalhau/challenge_ns_no_anon.rego
bacalhau config set API.Auth.Methods '\{Method: ClientKey, Policy: \{Type: challenge, PolicyPath: ~/.bacalhau/challenge_ns_no_anon.rego\}\}'

Then, modify the allowed_clients variable in challange_ns_no_anon.rego to include acceptable client IDs, found by running bacalhau agent node.

bacalhau agent node | jq -rc .ClientID

Once the node is restarted, only keys in the allowed list will be able to access any API.

Username and password access

Users can authenticate using a username and password instead of specifying a private key for access. Again, this requires installation of an appropriate policy on the server.

curl -sL https://raw.githubusercontent.com/bacalhau-project/bacalhau/main/pkg/authn/ask/ask_ns_password.rego -o ~/.bacalhau/ask_ns_password.rego
bacalhau config set API.Auth.Methods '\{Method: Password, Policy: \{Type: ask, PolicyPath: ~/.bacalhau/ask_ns_password.rego\}\}'

cd pkg/authn/ask/gen_password && go run .

This will ask for a password and generate a salt and hash to authenticate with it. Add the encoded username, salt and hash into the ask_ns_password.rego.

Writing custom policies

In principle, Bacalhau can implement any auth scheme that can be described in a structured way by a policy file.

Policies are written in a language called , also used by Kubernetes. Users who want to write their own policies should get familiar with the Rego language.

Custom authentication policies

`challenge` authentication

challenge authentication uses identifies the user by the presence of a private key. The user is asked to sign an input phrase to prove they have the key they are identifying with.

Policies for this type will need to implement these rules:

bacalhau.authn.token: if the user should be authenticated, an access token they should use in subsequent requests. If the user should not be authenticated, should be undefined.

They should expect as fields on the input variable:

clientId: an ID derived from the user's private key that identifies them uniquely
nodeId: the ID of the requester node that this user is authenticating with
signingKey: the private key (as a JWK) that should be used to sign any access tokens to be returned

The simplest possible policy might therefore be this policy that returns the same opaque token for all users:

package bacalhau.authn

token := "anything"

A more realistic example that returns a signed JWT is in .

`ask` authentication

Policies for this type will need to implement these rules:

bacalhau.authn.token: if the user should be authenticated, an access token they should use in subsequent requests. If the user should not be authenticated, should be undefined.
bacalhau.authn.schema: a static JSON schema that should be used to collect information about the user. The type of declared fields may be used to pick the input method, and if a field is marked as writeOnly then it will be collected in a secure way (e.g. not shown on screen). The schema rule does not receive any input data.

They should expect as fields on the input variable:

ask: a map of field names from the JSON schema to strings supplied by the user. The policy should validate these credentials.
nodeId: the ID of the requester node that this user is authenticating with
signingKey: the private key (as a JWK) that should be used to sign any access tokens to be returned

The simplest possible policy might therefore be one that asks for no data and returns the same opaque token for every user:

package bacalhau.authn

schema := {}
token := "anything"

A more realistic example that returns a signed JWT is in .

Custom authorization policies

Authorization policies do not vary depending on the type of authentication used – Bacalhau uses one authz policy for all API requests.

Policies will need to implement these rules:

bacalhau.authz.token_valid: true if the access token in the request is "valid" (but does not necessarily grant access for this request), or false if it is invalid for every request (e.g. because it has expired) and should be discarded.
bacalhau.authz.allow: true if the user should be permitted to carry out the input request, false otherwise.

They should expect as fields on the input variable for both rules:

http: details of the user's HTTP request:
- host: the hostname used in the HTTP request
- method: the HTTP method (e.g. GET, POST)
- path: the path requested, as an array of path components without slashes
- query: a map of URL query parameters to their values
- headers: a map of HTTP header names to arrays representing their values
- body: a blob of any content submitted as the body
constraints: details about the receiving node that should be used to validate any supplied tokens:
- cert: keys that the input token should have been signed with
- iss: the name of a node that this node will recognize as the issuer of any signed tokens
- aud: the name of this node that is receiving the request

Notably, the constraints data is appropriate to be passed directly to the Rego io.jwt.decode_verify method which will validate the access token as a JWT against the given constraints.

The simplest possible authz policy might be this one that allows all users to access all endpoints:

package bacalhau.authz

allow := true
token_valid := true

A more realistic example (which is the Bacalhau "anonymous mode" default) is in .

Private IPFS Network Setup

Set up private IPFS network

Note that Bacalhauv1.4.0 supports IPFS v0.27 and below.

Starting from v.1.5.0 Bacalhau supports latest IPFS versions.

Consider this when selecting versions of Bacalhau and IPFS when setting up your own private network.

Introduction

Install and configure IPFS
Create Private IPFS network
Pin your data to private IPFS network

TL;DR

Initialize Private IPFS network
Connect all nodes to the same private network
Connect Bacalhau network to use private IPFS network

Download and Install

wget https://go.dev/dl/go1.23.0.linux-amd64.tar.gz

Remove any previous Go installation by deleting the /usr/local/go folder (if it exists), then extract the archive you downloaded into /usr/local, creating a fresh Go tree in /usr/local/go:

rm -rf /usr/local/go && tar -C /usr/local -xzf go1.23.0.linux-amd64.tar.gz

Add /usr/local/go/bin to the PATH environment variable. You can do this by adding the following line to your $HOME/.profile or /etc/profile (for a system-wide installation):

export PATH=$PATH:/usr/local/go/bin

Changes made to a profile file may not apply until the next time you log into the system. To apply the changes immediately, just run the shell commands directly or execute them from the profile using a command such as source $HOME/.profile.

Verify that Go is installed correctly by checking its version:

go version

wget https://dist.ipfs.tech/kubo/v0.30.0/kubo_v0.30.0_linux-amd64.tar.gz
tar -xvzf kubo_v0.30.0_linux-amd64.tar.gz
sudo bash kubo/install.sh

Verify that IPFS is installed correctly by checking its version:

ipfs --version

Configure Bootstrap IPFS Node

A bootstrap node is used by client nodes to connect to the private IPFS network. The bootstrap connects clients to other nodes available on the network.

Execute the ipfs init command to initialize an IPFS node:

ipfs init

# example output

generating ED25519 keypair...done
peer identity: 12D3KooWQqr8BLHDUaZvYG59KnrfYJ1PbbzCq3pzfpQ6QrKP5yz7
initializing IPFS node at /home/username/.ipfs

The next step is to generate the swarm key - a cryptographic key that is used to control access to an IPFS network, and export the key into a swarm.key file, located in the ~/ipfs folder.

echo -e "/key/swarm/psk/1.0.0/\n/base16/\n$(tr -dc 'a-f0-9' < /dev/urandom | head -c64)" > ~/.ipfs/swarm.key

# example swarm.key content:

/key/swarm/psk/1.0.0/
/base16/
k51qzi5uqu5dli3yce3powa8pme8yc2mcwc3gpfwh7hzkzrvp5c6l0um99kiw2

Now the default entries of bootstrap nodes should be removed. Execute the command on all nodes:

ipfs bootstrap rm --all

Check that bootstrap config does not contain default values:

ipfs config show | grep Bootstrap

# expected output:

  "Bootstrap": null,

Configure IPFS to listen for incoming connections on specific network addresses and ports, making the IPFS Gateway and API services accessible. Consider changing addresses and ports depending on the specifics of your network.

ipfs config Addresses.Gateway /ip4/0.0.0.0/tcp/8080

ipfs config Addresses.API /ip4/0.0.0.0/tcp/5001

Start the IPFS daemon:

ipfs daemon

Configure Client Nodes

Copy the swarm.key file from the bootstrap node to client nodes into the ~/.ipfs/ folder and initialize IPFS:

ipfs init

Apply same config as on bootstrap node and start the daemon:

ipfs bootstrap rm --all

ipfs config Addresses.Gateway /ip4/0.0.0.0/tcp/8080

ipfs config Addresses.API /ip4/0.0.0.0/tcp/5001

ipfs daemon

Done! Now you can check that private IPFS network works properly:

List peers on the bootstrap node. It should list all connected nodes:

ipfs swarm peers

# example output for single connected node

/ip4/10.0.2.15/tcp/4001/p2p/12D3KooWQqr8BLHDUaZvYG59KnrfYJ1PbbzCq3pzfpQ6QrKP5yz7

Pin some files and check their availability across the network:

# Create a sample text file and pin it
echo “Hello from the private IPFS network!” > sample.txt

# Pin file:
ipfs add sample.txt

# example output:

added QmWQeYip3JuwhDFmkDkx9mXG3p83a3zMFfiMfhjS2Zvnms sample.txt
 25 B / 25 B [=========================================] 100.00%

# Retrieve and display the content of a pinned file
# Execute this on any node of your private network
ipfs cat QmWQeYip3JuwhDFmkDkx9mXG3p83a3zMFfiMfhjS2Zvnms

# expected output:

Hello from the private IPFS network!

Configure the IPFS Daemon as `systemd` Service

Finally, make the IPFS daemon run at system startup. To do this:

Create new service unit file in the /etc/systemd/system/

sudo nano /etc/systemd/system/ipfs.service

Add following content to the file, replacing /path/to/your/ipfs/executable with the actual path

[Unit]
Description=IPFS Daemon
After=network.target

[Service]
User=username
ExecStart=/path/to/your/ipfs/executable daemon
Restart=on-failure

[Install]
WantedBy=multi-user.target

Use which ipfs command to locate the executable.

Usually path to the executable is /usr/local/bin/ipfs

For security purposes, consider creating a separate user to run the service. In this case, specify its name in the User= line. Without specifying user, the ipfs service will be launched with root, which means that you will need to copy the ipfs binary to the /root directory

Reload and enable the service

sudo systemctl daemon-reload
sudo systemctl enable ipfs

Done! Now reboot the machine to ensure that daemon starts correctly. Use systemctl status ipfs command to check that service is running:

sudo systemctl status ipfs

#example output

● ipfs.service - IPFS Daemon
     Loaded: loaded (/etc/systemd/system/ipfs.service; enabled; preset: enabled)
     Active: active (running) since Wed 2024-09-10 13:24:09 CEST; 16min ago

Configure Bacalhau Nodes

Now to connect your private Bacalhau network to the private IPFS network, the IPFS API address should be specified using the --ipfs-connect flag. It can be found in the ~/.ipfs/api file:

bacalhau serve \
# any other flags
--ipfs-connect /ip4/0.0.0.0/tcp/5001

Done! Now your private Bacalhau network is connected to the private IPFS network!

Test Configured Networks

To verify that everything works correctly:

Pin the file to the private IPFS network
Run the job, which takes the pinned file as input and publishes result to the private IPFS network
View and download job results

Create and Pin Sample File

Create any file and pin it. Use the ipfs add command:

# create file
echo "Hello from private IPFS network!" > file.txt

# pin the file
ipfs add file.txt

# example output:

added QmWQK2Rz4Ng1RPFPyiHECvQGrJb5ZbSwjpLeuWpDuCZAbQ file.txt
 33 B / 33 B

Run a Bacalhau Job

Run a simple job, which fetches the pinned file via its CID, lists its content and publishes results back into the private IPFS network:

bacalhau docker run \
-i ipfs://QmWQK2Rz4Ng1RPFPyiHECvQGrJb5ZbSwjpLeuWpDuCZAbQ
--publisher ipfs \
alpine cat inputs

# example output

Job successfully submitted. Job ID: j-c6514250-2e97-4fb6-a1e6-6a5a8e8ba6aa
Checking job status... (Enter Ctrl+C to exit at any time, your job will continue running):

 TIME          EXEC. ID    TOPIC            EVENT         
 15:54:35.767              Submission       Job submitted 
 15:54:35.780  e-a498daaf  Scheduling       Requested execution on n-0f29f45c 
 15:54:35.859  e-a498daaf  Execution        Running 
 15:54:36.707  e-a498daaf  Execution        Completed successfully 
                                             
To get more details about the run, execute:
	bacalhau job describe j-c6514250-2e97-4fb6-a1e6-6a5a8e8ba6aa

To get more details about the run executions, execute:
	bacalhau job executions j-c6514250-2e97-4fb6-a1e6-6a5a8e8ba6aa

To download the results, execute:
	bacalhau job get j-c6514250-2e97-4fb6-a1e6-6a5a8e8ba6aa

View and Download Job Results

bacalhau job describe j-c6514250-2e97-4fb6-a1e6-6a5a8e8ba6aa

# example output (was truncated for brevity)

...
Standard Output
Hello from private IPFS network!

bacalhau job get j-c6514250-2e97-4fb6-a1e6-6a5a8e8ba6aa

# example output

Fetching results of job 'j-c6514250-2e97-4fb6-a1e6-6a5a8e8ba6aa'...
No supported downloader found for the published results. You will have to download the results differently.
[
    {
        "Type": "ipfs",
        "Params": {
            "CID": "QmSskRNnbbw8rNtkLdcJrUS2uC2mhiKofVJsahKRPgbGGj"
        }
    }
]

Use the ipfs ls command to view the results:

ipfs ls QmSskRNnbbw8rNtkLdcJrUS2uC2mhiKofVJsahKRPgbGGj

# example output

QmS6mcrMTFsZnT3wAptqEb8NpBPnv1H6WwZBMzEjT8SSDv 1  exitCode
QmbFMke1KXqnYyBBWxB74N4c5SBnJMVAiMNRcGu6x1AwQH 0  stderr
QmWQK2Rz4Ng1RPFPyiHECvQGrJb5ZbSwjpLeuWpDuCZAbQ 33 stdout

Use the ipfs cat command to view the file content. In our case, the file of interest is the stdout:

ipfs cat QmWQK2Rz4Ng1RPFPyiHECvQGrJb5ZbSwjpLeuWpDuCZAbQ

# example output

Hello from private IPFS network!

Use the ipfs get command to download the file using its CID:

ipfs get --output stdout QmWQK2Rz4Ng1RPFPyiHECvQGrJb5ZbSwjpLeuWpDuCZAbQ

# example output
Saving file(s) to stdout
 33 B / 33 B [===============================================] 100.00% 0s

Building and Running Custom Python Container

Introduction

In this tutorial example, we will walk you through building your own Python container and running the container on Bacalhau.

Prerequisites

1. Sample Recommendation Dataset

We will be using a simple recommendation script that, when given a movie ID, recommends other movies based on user ratings. Assuming you want recommendations for the movie 'Toy Story' (1995), it will suggest movies from similar categories:

Recommendations for Toy Story (1995):
1  :  Toy Story (1995)
58  :  Postino, Il (The Postman) (1994)
3159  :  Fantasia 2000 (1999)
359  :  I Like It Like That (1994)
756  :  Carmen Miranda: Bananas Is My Business (1994)
618  :  Two Much (1996)
48  :  Pocahontas (1995)
2695  :  Boys, The (1997)
2923  :  Citizen's Band (a.k.a. Handle with Care) (1977)
688  :  Operation Dumbo Drop (1995)

Downloading the dataset

wget https://files.grouplens.org/datasets/movielens/ml-1m.zip

In this example, we’ll be using 2 files from the MovieLens 1M dataset: ratings.dat and movies.dat. After the dataset is downloaded, extract the zip and place ratings.dat and movies.dat into a folder called input:

# Extracting the downloaded zip file
unzip ml-1m.zip

#moving  ratings.dat and movies.dat into a folder called 'input'
mkdir input; mv ml-1m/movies.dat ml-1m/ratings.dat input/

The structure of the input directory should be

input
├── movies.dat
└── ratings.dat

Installing Dependencies

To create a requirements.txt for the Python libraries we’ll be using, create:

# content of the requirements.txt
numpy
pandas

To install the dependencies, run:

pip install -r requirements.txt

Writing the Script

Create a new file called similar-movies.py and in it paste the following script

# content of the similar-movies.py

# Imports
import numpy as np
import pandas as pd
import argparse
from distutils.dir_util import mkpath
import warnings
warnings.filterwarnings("ignore")
# Read the files with pandas
data = pd.io.parsers.read_csv('input/ratings.dat',
names=['user_id', 'movie_id', 'rating', 'time'],
engine='python', delimiter='::', encoding='latin-1')
movie_data = pd.io.parsers.read_csv('input/movies.dat',
names=['movie_id', 'title', 'genre'],
engine='python', delimiter='::', encoding='latin-1')

# Create the ratings matrix of shape (m×u) with rows as movies and columns as users

ratings_mat = np.ndarray(
shape=((np.max(data.movie_id.values)), np.max(data.user_id.values)),
dtype=np.uint8)
ratings_mat[data.movie_id.values-1, data.user_id.values-1] = data.rating.values

# Normalise matrix (subtract mean off)

normalised_mat = ratings_mat - np.asarray([(np.mean(ratings_mat, 1))]).T

# Compute SVD

normalised_mat = ratings_mat - np.matrix(np.mean(ratings_mat, 1)).T
cov_mat = np.cov(normalised_mat)
evals, evecs = np.linalg.eig(cov_mat)

# Calculate cosine similarity, sort by most similar, and return the top N.

def top_cosine_similarity(data, movie_id, top_n=10):

index = movie_id - 1
# Movie id starts from 1

movie_row = data[index, :]
magnitude = np.sqrt(np.einsum('ij, ij -> i', data, data))
similarity = np.dot(movie_row, data.T) / (magnitude[index] * magnitude)
sort_indexes = np.argsort(-similarity)
return sort_indexes[:top_n]

# Helper function to print top N similar movies
def print_similar_movies(movie_data, movie_id, top_indexes):
print('Recommendations for {0}: \n'.format(
movie_data[movie_data.movie_id == movie_id].title.values[0]))
for id in top_indexes + 1:
print(str(id),' : ',movie_data[movie_data.movie_id == id].title.values[0])


parser = argparse.ArgumentParser(description='Personal information')
parser.add_argument('--k', dest='k', type=int, help='principal components to represent the movies',default=50)
parser.add_argument('--id', dest='id', type=int, help='Id of the movie',default=1)
parser.add_argument('--n', dest='n', type=int, help='No of recommendations',default=10)

args = parser.parse_args()
k = args.k
movie_id = args.id # Grab an id from movies.dat
top_n = args.n

# k = 50
# # Grab an id from movies.dat
# movie_id = 1
# top_n = 10

sliced = evecs[:, :k] # representative data
top_indexes = top_cosine_similarity(sliced, movie_id, top_n)
print_similar_movies(movie_data, movie_id, top_indexes)

What the similar-movies.py script does

Read the files with pandas. The code uses Pandas to read data from the files ratings.dat and movies.dat.
Create the ratings matrix of shape (m×u) with rows as movies and columns as user
Normalise matrix (subtract mean off). The ratings matrix is normalized by subtracting the mean off.
Compute SVD: a singular value decomposition (SVD) of the normalized ratings matrix is performed.
Calculate cosine similarity, sort by most similar, and return the top N.
Select k principal components to represent the movies, a movie_id to find recommendations, and print the top_n results.

Running the Script

Running the script similar-movies.py using the default values:

python similar-movies.py

You can also use other flags to set your own values.

2. Setting Up Docker

We will create a Dockerfile and add the desired configuration to the file. These commands specify how the image will be built, and what extra requirements will be included.

FROM python:3.8
ADD similar-movies.py .
ADD /input input
COPY ./requirements.txt /requirements.txt
RUN pip install -r requirements.txt

We will use the python:3.8 docker image and add our script similar-movies.py to copy the script to the docker image, similarly, we also add the dataset directory and also the requirements, after that run the command to install the dependencies in the image

The final folder structure will look like this:

├── Dockerfile
├── input
│   ├── movies.dat
│   └── ratings.dat
├── requirements.txt
└── similar-movies.py

Build the container

We will run docker build command to build the container:

docker build -t <hub-user>/<repo-name>:<tag> .

Before running the command replace:

repo-name with the name of the container, you can name it anything you want

tag this is not required, but you can use the latest tag

In our case:

docker build -t jsace/python-similar-movies .

Push the container

Next, upload the image to the registry. This can be done by using the Docker hub username, repo name or tag.

docker push <hub-user>/<repo-name>:<tag>

In our case:

docker push jsace/python-similar-movies

3. Running a Bacalhau Job

After the repo image has been pushed to Docker Hub, we can now use the container for running on Bacalhau. You can submit a Bacalhau job by running your container on Bacalhau with default or custom parameters.

Running the Container with Default Parameters

To submit a Bacalhau job by running your container on Bacalhau with default parameters, run the following Bacalhau command:

export JOB_ID=$(bacalhau docker run \
    --id-only \
    --wait \
    jsace/python-similar-movies \
    -- python similar-movies.py)

Structure of the command

bacalhau docker run: call to Bacalhau
jsace/python-similar-movies: the name and of the docker image we are using
-- python similar-movies.py: execute the Python script

When a job is submitted, Bacalhau prints out the related job_id. We store that in an environment variable so that we can reuse it later on.

Running the Container with Custom Parameters

To submit a Bacalhau job by running your container on Bacalhau with custom parameters, run the following Bacalhau command:

bacalhau docker run \
    jsace/python-similar-movies \
    -- python similar-movies.py --k 50 --id 10 --n 10

Structure of the command

bacalhau docker run: call to Bacalhau
jsace/python-similar-movies: the name of the docker image we are using
-- python similar-movies.py --k 50 --id 10 --n 10: execute the python script. The script will use Singular Value Decomposition (SVD) and cosine similarity to find 10 movies most similar to the one with identifier 10, using 50 principal components.

4. Checking the State of your Jobs

Job status: You can check the status of the job using bacalhau job list.

bacalhau job list --id-filter ${JOB_ID}

When it says Published or Completed, that means the job is done, and we can get the results.

Job information: You can find out more information about your job by using bacalhau job describe.

bacalhau job describe ${JOB_ID}

Job download: You can download your job results directly by using bacalhau job get. Alternatively, you can choose to create a directory to store your results. In the command below, we created a directory (results) and downloaded our job output to be stored in that directory.

rm -rf results && mkdir -p results
bacalhau job get $JOB_ID --output-dir results

5. Viewing your Job Output

To view the file, run the following command:

cat results/stdout # displays the contents of the file

Support

Write a config.yaml

How to write the config.yaml file to configure your nodes

On installation, Bacalhau creates a .bacalhau directory that includes a config.yaml file tailored for your specific settings. This configuration file is the central repository for custom settings for your Bacalhau nodes.

When initializing a Bacalhau node, the system determines its configuration by following a specific hierarchy. First, it checks the default settings, then the config.yaml file, followed by environment variables, and finally, any command line flags specified during execution. Configurations are set and overridden in that sequence. This layered approach allows the default Bacalhau settings to provide a baseline, while environment variables and command-line flags offer added flexibility. However, the config.yaml file offers a reliable way to predefine all necessary settings before node creation across environments, ensuring consistency and ease of management.

Modifications to the config.yaml file are not dynamically applied to existing nodes. A restart of the Bacalhau node is required for any changes to take effect.

Your config.yaml file starts off empty. However, you can see all available settings using the following command

bacalhau config list

This command showcases over a hundred configuration parameters related to users, security, metrics, updates, and node configuration, providing a comprehensive overview of the customization options available for your Bacalhau setup.

Let’s go through the different options and how your configuration file is structured.

Config.yaml Structure

The bacalhau config list command displays your configuration paths, segmented with periods to indicate each part you are configuring.

Consider these configuration settings: NameProvider and Labels. These settings help set name and labels for your Bacalhau node.

In your config.yaml, these settings will be formatted like this:

labels:
    NodeType: WebServer
    OS: Linux
nameprovider: puuid

Configuration Options

Here are your Bacalhau configuration options in alphabetical order:

Configuration Option

Description

API.Auth.AccessPolicyPath

String path to where your security policy is stored

API.Auth.Methods

Set authentication method for your Bacalhau network

API.Host

The host for the client and server to communicate on (via REST). Ignored if BACALHAU_API_HOST environment variable is set

API.Port

The port for the client and server to communicate on (via REST). Ignored if BACALHAU_API_PORT environment variable is set

API.TLS.AutoCert

Domain for automatic certificate generation

API.TLS.AutoCertCachePath

The directory where the autocert process will cache certificates to avoid rate limits

API.TLS.CAFile

The location of your node client’s chosen Certificate Authority certificate file when self-signed certificates are used

API.TLS.CAFile

CAFile specifies the path to the Certificate Authority file

API.TLS.KeyFile

Specifies path to the TLS Private Key file

API.TLS.Insecure

Boolean binary indicating if the client TLS is insecure, when true instructs the client to use HTTPS (TLS), but not to attempt to verify the certificate

API.TLS.SelfSigned

Boolean indicating if a self-signed security certificate is being used

API.TLS.UseTLS

Boolean indicating if TLS should be used for client connections

Compute.Auth.Token

Token specifies the key for compute nodes to be able to access the orchestrator.

Compute.AllocatedCapacity.CPU

Total amount of CPU the system can use at one time in aggregate for all jobs

Compute.AllocatedCapacity.Disk

Total amount of disk the system can use at one time in aggregate for all jobs

Compute.AllocatedCapacity.GPU

Total amount of GPU the system can use at one time in aggregate for all jobs

Compute.AllocatedCapacity.Memory

Total amount of memory the system can use at one time in aggregate for all jobs

Compute.AllowListedLocalPaths

AllowListedLocalPaths specifies a list of local file system paths that the compute node is allowed to access

Compute.Heartbeat.Interval

How often the compute node will send a heartbeat to the requester node to let it know that the compute node is still alive. This should be less than the requester's configured heartbeat timeout to avoid flapping.

Compute.Heartbeat.InfoUpdateInterval

The frequency with which the compute node will send node info (including current labels) to the controlling requester node

Compute.Heartbeat.ResourceUpdateInterval

How often the compute node will send current resource availability to the requester node

Compute.Orchestrators

Comma-separated list of orchestrators to connect to. Applies to compute nodes

DataDir

DataDir specifies a location on disk where the bacalhau node will maintain state

DisableAnalytics

When set to true disables Bacalhau from sharing anonymous user data with Expanso

JobDefaults.Batch.Priority

Priority specifies the default priority allocated to a batch job. This value is used when the job hasn't explicitly set its priority requirement

JobDefaults.Batch.Task.Publisher.Params

Params specifies the publisher configuration data

JobDefaults.Batch.Task.Publisher.Type

Type specifies the publisher type. e.g. "s3", "local", "ipfs", etc.

JobDefaults.Batch.Task.Resources.CPU

Sets default CPU resource limits for batch jobs on your Compute node

JobDefaults.Daemon.Task.Resources.CPU

Sets default CPU resource limits for daemon jobs on your Compute node

JobDefaults.Ops.Task.Resources.CPU

Sets default CPU resource limits for ops jobs on your Compute node

JobDefaults.Service.Task.Resources.CPU

Sets default CPU resource limits for service jobs on your Compute node

JobDefaults.Batch.Task.Resources.Disk

Sets default disk resource limits for batch jobs on your Compute node

JobDefaults.Daemon.Task.Resources.Disk

Sets default disk resource limits for daemon jobs on your Compute node

JobDefaults.Ops.Task.Resources.Disk

Sets default disk resource limits for ops jobs on your Compute node

JobDefaults.Service.Task.Resources.Disk

Sets default disk resource limits for service jobs on your Compute node

JobDefaults.Batch.Task.Resources.GPU

Sets default GPU resource limits for batch jobs on your Compute node

JobDefaults.Daemon.Task.Resources.GPU

Sets default GPU resource limits for daemon jobs on your Compute node

JobDefaults.Ops.Task.Resources.GPU

Sets default GPU resource limits for ops jobs on your Compute node

JobDefaults.Service.Task.Resources.GPU

Sets default GPU resource limits for service jobs on your Compute node

JobDefaults.Batch.Task.Resources.Memory

Sets default memory resource limits for batch jobs on your Compute node

JobDefaults.Daemon.Task.Resources.Memory

Sets default memory resource limits for daemon jobs on your Compute node

JobDefaults.Ops.Task.Resources.Memory

Sets default memory resource limits for ops jobs on your Compute node

JobDefaults.Service.Task.Resources.Memory

Sets default memory resource limits for service jobs on your Compute node

JobDefaults.Ops.Task.Publisher.Params

Params specifies the publisher configuration data

JobDefaults.Ops.Task.Publisher.Type

Type specifies the publisher type. e.g. "s3", "local", "ipfs", etc.

JobDefaults.Service.Priority

Priority specifies the default priority allocated to a service job

JobDefaults.Daemon.Priority

Priority specifies the default priority allocated to a daemon job

JobDefaults.Ops.Priority

Priority specifies the default priority allocated to an ops job

JobAdmissionControl.Locality

Sets job selection policy based on where the data for the job is located. ‘local’ or ‘anywhere’

JobAdmissionControl.ProbeExec

Use the result of an executed external program to decide if a job should be accepted. Overrides data locality settings

JobAdmissionControl.ProbeHTTP

Use the result of a HTTP POST to decide if a job should be accepted. Overrides data locality settings

JobAdmissionControl.RejectStatelessJobs

Boolean signifying if jobs that don’t specify any data should be rejected

JobDefaults.Batch.Task.Timeouts.ExecutionTimeout

Default value for batch job execution timeouts on your current compute node. It will be assigned to batch jobs with no timeout requirement defined

JobDefaults.Ops.Task.Timeouts.ExecutionTimeout

Default value for ops job execution timeouts on your current compute node. It will be assigned to ops jobs with no timeout requirement defined

JobDefaults.Batch.Task.Timeouts.TotalTimeout

Default value for the maximum execution timeout this compute node supports for batch jobs. Jobs with higher timeout requirements will not be bid on

JobDefaults.Ops.Task.Timeouts.TotalTimeout

Default value for the maximum execution timeout this compute node supports for ops jobs. Jobs with higher timeout requirements will not be bid on

Publishers.Types.Local.Address

The address for the local publisher's server to bind to

Publishers.Types.Local.Port

The port for the local publisher's server to bind to (default: 6001)

Logging.LogDebugInfoInterval

The duration interval your compute node should generate logs on the running job executions

Logging.Mode

Mode specifies the logging mode. One of: default, json.

Logging.Level

Level sets the logging level. One of: trace, debug, info, warn, error, fatal, panic.

Engines.Disabled

List of Engine types to disable

Engines.Types.Docker.ManifestCache.TTL

The default time-to-live for each record in the manifest cache

Engines.Types.Docker.ManifestCache.Refresh

Refresh specifies the refresh interval for cache entries.

Engines.Types.Docker.ManifestCache.Size

Specifies the number of items that can be held in the manifest cache

FeatureFlags.ExecTranslation

ExecTranslation enables the execution translation feature

Publishers.Disabled

List of Publisher types to disable

Publishers.Types.IPFS.Endpoint

Endpoint specifies the multi-address to connect to for IPFS

InputSources.Disabled

List of Input Source types to disable

InputSources.MaxRetryCount

MaxRetryCount specifies the maximum number of attempts for reading from a storage

InputSources.ReadTimeout

ReadTimeout specifies the maximum time allowed for reading from a storage

InputSources.Types.IPFS.Endpoint

Endpoint specifies the multi-address to connect to for IPFS - to be used as input source

ResultDownloaders.Timeout

Timeout specifies the maximum time allowed for a download operation.

ResultDownloaders.Disabled

Disabled is a list of downloaders that are disabled

ResultDownloaders.Types.IPFS.Endpoint

Endpoint specifies the multi-address to connect to for IPFS

Labels

List of labels to apply to the node that can be used for node selection and filtering

NameProvider

The name provider to use to generate the node name

Orchestrator.Auth.Token

Token specifies the key, which Orchestrator node expects from the Compute node to use to connect to it

Orchestrator.Advertise

Address to advertise to compute nodes to connect to

Orchestrator.Cluster.Advertise

Address to advertise to other orchestrators to connect to

Orchestrator.Cluster.Name

Name of the cluster to join

Orchestrator.Cluster.Peers

Comma-separated list of other orchestrators to connect to form a cluster

Orchestrator.Cluster.Port

Port to listen for connections from other orchestrators to form a cluster

Orchestrator.Port

Port to listen for connections from other nodes. Applies to orchestrator nodes

Orchestrator.NodeManager.DisconnectTimeout

This is the time period after which a compute node is considered to be disconnected. If the compute node does not deliver a heartbeat every DisconnectTimeout then it is considered disconnected

Orchestrator.EvaluationBroker.MaxRetryCount

Maximum retry count for the evaluation broker

Orchestrator.EvaluationBroker.VisibilityTimeout

Visibility timeout for the evaluation broker

Orchestrator.Scheduler.HousekeepingInterval

Duration between Bacalhau housekeeping runs

Orchestrator.Scheduler.HousekeepingTimeout

Specifies the maximum time allowed for a single housekeeping run

JobAdmissionControl.AcceptNetworkedJobs

Boolean signifying if jobs that specify networking should be accepted

Orchestrator.NodeManager.ManualApproval

Boolean signifying if new nodes should only be manually approved to your network. Default is false

Orchestrator.Scheduler.QueueBackoff

QueueBackoff specifies the time to wait before retrying a failed job.

Publishers.Types.S3.PreSignedURLDisabled

Boolean deciding if a secure S3 URL should be generated and used. Default false, Disabled if true.

Publishers.Types.S3.PreSignedURLExpiration

Defined expiration interval for your secure S3 urls

FeatureFlags.ExecTranslation

Whether jobs should be translated at the requester node or not. Default: false

Orchestrator.Scheduler.WorkerCount

Number of workers that should be generated under your requester node

Orchestrator.Host

Host specifies the hostname or IP address on which the Orchestrator server listens for compute node connections

Orchestrator.Port

Port specifies the port number on which the Orchestrator server listens for compute node connections.

Orchestrator.Enabled

Enabled indicates whether the orchestrator node is active and available for job submission.

Compute.Enabled

Enabled indicates whether the compute node is active and available for job execution.

StrictVersionMatch

StrictVersionMatch indicates whether to enforce strict version matching

UpdateConfig.Interval

The frequency with which your system checks for version updates. When set to 0 update checks are not performed.

WebUI.Backend

Backend specifies the address and port of the backend API server. If empty, the Web UI will use the same address and port as the API server

WebUI.Enabled

Enabled indicates whether the Web UI is enabled

WebUI.Listen

Listen specifies the address and port on which the Web UI listens

(Updated) Configuration Management

Introduction

There have been some changes made to how Bacalhau handles configuration:

The bacalhau repo ~/.bacalhau no longer contains a config file.
Bacalhau no longer looks in the repo ~/.bacalhau for a config file.
Bacalhau never writes a config file to disk unless instructed by a user to do so.
A config file is not required to operate Bacalhau.
Bacalhau searches for a default config file. The location is OS-dependent:
1. Linux: ~/.config/bacalhau/config.yaml
2. OSX: ~/.config/Application\ Support/bacalhau/config.yaml
3. Windows: $AppData\bacalhau\config.yaml. Usually, this is something like C:\Users\username\bacalhau\config.yaml

Summary

Bacalhau no longer relies on the ~/.bacalhau directory for configuration and only creates a config file when instructed. While not required, it will look for a default config file in OS-specific locations.

Inspecting the Current Configuration of Bacalhau

Making Changes to the Default Config File

As described above, bacalhau still has the concept of a default config file, which, for the sake of simplicity, we’ll say lives in ~/.config/bacalhau/config.yaml. There are two ways this file can be modified:

A text editor vim ~/.config/bacalhau/config.yaml.

Using a Non-Default Config File.

Bacalhau Configuration Keys

In Bacalhau, configuration keys are structured identifiers used to configure and customize the behavior of the application. They represent specific settings that control various aspects of Bacalhau's functionality, such as network parameters, API endpoints, node operations, and user interface options. The configuration file is organized in a tree-like structure using nested mappings (dictionaries) in YAML format. Each level of indentation represents a deeper level in the hierarchy.

Example: part of the config file

API:
  Host: 0.0.0.0
  Port: 1234
  Auth:
    Methods:
      ClientKey:
      Type: challenge
NameProvider: puuid
DataDir: /home/frrist/.bacalhau
Orchestrator:
  Host: 0.0.0.0
  Port: 4222
  NodeManager:
    DisconnectTimeout: 1m0s

In this YAML configuration file:

Top-Level Keys (Categories): API, Orchestrator
Sub-Level Keys (Subcategories): Under API, we have Host and Port; Under Orchestrator we have Host, Port and NodeManager
Leaf Nodes (Settings): Host, Port, NameProvider, DataDir, DisconnectTimeout — these contain the actual configuration values.

Config keys use dot notation to represent the path from the root of the configuration hierarchy down to a specific leaf node. Each segment in the key corresponds to a level in the hierarchy. Syntax is Category.Subcategory(s)...LeafNode

Using Keys With `config set`, `config list` and `--config`

The bacalhau config list returns all keys and their corresponding value. The bacalhau config set command accepts a key and a value to set it to. The --config flag accepts a key and a value that will be applied to Bacalhau when it runs.

Example Interaction With the Bacalhau Configuration System

How to Modify the `API Host` Using `bacalhau config set` in the Default Config File:

Run bacalhau config list to find the appropriate key

bacalhau config list
 KEY VALUE DESCRIPTION
 ... ... ...
 api.host 0.0.0.0 Host specifies the hostname or IP address o
 ... ... ...

Run the bacalhau config set command

bacalhau config set api.host 192.186.0.1

Observe how bacalhau config list reflects the new setting

bacalhau config list
 KEY VALUE DESCRIPTION
 ... ... ...
 api.host 192.168.0.1 Host specifies the hostname or IP address
 ... ... ...

Observe the change has been reflected in the default config file

cat ~/.config/bacalhau/config.yaml
api:
    host: 192.168.0.1

How to Modify the API Host Using bacalhau config set a Custom Config File

Run the config set command with the flag

bacalhau config set --config=custom.yaml api.host 10.0.0.1

Observe the created config file

cat custom.yaml
api:
 host: 10.0.0.1

Observe the default config and output of bacalhau config list does not reflect this change.

How to Start Bacalhau With a Custom Config File

bacalhau --config=custom.yaml serve

Usage of the `--config` Flag

The --config (or -c) flag allows flexible configuration of bacalhau through various methods. You can use this flag multiple times to combine different configuration sources.

Usage

bacalhau [command] --config <option> [--config <option> ...]

or using the short form:

bacalhau [command] -c <option> [-c <option> ...]

Configuration Options

YAML Config Files: Specify paths to YAML configuration files. Example:

--config path/to/config.yaml

Key-Value Pairs: Set specific configuration values using dot notation. Example:

--config WebUI.Enabled=true

Boolean Flags: Enable boolean options by specifying the key alone. Example:

--config WebUI.Enabled

Precedence

When multiple configuration options are provided, they are applied in the following order of precedence (highest to lowest):

Command-line key-value pairs and boolean flags
YAML configuration files
Default values

Within each category, options specified later override earlier ones.

Examples

Using a single config file:

bacalhau serve --config my-config.yaml

Merging multiple config files:

bacalhau serve -c base-config.yaml -c override-config.yaml

Overriding specific values:

bacalhau serve \
-c config.yaml \
-c WebUI.Listen=0.0.0.0:9999 \
-c NameProvider=hostname

Combining file and multiple overrides:

bacalhau serve \
-c config.yaml \
-c WebUI.Enabled \
-c API.Host=192.168.1.5

In the last example, WebUI.Enabled will be set to true, API.Host will be 192.168.1.5, and other values will be loaded from config.yaml if present.

Remember, later options override earlier ones, allowing for flexible configuration management.

Usage of the `bacalhau completion` Command

The bacalhau completion command will generate shell completion for your shell. You can use the command like:

bacalhau completion <bash|fish|powershell|zsh> > /tmp/bacalhau_completion && source /tmp/bacalhau_completion

After running the above command, commands like bacalhau config set and bacalhau --config will have auto-completion for all possible configuration values along with their descriptions

Support

Speech Recognition using Whisper

Whisper is an automatic speech recognition (ASR) system trained on 680,000 hours of multilingual and multitask supervised data collected from the web. It shows that the use of such a large and diverse dataset leads to improved robustness to accents, background noise, and technical language. Moreover, it enables transcription in multiple languages, as well as translation from those languages into English. Creators are open-sourcing models and inference code to serve as a foundation for building useful applications and for further research on robust speech processing. In this example, we will transcribe an audio clip locally, containerize the script and then run the container on Bacalhau.

The advantage of using Bacalhau over managed Automatic Speech Recognition services is that you can run your own containers which can scale to do batch process petabytes of videos or audio for automatic speech recognition

bacalhau docker run \
    --id-only \
    --gpu 1 \
    --timeout 3600 \
    --wait-timeout-secs 3600 \
    jsacex/whisper \
    -i ipfs://bafybeielf6z4cd2nuey5arckect5bjmelhouvn5rhbjlvpvhp7erkrc4nu \
    -- python openai-whisper.py -p inputs/Apollo_11_moonwalk_montage_720p.mp4 -o outputs

To get started, you need to install:

Whisper.
PyTorch.
Python Pandas.

pip install git+https://github.com/openai/whisper.git
pip install torch==1.10.1
pip install --upgrade  pandas
sudo apt update && sudo apt install ffmpeg

Before we create and run the script we need a sample audio file to test the code. For that we download a sample audio clip:

wget https://github.com/js-ts/hello/raw/main/hello.mp3

We will create a script that accepts parameters (input file path, output file path, temperature, etc.) and set the default parameters. Also if the input file is in mp4 format, then the script converts it to wav format. The transcript can be saved in various formats. Then the large model is loaded and we pass it the required parameters.

This model is not only limited to English and transcription, it supports many other languages.

Next, let's create an openai-whisper script:

#content of the openai-whisper.py file

import argparse
import os
import sys
import warnings
import whisper
from pathlib import Path
import subprocess
import torch
import shutil
import numpy as np
parser = argparse.ArgumentParser(description="OpenAI Whisper Automatic Speech Recognition")
parser.add_argument("-l",dest="audiolanguage", type=str,help="Language spoken in the audio, use Auto detection to let Whisper detect the language. Select from the following languages['Auto detection', 'Afrikaans', 'Albanian', 'Amharic', 'Arabic', 'Armenian', 'Assamese', 'Azerbaijani', 'Bashkir', 'Basque', 'Belarusian', 'Bengali', 'Bosnian', 'Breton', 'Bulgarian', 'Burmese', 'Castilian', 'Catalan', 'Chinese', 'Croatian', 'Czech', 'Danish', 'Dutch', 'English', 'Estonian', 'Faroese', 'Finnish', 'Flemish', 'French', 'Galician', 'Georgian', 'German', 'Greek', 'Gujarati', 'Haitian', 'Haitian Creole', 'Hausa', 'Hawaiian', 'Hebrew', 'Hindi', 'Hungarian', 'Icelandic', 'Indonesian', 'Italian', 'Japanese', 'Javanese', 'Kannada', 'Kazakh', 'Khmer', 'Korean', 'Lao', 'Latin', 'Latvian', 'Letzeburgesch', 'Lingala', 'Lithuanian', 'Luxembourgish', 'Macedonian', 'Malagasy', 'Malay', 'Malayalam', 'Maltese', 'Maori', 'Marathi', 'Moldavian', 'Moldovan', 'Mongolian', 'Myanmar', 'Nepali', 'Norwegian', 'Nynorsk', 'Occitan', 'Panjabi', 'Pashto', 'Persian', 'Polish', 'Portuguese', 'Punjabi', 'Pushto', 'Romanian', 'Russian', 'Sanskrit', 'Serbian', 'Shona', 'Sindhi', 'Sinhala', 'Sinhalese', 'Slovak', 'Slovenian', 'Somali', 'Spanish', 'Sundanese', 'Swahili', 'Swedish', 'Tagalog', 'Tajik', 'Tamil', 'Tatar', 'Telugu', 'Thai', 'Tibetan', 'Turkish', 'Turkmen', 'Ukrainian', 'Urdu', 'Uzbek', 'Valencian', 'Vietnamese', 'Welsh', 'Yiddish', 'Yoruba'] ",default="English")
parser.add_argument("-p",dest="inputpath", type=str,help="Path of the input file",default="/hello.mp3")
parser.add_argument("-v",dest="typeverbose", type=str,help="Whether to print out the progress and debug messages. ['Live transcription', 'Progress bar', 'None']",default="Live transcription")
parser.add_argument("-g",dest="outputtype", type=str,help="Type of file to generate to record the transcription. ['All', '.txt', '.vtt', '.srt']",default="All")
parser.add_argument("-s",dest="speechtask", type=str,help="Whether to perform X->X speech recognition (`transcribe`) or X->English translation (`translate`). ['transcribe', 'translate']",default="transcribe")
parser.add_argument("-n",dest="numSteps", type=int,help="Number of Steps",default=50)
parser.add_argument("-t",dest="decodingtemperature", type=int,help="Temperature to increase when falling back when the decoding fails to meet either of the thresholds below.",default=0.15 )
parser.add_argument("-b",dest="beamsize", type=int,help="Number of Images",default=5)
parser.add_argument("-o",dest="output", type=str,help="Output Folder where to store the outputs",default="")

args=parser.parse_args()
device = torch.device('cuda:0')
print('Using device:', device, file=sys.stderr)

Model = 'large'
whisper_model =whisper.load_model(Model)
video_path_local = os.getcwd()+args.inputpath
file_name=os.path.basename(video_path_local)
output_file_path=args.output

if os.path.splitext(video_path_local)[1] == ".mp4":
    video_path_local_wav =os.path.splitext(file_name)[0]+".wav"
    result  = subprocess.run(["ffmpeg", "-i", str(video_path_local), "-vn", "-acodec", "pcm_s16le", "-ar", "16000", "-ac", "1", str(video_path_local_wav)])

# add language parameters
# Language spoken in the audio, use Auto detection to let Whisper detect the language.
#  ['Auto detection', 'Afrikaans', 'Albanian', 'Amharic', 'Arabic', 'Armenian', 'Assamese', 'Azerbaijani', 'Bashkir', 'Basque', 'Belarusian', 'Bengali', 'Bosnian', 'Breton', 'Bulgarian', 'Burmese', 'Castilian', 'Catalan', 'Chinese', 'Croatian', 'Czech', 'Danish', 'Dutch', 'English', 'Estonian', 'Faroese', 'Finnish', 'Flemish', 'French', 'Galician', 'Georgian', 'German', 'Greek', 'Gujarati', 'Haitian', 'Haitian Creole', 'Hausa', 'Hawaiian', 'Hebrew', 'Hindi', 'Hungarian', 'Icelandic', 'Indonesian', 'Italian', 'Japanese', 'Javanese', 'Kannada', 'Kazakh', 'Khmer', 'Korean', 'Lao', 'Latin', 'Latvian', 'Letzeburgesch', 'Lingala', 'Lithuanian', 'Luxembourgish', 'Macedonian', 'Malagasy', 'Malay', 'Malayalam', 'Maltese', 'Maori', 'Marathi', 'Moldavian', 'Moldovan', 'Mongolian', 'Myanmar', 'Nepali', 'Norwegian', 'Nynorsk', 'Occitan', 'Panjabi', 'Pashto', 'Persian', 'Polish', 'Portuguese', 'Punjabi', 'Pushto', 'Romanian', 'Russian', 'Sanskrit', 'Serbian', 'Shona', 'Sindhi', 'Sinhala', 'Sinhalese', 'Slovak', 'Slovenian', 'Somali', 'Spanish', 'Sundanese', 'Swahili', 'Swedish', 'Tagalog', 'Tajik', 'Tamil', 'Tatar', 'Telugu', 'Thai', 'Tibetan', 'Turkish', 'Turkmen', 'Ukrainian', 'Urdu', 'Uzbek', 'Valencian', 'Vietnamese', 'Welsh', 'Yiddish', 'Yoruba']
language = args.audiolanguage
# Whether to print out the progress and debug messages.
# ['Live transcription', 'Progress bar', 'None']
verbose = args.typeverbose
#  Type of file to generate to record the transcription.
# ['All', '.txt', '.vtt', '.srt']
output_type = args.outputtype
# Whether to perform X->X speech recognition (`transcribe`) or X->English translation (`translate`).
# ['transcribe', 'translate']
task = args.speechtask
# Temperature to use for sampling.
temperature = args.decodingtemperature
#  Temperature to increase when falling back when the decoding fails to meet either of the thresholds below.
temperature_increment_on_fallback = 0.2
#  Number of candidates when sampling with non-zero temperature.
best_of = 5
#  Number of beams in beam search, only applicable when temperature is zero.
beam_size = args.beamsize
# Optional patience value to use in beam decoding, as in [*Beam Decoding with Controlled Patience*](https://arxiv.org/abs/2204.05424), the default (1.0) is equivalent to conventional beam search.
patience = 1.0
# Optional token length penalty coefficient (alpha) as in [*Google's Neural Machine Translation System*](https://arxiv.org/abs/1609.08144), set to negative value to uses simple length normalization.
length_penalty = -0.05
# Comma-separated list of token ids to suppress during sampling; '-1' will suppress most special characters except common punctuations.
suppress_tokens = "-1"
# Optional text to provide as a prompt for the first window.
initial_prompt = ""
# if True, provide the previous output of the model as a prompt for the next window; disabling may make the text inconsistent across windows, but the model becomes less prone to getting stuck in a failure loop.
condition_on_previous_text = True
#  whether to perform inference in fp16.
fp16 = True
#  If the gzip compression ratio is higher than this value, treat the decoding as failed.
compression_ratio_threshold = 2.4
# If the average log probability is lower than this value, treat the decoding as failed.
logprob_threshold = -1.0
# If the probability of the <|nospeech|> token is higher than this value AND the decoding has failed due to `logprob_threshold`, consider the segment as silence.
no_speech_threshold = 0.6

verbose_lut = {
    'Live transcription': True,
    'Progress bar': False,
    'None': None
}

args = dict(
    language = (None if language == "Auto detection" else language),
    verbose = verbose_lut[verbose],
    task = task,
    temperature = temperature,
    temperature_increment_on_fallback = temperature_increment_on_fallback,
    best_of = best_of,
    beam_size = beam_size,
    patience=patience,
    length_penalty=(length_penalty if length_penalty>=0.0 else None),
    suppress_tokens=suppress_tokens,
    initial_prompt=(None if not initial_prompt else initial_prompt),
    condition_on_previous_text=condition_on_previous_text,
    fp16=fp16,
    compression_ratio_threshold=compression_ratio_threshold,
    logprob_threshold=logprob_threshold,
    no_speech_threshold=no_speech_threshold
)

temperature = args.pop("temperature")
temperature_increment_on_fallback = args.pop("temperature_increment_on_fallback")
if temperature_increment_on_fallback is not None:
    temperature = tuple(np.arange(temperature, 1.0 + 1e-6, temperature_increment_on_fallback))
else:
    temperature = [temperature]

if Model.endswith(".en") and args["language"] not in {"en", "English"}:
    warnings.warn(f"{Model} is an English-only model but receipted '{args['language']}'; using English instead.")
    args["language"] = "en"

video_transcription = whisper.transcribe(
    whisper_model,
    str(video_path_local),
    temperature=temperature,
    **args,
)

# Save output
writing_lut = {
    '.txt': whisper.utils.write_txt,
    '.vtt': whisper.utils.write_vtt,
    '.srt': whisper.utils.write_txt,
}

if output_type == "All":
    for suffix, write_suffix in writing_lut.items():
        transcript_local_path =os.getcwd()+output_file_path+'/'+os.path.splitext(file_name)[0] +suffix
        with open(transcript_local_path, "w", encoding="utf-8") as f:
            write_suffix(video_transcription["segments"], file=f)
        try:
            transcript_drive_path =file_name
        except:
            print(f"**Transcript file created: {transcript_local_path}**")
else:
    transcript_local_path =output_file_path+'/'+os.path.splitext(file_name)[0] +output_type

    with open(transcript_local_path, "w", encoding="utf-8") as f:
        writing_lut[output_type](video_transcription["segments"], file=f)

Let's run the script with the default parameters:

python openai-whisper.py

To view the outputs, execute following:

cat hello.srt

To build your own docker container, create a Dockerfile, which contains instructions on how the image will be built, and what extra requirements will be included.

FROM  pytorch/pytorch:1.12.1-cuda11.3-cudnn8-runtime

WORKDIR /

RUN apt-get -y update

RUN apt-get -y install git

RUN python3 -m pip install --upgrade pip

RUN python -m pip install regex tqdm Pillow

RUN pip install git+https://github.com/openai/whisper.git

ADD hello.mp3 hello.mp3

ADD openai-whisper.py openai-whisper.py

RUN python openai-whisper.py

We choose pytorch/pytorch:1.12.1-cuda11.3-cudnn8-runtime as our base image.

And then install all the dependencies, after that we will add the test audio file and our openai-whisper script to the container, we will also run a test command to check whether our script works inside the container and if the container builds successfully

We will run docker build command to build the container;

docker build -t <hub-user>/<repo-name>:<tag> .

Before running the command replace:

repo-name with the name of the container, you can name it anything you want
tag this is not required but you can use the latest tag

In our case:

docker build -t jsacex/whisper

Next, upload the image to the registry. This can be done by using the Docker hub username, repo name or tag.

docker push <hub-user>/<repo-name>:<tag>

In our case:

docker push jsacex/whisper

After the dataset has been uploaded, copy the CID:

bafybeielf6z4cd2nuey5arckect5bjmelhouvn5rhbjlvpvhp7erkrc4nu

Let's look closely at the command below:

export JOB_ID=$( ... ) exports the job ID as environment variable
bacalhau docker run: call to bacalhau
The-i ipfs://bafybeielf6z4cd2nuey5arckect5bjmelhouvn5r: flag to mount the CID which contains our file to the container at the path /inputs
The --gpu 1 flag is set to specify hardware requirements, a GPU is needed to run such a job
jsacex/whisper: the name and the tag of the docker image we are using
python openai-whisper.py: execute the script with following parameters:
1. -p inputs/Apollo_11_moonwalk_montage_720p.mp4 : the input path of our file
2. -o outputs: the path where to store the outputs

export JOB_ID=$(bacalhau docker run \
    --id-only \
    --gpu 1 \
    --timeout 3600 \
    --wait-timeout-secs 3600 \
    jsacex/whisper \
    -i ipfs://bafybeielf6z4cd2nuey5arckect5bjmelhouvn5rhbjlvpvhp7erkrc4nu \
    -- python openai-whisper.py -p inputs/Apollo_11_moonwalk_montage_720p.mp4 -o outputs

When a job is submitted, Bacalhau prints out the related job_id. We store that in an environment variable so that we can reuse it later on.

name: Speech Recognition using Whisper
type: batch
count: 1
tasks:
  - name: My main task
    Engine:
      type: docker
      params:
        Image: jsacex/whisper:latest
        Entrypoint:
          - /bin/bash
        Parameters:
          - -c   
          - python openai-whisper.py -p inputs/Apollo_11_moonwalk_montage_720p.mp4 -o outputs
    Resources:
      GPU: "1"

You can check the status of the job using bacalhau job list.

bacalhau job list --id-filter ${JOB_ID}

When it says Completed, that means the job is done, and we can get the results.

You can find out more information about your job by using bacalhau job describe.

bacalhau job describe ${JOB_ID}

You can download your job results directly by using bacalhau job get. Alternatively, you can choose to create a directory to store your results. In the command below, we created a directory and downloaded our job output to be stored in that directory.

rm -rf results && mkdir -p results
bacalhau job get $JOB_ID --output-dir results

After the download has finished you should see the following contents in results directory

Now you can find the file in the results/outputs folder. To view it, run the following command:

cat results/outputs/Apollo_11_moonwalk_montage_720p.vtt

Stable Diffusion Dreambooth (Finetuning)

Stable diffusion has revolutionalized text2image models by producing high quality images based on a prompt. Dreambooth is a approach for personalization of text-to-image diffusion models. With images as input subject, we can fine-tune a pretrained text-to-image model

Dreambooth makes stable-diffusion even more powered with the ability to generate realistic looking pictures of humans, animals or any other object by just training them on 20-30 images.

In this example tutorial, we will be fine-tuning a pretrained stable diffusion using images of a human and generating images of him drinking coffee.

The following command generates the following:

Subject: SBF
Prompt: a photo of SBF without hair

bacalhau docker run \
 --gpu 1 \
 --timeout 3600 \
 --wait-timeout-secs 3600 \
  -i ipfs://QmRKnvqvpFzLjEoeeNNGHtc7H8fCn9TvNWHFnbBHkK8Mhy  \
  jsacex/dreambooth:full \
  -- bash finetune.sh /inputs /outputs "a photo of sbf man" "a photo of man" 3000 "/man" "/model"

bacalhau docker run \
 --gpu 1 \
  -i ipfs://QmUEJPr5pfV6tRzWQuNSSb3wdcN6tRQS5tdk3dYSCJ55Xs:/SBF.ckpt \
   jsacex/stable-diffusion-ckpt \
   -- conda run --no-capture-output -n ldm python scripts/txt2img.py --prompt "a photo of sbf without hair" --plms --ckpt ../SBF.ckpt --skip_grid --n_samples 1 --skip_grid --outdir ../outputs

Output:

Building this container requires you to have a supported GPU which needs to have 16gb+ of memory, since it can be resource intensive.

We will create a Dockerfile and add the desired configuration to the file. Following commands specify how the image will be built, and what extra requirements will be included:

FROM pytorch/pytorch:1.12.1-cuda11.3-cudnn8-devel

WORKDIR /

# Install requirements
# RUN git clone https://github.com/TheLastBen/diffusers

RUN apt update && apt install wget git unzip -y

RUN wget -q https://gist.githubusercontent.com/js-ts/28684a7e6217214ec944a9224584e9af/raw/d7492bc8f36700b75d51e3346259d1a466b99a40/train_dreambooth.py

RUN wget -q https://github.com/TheLastBen/diffusers/raw/main/scripts/convert_diffusers_to_original_stable_diffusion.py

# RUN cp /content/convert_diffusers_to_original_stable_diffusion.py /content/diffusers/scripts/convert_diffusers_to_original_stable_diffusion.py

RUN pip install -qq git+https://github.com/TheLastBen/diffusers

RUN pip install -q accelerate==0.12.0 transformers ftfy bitsandbytes gradio natsort

# Install xformers

RUN pip install -q https://github.com/metrolobo/xformers_wheels/releases/download/1d31a3ac_various_6/xformers-0.0.14.dev0-cp37-cp37m-linux_x86_64.whl

RUN wget 'https://github.com/TheLastBen/fast-stable-diffusion/raw/main/Dreambooth/Regularization/Women' -O woman.zip

RUN wget 'https://github.com/TheLastBen/fast-stable-diffusion/raw/main/Dreambooth/Regularization/Men' -O man.zip

RUN wget 'https://github.com/TheLastBen/fast-stable-diffusion/raw/main/Dreambooth/Regularization/Mix' -O mix.zip

RUN unzip -j woman.zip -d woman

RUN unzip -j man.zip -d man

RUN unzip -j mix.zip -d mix

This container is using the pytorch/pytorch:1.12.1-cuda11.3-cudnn8-devel image and the working directory is set. Next, we add our custom code and pull the dependent repositories.

# finetune.sh
python clear_mem.py

accelerate launch train_dreambooth.py \
  --image_captions_filename \
  --train_text_encoder \
  --save_n_steps=$(expr $5 / 6) \
  --stop_text_encoder_training=$(expr $5 + 100) \
  --class_data_dir="$6" \
  --pretrained_model_name_or_path=${7:=/model} \
  --tokenizer_name=${7:=/model}/tokenizer/ \
  --instance_data_dir="$1" \
  --output_dir="$2" \
  --instance_prompt="$3" \
  --class_prompt="$4" \
  --seed=96576 \
  --resolution=512 \
  --mixed_precision="fp16" \
  --train_batch_size=1 \
  --gradient_accumulation_steps=1 \
  --use_8bit_adam \
  --learning_rate=2e-6 \
  --lr_scheduler="polynomial" \
  --center_crop \
  --lr_warmup_steps=0 \
  --max_train_steps=$5

echo  Convert weights to ckpt
python convert_diffusers_to_original_stable_diffusion.py --model_path $2  --checkpoint_path $2/model.ckpt --half
echo model saved at $2/model.ckpt

The shell script is there to make things much simpler since the command to train the model needs many parameters to pass and later convert the model weights to the checkpoint, you can edit this script and add in your own parameters

To download the models and run a test job in the Docker file, copy the following:

FROM pytorch/pytorch:1.12.1-cuda11.3-cudnn8-devel
WORKDIR /
# Install requirements
# RUN git clone https://github.com/TheLastBen/diffusers
RUN apt update && apt install wget git unzip -y
RUN wget -q https://gist.githubusercontent.com/js-ts/28684a7e6217214ec944a9224584e9af/raw/d7492bc8f36700b75d51e3346259d1a466b99a40/train_dreambooth.py
RUN wget -q https://github.com/TheLastBen/diffusers/raw/main/scripts/convert_diffusers_to_original_stable_diffusion.py
# RUN cp /content/convert_diffusers_to_original_stable_diffusion.py /content/diffusers/scripts/convert_diffusers_to_original_stable_diffusion.py
RUN pip install -qq git+https://github.com/TheLastBen/diffusers
RUN pip install -q accelerate==0.12.0 transformers ftfy bitsandbytes gradio natsort
# Install xformers
RUN pip install -q https://github.com/metrolobo/xformers_wheels/releases/download/1d31a3ac_various_6/xformers-0.0.14.dev0-cp37-cp37m-linux_x86_64.whl
# You need to accept the model license before downloading or using the Stable Diffusion weights. Please, visit the [model card](https://huggingface.co/runwayml/stable-diffusion-v1-5), read the license and tick the checkbox if you agree. You have to be a registered user in 🤗 Hugging Face Hub, and you'll also need to use an access token for the code to work.
# https://huggingface.co/settings/tokens
RUN mkdir -p ~/.huggingface
RUN echo -n "<your-hugging-face-token>" > ~/.huggingface/token
# copy the test dataset from a local file
# COPY jfk /jfk

# Download and extract the test dataset
RUN wget https://github.com/js-ts/test-images/raw/main/jfk.zip
RUN unzip -j jfk.zip -d jfk
RUN mkdir model
RUN wget 'https://github.com/TheLastBen/fast-stable-diffusion/raw/main/Dreambooth/Regularization/Women' -O woman.zip
RUN wget 'https://github.com/TheLastBen/fast-stable-diffusion/raw/main/Dreambooth/Regularization/Men' -O man.zip
RUN wget 'https://github.com/TheLastBen/fast-stable-diffusion/raw/main/Dreambooth/Regularization/Mix' -O mix.zip
RUN unzip -j woman.zip -d woman
RUN unzip -j man.zip -d man
RUN unzip -j mix.zip -d mix

RUN  accelerate launch train_dreambooth.py \
  --image_captions_filename \
  --train_text_encoder \
  --save_starting_step=5\
  --stop_text_encoder_training=31 \
  --class_data_dir=/man \
  --save_n_steps=5 \
  --pretrained_model_name_or_path="CompVis/stable-diffusion-v1-4" \
  --instance_data_dir="/jfk" \
  --output_dir="/model" \
  --instance_prompt="a photo of jfk man" \
  --class_prompt="a photo of man" \
  --instance_prompt="" \
  --seed=96576 \
  --resolution=512 \
  --mixed_precision="fp16" \
  --train_batch_size=1 \
  --gradient_accumulation_steps=1 \
  --use_8bit_adam \
  --learning_rate=2e-6 \
  --lr_scheduler="polynomial" \
  --center_crop \
  --lr_warmup_steps=0 \
  --max_train_steps=30

COPY finetune.sh /finetune.sh

RUN wget -q https://gist.githubusercontent.com/js-ts/624fecc3fff807d4948688cb28993a94/raw/fd69ac084debe26a815485c1f363b8a45566f1ba/clear_mem.py
# Removing your token
RUN rm -rf  ~/.huggingface/token

Then execute finetune.sh with following commands:

# finetune.sh
python clear_mem.py

accelerate launch train_dreambooth.py \
    --image_captions_filename \
   --train_text_encoder \
    --save_n_steps=$(expr $5 / 6) \
    --stop_text_encoder_training=$(expr $5 + 100) \
       --class_data_dir="$6" \
  --pretrained_model_name_or_path=${7:=/model} \
--tokenizer_name=${7:=/model}/tokenizer/ \
    --instance_data_dir="$1" \
    --output_dir="$2" \
    --instance_prompt="$3" \
   --class_prompt="$4" \
    --seed=96576 \
    --resolution=512 \
    --mixed_precision="fp16" \
    --train_batch_size=1 \
    --gradient_accumulation_steps=1 \
    --use_8bit_adam \
    --learning_rate=2e-6 \
    --lr_scheduler="polynomial" \
    --center_crop \
    --lr_warmup_steps=0 \
    --max_train_steps=$5

echo  Convert weights to ckpt
python convert_diffusers_to_original_stable_diffusion.py --model_path $2  --checkpoint_path $2/model.ckpt --half
echo model saved at $2/model.ckpt

We will run docker build command to build the container:

docker build -t <hub-user>/<repo-name>:<tag> .

Before running the command replace:

repo-name with the name of the container, you can name it anything you want.
tag this is not required but you can use the latest tag

Now you can push this repository to the registry designated by its name or tag.

docker push <hub-user>/<repo-name>:<tag>

The optimal dataset size is between 20-30 images. You can choose the images of the subject in different positions, full body images, half body, pictures of the face etc.

Only the subject should appear in the image so you can crop the image to just fit the subject. Make sure that the images are 512x512 size and are named in the following pattern:

Subject Name.jpg, Subject Name (2).jpg ... Subject Name (n).jpg

After the Subject dataset is created we upload it to IPFS.

To upload your dataset using NFTup just drag and drop your directory it will upload it to IPFS:

After the checkpoint file has been uploaded, copy its CID which will look like this:

bafybeidqbuphwkqwgrobv2vakwsh3l6b4q2mx7xspgh4l7lhulhc3dfa7a

Since there are a lot of combinations that you can try, processing of finetuned model can take almost 1hr+ to complete. Here are a few approaches that you can try based on your requirements:

bacalhau docker run: call to bacalhau
The --gpu 1 flag is set to specify hardware requirements, a GPU is needed to run such a job
-i ipfs://bafybeidqbuphwkqwgrobv2vakwsh3l6b4q2mx7xspgh4l7lhulhc3dfa7a Mounts the data from IPFS via its CID
jsacex/dreambooth:latest Name and tag of the docker image we are using
-- bash finetune.sh /inputs /outputs "a photo of David Aronchick man" "a photo of man" 3000 "/man" execute script with following paramters:
1. /inputs Path to the subject Images
2. /outputs Path to save the generated outputs
3. "a photo of < name of the subject > < class >" -> "a photo of David Aronchick man" Subject name along with class
4. "a photo of < class >" -> "a photo of man" Name of the class

bacalhau docker run \
  --gpu 1 \
  --timeout 3600 \
  --wait-timeout-secs 3600 \
  --timeout 3600 \
  --wait-timeout-secs 3600 \
  -i <CID-OF-THE-SUBJECT> \
  jsacex/dreambooth:full \
  -- bash finetune.sh /inputs /outputs "a photo of <name-of-the-subject> man" "a photo of man" 3000 "/man" "/model"

The number of iterations is 3000. This number should be no of subject images x 100. So if there are 30 images, it would be 3000. It takes around 32 minutes on a v100 for 3000 iterations, but you can increase/decrease the number based on your requirements.

Here is our command with our parameters replaced:

bacalhau docker run \
  --gpu 1 \
  --timeout 3600 \
  --wait-timeout-secs 3600 \
  --timeout 3600 \
  --wait-timeout-secs 3600 \
  -i ipfs://bafybeidqbuphwkqwgrobv2vakwsh3l6b4q2mx7xspgh4l7lhulhc3dfa7a \
  --wait \
  --id-only \
  jsacex/dreambooth:full \
  -- bash finetune.sh /inputs /outputs "a photo of David Aronchick man" "a photo of man" 3000 "/man" "/model"

If your subject fits the above class, but has a different name you just need to replace the input CID and the subject name.

Use the /woman class images

bacalhau docker run \
  --gpu 1 \
  --timeout 3600 \
  --wait-timeout-secs 3600 \
  -i <CID-OF-THE-SUBJECT> \
  jsacex/dreambooth:full \
  -- bash finetune.sh /inputs /outputs "a photo of <name-of-the-subject> woman" "a photo of woman" 3000 "/woman"  "/model"

Here you can provide your own regularization images or use the mix class.

Use the /mix class images if the class of the subject is mix

bacalhau docker run \
  --gpu 1 \
  --timeout 3600 \
  --wait-timeout-secs 3600 \
  -i <CID-OF-THE-SUBJECT> \
  jsacex/dreambooth:full \
  -- bash finetune.sh /inputs /outputs "a photo of <name-of-the-subject> mix" "a photo of mix" 3000 "/mix"  "/model"

You can upload the model to IPFS and then create a gist, mount the model and script to the lightweight container

bacalhau docker run \
  --gpu 1 \
  --timeout 3600 \
  --wait-timeout-secs 3600 \
  -i ipfs://bafybeidqbuphwkqwgrobv2vakwsh3l6b4q2mx7xspgh4l7lhulhc3dfa7a:/aronchick \
  -i ipfs://<CID-OF-THE-MODEL>:/model 
  -i https://gist.githubusercontent.com/js-ts/54b270a36aa3c35fdc270640680b3bd4/raw/7d8e8fa47bc3811ef63772f7fc7f4360aa9d51a8/finetune.sh
  --wait \
  --id-only \
  jsacex/dreambooth:lite \
  -- bash /inputs/finetune.sh /aronchick /outputs "a photo of aronchick man" "a photo of man" 3000 "/man" "/model"

When a job is submitted, Bacalhau prints out the related job_id. Use the export JOB_ID=$(bacalhau docker run ...) wrapper to store that in an environment variable so that we can reuse it later on.

name: Stable Diffusion Dreambooth Finetuning
type: batch
count: 1
tasks:
  - name: My main task
    Engine:
      type: docker
      params:
        Image: "jsacex/dreambooth:full" 
        Parameters:
          - bash finetune.sh /inputs /outputs "a photo of aronchick man" "a photo of man" 3000 "/man" "/model"
    InputSources:
      - Target: "/inputs/data"
        Source:
          Type: "ipfs"
          Params:
            CID: "QmRKnvqvpFzLjEoeeNNGHtc7H8fCn9TvNWHFnbBHkK8Mhy"
    Resources:
      GPU: "1"

You can check the status of the job using bacalhau job list.

bacalhau job list --id-filter ${JOB_ID}

When it says Completed, that means the job is done, and we can get the results.

You can find out more information about your job by using bacalhau job describe.

bacalhau job describe ${JOB_ID}

rm -rf results && mkdir -p results
bacalhau job get $JOB_ID --output-dir results

After the download has finished you should see the following contents in results directory

Now you can find the file in the results/outputs folder. You can view results by running following commands:

ls results # list the contents of the current directory

In the next steps, we will be doing inference on the finetuned model

Bacalhau currently doesn't support mounting subpaths of the CID, so instead of just mounting the model.ckpt file we need to mount the whole output CID which is 6.4GB, which might result in errors like FAILED TO COPY /inputs. So you have to manually copy the CID of the model.ckpt which is of 2GB.

To get the CID of the model.ckpt file go to https://gateway.ipfs.io/ipfs/< YOUR-OUTPUT-CID >/outputs/. For example:

https://gateway.ipfs.io/ipfs/QmcmD7M7pYLP8QgwjqpbP4dojRLiLuEBdhevuCD9kFmbdV/outputs/

ipfs://QmdpsqZn9BZx9XxzCsyPcJyS7yfYacmQXZxHzcuYwzmtGg/outputs

Or you can use the IPFS CLI:

ipfs ls QmdpsqZn9BZx9XxzCsyPcJyS7yfYacmQXZxHzcuYwzmtGg/outputs

Copy the link of model.ckpt highlighted in the box:

https://gateway.ipfs.io/ipfs/QmdpsqZn9BZx9XxzCsyPcJyS7yfYacmQXZxHzcuYwzmtGg?filename=model.ckpt

Then extract the CID portion of the link and copy it.

To run a Bacalhau Job on the fine-tuned model, we will use the bacalhau docker run command.

export JOB_ID=$(bacalhau docker run \
  --gpu 1 \
  --timeout 3600 \
  --wait-timeout-secs 3600 \
  --wait \
  --id-only \
  -i ipfs://QmdpsqZn9BZx9XxzCsyPcJyS7yfYacmQXZxHzcuYwzmtGg \
  jsacex/stable-diffusion-ckpt \
  -- conda run --no-capture-output -n ldm python scripts/txt2img.py --prompt "a photo of aronchick drinking coffee" --plms --ckpt ../inputs/model.ckpt --skip_grid --n_samples 1 --skip_grid --outdir ../outputs)

If you are facing difficulties using the above method you can mount the whole output CID

export JOB_ID=$(bacalhau docker run \
  --gpu 1 \
  --timeout 3600 \
  --wait-timeout-secs 3600 \
  --wait \
  --id-only \
  -i ipfs://QmcmD7M7pYLP8QgwjqpbP4dojRLiLuEBdhevuCD9kFmbdV \
  jsacex/stable-diffusion-ckpt \
  -- conda run --no-capture-output -n ldm python scripts/txt2img.py --prompt "a photo of aronchick drinking coffee" --plms --ckpt ../inputs/outputs/model.ckpt --skip_grid --n_samples 1 --skip_grid --outdir ../outputs)