Only this pageAll pages
Powered by GitBook
Couldn't generate the PDF for 175 pages, generation stopped at 100.
Extend with 50 more pages.
1 of 100

v1.5.x

Loading...

Getting Started

Loading...

Loading...

Loading...

Loading...

Container Onboarding

Loading...

Loading...

Setting Up

Running Nodes

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Workload Onboarding

This directory contains examples relating to performing common tasks with Bacalhau.

Container

Loading...

Loading...

Loading...

Loading...

Python

Loading...

Loading...

Loading...

Loading...

Loading...

R (language)

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Data Ingestion

Loading...

Loading...

Loading...

Networking Instructions

Loading...

Loading...

Loading...

Loading...

Marketplace Deployments

Loading...

Guides

Loading...

Loading...

Loading...

Examples

Data Engineering

This directory contains examples relating to data engineering workloads. The goal is to provide a range of examples that show you how to work with Bacalhau in a variety of use cases.

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Model Inference

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Model Training

Loading...

Loading...

Loading...

Molecular Dynamics

Loading...

Loading...

Loading...

Loading...

Loading...

References

Jobs Guide

Loading...

Loading...

Loading...

Engines

Loading...

Loading...

Publishers

Loading...

Loading...

Loading...

Sources

Loading...

Loading...

Installation

Install the Bacalhau CLI

In this tutorial, you'll learn how to install and run a job with the Bacalhau client using the Bacalhau CLI or Docker.

Step 1 - Install the Bacalhau Client

The Bacalhau client is a command-line interface (CLI) that allows you to submit jobs to the Bacalhau. The client is available for Linux, macOS, and Windows. You can also run the Bacalhau client in a Docker container.

Step 1.1 - Install the Bacalhau CLI

Step 1.2 - Verify the Installation

To verify installation and check the version of the client and server, use the version command. To run a Bacalhau client command with Docker, prefix it with docker run ghcr.io/bacalhau-project/bacalhau:latest.

bacalhau version
docker run -it ghcr.io/bacalhau-project/bacalhau:latest version

If you're wondering which server is being used, the Bacalhau Project has a demo network that's shared with the community. This network allows you to familiarize with Bacalhau's capabilities and launch jobs from your computer without maintaining a compute cluster on your own.

Step 2 - Submit a Hello World job

bacalhau docker run [flags] IMAGE[:TAG|@DIGEST] [COMMAND] [ARG...]
bacalhau docker run \
--config api.host=bootstrap.production.bacalhau.org \
alpine echo helloWorld

Let's take a look at the results of the command execution in the terminal:

Job successfully submitted. Job ID: j-de72aeff-0f18-4f70-a07c-1366a0edcb64
Checking job status... (Enter Ctrl+C to exit at any time, your job will continue running):

 TIME          EXEC. ID    TOPIC            EVENT         
 15:32:50.323              Submission       Job submitted 
 15:32:50.332  e-6e4f2db9  Scheduling       Requested execution on n-f1c579e2 
 15:32:50.410  e-6e4f2db9  Execution        Running 
 15:32:50.986  e-6e4f2db9  Execution        Completed successfully 
                                             
To get more details about the run, execute:
	bacalhau job describe j-de72aeff-0f18-4f70-a07c-1366a0edcb64

To get more details about the run executions, execute:
	bacalhau job executions j-de72aeff-0f18-4f70-a07c-1366a0edcb64

After the above command is run, the job is submitted to the selected network, which processes the job and Bacalhau prints out the related job id:

Job successfully submitted. Job ID: j-de72aeff-0f18-4f70-a07c-1366a0edcb64
Checking job status...

The job_id above is shown in its full form. For convenience, you can use the shortened version, in this case: j-de72aeff.

docker run -t ghcr.io/bacalhau-project/bacalhau:latest \
                docker run \
                --id-only \
                --wait \
                ubuntu:latest -- \
                sh -c 'uname -a && echo "Hello from Docker Bacalhau!"'

Let's take a look at the results of the command execution in the terminal:

14:02:25.992 | INF pkg/repo/fs.go:81 > Initializing repo at '/root/.bacalhau' for environment 'production'
19b105c9-4cb5-43bd-a12f-d715d738addd

Step 3 - Checking the State of your Jobs

After having deployed the job, we now can use the CLI for the interaction with the network. The jobs were sent to the public demo network, where it was processed and we can call the following functions. The job_id will differ for every submission.

Step 3.1 - Job information:

bacalhau job describe j-de72aeff

Let's take a look at the results of the command execution in the terminal:

ID            = j-de72aeff-0f18-4f70-a07c-1366a0edcb64
Name          = j-de72aeff-0f18-4f70-a07c-1366a0edcb64
Namespace     = default
Type          = batch
State         = Completed
Count         = 1
Created Time  = 2024-10-07 13:32:50
Modified Time = 2024-10-07 13:32:50
Version       = 0

Summary
Completed = 1

Job History
 TIME                 TOPIC         EVENT         
 2024-10-07 15:32:50  Submission    Job submitted 
 2024-10-07 15:32:50  State Update  Running       
 2024-10-07 15:32:50  State Update  Completed     

Executions
 ID          NODE ID     STATE      DESIRED  REV.  CREATED    MODIFIED   COMMENT 
 e-6e4f2db9  n-f1c579e2  Completed  Stopped  6     4m18s ago  4m17s ago          

Execution e-6e4f2db9 History
 TIME                 TOPIC       EVENT                             
 2024-10-07 15:32:50  Scheduling  Requested execution on n-f1c579e2 
 2024-10-07 15:32:50  Execution   Running                           
 2024-10-07 15:32:50  Execution   Completed successfully            

Standard Output
helloWorld

This outputs all information about the job, including stdout, stderr, where the job was scheduled, and so on.

Step 3.2 - Job download:

bacalhau job get j-de72aeff
Fetching results of job 'j-de72aeff'...
Results for job 'j-de72aeff' have been written to...
/home/username/.bacalhau/job-j-de72aeff

After the download has finished you should see the following contents in the results directory.

job-j-de72aeff
├── exitCode
├── outputs
├── stderr
└── stdout

Step 4 - Viewing your Job Output

cat j-de72aeff/stdout

That should print out the string helloWorld.

helloWorld

Step 5 - Where to go next?

Here are few resources that provide a deeper dive into running jobs with Bacalhau:

Support

By default, you will submit to the Bacalhau public network, but the same CLI can be configured to submit to a private Bacalhau network. For more information, please read Running .

You can install or update the Bacalhau CLI by running the commands in a terminal. You may need sudo mode or root password to install the local Bacalhau binary to /usr/local/bin:

curl -sL https://get.bacalhau.org/install.sh | bash

Windows users can download the and extract bacalhau.exe to any location available in the PATH environment variable.

docker image rm -f ghcr.io/bacalhau-project/bacalhau:latest # Remove old image if it exists
docker pull ghcr.io/bacalhau-project/bacalhau:latest

To run a specific version of Bacalhau using Docker, use the command docker run -it ghcr.io/bacalhau-project/bacalhau:v1.0.3, where v1.0.3 is the version you want to run; note that the latest tag will not re-download the image if you have an older version. For more information on running the Docker image, check out the Bacalhau docker image example.

To submit a job in Bacalhau, we will use the command. The command runs a job using the Docker executor on the node. Let's take a quick look at its syntax:

To run the job, you will need to connect to a public demo network or set up your own . In the following example, we will use the public demo network by using the --configuration flag.

We will use the command to submit a Hello World job that runs an program within an .

While this command is designed to resemble Docker's run command which you may be familiar with, Bacalhau introduces a whole new set of to support its computing model.

You can find out more information about your job by using .

You can download your job results directly by using .

Depending on selected , this may result in:

While executing this command, you may encounter warnings regarding receive and send buffer sizes: failed to sufficiently increase receive buffer size. These warnings can arise due to limitations in the UDP buffer used by Bacalhau to process tasks. Additional information can be found in .

With that, you have just successfully run a job on Bacalhau!

, ,

If you have questions or need support or guidance, please reach out to the (#general channel).

🐟
Bacalhau on a Private Network
latest release tarball from Github
private network
echo
Alpine container
bacalhau job describe
publisher
https://github.com/quic-go/quic-go/wiki/UDP-Buffer-Sizes
How Bacalhau works
Create your Private Network
Examples & Use Cases
Bacalhau team via Slack

Hardware Setup

Different jobs may require different amounts of resources to execute. Some jobs may have specific hardware requirements, such as GPU. This page describes how to specify hardware requirements for your job.

Please bear in mind that each executor is implemented independently and these docs might be slightly out of date. Double check the man page for the executor you are using with bacalhau [executor] --help.

Docker Executor

The following table describes how to specify hardware requirements for the Docker executor.

Flag
Default
Description

--cpu

500m

Job CPU cores (e.g. 500m, 2, 8)

--memory

1Gb

Job Memory requirement (e.g. 500Mb, 2Gb, 8Gb).

--gpu

0

Job GPU requirement (e.g. 1).

How it Works

When you specify hardware requirements, the job will be offered out to the network to see if there are any nodes that can satisfy the requirements. If there are, the job will be scheduled on the node and the executor will be started.

GPU Setup

Bacalhau supports GPU workloads. Learn how to run a job using GPU workloads with the Bacalhau client.

Prerequisites

  • The Bacalhau network must have an executor node with a GPU exposed

  • Your container must include the CUDA runtime (cudart) and must be compatible with the CUDA version running on the node

Usage

Use following command to see available resources amount:

bacalhau node list --show=capacity

To submit a request for a job that requires more than the standard set of resources, add the --cpu and --memory flags. For example, for a job that requires 2 CPU cores and 4Gb of RAM, use --cpu=2 --memory=4Gb, e.g.:

bacalhau docker run ubuntu echo Hello World --cpu=2 --memory=4Gb

To submit a GPU job request, use the --gpu flag under the docker run command to select the number of GPUs your job requires. For example:

bacalhau docker run --gpu=1 nvidia/cuda:11.0.3-base-ubuntu20.04 nvidia-smi

Limitations

The following limitations currently exist within Bacalhau.

  1. Maximum CPU and memory limits depend on the participants in the network

  2. For GPU:

    1. NVIDIA, Intel or AMD GPUs only

    2. Only the Docker Executor supports GPUs

Welcome

Welcome to the Bacalhau documentation!

In Bacalhau v.1.5.0 a couple of things changed.

  • There is now a built-in WebUI.

What is Bacalhau?

Why Bacalhau?

Bacalhau simplifies the process of managing compute jobs by providing a unified platform for managing jobs across different regions, clouds, and edge devices.

How it works

Bacalhau consists of a network of nodes that enables orchestration between every compute resource, no matter whether it is a Cloud VM, an On-premise server, or Edge devices. The network consists of two types of nodes:

Requester Node: responsible for handling user requests, discovering and ranking compute nodes, forwarding jobs to compute nodes, and monitoring the job lifecycle.

Compute Node: responsible for executing jobs and producing results. Different compute nodes can be used for different types of jobs, depending on their capabilities and resources.

Data ingestion

Data is identified by its content identifier (CID) and can be accessed by anyone who knows the CID. Here are some options that can help you mount your data:

The options are not limited to the above-mentioned. You can mount your data anywhere on your machine, and Bacalhau will be able to run against that data

Security in Bacalhau

All workloads run under restricted Docker or WASM permissions on the node. Additionally, you can use existing (locked down) binaries that are pre-installed through Pluggable Executors.

Finally, endpoints (such as vaults) can also be used to provide secure access to Bacalhau. This way, the client can authenticate with Bacalhau using the token without exposing their credentials.

Use Cases

Bacalhau can be used for a variety of data processing workloads, including machine learning, data analytics, and scientific computing. It is well-suited for workloads that require processing large amounts of data in a distributed and parallelized manner.

Here are some example tutorials on how you can process your data with Bacalhau:

Community

Bacalhau has a very friendly community and we are always happy to help you get started:

Next Steps

Create Network

In this tutorial you are setting up your own network

Introduction

Bacalhau allows you to create your own private network so you can securely run private workloads without the risks inherent in working on public nodes or inadvertently distributing data outside your organization.

This tutorial describes the process of creating your own private network from multiple nodes, configuring the nodes and running demo jobs.​

TLDR

  1. Create and apply auth token

  2. Configure auth token and orchestrators list line on the other hosts

  3. Copy and paste the environment variables it outputs under the "To connect to this node from the client, run the following commands in your shell" line to a client machine

  4. Done! You can run an example, like:

bacalhau docker run apline echo hello

Prerequisites

  1. Prepare the hosts on which the nodes are going to be set up. They could be:

    1. Physical Hosts

    2. Local Hypervisor VMs

  2. Ensure that all nodes are connected to the same network and that the necessary ports are open for communication between them.

    1. Ensure your nodes have an internet connection in case you have to download or upload any data (docker images, input data, results)

Start Initial Requestor Node

The Bacalhau network consists of nodes of two types: compute and requester. Compute Node is responsible for executing jobs and producing results. Requester Node is responsible for handling user requests, forwarding jobs to compute nodes and monitoring the job lifecycle.

The first step is to start up the initial Requester node. This node will connect to nothing but will listen for connections.

Start by creating a secure token. This token will be used for authentication between the orchestrator and compute nodes during their communications. Any string can be used as a token, preferably not easy to guess or brute-force. In addition, new authentication methods will be introduced in future releases.​

Create and Set Up a Token

Let's use the uuidgen tool to create our token, then add it to the Bacalhau configuration and run the requester node:

# Create token and write it into the 'my_token' file
uuidgen > my_token

#Add token to the Bacalhau configuration
bacalhau config set orchestrator.auth.token=$(cat my_token)
#Start the Requester node
bacalhau serve --orchestrator

This will produce output similar to this, indicating that the node is up and running:

17:27:42.273 | INF cmd/cli/serve/serve.go:102 > Config loaded from: [/home/username/.bacalhau/config.yaml], and with data-dir /home/username/.bacalhau
17:27:42.322 | INF cmd/cli/serve/serve.go:228 > Starting bacalhau...
17:27:42.405 | WRN pkg/nats/logger.go:49 > Filestore [KV_node_v1] Stream state too short (0 bytes) [Server:n-0f29f45c-c894-4f8f-8a0a-8f2f1f64d96d]
17:27:42.479 | INF cmd/cli/serve/serve.go:300 > bacalhau node running [address:0.0.0.0:1234] [compute_enabled:false] [name:n-0f29f45c-c894-4f8f-8a0a-8f2f1f64d96d] [orchestrator_address:0.0.0.0:4222] [orchestrator_enabled:true] [webui_enabled:true]

To connect to this node from the local client, run the following commands in your shell:
export BACALHAU_API_HOST=127.0.0.1
export BACALHAU_API_PORT=1234

17:27:42.479 | INF webui/webui.go:65 > Starting UI server [listen:0.0.0.0:8438]
A copy of these variables have been written to: /home/username/.bacalhau/bacalhau.run

Note that for security reasons, the output of the command contains the localhost 127.0.0.1 address instead of your real IP. To connect to this node, you should replace it with your real public IP address yourself. The method for obtaining your public IP address may vary depending on the type of instance you're using. Windows and Linux instances can be queried for their public IP using the following command:

curl https://api.ipify.org

Create and Connect Compute Node

Now let's move to another host from the preconditions, start a compute node on it and connect to the requester node. Here you will also need to add the same token to the configuration as on the requester.

#Add token to the Bacalhau configuration
bacalhau config set compute.auth.token=$(cat my_token)

Then execute the serve command to connect to the requester node:

bacalhau serve --сompute --orchestrators=<Public-IP-of-Requester-Node>

This will produce output similar to this, indicating that the node is up and running:

# formatting has been adjusted for better readability
16:23:33.386 | INF cmd/cli/serve/serve.go:256 > bacalhau node running 
[address:0.0.0.0:1235] 
[capacity:"{CPU: 1.40, Memory: 2.9 GB, Disk: 13 GB, GPU: 0}"]
[compute_enabled:true] [engines:["docker","wasm"]]
[name:n-7a510a5b-86de-41db-846f-8c6a24b67482] [orchestrator_enabled:false]
[orchestrators:["127.0.0.1","0.0.0.0"]] [publishers:["local","noop"]]
[storages:["urldownload","inline"]] [webui_enabled:false]

To ensure that the nodes are connected to the network, run the following command, specifying the public IP of the requester node:

bacalhau --api-host <Public-IP-of-Requester-Node> node list

This will produce output similar to this, indicating that the nodes belong to the same network:

bacalhau --api-host 10.0.2.15 node list
 ID          TYPE       STATUS    LABELS                                              CPU     MEMORY      DISK         GPU  
 n-7a510a5b  Compute              Architecture=amd64 Operating-System=linux           0.8 /   1.5 GB /    12.3 GB /    0 /  
                                  git-lfs=true                                        0.8     1.5 GB      12.3 GB      0    
 n-b2ab8483  Requester  APPROVED  Architecture=amd64 Operating-System=linux                                                 

Submitting Jobs

To connect to the requester node find the following lines in the requester node logs:

To connect to this node from the local client, run the following commands in your shell:
export BACALHAU_API_HOST=<Public-IP-of-the-Requester-Node>
export BACALHAU_API_PORT=1234

The exact commands list will be different for each node and is outputted by the bacalhau serve command.

Note that by default such command contains 127.0.0.1 or 0.0.0.0 instead of actual public IP. Make sure to replace it before executing the command.

Now you can submit your jobs using the bacalhau docker run, bacalhau wasm run and bacalhau job run commands. For example submit a hello-world job:

bacalhau docker run alpine echo hello
Job successfully submitted. Job ID: j-5be2a5b2-567e-4f57-ac9e-8816e47ebeff
Checking job status... (Enter Ctrl+C to exit at any time, your job will continue running):

 TIME          EXEC. ID    TOPIC            EVENT         
 16:34:16.467              Submission       Job submitted 
 16:34:16.484  e-1e9dca31  Scheduling       Requested execution on n-d41eeae7 
 16:34:16.550  e-1e9dca31  Execution        Running 
 16:34:17.506  e-1e9dca31  Execution        Completed successfully 
                                             
To get more details about the run, execute:
	bacalhau job describe j-5be2a5b2-567e-4f57-ac9e-8816e47ebeff

To get more details about the run executions, execute:
	bacalhau job executions j-5be2a5b2-567e-4f57-ac9e-8816e47ebeff

You will be able to see the job execution logs on the compute node:

16:34:16.571 | INF pkg/executor/docker/executor.go:119 > starting execution [NodeID:n-d41eeae7] [execution:e-1e9dca31-7089-4cbf-a2f6-a584930bbae5] [executionID:e-1e9dca31-7089-4cbf-a2f6-a584930bbae5] [job:j-5be2a5b2-567e-4f57-ac9e-8816e47ebeff] [jobID:j-5be2a5b2-567e-4f57-ac9e-8816e47ebeff]

...

16:34:17.496 | INF pkg/executor/docker/executor.go:221 > received results from execution [executionID:e-1e9dca31-7089-4cbf-a2f6-a584930bbae5]
16:34:17.505 | INF pkg/compute/executor.go:196 > cleaning up execution [NodeID:n-d41eeae7] [execution:e-1e9dca31-7089-4cbf-a2f6-a584930bbae5] [job:j-5be2a5b2-567e-4f57-ac9e-8816e47ebeff]

Publishers and Sources Configuration

By default only local publisher and URL & local sources are available on the compute node. The following describes how to configure the appropriate sources and publishers:

Your chosen publisher can be set for your Bacalhau compute nodes declaratively or imperatively using either configuration yaml file:

Publisher:
  Type: "s3"
  Params:
    Bucket: "my-task-results"
    Key: "task123/result.tar.gz"
    Endpoint: "https://s3.us-west-2.amazonaws.com"

Or within your imperative job execution commands:

bacalhau docker run -p s3://bucket/key,opt=endpoint=http://s3.example.com,opt=region=us-east-1 ubuntu …
InputSources:
  - Source:
      Type: "s3"
      Params:
        Bucket: "my-bucket"
        Key: "data/"
        Endpoint: "https://storage.googleapis.com"
  - Target: "/data"
Publisher:
  Type: ipfs

Or within your imperative job execution commands:

bacalhau docker run --publisher ipfs ubuntu ...
InputSources:
  - Source:
      Type: "ipfs"
      Params:
        CID: "QmY7Yh4UquoXHLPFo2XbhXkhBvFoPwmQUSa92pxnxjY3fZ"
  - Target: "/data"

Or imperative format:

bacalhau docker run --input QmY7Yh4UquoXHLPFo2XbhXkhBvFoPwmQUSa92pxnxjY3fZ:/data ...

Bacalhau allows to publish job results directly to the compute node. Please note that this method is not a reliable storage option and is recommended to be used mainly for introductory purposes.

Publisher:
  Type: local

Or within your imperative job execution commands:

bacalhau docker run --publisher local ubuntu ...
bacalhau config set Compute.AllowListedLocalPaths=/etc/config:rw,/etc/*.conf:ro

Further, the path to local data in declarative or imperative form must be specified in the job. Declarative example of the local input source:

InputSources:
  - Source:
      Type: "localDirectory"
      Params:
        SourcePath: "/etc/config"
        ReadWrite: true
    Target: "/config"

Imperative example of the local input source:

bacalhau docker run --input file:///etc/config:/config ubuntu ...

Bacalhau Configuration Keys Overview

bacalhau config set JobAdmissionControl.AcceptNetworkedJobs=true
  1. Labels: Describes the node with labels in a key=value format. Later labels can be used by the job as conditions for choosing the node on which to run on. For example:

bacalhau config set Labels=NodeType=WebServer
  1. Compute.Orchestrators: Specifies list of orchestrators to connect to. Applies to compute nodes.

bacalhau config set Compute.Orchestrators=127.0.0.1
  1. DataDir: Specifies path to the directory where the Bacalhau node will maintain its state. Default value is /home/username/.bacalhau. Can be helpful when a repo should be initialized in any but default path or when more than one node should be started on a single machine.

bacalhau config set DataDir=/path/to/new/directory
  1. Webui.Enabled: Enables a WebUI, allowing to get up-to-date and demonstrative information about the jobs and nodes on your network

bacalhau config set WebUI.Enabled=true

Best Practices for Production Use Cases

Your private cluster can be quickly set up for testing packaged jobs and tweaking data processing pipelines. However, when using a private cluster in production, here are a few considerations to note.

  1. Ensure separation of concerns in your cloud deployments by mounting the Bacalhau repository on a separate non-boot disk. This prevents instability on shutdown or restarts and improves performance within your host instances.

Docker Workloads

How to use docker containers with Bacalhau

Docker Workloads

Bacalhau executes jobs by running them within containers. Bacalhau employs a syntax closely resembling Docker, allowing you to utilize the same containers. The key distinction lies in how input and output data are transmitted to the container via IPFS, enabling scalability on a global level.

This section describes how to migrate a workload based on a Docker container into a format that will work with the Bacalhau client.

Requirements

Here are few things to note before getting started:

  1. Container Registry: Ensure that the container is published to a public container registry that is accessible from the Bacalhau network.

  2. Architecture Compatibility: Bacalhau supports only images that match the host node's architecture. Typically, most nodes run on linux/amd64, so containers in arm64 format are not able to run.

  3. Input Flags: The --input ipfs://... flag supports only directories and does not support CID subpaths. The --input https://... flag supports only single files and does not support URL directories. The --input s3://... flag supports S3 keys and prefixes. For example, s3://bucket/logs-2023-04* includes all logs for April 2023.

Note: Only about a third of examples have their containers here. If you can't find one, feel free to contact the team.

Runtime Restrictions

To help provide a safe, secure network for all users, we add the following runtime restrictions:

  1. Limited Ingress/Egress Networking:

  1. Data Passing with Docker Volumes:

A job includes the concept of input and output volumes, and the Docker executor implements support for these. This means you can specify your CIDs, URLs, and/or S3 objects as input paths and also write results to an output volume. This can be seen in the following example:

The above example demonstrates an input volume flag -i s3://mybucket/logs-2023-04*, which mounts all S3 objects in bucket mybucket with logs-2023-04 prefix within the docker container at location /input (root).

Output volumes are mounted to the Docker container at the location specified. In the example above, any content written to /output_folder will be made available within the apples folder in the job results CID.

Once the job has run on the executor, the contents of stdout and stderr will be added to any named output volumes the job has used (in this case apples), and all those entities will be packaged into the results folder which is then published to a remote location by the publisher.

Onboarding Your Workload

Step 1 - Read Data From Your Directory

If you need to pass data into your container you will do this through a Docker volume. You'll need to modify your code to read from a local directory.

We make the assumption that you are reading from a directory called /inputs, which is set as the default.

Step 2 - Write Data to the Your Directory

If you need to return data from your container you will do this through a Docker volume. You'll need to modify your code to write to a local directory.

We make the assumption that you are writing to a directory called /outputs, which is set as the default.

Step 3 - Build and Push Your Image To a Registry

Most Bacalhau nodes are of an x86_64 architecture, therefore containers should be built for x86_64 systems.

For example:

Step 4 - Test Your Container

To test your docker image locally, you'll need to execute the following command, changing the environment variables as necessary:

Let's see what each command will be used for:

Exports the current working directory of the host system to the LOCAL_INPUT_DIR variable. This variable will be used for binding a volume and transferring data into the container.

Exports the current working directory of the host system to the LOCAL_OUTPUT_DIR variable. Similarly, this variable will be used for binding a volume and transferring data from the container.

Creates an array of commands CMD that will be executed inside the container. In this case, it is a simple command executing 'ls' in the /inputs directory and writing text to the /outputs/stdout file.

Launches a Docker container using the specified variables and commands. It binds volumes to facilitate data exchange between the host and the container.

For example:

The result of the commands' execution is shown below:

Step 5 - Upload the Input Data

Data is identified by its content identifier (CID) and can be accessed by anyone who knows the CID. You can use either of these methods to upload your data:

You can choose to

You can mount your data anywhere on your machine, and Bacalhau will be able to run against that data

Step 6 - Run the Workload on Bacalhau

To launch your workload in a Docker container, using the specified image and working with input data specified via IPFS CID, run the following command.

To check the status of your job, run the following command.

To get more information on your job, you can run the following command.

To download your job, run.

To put this all together into one would look like the following.

This outputs the following.

The --input flag does not support CID subpaths for ipfs:// content.

Alternatively, you can run your workload with a publicly accessible http(s) URL, which will download the data temporarily into your public storage:

The --input flag does not support URL directories.

Troubleshooting

If you run into this compute error while running your docker image

This can often be resolved by re-tagging your docker image

Support

Connect Storage

Bacalhau has two ways to make use of external storage providers: Sources and Publishers. Sources are storage resources consumed as inputs to jobs. And Publishers are storage resources created with the results of jobs.

Sources

Publishers

How Bacalhau Works

In this tutorial we will go over the components and the architecture of Bacalhau. You will learn how it is built, what components are used, how you could interact and how you could use Bacalhau.

Chapter 1 - Architecture

Bacalhau is a peer-to-peer network of nodes that enables decentralized communication between computers. The network consists of two types of nodes, which can communicate with each other.

The requester and compute nodes together form a p2p network and use gossiping to discover each other, share information about node capabilities, available resources and health status. Bacalhau is a peer-to-peer network of nodes that enables decentralized communication between computers.

Requester Node: responsible for handling user requests, discovering and ranking compute nodes, forwarding jobs to compute nodes, and monitoring the job lifecycle.

Compute Node: responsible for executing jobs and producing results. Different compute nodes can be used for different types of jobs, depending on their capabilities and resources.

To interact with the Bacalhau network, users can use the Bacalhau CLI (command-line interface) to send requests to a requester node in the network. These requests are sent using the JSON format over HTTP, a widely-used protocol for transmitting data over the internet. Bacalhau's architecture involves two main sections which are the core components and interfaces.

Components overview

Core Components

The core components are responsible for handling requests and connecting different nodes. The network includes two different components:

Requester node

In the Bacalhau network, the requester node is responsible for handling requests from clients using JSON over HTTP. This node serves as the main custodian of jobs that are submitted to it. When a job is submitted to a requester node, it selects compute nodes that are capable and suitable to execute the job, and coordinates the job execution.

Compute node

In the Bacalhau network, it is the compute node that is responsible for determining whether it can execute a job or not. This model allows for a more decentralized approach to job orchestration as the network will function properly even if the requester nodes have stale view of the network, or if concurrent requesters are allocating jobs to the same compute nodes. Once the compute node has run the job and produced results, it will publish the results to a remote destination as specified in the job specification (e.g. S3), and notify the requester of the job completion. The compute node has a collection of named executors, storage sources, and publishers, and it will choose the most appropriate ones based on the job specifications.

Interfaces

The interfaces handle the distribution, execution, storage and publishing of jobs. In the following all the different components are described and their respective protocols are shown.

Transport

The transport interface is responsible for sending messages about jobs that are created, accepted, and executed to other compute nodes. It also manages the identity of individual Bacalhau nodes to ensure that messages are only delivered to authorized nodes, which improves network security. To achieve this, the transport interface uses a protocol, which is a point-to-point scheduling protocol that runs securely and is used to distribute job messages efficiently to other nodes on the network. This is our upgrade to previous handlers as it ensures that messages are delivered to the right nodes without causing network congestion, thereby making communication between nodes more scalable and efficient.

Executor

The executor is a critical component of the Bacalhau network that handles the execution of jobs and ensures that the storage used by the job is local. One of its main responsibilities is to present the input and output storage volumes into the job when it is run. The executor performs two primary functions: presenting the storage volumes in a format that is suitable for the executor and running the job. When the job is completed, the executor will merge the stdout, stderr and named output volumes into a results folder that is then published to a remote location. Overall, the executor plays a crucial role in the Bacalhau network by ensuring that jobs are executed properly, and their results are published accurately.

Storage Provider

In a peer-to-peer network like Bacalhau, storage providers play a crucial role in presenting an upstream storage source. There can be different storage providers available in the network, each with its own way of manifesting the CID (Content IDentifier) to the executor. For instance, there can be a POSIX storage provider that presents the CID as a POSIX filesystem, or a library storage provider that streams the contents of the CID via a library call. Therefore, the storage providers and Executor implementations are loosely coupled, allowing the POSIX and library storage providers to be used across multiple executors, wherever it is deemed appropriate.

Publisher

The publisher is responsible for uploading the final results of a job to a remote location where clients can access them, such as S3 or IPFS.

Chapter 2 - Job cycle

Job preparation

Advanced job preparation

Job Submission

You should use the Bacalhau client to send a task to the network. The client transmits the job information to the Bacalhau network via established protocols and interfaces. Jobs submitted via the Bacalhau CLI are forwarded to a Bacalhau network node at (http://bootstrap.production.bacalhau.org/) via port 1234 by default. This Bacalhau node will act as the requester node for the duration of the job lifecycle.

Bacalhau provides an interface to interact with the server via a REST API. Bacalhau uses 127.0.0.1 as the localhost and 1234 as the port by default.

Bacalhau Docker CLI commands
Bacalhau WASM CLI commands

Job Acceptance

When a job is submitted to a requester node, it selects compute nodes that are capable and suitable to execute the job, and communicate with them directly. The compute node has a collection of named executors, storage sources, and publishers, and it will choose the most appropriate ones based on the job specifications.

Job execution

Results publishing

When the Compute node completes the job, it publishes the results to S3's remote storage, IPFS.

Chapter 3 - Returning Information

The Bacalhau client receives updates on the task execution status and results. A user can access the results and manage tasks through the command line interface.

Get Job Results

To Get the results of a job you can run the following command.

One can choose from a wide range of flags, from which a few are shown below.

Describe a Job

To describe a specific job, inserting the ID to the CLI or API gives back an overview of the job.

List of Jobs

If you run more then one job or you want to find a specific job ID

Job Executions

To list executions follow the following commands.

Chapter 4 - Monitoring and Management

Stop a Job

Job History

Job Logs

Job selection policy

When running a node, you can choose which jobs you want to run by using configuration options, environment variables or flags to specify a job selection policy.

Job selection probes

If you want more control over making the decision to take on jobs, you can use the JobAdmissionControl.ProbeExec and JobAdmissionControl.ProbeHTTP configuration keys.

These are external programs that are passed the following data structure so that they can make a decision about whether to take on a job:

The exec probe is a script to run that will be given the job data on stdin, and must exit with status code 0 if the job should be run.

The http probe is a URL to POST the job data to. The job will be rejected if the HTTP request returns a non-positive status code (e.g. >= 400).

For example, the following response will reject the job:

If the HTTP response is not a JSON blob, the content is ignored and any non-error status code will accept the job.

GPU Installation

How to enable GPU support on your Bacalhau node

Bacalhau supports GPUs out of the box and defaults to allowing execution on all GPUs installed on the node.

Prerequisites

Bacalhau makes the assumption that you have installed all the necessary drivers and tools on your node host and have appropriately configured them for use by Docker.

In general for GPUs from any vendor, the Bacalhau client requires:

Nvidia

  1. nvidia-smi installed and functional

AMD

  1. rocm-smi tool installed and functional

Intel

  1. xpu-smi tool installed and functional

GPU Node Configuration

Access Management

How to configure authentication and authorization on your Bacalhau node.

Access Management

Bacalhau includes a flexible auth system that supports multiple methods of auth that are appropriate for different deployment environments.

By default

With no specific authentication configuration supplied, Bacalhau runs in "anonymous mode" – which allows unidentified users limited control over the system. "Anonymous mode" is only appropriate for testing or evaluation setups.

In anonymous mode, Bacalhau will allow:

  1. Users identified by a self-generated private key to submit any job and cancel their own jobs.

  2. Users not identified by any key to access other read-only endpoints, such as to read job lists, describe jobs, and query node or agent information.

Restricting anonymous access

Bacalhau auth is controlled by policies. Configuring the auth system is done by supplying a different policy file.

Restricting API access to only users that have authenticated requires specifying a new authorization policy. You can download a policy that restricts anonymous access and install it by using:

Once the node is restarted, accessing the node APIs will require the user to be authenticated, but by default will still allow users with a self-generated key to authenticate themselves.

Restricting the list of keys that can authenticate to only a known set requires specifying a new authentication policy. You can download a policy that restricts key-based access and install it by using:

Then, modify the allowed_clients variable in challange_ns_no_anon.rego to include acceptable client IDs, found by running bacalhau agent node.

Once the node is restarted, only keys in the allowed list will be able to access any API.

Username and password access

Users can authenticate using a username and password instead of specifying a private key for access. Again, this requires installation of an appropriate policy on the server.

Passwords are not stored in plaintext and are salted. The downloaded policy expects password hashes and salts generated by scrypt. To generate a salted password, the helper script in pkg/authn/ask/gen_password can be used:

This will ask for a password and generate a salt and hash to authenticate with it. Add the encoded username, salt and hash into the ask_ns_password.rego.

Writing custom policies

In principle, Bacalhau can implement any auth scheme that can be described in a structured way by a policy file.

Custom authentication policies

Bacalhau will pass information pertinent to the current request into every authentication policy query as a field on the input variable. The exact information depends on the type of authentication used.

challenge authentication

challenge authentication uses identifies the user by the presence of a private key. The user is asked to sign an input phrase to prove they have the key they are identifying with.

Policies used for challenge authentication do not need to actually implement the challenge verification logic as this is handled by the core code. Instead, they will only be invoked if this verification passes.

Policies for this type will need to implement these rules:

  • bacalhau.authn.token: if the user should be authenticated, an access token they should use in subsequent requests. If the user should not be authenticated, should be undefined.

They should expect as fields on the input variable:

  • clientId: an ID derived from the user's private key that identifies them uniquely

  • nodeId: the ID of the requester node that this user is authenticating with

  • signingKey: the private key (as a JWK) that should be used to sign any access tokens to be returned

The simplest possible policy might therefore be this policy that returns the same opaque token for all users:

ask authentication

ask authentication uses credentials supplied manually by the user as identification. For example, an ask policy could require a username and password as input and check these against a known list. ask policies do all the verification of the supplied credentials.

Policies for this type will need to implement these rules:

  • bacalhau.authn.token: if the user should be authenticated, an access token they should use in subsequent requests. If the user should not be authenticated, should be undefined.

  • bacalhau.authn.schema: a static JSON schema that should be used to collect information about the user. The type of declared fields may be used to pick the input method, and if a field is marked as writeOnly then it will be collected in a secure way (e.g. not shown on screen). The schema rule does not receive any input data.

They should expect as fields on the input variable:

  • ask: a map of field names from the JSON schema to strings supplied by the user. The policy should validate these credentials.

  • nodeId: the ID of the requester node that this user is authenticating with

  • signingKey: the private key (as a JWK) that should be used to sign any access tokens to be returned

The simplest possible policy might therefore be one that asks for no data and returns the same opaque token for every user:

Custom authorization policies

Authorization policies do not vary depending on the type of authentication used – Bacalhau uses one authz policy for all API requests.

Authz policies are invoked for every API request. Authz policies should check the validity of any supplied access tokens and issue an authz decision for the requested API endpoint. It is not required that authz policies enforce that an access token is present – they may choose to grant access to unauthorized users.

Policies will need to implement these rules:

  • bacalhau.authz.token_valid: true if the access token in the request is "valid" (but does not necessarily grant access for this request), or false if it is invalid for every request (e.g. because it has expired) and should be discarded.

  • bacalhau.authz.allow: true if the user should be permitted to carry out the input request, false otherwise.

They should expect as fields on the input variable for both rules:

  • http: details of the user's HTTP request:

    • host: the hostname used in the HTTP request

    • method: the HTTP method (e.g. GET, POST)

    • path: the path requested, as an array of path components without slashes

    • query: a map of URL query parameters to their values

    • headers: a map of HTTP header names to arrays representing their values

    • body: a blob of any content submitted as the body

  • constraints: details about the receiving node that should be used to validate any supplied tokens:

    • cert: keys that the input token should have been signed with

    • iss: the name of a node that this node will recognize as the issuer of any signed tokens

    • aud: the name of this node that is receiving the request

Notably, the constraints data is appropriate to be passed directly to the Rego io.jwt.decode_verify method which will validate the access token as a JWT against the given constraints.

The simplest possible authz policy might be this one that allows all users to access all endpoints:

Node Onboarding

Introduction

This tutorial describes how to add new nodes to an existing private network. Two basic scenarios will be covered:

Pre-Prerequisites

Add Host/Virtual Machine as a New Node

  1. Set the token in the Compute.Auth.Token configuration key

  2. Set the orchestrators IP address in the Compute.Orchestrators configuration key

  3. Execute bacalhau serve specifying the node type via --orchestrator flag

Add a Cloud Instance as a New Node

To automate the process using Terraform follow these steps:

  1. Determine the IP address of your requester node

  2. Write a terraform script, which does the following:

    1. Adds a new instance

    2. Installs bacalhau on it

    3. Launches a compute node

  3. Execute the script

Support

Node persistence

How to configure compute/requester persistence

Both compute nodes, and requester nodes, maintain state. How that state is maintained is configurable, although the defaults are likely adequate for most use-cases. This page describes how to configure the persistence of compute and requester nodes should the defaults not be suitable.

Compute node persistence

The computes nodes maintain information about the work that has been allocated to them, including:

  • The current state of the execution, and

  • The original job that resulted in this allocation

This information is used by the compute and requester nodes to ensure allocated jobs are completed successfully. By default, compute nodes store their state in a bolt-db database and this is located in the bacalhau repository along with configuration data. For a compute node whose ID is "abc", the database can be found in ~/.bacalhau/abc-compute/executions.db.

In some cases, it may be preferable to maintain the state in memory, with the caveat that should the node restart, all state will be lost. This can be configured using the environment variables in the table below.

Requester node persistence

When running a requester node, it maintains state about the jobs it has been requested to orchestrate and schedule, the evaluation of those jobs, and the executions that have been allocated. By default, this state is stored in a bolt db database that, with a node ID of "xyz" can be found in ~/.bacalhau/xyz-requester/jobs.db.

Bacalhau has updated configurations. Please check out .

There is no anymore to give users even more control over their network from the start.

For more information, check out the .

Bacalhau is a platform designed for fast, cost-efficient, and secure computation by running jobs directly where the data is generated and stored. Bacalhau helps you streamline existing workflows without extensive rewriting, as it allows you to run arbitrary Docker containers and WebAssembly (WASM) images as tasks. This approach is also known as Compute Over Data (or CoD). The name comes from the Portuguese word for salted cod fish.

Bacalhau aims to revolutionize data processing for large-scale datasets by enhancing cost-efficiency and accessibility, making data processing available to a broader audience. Our goal is to build an open, collaborative compute ecosystem that fosters unmatched collaboration. At (), we offer a demo network where you can try running jobs without any installation. Give it a try!

For a more detailed tutorial, check out our .

The best practice in is to use environment variables to store sensitive data such as access tokens, API keys, or passwords. These variables can be accessed by Bacalhau at runtime and are not visible to anyone who has access to the code or the server. Alternatively, you can pre-provision credentials to the nodes and access those on a node-by-node basis.

Once you have more than 10 devices generating or storing around 100GB of data, you're likely to face challenges with processing that data efficiently. Traditional computing approaches may struggle to handle such large volumes, and that's where distributed computing solutions like Bacalhau can be extremely useful. Bacalhau can be used in various industries, including security, web serving, financial services, IoT, Edge, Fog, and multi-cloud. Bacalhau shines when it comes to data-intensive applications like , , , , etc.

training and deployment with Bacalhau.

.

.

.

.

For more tutorials, visit our

– ask anything about the project, give feedback, or answer questions that will help other users.

and go to #bacalhau channel – it is the easiest way to engage with other members in the community and get help.

– learn how to contribute to the Bacalhau project.

👉 Continue with Bacalhau's to learn how to install and run a job with the Bacalhau client.

👉 Or jump directly to try out the different that showcase Bacalhau's abilities.

curl -sL https://get.bacalhau.org/install.sh | bash on every host

Start the : bacalhau serve --orchestrator

Cloud VMs (, , or any other provider)

on each host

Ensure that is installed in case you are going to run Docker Workloads​

Bacalhau is designed to be versatile in its deployment, capable of running on various environments: physical hosts, virtual machines or cloud instances. Its resource requirements are modest, ensuring compatibility with a wide range of hardware configurations. However, for certain workloads, such as machine learning, it's advisable to consider hardware configurations optimized for computational tasks, including .

If you are using a cloud deployment, you can find your public IP through their console, e.g. and .​

To set up you need to specify environment variables such as AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY, populating a credentials file to be located on your compute node, i.e. ~/.aws/credentials, or creating an for your compute nodes if you are utilizing cloud instances.

S3 compatible publishers can also be used as for your jobs, with a similar configuration.

By default, bacalhau does not connect or create its own IPFS network. Consider creating your network and connect to it using the .

can be set for your Bacalhau compute nodes declaratively or imperatively using either configuration yaml file:

Data pinned to the IPFS network can be used as . To do this, you will need to specify the CID in declarative:

can be set for your Bacalhau compute nodes declaratively or imperatively using configuration yaml file:

The allows Bacalhau jobs to access files and directories that are already present on the compute node. To allow jobs to access local files when starting a node, the Compute.AllowListedLocalPaths configuration key should be used, specifying the path to the data and access mode :rw for Read-Write access or :ro for Read-Only (used by default). For example:

Optimize your private network nodes performance and functionality with these most useful , related to the node management:

JobAdmissionControl.AcceptNetworkedJobs: Allows node to accept jobs, that require

Ensure you are running the Bacalhau process from a dedicated system user with limited permissions. This enhances security and reduces the risk of unauthorized access to critical system resources. If you are using an orchestrator such as , utilize a service file to manage the Bacalhau process, ensuring the correct user is specified and consistently used. Here’s a

Create an authentication file for your clients. A can ease the process of maintaining secure data transmission within your network. With this, clients can authenticate themselves, and you can limit the Bacalhau API endpoints unauthorized users have access to.

Consistency is a key consideration when deploying decentralized tools such as Bacalhau. You can use an to affix a specific version of Bacalhau or specify deployment actions, ensuring that each host instance has all the necessary resources for efficient operations.

That's all folks! 🎉 Please contact us on #bacalhau channel for questions and feedback!

You can check out this example tutorial on to see how we used all these steps together.

You can check to see a used by the Bacalhau team

All ingress/egress networking is limited as described in the documentation. You won't be able to pull data/code/weights/ etc. from an external source.

You can specify which directory the data is written to with the CLI flag.

You can specify which directory the data is written to with the CLI flag.

At this step, you create (or update) a Docker image that Bacalhau will use to perform your task. You from your code and dependencies, then to a public registry so that Bacalhau can access it. This is necessary for other Bacalhau nodes to run your container and execute the task.

Bacalhau will use the if your image contains one. If you need to specify another entrypoint, use the --entrypoint flag to bacalhau docker run.

.

If you have questions or need support or guidance, please reach out to the (#general channel).

Bacalhau allows you to use S3 or any S3-compatible storage service as an input source. Users can specify files or entire prefixes stored in S3 buckets to be fetched and mounted directly into the job execution environment. This capability ensures that your jobs have immediate access to the necessary data. See the for more details.

To use the S3 source, you will have to specify the mandatory name of the S3 bucket and the optional parameters Key, Filter, Region, Endpoint, VersionID and ChechsumSHA256.

Below is an example of how to define an S3 input source in YAML format:

To start, you'll need to connect the Bacalhau node to an IPFS server so that you can run jobs that consume CIDs as inputs. You can either and , or you can connect to a remote IPFS server.

In both cases, you should have an for the IPFS server that should look something like this:

The multiaddress above is just an example - you'll need to get the multiaddress of the IPFS server you want to connect to.

You can then configure your Bacalhau node to use this IPFS server by adding the address to the InputSources.Types.IPFS.Endpoint configuration key:

See the for more details.

Below is an example of how to define an IPFS input source in YAML format:

To use a local data source, you will have to:

  1. Enable the use of local data when configuring the node itself by using the Compute.AllowListedLocalPaths configuration key, specifying the file path and access mode. For example

  1. In the job description specify parameters SourcePath - the absolute path on the compute node where your data is located and ReadWrite - the access mode.

Below is an example of how to define a Local input source in YAML format:

To use a URL data source, you will have to specify only URL parameter, as in the part of the declarative job description below:

Bacalhau's S3 Publisher provides users with a secure and efficient method to publish job results to any S3-compatible storage service. To use an S3 publisher you will have to specify required parameters Bucket and Key and optional parameters Region, Endpoint, VersionID, ChecksumSHA256. See the for more details.

Here’s an example of the part of the declarative job description that outlines the process of using the S3 Publisher with Bacalhau:

The IPFS publisher works using the same setup as - you'll need to have an and a multiaddress for it. Then you'll configure that multiaddress using the InputSources.Types.IPFS.Endpoint configuration key. Then you can use bacalhau job get <job-ID> with no further arguments to download the results.

To use the IPFS publisher you will have to specify CID which can be used to access the published content. See the for more details.

And part of the declarative job description with an IPFS publisher will look like this:

The Local Publisher should not be used for Production use as it is not a reliable storage option. For production use, we recommend using a more reliable option such as an S3-compatible storage service.

Another possibility to store the results of a job execution is on a compute node. In such case the results will be published to the local compute node, and stored as compressed tar file, which can be accessed and retrieved over HTTP from the command line using the get command. To use the Local publisher you will have to specify the only URL parameter with a HTTP URL to the location where you would like to save the result. See the for more details.

Here is an example of part of the declarative job description with a local publisher:

You can create jobs in the Bacalhau network using various introduced in version 1.2. Each job may need specific variables, resource requirements and data details that are described in the .

Prepare data with Bacalhau by , or . Mount data anywhere for Bacalhau to run against. Refer to , , and Source Specifications for data source usage.

Optimize workflows without completely redesigning them. Run arbitrary tasks using Docker containers and WebAssembly images. Follow the Onboarding guides for and workloads.

Explore GPU workload support with Bacalhau. Learn how to run using the Bacalhau client in the GPU Workloads section. Integrate Python applications with Bacalhau using the .

For node operation, refer to the section for configuring and running a Bacalhau node. If you prefer an isolated environment, explore the for performing tasks without connecting to the main Bacalhau network.

You can use the command with to create a job in Bacalhau using JSON and YAML formats.

You can use to submit a new job for execution.

You can use the bacalhau docker run to start a job in a Docker container. Below, you can see an excerpt of the commands:

You can also use the bacalhau wasm run to run a job compiled into the (WASM) format. Below, you can find an excerpt of the commands in the Bacalhau CLI:

The selected compute node receives the job and starts its execution inside a container. The container can use different executors to work with the data and perform the necessary actions. A job can use the docker executor, WASM executor or a library storage volumes. Use to view the parameters to configure the Docker Engine. If you want tasks to be executed in a WebAssembly environment, pay attention to .

Bacalhau's seamless integration with IPFS ensures that users have a decentralized option for publishing their task results, enhancing accessibility and resilience while reducing dependence on a single point of failure. View to get the detailed information.

Bacalhau's S3 Publisher provides users with a secure and efficient method to publish task results to any S3-compatible storage service. This publisher supports not just AWS S3, but other S3-compatible services offered by cloud providers like Google Cloud Storage and Azure Blob Storage, as well as open-source options like MinIO. View to get the detailed information.

You can use the command with to get a full description of a job in yaml format.

You can use to retrieve the specification and current status of a particular job.

You can use the command with to list jobs on the network in yaml format.

You can use to retrieve a list of jobs.

You can use the command with to list all executions associated with a job, identified by its ID, in yaml format.

You can use to retrieve all executions for a particular job.

The Bacalhau client provides the user with tools to monitor and manage the execution of jobs. You can get information about status, progress and decide on next steps. View the if you want to know the node's health, capabilities, and deployed Bacalhau version. To get information about the status and characteristics of the nodes in the cluster use .

You can use the command with to cancel a job that was previously submitted and stop it running if it has not yet completed.

You can use to terminate a specific job asynchronously.

You can use the command with to enumerate the historical events related to a job, identified by its ID.

You can use to retrieve historical events for a specific job.

You can use this to retrieve the log output (stdout, and stderr) from a job. If the job is still running it is possible to follow the logs after the previously generated logs are retrieved.

To familiarize yourself with all the commands used in Bacalhau, please view

If the HTTP response is a JSON blob, it should match the and will be used to respond to the bid directly:

Verify installation by

See the for guidance on how to run Docker workloads on AMD GPU.

See the for guidance on how to run Docker workloads on Intel GPU.

Access to GPUs can be controlled using . To limit the number of GPUs that can be used per job, set a job resource limit. To limit access to GPUs from all jobs, set a total resource limit.

Policies are written in a language called , also used by Kubernetes. Users who want to write their own policies should get familiar with the Rego language.

A more realistic example that returns a signed JWT is in .

A more realistic example that returns a signed JWT is in .

A more realistic example (which is the Bacalhau "anonymous mode" default) is in .

Adding a machine as a new node.

Adding a as a new node. ​

You should have an established private network consisting of at least one requester node. See the guide to set one up.

You should have a new host (physical/virtual machine, cloud instance or docker container) with installed.​

Let's assume that you already have a private network with at least one requester node. In this case, the process of adding new nodes follows the section. You will need to:

Let's assume you already have all the necessary cloud infrastructure set up with a private network with at least one requester node. In this case, you can add new nodes manually (, , ) or use a tool like to automatically create and add any number of nodes to your network. The process of adding new nodes manually follows the section.

Configure terraform for

If you have questions or need support or guidance, please reach out to the (#general channel).

Environment Variable
Flag alternative
Value
Effect
Environment Variable
Flag alternative
Value
Effect

The Local input source allows Bacalhau jobs to access files and directories that are already present on the compute node. This is especially useful for utilizing locally stored datasets, configuration files, logs, or other necessary resources without the need to fetch them from a remote source, ensuring faster job initialization and execution. See the for more details.

The URL Input Source provides a straightforward method for Bacalhau jobs to access and incorporate data available over HTTP/HTTPS. By specifying a URL, users can ensure the required data, whether a single file or a web page content, is retrieved and prepared in the job's execution environment, enabling direct and efficient data utilization. See the for more details.

the guide
default endpoint
release notes
Bacalhau
Expanso.io
Getting Started Tutorial
Copy data from a URL to public storage
Pin Data to public storage
Copy Data from S3 Bucket to public storage
12-factor apps
data engineering
model training
model inference
molecular dynamics
Stable Diffusion AI
Generate Realistic Images using StyleGAN3 and Bacalhau
Object Detection with YOLOv5 on Bacalhau
Running Genomics on Bacalhau
Training Pytorch Model with Bacalhau
example page
GitHub Discussions
Join the Slack Community
Contributing
Getting Started guide
Examples
Install Bacalhau
AWS
GCP
Azure
Docker Containers
Install Bacalhau
Docker Engine
GPUs
AWS
Google Cloud
S3 publisher
IAM role
input sources
IPFS publisher
input source
Local publisher
Local input source
configuration keys
network access
Terraform
sample service file
dedicated authentication file or policy
installation script
Slack
Requester node
bacalhau docker run \
  -i s3://mybucket/logs-2023-04*:/input \
  -o apples:/output_folder \
  ubuntu \
  bash -c 'ls /input > /output_folder/file.txt'
export IMAGE=myuser/myimage:latest
docker build -t ${IMAGE} .
docker image push ${IMAGE}
export LOCAL_INPUT_DIR=$PWD
export LOCAL_OUTPUT_DIR=$PWD
export CMD=(sh -c 'ls /inputs; echo do something useful > /outputs/stdout')
docker run --rm \
  -v ${LOCAL_INPUT_DIR}:/inputs  \
  -v ${LOCAL_OUTPUT_DIR}:/outputs \
  ${IMAGE} \
  ${CMD}
export LOCAL_INPUT_DIR=$PWD
export LOCAL_OUTPUT_DIR=$PWD
export CMD=(sh -c 'ls /inputs; echo do something useful > /outputs/stdout')
docker run ... ${IMAGE} ${CMD}
export LOCAL_INPUT_DIR=$PWD
export LOCAL_OUTPUT_DIR=$PWD
export CMD=(sh -c 'ls /inputs; echo "do something useful" > /outputs/stdout')
export IMAGE=ubuntu
docker run --rm \
  -v ${LOCAL_INPUT_DIR}:/inputs  \
  -v ${LOCAL_OUTPUT_DIR}:/outputs \
  ${IMAGE} \
  ${CMD}
cat stdout
do something useful
bacalhau docker run --input ipfs://${CID} ${IMAGE} ${CMD}
bacalhau job list --id-filter JOB_ID
bacalhau job describe JOB_ID
bacalhau job get JOB_ID
JOB_ID=$(bacalhau docker run ubuntu echo hello | grep 'Job ID:' | sed 's/.*Job ID: \([^ ]*\).*/\1/')
echo "The job ID is: $JOB_ID"
bacalhau job list --id-filter $JOB_ID
sleep 5

bacalhau job list --id-filter $JOB_ID
bacalhau get $JOB_ID

ls shards
CREATED   ID        JOB                      STATE      VERIFIED  PUBLISHED
 10:26:00  24440f0d  Docker ubuntu echo h...  Verifying
 CREATED   ID        JOB                      STATE      VERIFIED  PUBLISHED
 10:26:00  24440f0d  Docker ubuntu echo h...  Published            /ipfs/bafybeiflj3kha...
11:26:09.107 | INF bacalhau/get.go:67 > Fetching results of job '24440f0d-3c06-46af-9adf-cb524aa43961'...
11:26:10.528 | INF ipfs/downloader.go:115 > Found 1 result shards, downloading to temporary folder.
11:26:13.144 | INF ipfs/downloader.go:195 > Combining shard from output volume 'outputs' to final location: '/Users/phil/source/filecoin-project/docs.bacalhau.org'
job-24440f0d-3c06-46af-9adf-cb524aa43961-shard-0-host-QmYgxZiySj3MRkwLSL4X2MF5F9f2PMhAE3LV49XkfNL1o3
export URL=https://download.geofabrik.de/antarctica-latest.osm.pbf
bacalhau docker run --input ${URL} ${IMAGE} ${CMD}

bacalhau job list

bacalhau job get JOB_ID
Creating job for submission ... done ✅
Finding node(s) for the job ... done ✅
Node accepted the job ... done ✅
Error while executing the job.
bacalhau config set Compute.AllowListedLocalPaths="/etc/config:rw,/etc/*.conf:ro".
InputSources:
  - Source:
      Type: "localDirectory"
      Params:
        SourcePath: "/etc/config"
        ReadWrite: true
    Target: "/config"
InputSources:
  - Source:
      Type: "urlDownload"
      Params:
        URL: "https://example.com/data/file.txt"
    Target: "/data"
bacalhau job run [flags]
Endpoint: `PUT /api/v1/orchestrator/jobs` 
bacalhau docker run [flags] IMAGE[:TAG|@DIGEST] [COMMAND] [ARG...]
Flags:
    --concurrency int                  How many nodes should run the job (default 1)
    --cpu string                       Job CPU cores (e.g. 500m, 2, 8).
    --disk string                      Job Disk requirement (e.g. 500Gb, 2Tb, 8Tb).
    --domain stringArray               Domain(s) that the job needs to access (for HTTP networking)
    --download                         Should we download the results once the job is complete?
    --download-timeout-secs duration   Timeout duration for IPFS downloads. (default 5m0s)
    --dry-run                          Do not submit the job, but instead print out what will be submitted
    --entrypoint strings               Override the default ENTRYPOINT of the image
-e, --env strings                      The environment variables to supply to the job (e.g. --env FOO=bar --env BAR=baz)
-f, --follow                           When specified will follow the output from the job as it runs
-g, --gettimeout int                   Timeout for getting the results of a job in --wait (default 10)
    --gpu string                       Job GPU requirement (e.g. 1, 2, 8).
-h, --help                             help for run
    --id-only                          Print out only the Job ID on successful submission.
-i, --input storage                    Mount URIs as inputs to the job. Can be specified multiple times. Format: src=URI,dst=PATH[,opt=key=value]
                                    Examples:
                                    # Mount IPFS CID to /inputs directory
                                    -i ipfs://QmeZRGhe4PmjctYVSVHuEiA9oSXnqmYa4kQubSHgWbjv72
                                    # Mount S3 object to a specific path
                                    -i s3://bucket/key,dst=/my/input/path
                                    # Mount S3 object with specific endpoint and region
                                    -i src=s3://bucket/key,dst=/my/input/path,opt=endpoint=https://s3.example.com,opt=region=us-east-1
    --ipfs-connect string              The ipfs host multiaddress to connect to, otherwise an in-process IPFS node will be created if not set.
    --ipfs-serve-path string           path local Ipfs node will persist data to
    --ipfs-swarm-addrs strings         IPFS multiaddress to connect the in-process IPFS node to - cannot be used with --ipfs-connect. (default [/ip4/35.245.161.250/tcp/4001/p2p/12D3KooWAQpZzf3qiNxpwizXeArGjft98ZBoMNgVNNpoWtKAvtYH,/ip4/35.245.161.250/udp/4001/quic/p2p/12D3KooWAQpZzf3qiNxpwizXeArGjft98ZBoMNgVNNpoWtKAvtYH,/ip4/34.86.254.26/tcp/4001/p2p/12D3KooWLfFBjDo8dFe1Q4kSm8inKjPeHzmLBkQ1QAjTHocAUazK,/ip4/34.86.254.26/udp/4001/quic/p2p/12D3KooWLfFBjDo8dFe1Q4kSm8inKjPeHzmLBkQ1QAjTHocAUazK,/ip4/35.245.215.155/tcp/4001/p2p/12D3KooWH3rxmhLUrpzg81KAwUuXXuqeGt4qyWRniunb5ipjemFF,/ip4/35.245.215.155/udp/4001/quic/p2p/12D3KooWH3rxmhLUrpzg81KAwUuXXuqeGt4qyWRniunb5ipjemFF,/ip4/34.145.201.224/tcp/4001/p2p/12D3KooWBCBZnXnNbjxqqxu2oygPdLGseEbfMbFhrkDTRjUNnZYf,/ip4/34.145.201.224/udp/4001/quic/p2p/12D3KooWBCBZnXnNbjxqqxu2oygPdLGseEbfMbFhrkDTRjUNnZYf,/ip4/35.245.41.51/tcp/4001/p2p/12D3KooWJM8j97yoDTb7B9xV1WpBXakT4Zof3aMgFuSQQH56rCXa,/ip4/35.245.41.51/udp/4001/quic/p2p/12D3KooWJM8j97yoDTb7B9xV1WpBXakT4Zof3aMgFuSQQH56rCXa])
    --ipfs-swarm-key string            Optional IPFS swarm key required to connect to a private IPFS swarm
-l, --labels strings                   List of labels for the job. Enter multiple in the format '-l a -l 2'. All characters not matching /a-zA-Z0-9_:|-/ and all emojis will be stripped.
    --memory string                    Job Memory requirement (e.g. 500Mb, 2Gb, 8Gb).
    --network network-type             Networking capability required by the job. None, HTTP, or Full (default None)
    --node-details                     Print out details of all nodes (overridden by --id-only).
-o, --output strings                   name:path of the output data volumes. 'outputs:/outputs' is always added unless '/outputs' is mapped to a different name. (default [outputs:/outputs])
    --output-dir string                Directory to write the output to.
    --private-internal-ipfs            Whether the in-process IPFS node should auto-discover other nodes, including the public IPFS network - cannot be used with --ipfs-connect. Use "--private-internal-ipfs=false" to disable. To persist a local Ipfs node, set BACALHAU_SERVE_IPFS_PATH to a valid path. (default true)
-p, --publisher publisher              Where to publish the result of the job (default ipfs)
    --raw                              Download raw result CIDs instead of merging multiple CIDs into a single result
-s, --selector string                  Selector (label query) to filter nodes on which this job can be executed, supports '=', '==', and '!='.(e.g. -s key1=value1,key2=value2). Matching objects must satisfy all of the specified label constraints.
    --target all|any                   Whether to target the minimum number of matching nodes ("any") (default) or all matching nodes ("all") (default any)
    --timeout int                      Job execution timeout in seconds (e.g. 300 for 5 minutes)
    --wait                             Wait for the job to finish. Use --wait=false to return as soon as the job is submitted. (default true)
    --wait-timeout-secs int            When using --wait, how many seconds to wait for the job to complete before giving up. (default 600)
-w, --workdir string                   Working directory inside the container. Overrides the working directory shipped with the image (e.g. via WORKDIR in Dockerfile).
bacalhau wasm run {cid-of-wasm | } [--entry-point ] [wasm-args ...] [flags]
Flags:
    --concurrency int                  How many nodes should run the job (default 1)
    --cpu string                       Job CPU cores (e.g. 500m, 2, 8).
    --disk string                      Job Disk requirement (e.g. 500Gb, 2Tb, 8Tb).
    --domain stringArray               Domain(s) that the job needs to access (for HTTP networking)
    --download                         Should we download the results once the job is complete?
    --download-timeout-secs duration   Timeout duration for IPFS downloads. (default 5m0s)
    --dry-run                          Do not submit the job, but instead print out what will be submitted
    --entry-point string               The name of the WASM function in the entry module to call. This should be a zero-parameter zero-result function that
                                will execute the job. (default "_start")
-e, --env strings                      The environment variables to supply to the job (e.g. --env FOO=bar --env BAR=baz)
-f, --follow                           When specified will follow the output from the job as it runs
-g, --gettimeout int                   Timeout for getting the results of a job in --wait (default 10)
    --gpu string                       Job GPU requirement (e.g. 1, 2, 8).
-h, --help                             help for run
    --id-only                          Print out only the Job ID on successful submission.
-U, --import-module-urls url           URL of the WASM modules to import from a URL source. URL accept any valid URL supported by the 'wget' command, and supports both HTTP and HTTPS.
-I, --import-module-volumes cid:path   CID:path of the WASM modules to import from IPFS, if you need to set the path of the mounted data.
-i, --input storage                    Mount URIs as inputs to the job. Can be specified multiple times. Format: src=URI,dst=PATH[,opt=key=value]
                                        Examples:
                                        # Mount IPFS CID to /inputs directory
                                        -i ipfs://QmeZRGhe4PmjctYVSVHuEiA9oSXnqmYa4kQubSHgWbjv72
                                        # Mount S3 object to a specific path
                                        -i s3://bucket/key,dst=/my/input/path
                                        # Mount S3 object with specific endpoint and region
                                        -i src=s3://bucket/key,dst=/my/input/path,opt=endpoint=https://s3.example.com,opt=region=us-east-1
    --ipfs-connect string              The ipfs host multiaddress to connect to, otherwise an in-process IPFS node will be created if not set.
    --ipfs-serve-path string           path local Ipfs node will persist data to
    --ipfs-swarm-addrs strings         IPFS multiaddress to connect the in-process IPFS node to - cannot be used with --ipfs-connect. (default [/ip4/35.245.161.250/tcp/4001/p2p/12D3KooWAQpZzf3qiNxpwizXeArGjft98ZBoMNgVNNpoWtKAvtYH,/ip4/35.245.161.250/udp/4001/quic/p2p/12D3KooWAQpZzf3qiNxpwizXeArGjft98ZBoMNgVNNpoWtKAvtYH,/ip4/34.86.254.26/tcp/4001/p2p/12D3KooWLfFBjDo8dFe1Q4kSm8inKjPeHzmLBkQ1QAjTHocAUazK,/ip4/34.86.254.26/udp/4001/quic/p2p/12D3KooWLfFBjDo8dFe1Q4kSm8inKjPeHzmLBkQ1QAjTHocAUazK,/ip4/35.245.215.155/tcp/4001/p2p/12D3KooWH3rxmhLUrpzg81KAwUuXXuqeGt4qyWRniunb5ipjemFF,/ip4/35.245.215.155/udp/4001/quic/p2p/12D3KooWH3rxmhLUrpzg81KAwUuXXuqeGt4qyWRniunb5ipjemFF,/ip4/34.145.201.224/tcp/4001/p2p/12D3KooWBCBZnXnNbjxqqxu2oygPdLGseEbfMbFhrkDTRjUNnZYf,/ip4/34.145.201.224/udp/4001/quic/p2p/12D3KooWBCBZnXnNbjxqqxu2oygPdLGseEbfMbFhrkDTRjUNnZYf,/ip4/35.245.41.51/tcp/4001/p2p/12D3KooWJM8j97yoDTb7B9xV1WpBXakT4Zof3aMgFuSQQH56rCXa,/ip4/35.245.41.51/udp/4001/quic/p2p/12D3KooWJM8j97yoDTb7B9xV1WpBXakT4Zof3aMgFuSQQH56rCXa])
    --ipfs-swarm-key string            Optional IPFS swarm key required to connect to a private IPFS swarm
-l, --labels strings                   List of labels for the job. Enter multiple in the format '-l a -l 2'. All characters not matching /a-zA-Z0-9_:|-/ and all emojis will be stripped.
    --memory string                    Job Memory requirement (e.g. 500Mb, 2Gb, 8Gb).
    --network network-type             Networking capability required by the job. None, HTTP, or Full (default None)
    --node-details                     Print out details of all nodes (overridden by --id-only).
-o, --output strings                   name:path of the output data volumes. 'outputs:/outputs' is always added unless '/outputs' is mapped to a different name. (default [outputs:/outputs])
    --output-dir string                Directory to write the output to.
    --private-internal-ipfs            Whether the in-process IPFS node should auto-discover other nodes, including the public IPFS network - cannot be used with --ipfs-connect. Use "--private-internal-ipfs=false" to disable. To persist a local Ipfs node, set BACALHAU_SERVE_IPFS_PATH to a valid path. (default true)
-p, --publisher publisher              Where to publish the result of the job (default ipfs)
    --raw                              Download raw result CIDs instead of merging multiple CIDs into a single result
-s, --selector string                  Selector (label query) to filter nodes on which this job can be executed, supports '=', '==', and '!='.(e.g. -s key1=value1,key2=value2). Matching objects must satisfy all of the specified label constraints.
    --target all|any                   Whether to target the minimum number of matching nodes ("any") (default) or all matching nodes ("all") (default any)
    --timeout int                      Job execution timeout in seconds (e.g. 300 for 5 minutes)
    --wait                             Wait for the job to finish. Use --wait=false to return as soon as the job is submitted. (default true)
    --wait-timeout-secs int            When using --wait, how many seconds to wait for the job to complete before giving up. (default 600)
bacalhau job get [id] [flags]
Usage:
  bacalhau job get [id] [flags]

Flags:
      --download-timeout-secs duration   Timeout duration for IPFS downloads. (default 5m0s)
  -h, --help                             help for get
      --ipfs-connect string              The ipfs host multiaddress to connect to, otherwise an in-process IPFS node will be created if not set.
      --ipfs-serve-path string           path local Ipfs node will persist data to
      --ipfs-swarm-addrs strings         IPFS multiaddress to connect the in-process IPFS node to - cannot be used with --ipfs-connect. (default [/ip4/35.245.161.250/tcp/4001/p2p/12D3KooWAQpZzf3qiNxpwizXeArGjft98ZBoMNgVNNpoWtKAvtYH,/ip4/35.245.161.250/udp/4001/quic/p2p/12D3KooWAQpZzf3qiNxpwizXeArGjft98ZBoMNgVNNpoWtKAvtYH,/ip4/34.86.254.26/tcp/4001/p2p/12D3KooWLfFBjDo8dFe1Q4kSm8inKjPeHzmLBkQ1QAjTHocAUazK,/ip4/34.86.254.26/udp/4001/quic/p2p/12D3KooWLfFBjDo8dFe1Q4kSm8inKjPeHzmLBkQ1QAjTHocAUazK,/ip4/35.245.215.155/tcp/4001/p2p/12D3KooWH3rxmhLUrpzg81KAwUuXXuqeGt4qyWRniunb5ipjemFF,/ip4/35.245.215.155/udp/4001/quic/p2p/12D3KooWH3rxmhLUrpzg81KAwUuXXuqeGt4qyWRniunb5ipjemFF,/ip4/34.145.201.224/tcp/4001/p2p/12D3KooWBCBZnXnNbjxqqxu2oygPdLGseEbfMbFhrkDTRjUNnZYf,/ip4/34.145.201.224/udp/4001/quic/p2p/12D3KooWBCBZnXnNbjxqqxu2oygPdLGseEbfMbFhrkDTRjUNnZYf,/ip4/35.245.41.51/tcp/4001/p2p/12D3KooWJM8j97yoDTb7B9xV1WpBXakT4Zof3aMgFuSQQH56rCXa,/ip4/35.245.41.51/udp/4001/quic/p2p/12D3KooWJM8j97yoDTb7B9xV1WpBXakT4Zof3aMgFuSQQH56rCXa])
      --ipfs-swarm-key string            Optional IPFS swarm key required to connect to a private IPFS swarm
      --output-dir string                Directory to write the output to.
      --private-internal-ipfs            Whether the in-process IPFS node should auto-discover other nodes, including the public IPFS network - cannot be used with --ipfs-connect. Use "--private-internal-ipfs=false" to disable. To persist a local Ipfs node, set BACALHAU_SERVE_IPFS_PATH to a valid path. (default true)
      --raw                              Download raw result CIDs instead of merging multiple CIDs into a single result
bacalhau job describe [id] [flags]
Endpoint: `GET /api/v1/orchestrator/jobs/:jobID`
bacalhau job list [flags]
Endpoint: `GET /api/v1/orchestrator/jobs`
bacalhau job executions [id] [flags]
Endpoint: `GET /api/v1/orchestrator/jobs/:jobID/executions`
bacalhau job stop [id] [flags]
Endpoint: `DELETE /api/v1/orchestrator/jobs/:jobID`
bacalhau job history [id] [flags]
Endpoint: `GET /api/v1/orchestrator/jobs/:jobID/history`
bacalhau job logs [flags] [id]
{
  "node_id": "XXX",
  "job_id": "XXX",
  "spec": {
    "engine": "docker",
    "verifier": "ipfs",
    "job_spec_vm": {
      "image": "ubuntu:latest",
      "entrypoint": ["cat", "/file.txt"]
    },
    "inputs": [{
      "engine": "ipfs",
      "cid": "XXX",
      "path": "/file.txt"
    }]
  }
}
{
  "$schema": "http://json-schema.org/draft-04/schema#",
  "type": "object",
  "properties": {
    "shouldBid": {
      "description": "If the job should be accepted",
      "type": "boolean"
    },
    "shouldWait": {
      "description": "If the node should wait for an async response that will come later. `shouldBid` will be ignored",
      "type": "boolean",
      "default": false,
    },
    "reason": {
      "description": "Human-readable string explaining why the job should be accepted or rejected, or why the wait is required",
      "type": "string"
    }
  },
  "required": [
    "shouldBid",
    "reason"
  ]
}
{
  "shouldBid": false,
  "reason": "The job did not pass this specific validation: ...",
}
curl -sL https://raw.githubusercontent.com/bacalhau-project/bacalhau/main/pkg/authz/policies/policy_ns_anon.rego -o ~/.bacalhau/no-anon.rego
bacalhau config set API.Auth.AccessPolicyPath ~/.bacalhau/no-anon.rego
curl -sL https://raw.githubusercontent.com/bacalhau-project/bacalhau/main/pkg/authn/challenge/challenge_ns_no_anon.rego -o ~/.bacalhau/challenge_ns_no_anon.rego
bacalhau config set API.Auth.Methods '\{Method: ClientKey, Policy: \{Type: challenge, PolicyPath: ~/.bacalhau/challenge_ns_no_anon.rego\}\}'
bacalhau agent node | jq -rc .ClientID
curl -sL https://raw.githubusercontent.com/bacalhau-project/bacalhau/main/pkg/authn/ask/ask_ns_password.rego -o ~/.bacalhau/ask_ns_password.rego
bacalhau config set API.Auth.Methods '\{Method: Password, Policy: \{Type: ask, PolicyPath: ~/.bacalhau/ask_ns_password.rego\}\}'
cd pkg/authn/ask/gen_password && go run .
package bacalhau.authn

token := "anything"
package bacalhau.authn

schema := {}
token := "anything"
package bacalhau.authz

allow := true
token_valid := true

BACALHAU_COMPUTE_STORE_TYPE

--compute-execution-store-type

boltdb

Uses the bolt db execution store (default)

BACALHAU_COMPUTE_STORE_PATH

--compute-execution-store-path

A path (inc. filename)

Specifies where the boltdb database should be stored. Default is ~/.bacalhau/{NODE-ID}-compute/executions.db if not set

BACALHAU_JOB_STORE_TYPE

--requester-job-store-type

boltdb

Uses the bolt db job store (default)

BACALHAU_JOB_STORE_PATH

--requester-job-store-path

A path (inc. filename)

Specifies where the boltdb database should be stored. Default is ~/.bacalhau/{NODE-ID}-requester/jobs.db if not set

An example on how to build your own ETL pipeline with Bacalhau and MongoDB.

Test Network Locally

Before you join the main Bacalhau network, you can test locally.

To test, you can use the bacalhau devstack command, which offers a way to get a 3 node cluster running locally.

export PREDICTABLE_API_PORT=1
bacalhau devstack

By settings PREDICTABLE_API_PORT=1 , the first node of our 3 node cluster will always listen on port 20000

In another window, export the following environment variables so that the Bacalhau client binary connects to our local development cluster:

export BACALHAU_API_HOST=127.0.0.1
export BACALHAU_API_PORT=20000

You can now interact with Bacalhau - all jobs are running by the local devstack cluster.

bacalhau docker run ubuntu echo hello
bacalhau job list
how to work with custom containers in Bacalhau
list of example public containers
networking
--input
--output-volumes
build your image
push it
default ENTRYPOINT
Copy data from a URL to public storage
Pin Data to public storage
Copy Data from S3 Bucket to public storage
Bacalhau team via Slack
InputSources:
  - Source:
      Type: "s3"
      Params:
        Bucket: "my-bucket"
        Key: "data/"
        Endpoint: "https://s3.us-west-2.amazonaws.com"
        ChecksumSHA256: "e3b0c44b542b..."
  - Target: "/data"
export IPFS_CONNECT=/ip4/10.1.10.10/tcp/80/p2p/QmVcSqVEsvm5RR9mBLjwpb2XjFVn5bPdPL69mL8PH45pPC
bacalhau config set InputSources.Types.IPFS.Endpoint=/ip4/10.1.10.10/tcp/80/p2p/QmVcSqVEsvm5RR9mBLjwpb2XjFVn5bPdPL69mL8PH45pPC
InputSources:
  - Source:
      Type: "ipfs"
      Params:
        CID: "QmY7Yh4UquoXHLPFo2XbhXkhBvFoPwmQUSa92pxnxjY3fZ"
  - Target: "/data"
S3 source specification
install IPFS
run it locally
IPFS multiaddress
IPFS input source specification
Local source specification
URL source specification
job types
Job Specification
copying from URLs
pinning to public storage
copying from an S3 bucket
IPFS
Local
S3
URL
Docker
WebAssembly
GPU workloads
Bacalhau Python SDK
Running a Node
Private Cluster
appropriate flags
Create Job API Documentation
command
Docker Engine Specification
WebAssembly Engine Specification
IPFS Publisher Specification
S3 Publisher Specification
appropriate flags
describe Job API Documentation
appropriate flags
appropriate flags
Bacalhau Agent APIs
Nodes API Documentation
appropriate flags
CLI Commands
following schema
Docker
Permission to access Docker
NVIDIA GPU Drivers
NVIDIA Container Toolkit (nvidia-docker2)
Running a Sample Workload
AMD GPU drivers
Running ROCm Docker containers
Intel GPU drivers
Running on GPU under docker
resource limits
Rego
challenge_ns_anon.rego
ask_ns_example.rego
policy_ns_anon.rego
Create Private Network
Bacalhau
Install Terraform
your cloud provider
Bacalhau team via Slack
Publisher:
  Type: "s3"
  Params:
    Bucket: "my-task-results"
    Key: "task123/result.tar.gz"
    Endpoint: "https://s3.us-west-2.amazonaws.com"
Publisher:
  Type: ipfs
PublishedResult:
  Type: ipfs
  Params:
    CID: "QmXoypizjW3WknFiJnKLwHCnL72vedxjQkDDP1mXWo6uco"
Publisher:
    Type: local
PublishedResult:
  Type: local
  Params:
    URL: "http://192.168.0.11:6001/e-c4b80d04-ff2b-49d6-9b99-d3a8e669a6bf.tgz"
S3 publisher specification
IPFS server running
IPFS publisher specification
Local publisher specification
above
physical host/virtual
cloud instance
Create And Connect Compute Node
AWS
Azure
GCP
Terraform
Create And Connect Compute Node

Confiuration key

Default value

Meaning

JobAdmissionControl.Locality

Anywhere

Only accept jobs that reference data we have locally ("local") or anywhere ("anywhere").

JobAdmissionControl.ProbeExec

unused

Use the result of an external program to decide if we should take on the job.

JobAdmissionControl.ProbeHTTP

unused

Use the result of a HTTP POST to decide if we should take on the job.

JobAdmissionControl.RejectStatelessJobs

False

JobAdmissionControl.AcceptNetworkedJobs

False

Docker Workload Onboarding

How to use docker containers with Bacalhau

Docker Workloads

Bacalhau executes jobs by running them within containers. Bacalhau employs a syntax closely resembling Docker, allowing you to utilize the same containers. The key distinction lies in how input and output data are transmitted to the container via IPFS, enabling scalability on a global level.

This section describes how to migrate a workload based on a Docker container into a format that will work with the Bacalhau client.

Requirements

Here are few things to note before getting started:

  1. Container Registry: Ensure that the container is published to a public container registry that is accessible from the Bacalhau network.

  2. Architecture Compatibility: Bacalhau supports only images that match the host node's architecture. Typically, most nodes run on linux/amd64, so containers in arm64 format are not able to run.

  3. Input Flags: The --input ipfs://... flag supports only directories and does not support CID subpaths. The --input https://... flag supports only single files and does not support URL directories. The --input s3://... flag supports S3 keys and prefixes. For example, s3://bucket/logs-2023-04* includes all logs for April 2023.

Note: Only about a third of examples have their containers here. The rest are under random docker hub registries.

Runtime Restrictions

To help provide a safe, secure network for all users, we add the following runtime restrictions:

  1. Limited Ingress/Egress Networking:

  1. Data Passing with Docker Volumes:

A job includes the concept of input and output volumes, and the Docker executor implements support for these. This means you can specify your CIDs, URLs, and/or S3 objects as input paths and also write results to an output volume. This can be seen in the following example:

bacalhau docker run \
  -i s3://mybucket/logs-2023-04*:/input \
  -o apples:/output_folder \
  ubuntu \
  bash -c 'ls /input > /output_folder/file.txt'

The above example demonstrates an input volume flag -i s3://mybucket/logs-2023-04*, which mounts all S3 objects in bucket mybucket with logs-2023-04 prefix within the docker container at location /input (root).

Output volumes are mounted to the Docker container at the location specified. In the example above, any content written to /output_folder will be made available within the apples folder in the job results CID.

Once the job has run on the executor, the contents of stdout and stderr will be added to any named output volumes the job has used (in this case apples), and all those entities will be packaged into the results folder which is then published to a remote location by the publisher.

Onboarding Your Workload

Step 1 - Read Data From Your Directory

If you need to pass data into your container you will do this through a Docker volume. You'll need to modify your code to read from a local directory.

We make the assumption that you are reading from a directory called /inputs, which is set as the default.

Step 2 - Write Data to the Your Directory

If you need to return data from your container you will do this through a Docker volume. You'll need to modify your code to write to a local directory.

We make the assumption that you are writing to a directory called /outputs, which is set as the default.

Step 3 - Build and Push Your Image To a Registry

For example:

$ export IMAGE=myuser/myimage:latest
$ docker build -t ${IMAGE} .
$ docker image push ${IMAGE}

Step 4 - Test Your Container

To test your docker image locally, you'll need to execute the following command, changing the environment variables as necessary:

$ export LOCAL_INPUT_DIR=$PWD
$ export LOCAL_OUTPUT_DIR=$PWD
$ export CMD=(sh -c 'ls /inputs; echo do something useful > /outputs/stdout')
$ docker run --rm \
  -v ${LOCAL_INPUT_DIR}:/inputs  \
  -v ${LOCAL_OUTPUT_DIR}:/outputs \
  ${IMAGE} \
  ${CMD}

Let's see what each command will be used for:

$ export LOCAL_INPUT_DIR=$PWD
Exports the current working directory of the host system to the LOCAL_INPUT_DIR variable. This variable will be used for binding a volume and transferring data into the container.

$ export LOCAL_OUTPUT_DIR=$PWD
Exports the current working directory of the host system to the LOCAL_OUTPUT_DIR variable. Similarly, this variable will be used for binding a volume and transferring data from the container.

$ export CMD=(sh -c 'ls /inputs; echo do something useful > /outputs/stdout')
Creates an array of commands CMD that will be executed inside the container. In this case, it is a simple command executing 'ls' in the /inputs directory and writing text to the /outputs/stdout file.

$ docker run ... ${IMAGE} ${CMD}
Launches a Docker container using the specified variables and commands. It binds volumes to facilitate data exchange between the host and the container.

For example:

$ export LOCAL_INPUT_DIR=$PWD
$ export LOCAL_OUTPUT_DIR=$PWD
$ export CMD=(sh -c 'ls /inputs; echo "do something useful" > /outputs/stdout')
$ export IMAGE=ubuntu
$ docker run --rm \
  -v ${LOCAL_INPUT_DIR}:/inputs  \
  -v ${LOCAL_OUTPUT_DIR}:/outputs \
  ${IMAGE} \
  ${CMD}
$ cat stdout

The result of the commands' execution is shown below:

do something useful

Step 5 - Upload the Input Data

Data is identified by its content identifier (CID) and can be accessed by anyone who knows the CID. You can use either of these methods to upload your data:

You can mount your data anywhere on your machine, and Bacalhau will be able to run against that data

Step 6 - Run the Workload on Bacalhau

To launch your workload in a Docker container, using the specified image and working with input data specified via IPFS CID, run the following command:

$ bacalhau docker run --input ipfs://${CID} ${IMAGE} ${CMD}

To check the status of your job, run the following command:

$ bacalhau job list --id-filter JOB_ID

To get more information on your job,run:

$ bacalhau job describe JOB_ID

To download your job, run:

$ bacalhau job get JOB_ID

For example, running:

JOB_ID=$(bacalhau docker run ubuntu echo hello | grep 'Job ID:' | sed 's/.*Job ID: \([^ ]*\).*/\1/')
echo "The job ID is: $JOB_ID"
bacalhau job list --id-filter $JOB_ID
sleep 5

bacalhau job list --id-filter $JOB_ID
bacalhau get $JOB_ID

ls shards

outputs:

CREATED   ID        JOB                      STATE      VERIFIED  PUBLISHED
 10:26:00  24440f0d  Docker ubuntu echo h...  Verifying
 CREATED   ID        JOB                      STATE      VERIFIED  PUBLISHED
 10:26:00  24440f0d  Docker ubuntu echo h...  Published            /ipfs/bafybeiflj3kha...
11:26:09.107 | INF bacalhau/get.go:67 > Fetching results of job '24440f0d-3c06-46af-9adf-cb524aa43961'...
11:26:10.528 | INF ipfs/downloader.go:115 > Found 1 result shards, downloading to temporary folder.
11:26:13.144 | INF ipfs/downloader.go:195 > Combining shard from output volume 'outputs' to final location: '/Users/phil/source/filecoin-project/docs.bacalhau.org'
job-24440f0d-3c06-46af-9adf-cb524aa43961-shard-0-host-QmYgxZiySj3MRkwLSL4X2MF5F9f2PMhAE3LV49XkfNL1o3

The --input flag does not support CID subpaths for ipfs:// content.

Alternatively, you can run your workload with a publicly accessible http(s) URL, which will download the data temporarily into your public storage:

$ export URL=https://download.geofabrik.de/antarctica-latest.osm.pbf
$ bacalhau docker run --input ${URL} ${IMAGE} ${CMD}

$ bacalhau job list

$ bacalhau job get JOB_ID

The --input flag does not support URL directories.

Troubleshooting

If you run into this compute error while running your docker image

Creating job for submission ... done ✅
Finding node(s) for the job ... done ✅
Node accepted the job ... done ✅
Error while executing the job.

This can often be resolved by re-tagging your docker image

Support

Bacalhau WebUI

How to run the WebUI.

Note that in version v1.5.0 the WebUI was completely reworked.

Overview

The Bacalhau WebUI offers an intuitive interface for interacting with the Bacalhau network. This guide provides comprehensive instructions for setting up and utilizing the WebUI.

WebUI Setup

Prerequisites

  • Ensure you have a Bacalhau v1.5.0 or later installed.

Configuration

To enable the WebUI, use the WebUI.Enabled configuration key:

bacalhau config set webui.enabled=true

By default, WebUI uses host=0.0.0.0 and port=8438. This can be configured via WebUI.Listen configuration key:

bacalhau config set webui.listen=<ip-address>:<port>

Accessing the WebUI

Once started, the WebUI is accessible at the specified address, localhost:8438 by default.

WebUI Features

Jobs

The updated WebUI allows you to view a list of jobs, including job status, run time, type, and a message in case the job failed.

Clicking on the id of a job in the list opens the job details page, where you can see the history of events related to the job, the list of nodes on which the job was executed and the real-time logs of the job.

Nodes

On the Nodes page you can see a list of nodes connected to your network, including node type, membership and connection statuses, amount of resources - total and currently available, and a list of labels of the node.

Clicking on the node id opens the node details page, where you can see the status and settings of the node, the number of running and scheduled jobs.

Configuring Transport Level Security

How to configure TLS for the requester node APIs

By default, the requester node APIs used by the Bacalhau CLI are accessible over HTTP, but it is possible to configure it to use Transport Level Security (TLS) so that they are accessible over HTTPS instead. There are several ways to obtain the necessary certificates and keys, and Bacalhau supports obtaining them via ACME and Certificate Authorities or even self-signing them.

Once configured, you must ensure that instead of using http://IP:PORT you use https://IP:PORT to access the Bacalhau API

Getting a certificate from Let's Encrypt with ACME

Alternatively, you may set these options via the environment variable, BACALHAU_AUTO_TLS. If you are using a configuration file, you can set the values inNode.ServerAPI.TLS.AutoCert instead.

As a result of the Lets Encrypt verification step, it is necessary for the server to be able to handle requests on port 443. This typically requires elevated privileges, and rather than obtain these through a privileged account (such as root), you should instead use setcap to grant the executable the right to bind to ports <1024.

sudo setcap CAP_NET_BIND_SERVICE+ep $(which bacalhau)

A cache of ACME data is held in the config repository, by default ~/.bacalhau/autocert-cache, and this will be used to manage renewals to avoid rate limits.

Getting a certificate from a Certificate Authority

Obtaining a TLS certificate from a Certificate Authority (CA) without using the Automated Certificate Management Environment (ACME) protocol involves a manual process that typically requires the following steps:

  1. Choose a Certificate Authority: First, you need to select a trusted Certificate Authority that issues TLS certificates. Popular CAs include DigiCert, GlobalSign, Comodo (now Sectigo), and others. You may also consider whether you want a free or paid certificate, as CAs offer different pricing models.

  2. Generate a Certificate Signing Request (CSR): A CSR is a text file containing information about your organization and the domain for which you need the certificate. You can generate a CSR using various tools or directly on your web server. Typically, this involves providing details such as your organization's name, common name (your domain name), location, and other relevant information.

  3. Submit the CSR: Access your chosen CA's website and locate their certificate issuance or order page. You'll typically find an option to "Submit CSR" or a similar option. Paste the contents of your CSR into the provided text box.

  4. Verify Domain Ownership: The CA will usually require you to verify that you own the domain for which you're requesting the certificate. They may send an email to one of the standard domain-related email addresses (e.g., admin@yourdomain.com, webmaster@yourdomain.com). Follow the instructions in the email to confirm domain ownership.

  5. Complete Additional Verification: Depending on the CA's policies and the type of certificate you're requesting (e.g., Extended Validation or EV certificates), you may need to provide additional documentation to verify your organization's identity. This can include legal documents or phone calls from the CA to confirm your request.

  6. Payment and Processing: If you're obtaining a paid certificate, you'll need to make the payment at this stage. Once the CA has received your payment and completed the verification process, they will issue the TLS certificate.

Once you have obtained your certificates, you will need to put two files in a location that bacalhau can read them. You need the server certificate, often called something like server.cert or server.cert.pem, and the server key which is often called something like server.key or server.key.pem.

Once you have these two files available, you must start bacalhau serve which two new flags. These are tlscert and tlskey flags, whose arguments should point to the relevant file. An example of how it is used is:

bacalhau serve --node-type=requester --tlscert=server.cert --tlskey=server.key

Alternatively, you may set these options via the environment variables, BACALHAU_TLS_CERT and BACALHAU_TLS_KEY. If you are using a configuration file, you can set the values inNode.ServerAPI.TLS.ServerCertificate and Node.ServerAPI.TLS.ServerKey instead.

Self-signed certificates

Once you have generated the necessary files, the steps are much like above, you must start bacalhau serve which two new flags. These are tlscert and tlskey flags, whose arguments should point to the relevant file. An example of how it is used is:

bacalhau serve --node-type=requester --tlscert=server.cert --tlskey=server.key

Alternatively, you may set these options via the environment variables, BACALHAU_TLS_CERT and BACALHAU_TLS_KEY. If you are using a configuration file, you can set the values inNode.ServerAPI.TLS.ServerCertificate and Node.ServerAPI.TLS.ServerKey instead.

If you use self-signed certificates, it is unlikely that any clients will be able to verify the certificate when connecting to the Bacalhau APIs. There are three options available to work around this problem:

  1. Provide a CA certificate file of trusted certificate authorities, which many software libraries support in addition to system authorities.

  2. Install the CA certificate file in the system keychain of each machine that needs access to the Bacalhau APIs.

  3. Instruct the software library you are using not to verify HTTPS requests.

Private IPFS Network Setup

Set up private IPFS network

Note that Bacalhauv1.4.0 supports IPFS v0.27 and below.

Starting from v.1.5.0 Bacalhau supports latest IPFS versions.

Consider this when selecting versions of Bacalhau and IPFS when setting up your own private network.

Introduction

  1. Install and configure IPFS

  2. Create Private IPFS network

  3. Pin your data to private IPFS network

TL;DR

  1. Initialize Private IPFS network

  2. Connect all nodes to the same private network

  3. Connect Bacalhau network to use private IPFS network

Download and Install

wget https://go.dev/dl/go1.23.0.linux-amd64.tar.gz
  1. Remove any previous Go installation by deleting the /usr/local/go folder (if it exists), then extract the archive you downloaded into /usr/local, creating a fresh Go tree in /usr/local/go:

rm -rf /usr/local/go && tar -C /usr/local -xzf go1.23.0.linux-amd64.tar.gz
  1. Add /usr/local/go/bin to the PATH environment variable. You can do this by adding the following line to your $HOME/.profile or /etc/profile (for a system-wide installation):

export PATH=$PATH:/usr/local/go/bin

Changes made to a profile file may not apply until the next time you log into the system. To apply the changes immediately, just run the shell commands directly or execute them from the profile using a command such as source $HOME/.profile.

  1. Verify that Go is installed correctly by checking its version:

go version
wget https://dist.ipfs.tech/kubo/v0.30.0/kubo_v0.30.0_linux-amd64.tar.gz
tar -xvzf kubo_v0.30.0_linux-amd64.tar.gz
sudo bash kubo/install.sh

Verify that IPFS is installed correctly by checking its version:

ipfs --version

Configure Bootstrap IPFS Node

A bootstrap node is used by client nodes to connect to the private IPFS network. The bootstrap connects clients to other nodes available on the network.

Execute the ipfs init command to initialize an IPFS node:

ipfs init
# example output

generating ED25519 keypair...done
peer identity: 12D3KooWQqr8BLHDUaZvYG59KnrfYJ1PbbzCq3pzfpQ6QrKP5yz7
initializing IPFS node at /home/username/.ipfs

The next step is to generate the swarm key - a cryptographic key that is used to control access to an IPFS network, and export the key into a swarm.key file, located in the ~/ipfs folder.

echo -e "/key/swarm/psk/1.0.0/\n/base16/\n$(tr -dc 'a-f0-9' < /dev/urandom | head -c64)" > ~/.ipfs/swarm.key
# example swarm.key content:

/key/swarm/psk/1.0.0/
/base16/
k51qzi5uqu5dli3yce3powa8pme8yc2mcwc3gpfwh7hzkzrvp5c6l0um99kiw2

Now the default entries of bootstrap nodes should be removed. Execute the command on all nodes:

ipfs bootstrap rm --all

Check that bootstrap config does not contain default values:

ipfs config show | grep Bootstrap
# expected output:

  "Bootstrap": null,

Configure IPFS to listen for incoming connections on specific network addresses and ports, making the IPFS Gateway and API services accessible. Consider changing addresses and ports depending on the specifics of your network.

ipfs config Addresses.Gateway /ip4/0.0.0.0/tcp/8080
ipfs config Addresses.API /ip4/0.0.0.0/tcp/5001

Start the IPFS daemon:

ipfs daemon

Configure Client Nodes

Copy the swarm.key file from the bootstrap node to client nodes into the ~/.ipfs/ folder and initialize IPFS:

ipfs init

Apply same config as on bootstrap node and start the daemon:

ipfs bootstrap rm --all

ipfs config Addresses.Gateway /ip4/0.0.0.0/tcp/8080

ipfs config Addresses.API /ip4/0.0.0.0/tcp/5001

ipfs daemon

Done! Now you can check that private IPFS network works properly:

  1. List peers on the bootstrap node. It should list all connected nodes:

ipfs swarm peers
# example output for single connected node

/ip4/10.0.2.15/tcp/4001/p2p/12D3KooWQqr8BLHDUaZvYG59KnrfYJ1PbbzCq3pzfpQ6QrKP5yz7
  1. Pin some files and check their availability across the network:

# Create a sample text file and pin it
echo “Hello from the private IPFS network!” > sample.txt
# Pin file:
ipfs add sample.txt
# example output:

added QmWQeYip3JuwhDFmkDkx9mXG3p83a3zMFfiMfhjS2Zvnms sample.txt
 25 B / 25 B [=========================================] 100.00%
# Retrieve and display the content of a pinned file
# Execute this on any node of your private network
ipfs cat QmWQeYip3JuwhDFmkDkx9mXG3p83a3zMFfiMfhjS2Zvnms
# expected output:

Hello from the private IPFS network!

Configure the IPFS Daemon as systemd Service

Finally, make the IPFS daemon run at system startup. To do this:

  1. Create new service unit file in the /etc/systemd/system/

sudo nano /etc/systemd/system/ipfs.service
  1. Add following content to the file, replacing /path/to/your/ipfs/executable with the actual path

[Unit]
Description=IPFS Daemon
After=network.target

[Service]
User=username
ExecStart=/path/to/your/ipfs/executable daemon
Restart=on-failure

[Install]
WantedBy=multi-user.target

Use which ipfs command to locate the executable.

Usually path to the executable is /usr/local/bin/ipfs

For security purposes, consider creating a separate user to run the service. In this case, specify its name in the User= line. Without specifying user, the ipfs service will be launched with root, which means that you will need to copy the ipfs binary to the /root directory

  1. Reload and enable the service

sudo systemctl daemon-reload
sudo systemctl enable ipfs
  1. Done! Now reboot the machine to ensure that daemon starts correctly. Use systemctl status ipfs command to check that service is running:

sudo systemctl status ipfs

#example output

● ipfs.service - IPFS Daemon
     Loaded: loaded (/etc/systemd/system/ipfs.service; enabled; preset: enabled)
     Active: active (running) since Wed 2024-09-10 13:24:09 CEST; 16min ago

Configure Bacalhau Nodes

Now to connect your private Bacalhau network to the private IPFS network, the IPFS API address should be specified using the --ipfs-connect flag. It can be found in the ~/.ipfs/api file:

bacalhau serve \
# any other flags
--ipfs-connect /ip4/0.0.0.0/tcp/5001

Done! Now your private Bacalhau network is connected to the private IPFS network!

Test Configured Networks

To verify that everything works correctly:

  1. Pin the file to the private IPFS network

  2. Run the job, which takes the pinned file as input and publishes result to the private IPFS network

  3. View and download job results

Create and Pin Sample File

Create any file and pin it. Use the ipfs add command:

# create file
echo "Hello from private IPFS network!" > file.txt

# pin the file
ipfs add file.txt
# example output:

added QmWQK2Rz4Ng1RPFPyiHECvQGrJb5ZbSwjpLeuWpDuCZAbQ file.txt
 33 B / 33 B

Run a Bacalhau Job

Run a simple job, which fetches the pinned file via its CID, lists its content and publishes results back into the private IPFS network:

bacalhau docker run \
-i ipfs://QmWQK2Rz4Ng1RPFPyiHECvQGrJb5ZbSwjpLeuWpDuCZAbQ
--publisher ipfs \
alpine cat inputs
# example output

Job successfully submitted. Job ID: j-c6514250-2e97-4fb6-a1e6-6a5a8e8ba6aa
Checking job status... (Enter Ctrl+C to exit at any time, your job will continue running):

 TIME          EXEC. ID    TOPIC            EVENT         
 15:54:35.767              Submission       Job submitted 
 15:54:35.780  e-a498daaf  Scheduling       Requested execution on n-0f29f45c 
 15:54:35.859  e-a498daaf  Execution        Running 
 15:54:36.707  e-a498daaf  Execution        Completed successfully 
                                             
To get more details about the run, execute:
	bacalhau job describe j-c6514250-2e97-4fb6-a1e6-6a5a8e8ba6aa

To get more details about the run executions, execute:
	bacalhau job executions j-c6514250-2e97-4fb6-a1e6-6a5a8e8ba6aa

To download the results, execute:
	bacalhau job get j-c6514250-2e97-4fb6-a1e6-6a5a8e8ba6aa

View and Download Job Results

bacalhau job describe j-c6514250-2e97-4fb6-a1e6-6a5a8e8ba6aa
# example output (was truncated for brevity)

...
Standard Output
Hello from private IPFS network!
bacalhau job get j-c6514250-2e97-4fb6-a1e6-6a5a8e8ba6aa
# example output

Fetching results of job 'j-c6514250-2e97-4fb6-a1e6-6a5a8e8ba6aa'...
No supported downloader found for the published results. You will have to download the results differently.
[
    {
        "Type": "ipfs",
        "Params": {
            "CID": "QmSskRNnbbw8rNtkLdcJrUS2uC2mhiKofVJsahKRPgbGGj"
        }
    }
]

Use the ipfs ls command to view the results:

ipfs ls QmSskRNnbbw8rNtkLdcJrUS2uC2mhiKofVJsahKRPgbGGj
# example output

QmS6mcrMTFsZnT3wAptqEb8NpBPnv1H6WwZBMzEjT8SSDv 1  exitCode
QmbFMke1KXqnYyBBWxB74N4c5SBnJMVAiMNRcGu6x1AwQH 0  stderr
QmWQK2Rz4Ng1RPFPyiHECvQGrJb5ZbSwjpLeuWpDuCZAbQ 33 stdout

Use the ipfs cat command to view the file content. In our case, the file of interest is the stdout:

ipfs cat QmWQK2Rz4Ng1RPFPyiHECvQGrJb5ZbSwjpLeuWpDuCZAbQ
# example output

Hello from private IPFS network!

Use the ipfs get command to download the file using its CID:

ipfs get --output stdout QmWQK2Rz4Ng1RPFPyiHECvQGrJb5ZbSwjpLeuWpDuCZAbQ
# example output
Saving file(s) to stdout
 33 B / 33 B [===============================================] 100.00% 0s

How To Work With Custom Containers in Bacalhau

Bacalhau operates by executing jobs within containers. This example shows you how to build and use a custom docker container.

Prerequisite

1. Running Containers

Docker Command

You're likely familiar with executing Docker commands to start a container:

docker run docker/whalesay cowsay sup old fashioned container run

This command runs a container from the docker/whalesay image. The container executes the cowsay sup old fashioned container run command:

_________________________________
< sup old fashioned container run >
 ---------------------------------
    \
     \
      \
                    ##        .
              ## ## ##       ==
           ## ## ## ##      ===
       /""""""""""""""""___/ ===
  ~~~ {~~ ~~~~ ~~~ ~~~~ ~~ ~ /  ===- ~~~
       \______ o          __/
        \    \        __/
          \____\______/

Bacalhau Command

export JOB_ID=$(bacalhau docker run \
    --wait \
    --id-only \ 
    docker/whalesay -- bash -c 'cowsay hello web3 uber-run')

This command also runs a container from the docker/whalesay image, using Bacalhau. We use the bacalhau docker run command to start a job in a Docker container. It contains additional flags such as --wait to wait for job completion and --id-only to return only the job identifier. Inside the container, the bash -c 'cowsay hello web3 uber-run' command is executed.

When a job is submitted, Bacalhau prints out the related job_id. We store that in an environment variable so that we can reuse it later on.

j-7e41b9b9-a9e2-4866-9fce-17020d8ec9e0

You can download your job results directly by using bacalhau job get. Alternatively, you can choose to create a directory to store your results. In the command below, we created a directory (results) and downloaded our job output to be stored in that directory.

rm -rf results && mkdir -p results
bacalhau job get \
--output-dir results \
${JOB_ID}

Viewing your job output

cat ./results/stdout

 _____________________
< hello web3 uber-run >
 ---------------------
    \
     \
      \
                    ##        .
              ## ## ##       ==
           ## ## ## ##      ===
       /""""""""""""""""___/ ===
  ~~~ {~~ ~~~~ ~~~ ~~~~ ~~ ~ /  ===- ~~~
       \______ o          __/
        \    \        __/
          \____\______/

Both commands execute cowsay in the docker/whalesay container, but Bacalhau provides additional features for working with jobs at scale.

Bacalhau Syntax

Bacalhau uses a syntax that is similar to Docker, and you can use the same containers. The main difference is that input and output data is passed to the container via IPFS, to enable planetary scale. In the example above, it doesn't make too much difference except that we need to download the stdout.

The --wait flag tells Bacalhau to wait for the job to finish before returning. This is useful in interactive sessions like this, but you would normally allow jobs to complete in the background and use the bacalhau job list command to check on their status.

Another difference is that by default Bacalhau overwrites the default entry point for the container, so you have to pass all shell commands as arguments to the run command after the -- flag.

2. Building Your Own Custom Container For Bacalhau

To use your own custom container, you must publish the container to a container registry that is accessible from the Bacalhau network. At this time, only public container registries are supported.

To demonstrate this, you will develop and build a simple custom container that comes from an old Docker example. I remember seeing cowsay at a Docker conference about a decade ago. I think it's about time we brought it back to life and distribute it across the Bacalhau network.

# write to the cod.cow
$the_cow = <<"EOC";
   $thoughts
    $thoughts
                               ,,,,_
                            ┌Φ▓╬▓╬▓▓▓W      @▓▓▒,
                           ╠▓╬▓╬╣╬╬▓╬▓▓   ╔╣╬╬▓╬╣▓,
                    __,┌╓═╠╬╠╬╬╬Ñ╬╬╬Ñ╬╬¼,╣╬╬▓╬╬▓╬▓▓▓┐        ╔W_             ,φ▓▓
               ,«@▒╠╠╠╠╩╚╙╙╩Ü╚╚╚╚╩╙╙╚╠╩╚╚╟▓▒╠╠╫╣╬╬╫╬╣▓,   _φ╬▓╬╬▓,        ,φ╣▓▓╬╬
          _,φÆ╩╬╩╙╚╩░╙╙░░╩`=░╙╚»»╦░=╓╙Ü1R░│░╚Ü░╙╙╚╠╠╠╣╣╬≡Φ╬▀╬╣╬╬▓▓▓_   ╓▄▓▓▓▓▓▓╬▌
      _,φ╬Ñ╩▌▐█[▒░░░░R░░▀░`,_`!R`````╙`-'╚Ü░░Ü░░░░░░░│││░╚╚╙╚╩╩╩╣Ñ╩╠▒▒╩╩▀▓▓╣▓▓╬╠▌
     '╚╩Ü╙│░░╙Ö▒Ü░░░H░░R ▒¥╣╣@@@▓▓▓  := '`   `░``````````````````````````]▓▓▓╬╬╠H
       '¬═▄ `\░╙Ü░╠DjK` Å»»╙╣▓▓▓▓╬Ñ     -»`       -`      `  ,;╓▄╔╗∞  ~▓▓▓▀▓▓╬╬╬▌
             '^^^`   _╒Γ   `╙▀▓▓╨                     _, ⁿD╣▓╬╣▓╬▓╜      ╙╬▓▓╬╬▓▓
                 ```└                           _╓▄@▓▓▓╜   `╝╬▓▓╙           ²╣╬▓▓
                        %φ▄╓_             ~#▓╠▓▒╬▓╬▓▓^        `                ╙╙
                         `╣▓▓▓              ╠╬▓╬▓╬▀`
                           ╚▓▌               '╨▀╜
EOC

Next, the Dockerfile adds the script and sets the entry point.

# write the Dockerfile
FROM debian:stretch
RUN apt-get update && apt-get install -y cowsay
# "cowsay" installs to /usr/games
ENV PATH $PATH:/usr/games
RUN echo '#!/bin/bash\ncowsay "${@:1}"' > /usr/bin/codsay && \
    chmod +x /usr/bin/codsay
COPY cod.cow /usr/share/cowsay/cows/default.cow

Now let's build and test the container locally.

docker build -t ghcr.io/bacalhau-project/examples/codsay:latest . 2> /dev/null
docker run --rm ghcr.io/bacalhau-project/examples/codsay:latest codsay I like swimming in data

Once your container is working as expected then you should push it to a public container registry. In this example, I'm pushing to Github's container registry, but we'll skip the step below because you probably don't have permission. Remember that the Bacalhau nodes expect your container to have a linux/amd64 architecture.

docker buildx build --platform linux/amd64,linux/arm64 --push -t ghcr.io/bacalhau-project/examples/codsay:latest .

3. Running Your Custom Container on Bacalhau

Now we're ready to submit a Bacalhau job using your custom container. This code runs a job, downloads the results, and prints the stdout.

The bacalhau docker run command strips the default entry point, so don't forget to run your entry point in the command line arguments.

export JOB_ID=$(bacalhau docker run \
    --wait \
    --id-only \
    ghcr.io/bacalhau-project/examples/codsay:v1.0.0 \
    -- bash -c 'codsay Look at all this data')

When a job is submitted, Bacalhau prints out the related job_id. We store that in an environment variable so that we can reuse it later on.

Download your job results directly by using bacalhau job get command.

rm -rf results && mkdir -p results
bacalhau job get ${JOB_ID}  --output-dir results

View your job output

cat ./results/stdout

_______________________
< Look at all this data >
 -----------------------
   \
    \
                               ,,,,_
                            ┌Φ▓╬▓╬▓▓▓W      @▓▓▒,
                           ╠▓╬▓╬╣╬╬▓╬▓▓   ╔╣╬╬▓╬╣▓,
                    __,┌╓═╠╬╠╬╬╬Ñ╬╬╬Ñ╬╬¼,╣╬╬▓╬╬▓╬▓▓▓┐        ╔W_             ,φ▓▓
               ,«@▒╠╠╠╠╩╚╙╙╩Ü╚╚╚╚╩╙╙╚╠╩╚╚╟▓▒╠╠╫╣╬╬╫╬╣▓,   _φ╬▓╬╬▓,        ,φ╣▓▓╬╬
          _,φÆ╩╬╩╙╚╩░╙╙░░╩`=░╙╚»»╦░=╓╙Ü1R░│░╚Ü░╙╙╚╠╠╠╣╣╬≡Φ╬▀╬╣╬╬▓▓▓_   ╓▄▓▓▓▓▓▓╬▌
      _,φ╬Ñ╩▌▐█[▒░░░░R░░▀░`,_`!R`````╙`-'╚Ü░░Ü░░░░░░░│││░╚╚╙╚╩╩╩╣Ñ╩╠▒▒╩╩▀▓▓╣▓▓╬╠▌
     '╚╩Ü╙│░░╙Ö▒Ü░░░H░░R ▒¥╣╣@@@▓▓▓  := '`   `░``````````````````````````]▓▓▓╬╬╠H
       '¬═▄ `░╙Ü░╠DjK` Å»»╙╣▓▓▓▓╬Ñ     -»`       -`      `  ,;╓▄╔╗∞  ~▓▓▓▀▓▓╬╬╬▌
             '^^^`   _╒Γ   `╙▀▓▓╨                     _, ⁿD╣▓╬╣▓╬▓╜      ╙╬▓▓╬╬▓▓
                 ```└                           _╓▄@▓▓▓╜   `╝╬▓▓╙           ²╣╬▓▓
                        %φ▄╓_             ~#▓╠▓▒╬▓╬▓▓^        `                ╙╙
                         `╣▓▓▓              ╠╬▓╬▓╬▀`
                           ╚▓▌               '╨▀╜

Support

WebAssembly (Wasm) Workloads

Prerequisites and Limitations

  1. Supported WebAssembly System Interface (WASI) Bacalhau can run compiled Wasm programs that expect the WebAssembly System Interface (WASI) Snapshot. WebAssembly programs can access data, environment variables, and program arguments through this interface.

  2. Networking Restrictions All ingress/egress networking is disabled; you won't be able to pull data/code/weights etc. from an external source. Wasm jobs say what data they need using URLs or CIDs (Content IDentifier) and then access the data by reading from the filesystem.

  3. Single-Threading There is no multi-threading as WASI does not expose any interface.

Onboarding Your Workload

Step 1: Replace network operations with filesystem reads and writes

If your program typically involves reading from and writing to network endpoints, follow these steps to adapt it for Bacalhau:

  1. Replace Network Operations: Instead of making HTTP requests to external servers (e.g., example.com), modify your program to read data from the local filesystem.

  2. Input Data Handling: Specify the input data location in Bacalhau using the --input flag when running the job. For instance, if your program used to fetch data from example.com, read from the /inputs folder locally, and provide the URL as input when executing the Bacalhau job. For example, --input http://example.com.

  3. Output Handling: Adjust your program to output results to standard output (stdout) or standard error (stderr) pipes. Alternatively, you can write results to the filesystem, typically into an output mount. In the case of Wasm jobs, a default folder at /outputs is available, ensuring that data written there will persist after the job concludes.

By making these adjustments, you can effectively transition your program to operate within the Bacalhau environment, utilizing filesystem operations instead of traditional network interactions.

You can specify additional or different output mounts using the -o flag.

Step 2: Configure your compiler to output WASI-compliant WebAssembly

You will need to compile your program to WebAssembly that expects WASI. Check the instructions for your compiler to see how to do this.

Step 3: Upload the input data

Data is identified by its content identifier (CID) and can be accessed by anyone who knows the CID. You can use either of these methods to upload your data:

You can mount your data anywhere on your machine, and Bacalhau will be able to run against that data

Step 4: Run your program

You can run a WebAssembly program on Bacalhau using the bacalhau wasm run command.

bacalhau wasm run

Run Locally Compiled Program:

If your program is locally compiled, specify it as an argument. For instance, running the following command will upload and execute the main.wasm program:

bacalhau wasm run main.wasm

The program you specify will be uploaded to a Bacalhau storage node and will be publicly available.

Alternative Program Specification:

You can use a Content IDentifier (CID) for a specific WebAssembly program.

bacalhau wasm run Qmajb9T3jBdMSp7xh2JruNrqg3hniCnM6EUVsBocARPJRQ

Input Data Specification:

Make sure to specify any input data using --input flag.

bacalhau wasm run --input http://example.com

This ensures the necessary data is available for the program's execution.

Program arguments

You can give the Wasm program arguments by specifying them after the program path or CID. If the Wasm program is already compiled and located in the current directory, you can run it by adding arguments after the file name:

bacalhau wasm run echo.wasm hello world

For a specific WebAssembly program, run:

bacalhau wasm run Qmajb9T3jBdMSp7xh2JruNrqg3hniCnM6EUVsBocARPJRQ hello world

Write your program to use program arguments to specify input and output paths. This makes your program more flexible in handling different configurations of input and output volumes.

For example, instead of hard-coding your program to read from /inputs/data.txt, accept a program argument that should contain the path and then specify the path as an argument to bacalhau wasm run:

bacalhau wasm run prog.wasm /inputs/data.txt

Your language of choice should contain a standard way of reading program arguments that will work with WASI.

Environment variables

You can also specify environment variables using the -e flag.

$ bacalhau wasm run prog.wasm -e HELLO=world

Examples

Support

Bacalhau Docker Image

How to use Bacalhau Docker Image for task management

This documentation explains how to use the Bacalhau Docker image for task management with Bacalhau client.

Prerequisites

1. Check the version of Bacalhau client

docker run -t ghcr.io/bacalhau-project/bacalhau:latest version

The output is similar to:

12:00:32.427 | INF pkg/repo/fs.go:93 > Initializing repo at '/root/.bacalhau' for environment 'production'
CLIENT  SERVER  UPDATE MESSAGE 
v1.3.0  v1.4.0                 

2. Run a Bacalhau Job

For example to run an Ubuntu-based job that prints the message 'Hello from Docker Bacalhau':

bacalhau docker run \
        --id-only \
        --wait \
        ubuntu:latest \
        -- sh -c 'uname -a && echo "Hello from Docker Bacalhau!"'

Structure of the command

  1. --id-only: Output only the job id

  2. --wait: Wait for the job to finish

  3. ubuntu:latest. Ubuntu container

  4. --: Separate Bacalhau parameters from the command to be executed inside the container

  5. sh -c 'uname -a && echo "Hello from Docker Bacalhau!"': The command executed inside the container

The command execution in the terminal is similar to:

j-6ffd54b8-e992-498f-9ee9-766ab09d5daa

j-6ffd54b8-e992-498f-9ee9-766ab09d5daa is a job ID, which represents the result of executing a command inside a Docker container. It can be used to obtain additional information about the executed job or to access the job's results. We store that in an environment variable so that we can reuse it later on (env: JOB_ID=j-6ffd54b8-e992-498f-9ee9-766ab09d5daa)

To print the content of the Job ID, execute the following command:

bacalhau job describe j-6ffd54b8-e992-498f-9ee9-766ab09d5daa

The output is similar to:

ID            = j-6ffd54b8-e992-498f-9ee9-766ab09d5daa
Name          = j-6ffd54b8-e992-498f-9ee9-766ab09d5daa
Namespace     = default
Type          = batch
State         = Completed
Count         = 1
Created Time  = 2024-09-08 14:33:19
Modified Time = 2024-09-08 14:33:20
Version       = 0

Summary
Completed = 1

Job History
 TIME                 REV.  STATE      TOPIC       EVENT         
 2024-09-08 14:33:19  1     Pending    Submission  Job submitted 
 2024-09-08 14:33:19  2     Running                              
 2024-09-08 14:33:20  3     Completed                            

Executions
 ID          NODE ID     STATE      DESIRED  REV.  CREATED     MODIFIED    COMMENT      
 e-bd5746b8  n-e002001e  Completed  Stopped  6     27m21s ago  27m21s ago  Accepted job 

Execution e-bd5746b8 History
 TIME                 REV.  STATE              TOPIC            EVENT        
 2024-09-08 14:33:19  1     New                                              
 2024-09-08 14:33:19  2     AskForBid                                        
 2024-09-08 14:33:19  3     AskForBidAccepted  Requesting Node  Accepted job 
 2024-09-08 14:33:19  4     AskForBidAccepted                                
 2024-09-08 14:33:19  5     BidAccepted                                      
 2024-09-08 14:33:20  6     Completed                                        

Standard Output
Linux 7d5c3dcc7fc2 6.5.0-1024-gcp #26~22.04.1-Ubuntu SMP Fri Jun 14 18:48:45 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
Hello from Docker Bacalhau!

3. Submit a Job With Output Files

You always need to mount directories into the container to access files. This is because the container is running in a separate environment from your host machine.

The first part of this example should look familiar, except for the Docker commands.

bacalhau docker run \                                   
        --id-only \
        --wait \
        --gpu 1 \
        ghcr.io/bacalhau-project/examples/stable-diffusion-gpu:0.0.1 -- \
            python main.py --o ./outputs --p "A Docker whale and a cod having a conversation about the state of the ocean"

When a job is submitted, Bacalhau prints the related job_id (j-da29a804-3960-4667-b6e5-73f05e120117):

j-da29a804-3960-4667-b6e5-73f05e120117

4. Check the State of your Jobs

Job status: You can check the status of the job using bacalhau job list.

bacalhau job list

When it reads Completed, that means the job is done, and you can get the results.

Job information: You can find out more information about your job by using bacalhau job describe.

bacalhau job describe j-da29a804-3960-4667-b6e5-73f05e120117

Job download: You can download your job results directly by using bacalhau job get. Alternatively, you can choose to create a directory to store your results. In the command below, we created a directory and downloaded our job output to be stored in the result directory.

bacalhau job get ${JOB_ID} --output-dir result

After the download is complete, you should see the following contents in the results directory.

Support

WebAssembly (Wasm) Workloads

Prerequisites and Limitations

  1. Supported WebAssembly System Interface (WASI) Bacalhau can run compiled Wasm programs that expect the WebAssembly System Interface (WASI) Snapshot 1. Through this interface, WebAssembly programs can access data, environment variables, and program arguments.

  2. Networking Restrictions All ingress/egress networking is disabled; you won't be able to pull data/code/weights etc. from an external source. Wasm jobs can say what data they need using URLs or CIDs (Content IDentifier) and can then access the data by reading from the filesystem.

  3. Single-Threading There is no multi-threading as WASI does not expose any interface for it.

Onboarding Your Workload

Step 1: Replace network operations with filesystem reads and writes

If your program typically involves reading from and writing to network endpoints, follow these steps to adapt it for Bacalhau:

  1. Replace Network Operations: Instead of making HTTP requests to external servers (e.g., example.com), modify your program to read data from the local filesystem.

  2. Input Data Handling: Specify the input data location in Bacalhau using the --input flag when running the job. For instance, if your program used to fetch data from example.com, read from the /inputs folder locally, and provide the URL as input when executing the Bacalhau job. For example, --input http://example.com.

  3. Output Handling: Adjust your program to output results to standard output (stdout) or standard error (stderr) pipes. Alternatively, you can write results to the filesystem, typically into an output mount. In the case of Wasm jobs, a default folder at /outputs is available, ensuring that data written there will persist after the job concludes.

By making these adjustments, you can effectively transition your program to operate within the Bacalhau environment, utilizing filesystem operations instead of traditional network interactions.

You can specify additional or different output mounts using the -o flag.

Step 2: Configure your compiler to output WASI-compliant WebAssembly

You will need to compile your program to WebAssembly that expects WASI. Check the instructions for your compiler to see how to do this.

Step 3: Upload the input data

Data is identified by its content identifier (CID) and can be accessed by anyone who knows the CID. You can use either of these methods to upload your data:

You can mount your data anywhere on your machine, and Bacalhau will be able to run against that data

Step 4: Run your program

You can run a WebAssembly program on Bacalhau using the bacalhau wasm run command.

bacalhau wasm run

Run Locally Compiled Program:

If your program is locally compiled, specify it as an argument. For instance, running the following command will upload and execute the main.wasm program:

bacalhau wasm run main.wasm

The program you specify will be uploaded to a Bacalhau storage node and will be publicly available if you are using the public demo network.

Alternative Program Specification:

You can use a Content IDentifier (CID) for a specific WebAssembly program.

bacalhau wasm run Qmajb9T3jBdMSp7xh2JruNrqg3hniCnM6EUVsBocARPJRQ

Input Data Specification:

Make sure to specify any input data using --input flag.

bacalhau wasm run --input http://example.com

This ensures the necessary data is available for the program's execution.

Program arguments

You can give the Wasm program arguments by specifying them after the program path or CID. If the Wasm program is already compiled and located in the current directory, you can run it by adding arguments after the file name:

bacalhau wasm run echo.wasm hello world

For a specific WebAssembly program, run:

bacalhau wasm run Qmajb9T3jBdMSp7xh2JruNrqg3hniCnM6EUVsBocARPJRQ hello world

Write your program to use program arguments to specify input and output paths. This makes your program more flexible in handling different configurations of input and output volumes.

For example, instead of hard-coding your program to read from /inputs/data.txt, accept a program argument that should contain the path and then specify the path as an argument to bacalhau wasm run:

bacalhau wasm run prog.wasm /inputs/data.txt

Your language of choice should contain a standard way of reading program arguments that will work with WASI.

Environment variables

You can also specify environment variables using the -e flag.

bacalhau wasm run prog.wasm -e HELLO=world

Examples

Support

Running a Python Script

This tutorial serves as an introduction to Bacalhau. In this example, you'll be executing a simple "Hello, World!" Python script hosted on a website on Bacalhau.

Prerequisites​

1. Running Python Locally​

# hello-world.py
print("Hello, world!")

Running the script to print out the output:

python3 hello-world.py

After the script has run successfully locally we can now run it on Bacalhau.

2. Running a Bacalhau Job​

export JOB_ID=$(bacalhau docker run \
    --id-only \
    --input https://raw.githubusercontent.com/bacalhau-project/examples/151eebe895151edd83468e3d8b546612bf96cd05/workload-onboarding/trivial-python/hello-world.py \
    python:3.10-slim \
    -- python3 /inputs/hello-world.py)

Structure of the command​

  1. bacalhau docker run: call to Bacalhau

  2. --id-only: specifies that only the job identifier (job_id) will be returned after executing the container, not the entire output

  3. --input https://raw.githubusercontent.com/bacalhau-project/examples/151eebe895151edd83468e3d8b546612bf96cd05/workload-onboarding/trivial-python/hello-world.py \: indicates where to get the input data for the container. In this case, the input data is downloaded from the specified URL, which represents the Python script "hello-world.py".

  4. python:3.10-slim: the Docker image that will be used to run the container. In this case, it uses the Python 3.10 image with a minimal set of components (slim).

  5. --: This double dash is used to separate the Bacalhau command options from the command that will be executed inside the Docker container.

  6. python3 /inputs/hello-world.py: running the hello-world.py Python script stored in /inputs.

When a job is submitted, Bacalhau prints out the related job_id. We store that in an environment variable so that we can reuse it later on.

Declarative job description​

name: Running Trivial Python
type: batch
count: 1
tasks:
  - name: My main task
    Engine:
      type: docker
      params:
        Image: python:3.10-slim
        Entrypoint:
          - /bin/bash
        Parameters:
          - -c
          - python3 /inputs/hello-world.py
    InputSources:
      - Target: /inputs
        Source:
          Type: urlDownload
          Params:
            URL: https://raw.githubusercontent.com/bacalhau-project/examples/151eebe895151edd83468e3d8b546612bf96cd05/workload-onboarding/trivial-python/hello-world.py
            Path: /inputs/hello-world.py

The job description should be saved in .yaml format, e.g. helloworld.yaml, and then run with the command:

bacalhau job run helloworld.yaml

3. Checking the State of your Jobs​

Job status: You can check the status of the job using bacalhau job list.

bacalhau job list --id-filter ${JOB_ID} --no-style

When it says Published or Completed, that means the job is done, and we can get the results.

Job information: You can find out more information about your job by using bacalhau job describe.

bacalhau job describe ${JOB_ID}

Job download: You can download your job results directly by using bacalhau job get. Alternatively, you can choose to create a directory to store your results. In the command below, we created a directory (results) and downloaded our job output to be stored in that directory.

rm -rf results && mkdir results
bacalhau job get ${JOB_ID} --output-dir results

4. Viewing your Job Output​

To view the file, run the following command:

cat results/stdout

Support​

Building and Running Custom Python Container

Introduction

In this tutorial example, we will walk you through building your own Python container and running the container on Bacalhau.

Prerequisites

1. Sample Recommendation Dataset

We will be using a simple recommendation script that, when given a movie ID, recommends other movies based on user ratings. Assuming you want recommendations for the movie 'Toy Story' (1995), it will suggest movies from similar categories:

Recommendations for Toy Story (1995):
1  :  Toy Story (1995)
58  :  Postino, Il (The Postman) (1994)
3159  :  Fantasia 2000 (1999)
359  :  I Like It Like That (1994)
756  :  Carmen Miranda: Bananas Is My Business (1994)
618  :  Two Much (1996)
48  :  Pocahontas (1995)
2695  :  Boys, The (1997)
2923  :  Citizen's Band (a.k.a. Handle with Care) (1977)
688  :  Operation Dumbo Drop (1995)

Downloading the dataset

wget https://files.grouplens.org/datasets/movielens/ml-1m.zip

In this example, we’ll be using 2 files from the MovieLens 1M dataset: ratings.dat and movies.dat. After the dataset is downloaded, extract the zip and place ratings.dat and movies.dat into a folder called input:

# Extracting the downloaded zip file
unzip ml-1m.zip
#moving  ratings.dat and movies.dat into a folder called 'input'
mkdir input; mv ml-1m/movies.dat ml-1m/ratings.dat input/

The structure of the input directory should be

input
├── movies.dat
└── ratings.dat

Installing Dependencies

To create a requirements.txt for the Python libraries we’ll be using, create:

# content of the requirements.txt
numpy
pandas

To install the dependencies, run:

pip install -r requirements.txt

Writing the Script

Create a new file called similar-movies.py and in it paste the following script

# content of the similar-movies.py

# Imports
import numpy as np
import pandas as pd
import argparse
from distutils.dir_util import mkpath
import warnings
warnings.filterwarnings("ignore")
# Read the files with pandas
data = pd.io.parsers.read_csv('input/ratings.dat',
names=['user_id', 'movie_id', 'rating', 'time'],
engine='python', delimiter='::', encoding='latin-1')
movie_data = pd.io.parsers.read_csv('input/movies.dat',
names=['movie_id', 'title', 'genre'],
engine='python', delimiter='::', encoding='latin-1')

# Create the ratings matrix of shape (m×u) with rows as movies and columns as users

ratings_mat = np.ndarray(
shape=((np.max(data.movie_id.values)), np.max(data.user_id.values)),
dtype=np.uint8)
ratings_mat[data.movie_id.values-1, data.user_id.values-1] = data.rating.values

# Normalise matrix (subtract mean off)

normalised_mat = ratings_mat - np.asarray([(np.mean(ratings_mat, 1))]).T

# Compute SVD

normalised_mat = ratings_mat - np.matrix(np.mean(ratings_mat, 1)).T
cov_mat = np.cov(normalised_mat)
evals, evecs = np.linalg.eig(cov_mat)

# Calculate cosine similarity, sort by most similar, and return the top N.

def top_cosine_similarity(data, movie_id, top_n=10):

index = movie_id - 1
# Movie id starts from 1

movie_row = data[index, :]
magnitude = np.sqrt(np.einsum('ij, ij -> i', data, data))
similarity = np.dot(movie_row, data.T) / (magnitude[index] * magnitude)
sort_indexes = np.argsort(-similarity)
return sort_indexes[:top_n]

# Helper function to print top N similar movies
def print_similar_movies(movie_data, movie_id, top_indexes):
print('Recommendations for {0}: \n'.format(
movie_data[movie_data.movie_id == movie_id].title.values[0]))
for id in top_indexes + 1:
print(str(id),' : ',movie_data[movie_data.movie_id == id].title.values[0])


parser = argparse.ArgumentParser(description='Personal information')
parser.add_argument('--k', dest='k', type=int, help='principal components to represent the movies',default=50)
parser.add_argument('--id', dest='id', type=int, help='Id of the movie',default=1)
parser.add_argument('--n', dest='n', type=int, help='No of recommendations',default=10)

args = parser.parse_args()
k = args.k
movie_id = args.id # Grab an id from movies.dat
top_n = args.n

# k = 50
# # Grab an id from movies.dat
# movie_id = 1
# top_n = 10

sliced = evecs[:, :k] # representative data
top_indexes = top_cosine_similarity(sliced, movie_id, top_n)
print_similar_movies(movie_data, movie_id, top_indexes)

What the similar-movies.py script does

  1. Read the files with pandas. The code uses Pandas to read data from the files ratings.dat and movies.dat.

  2. Create the ratings matrix of shape (m×u) with rows as movies and columns as user

  3. Normalise matrix (subtract mean off). The ratings matrix is normalized by subtracting the mean off.

  4. Compute SVD: a singular value decomposition (SVD) of the normalized ratings matrix is performed.

  5. Calculate cosine similarity, sort by most similar, and return the top N.

  6. Select k principal components to represent the movies, a movie_id to find recommendations, and print the top_n results.

Running the Script

Running the script similar-movies.py using the default values:

python similar-movies.py

You can also use other flags to set your own values.

2. Setting Up Docker

We will create a Dockerfile and add the desired configuration to the file. These commands specify how the image will be built, and what extra requirements will be included.

FROM python:3.8
ADD similar-movies.py .
ADD /input input
COPY ./requirements.txt /requirements.txt
RUN pip install -r requirements.txt

We will use the python:3.8 docker image and add our script similar-movies.py to copy the script to the docker image, similarly, we also add the dataset directory and also the requirements, after that run the command to install the dependencies in the image

The final folder structure will look like this:

├── Dockerfile
├── input
│   ├── movies.dat
│   └── ratings.dat
├── requirements.txt
└── similar-movies.py

Build the container

We will run docker build command to build the container:

docker build -t <hub-user>/<repo-name>:<tag> .

Before running the command replace:

repo-name with the name of the container, you can name it anything you want

tag this is not required, but you can use the latest tag

In our case:

docker build -t jsace/python-similar-movies .

Push the container

Next, upload the image to the registry. This can be done by using the Docker hub username, repo name or tag.

docker push <hub-user>/<repo-name>:<tag>

In our case:

docker push jsace/python-similar-movies

3. Running a Bacalhau Job

After the repo image has been pushed to Docker Hub, we can now use the container for running on Bacalhau. You can submit a Bacalhau job by running your container on Bacalhau with default or custom parameters.

Running the Container with Default Parameters

To submit a Bacalhau job by running your container on Bacalhau with default parameters, run the following Bacalhau command:

export JOB_ID=$(bacalhau docker run \
    --id-only \
    --wait \
    jsace/python-similar-movies \
    -- python similar-movies.py)

Structure of the command

  1. bacalhau docker run: call to Bacalhau

  2. jsace/python-similar-movies: the name and of the docker image we are using

  3. -- python similar-movies.py: execute the Python script

When a job is submitted, Bacalhau prints out the related job_id. We store that in an environment variable so that we can reuse it later on.

Running the Container with Custom Parameters

To submit a Bacalhau job by running your container on Bacalhau with custom parameters, run the following Bacalhau command:

bacalhau docker run \
    jsace/python-similar-movies \
    -- python similar-movies.py --k 50 --id 10 --n 10

Structure of the command

  1. bacalhau docker run: call to Bacalhau

  2. jsace/python-similar-movies: the name of the docker image we are using

  3. -- python similar-movies.py --k 50 --id 10 --n 10: execute the python script. The script will use Singular Value Decomposition (SVD) and cosine similarity to find 10 movies most similar to the one with identifier 10, using 50 principal components.

4. Checking the State of your Jobs

Job status: You can check the status of the job using bacalhau job list.

bacalhau job list --id-filter ${JOB_ID}

When it says Published or Completed, that means the job is done, and we can get the results.

Job information: You can find out more information about your job by using bacalhau job describe.

bacalhau job describe ${JOB_ID}

Job download: You can download your job results directly by using bacalhau job get. Alternatively, you can choose to create a directory to store your results. In the command below, we created a directory (results) and downloaded our job output to be stored in that directory.

rm -rf results && mkdir -p results
bacalhau job get $JOB_ID --output-dir results

5. Viewing your Job Output

To view the file, run the following command:

cat results/stdout # displays the contents of the file

Support

Running Pandas on Bacalhau

Introduction

Pandas is a Python package that provides fast, flexible, and expressive data structures designed to make working with data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real-world data analysis in Python. Additionally, it has the broader goal of becoming the most powerful and flexible open-source data analysis/manipulation tool available in any language. It is already well on its way towards this goal.

In this tutorial example, we will run Pandas script on Bacalhau.

Prerequisite

1. Running Pandas Locally

To run the Pandas script on Bacalhau for analysis, first, we will place the Pandas script in a container and then run it at scale on Bacalhau.

To get started, you need to install the Pandas library from pip:

pip install pandas

Importing data from CSV to DataFrame

Pandas is built around the idea of a DataFrame, a container for representing data. Below you will create a DataFrame by importing a CSV file. A CSV file is a text file with one record of data per line. The values within the record are separated using the “comma” character. Pandas provides a useful method, named read_csv() to read the contents of the CSV file into a DataFrame. For example, we can create a file named transactions.csv containing details of Transactions. The CSV file is stored in the same directory that contains the Python script.

# read_csv.py
import pandas as pd

print(pd.read_csv("transactions.csv"))

The overall purpose of the command above is to read data from a CSV file (transactions.csv) using Pandas and print the resulting DataFrame.

To download the transactions.csv file, run:

wget https://cloudflare-ipfs.com/ipfs/QmfKJT13h5k1b23ja3ZCVg5nFL9oKz2bVXc8oXgtwiwhjz/transactions.csv

To output a content of the transactions.csv file, run:

cat transactions.csv

Running the script

Now let's run the script to read in the CSV file. The output will be a DataFrame object.

python3 read_csv.py

2. Ingesting Data

To run Pandas on Bacalhau you must store your assets in a location that Bacalhau has access to. We usually default to storing data on IPFS and code in a container, but you can also easily upload your script to IPFS too.

3. Running a Bacalhau Job

Now we're ready to run a Bacalhau job, whilst mounting the Pandas script and data from IPFS. We'll use the bacalhau docker run command to do this:

export JOB_ID=$(bacalhau docker run \
    --wait \
    --id-only \
    -i ipfs://QmfKJT13h5k1b23ja3ZCVg5nFL9oKz2bVXc8oXgtwiwhjz:/files \
    -w /files \
    amancevice/pandas \
    -- python read_csv.py)

Structure of the command

  1. bacalhau docker run: call to Bacalhau

  2. amancevice/pandas : Docker image with pandas installed.

  3. -i ipfs://QmfKJT13h5k1b23ja3ZCVg5nFL9oKz2bVXc8oXgtwiwhjz:/files: Mounting the uploaded dataset to path. The -i flag allows us to mount a file or directory from IPFS into the container. It takes two arguments, the first is the IPFS CID

  4. QmfKJT13h5k1b23ja3ZCVg5nFL9oKz2bVXc8oXgtwiwhjz) and the second is the file path within IPFS (/files). The -i flag can be used multiple times to mount multiple directories.

    -w /files Our working directory is /files. This is the folder where we will save the model as it will automatically get uploaded to IPFS as outputs

  5. python read_csv.py: python script to read pandas script

When a job is submitted, Bacalhau prints out the related job_id. We store that in an environment variable so that we can reuse it later on.

4. Checking the State of your Jobs

Job status: You can check the status of the job using bacalhau job list.

bacalhau job list --id-filter ${JOB_ID}

When it says Published or Completed, that means the job is done, and we can get the results.

Job information: You can find out more information about your job by using bacalhau job describe.

bacalhau job describe ${JOB_ID}

Job download: You can download your job results directly by using bacalhau job get. Alternatively, you can choose to create a directory to store your results. In the command below, we created a directory (results) and downloaded our job output to be stored in that directory.

rm -rf results && mkdir -p results
bacalhau job get ${JOB_ID}  --output-dir results

5. Viewing your Job Output

To view the file, run the following command:

cat results/stdout

Support

Running a Prolog Script

Introduction

Prolog is intended primarily as a declarative programming language: the program logic is expressed in terms of relations, represented as facts and rules. A computation is initiated by running a query over these relations. Prolog is well-suited for specific tasks that benefit from rule-based logical queries such as searching databases, voice control systems, and filling templates.

This tutorial is a quick guide on how to run a hello world script on Bacalhau.

Prerequisites

1. Running Locally​

To get started, install swipl

Create a file called helloworld.pl. The following script prints ‘Hello World’ to the stdout:

Running the script to print out the output:

After the script has run successfully locally, we can now run it on Bacalhau.

Before running it on Bacalhau we need to upload it to IPFS.

Using the IPFS cli:

Run the command below to check if our script has been uploaded.

This command outputs the CID. Copy the CID of the file, which in our case is QmYq9ipYf3vsj7iLv5C67BXZcpLHxZbvFAJbtj7aKN5qii

2. Running a Bacalhau Job

We will mount the script to the container using the -i flag: -i: ipfs://< CID >:/< name-of-the-script >.

To submit a job, run the following Bacalhau command:

Structure of the Command

  1. -i ipfs://QmYq9ipYf3vsj7iLv5C67BXZcpLHxZbvFAJbtj7aKN5qii:/helloworld.pl : Sets the input data for the container.

  2. mYq9ipYf3vsj7iLv5C67BXZcpLHxZbvFAJbtj7aKN5qii is our CID which points to the helloworld.pl file on the IPFS network. This file will be accessible within the container.

  3. -- swipl -q -s helloworld.pl -g hello_world: instructs SWI-Prolog to load the program from the helloworld.pl file and execute the hello_world function in quiet mode:

    1. -q: running in quiet mode

    2. -s: load file as a script. In this case we want to run the helloworld.pl script

    3. -g: is the name of the function you want to execute. In this case its hello_world

When a job is submitted, Bacalhau prints out the related job_id. We store that in an environment variable so that we can reuse it later on:

3. Checking the State of your Jobs

Job status: You can check the status of the job using bacalhau job list.

When it says Published or Completed, that means the job is done, and we can get the results.

Job information: You can find out more information about your job by using bacalhau job describe.

Job download: You can download your job results directly by using bacalhau job get. Alternatively, you can choose to create a directory to store your results. In the command below, we created a directory (results) and downloaded our job output to be stored in that directory.

4. Viewing your Job Output

To view the file, run the following command:

Support

Building and Running your Custom R Containers on Bacalhau

Introduction

Quick script to run custom R container on Bacalhau:

Prerequisites

1. Running Prophet in R Locally

Open R studio or R-supported IDE. If you want to run this on a notebook server, then make sure you use an R kernel. Prophet is a CRAN package, so you can use install.packages to install the prophet package:

After installation is finished, you can download the example data that is stored in IPFS:

The code below instantiates the library and fits a model to the data.

Create a new file called Saturating-Forecasts.R and in it paste the following script:

This script performs time series forecasting using the Prophet library in R, taking input data from a CSV file, applying the forecasting model, and generating plots for analysis.

Let's have a look at the command below:

This command uses Rscript to execute the script that was created and written to the Saturating-Forecasts.R file.

The input parameters provided in this case are the names of input and output files:

example_wp_log_R.csv - the example data that was previously downloaded.

outputs/output0.pdf - the name of the file to save the first forecast plot.

outputs/output1.pdf - the name of the file to save the second forecast plot.

2. Running R Prophet on Bacalhau

3. Containerize Script with Docker

To build your own docker container, create a Dockerfile, which contains instructions to build your image.

These commands specify how the image will be built, and what extra requirements will be included. We use r-base as the base image and then install the prophet package. We then copy the Saturating-Forecasts.R script into the container and set the working directory to the R folder.

Build the container

We will run docker build command to build the container:

Before running the command replace:

repo-name with the name of the container, you can name it anything you want

tag this is not required but you can use the latest tag

In our case:

Push the container

Next, upload the image to the registry. This can be done by using the Docker hub username, repo name, or tag.

In our case:

4. Running a Job on Bacalhau

The following command passes a prompt to the model and generates the results in the outputs directory. It takes approximately 2 minutes to run.

Structure of the command

  1. bacalhau docker run: call to Bacalhau

  2. -i ipfs://QmY8BAftd48wWRYDf5XnZGkhwqgjpzjyUG3hN1se6SYaFt:/example_wp_log_R.csv: Mounting the uploaded dataset at /inputs in the execution. It takes two arguments, the first is the IPFS CID (QmY8BAftd48wWRYDf5XnZGkhwqgjpzjyUG3hN1se6SYaFtz) and the second is file path within IPFS (/example_wp_log_R.csv)

  3. ghcr.io/bacalhau-project/examples/r-prophet:0.0.2: the name and the tag of the docker image we are using

  4. /example_wp_log_R.csv : path to the input dataset

  5. /outputs/output0.pdf, /outputs/output1.pdf: paths to the output

  6. Rscript Saturating-Forecasts.R: execute the R script

When a job is submitted, Bacalhau prints out the related job_id. We store that in an environment variable so that we can reuse it later on:

5. Checking the State of your Jobs

Job status: You can check the status of the job using bacalhau job list.

When it says Published or Completed, that means the job is done, and we can get the results.

Job information: You can find out more information about your job by using bacalhau job describe.

Job download: You can download your job results directly by using bacalhau job get. Alternatively, you can choose to create a directory to store your results. In the command below, we created a directory (results) and downloaded our job output to be stored in that directory.

6. Viewing your Job Output

To view the file, run the following command:

You can't natively display PDFs in notebooks, so here are some static images of the PDFs:

output0.pdf

output1.pdf

Support

Run CUDA programs on Bacalhau

What is CUDA

In this tutorial, we will look at how to run CUDA programs on Bacalhau. CUDA (Compute Unified Device Architecture) is an extension of C/C++ programming. It is a parallel computing platform and programming model created by NVIDIA. It helps developers speed up their applications by harnessing the power of GPU accelerators.

In addition to accelerating high-performance computing (HPC) and research applications, CUDA has also been widely adopted across consumer and industrial ecosystems. CUDA also makes it easy for developers to take advantage of all the latest GPU architecture innovations

Advantage of GPU over CPU

Architecturally, the CPU is composed of just a few cores with lots of cache memory that can handle a few software threads at a time. In contrast, a GPU is composed of hundreds of cores that can handle thousands of threads simultaneously.

Computations like matrix multiplication could be done much faster on GPU than on CPU

Prerequisite

1. Running CUDA locally

You'll need to have the following installed:

  1. NVIDIA GPU

  2. CUDA drivers installed

  3. nvcc installed

Checking if nvcc is installed:

Downloading the programs:

Viewing the programs

  1. 00-hello-world.cu:

This example represents a standard C++ program that inefficiently utilizes GPU resources due to the use of non-parallel loops.

  1. 02-cuda-hello-world-faster.cu:

In this example we utilize Vector addition using CUDA and allocate the memory in advance and copy the memory to the GPU using cudaMemcpy so that it can utilize the HBM (High Bandwidth memory of the GPU). Compilation and execution occur faster (1.39 seconds) compared to the previous example (8.67 seconds).

2. Running a Bacalhau Job

To submit a job, run the following Bacalhau command:

Structure of the Commands

  1. bacalhau docker run: call to Bacalhau

  2. -i https://raw.githubusercontent.com/tristanpenman/cuda-examples/master/02-cuda-hello-world-faster.cu: URL path of the input data volumes downloaded from a URL source.

  3. nvidia/cuda:11.2.0-cudnn8-devel-ubuntu18.04: Docker container for executing CUDA programs (you need to choose the right CUDA docker container). The container should have the tag of "devel" in them.

  4. nvcc --expt-relaxed-constexpr -o ./outputs/hello ./inputs/02-cuda-hello-world-faster.cu: Compilation using the nvcc compiler and save it to the outputs directory as hello

  5. Note that there is ; between the commands: -- /bin/bash -c 'nvcc --expt-relaxed-constexpr -o ./outputs/hello ./inputs/02-cuda-hello-world-faster.cu; ./outputs/hello The ";" symbol allows executing multiple commands sequentially in a single line.

  6. ./outputs/hello: Execution hello binary: You can combine compilation and execution commands.

Note that the CUDA version will need to be compatible with the graphics card on the host machine

When a job is submitted, Bacalhau prints out the related job_id. We store that in an environment variable so that we can reuse it later on:

3. Checking the State of your Jobs

Job status: You can check the status of the job using bacalhau job list.

When it says Published or Completed, that means the job is done, and we can get the results.

Job information: You can find out more information about your job by using bacalhau job describe.

Job download: You can download your job results directly by using bacalhau job get. Alternatively, you can choose to create a directory to store your results. In the command below, we created a directory (results) and downloaded our job output to be stored in that directory.

4. Viewing your Job Output

To view the file, run the following command:

Support

Running Jupyter Notebooks on Bacalhau

Introduction

Jupyter Notebooks have become an essential tool for data scientists, researchers, and developers for interactive computing and the development of data-driven projects. They provide an efficient way to share code, equations, visualizations, and narrative text with support for multiple programming languages. In this tutorial, we will introduce you to running Jupyter Notebooks on Bacalhau, a powerful and flexible container orchestration platform. By leveraging Bacalhau, you can execute Jupyter Notebooks in a scalable and efficient manner using Docker containers, without the need for manual setup or configuration.

In the following sections, we will explore two examples of executing Jupyter Notebooks on Bacalhau:

  1. Executing a Simple Hello World Notebook: We will begin with a basic example to familiarize you with the process of running a Jupyter Notebook on Bacalhau. We will execute a simple "Hello, World!" notebook to demonstrate the steps required for running a notebook in a containerized environment.

  2. Notebook to Train an MNIST Model: In this section, we will dive into a more advanced example. We will execute a Jupyter Notebook that trains a machine-learning model on the popular MNIST dataset. This will showcase the potential of Bacalhau to handle more complex tasks while providing you with insights into utilizing containerized environments for your data science projects.

Prerequisite

1. Executing a Simple Hello World Notebook

There are no external dependencies that we need to install. All dependencies are already there in the container.

  1. /inputs/hello.ipynb: This is the path of the input Jupyter Notebook inside the Docker container.

  2. -i: This flag stands for "input" and is used to provide the URL of the input Jupyter Notebook you want to execute.

  3. https://raw.githubusercontent.com/js-ts/hello-notebook/main/hello.ipynb: This is the URL of the input Jupyter Notebook.

  4. jsacex/jupyter: This is the name of the Docker image used for running the Jupyter Notebook. It is a minimal Jupyter Notebook stack based on the official Jupyter Docker Stacks.

  5. --: This double dash is used to separate the Bacalhau command options from the command that will be executed inside the Docker container.

  6. jupyter nbconvert: This is the primary command used to convert and execute Jupyter Notebooks. It allows for the conversion of notebooks to various formats, including execution.

  7. --execute: This flag tells nbconvert to execute the notebook and store the results in the output file.

  8. --to notebook: This option specifies the output format. In this case, we want to keep the output as a Jupyter Notebook.

  9. --output /outputs/hello_output.ipynb: This option specifies the path and filename for the output Jupyter Notebook, which will contain the results of the executed input notebook.

Checking the State of your Jobs

Job status: You can check the status of the job using bacalhau job list:

When it says Published or Completed, that means the job is done, and we can get the results.

Job information: You can find out more information about your job by using bacalhau job describe:

Job download: You can download your job results directly by using bacalhau job get. Alternatively, you can choose to create a directory to store your results. In the command below, we created a directory (results) and downloaded our job output to be stored in that directory.

After the download has finished you can see the contents in the results directory, running the command below:

2. Running Notebook to Train an MNIST Model

Building the container (optional)

Prerequisite

  1. Install Docker on your local machine.

  2. Sign up for a DockerHub account if you don't already have one.

Step 1: Create a Dockerfile

Create a new file named Dockerfile in your project directory with the following content:

This Dockerfile creates a Docker image based on the official TensorFlow GPU-enabled image, sets the working directory to the root, updates the package list, and copies an IPython notebook (mnist.ipynb) and a requirements.txt file. It then upgrades pip and installs Python packages from the requirements.txt file, along with scikit-learn. The resulting image provides an environment ready for running the mnist.ipynb notebook with TensorFlow and scikit-learn, as well as other specified dependencies.

Step 2: Build the Docker Image

In your terminal, navigate to the directory containing the Dockerfile and run the following command to build the Docker image:

Replace "your-dockerhub-username" with your actual DockerHub username. This command will build the Docker image and tag it with your DockerHub username and the name "your-dockerhub-username/jupyter-mnist-tensorflow".

Step 3: Push the Docker Image to DockerHub

Once the build process is complete, push the Docker image to DockerHub using the following command:

Again, replace "your-dockerhub-username" with your actual DockerHub username. This command will push the Docker image to your DockerHub repository.

Running the job on Bacalhau

Prerequisite

Structure of the command

  1. --gpu 1: Flag to specify the number of GPUs to use for the execution. In this case, 1 GPU will be used.

  2. -i gitlfs://huggingface.co/datasets/VedantPadwal/mnist.git: The -i flag is used to clone the MNIST dataset from Hugging Face's repository using Git LFS. The files will be mounted inside the container.

  3. jsacex/jupyter-tensorflow-mnist:v02: The name and the tag of the Docker image.

    --: This double dash is used to separate the Bacalhau command options from the command that will be executed inside the Docker container.

  4. jupyter nbconvert --execute --to notebook --output /outputs/mnist_output.ipynb mnist.ipynb: The command to be executed inside the container. In this case, it runs the jupyter nbconvert command to execute the mnist.ipynb notebook and save the output as mnist_output.ipynb in the /outputs directory.

Checking the State of your Jobs

Job status: You can check the status of the job using bacalhau job list.

When it says Published or Completed, that means the job is done, and we can get the results.

Job information: You can find out more information about your job by using bacalhau job describe.

Job download: You can download your job results directly by using bacalhau job get. Alternatively, you can choose to create a directory to store your results. In the command below, we created a directory (results) and downloaded our job output to be stored in that directory.

After the download has finished you can see the contents in the results directory, running the command below:

The outputs include our trained model and the Jupyter notebook with the output cells.

Support

Running a Simple R Script on Bacalhau

You can use official Docker containers for each language, like R or Python. In this example, we will use the official R container and run it on Bacalhau.

In this tutorial example, we will run a "hello world" R script on Bacalhau.

Prerequisites​

1. Running an R Script Locally​

Run the script:

2. Running a Job on Bacalhau​

Now it's time to run the script on Bacalhau:

Structure of the command​

  1. bacalhau docker run: call to Bacalhau

  2. i ipfs://QmQRVx3gXVLaRXywgwo8GCTQ63fHqWV88FiwEqCidmUGhk:/hello.R: Mounting the uploaded dataset at /inputs in the execution. It takes two arguments, the first is the IPFS CID (QmQRVx3gXVLaRXywgwo8GCTQ63fHqWV88FiwEqCidmUGhk) and the second is file path within IPFS (/hello.R)

  3. r-base: docker official image we are using

  4. Rscript hello.R: execute the R script

When a job is submitted, Bacalhau prints out the related job_id. We store that in an environment variable so that we can reuse it later on:

Declarative job description​

The job description should be saved in .yaml format, e.g. rhello.yaml, and then run with the command:

3. Checking the State of your Jobs​

Job status: You can check the status of the job using bacalhau job list.

When it says Published or Completed, that means the job is done, and we can get the results.

Job information: You can find out more information about your job by using bacalhau job describe.

Job download: You can download your job results directly by using bacalhau job get. Alternatively, you can choose to create a directory to store your results. In the command below, we created a directory (results) and downloaded our job output to be stored in that directory.

4. Viewing your Job Output​

To view the file, run the following command:

Futureproofing your R Scripts​

You can generate the job request using bacalhau job describe with the --spec flag. This will allow you to re-run that job in the future:

Support​

Scripting Bacalhau with Python

Bacalhau allows you to easily execute batch jobs via the CLI. But sometimes you need to do more than that. You might need to execute a script that requires user input, or you might need to execute a script that requires a lot of parameters. In any case, you probably want to execute your jobs in a repeatable manner.

This example demonstrates a simple Python script that is able to orchestrate the execution of lots of jobs in a repeatable manner.

Prerequisite

Executing Bacalhau Jobs with Python Scripts

To demonstrate this example, I will use the data generated from an Ethereum example. This produced a list of hashes that I will iterate over and execute a job for each one.

Now let's create a file called bacalhau.py. The script below automates the submission, monitoring, and retrieval of results for multiple Bacalhau jobs in parallel. It is designed to be used in a scenario where there are multiple hash files, each representing a job, and the script manages the execution of these jobs using Bacalhau commands.

This code has a few interesting features:

  1. Change the value in the main call (main("hashes.txt", 10)) to change the number of jobs to execute.

  2. Because all jobs are complete at different times, there's a loop to check that all jobs have been completed before downloading the results. If you don't do this, you'll likely see an error when trying to download the results. The while True loop is used to monitor the status of jobs and wait for them to complete.

  3. When downloading the results, the IPFS get often times out, so I wrapped that in a loop. The for i in range(0, 5) loop in the getResultsFromJob function involves retrying the bacalhau job get operation if it fails to complete successfully.

Let's run it!

Hopefully, the results directory contains all the combined results from the jobs we just executed. Here's we're expecting to see CSV files:

Success! We've now executed a bunch of jobs in parallel using Python. This is a great way to execute lots of jobs in a repeatable manner. You can alter the file above for your purposes.

Next Steps

You might also be interested in the following examples:

Support

own private IPFS

Reject jobs that don't specify any .

Accept jobs that require .

You can check out this example tutorial on to see how we used all these steps together.

You can check to see a used by the Bacalhau team

All ingress/egress networking is limited as described in the documentation. You won't be able to pull data/code/weights/ etc. from an external source.

You can specify which directory the data is written to with the CLI flag.

You can specify which directory the data is written to with the CLI flag.

At this step, you create (or update) a Docker image that Bacalhau will use to perform your task. You from your code and dependencies, then to a public registry so that Bacalhau can access it. This is necessary for other Bacalhau nodes to run your container and execute the given task.

Most Bacalhau nodes are of an x86_64 architecture, therefore containers should be built for .

Bacalhau will use the if your image contains one. If you need to specify another entrypoint, use the --entrypoint flag to bacalhau docker run.

If you have questions or need support or guidance, please reach out to the (#general channel)

Check out the to learn about all the changes to WebUI and more.

For contributing to the WebUI's development, please refer to the .

Automatic Certificate Management Environment (ACME) is a protocol that allows for automating the deployment of Public Key Infrastructure, and is the protocol used to obtain a free certificate from the Certificate Authority.

Using the --autocert [hostname] parameter to the CLI (in the serve and devstack commands), a certificate is obtained automatically from Lets Encrypt. The provided hostname should be a comma-separated list of hostnames, but they should all be publicly resolvable as Lets Encrypt will attempt to connect to the server to verify ownership (using the challenge). On the very first request this can take a short time whilst the first certificate is issued, but afterwards they are then cached in the bacalhau repository.

If you wish, it is possible to use Bacalhau with a self-signed certificate which does not rely on an external Certificate Authority. This is an involved process and so is not described in detail here although there is which should provide a good starting point.

Support for the embedded node was in v1.4.0 to streamline communication and reduce overhead. Therefore, now in order to use a private IPFS network, it is necessary to create it yourself and then connect to it with nodes. This manual describes how to:

Configure your to use the private IPFS network

Install on all nodes

Install

In this manual (the earliest and most widely used implementation of IPFS) will be used, so first of all, should be installed.

See the page for latest Go version.

The next step is to download and install Kubo. the appropriate version for your system. It is recommended to use the latest stable version.

Use command to view job execution results:

Use command to download job results. In this particular case, ipfs publisher was used, so the get command will print the CID of the job results:

Need Support?

For questions and feedback, please reach out in our

To get started, you need to install the Bacalhau client, see more information

This example requires Docker. If you don't have Docker installed, you can install it from . Docker commands will not work on hosted notebooks like Google Colab, but the Bacalhau commands will.

If you have questions or need support or guidance, please reach out to the (#general channel).

Bacalhau supports running programs that are compiled to . With the Bacalhau client, you can upload Wasm programs, retrieve data from public storage, read and write data, receive program arguments, and access environment variables.

For example, Rust users can specify the wasm32-wasi target rustup and cargo to get programs compiled for WASI WebAssembly. See for more information on this.

.

See for a workload that leverages WebAssembly support.

If you have questions or need support or guidance, please reach out to the (#general channel).

Install the .

If have questions or need support or guidance, please reach out to the (#general channel).

Bacalhau supports running programs that are compiled to . With the Bacalhau client, you can upload Wasm programs, retrieve data from public storage, read and write data, receive program arguments, and access environment variables.

For example, Rust users can specify the wasm32-wasi target to rustup and cargo to get programs compiled for WASI WebAssembly. See for more information on this.

Consider creating .

See for a workload that leverages WebAssembly support.

If you have questions or need support or guidance, please reach out to the (#general channel)

To get started, you need to install the Bacalhau client, see more information

We'll be using a very simple Python script that displays the . Create a file called hello-world.py:

To submit a workload to Bacalhau you can use the bacalhau docker run command. This command allows passing input data into the container using volumes, we will be using the --input URL:path for simplicity. This results in Bacalhau mounting a data volume inside the container. By default, Bacalhau mounts the input volume at the path /inputs inside the container.

, so we must run the full command after the -- argument.

The same job can be presented in the format. In this case, the description will look like this:

If you have questions or need support or guidance, please reach out to the (#general channel).

To get started, you need to install the Bacalhau client, see more information

Download Movielens1M dataset from this link

For further reading on how the script works, go to

See more information on how to containerize your script/app

hub-user with your docker hub username, If you don’t have a docker hub account , and use the username of the account you created

If you have questions or need support or guidance, please reach out to the (#general channel).

To get started, you need to install the Bacalhau client, see more information

If you are interested in finding out more about how to ingest your data into IPFS, please see the .

We've already uploaded the script and data to IPFS to the following CID: QmfKJT13h5k1b23ja3ZCVg5nFL9oKz2bVXc8oXgtwiwhjz. You can look at this by browsing to one of the HTTP IPFS proxies like or .

If you have questions or need support or guidance, please reach out to the (#general channel).

To get started, you need to install the Bacalhau client, see more information

Since the data uploaded to IPFS isn’t pinned, we will need to do that manually. Check this information on how to pin your We recommend using .

If you have questions or need support or guidance, please reach out to the (#general channel).

This example will walk you through building Time Series Forecasting using . Prophet is a forecasting procedure implemented in R and Python. It is fast and provides completely automated forecasts that can be tuned by hand by data scientists and analysts.

To get started, you need to install the Bacalhau client, see more information

To use Bacalhau, you need to package your code in an appropriate format. The developers have already pushed a container for you to use, but if you want to build your own, you can follow the steps below. You can view a in the documentation.

hub-user with your docker hub username. If you don’t have a docker hub account , and use the username of the account you created

If you have questions or need support or guidance, please reach out to the (#general channel).

To get started, you need to install the Bacalhau client, see more information

If you have questions or need support or guidance, please reach out to the (#general channel).

To get started, you need to install the Bacalhau client, see more information

To get started, you need to install the Bacalhau client, see more information

If you have questions or need support or guidance, please reach out to the (#general channel).

To get started, you need to install the Bacalhau client, see more information

To install R follow these instructions . After R and RStudio are installed, create and run a script called hello.R:

Next, upload the script to your public storage (in our case, IPFS). We've already uploaded the script to IPFS and the CID is: QmVHSWhAL7fNkRiHfoEJGeMYjaYZUsKHvix7L54SptR8ie. You can look at this by browsing to one of the HTTP IPFS proxies like or .

The same job can be presented in the format. In this case, the description will look like this:

If you have questions or need support or guidance, please reach out to the (#general channel).

To get started, you need to install the Bacalhau client, see more information

If you have questions or need support or guidance, please reach out to the (#general channel).

input data
network connections
how to work with custom containers in Bacalhau
list of example public containers
networking
build your image
push it
default ENTRYPOINT
Copy data from a URL to public storage
Pin Data to public storage
Copy Data from S3 Bucket to public storage
Bacalhau team via Slack
release notes
Bacalhau WebUI GitHub Repository
Let's Encrypt
ACME HTTP-01
a helpful script in the Bacalhau github repository
Bacalhau network
Go
IPFS
Kubo
Go
Go Downloads
Select and download
bacalhau job describe
​
Slack
here
here
Bacalhau team via Slack
WebAssembly (Wasm)
the Rust example
Copy data from a URL to public storage
Pin Data to public storage
Copy Data from S3 Bucket to public storage
the Rust example
Bacalhau team via Slack
Bacalhau CLI in Docker
Bacalhau team via Slack
WebAssembly (Wasm)
the Rust example
Copy data from a URL to public storage
Pin Data to public storage
Copy Data from S3 Bucket to public storage
your own private network
the Rust example
Bacalhau team via Slack
here
traditional first greeting
Bacalhau overwrites the default entrypoint
declarative
Bacalhau team via Slack
here
https://files.grouplens.org/datasets/movielens/ml-1m.zip
Simple Movie Recommender Using SVD | Alyssa
here
follow these instructions to create a docker account
Bacalhau team via Slack
here
data ingestion guide
ipfs.tech
w3s.link
Bacalhau team via Slack
x86_64 systems
sudo add-apt-repository ppa:swi-prolog/stable
sudo apt-get update
sudo apt-get install swi-prolog
# helloworld.pl
hello_world :- write('Hello World'), nl,
               halt.
swipl -q -s helloworld.pl -g hello_world
wget https://dist.ipfs.io/go-ipfs/v0.4.2/go-ipfs_v0.4.2_linux-amd64.tar.gz
tar xvfz go-ipfs_v0.4.2_linux-amd64.tar.gz
mv go-ipfs/ipfs /usr/local/bin/ipfs
ipfs init
ipfs cat /ipfs/QmYwAPJzv5CZsnA625s3Xf2nemtYgPpHdWEz79ojWnPbdG/readme
ipfs config Addresses.Gateway /ip4/127.0.0.1/tcp/8082
ipfs config Addresses.API /ip4/127.0.0.1/tcp/5002
nohup ipfs daemon > startup.log &
ipfs add helloworld.pl
export JOB_ID=$(bacalhau docker run \
    -i ipfs://QmYq9ipYf3vsj7iLv5C67BXZcpLHxZbvFAJbtj7aKN5qii:/helloworld.pl \
    --wait \
    --id-only \
    swipl \
    -- swipl -q -s helloworld.pl -g hello_world)
bacalhau job list --id-filter ${JOB_ID} --wide
bacalhau job describe ${JOB_ID}
rm -rf results && mkdir -p results
bacalhau job get $JOB_ID --output-dir results
cat results/stdout
bacalhau docker run \
    -i ipfs://QmY8BAftd48wWRYDf5XnZGkhwqgjpzjyUG3hN1se6SYaFt:/example_wp_log_R.csv \
    ghcr.io/bacalhau-project/examples/r-prophet:0.0.2 \
    -- Rscript Saturating-Forecasts.R "/example_wp_log_R.csv" "/outputs/output0.pdf" "/outputs/output1.pdf"
R -e "install.packages('prophet',dependencies=TRUE, repos='http://cran.rstudio.com/')"
wget https://w3s.link/ipfs/QmZiwZz7fXAvQANKYnt7ya838VPpj4agJt5EDvRYp3Deeo/example_wp_log_R.csv
mkdir -p outputs
mkdir -p R
# content of the Saturating-Forecasts.R

# Library Inclusion
library('prophet')


# Command Line Arguments:
args = commandArgs(trailingOnly=TRUE)
args

input = args[1]
output = args[2]
output1 = args[3]


# File Path Processing:
I <- paste("", input, sep ="")

O <- paste("", output, sep ="")

O1 <- paste("", output1 ,sep ="")


# Read CSV Data:
df <- read.csv(I)


# Forecasting 1:
df$cap <- 8.5
m <- prophet(df, growth = 'logistic')

future <- make_future_dataframe(m, periods = 1826)
future$cap <- 8.5
fcst <- predict(m, future)
pdf(O)
plot(m, fcst)
dev.off()

# Forecasting 2:
df$y <- 10 - df$y
df$cap <- 6
df$floor <- 1.5
future$cap <- 6
future$floor <- 1.5
m <- prophet(df, growth = 'logistic')
fcst <- predict(m, future)
pdf(O1)
plot(m, fcst)
dev.off()
Rscript Saturating-Forecasts.R "example_wp_log_R.csv" "outputs/output0.pdf" "outputs/output1.pdf"
FROM r-base
RUN R -e "install.packages('prophet',dependencies=TRUE, repos='http://cran.rstudio.com/')"
RUN mkdir /R
RUN mkdir /outputs
COPY Saturating-Forecasts.R R
WORKDIR /R
docker build -t <hub-user>/<repo-name>:<tag> .
docker buildx build --platform linux/amd64 -t ghcr.io/bacalhau-project/examples/r-prophet:0.0.1 .
docker push <hub-user>/<repo-name>:<tag>
docker push ghcr.io/bacalhau-project/examples/r-prophet:0.0.1
export JOB_ID=$(bacalhau docker run \
    --wait \
    --id-only \
    -i ipfs://QmY8BAftd48wWRYDf5XnZGkhwqgjpzjyUG3hN1se6SYaFt:/example_wp_log_R.csv \
    ghcr.io/bacalhau-project/examples/r-prophet:0.0.2 \
    -- Rscript Saturating-Forecasts.R "/example_wp_log_R.csv" "/outputs/output0.pdf" "/outputs/output1.pdf")
bacalhau job list --id-filter ${JOB_ID}
bacalhau job describe ${JOB_ID}
rm -rf results && mkdir -p results
bacalhau job get ${JOB_ID} --output-dir results
ls results/outputs
nvcc --version
mkdir inputs outputs
wget -P inputs https://raw.githubusercontent.com/tristanpenman/cuda-examples/master/00-hello-world.cu
wget -P inputs https://raw.githubusercontent.com/tristanpenman/cuda-examples/master/02-cuda-hello-world-faster.cu
# View the contents of the standard C++ program
cat inputs/00-hello-world.cu

# Measure the time it takes to compile and run the program
nvcc -o ./outputs/hello ./inputs/00-hello-world.cu; ./outputs/hello
# View the contents of the CUDA program with vector addition
!cat inputs/02-cuda-hello-world-faster.cu

# Remove any previous output
rm -rf outputs/hello

# Measure the time for compilation and execution
nvcc --expt-relaxed-constexpr -o ./outputs/hello ./inputs/02-cuda-hello-world-faster.cu; ./outputs/hello
export JOB_ID=$(bacalhau docker run \
    --gpu 1 \
    --timeout 3600 \
    --wait-timeout-secs 3600 \
    -i https://raw.githubusercontent.com/tristanpenman/cuda-examples/master/02-cuda-hello-world-faster.cu \
    --id-only \
    --wait \
    nvidia/cuda:11.2.2-cudnn8-devel-ubuntu18.04 \
    -- /bin/bash -c 'nvcc --expt-relaxed-constexpr  -o ./outputs/hello ./inputs/02-cuda-hello-world-faster.cu; ./outputs/hello ')
bacalhau job list --id-filter ${JOB_ID} --wide
bacalhau job describe ${JOB_ID}
rm -rf results && mkdir -p results
bacalhau job get $JOB_ID --output-dir results
cat results/stdout
export JOB_ID=$(bacalhau docker run \
    --wait \
    --id-only \
    --timeout 3600 \
    --wait-timeout-secs 3600 \
    -w /inputs \
    -i https://raw.githubusercontent.com/js-ts/hello-notebook/main/hello.ipynb \
    jsacex/jupyter \
    -- jupyter nbconvert --execute --to notebook --output /outputs/hello_output.ipynb hello.ipynb)
bacalhau job list --id-filter=${JOB_ID} --no-style
bacalhau job describe ${JOB_ID}
rm -rf results && mkdir results # Temporary directory to store the results
bacalhau job get ${JOB_ID} --output-dir results # Download the results
ls results/outputs

hello_output.nbconvert.ipynb
# Use the official Python image as the base image
FROM tensorflow/tensorflow:nightly-gpu

# Set the working directory in the container
WORKDIR /

RUN apt-get update -y

COPY mnist.ipynb .
# Install the Python packages
COPY requirements.txt .

RUN python3 -m pip install --upgrade pip

# Install the Python packages
RUN pip install --no-cache-dir -r requirements.txt

RUN pip install -U scikit-learn
docker build -t your-dockerhub-username/jupyter-mnist-tensorflow:latest .
docker push your-dockerhub-username/jupyter-mnist-tensorflow
export JOB_ID=$(bacalhau docker run \
    --wait \
    --id-only \
    --timeout 3600 \
    --wait-timeout-secs 3600 \
    --gpu 1 \
    -i gitlfs://huggingface.co/datasets/VedantPadwal/mnist.git \
    jsacex/jupyter-tensorflow-mnist:v02 \
    -- jupyter nbconvert --execute --to notebook --output /outputs/mnist_output.ipynb mnist.ipynb)
bacalhau job list --id-filter=${JOB_ID} --no-style
bacalhau job describe ${JOB_ID}
rm -rf results && mkdir results # Temporary directory to store the results
bacalhau job get ${JOB_ID} --output-dir results # Download the results
ls results/outputs
# hello.R
print("hello world")
Rscript hello.R
export JOB_ID=$(bacalhau docker run \
    --wait \
    --id-only \
    -i ipfs://QmQRVx3gXVLaRXywgwo8GCTQ63fHqWV88FiwEqCidmUGhk:/hello.R \
    r-base \
    -- Rscript hello.R)
name: Running a Simple R Script
type: batch
count: 1
tasks:
  - name: My main task
    Engine:
      type: docker
      params:
        Image: r-base:latest
        Entrypoint:
          - /bin/bash
        Parameters:
          - -c        
          - Rscript /hello.R
    InputSources:
      - Target: "/"
        Source:
          Type: urlDownload
          Params:
            URL: https://raw.githubusercontent.com/bacalhau-project/examples/main/scripts/hello.R
            Path: /hello.R
bacalhau job run rhello.yaml
bacalhau job list --id-filter ${JOB_ID}
bacalhau job describe  ${JOB_ID}
rm -rf results && mkdir results
bacalhau job get ${JOB_ID} --output-dir results
cat results/stdout
bacalhau job describe ${JOB_ID} --spec > job.yaml
cat job.yaml
#write following to the file hashes.txt
bafybeihvtzberlxrsz4lvzrzvpbanujmab3hr5okhxtbgv2zvonqos2l3i
bafybeifb25fgxrzu45lsc47gldttomycqcsao22xa2gtk2ijbsa5muzegq
bafybeig4wwwhs63ly6wbehwd7tydjjtnw425yvi2tlzt3aii3pfcj6hvoq
bafybeievpb5q372q3w5fsezflij3wlpx6thdliz5xowimunoqushn3cwka
bafybeih6te26iwf5kzzby2wqp67m7a5pmwilwzaciii3zipvhy64utikre
bafybeicjd4545xph6rcyoc74wvzxyaz2vftapap64iqsp5ky6nz3f5yndm
# write following to the file bacalhau.py
import json, glob, os, multiprocessing, shutil, subprocess, tempfile, time

# checkStatusOfJob checks the status of a Bacalhau job
def checkStatusOfJob(job_id: str) -> str:
    assert len(job_id) > 0
    p = subprocess.run(
        ["bacalhau", "list", "--output", "json", "--id-filter", job_id],
        stdout=subprocess.PIPE,
        stderr=subprocess.PIPE,
        text=True,
    )
    r = parseJobStatus(p.stdout)
    if r == "":
        print("job status is empty! %s" % job_id)
    elif r == "Completed":
        print("job completed: %s" % job_id)
    else:
        print("job not completed: %s - %s" % (job_id, r))

    return r


# submitJob submits a job to the Bacalhau network
def submitJob(cid: str) -> str:
    assert len(cid) > 0
    p = subprocess.run(
        [
            "bacalhau",
            "docker",
            "run",
            "--id-only",
            "--wait=false",
            "--input",
            "ipfs://" + cid + ":/inputs/data.tar.gz",
            "ghcr.io/bacalhau-project/examples/blockchain-etl:0.0.6",
        ],
        stdout=subprocess.PIPE,
        stderr=subprocess.PIPE,
        text=True,
    )
    if p.returncode != 0:
        print("failed (%d) job: %s" % (p.returncode, p.stdout))
    job_id = p.stdout.strip()
    print("job submitted: %s" % job_id)

    return job_id


# getResultsFromJob gets the results from a Bacalhau job
def getResultsFromJob(job_id: str) -> str:
    assert len(job_id) > 0
    temp_dir = tempfile.mkdtemp()
    print("getting results for job: %s" % job_id)
    for i in range(0, 5): # try 5 times
        p = subprocess.run(
            [
                "bacalhau",
                "get",
                "--output-dir",
                temp_dir,
                job_id,
            ],
            stdout=subprocess.PIPE,
            stderr=subprocess.PIPE,
            text=True,
        )
        if p.returncode == 0:
            break
        else:
            print("failed (exit %d) to get job: %s" % (p.returncode, p.stdout))

    return temp_dir


# parseJobStatus parses the status of a Bacalhau job
def parseJobStatus(result: str) -> str:
    if len(result) == 0:
        return ""
    r = json.loads(result)
    if len(r) > 0:
        return r[0]["State"]["State"]
    return ""


# parseHashes splits lines from a text file into a list
def parseHashes(filename: str) -> list:
    assert os.path.exists(filename)
    with open(filename, "r") as f:
        hashes = f.read().splitlines()
    return hashes


def main(file: str, num_files: int = -1):
    # Use multiprocessing to work in parallel
    count = multiprocessing.cpu_count()
    with multiprocessing.Pool(processes=count) as pool:
        hashes = parseHashes(file)[:num_files]
        print("submitting %d jobs" % len(hashes))
        job_ids = pool.map(submitJob, hashes)
        assert len(job_ids) == len(hashes)

        print("waiting for jobs to complete...")
        while True:
            job_statuses = pool.map(checkStatusOfJob, job_ids)
            total_finished = sum(map(lambda x: x == "Completed", job_statuses))
            if total_finished >= len(job_ids):
                break
            print("%d/%d jobs completed" % (total_finished, len(job_ids)))
            time.sleep(2)

        print("all jobs completed, saving results...")
        results = pool.map(getResultsFromJob, job_ids)
        print("finished saving results")

        # Do something with the results
        shutil.rmtree("results", ignore_errors=True)
        os.makedirs("results", exist_ok=True)
        for r in results:
            path = os.path.join(r, "outputs", "*.csv")
            csv_file = glob.glob(path)
            for f in csv_file:
                print("moving %s to results" % f)
                shutil.move(f, "results")

if __name__ == "__main__":
    main("hashes.txt", 10)
python3 bacalhau.py
ls results

transactions_00000000_00049999.csv  transactions_00150000_00199999.csv
transactions_00050000_00099999.csv  transactions_00200000_00249999.csv
transactions_00100000_00149999.csv  transactions_00250000_00299999.csv

GPU Workloads Setup

Bacalhau supports GPU workloads. In this tutorial, learn how to run a job using GPU workloads with the Bacalhau client.

Prerequisites

  • The Bacalhau network must have an executor node with a GPU exposed

  • Your container must include the CUDA runtime (cudart) and must be compatible with the CUDA version running on the node

Usage

To submit a job request, use the --gpu flag under the docker run command to select the number of GPUs your job requires. For example:

bacalhau docker run --gpu=1 nvidia/cuda:11.0.3-base-ubuntu20.04 nvidia-smi

Limitations

The following limitations currently exist within Bacalhau. Bacalhau supports:

  • NVIDIA, Intel or AMD GPUs only

  • GPUs for the Docker executor only

Google Cloud Marketplace

Introduction

Well done on deploying your Bacalhau cluster! Now that the deployment is finished, this document will help with the next steps. It provides important information on how to interact with and manage the cluster. You'll find details on the outputs from the deployment, including how to set up and connect a Bacalhau Client, and how to authorize and connect a Bacalhau Compute node to the cluster. This guide gives everything needed to start using the Bacalhau setup

Deployment Outputs

After completing the deployment, several outputs will be presented. Below is a description of each output and instructions on how to configure your Bacalhau node using them.

Requester Public IP

Description: The IP address of the Requester node for the deployment and the endpoint where the Bacalhau API is served.

Usage: Configure the Bacalhau Client to connect to this IP address in the following ways:

  1. Setting the --api-host CLI Flag:

    $ bacalhau --api-host $REQUESTER_IP [CMD]
  2. Setting the BACALHAU_API_HOST environment variable:

    $ export BACALHAU_API_HOST=$REQUESTER_IP
    $ bacalhau [CMD]
  3. Modifying the Bacalhau Configuration File:

    $ bacalhau config set node.clientapi.host $REQUESTER_IP
    $ bacalhau [CMD]

Requester API Token

Description: The token used to authorize a client when accessing the Bacalhau API.

Usage: The Bacalhau client prompts for this token when a command is first issued to the Bacalhau API. For example:

$ bacalhau agent version
token: $REQUESTER_API_TOKEN

Compute API Token

Description: The token used to authorize a Bacalhau Compute node to connect to the Requester Node.

Usage: A Bacalhau Compute node can be connected to the Requester Node using the following command:

bacalhau serve --node-type=compute --orchestrators=nats://$COMPUTE_API_TOKEN@$REQUESTER_IP

Write a config.yaml

How to write the config.yaml file to configure your nodes

On installation, Bacalhau creates a .bacalhau directory that includes a config.yaml file tailored for your specific settings. This configuration file is the central repository for custom settings for your Bacalhau nodes.

When initializing a Bacalhau node, the system determines its configuration by following a specific hierarchy. First, it checks the default settings, then the config.yaml file, followed by environment variables, and finally, any command line flags specified during execution. Configurations are set and overridden in that sequence. This layered approach allows the default Bacalhau settings to provide a baseline, while environment variables and command-line flags offer added flexibility. However, the config.yaml file offers a reliable way to predefine all necessary settings before node creation across environments, ensuring consistency and ease of management.

Modifications to the config.yaml file are not dynamically applied to existing nodes. A restart of the Bacalhau node is required for any changes to take effect.

Your config.yaml file starts off empty. However, you can see all available settings using the following command

bacalhau config list

This command showcases over a hundred configuration parameters related to users, security, metrics, updates, and node configuration, providing a comprehensive overview of the customization options available for your Bacalhau setup.

Let’s go through the different options and how your configuration file is structured.

Config.yaml Structure

The bacalhau config list command displays your configuration paths, segmented with periods to indicate each part you are configuring.

Consider these configuration settings: NameProvider and Labels. These settings help set name and labels for your Bacalhau node.

In your config.yaml, these settings will be formatted like this:

labels:
    NodeType: WebServer
    OS: Linux
nameprovider: puuid

Configuration Options

Here are your Bacalhau configuration options in alphabetical order:

Configuration Option
Description

API.Auth.AccessPolicyPath

String path to where your security policy is stored

API.Auth.Methods

Set authentication method for your Bacalhau network

API.Host

The host for the client and server to communicate on (via REST). Ignored if BACALHAU_API_HOST environment variable is set

API.Port

The port for the client and server to communicate on (via REST). Ignored if BACALHAU_API_PORT environment variable is set

API.TLS.AutoCert

Domain for automatic certificate generation

API.TLS.AutoCertCachePath

The directory where the autocert process will cache certificates to avoid rate limits

API.TLS.CAFile

The location of your node client’s chosen Certificate Authority certificate file when self-signed certificates are used

API.TLS.CAFile

CAFile specifies the path to the Certificate Authority file

API.TLS.KeyFile

Specifies path to the TLS Private Key file

API.TLS.Insecure

Boolean binary indicating if the client TLS is insecure, when true instructs the client to use HTTPS (TLS), but not to attempt to verify the certificate

API.TLS.SelfSigned

Boolean indicating if a self-signed security certificate is being used

API.TLS.UseTLS

Boolean indicating if TLS should be used for client connections

Compute.Auth.Token

Token specifies the key for compute nodes to be able to access the orchestrator.

Compute.AllocatedCapacity.CPU

Total amount of CPU the system can use at one time in aggregate for all jobs

Compute.AllocatedCapacity.Disk

Total amount of disk the system can use at one time in aggregate for all jobs

Compute.AllocatedCapacity.GPU

Total amount of GPU the system can use at one time in aggregate for all jobs

Compute.AllocatedCapacity.Memory

Total amount of memory the system can use at one time in aggregate for all jobs

Compute.AllowListedLocalPaths

AllowListedLocalPaths specifies a list of local file system paths that the compute node is allowed to access

Compute.Heartbeat.Interval

How often the compute node will send a heartbeat to the requester node to let it know that the compute node is still alive. This should be less than the requester's configured heartbeat timeout to avoid flapping.

Compute.Heartbeat.InfoUpdateInterval

The frequency with which the compute node will send node info (including current labels) to the controlling requester node

Compute.Heartbeat.ResourceUpdateInterval

How often the compute node will send current resource availability to the requester node

Compute.Orchestrators

Comma-separated list of orchestrators to connect to. Applies to compute nodes

DataDir

DataDir specifies a location on disk where the bacalhau node will maintain state

DisableAnalytics

When set to true disables Bacalhau from sharing anonymous user data with Expanso

JobDefaults.Batch.Priority

Priority specifies the default priority allocated to a batch job. This value is used when the job hasn't explicitly set its priority requirement

JobDefaults.Batch.Task.Publisher.Params

Params specifies the publisher configuration data

JobDefaults.Batch.Task.Publisher.Type

Type specifies the publisher type. e.g. "s3", "local", "ipfs", etc.

JobDefaults.Batch.Task.Resources.CPU

Sets default CPU resource limits for batch jobs on your Compute node

JobDefaults.Daemon.Task.Resources.CPU

Sets default CPU resource limits for daemon jobs on your Compute node

JobDefaults.Ops.Task.Resources.CPU

Sets default CPU resource limits for ops jobs on your Compute node

JobDefaults.Service.Task.Resources.CPU

Sets default CPU resource limits for service jobs on your Compute node

JobDefaults.Batch.Task.Resources.Disk

Sets default disk resource limits for batch jobs on your Compute node

JobDefaults.Daemon.Task.Resources.Disk

Sets default disk resource limits for daemon jobs on your Compute node

JobDefaults.Ops.Task.Resources.Disk

Sets default disk resource limits for ops jobs on your Compute node

JobDefaults.Service.Task.Resources.Disk

Sets default disk resource limits for service jobs on your Compute node

JobDefaults.Batch.Task.Resources.GPU

Sets default GPU resource limits for batch jobs on your Compute node

JobDefaults.Daemon.Task.Resources.GPU

Sets default GPU resource limits for daemon jobs on your Compute node

JobDefaults.Ops.Task.Resources.GPU

Sets default GPU resource limits for ops jobs on your Compute node

JobDefaults.Service.Task.Resources.GPU

Sets default GPU resource limits for service jobs on your Compute node

JobDefaults.Batch.Task.Resources.Memory

Sets default memory resource limits for batch jobs on your Compute node

JobDefaults.Daemon.Task.Resources.Memory

Sets default memory resource limits for daemon jobs on your Compute node

JobDefaults.Ops.Task.Resources.Memory

Sets default memory resource limits for ops jobs on your Compute node

JobDefaults.Service.Task.Resources.Memory

Sets default memory resource limits for service jobs on your Compute node

JobDefaults.Ops.Task.Publisher.Params

Params specifies the publisher configuration data

JobDefaults.Ops.Task.Publisher.Type

Type specifies the publisher type. e.g. "s3", "local", "ipfs", etc.

JobDefaults.Service.Priority

Priority specifies the default priority allocated to a service job

JobDefaults.Daemon.Priority

Priority specifies the default priority allocated to a daemon job

JobDefaults.Ops.Priority

Priority specifies the default priority allocated to an ops job

JobAdmissionControl.Locality

Sets job selection policy based on where the data for the job is located. ‘local’ or ‘anywhere’

JobAdmissionControl.ProbeExec

Use the result of an executed external program to decide if a job should be accepted. Overrides data locality settings

JobAdmissionControl.ProbeHTTP

Use the result of a HTTP POST to decide if a job should be accepted. Overrides data locality settings

JobAdmissionControl.RejectStatelessJobs

Boolean signifying if jobs that don’t specify any data should be rejected

JobDefaults.Batch.Task.Timeouts.ExecutionTimeout

Default value for batch job execution timeouts on your current compute node. It will be assigned to batch jobs with no timeout requirement defined

JobDefaults.Ops.Task.Timeouts.ExecutionTimeout

Default value for ops job execution timeouts on your current compute node. It will be assigned to ops jobs with no timeout requirement defined

JobDefaults.Batch.Task.Timeouts.TotalTimeout

Default value for the maximum execution timeout this compute node supports for batch jobs. Jobs with higher timeout requirements will not be bid on

JobDefaults.Ops.Task.Timeouts.TotalTimeout

Default value for the maximum execution timeout this compute node supports for ops jobs. Jobs with higher timeout requirements will not be bid on

Publishers.Types.Local.Address

The address for the local publisher's server to bind to

Publishers.Types.Local.Port

The port for the local publisher's server to bind to (default: 6001)

Logging.LogDebugInfoInterval

The duration interval your compute node should generate logs on the running job executions

Logging.Mode

Mode specifies the logging mode. One of: default, json.

Logging.Level

Level sets the logging level. One of: trace, debug, info, warn, error, fatal, panic.

Engines.Disabled

List of Engine types to disable

Engines.Types.Docker.ManifestCache.TTL

The default time-to-live for each record in the manifest cache

Engines.Types.Docker.ManifestCache.Refresh

Refresh specifies the refresh interval for cache entries.

Engines.Types.Docker.ManifestCache.Size

Specifies the number of items that can be held in the manifest cache

FeatureFlags.ExecTranslation

ExecTranslation enables the execution translation feature

Publishers.Disabled

List of Publisher types to disable

Publishers.Types.IPFS.Endpoint

Endpoint specifies the multi-address to connect to for IPFS

InputSources.Disabled

List of Input Source types to disable

InputSources.MaxRetryCount

MaxRetryCount specifies the maximum number of attempts for reading from a storage

InputSources.ReadTimeout

ReadTimeout specifies the maximum time allowed for reading from a storage

InputSources.Types.IPFS.Endpoint

Endpoint specifies the multi-address to connect to for IPFS - to be used as input source

ResultDownloaders.Timeout

Timeout specifies the maximum time allowed for a download operation.

ResultDownloaders.Disabled

Disabled is a list of downloaders that are disabled

ResultDownloaders.Types.IPFS.Endpoint

Endpoint specifies the multi-address to connect to for IPFS

Labels

List of labels to apply to the node that can be used for node selection and filtering

NameProvider

The name provider to use to generate the node name

Orchestrator.Auth.Token

Token specifies the key, which Orchestrator node expects from the Compute node to use to connect to it

Orchestrator.Advertise

Address to advertise to compute nodes to connect to

Orchestrator.Cluster.Advertise

Address to advertise to other orchestrators to connect to

Orchestrator.Cluster.Name

Name of the cluster to join

Orchestrator.Cluster.Peers

Comma-separated list of other orchestrators to connect to form a cluster

Orchestrator.Cluster.Port

Port to listen for connections from other orchestrators to form a cluster

Orchestrator.Port

Port to listen for connections from other nodes. Applies to orchestrator nodes

Orchestrator.NodeManager.DisconnectTimeout

This is the time period after which a compute node is considered to be disconnected. If the compute node does not deliver a heartbeat every DisconnectTimeout then it is considered disconnected

Orchestrator.EvaluationBroker.MaxRetryCount

Maximum retry count for the evaluation broker

Orchestrator.EvaluationBroker.VisibilityTimeout

Visibility timeout for the evaluation broker

Orchestrator.Scheduler.HousekeepingInterval

Duration between Bacalhau housekeeping runs

Orchestrator.Scheduler.HousekeepingTimeout

Specifies the maximum time allowed for a single housekeeping run

JobAdmissionControl.AcceptNetworkedJobs

Boolean signifying if jobs that specify networking should be accepted

Orchestrator.NodeManager.ManualApproval

Boolean signifying if new nodes should only be manually approved to your network. Default is false

Orchestrator.Scheduler.QueueBackoff

QueueBackoff specifies the time to wait before retrying a failed job.

Publishers.Types.S3.PreSignedURLDisabled

Boolean deciding if a secure S3 URL should be generated and used. Default false, Disabled if true.

Publishers.Types.S3.PreSignedURLExpiration

Defined expiration interval for your secure S3 urls

FeatureFlags.ExecTranslation

Whether jobs should be translated at the requester node or not. Default: false

Orchestrator.Scheduler.WorkerCount

Number of workers that should be generated under your requester node

Orchestrator.Host

Host specifies the hostname or IP address on which the Orchestrator server listens for compute node connections

Orchestrator.Port

Port specifies the port number on which the Orchestrator server listens for compute node connections.

Orchestrator.Enabled

Enabled indicates whether the orchestrator node is active and available for job submission.

Compute.Enabled

Enabled indicates whether the compute node is active and available for job execution.

StrictVersionMatch

StrictVersionMatch indicates whether to enforce strict version matching

UpdateConfig.Interval

The frequency with which your system checks for version updates. When set to 0 update checks are not performed.

WebUI.Backend

Backend specifies the address and port of the backend API server. If empty, the Web UI will use the same address and port as the API server

WebUI.Enabled

Enabled indicates whether the Web UI is enabled

WebUI.Listen

Listen specifies the address and port on which the Web UI listens

here
data
NFT.Storage
Bacalhau team via Slack
Prophet
here
dedicated container example
follow these instructions to create docker account
Bacalhau team via Slack
here
Bacalhau team via Slack
here
here
Bacalhau team via Slack
here
A Installing R and RStudio | Hands-On Programming with R
ipfs.io
w3s.link
declarative
Bacalhau team via Slack
here
Analysing Data with Python Pandas
Bacalhau team via Slack

Pinning Data

How to pin data to public storage

If you have data that you want to make available to your Bacalhau jobs (or other people), you can pin it using a pinning service like Pinata, NFT.Storage, Thirdweb, etc. Pinning services store data on behalf of users. The pinning provider is essentially guaranteeing that your data will be available if someone knows the CID. Most pinning services offer you a free tier, so you can try them out without spending any money.

Basic steps

To use a pinning service, you will almost always need to create an account. After registration, you get an API token, which is necessary to control and access the files. Then you need to upload files - usually services provide a web interface, CLI and code samples for integration into your application. Once you upload the files you will get its CID, which looks like this: QmUyUg8en7G6RVL5uhyoLBxSWFgRMdMraCRWFcDdXKWEL9. Now you can access pinned data from the jobs via this CID.

Copy Data from URL to Public Storage

To upload a file from a URL we will use the bacalhau docker run command.

bacalhau docker run \
    --id-only \
    --wait \
    --input https://raw.githubusercontent.com/filecoin-project/bacalhau/main/README.md \
    ghcr.io/bacalhau-project/examples/upload:v1

The job has been submitted and Bacalhau has printed out the related job id.

Let's look closely at the command above:

  1. bacalhau docker run: call to bacalhau using docker executor

  2. --input https://raw.githubusercontent.com/filecoin-project/bacalhau/main/README.md: URL path of the input data volumes downloaded from a URL source.

  3. ghcr.io/bacalhau-project/examples/upload:v1: the name and tag of the docker image we are using

The bacalhau docker run command takes advantage of the --input parameter. This will download a file from a public URL and place it in the /inputs directory of the container (by default). Then we will use a helper container to move that data to the /outputs directory.

Job status: You can check the status of the job using bacalhau job list, processing the json ouput with the jq:

bacalhau job list $JOB_ID --output=json | jq '.[0].Status.JobState.Nodes[] | .Shards."0" | select(.RunOutput)'

When the job status is Published or Completed, that means the job is done, and we can get the results using the job ID.

Job information: You can find out more information about your job by using bacalhau job describe.

bacalhau job describe  $JOB_ID 

Job download: You can download your job results directly by using bacalhau job get. Alternatively, you can choose to create a directory to store your results. In the command below, we removed a directory in case it was present before, created it and downloaded our job output to be stored in that directory.

rm -rf results && mkdir ./results
bacalhau job get --output-dir ./results $JOB_ID 

Each job result contains an outputs subfolder and exitCode, stderr and stdout files with relevant content. To view the execution logs execute following:

head -n 15 ./results/stdout

And to view the job execution result (README.md file in the example case), which was saved as a job output, execute:

tail ./results/outputs/README.md

To get the output CID from a completed job, run the following command:

bacalhau job list $JOB_ID --output=json | jq -r '.[0].Status.JobState.Nodes[] | .Shards."0".PublishedResults | select(.CID) | .CID'

The job will upload the CID to the public storage via IPFS. We will store the CID in an environment variable so that we can reuse it later on.

Now that we have the CID, we can use it in a new job. This time we will use the --input parameter to tell Bacalhau to use the CID we just uploaded.

In this case, the only goal of our job is just to list the contents of the /inputs directory. You can see that the "input" data is located under /inputs/outputs/README.md.

bacalhau docker run \
    --id-only \
    --wait \
    --input ipfs://$CID \
    ubuntu -- \
    bash -c "set -x; ls -l /inputs; ls -l /inputs/outputs; cat /inputs/outputs/README.md"

The job has been submitted and Bacalhau has printed out the related job id. We store that in an environment variable so that we can reuse it later on.

Running a Job over S3 data

Here is a quick tutorial on how to copy Data from S3 to a public storage. In this tutorial, we will scrape all the links from a public AWS S3 buckets and then copy the data to IPFS using Bacalhau.

bacalhau docker run \
    -i "s3://noaa-goes16/ABI-L1b-RadC/2000/001/12/OR_ABI-L1b-RadC-M3C01*:/inputs,opt=region=us-east-1" \
    --id-only \
    --wait \
    alpine \
    -- sh -c "cp -r /inputs/* /outputs/"

Let's look closely at the command above:

  1. bacalhau docker run: call to bacalhau

  2. -i "s3://noaa-goes16/ABI-L1b-RadC/2000/001/12/OR_ABI-L1b-RadC-M3C01*:/inputs,opt=region=us-east-1: defines S3 objects as inputs to the job. In this case, it will download all objects that match the prefix ABI-L1b-RadC/2000/001/12/OR_ABI-L1b-RadC-M3C01 from the bucket noaa-goes16 in us-east-1 region, and mount the objects under /inputs path inside the docker job.

  3. -- sh -c "cp -r /inputs/* /outputs/": copies all files under /inputs to /outputs, which is by default the result output directory which all of its content will be published to the specified destination, which is IPFS by default

When a job is submitted, Bacalhau prints out the related job_id. We store that in an environment variable so that we can reuse it later on.

Job status: You can check the status of the job using bacalhau job list.

bacalhau job list --id-filter ${JOB_ID} --wide

When it says Published or Completed, that means the job is done, and we can get the results.

Job information: You can find out more information about your job by using bacalhau job describe.

bacalhau job describe ${JOB_ID}

Job download: You can download your job results directly by using bacalhau job get. Alternatively, you can choose to create a directory to store your results. In the command below, we remove the results directory if it exists, create it again and download our job output to be stored in that directory.

rm -rf results && mkdir -p results # Temporary directory to store the results
bacalhau job get $JOB_ID --output-dir results # Download the results

When the download is completed, the results of the job will be present in the directory. To view them, run the following command:

ls -1 results/outputs

{
  "NextToken": "",
  "Results": [
    {
      "Type": "s3PreSigned",
      "Params": {
        "PreSignedURL": "https://bacalhau-test-datasets.s3.eu-west-1.amazonaws.com/integration-tests-publisher/walid-manual-test-j-46a23fe7-e063-4ba6-8879-aac62af732b0.tar.gz?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAUEMPQ7JFSLGEPHJG%2F20240129%2Feu-west-1%2Fs3%2Faws4_request&X-Amz-Date=20240129T060142Z&X-Amz-Expires=1800&X-Amz-SignedHeaders=host&x-id=GetObject&X-Amz-Signature=cea00578ae3b03a1b52dba2d65a1bab40f1901fb7cd4ee1a0a974dc05b595f2e",
        "SourceSpec": {
          "Bucket": "bacalhau-test-datasets",
          "ChecksumSHA256": "1tlbgo+q0TlQhJi8vkiWnwTwPu1zenfvTO4qW1D5yvI=",
          "Endpoint": "",
          "Filter": "",
          "Key": "integration-tests-publisher/walid-manual-test-j-46a23fe7-e063-4ba6-8879-aac62af732b0.tar.gz",
          "Region": "eu-west-1",
          "VersionID": "oS7n.lY5BYHPMNOfbBS1n5VLl4ppVS4h"
        }
      }
    }
  ]
}

First you need to install jq (if it is not already installed) to process JSON:

sudo apt update
sudo apt install jq

To extract the CIDs from output JSON, execute following:

bacalhau job describe ${JOB_ID} --json \
| jq -r '.State.Executions[].PublishedResults.CID | select (. != null)'

The extracted CID will look like this:

QmYFhG668yJZmtk84SMMdbrz5Uvuh78Q8nLxTgLDWShkhR

You can publish your results to Amazon s3 or other S3-compatible destinations like MinIO, Ceph, or SeaweedFS to conveniently store and share your outputs.

To facilitate publishing results, define publishers and their configurations using the PublisherSpec structure.

For S3-compatible destinations, the configuration is as follows:

type PublisherSpec struct {
    Type   Publisher              `json:"Type,omitempty"`
    Params map[string]interface{} `json:"Params,omitempty"`
}

For Amazon S3, you can specify the PublisherSpec configuration as shown below:

PublisherSpec:
  Type: S3
  Params:
    Bucket: <bucket>              # Specify the bucket where results will be stored
    Key: <object-key>             # Define the object key (supports dynamic naming using placeholders)
    Compress: <true/false>        # Specify whether to publish results as a single gzip file (default: false)
    Endpoint: <optional>          # Optionally specify the S3 endpoint
    Region: <optional>            # Optionally specify the S3 region

Let's explore some examples to illustrate how you can use this:

  1. Publishing results to S3 using default settings

bacalhau docker run -p s3://<bucket>/<object-key> ubuntu ...
  1. Publishing results to S3 with a custom endpoint and region:

bacalhau docker run \
-p s3://<bucket>/<object-key>,opt=endpoint=http://s3.example.com,opt=region=us-east-1 \
ubuntu ...
  1. Publishing results to S3 as a single compressed file

bacalhau docker run -p s3://<bucket>/<object-key>,opt=compress=true ubuntu ...
  1. Utilizing naming placeholders in the object key

bacalhau docker run -p s3://<bucket>/result-{date}-{jobID} ubuntu ...

Tracking content identification and maintaining lineage across different jobs' inputs and outputs can be challenging. To address this, the publisher encodes the SHA-256 checksum of the published results, specifically when publishing a single compressed file.

Here's an example of a sample result:

{
    "NodeID": "QmYJ9QN9Pbi6gBKNrXVk5J36KSDGL5eUT6LMLF5t7zyaA7",
    "Data": {
        "StorageSource": "S3",
        "Name": "s3://<bucket>/run3.tar.gz",
        "S3": {
            "Bucket": "<bucket>",
            "Key": "run3.tar.gz",
            "Checksum": "e0uDqmflfT9b+rMfoCnO5G+cy+8WVTOPUtAqDMnXWbw=",
            "VersionID": "hZoNdqJsZxE_bFm3UGJuJ0RqkITe9dQ1"
        }
    }
}

To enable support for the S3-compatible storage provider, no additional dependencies are required. However, valid AWS credentials are necessary to sign the requests. The storage provider uses the default credentials chain, which checks the following sources for credentials:

  • Environment variables, such as AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY

  • Credentials file ~/.aws/credentials

  • IAM Roles for Amazon EC2 Instances

Running Rust programs as WebAssembly (WASM)

Prerequisites

1. Develop a Rust Program Locally

We can use cargo (which will have been installed by rustup) to start a new project (my-program) and compile it:

cargo init my-program

We can then write a Rust program. Rust programs that run on Bacalhau can read and write files, access a simple clock, and make use of pseudo-random numbers. They cannot memory-map files or run code on multiple threads.

// ./my-program/src/main.rs
use image::{open, GrayImage, Luma, Pixel};
use imageproc::definitions::Clamp;
use imageproc::gradients::sobel_gradient_map;
use imageproc::map::map_colors;
use imageproc::seam_carving::*;
use std::path::Path;

fn main() {
    let input_path = "inputs/image0.JPG";
    let output_dir = "outputs/";

    let input_path = Path::new(&input_path);
    let output_dir = Path::new(&output_dir);

    // Load image and convert to grayscale
    let input_image = open(input_path)
        .expect(&format!("Could not load image at {:?}", input_path))
        .to_rgb8();

    // Save original image in output directory
    let original_path = output_dir.join("original.png");
    input_image.save(&original_path).unwrap();

    // We will reduce the image width by this amount, removing one seam at a time.
    let seams_to_remove: u32 = input_image.width() / 6;

    let mut shrunk = input_image.clone();
    let mut seams = Vec::new();

    // Record each removed seam so that we can draw them on the original image later.
    for i in 0..seams_to_remove {
        if i % 100 == 0 {
            println!("Removing seam {}", i);
        }
        let vertical_seam = find_vertical_seam(&shrunk);
        shrunk = remove_vertical_seam(&mut shrunk, &vertical_seam);
        seams.push(vertical_seam);
    }

    // Draw the seams on the original image.
    let gray_image = map_colors(&input_image, |p| p.to_luma());
    let annotated = draw_vertical_seams(&gray_image, &seams);
    let annotated_path = output_dir.join("annotated.png");
    annotated.save(&annotated_path).unwrap();

    // Draw the seams on the gradient magnitude image.
    let gradients = sobel_gradient_map(&input_image, |p| {
        let mean = (p[0] + p[1] + p[2]) / 3;
        Luma([mean as u32])
    });
    let clamped_gradients: GrayImage = map_colors(&gradients, |p| Luma([Clamp::clamp(p[0])]));
    let annotated_gradients = draw_vertical_seams(&clamped_gradients, &seams);
    let gradients_path = output_dir.join("gradients.png");
    clamped_gradients.save(&gradients_path).unwrap();
    let annotated_gradients_path = output_dir.join("annotated_gradients.png");
    annotated_gradients.save(&annotated_gradients_path).unwrap();

    // Save the shrunk image.
    let shrunk_path = output_dir.join("shrunk.png");
    shrunk.save(&shrunk_path).unwrap();
}

In the main function main() an image is loaded, the original is saved, and then a loop is performed to reduce the width of the image by removing "seams." The results of the process are saved, including the original image with drawn seams and a gradient image with highlighted seams.

We also need to install the imageproc and image libraries and switch off the default features to make sure that multi-threading is disabled (default-features = false). After disabling the default features, you need to explicitly specify only the features that you need:

// ./my-program/Cargo.toml
[package]
name = "my-program"
version = "0.1.0"
edition = "2021"

[dependencies.image]
version = "0.24.4"
default-features = false
features = ["png", "jpeg", "bmp"]

[dependencies.imageproc]
version = "0.23.0"
default-features = false

We can now build the Rust program into a WASM blob using cargo:

cd my-program && cargo build --target wasm32-wasi --release

This command navigates to the my-program directory and builds the project using Cargo with the target set to wasm32-wasi in release mode.

This will generate a WASM file at ./my-program/target/wasm32-wasi/release/my-program.wasm which can now be run on Bacalhau.

2. Running WASM on Bacalhau

Now that we have a WASM binary, we can upload it to IPFS and use it as input to a Bacalhau job.

The -i flag allows specifying a URI to be mounted as a named volume in the job, which can be an IPFS CID, HTTP URL, or S3 object.

For this example, we are using an image of the Statue of Liberty that has been pinned to a storage facility.

export JOB_ID=$(bacalhau wasm run \
    ./my-program/target/wasm32-wasi/release/my-program.wasm _start \
    --id-only \
    -i ipfs://bafybeifdpl6dw7atz6uealwjdklolvxrocavceorhb3eoq6y53cbtitbeu:/inputs)

Structure of the Commands

  1. bacalhau wasm run: call to Bacalhau

  2. ./my-program/target/wasm32-wasi/release/my-program.wasm: the path to the WASM file that will be executed

  3. _start: the entry point of the WASM program, where its execution begins

  4. --id-only: this flag indicates that only the identifier of the executed job should be returned

  5. -i ipfs://bafybeifdpl6dw7atz6uealwjdklolvxrocavceorhb3eoq6y53cbtitbeu:/inputs: input data volume that will be accessible within the job at the specified destination path

When a job is submitted, Bacalhau prints out the related job_id. We store that in an environment variable so that we can reuse it later on:

You can download your job results directly by using bacalhau job get. Alternatively, you can choose to create a directory to store your results. In the command below, we created a directory (wasm_results) and downloaded our job output to be stored in that directory.

We can now get the results.

rm -rf wasm_results && mkdir -p wasm_results
bacalhau job get ${JOB_ID} --output-dir wasm_results

Viewing Job Output

When we view the files, we can see the original image, the resulting shrunk image, and the seams that were removed.

./wasm_results/outputs/original.png
./wasm_results/outputs/annotated_gradients.png
./wasm_results/outputs/shrunk.png

Support

Accessing the Internet from Jobs

By default, Bacalhau jobs do not have any access to the internet. This is to keep both compute providers and users safe from malicious activities.

However, by using data volumes, you can read and access your data from within jobs and write back results.

Using Data Volumes

To use these features, the data to be downloaded has to be known before the job starts. For some workloads, the required data is computed as part of the job if the purpose of the job is to process web results. In these cases, networking may be possible during job execution.

Specifying Jobs to Access the Internet

To run Docker jobs on Bacalhau to access the internet, you'll need to specify one of the following:

  1. full: unfiltered networking for any protocol --network=full

  2. http: HTTP(S)-only networking to a specified list of domains --network=http

  3. none: no networking at all, the default --network=none

Specifying none will still allow Bacalhau to download and upload data before and after the job.

Jobs using http must specify the domains they want to access when the job is submitted. When the job runs, only HTTP requests to those domains will be possible, and data transfer will be rate-limited to 10Mbit/sec in either direction to prevent DDOS.

The required networking can be specified using the --network flag. For http networking, the required domains can be specified using the --domain flag, multiple times for as many domains as required. Specifying a domain starting with a . means that all sub-domains will be included. For example, specifying .example.com will cover some.thing.example.com as well as example.com.

Bacalhau jobs are explicitly prevented from starting other Bacalhau jobs, even if a Bacalhau requester node is specified on the HTTP allowlist.

Support for networked jobs on the public network

Bacalhau has support for describing jobs that can access the internet during job execution. The ability for compute nodes to run jobs that require internet access depends on what compute nodes are currently part of the network.

Compute nodes that join the Bacalhau network do not accept networked jobs by default (i.e. they only accept jobs that specify --network=none, which is also the default).

(Updated) Configuration Management

Introduction

There have been some changes made to how Bacalhau handles configuration:

  1. The bacalhau repo ~/.bacalhau no longer contains a config file.

  2. Bacalhau no longer looks in the repo ~/.bacalhau for a config file.

  3. Bacalhau never writes a config file to disk unless instructed by a user to do so.

  4. A config file is not required to operate Bacalhau.

  5. Bacalhau searches for a default config file. The location is OS-dependent:

    1. Linux: ~/.config/bacalhau/config.yaml

    2. OSX: ~/.config/Application\ Support/bacalhau/config.yaml

    3. Windows: $AppData\bacalhau\config.yaml. Usually, this is something like C:\Users\username\bacalhau\config.yaml

Summary

Bacalhau no longer relies on the ~/.bacalhau directory for configuration and only creates a config file when instructed. While not required, it will look for a default config file in OS-specific locations.

Inspecting the Current Configuration of Bacalhau

Making Changes to the Default Config File

As described above, bacalhau still has the concept of a default config file, which, for the sake of simplicity, we’ll say lives in ~/.config/bacalhau/config.yaml. There are two ways this file can be modified:

  1. A text editor vim ~/.config/bacalhau/config.yaml.

Using a Non-Default Config File.

Bacalhau Configuration Keys

In Bacalhau, configuration keys are structured identifiers used to configure and customize the behavior of the application. They represent specific settings that control various aspects of Bacalhau's functionality, such as network parameters, API endpoints, node operations, and user interface options. The configuration file is organized in a tree-like structure using nested mappings (dictionaries) in YAML format. Each level of indentation represents a deeper level in the hierarchy.

Example: part of the config file

API:
  Host: 0.0.0.0
  Port: 1234
  Auth:
    Methods:
      ClientKey:
      Type: challenge
NameProvider: puuid
DataDir: /home/frrist/.bacalhau
Orchestrator:
  Host: 0.0.0.0
  Port: 4222
  NodeManager:
    DisconnectTimeout: 1m0s

In this YAML configuration file:

  1. Top-Level Keys (Categories): API, Orchestrator

  2. Sub-Level Keys (Subcategories): Under API, we have Host and Port; Under Orchestrator we have Host, Port and NodeManager

  3. Leaf Nodes (Settings): Host, Port, NameProvider, DataDir, DisconnectTimeout — these contain the actual configuration values.

Config keys use dot notation to represent the path from the root of the configuration hierarchy down to a specific leaf node. Each segment in the key corresponds to a level in the hierarchy. Syntax is Category.Subcategory(s)...LeafNode

Using Keys With config set, config list and --config

The bacalhau config list returns all keys and their corresponding value. The bacalhau config set command accepts a key and a value to set it to. The --config flag accepts a key and a value that will be applied to Bacalhau when it runs.

Example Interaction With the Bacalhau Configuration System

How to Modify the API Host Using bacalhau config set in the Default Config File:

  1. Run bacalhau config list to find the appropriate key

bacalhau config list
 KEY VALUE DESCRIPTION
 ... ... ...
 api.host 0.0.0.0 Host specifies the hostname or IP address o
 ... ... ...
  1. Run the bacalhau config set command

bacalhau config set api.host 192.186.0.1
  1. Observe how bacalhau config list reflects the new setting

bacalhau config list
 KEY VALUE DESCRIPTION
 ... ... ...
 api.host 192.168.0.1 Host specifies the hostname or IP address
 ... ... ...
  1. Observe the change has been reflected in the default config file

cat ~/.config/bacalhau/config.yaml
api:
    host: 192.168.0.1

How to Modify the API Host Using bacalhau config set a Custom Config File

  1. Run the config set command with the flag

bacalhau config set --config=custom.yaml api.host 10.0.0.1
  1. Observe the created config file

cat custom.yaml
api:
 host: 10.0.0.1

Observe the default config and output of bacalhau config list does not reflect this change.

How to Start Bacalhau With a Custom Config File

bacalhau --config=custom.yaml serve

Usage of the --config Flag

The --config (or -c) flag allows flexible configuration of bacalhau through various methods. You can use this flag multiple times to combine different configuration sources.

Usage

bacalhau [command] --config <option> [--config <option> ...] 

or using the short form:

bacalhau [command] -c <option> [-c <option> ...]

Configuration Options

  1. YAML Config Files: Specify paths to YAML configuration files. Example:

--config path/to/config.yaml
  1. Key-Value Pairs: Set specific configuration values using dot notation. Example:

--config WebUI.Enabled=true
  1. Boolean Flags: Enable boolean options by specifying the key alone. Example:

--config WebUI.Enabled

Precedence

When multiple configuration options are provided, they are applied in the following order of precedence (highest to lowest):

  1. Command-line key-value pairs and boolean flags

  2. YAML configuration files

  3. Default values

Within each category, options specified later override earlier ones.

Examples

Using a single config file:

bacalhau serve --config my-config.yaml

Merging multiple config files:

bacalhau serve -c base-config.yaml -c override-config.yaml

Overriding specific values:

bacalhau serve \
-c config.yaml \
-c WebUI.Listen=0.0.0.0:9999 \
-c NameProvider=hostname

Combining file and multiple overrides:

bacalhau serve \
-c config.yaml \
-c WebUI.Enabled \
-c API.Host=192.168.1.5

In the last example, WebUI.Enabled will be set to true, API.Host will be 192.168.1.5, and other values will be loaded from config.yaml if present.

Remember, later options override earlier ones, allowing for flexible configuration management.

Usage of the bacalhau completion Command

The bacalhau completion command will generate shell completion for your shell. You can use the command like:

bacalhau completion <bash|fish|powershell|zsh> > /tmp/bacalhau_completion && source /tmp/bacalhau_completion 

After running the above command, commands like bacalhau config set and bacalhau --config will have auto-completion for all possible configuration values along with their descriptions

Support

Utilizing NATS.io within Bacalhau

NATS.io networking fundamentals in Bacalhau

Support of the libp2p was discontinued in version v1.5.0

Our initial NATS integration focuses on simplifying communication between orchestrator and compute nodes. By embedding NATS within orchestrators, we streamline the network. Now, compute nodes need only connect to one or a few orchestrators and dynamically discover others at runtime, dramatically cutting down on configuration complexity.

How This Benefits Users

  1. Easier Setup: Compute nodes no longer need to be directly accessible by orchestrators, removing deployment barriers in diverse environments (on-premises, edge locations, etc.).

  2. Increased Reliability: Network changes are less disruptive, as compute nodes can easily switch between orchestrators if needed.

  3. Future-Proof: This sets the stage for more advanced NATS features like global clusters and multi-orchestrator setups.

How This Affects Current Users

The aim of integrating NATS into Bacalhau is to keep user experience with Bacalhau's HTTP APIs and CLIs for job submission and queries consistent. This ensures a smooth transition, allowing you to continue your work without any disruptions.

Getting Started with NATS

1. Generate an Authentication Token:

bacalhau config set Compute.Auth.Token=<your_secure_token>
bacalhau config set Orchestrator.Auth.Token=<your_secure_token>

Make sure to securely store this token and share it only with authorized parties in your network.

2. Running an Orchestrator Node with Authentication:

With the authentication token set, launch your orchestrator node as follows:

bacalhau serve --orchestrator

This command sets up an orchestrator node with an embedded NATS server, using the given auth token to secure communications. It defaults to port 4222, but you can customize this using the Orchestrator.Port configuration key if needed.

3. Initiating Compute Nodes with Authentication:

Compute nodes can authenticate using one of the following methods, depending on your preferred configuration setup:

Option 1: Read authsecret from the Config:

bacalhau serve --compute --config Compute.Orchestrators=<HOST>

This method assumes the Compute.Auth.Token is already configured on the compute node, allowing for a seamless authentication process.

Option 2: Pass Auth.Token Value Directly in the Orchestrator URI:

bacalhau serve \
--compute \
--Compute.Orchestrators=<your_secure_token>@<HOST>

Here, the Auth.Token is directly included in the command line, providing an alternative for instances where it's preferable to specify the token explicitly rather than rely on the configuration file.

Both methods ensure that compute nodes, acting as NATS clients, securely authenticate with the orchestrator node(s), establishing a trusted communication channel within your Bacalhau network.

More Authentication Options

We're committed to providing a secure and flexible distributed computing environment. Future Bacalhau versions will expand authentication choices, including TLS certificates and JWT, catering to varied security needs and further strengthening network security.

Looking Ahead with NATS

Global Connectivity and Scalability: NATS opens avenues for Bacalhau to operate smoothly across all scales, from local deployments to international networks. Its self-healing capabilities and dynamic mesh networking form the foundation for a future of resilient and flexible distributed computing.

Unlocking New Possibilities: The integration heralds a new era of possibilities for Bacalhau, from global clusters to multiple orchestrator nodes, tackling the complexities of distributed computing with innovation and community collaboration.

Conclusion

The shift to NATS is a step toward making distributed computing more accessible, resilient, and scalable. As we start this new chapter, we're excited to explore the advanced features NATS brings to Bacalhau and welcome our community to join us on this transformative journey.

Support

Automatic Update Checking

Bacalhau has an update checking service to automatically detect whether a newer version of the software is available.

Users who are both running CLI commands and operating nodes will be regularly informed that a new release can be downloaded and installed.

For clients

Bacalhau will run an update check regularly when client commands are executed. If an update is available, explanatory text will be printed at the end of the command.

To force a manual update check, run the bacalhau version command, which will explicitly list the latest software release alongside the server and client versions.

bacalhau version

# expected output
# might show client version only if client is not connected to any orchestrator

CLIENT  SERVER  LATEST  UPDATE MESSAGE 
 v1.5.1  v1.5.1  1.5.1

For node operators

Bacalhau will run an update check regularly as part of the normal operation of the node.

If an update is available, an INFO level message will be printed to the log.

Configuring checks

Bacalhau has some configuration options for controlling how often checks are performed. By default, an update check will run no more than once every 24 hours. Users can opt out of automatic update checks using the configuration described below.

Config property
Environment variable
Default value
Meaning

UpdateConfig.Interval

BACALHAU_UPDATE_CHECKFREQUENCY

24h0m0s

The minimum amount of time between automated update checks. Set as any duration of hours, minutes or seconds, e.g. 24h or 10m. When set to 0 update checks are not performed

It's important to note that disabling the automatic update checks may lead to potential issues, arising from mismatched versions of different actors within Bacalhau.

To output update check config, run bacalhau config list:

bacalhau config list | grep UpdateConfig
 UpdateConfig.Interval   24h0m0s   Interval specifies the time between update checks, when set to 0 update checks are not performed.

Support

Generate Synthetic Data using Sparkov Data Generation technique

Introduction

A synthetic dataset is generated by algorithms or simulations which has similar characteristics to real-world data. Collecting real-world data, especially data that contains sensitive user data like credit card information, is not possible due to security and privacy concerns. If a data scientist needs to train a model to detect credit fraud, they can use synthetically generated data instead of using real data without compromising the privacy of users.

The advantage of using Bacalhau is that you can generate terabytes of synthetic data without having to install any dependencies or store the data locally.

In this example, we will learn how to run Bacalhau on a synthetic dataset. We will generate synthetic credit card transaction data using the Sparkov program and store the results in IPFS.

Prerequisite

1. Running Sparkov Locally​

To run Sparkov locally, you'll need to clone the repo and install dependencies:

git clone https://github.com/js-ts/Sparkov_Data_Generation/
pip3 install -r Sparkov_Data_Generation/requirements.txt

Go to the Sparkov_Data_Generation directory:

cd Sparkov_Data_Generation

Create a temporary directory (outputs) to store the outputs:

mkdir ../outputs

2. Running the script

python3 datagen.py -n 1000 -o ../outputs "01-01-2022" "10-01-2022"

The command above executes the Python script datagen.py, passing the following arguments to it:

  1. -n 1000: Number of customers to generate

  2. -o ../outputs: path to store the outputs

  3. "01-01-2022": Start date

  4. "10-01-2022": End date

Thus, this command uses a Python script to generate synthetic credit card transaction data for the period from 01-01-2022 to 10-01-2022 and saves the results in the ../outputs directory.

To see the full list of options, use:

python datagen.py -h

3. Containerize Script using Docker

To build your own docker container, create a Dockerfile, which contains instructions to build your image:

FROM python:3.8

RUN apt update && apt install git

RUN git clone https://github.com/js-ts/Sparkov_Data_Generation/

WORKDIR /Sparkov_Data_Generation/

RUN pip3 install -r requirements.txt

These commands specify how the image will be built, and what extra requirements will be included. We use python:3.8 as the base image, install git, clone the Sparkov_Data_Generation repository from GitHub, set the working directory inside the container to /Sparkov_Data_Generation/, and install Python dependencies listed in the requirements.txt file."

Build the container

We will run docker build command to build the container:

docker build -t <hub-user>/<repo-name>:<tag> .

Before running the command replace:

repo-name with the name of the container, you can name it anything you want

tag this is not required but you can use the latest tag

In our case:

docker build -t jsacex/sparkov-data-generation .

Push the container

Next, upload the image to the registry. This can be done by using the Docker hub username, repo name or tag.

docker push <hub-user>/<repo-name>:<tag>

In our case:

docker push jsacex/sparkov-data-generation

After the repo image has been pushed to Docker Hub, we can now use the container for running on Bacalhau

4. Running a Bacalhau Job

Now we're ready to run a Bacalhau job:

export JOB_ID=$(bacalhau docker run \
    --id-only \
    --wait \
    jsacex/sparkov-data-generation \
    --  python3 datagen.py -n 1000 -o ../outputs "01-01-2022" "10-01-2022")

Structure of the command:

  1. bacalhau docker run: call to Bacalhau

  2. jsacex/sparkov-data-generation: the name of the docker image we are using

  3. -- python3 datagen.py -n 1000 -o ../outputs "01-01-2022" "10-01-2022": the arguments passed into the container, specifying the execution of the Python script datagen.py with specific parameters, such as the amount of data, output path, and time range.

When a job is submitted, Bacalhau prints out the related job_id. We store that in an environment variable so that we can reuse it later on:

5. Checking the State of your Jobs

Job status: You can check the status of the job using bacalhau job list.

bacalhau job list --id-filter ${JOB_ID}

When it says Published or Completed, that means the job is done, and we can get the results.

Job information: You can find out more information about your job by using bacalhau job describe.

bacalhau job describe ${JOB_ID}

Job download: You can download your job results directly by using bacalhau job get. Alternatively, you can choose to create a directory to store your results. In the command below, we created a directory (results) and downloaded our job output to be stored in that directory.

rm -rf results && mkdir -p results
bacalhau job get ${JOB_ID} --output-dir results

6. Viewing your Job Output

To view the contents of the current directory, run the following command:

ls results/outputs

Support

Simple Image Processing

Introduction

In this example tutorial, we will show you how to use Bacalhau to process images on a Landsat dataset.

Bacalhau has the unique capability of operating at a massive scale in a distributed environment. This is made possible because data is naturally sharded across the IPFS network amongst many providers. We can take advantage of this to process images in parallel.

Prerequisite​

Running a Bacalhau Job​

Bacalhau also mounts a data volume to store output data. The bacalhau docker run command creates an output data volume mounted at /outputs. This is a convenient location to store the results of your job.

Structure of the command​

Let's look closely at the command above:

  1. bacalhau docker run: call to Bacalhau

  2. -i src=s3://landsat-image-processing/*,dst=/input_images,opt=region=us-east-1: Specifies the input data, which is stored in the S3 storage.

  3. --entrypoint mogrify: Overrides the default ENTRYPOINT of the image, indicating that the mogrify utility from the ImageMagick package will be used instead of the default entry.

  4. dpokidov/imagemagick:7.1.0-47-ubuntu: The name and the tag of the docker image we are using

  5. -- -resize 100x100 -quality 100 -path /outputs '/input_images/*.jpg': These arguments are passed to mogrify and specify operations on the images: resizing to 100x100 pixels, setting quality to 100, and saving the results to the /outputs folder.

When a job is submitted, Bacalhau prints out the related job_id. We store that in an environment variable so that we can reuse it later on.

Declarative job description​

The job description should be saved in .yaml format, e.g. image.yaml, and then run with the command:

Checking the State of your Jobs​

Job status: You can check the status of the job using bacalhau job list:

When it says Completed, that means the job is done, and we can get the results.

Job information: You can find out more information about your job by using bacalhau job describe:

Job download: You can download your job results directly by using bacalhau job get. Alternatively, you can choose to create a directory to store your results. In the command below, we created a directory and downloaded our job output to be stored in that directory.

Display the image​

To view the images, open the results/outputs/ folder:

Support​

Using Bacalhau with DuckDB

DuckDB is a relational table-oriented database management system that supports SQL queries for producing analytical results. It also comes with various features that are useful for data analytics.

DuckDB is suited for the following use cases:

  1. Processing and storing tabular datasets, e.g. from CSV or Parquet files

  2. Interactive data analysis, e.g. Joining & aggregate multiple large tables

  3. Concurrent large changes, to multiple large tables, e.g. appending rows, adding/removing/updating columns

  4. Large result set transfer to client

In this example tutorial, we will show how to use DuckDB with Bacalhau. The advantage of using DuckDB with Bacalhau is that you don’t need to install, and there is no need to download the datasets since the datasets are already there on IPFS or on the web.

Overview

  • How to run a relational database (like DUCKDB) on Bacalhau

Prerequisites

Containerize Script using Docker

You can skip this entirely and directly go to running on Bacalhau.

If you want any additional dependencies to be installed along with DuckDB, you need to build your own container.

To build your own docker container, create a Dockerfile, which contains instructions to build your DuckDB docker container.

Build the container

We will run docker build command to build the container:

Before running the command replace:

repo-name with the name of the container, you can name it anything you want

tag this is not required, but you can use the latest tag

In our case:

Push the container

Next, upload the image to the registry. This can be done by using the Docker hub username, repo name or tag.

In our case:

Running a Bacalhau Job

After the repo image has been pushed to Docker Hub, we can now use the container for running on Bacalhau. To submit a job, run the following Bacalhau command:

Structure of the command

Let's look closely at the command above:

  1. bacalhau docker run: call to bacalhau

  2. davidgasquez/datadex:v0.2.0 : the name and the tag of the docker image we are using

  3. duckdb -s "select 1": execute DuckDB

When a job is submitted, Bacalhau prints out the related job_id. We store that in an environment variable so that we can reuse it later on.

Declarative job description​

The job description should be saved in .yaml format, e.g. duckdb1.yaml, and then run with the command:

Checking the State of your Jobs

Job status: You can check the status of the job using bacalhau job list.

When it says Published or Completed, that means the job is done, and we can get the results.

Job information: You can find out more information about your job by using bacalhau job describe.

Job download: You can download your job results directly by using bacalhau job get. Alternatively, you can choose to create a directory to store your results. In the command below, we created a directory and downloaded our job output to be stored in that directory.

Viewing your Job Output

Each job creates 3 subfolders: the combined_results,per_shard files, and the raw directory. To view the file, run the following command:

Expected output:

Running Arbitrary SQL commands

Below is the bacalhau docker run command to to run arbitrary SQL commands over the yellow taxi trips dataset

Structure of the command

Let's look closely at the command above:

  1. bacalhau docker run: call to bacalhau

  2. -i ipfs://bafybeiejgmdpwlfgo3dzfxfv3cn55qgnxmghyv7vcarqe3onmtzczohwaq \: CIDs to use on the job. Mounts them at '/inputs' in the execution.

  3. davidgasquez/duckdb:latest: the name and the tag of the docker image we are using

  4. /inputs: path to input dataset

  5. duckdb -s: execute DuckDB

Declarative job description​

The job description should be saved in .yaml format, e.g. duckdb2.yaml, and then run with the command:

When a job is submitted, Bacalhau prints out the related job_id. We store that in an environment variable so that we can reuse it later on.

Job status: You can check the status of the job using bacalhau job list.

Job information: You can find out more information about your job by using bacalhau job describe.

Job download: You can download your job results directly by using bacalhau job get. Alternatively, you can choose to create a directory to store your results. In the command below, we created a directory and downloaded our job output to be stored in that directory.

Viewing your Job Output

Each job creates 3 subfolders: the combined_results, per_shard files, and the raw directory. To view the file, run the following command:

Need Support?

Ethereum Blockchain Analysis with Ethereum-ETL and Bacalhau

Introduction

Mature blockchains are difficult to analyze because of their size. Ethereum-ETL is a tool that makes it easy to extract information from an Ethereum node, but it's not easy to get working in a batch manner. It takes approximately 1 week for an Ethereum node to download the entire chain (even more in my experience) and importing and exporting data from the Ethereum node is slow.

For this example, we ran an Ethereum node for a week and allowed it to synchronize. We then ran ethereum-etl to extract the information and pinned it on Filecoin. This means that we can both now access the data without having to run another Ethereum node.

But there's still a lot of data and these types of analyses typically need repeating or refining. So it makes absolute sense to use a decentralized network like Bacalhau to process the data in a scalable way.

In this tutorial example, we will run Ethereum-ETL tool on Bacalhau to extract data from an Ethereum node.

Prerequisite​

Analysing Ethereum Data Locally​

First let's download one of the IPFS files and inspect it locally:

You can see the full list of IPFS CIDs in the appendix at the bottom of the page.

If you don't already have the Pandas library, let's install it:

The following code inspects the daily trading volume of Ethereum for a single chunk (100,000 blocks) of data.

This is all good, but we can do better. We can use the Bacalhau client to download the data from IPFS and then run the analysis on the data in the cloud. This means that we can analyze the entire Ethereum blockchain without having to download it locally.

Analysing Ethereum Data With Bacalhau​

To run jobs on the Bacalhau network you need to package your code. In this example, I will package the code as a Docker image.

But before we do that, we need to develop the code that will perform the analysis. The code below is a simple script to parse the incoming data and produce a CSV file with the daily trading volume of Ethereum.

Next, let's make sure the file works as expected:

And finally, package the code inside a Docker image to make the process reproducible. Here I'm passing the Bacalhau default /inputs and /outputs directories. The /inputs directory is where the data will be read from and the /outputs directory is where the results will be saved to.

We've already pushed the container, but for posterity, the following command pushes this container to GHCR.

Running a Bacalhau Job​

To run our analysis on the Ethereum blockchain, we will use the bacalhau docker run command.

The job has been submitted and Bacalhau has printed out the related job id. We store that in an environment variable so that we can reuse it later on.

Bacalhau also mounts a data volume to store output data. The bacalhau docker run command creates an output data volume mounted at /outputs. This is a convenient location to store the results of your job.

Declarative job description​

The job description should be saved in .yaml format, e.g. blockchain.yaml, and then run with the command:

Copy

Checking the State of your Jobs​

Job status: You can check the status of the job using bacalhau job list.

When it says Published or Completed, that means the job is done, and we can get the results.

Job information: You can find out more information about your job by using bacalhau job describe.

Job download: You can download your job results directly by using bacalhau job get. Alternatively, you can choose to create a directory to store your results. In the command below, we created a directory (results) and downloaded our job output to be stored in that directory.

Viewing your Job Output​

To view the file, run the following command:

Display the image​

To view the images, we will use glob to return all file paths that match a specific pattern.

Massive Scale Ethereum Analysis​

Ok, so that works. Let's scale this up! We can run the same analysis on the entire Ethereum blockchain (up to the point where I have uploaded the Ethereum data). To do this, we need to run the analysis on each of the chunks of data that we have stored on IPFS. We can do this by running the same job on each of the chunks.

See the appendix for the hashes.txt file.

Now take a look at the job id's. You can use these to check the status of the jobs and download the results:

You might want to double-check that the jobs ran ok by doing a bacalhau job list.

Wait until all of these jobs have been completed. And then download all the results and merge them into a single directory. This might take a while, so this is a good time to treat yourself to a nice Dark Mild. There's also been some issues in the past communicating with IPFS, so if you get an error, try again.

Display the image​

To view the images, we will use glob to return all file paths that match a specific pattern.

That's it! There are several years of Ethereum transaction volume data.

Appendix: List Ethereum Data CIDs​

The following list is a list of IPFS CID's for the Ethereum data that we used in this tutorial. You can use these CID's to download the rest of the chain if you so desire. The CIDs are ordered by block number and they increase 50,000 blocks at a time. Here's a list of ordered CIDs:

Support​

Oceanography - Data Conversion

Introduction

In this example tutorial, our focus will be on running the oceanography dataset with Bacalhau, where we will investigate the data and convert the workload. This will enable the execution on the Bacalhau network, allowing us to leverage its distributed storage and compute resources.

Prerequisites​

Running Locally​

Downloading the dataset​

Installing dependencies​

Next let's write the requirements.txt. This file will also be used by the Dockerfile to install the dependencies.

Reading and Viewing Data​

We can see that the dataset contains latitude-longitude coordinates, the date, and a series of seawater measurements. Below is a plot of the average sea surface temperature (SST) between 2010 and 2020, where data have been collected by buoys and vessels.

Data Conversion​

Writing the Script​

Let's create a new file called main.py and paste the following script in it:

This code loads and processes SST and SOCAT data, combines them, computes pCO2, and saves the results for further use.

Upload the Data to IPFS​

This resulted in the IPFS CID of bafybeidunikexxu5qtuwc7eosjpuw6a75lxo7j5ezf3zurv52vbrmqwf6y.

Setting up Docker Container​

We will create a Dockerfile and add the desired configuration to the file. These commands specify how the image will be built, and what extra requirements will be included.

Build the container​

We will run docker build command to build the container:

Before running the command replace:

repo-name with the name of the container, you can name it anything you want

tag this is not required but you can use the latest tag

Push the container​

Now you can push this repository to the registry designated by its name or tag.

Running a Bacalhau Job​

Now that we have the data in IPFS and the Docker image pushed, next is to run a job using the bacalhau docker run command

Structure of the command​

Let's look closely at the command above:

  1. bacalhau docker run: call to Bacalhau

  2. --input ipfs://bafybeidunikexxu5qtuwc7eosjpuw6a75lxo7j5ezf3zurv52vbrmqwf6y: CIDs to use on the job. Mounts them at '/inputs' in the execution.

  3. ghcr.io/bacalhau-project/examples/socat:0.0.11: the name and the tag of the image we are using

  4. python main.py: execute the script

When a job is submitted, Bacalhau prints out the related job_id. We store that in an environment variable so that we can reuse it later on.

Declarative job description​

The job description should be saved in .yaml format, e.g. oceanyaml, and then run with the command:

Job status: You can check the status of the job using bacalhau job list.

When it says Published or Completed, that means the job is done, and we can get the results.

Job information: You can find out more information about your job by using bacalhau job describe.

Job download: You can download your job results directly by using bacalhau job get. Alternatively, you can choose to create a directory to store your results. In the command below, we created a directory (results) and downloaded our job output to be stored in that directory.

To view the file, run the following command:

Write a SpecConfig

SpecConfig provides a unified structure to specify configurations for various components in Bacalhau, including engines, publishers, and input sources. Its flexible design allows seamless integration with multiple systems like Docker, WebAssembly (Wasm), AWS S3, and local directories, among others.

SpecConfig Parameters

  1. Type (string : <required>): Specifies the type of the configuration. Examples include docker and wasm for execution engines, S3 for input sources and publishers, etc.

  2. Params (map[string]any : <optional>): A set of key-value pairs that provide the specific configurations for the chosen type. The keys and values are flexible and depend on the Type. For instance, parameters for a Docker engine might include image name and version, while an S3 publisher would require configurations like the bucket name and AWS region. If not provided, it defaults to nil.

Usage Examples

Here are a few hypothetical examples to demonstrate how you might define SpecConfig for different components:

Docker Engine

S3 Publisher

Local Directory Input Source

Remember, the exact keys and values in the Params map will vary depending on the specific requirements of the component being configured. Always refer to the individual component's documentation to understand the available parameters.

Convert CSV To Parquet Or Avro

Introduction​

Converting from CSV to parquet or avro reduces the size of the file and allows for faster read and write speeds. With Bacalhau, you can convert your CSV files stored on ipfs or on the web without the need to download files and install dependencies locally.

In this example tutorial we will convert a CSV file from a URL to parquet format and save the converted parquet file to IPFS

Prerequisites​

Running CSV to Avro or Parquet Locally​​

Downloading the CSV file​

Let's download the transactions.csv file:

Writing the Script​

Write the converter.py Python script, that serves as a CSV converter to Avro or Parquet formats:

Installing Dependencies​

Converting CSV file to Parquet format​

In our case:

Viewing the parquet file:​

Containerize Script with Docker​

To build your own docker container, create a Dockerfile, which contains instructions to build your image.

Build the container​

We will run the docker build command to build the container:

Before running the command replace:

repo-name with the name of the container, you can name it anything you want

tag this is not required but you can use the latest tag

In our case:

Push the container​

Next, upload the image to the registry. This can be done by using the Docker hub username, repo name or tag.

In our case:

Running a Bacalhau Job​

With the command below, we are mounting the CSV file for transactions from IPFS

Structure of the command​

Let's look closely at the command above:

  1. bacalhau docker run: call to Bacalhau

  2. -i ipfs://QmTAQMGiSv9xocaB4PUCT5nSBHrf9HZrYj21BAZ5nMTY2W: CIDs to use on the job. Mounts them at '/inputs' in the execution.

  3. jsacex/csv-to-arrow-or-parque: the name and the tag of the docker image we are using

  4. ../inputs/transactions.csv : path to input dataset

  5. /outputs/transactions.parquet parquet: path to the output

  6. python3 src/converter.py: execute the script

When a job is submitted, Bacalhau prints out the related job_id. We store that in an environment variable so that we can reuse it later on.

Declarative job description​

The job description should be saved in .yaml format, e.g. convertcsv.yaml, and then run with the command:

Checking the State of your Jobs​

Job status: You can check the status of the job using bacalhau job list.

When it says Published or Completed, that means the job is done, and we can get the results.

Job information: You can find out more information about your job by using bacalhau job describe.

Job download: You can download your job results directly by using bacalhau job get. Alternatively, you can choose to create a directory to store your results. In the command below, we created a directory (results) and downloaded our job output to be stored in that directory.

Viewing your Job Output​

To view the file, run the following command:

Alternatively, you can do this:

Support​

IPFS
content identifier (CID)
Static Badge

Data source can be specified via --input flag, see the for more details

Structure of the command

You can find out more about the which is designed to simplify the data uploading process.

For more details, see the

Checking the State of Your Jobs

Viewing your Job Output

Get the CID From the Completed Job

Use the CID in a New Bacalhau Job

Need Support?

For questions and feedback, please reach out in our

Prerequisite

To get started, you need to install the Bacalhau client, see more information

Running a Bacalhau Job

Structure of the Command

This works either with datasets that are publicly available or with private datasets, provided that the nodes have the necessary credentials to access. See the for more details.

Checking the State of your Jobs

Viewing your Job Output

Extract Result CID

Publishing Results to S3-Compatible Destinations

Publisher Spec

Example Usage

Content Identification

Support for the S3-compatible storage provider

Need Support?

For questions, feedback, please reach out in our

Bacalhau supports running jobs as a program. This example demonstrates how to compile a project into WebAssembly and run the program on Bacalhau.

To get started, you need to install the Bacalhau client, see more information .

A working Rust installation with the wasm32-wasi target. For example, you can use to install Rust and configure it to build WASM targets. For those using the notebook, these are installed in hidden cells below.

The program below will use the Rust imageproc crate to resize an image through seam carving, based on .

If you have questions or need support or guidance, please reach out to the (#general channel).

When you submit a Bacalhau job, you must specify the internet locations from which to download data and write results. Both and jobs support these features.

When submitting a Bacalhau job, you can specify the CID (Content IDentifier) or HTTP(S) URL from which to download data. The data will be retrieved before the job starts and made available to the job as a directory on the filesystem. When running Bacalhau jobs, you can specify as many CIDs or URLs as needed using --input which is accepted by both bacalhau docker run and bacalhau wasm run. See for more information.

You can write back results from your Bacalhau jobs to your public storage location. By default, jobs will write results to the storage provider using the --publisher command line flag. See on how to configure this.

Jobs will be provided with which contain a TCP address of an HTTP proxy to connect through. Most tools and libraries will use these environment variables by default. If not, they must be used by user code to configure HTTP proxy usage.

The public compute nodes provided by the Bacalhau network will accept jobs that require HTTP networking as long as the domains are from .

If you need to access a domain that isn't on the allowlist, you can make a request to the to include your required domains. You can also set up your own compute node that implements the allowlist you need.

To view the configuration that bacalhau will receive when a command is executed against it, users can run the command. Users who wish to see Bacalhau’s config represented as YAML may run bacalhau config list --output=yaml.

The command.

The --config (or -c ) flag allows flexible configuration of bacalhau through various methods. You can use this flag multiple times to combine different configuration sources. To specify a config file to bacalhau, users may use the --config flag, passing a path to a config file for bacalhau to use. When this flag is provided bacalhau will not search for a default config, and will instead use the configuration provided to it by the .

If you have questions or need support or guidance, please reach out to the (#general channel).

Starting from the v.1.3.0 to communicate with other nodes on the network Bacalhau uses , a powerful open-source messaging system designed to streamline communication across complex network environments.

Start by token. This token will be used for authentication between the orchestrator and compute nodes during their communications:

If you’re interested in learning more about distributed computing and how it can benefit your work, there are several ways to connect with . Visit our , , or .

If you have questions or need support or guidance, please reach out to the (#general channel).

To get started, you need to install the Bacalhau client, see more information

See more information on how to containerize your script/app

hub-user with your docker hub username. If you don’t have a docker hub account , and use the username of the account you created

If you have questions or need support or guidance, please reach out to the (#general channel).

To get started, you need to install the Bacalhau client, see more information

To submit a workload to Bacalhau, we will use the bacalhau docker run command. This command allows to pass input data volume with a -i ipfs://CID:path argument just like Docker, except the left-hand side of the argument is a . This results in Bacalhau mounting a data volume inside the container. By default, Bacalhau mounts the input volume at the path /inputs inside the container.

The same job can be presented in the format. In this case, the description will look like this:

If you have questions or need support or guidance, please reach out to the (#general channel).

To get started, you need to install the Bacalhau client, see more information

See more information on how to containerize your script/app

hub-user with your docker hub username, If you don’t have a docker hub account , and use the username of the account you created

The same job can be presented in the format. In this case, the description will look like this:

The same job can be presented in the format. In this case, the description will look like this:

If you have questions or need support or guidance, please reach out to the (#general channel).

To get started, you need to install the Bacalhau client, see more information

The bacalhau docker run command allows passing input data volume with --input or -i ipfs://CID:path argument just like Docker, except the left-hand side of the argument is a . This results in Bacalhau mounting a data volume inside the container. By default, Bacalhau mounts the input volume at the path /inputs inside the container.

The same job can be presented in the format. In this case, the description will look like this:

If you have questions or need support or guidance, please reach out to the (#general channel).

The Surface Ocean CO₂ Atlas (SOCAT) contains measurements of the of CO₂ in seawater around the globe. But to calculate how much carbon the ocean is taking up from the atmosphere, these measurements need to be converted to the partial pressure of CO₂. We will convert the units by combining measurements of the surface temperature and fugacity. Python libraries (xarray, pandas, numpy) and the pyseaflux package facilitate this process.

To get started, you need to install the Bacalhau client, see more information

For the purposes of this example we will use the dataset in the "Gridded" format from the and long-term global sea surface temperature data from - information about that dataset can be found .

To convert the data from fugacity of CO2 (fCO2) to partial pressure of CO2 (pCO2) we will combine the measurements of the surface temperature and fugacity. The conversion is performed by the package.

The simplest way to upload the data to IPFS is to use a third-party service to "pin" data to the IPFS network, to ensure that the data exists and is available. To do this you need an account with a pinning service like or . Once registered you can use their UI or API or SDKs to upload files.

hub-user with your docker hub username, If you don’t have a docker hub account , and use the username of the account you created

For more information about working with custom containers, see the .

The same job can be presented in the format. In this case, the description will look like this:

Checking the State of your Jobs

Viewing your Job Output

Support

If you have questions or need support or guidance, please reach out to the (#general channel).

Full Docker spec can be found .

Full S3 Publisher can be found .

Full local source can be found .

To get started, you need to install the Bacalhau client, see more information

You can use the CSV files from

You can find out more information about converter.py

You can skip this section entirely and directly go to

See more information on how to containerize your script/app

hub-user with your docker hub username. If you don’t have a docker hub account , and use the username of the account you created

The same job can be presented in the format. In this case, the description will look like this:

If you have questions or need support or guidance, please reach out to the (#general channel).

​
helper container in the examples repository
​
​
​
​
​
Slack
​
here
​
​
S3 Source Specification
​
​
​
​
​
​
​
​
​
Slack
WebAssembly (WASM)
Rust
here
rustup
an example from their repository
Bacalhau team via Slack
Docker
WebAssembly
command line flags
command line flags
http_proxy and https_proxy environment variables
this allowlist
Bacalhau Project team
bacalhau config list
bacalhau config set
Bacalhau team via Slack
NATS.io
us
website
sign up to our bi-weekly office hour
join our Slack
send us a message
Bacalhau team via Slack
here
here
follow these instructions to create docker account
Bacalhau team via Slack
--config flag
creating a secure
export JOB_ID=$(bacalhau docker run \
    --wait \
    --wait-timeout-secs 100 \
    --id-only \
    -i src=s3://landsat-image-processing/*,dst=/input_images,opt=region=us-east-1 \
    --publisher ipfs \
    --entrypoint mogrify \
    dpokidov/imagemagick:7.1.0-47-ubuntu \
    -- -resize 100x100 -quality 100 -path /outputs '/input_images/*.jpg')
name: Simple Image Processing
type: batch
count: 1
tasks:
  - name: My main task
    Engine:
      type: docker
      params:
        Image: dpokidov/imagemagick:7.1.0-47-ubuntu
        Entrypoint:
          - /bin/bash
        Parameters:
          - -c
          - magick mogrify -resize 100x100 -quality 100 -path /outputs '/input_images/*.jpg'
    Publisher:
      Type: ipfs
    ResultPaths:
      - Name: outputs
        Path: /outputs
    InputSources:
    - Target: "/input_images"
      Source:
        Type: "s3"
        Params:
          Bucket: "landsat-image-processing"
          Key: "*"
          Region: "us-east-1"
bacalhau job run image.yaml
bacalhau job list --id-filter ${JOB_ID}
bacalhau job describe ${JOB_ID}
rm -rf results && mkdir results
bacalhau job get ${JOB_ID} --output-dir results
FROM mcr.microsoft.com/vscode/devcontainers/python:3.9

RUN apt-get update && apt-get install -y nodejs npm g++

# Install dbt
RUN pip3 --disable-pip-version-check --no-cache-dir install duckdb==0.4.0 dbt-duckdb==1.1.4 \
    && rm -rf /tmp/pip-tmp

# Install duckdb cli
RUN wget https://github.com/duckdb/duckdb/releases/download/v0.4.0/duckdb_cli-linux-amd64.zip \
    && unzip duckdb_cli-linux-amd64.zip -d /usr/local/bin \
    && rm duckdb_cli-linux-amd64.zip

# Configure Workspace
ENV DBT_PROFILES_DIR=/workspaces/datadex
WORKDIR /workspaces/datadex
docker build -t <hub-user>/<repo-name>:<tag> .
docker build -t davidgasquez/datadex:v0.2.0
docker push <hub-user>/<repo-name>:<tag>
docker push davidgasquez/datadex:v0.2.0
export JOB_ID=$(bacalhau docker run \
davidgasquez/datadex:v0.2.0 \
--  duckdb -s "select 1")
name: DuckDB Hello World
type: batch
count: 1
tasks:
  - name: My main task
    Engine:
      type: docker
      params:
        Image: davidgasquez/datadex:v0.2.0
        Entrypoint:
          - /bin/bash
        Parameters:
          - -c
          - duckdb -s "select 1"
bacalhau job run duckdb1.yaml
bacalhau job list --id-filter ${JOB_ID}
bacalhau job describe ${JOB_ID}
rm -rf results && mkdir -p results
bacalhau job get $JOB_ID --output-dir results
cat results/stdout  # displays the contents of the file
┌───┐
│ 1 │
├───┤
│ 1 │
└───┘
export JOB_ID=$(bacalhau docker run \
 -i ipfs://bafybeiejgmdpwlfgo3dzfxfv3cn55qgnxmghyv7vcarqe3onmtzczohwaq \
  --workdir /inputs \
  --id-only \
  --wait \
  davidgasquez/duckdb:latest \
  -- duckdb -s "select count(*) from '0_yellow_taxi_trips.parquet'")
name: DuckDB Parquet Query
type: batch
count: 1
tasks:
  - name: My main task
    Engine:
      type: docker
      params:
        WorkingDirectory: "/inputs"
        Image: davidgasquez/duckdb:latest
        Entrypoint:
          - /bin/bash
        Parameters:
          - -c
          - duckdb -s "select count(*) from '0_yellow_taxi_trips.parquet'"
    InputSources:
    - Target: "/inputs"
      Source:
        Type: "s3"
        Params:
          Bucket: "bacalhau-duckdb"
          Key: "*"
          Region: "us-east-1"
bacalhau job run duckdb2.yaml
bacalhau job list --id-filter ${JOB_ID} --wide
bacalhau job describe ${JOB_ID}
rm -rf results && mkdir -p results
bacalhau job get $JOB_ID --output-dir results
cat results/stdout
┌──────────────┐
│ count_star() │
│    int64     │
├──────────────┤
│     24648499 │
└──────────────┘
wget -q -O file.tar.gz https://w3s.link/ipfs/bafybeifgqjvmzbtz427bne7af5tbndmvniabaex77us6l637gqtb2iwlwq
tar -xvf file.tar.gz
pip install pandas
# Use pandas to read in transaction data and clean up the columns
import pandas as pd
import glob

file = glob.glob('output_*/transactions/start_block=*/end_block=*/transactions*.csv')[0]
print("Loading file %s" % file)
df = pd.read_csv(file)
df['value'] = df['value'].astype('float')
df['from_address'] = df['from_address'].astype('string')
df['to_address'] = df['to_address'].astype('string')
df['hash'] = df['hash'].astype('string')
df['block_hash'] = df['block_hash'].astype('string')
df['block_datetime'] = pd.to_datetime(df['block_timestamp'], unit='s')
df.info()

# Total volume per day
df[['block_datetime', 'value']].groupby(pd.Grouper(key='block_datetime', freq='1D')).sum().plot()
# main.py
import glob, os, sys, shutil, tempfile
import pandas as pd

def main(input_dir, output_dir):
    search_path = os.path.join(input_dir, "output*", "transactions", "start_block*", "end_block*", "transactions_*.csv")
    csv_files = glob.glob(search_path)
    if len(csv_files) == 0:
        print("No CSV files found in %s" % search_path)
        sys.exit(1)
    for transactions_file in csv_files:
        print("Loading %s" % transactions_file)
        df = pd.read_csv(transactions_file)
        df['value'] = df['value'].astype('float')
        df['block_datetime'] = pd.to_datetime(df['block_timestamp'], unit='s')
        
        print("Processing %d blocks" % (df.shape[0]))
        results = df[['block_datetime', 'value']].groupby(pd.Grouper(key='block_datetime', freq='1D')).sum()
        print("Finished processing %d days worth of records" % (results.shape[0]))

        save_path = os.path.join(output_dir, os.path.basename(transactions_file))
        os.makedirs(os.path.dirname(save_path), exist_ok=True)
        print("Saving to %s" % (save_path))
        results.to_csv(save_path)

def extractData(input_dir, output_dir):
    search_path = os.path.join(input_dir, "*.tar.gz")
    gz_files = glob.glob(search_path)
    if len(gz_files) == 0:
        print("No tar.gz files found in %s" % search_path)
        sys.exit(1)
    for f in gz_files:
        shutil.unpack_archive(filename=f, extract_dir=output_dir)

if __name__ == "__main__":
    if len(sys.argv) != 3:
        print('Must pass arguments. Format: [command] input_dir output_dir')
        sys.exit()
    with tempfile.TemporaryDirectory() as tmp_dir:
        extractData(sys.argv[1], tmp_dir)
        main(tmp_dir, sys.argv[2])
python main.py . outputs/
FROM python:3.11-slim-bullseye
WORKDIR /src
RUN pip install pandas==1.5.1
ADD main.py .
CMD ["python", "main.py", "/inputs", "/outputs"]
docker buildx build --platform linux/amd64 --push -t ghcr.io/bacalhau-project/examples/blockchain-etl:0.0.1 .
export JOB_ID=$(bacalhau docker run \
    --id-only \
    --input ipfs://bafybeifgqjvmzbtz427bne7af5tbndmvniabaex77us6l637gqtb2iwlwq:/inputs/data.tar.gz \
    ghcr.io/bacalhau-project/examples/blockchain-etl:0.0.6)
name: Ethereum Blockchain Analysis with Ethereum-ETL
type: batch
count: 1
tasks:
  - name: My main task
    Engine:
      type: docker
      params:
        Image: ghcr.io/bacalhau-project/examples/blockchain-etl:0.0.6
    Publisher:
      Type: ipfs
    ResultPaths:
      - Name: outputs
        Path: /outputs
    InputSources:
      - Target: "/inputs/data.tar.gz"
        Source:
          Type: "ipfs"
          Params:
            CID: "bafybeifgqjvmzbtz427bne7af5tbndmvniabaex77us6l637gqtb2iwlwq"
bacalhau job run blockchain.yaml
bacalhau job list --id-filter ${JOB_ID}
bacalhau job describe ${JOB_ID}
rm -rf results && mkdir -p results # Temporary directory to store the results
bacalhau job get ${JOB_ID} --output-dir results # Download the results
ls -lah results/outputs
import glob
import pandas as pd

# Get CSV files list from a folder
csv_files = glob.glob("results/outputs/*.csv")
df = pd.read_csv(csv_files[0], index_col='block_datetime')
df.plot()
printf "" > job_ids.txt
for h in $(cat hashes.txt); do \
    bacalhau docker run \
    --id-only \
    --wait=false \
    --input=ipfs://$h:/inputs/data.tar.gz \
    ghcr.io/bacalhau-project/examples/blockchain-etl:0.0.6 >> job_ids.txt 
done
cat job_ids.txt

d840df7b-9318-4e5b-ab06-adb72dd95394
09d01f9c-9409-42b9-829d-92e22fcdd062
0072758f-3575-44d7-b193-da4a22f6bc86
2043dee4-fc82-4768-92cb-4d23dd2514b1
36ef8e9e-9eae-4218-81e6-15883d0a5b8d
932aa406-cd29-4933-b09f-c8cea4d77164
1f3e5273-bdd4-4ef0-b7ed-b83591fab64e
8bfabe96-54e3-4fee-b344-a0517c683268
1cd588a1-5c76-4f91-ba90-af7931bca596
b9c29531-e1b4-4520-b03d-7406a22bbdb3
8665b8be-24a9-4c78-9913-803d3e3c9a65
06115147-bc83-49e8-bb71-7b447c8ad1bc
84afed3e-831c-462b-a3e3-9a23bc7d6fb8
ed6e55e6-98d3-4bde-8ece-1f05838d489e
...
bacalhau job list -n 50
for id in $(cat job_ids.txt); do \
    rm -rf results_$id && mkdir results_$id
    bacalhau job get --output-dir results_$id $id &
done
wait
import os, glob
import pandas as pd

# Get CSV files list from a folder
path = os.path.join("results_*", "outputs", "*.csv")
csv_files = glob.glob(path)

# Read each CSV file into a list of DataFrames
df_list = (pd.read_csv(file, index_col='block_datetime') for file in csv_files)

# Concatenate all DataFrames
df_unsorted = pd.concat(df_list, ignore_index=False)

# Some files will cross days, so group by day and sum the values
df = df_unsorted.groupby(level=0).sum()

# Plot
df.plot(figsize=(16,9))
rm -rf results_* output_* outputs results temp # Remove temporary results
# hashes.txt
bafybeihvtzberlxrsz4lvzrzvpbanujmab3hr5okhxtbgv2zvonqos2l3i
bafybeifb25fgxrzu45lsc47gldttomycqcsao22xa2gtk2ijbsa5muzegq
bafybeig4wwwhs63ly6wbehwd7tydjjtnw425yvi2tlzt3aii3pfcj6hvoq
bafybeievpb5q372q3w5fsezflij3wlpx6thdliz5xowimunoqushn3cwka
bafybeih6te26iwf5kzzby2wqp67m7a5pmwilwzaciii3zipvhy64utikre
bafybeicjd4545xph6rcyoc74wvzxyaz2vftapap64iqsp5ky6nz3f5yndm
bafybeicgo3iofo3sw73wenc3nkdhi263yytjnds5cxjwvypwekbz4sk7ra
bafybeihvep5xsvxm44lngmmeysihsopcuvcr34an4idz45ixl5slsqzy3y
bafybeigmt2zwzrbzwb4q2kt2ihlv34ntjjwujftvabrftyccwzwdypama4
bafybeiciwui7sw3zqkvp4d55p4woq4xgjlstrp3mzxl66ab5ih5vmeozci
bafybeicpmotdsj2ambf666b2jkzp2gvg6tadr6acxqw2tmdlmsruuggbbu
bafybeigefo3esovbveavllgv5wiheu5w6cnfo72jxe6vmfweco5eq5sfty
bafybeigvajsumnfwuv7lp7yhr2sr5vrk3bmmuhhnaz53waa2jqv3kgkvsu
bafybeih2xg2n7ytlunvqxwqlqo5l3daykuykyvhgehoa2arot6dmorstmq
bafybeihnmq2ltuolnlthb757teihwvvw7wophoag2ihnva43afbeqdtgi4
bafybeibb34hzu6z2xgo6nhrplt3xntpnucthqlawe3pmzgxccppbxrpudy
bafybeigny33b4g6gf2hrqzzkfbroprqrimjl5gmb3mnsqu655pbbny6tou
bafybeifgqjvmzbtz427bne7af5tbndmvniabaex77us6l637gqtb2iwlwq
bafybeibryqj62l45pxjhdyvgdc44p3suhvt4xdqc5jpx474gpykxwgnw2e
bafybeidme3fkigdjaifkjfbwn76jk3fcqdogpzebtotce6ygphlujaecla
bafybeig7myc3eg3h2g5mk2co7ybte4qsuremflrjneer6xk3pghjwmcwbi
bafybeic3x2r5rrd3fdpdqeqax4bszcciwepvbpjl7xdv6mkwubyqizw5te
bafybeihxutvxg3bw7fbwohq4gvncrk3hngkisrtkp52cu7qu7tfcuvktnq
bafybeicumr67jkyarg5lspqi2w4zqopvgii5dgdbe5vtbbq53mbyftduxy
bafybeiecn2cdvefvdlczhz6i4afbkabf5pe5yqrcsgdvlw5smme2tw7em4
bafybeiaxh7dhg4krgkil5wqrv5kdsc3oewwy6ym4n3545ipmzqmxaxrqf4
bafybeiclcqfzinrmo3adr4lg7sf255faioxjfsolcdko3i4x7opx7xrqii
bafybeicjmeul7c2dxhmaudawum4ziwfgfkvbgthgtliggfut5tsc77dx7q
bafybeialziupik7csmhfxnhuss5vrw37kmte7rmboqovp4cpq5hj4insda
bafybeid7ecwdrw7pb3fnkokq5adybum6s5ok3yi2lw4m3edjpuy65zm4ji
bafybeibuxwnl5ogs4pwa32xriqhch24zbrw44rp22hrly4t6roh6rz7j4m
bafybeicxvy47jpvv3fi5umjatem5pxabfrbkzxiho7efu6mpidjpatte54
bafybeifynb4mpqrbsjbeqtxpbuf6y4frrtjrc4tm7cnmmui7gbjkckszrq
bafybeidcgnbhguyfaahkoqbyy2z525d3qfzdtbjuk4e75wkdbnkcafvjei
bafybeiefc67s6hpydnsqdgypbunroqwkij5j26sfmc7are7yxvg45uuh7i
bafybeiefwjy3o42ovkssnm7iihbog46k5grk3gobvvkzrqvof7p6xbgowi
bafybeihpydd3ivtza2ql5clatm5fy7ocych7t4czu46sbc6c2ykrbwk5uu
bafybeiet7222lqfmzogur3zlxqavlnd3lt3qryw5yi5rhuiqeqg4w7c3qu
bafybeihwomd4ygoydvj5kh24wfwk5kszmst5vz44zkl6yibjargttv7sly
bafybeidbjt2ckr4oooio3jsfk76r3bsaza5trjvt7u36slhha5ksoc5gv4
bafybeifyjrmopgtfmswq7b4pfscni46doy3g3z6vi5rrgpozc6duebpmuy
bafybeidsrowz46yt62zs64q2mhirlc3rsmctmi3tluorsts53vppdqjj7e
bafybeiggntql57bw24bw6hkp2yqd3qlyp5oxowo6q26wsshxopfdnzsxhq
bafybeidguz36u6wakx4e5ewuhslsfsjmk5eff5q7un2vpkrcu7cg5aaqf4
bafybeiaypwu2b45iunbqnfk2g7bku3nfqveuqp4vlmmwj7o7liyys42uai
bafybeicaahv7xvia7xojgiecljo2ddrvryzh2af7rb3qqbg5a257da5p2y
bafybeibgeiijr74rcliwal3e7tujybigzqr6jmtchqrcjdo75trm2ptb4e
bafybeiba3nrd43ylnedipuq2uoowd4blghpw2z7r4agondfinladcsxlku
bafybeif3semzitjbxg5lzwmnjmlsrvc7y5htekwqtnhmfi4wxywtj5lgoe
bafybeiedmsig5uj7rgarsjans2ad5kcb4w4g5iurbryqn62jy5qap4qq2a
bafybeidyz34bcd3k6nxl7jbjjgceg5eu3szbrbgusnyn7vfl7facpecsce
bafybeigmq5gch72q3qpk4nipssh7g7msk6jpzns2d6xmpusahkt2lu5m4y
bafybeicjzoypdmmdt6k54wzotr5xhpzwbgd3c4oqg6mj4qukgvxvdrvzye
bafybeien55egngdpfvrsxr2jmkewdyha72ju7qaaeiydz2f5rny7drgzta
mkdir -p inputs
curl -L --output ./inputs/SOCATv2022_tracks_gridded_monthly.nc.zip https://www.socat.info/socat_files/v2022/SOCATv2022_tracks_gridded_monthly.nc.zip
curl --output ./inputs/sst.mnmean.nc https://downloads.psl.noaa.gov/Datasets/noaa.oisst.v2/sst.mnmean.nc
# requirements.txt
Bottleneck==1.3.5
dask==2022.2.0
fsspec==2022.5.0
netCDF4==1.6.0
numpy==1.21.6
pandas==1.3.5
pip==22.1.2
pyseaflux==2.2.1
scipy==1.7.3
xarray==0.20.2
zarr>=2.0.0
pip install -r requirements.txt > /dev/null
import fsspec # for reading remote files
import xarray as xr

# Open the zip archive using fsspec and load the data into xarray.Dataset
with fsspec.open("./inputs/SOCATv2022_tracks_gridded_monthly.nc.zip", compression='zip') as fp:
    ds = xr.open_dataset(fp)

# Display information about the dataset    
ds.info()
time_slice = slice("2010", "2020") # select a decade
res = ds['sst_ave_unwtd'].sel(tmnth=time_slice).mean(dim='tmnth') # compute the mean for this period
res.plot() # plot the result
# main.py
import fsspec
import xarray as xr
import pandas as pd
import numpy as np
import pyseaflux


def lon_360_to_180(ds=None, lonVar=None):
    lonVar = "lon" if lonVar is None else lonVar
    return (ds.assign_coords({lonVar: (((ds[lonVar] + 180) % 360) - 180)})
            .sortby(lonVar)
            .astype(dtype='float32', order='C'))


def center_dates(ds):
    # start and end date
    start_date = str(ds.time[0].dt.strftime('%Y-%m').values)
    end_date = str(ds.time[-1].dt.strftime('%Y-%m').values)

    # monthly dates centered on 15th of each month
    dates = pd.date_range(start=f'{start_date}-01T00:00:00.000000000',
                          end=f'{end_date}-01T00:00:00.000000000',
                          freq='MS') + np.timedelta64(14, 'D')

    return ds.assign(time=dates)


def get_and_process_sst(url=None):
    # get noaa sst
    if url is None:
        url = ("/inputs/sst.mnmean.nc")

    with fsspec.open(url) as fp:
        ds = xr.open_dataset(fp)
        ds = lon_360_to_180(ds)
        ds = center_dates(ds)
        return ds


def get_and_process_socat(url=None):
    if url is None:
        url = ("/inputs/SOCATv2022_tracks_gridded_monthly.nc.zip")

    with fsspec.open(url, compression='zip') as fp:
        ds = xr.open_dataset(fp)
        ds = ds.rename({"xlon": "lon", "ylat": "lat", "tmnth": "time"})
        ds = center_dates(ds)
        return ds


def main():
    print("Load SST and SOCAT data")
    ds_sst = get_and_process_sst()
    ds_socat = get_and_process_socat()

    print("Merge datasets together")
    time_slice = slice("1981-12", "2022-05")
    ds_out = xr.merge([ds_sst['sst'].sel(time=time_slice),
                       ds_socat['fco2_ave_unwtd'].sel(time=time_slice)])

    print("Calculate pco2 from fco2")
    ds_out['pco2_ave_unwtd'] = xr.apply_ufunc(
        pyseaflux.fCO2_to_pCO2,
        ds_out['fco2_ave_unwtd'],
        ds_out['sst'])

    print("Add metadata")
    ds_out['pco2_ave_unwtd'].attrs['units'] = 'uatm'
    ds_out['pco2_ave_unwtd'].attrs['notes'] = ("calculated using" +
                                               "NOAA OI SST V2" +
                                               "and pyseaflux package")

    print("Save data")
    ds_out.to_zarr("/processed.zarr")
    import shutil
    shutil.make_archive("/outputs/processed.zarr", 'zip', "/processed.zarr")
    print("Zarr file written to disk, job completed successfully")

if __name__ == "__main__":
    main()
FROM python:slim

RUN apt-get update && apt-get -y upgrade \
    && apt-get install -y --no-install-recommends \
    g++ \
    && rm -rf /var/lib/apt/lists/*

WORKDIR /project

COPY ./requirements.txt /project

RUN pip3 install -r requirements.txt

COPY ./main.py /project

CMD ["python","main.py"]
docker build -t <hub-user>/<repo-name>:<tag> .
docker push <hub-user>/<repo-name>:<tag>
export JOB_ID=$(bacalhau docker run \
    --input ipfs://bafybeidunikexxu5qtuwc7eosjpuw6a75lxo7j5ezf3zurv52vbrmqwf6y \
    --memory 10Gb \
    ghcr.io/bacalhau-project/examples/socat:0.0.11 \
    -- python main.py)
name: Oceanography
type: batch
count: 1
tasks:
  - name: My main task
    Engine:
      type: docker
      params:
        Image: ghcr.io/bacalhau-project/examples/socat:0.0.11
        Entrypoint:
          - /bin/bash
        Parameters:
          - -c
          - python main.py
    Publisher:
      Type: ipfs
    ResultPaths:
      - Name: outputs
        Path: /outputs
    InputSources:
      - Target: "/inputs"
        Source:
          Type: "ipfs"
          Params:
            CID: "bafybeidunikexxu5qtuwc7eosjpuw6a75lxo7j5ezf3zurv52vbrmqwf6y"
    Resources:
        Memory: 10gb
bacalhau job run ocean.yaml
bacalhau job list --id-filter ${JOB_ID}
bacalhau job describe ${JOB_ID}
rm -rf results
mkdir -p ./results # Temporary directory to store the results
bacalhau job get ${JOB_ID} --output-dir ./results # Download the results
ls results/outputs

processed.zarr.zip
{
  "Type": "docker",
  "Params": {
    "Image": "my_app_image",
    "Entrypoint": "my_app_entrypoint",
  }
}
{
  "Type": "s3",
  "Params": {
    "Bucket": "my_bucket",
    "Region": "us-west-1"
  }
}
{
  "Type": "localDirectory",
  "Params": {
    "SourcePath": "/path/to/local/directory",
    "ReadWrite": true,
  }
}
wget https://cloudflare-ipfs.com/ipfs/QmfKJT13h5k1b23ja3ZCVg5nFL9oKz2bVXc8oXgtwiwhjz/transactions.csv
# converter.py
import os
import sys
from abc import ABCMeta, abstractmethod

import fastavro
import numpy as np
import pandas as pd
from pyarrow import Table, parquet


class BaseConverter(metaclass=ABCMeta):
    """
    Base class for converters.

    Validate received parameters for future use.
    """
    def __init__(
        self,
        csv_file_path: str,
        target_file_path: str,
    ) -> None:
        self.csv_file_path = csv_file_path
        self.target_file_path = target_file_path

    @property
    def csv_file_path(self):
        return self._csv_file_path

    @csv_file_path.setter
    def csv_file_path(self, path):
        if not os.path.isabs(path):
            path = os.path.join(os.getcwd(), path)
        _, extension = os.path.splitext(path)
        if not os.path.isfile(path) or extension != '.csv':
            raise FileNotFoundError(
                f'No such csv file: {path}'
            )
        self._csv_file_path = path

    @property
    def target_file_path(self):
        return self._target_file_path

    @target_file_path.setter
    def target_file_path(self, path):
        if not os.path.isabs(path):
            path = os.path.join(os.getcwd(), path)
        target_dir = os.path.dirname(path)
        if not os.path.isdir(target_dir):
            raise FileNotFoundError(
                f'No such directory: {target_dir}\n'
                'Choose existing or create directory for result file.'
            )
        if os.path.isfile(path):
            raise FileExistsError(
                f'File {path} has already exists.'
                'Usage of existing file may result in data loss.'
            )
        self._target_file_path = path

    def get_csv_reader(self):
        """Return csv reader which read csv file as a stream"""
        return pd.read_csv(
            self.csv_file_path,
            iterator=True,
            chunksize=100000
        )

    @abstractmethod
    def convert(self):
        """Should be implemented in child class"""
        pass


class ParquetConverter(BaseConverter):
    """
    Convert received csv file to parquet file.

    Take path to csv file and path to result file.
    """
    def convert(self):
        """Read csv file as a stream and write data to parquet file."""
        csv_reader = self.get_csv_reader()
        writer = None
        for chunk in csv_reader:
            if not writer:
                table = Table.from_pandas(chunk)
                writer = parquet.ParquetWriter(
                    self.target_file_path, table.schema
                )
            table = Table.from_pandas(chunk)
            writer.write_table(table)
        writer.close()


class AvroConverter(BaseConverter):
    """
    Convert received csv file to avro file.

    Take path to csv file and path to result file.
    """
    NUMPY_TO_AVRO_TYPES = {
        np.dtype('?'): 'boolean',
        np.dtype('int8'): 'int',
        np.dtype('int16'): 'int',
        np.dtype('int32'): 'int',
        np.dtype('uint8'): 'int',
        np.dtype('uint16'): 'int',
        np.dtype('uint32'): 'int',
        np.dtype('int64'): 'long',
        np.dtype('uint64'): 'long',
        np.dtype('O'): ['null', 'string', 'float'],
        np.dtype('unicode_'): 'string',
        np.dtype('float32'): 'float',
        np.dtype('float64'): 'double',
        np.dtype('datetime64'): {
            'type': 'long',
            'logicalType': 'timestamp-micros'
        },
    }

    def get_avro_schema(self, pandas_df):
        """Generate avro schema."""
        column_dtypes = pandas_df.dtypes
        schema_name = os.path.basename(self.target_file_path)
        schema = {
            'type': 'record',
            'name': schema_name,
            'fields': [
                {
                    'name': name,
                    'type': AvroConverter.NUMPY_TO_AVRO_TYPES[dtype]
                } for (name, dtype) in column_dtypes.items()
            ]
        }
        return fastavro.parse_schema(schema)

    def convert(self):
        """Read csv file as a stream and write data to avro file."""
        csv_reader = self.get_csv_reader()
        schema = None
        with open(self.target_file_path, 'a+b') as f:
            for chunk in csv_reader:
                if not schema:
                    schema = self.get_avro_schema(chunk)
                fastavro.writer(
                    f,
                    schema=schema,
                    records=chunk.to_dict('records')
                )


if __name__ == '__main__':
    converters = {
        'parquet': ParquetConverter,
        'avro': AvroConverter
    }
    csv_file, result_path, result_type = sys.argv[1], sys.argv[2], sys.argv[3]
    if result_type.lower() not in converters:
        raise ValueError(
            'Invalid target type. Avalible types: avro, parquet.'
        )
    converter = converters[result_type.lower()](csv_file, result_path)
    converter.convert()
pip install fastavro numpy pandas pyarrow
python converter.py <path_to_csv> <path_to_result_file> <extension>
python3 converter.py transactions.csv transactions.parquet parquet
import pandas as pd
pd.read_parquet('transactions.parquet').head()
FROM python:3.8

RUN apt update && apt install git

RUN git clone https://github.com/bacalhau-project/Sparkov_Data_Generation

WORKDIR /Sparkov_Data_Generation/

RUN pip3 install -r requirements.txt
docker build -t <hub-user>/<repo-name>:<tag> .
docker build -t jsacex/csv-to-arrow-or-parquet .
docker push <hub-user>/<repo-name>:<tag>
docker push jsacex/csv-to-arrow-or-parquet
export JOB_ID=$(bacalhau docker run \
    -i ipfs://QmTAQMGiSv9xocaB4PUCT5nSBHrf9HZrYj21BAZ5nMTY2W  \
    --wait \
    --id-only \
    --output outputs:\outputs \
    --publisher ipfs \
    jsacex/csv-to-arrow-or-parquet \
    -- python3 src/converter.py ../inputs/transactions.csv  /outputs/transactions.parquet parquet)
name: Convert CSV To Parquet Or Avro
type: batch
count: 1
tasks:
  - name: My main task
    Engine:
      type: docker
      params:
        Image: jsacex/csv-to-arrow-or-parquet
        Entrypoint:
          - /bin/bash
        Parameters:
          - -c
          - python3 src/converter.py ../inputs/transactions.csv  ../outputs/transactions.parquet parquet
    Publisher:
      Type: ipfs
    ResultPaths:
      - Name: outputs
        Path: /outputs
    InputSources:
      - Target: "/inputs"
        Source:
          Type: "ipfs"
          Params:
            CID: "QmTAQMGiSv9xocaB4PUCT5nSBHrf9HZrYj21BAZ5nMTY2W"
bacalhau job run convertcsv.yaml
bacalhau job list --id-filter ${JOB_ID} 
bacalhau job describe ${JOB_ID}
rm -rf results && mkdir -p results # Temporary directory to store the results
bacalhau job get ${JOB_ID} --output-dir results # Download the results
ls results/outputs

transactions.parquet
import pandas as pd
import os
pd.read_parquet('results/outputs/transactions.parquet')
here
content identifier (CID)
declarative
Bacalhau team via Slack
here
here
follow these instructions to create docker account
declarative
declarative
Bacalhau team via Slack
here
content identifier (CID)
declarative
Bacalhau team via Slack
fugacity
here
SOCATv2022
SOCAT website
NOAA
here
pyseaflux
NFT.storage
Pinata
follow these instructions to create a Docker account
custom containers example
declarative
​
​
​
Bacalhau team via Slack
here
here
here
here
here
here
here
follow these instructions to create docker account
declarative
Bacalhau team via Slack
Running on Bacalhau

Speech Recognition using Whisper

Whisper is an automatic speech recognition (ASR) system trained on 680,000 hours of multilingual and multitask supervised data collected from the web. It shows that the use of such a large and diverse dataset leads to improved robustness to accents, background noise, and technical language. Moreover, it enables transcription in multiple languages, as well as translation from those languages into English. Creators are open-sourcing models and inference code to serve as a foundation for building useful applications and for further research on robust speech processing. In this example, we will transcribe an audio clip locally, containerize the script and then run the container on Bacalhau.

The advantage of using Bacalhau over managed Automatic Speech Recognition services is that you can run your own containers which can scale to do batch process petabytes of videos or audio for automatic speech recognition

bacalhau docker run \
    --id-only \
    --gpu 1 \
    --timeout 3600 \
    --wait-timeout-secs 3600 \
    jsacex/whisper \
    -i ipfs://bafybeielf6z4cd2nuey5arckect5bjmelhouvn5rhbjlvpvhp7erkrc4nu \
    -- python openai-whisper.py -p inputs/Apollo_11_moonwalk_montage_720p.mp4 -o outputs

To get started, you need to install:

  1. Whisper.

  2. PyTorch.

  3. Python Pandas.

pip install git+https://github.com/openai/whisper.git
pip install torch==1.10.1
pip install --upgrade  pandas
sudo apt update && sudo apt install ffmpeg

Before we create and run the script we need a sample audio file to test the code. For that we download a sample audio clip:

wget https://github.com/js-ts/hello/raw/main/hello.mp3

We will create a script that accepts parameters (input file path, output file path, temperature, etc.) and set the default parameters. Also if the input file is in mp4 format, then the script converts it to wav format. The transcript can be saved in various formats. Then the large model is loaded and we pass it the required parameters.

This model is not only limited to English and transcription, it supports many other languages.

Next, let's create an openai-whisper script:

#content of the openai-whisper.py file

import argparse
import os
import sys
import warnings
import whisper
from pathlib import Path
import subprocess
import torch
import shutil
import numpy as np
parser = argparse.ArgumentParser(description="OpenAI Whisper Automatic Speech Recognition")
parser.add_argument("-l",dest="audiolanguage", type=str,help="Language spoken in the audio, use Auto detection to let Whisper detect the language. Select from the following languages['Auto detection', 'Afrikaans', 'Albanian', 'Amharic', 'Arabic', 'Armenian', 'Assamese', 'Azerbaijani', 'Bashkir', 'Basque', 'Belarusian', 'Bengali', 'Bosnian', 'Breton', 'Bulgarian', 'Burmese', 'Castilian', 'Catalan', 'Chinese', 'Croatian', 'Czech', 'Danish', 'Dutch', 'English', 'Estonian', 'Faroese', 'Finnish', 'Flemish', 'French', 'Galician', 'Georgian', 'German', 'Greek', 'Gujarati', 'Haitian', 'Haitian Creole', 'Hausa', 'Hawaiian', 'Hebrew', 'Hindi', 'Hungarian', 'Icelandic', 'Indonesian', 'Italian', 'Japanese', 'Javanese', 'Kannada', 'Kazakh', 'Khmer', 'Korean', 'Lao', 'Latin', 'Latvian', 'Letzeburgesch', 'Lingala', 'Lithuanian', 'Luxembourgish', 'Macedonian', 'Malagasy', 'Malay', 'Malayalam', 'Maltese', 'Maori', 'Marathi', 'Moldavian', 'Moldovan', 'Mongolian', 'Myanmar', 'Nepali', 'Norwegian', 'Nynorsk', 'Occitan', 'Panjabi', 'Pashto', 'Persian', 'Polish', 'Portuguese', 'Punjabi', 'Pushto', 'Romanian', 'Russian', 'Sanskrit', 'Serbian', 'Shona', 'Sindhi', 'Sinhala', 'Sinhalese', 'Slovak', 'Slovenian', 'Somali', 'Spanish', 'Sundanese', 'Swahili', 'Swedish', 'Tagalog', 'Tajik', 'Tamil', 'Tatar', 'Telugu', 'Thai', 'Tibetan', 'Turkish', 'Turkmen', 'Ukrainian', 'Urdu', 'Uzbek', 'Valencian', 'Vietnamese', 'Welsh', 'Yiddish', 'Yoruba'] ",default="English")
parser.add_argument("-p",dest="inputpath", type=str,help="Path of the input file",default="/hello.mp3")
parser.add_argument("-v",dest="typeverbose", type=str,help="Whether to print out the progress and debug messages. ['Live transcription', 'Progress bar', 'None']",default="Live transcription")
parser.add_argument("-g",dest="outputtype", type=str,help="Type of file to generate to record the transcription. ['All', '.txt', '.vtt', '.srt']",default="All")
parser.add_argument("-s",dest="speechtask", type=str,help="Whether to perform X->X speech recognition (`transcribe`) or X->English translation (`translate`). ['transcribe', 'translate']",default="transcribe")
parser.add_argument("-n",dest="numSteps", type=int,help="Number of Steps",default=50)
parser.add_argument("-t",dest="decodingtemperature", type=int,help="Temperature to increase when falling back when the decoding fails to meet either of the thresholds below.",default=0.15 )
parser.add_argument("-b",dest="beamsize", type=int,help="Number of Images",default=5)
parser.add_argument("-o",dest="output", type=str,help="Output Folder where to store the outputs",default="")

args=parser.parse_args()
device = torch.device('cuda:0')
print('Using device:', device, file=sys.stderr)

Model = 'large'
whisper_model =whisper.load_model(Model)
video_path_local = os.getcwd()+args.inputpath
file_name=os.path.basename(video_path_local)
output_file_path=args.output

if os.path.splitext(video_path_local)[1] == ".mp4":
    video_path_local_wav =os.path.splitext(file_name)[0]+".wav"
    result  = subprocess.run(["ffmpeg", "-i", str(video_path_local), "-vn", "-acodec", "pcm_s16le", "-ar", "16000", "-ac", "1", str(video_path_local_wav)])

# add language parameters
# Language spoken in the audio, use Auto detection to let Whisper detect the language.
#  ['Auto detection', 'Afrikaans', 'Albanian', 'Amharic', 'Arabic', 'Armenian', 'Assamese', 'Azerbaijani', 'Bashkir', 'Basque', 'Belarusian', 'Bengali', 'Bosnian', 'Breton', 'Bulgarian', 'Burmese', 'Castilian', 'Catalan', 'Chinese', 'Croatian', 'Czech', 'Danish', 'Dutch', 'English', 'Estonian', 'Faroese', 'Finnish', 'Flemish', 'French', 'Galician', 'Georgian', 'German', 'Greek', 'Gujarati', 'Haitian', 'Haitian Creole', 'Hausa', 'Hawaiian', 'Hebrew', 'Hindi', 'Hungarian', 'Icelandic', 'Indonesian', 'Italian', 'Japanese', 'Javanese', 'Kannada', 'Kazakh', 'Khmer', 'Korean', 'Lao', 'Latin', 'Latvian', 'Letzeburgesch', 'Lingala', 'Lithuanian', 'Luxembourgish', 'Macedonian', 'Malagasy', 'Malay', 'Malayalam', 'Maltese', 'Maori', 'Marathi', 'Moldavian', 'Moldovan', 'Mongolian', 'Myanmar', 'Nepali', 'Norwegian', 'Nynorsk', 'Occitan', 'Panjabi', 'Pashto', 'Persian', 'Polish', 'Portuguese', 'Punjabi', 'Pushto', 'Romanian', 'Russian', 'Sanskrit', 'Serbian', 'Shona', 'Sindhi', 'Sinhala', 'Sinhalese', 'Slovak', 'Slovenian', 'Somali', 'Spanish', 'Sundanese', 'Swahili', 'Swedish', 'Tagalog', 'Tajik', 'Tamil', 'Tatar', 'Telugu', 'Thai', 'Tibetan', 'Turkish', 'Turkmen', 'Ukrainian', 'Urdu', 'Uzbek', 'Valencian', 'Vietnamese', 'Welsh', 'Yiddish', 'Yoruba']
language = args.audiolanguage
# Whether to print out the progress and debug messages.
# ['Live transcription', 'Progress bar', 'None']
verbose = args.typeverbose
#  Type of file to generate to record the transcription.
# ['All', '.txt', '.vtt', '.srt']
output_type = args.outputtype
# Whether to perform X->X speech recognition (`transcribe`) or X->English translation (`translate`).
# ['transcribe', 'translate']
task = args.speechtask
# Temperature to use for sampling.
temperature = args.decodingtemperature
#  Temperature to increase when falling back when the decoding fails to meet either of the thresholds below.
temperature_increment_on_fallback = 0.2
#  Number of candidates when sampling with non-zero temperature.
best_of = 5
#  Number of beams in beam search, only applicable when temperature is zero.
beam_size = args.beamsize
# Optional patience value to use in beam decoding, as in [*Beam Decoding with Controlled Patience*](https://arxiv.org/abs/2204.05424), the default (1.0) is equivalent to conventional beam search.
patience = 1.0
# Optional token length penalty coefficient (alpha) as in [*Google's Neural Machine Translation System*](https://arxiv.org/abs/1609.08144), set to negative value to uses simple length normalization.
length_penalty = -0.05
# Comma-separated list of token ids to suppress during sampling; '-1' will suppress most special characters except common punctuations.
suppress_tokens = "-1"
# Optional text to provide as a prompt for the first window.
initial_prompt = ""
# if True, provide the previous output of the model as a prompt for the next window; disabling may make the text inconsistent across windows, but the model becomes less prone to getting stuck in a failure loop.
condition_on_previous_text = True
#  whether to perform inference in fp16.
fp16 = True
#  If the gzip compression ratio is higher than this value, treat the decoding as failed.
compression_ratio_threshold = 2.4
# If the average log probability is lower than this value, treat the decoding as failed.
logprob_threshold = -1.0
# If the probability of the <|nospeech|> token is higher than this value AND the decoding has failed due to `logprob_threshold`, consider the segment as silence.
no_speech_threshold = 0.6

verbose_lut = {
    'Live transcription': True,
    'Progress bar': False,
    'None': None
}

args = dict(
    language = (None if language == "Auto detection" else language),
    verbose = verbose_lut[verbose],
    task = task,
    temperature = temperature,
    temperature_increment_on_fallback = temperature_increment_on_fallback,
    best_of = best_of,
    beam_size = beam_size,
    patience=patience,
    length_penalty=(length_penalty if length_penalty>=0.0 else None),
    suppress_tokens=suppress_tokens,
    initial_prompt=(None if not initial_prompt else initial_prompt),
    condition_on_previous_text=condition_on_previous_text,
    fp16=fp16,
    compression_ratio_threshold=compression_ratio_threshold,
    logprob_threshold=logprob_threshold,
    no_speech_threshold=no_speech_threshold
)

temperature = args.pop("temperature")
temperature_increment_on_fallback = args.pop("temperature_increment_on_fallback")
if temperature_increment_on_fallback is not None:
    temperature = tuple(np.arange(temperature, 1.0 + 1e-6, temperature_increment_on_fallback))
else:
    temperature = [temperature]

if Model.endswith(".en") and args["language"] not in {"en", "English"}:
    warnings.warn(f"{Model} is an English-only model but receipted '{args['language']}'; using English instead.")
    args["language"] = "en"

video_transcription = whisper.transcribe(
    whisper_model,
    str(video_path_local),
    temperature=temperature,
    **args,
)

# Save output
writing_lut = {
    '.txt': whisper.utils.write_txt,
    '.vtt': whisper.utils.write_vtt,
    '.srt': whisper.utils.write_txt,
}

if output_type == "All":
    for suffix, write_suffix in writing_lut.items():
        transcript_local_path =os.getcwd()+output_file_path+'/'+os.path.splitext(file_name)[0] +suffix
        with open(transcript_local_path, "w", encoding="utf-8") as f:
            write_suffix(video_transcription["segments"], file=f)
        try:
            transcript_drive_path =file_name
        except:
            print(f"**Transcript file created: {transcript_local_path}**")
else:
    transcript_local_path =output_file_path+'/'+os.path.splitext(file_name)[0] +output_type

    with open(transcript_local_path, "w", encoding="utf-8") as f:
        writing_lut[output_type](video_transcription["segments"], file=f)

Let's run the script with the default parameters:

python openai-whisper.py

To view the outputs, execute following:

cat hello.srt

To build your own docker container, create a Dockerfile, which contains instructions on how the image will be built, and what extra requirements will be included.

FROM  pytorch/pytorch:1.12.1-cuda11.3-cudnn8-runtime

WORKDIR /

RUN apt-get -y update

RUN apt-get -y install git

RUN python3 -m pip install --upgrade pip

RUN python -m pip install regex tqdm Pillow

RUN pip install git+https://github.com/openai/whisper.git

ADD hello.mp3 hello.mp3

ADD openai-whisper.py openai-whisper.py

RUN python openai-whisper.py

We choose pytorch/pytorch:1.12.1-cuda11.3-cudnn8-runtime as our base image.

And then install all the dependencies, after that we will add the test audio file and our openai-whisper script to the container, we will also run a test command to check whether our script works inside the container and if the container builds successfully

We will run docker build command to build the container;

docker build -t <hub-user>/<repo-name>:<tag> .

Before running the command replace:

  1. repo-name with the name of the container, you can name it anything you want

  2. tag this is not required but you can use the latest tag

In our case:

docker build -t jsacex/whisper

Next, upload the image to the registry. This can be done by using the Docker hub username, repo name or tag.

docker push <hub-user>/<repo-name>:<tag>

In our case:

docker push jsacex/whisper

After the dataset has been uploaded, copy the CID:

bafybeielf6z4cd2nuey5arckect5bjmelhouvn5rhbjlvpvhp7erkrc4nu

Let's look closely at the command below:

  1. export JOB_ID=$( ... ) exports the job ID as environment variable

  2. bacalhau docker run: call to bacalhau

  3. The-i ipfs://bafybeielf6z4cd2nuey5arckect5bjmelhouvn5r: flag to mount the CID which contains our file to the container at the path /inputs

  4. The --gpu 1 flag is set to specify hardware requirements, a GPU is needed to run such a job

  5. jsacex/whisper: the name and the tag of the docker image we are using

  6. python openai-whisper.py: execute the script with following parameters:

    1. -p inputs/Apollo_11_moonwalk_montage_720p.mp4 : the input path of our file

    2. -o outputs: the path where to store the outputs

export JOB_ID=$(bacalhau docker run \
    --id-only \
    --gpu 1 \
    --timeout 3600 \
    --wait-timeout-secs 3600 \
    jsacex/whisper \
    -i ipfs://bafybeielf6z4cd2nuey5arckect5bjmelhouvn5rhbjlvpvhp7erkrc4nu \
    -- python openai-whisper.py -p inputs/Apollo_11_moonwalk_montage_720p.mp4 -o outputs

When a job is submitted, Bacalhau prints out the related job_id. We store that in an environment variable so that we can reuse it later on.

name: Speech Recognition using Whisper
type: batch
count: 1
tasks:
  - name: My main task
    Engine:
      type: docker
      params:
        Image: jsacex/whisper:latest
        Entrypoint:
          - /bin/bash
        Parameters:
          - -c   
          - python openai-whisper.py -p inputs/Apollo_11_moonwalk_montage_720p.mp4 -o outputs
    Resources:
      GPU: "1"

You can check the status of the job using bacalhau job list.

bacalhau job list --id-filter ${JOB_ID}

When it says Completed, that means the job is done, and we can get the results.

You can find out more information about your job by using bacalhau job describe.

bacalhau job describe ${JOB_ID}

You can download your job results directly by using bacalhau job get. Alternatively, you can choose to create a directory to store your results. In the command below, we created a directory and downloaded our job output to be stored in that directory.

rm -rf results && mkdir -p results
bacalhau job get $JOB_ID --output-dir results

After the download has finished you should see the following contents in results directory

Now you can find the file in the results/outputs folder. To view it, run the following command:

cat results/outputs/Apollo_11_moonwalk_montage_720p.vtt

Video Processing

Introduction

Many data engineering workloads consist of embarrassingly parallel workloads where you want to run a simple execution on a large number of files. In this example tutorial, we will run a simple video filter on a large number of video files.

Prerequisite​

Upload the Data to IPFS​

This resulted in the IPFS CID of Qmd9CBYpdgCLuCKRtKRRggu24H72ZUrGax5A9EYvrbC72j.

Running a Bacalhau Job​

export JOB_ID=$(bacalhau docker run \
    --wait \
    --wait-timeout-secs 100 \
    --id-only \
    -i ipfs://Qmd9CBYpdgCLuCKRtKRRggu24H72ZUrGax5A9EYvrbC72j:/inputs \
    linuxserver/ffmpeg \
    -- bash -c 'find /inputs -iname "*.mp4" -printf "%f\n" | xargs -I{} ffmpeg -y -i /inputs/{} -vf "scale=-1:72,setsar=1:1" /outputs/scaled_{}' )

Structure of the command​

Let's look closely at the command above:

  1. bacalhau docker run: call to Bacalhau

  2. -i ipfs://Qmd9CBYpdgCLuCKRtKRRggu24H72ZUrGax5A9EYvrbC72j: CIDs to use on the job. Mounts them at '/inputs' in the execution.

  3. linuxserver/ffmpeg: the name of the docker image we are using to resize the videos

  4. -- bash -c 'find /inputs -iname "*.mp4" -printf "%f\n" | xargs -I{} ffmpeg -y -i /inputs/{} -vf "scale=-1:72,setsar=1:1" /outputs/scaled_{}': the command that will be executed inside the container. It uses find to locate all files with the extension ".mp4" within /inputs and then uses ffmpeg to resize each found file to 72 pixels in height, saving the results in the /outputs folder.

When a job is submitted, Bacalhau prints out the related job_id. We store that in an environment variable so that we can reuse it later on.

Declarative job description​

name: Video Processing
type: batch
count: 1
tasks:
  - name: My main task
    Engine:
      type: docker
      params:
        Image: linuxserver/ffmpeg
        Entrypoint:
          - /bin/bash
        Parameters:
          - -c
          - find /inputs -iname "*.mp4" -printf "%f\n" | xargs -I{} ffmpeg -y -i /inputs/{} -vf "scale=-1:72,setsar=1:1" /outputs/scaled_{}
    Publisher:
      Type: ipfs
    ResultPaths:
      - Name: outputs
        Path: /outputs
    InputSources:
    - Target: "/inputs"
      Source:
        Type: "s3"
        Params:
          Bucket: "bacalhau-video-processing"
          Key: "*"
          Region: "us-east-1"

The job description should be saved in .yaml format, e.g. video.yaml, and then run with the command:

bacalhau job run video.yaml

Checking the State of your Jobs​

Job status: You can check the status of the job using bacalhau job list.

bacalhau job list --id-filter ${JOB_ID} --no-style

When it says Published or Completed, that means the job is done, and we can get the results.

Job information: You can find out more information about your job by using bacalhau job describe.

bacalhau job describe ${JOB_ID}

Job download: You can download your job results directly by using bacalhau job get. Alternatively, you can choose to create a directory to store your results. In the command below, we created a directory (results) and downloaded our job output to be stored in that directory.

mkdir -p ./results # Temporary directory to store the results
bacalhau job get ${JOB_ID} --output-dir ./results # Download the results

Viewing your Job Output​

To view the results open the results/outputs/ folder.

Support​

Running Inference on Dolly 2.0 Model with Hugging Face

Introduction​

Dolly 2.0, the groundbreaking, open-source, instruction-following Large Language Model (LLM) that has been fine-tuned on a human-generated instruction dataset, licensed for both research and commercial purposes. Developed using the EleutherAI Pythia model family, this 12-billion-parameter language model is built exclusively on a high-quality, human-generated instruction following dataset, contributed by Databricks employees.

Dolly 2.0 package is open source, including the training code, dataset, and model weights, all available for commercial use. This unprecedented move empowers organizations to create, own, and customize robust LLMs capable of engaging in human-like interactions, without the need for API access fees or sharing data with third parties.

Running locally​

Prerequisites​

  1. A NVIDIA GPU

  2. Python

Installing dependencies​

pip -q install git+https://github.com/huggingface/transformers # need to install from github
pip -q --upgrade install accelerate # ensure you are using version higher than 0.12.0

Create an inference.py file with following code:

# content of the inference.py file
import argparse
import torch
from transformers import pipeline

def main(prompt_string, model_version):

    # use dolly-v2-12b if you're using Colab Pro+, using pythia-2.8b for Free Colab
    generate_text = pipeline(model=model_version, 
                            torch_dtype=torch.bfloat16, 
                            trust_remote_code=True,
                            device_map="auto")

    print(generate_text(prompt_string))

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--prompt", type=str, required=True, help="The prompt to be used in the GPT model")
    parser.add_argument("--model_version", type=str, default="./databricks/dolly-v2-12b", help="The model version to be used")
    args = parser.parse_args()
    main(args.prompt, args.model_version)

Building the container (optional)​

FROM huggingface/transformers-pytorch-deepspeed-nightly-gpu
RUN apt-get update -y
RUN pip -q install git+https://github.com/huggingface/transformers
RUN pip -q install accelerate>=0.12.0 
COPY ./inference.py .

Running Inference on Bacalhau​

Prerequisite​

Structure of the command​

  1. export JOB_ID=$( ... ): Export results of a command execution as environment variable

  2. bacalhau docker run: Run a job using docker executor.

  3. --gpu 1: Flag to specify the number of GPUs to use for the execution. In this case, 1 GPU will be used.

  4. -w /inputs: Flag to set the working directory inside the container to /inputs.

  5. -i gitlfs://huggingface.co/databricks/dolly-v2-3b.git: Flag to clone the Dolly V2-3B model from Hugging Face's repository using Git LFS. The files will be mounted to /inputs/databricks/dolly-v2-3b.

  6. -i https://gist.githubusercontent.com/js-ts/d35e2caa98b1c9a8f176b0b877e0c892/raw/3f020a6e789ceef0274c28fc522ebf91059a09a9/inference.py: Flag to download the inference.py script from the provided URL. The file will be mounted to /inputs/inference.py.

  7. jsacex/dolly_inference:latest: The name and the tag of the Docker image.

  8. The command to run inference on the model: python inference.py --prompt "Where is Earth located ?" --model_version "./databricks/dolly-v2-3b". It consists of:

    1. inference.py: The Python script that runs the inference process using the Dolly V2-3B model.

    2. --prompt "Where is Earth located ?": Specifies the text prompt to be used for the inference.

    3. --model_version "./databricks/dolly-v2-3b": Specifies the path to the Dolly V2-3B model. In this case, the model files are mounted to /inputs/databricks/dolly-v2-3b.

When a job is submitted, Bacalhau prints out the related job_id. We store that in an environment variable so that we can reuse it later on.

export JOB_ID=$(bacalhau docker run \
    --gpu 1 \
    --id-only \
    -w /inputs \
    -i gitlfs://huggingface.co/databricks/dolly-v2-3b.git \
    -i https://gist.githubusercontent.com/js-ts/d35e2caa98b1c9a8f176b0b877e0c892/raw/3f020a6e789ceef0274c28fc522ebf91059a09a9/inference.py \
    jsacex/dolly_inference:latest \
    -- python inference.py --prompt "Where is Earth located ?" --model_version "./databricks/dolly-v2-3b")

Checking the State of your Jobs​

Job status: You can check the status of the job using bacalhau job list:

bacalhau job list --id-filter ${JOB_ID}

When it says Completed, that means the job is done, and we can get the results.

Job information: You can find out more information about your job by using bacalhau job describe:

bacalhau job describe ${JOB_ID}

Job download: You can download your job results directly by using bacalhau job get. Alternatively, you can choose to create a directory to store your results. In the command below, we created a directory and downloaded our job output to be stored in that directory.

rm -rf results && mkdir results
bacalhau job get ${JOB_ID} --output-dir results

Viewing your Job Output​

After the download has finished, we can see the results in the results/outputs folder.

EasyOCR (Optical Character Recognition) on Bacalhau

Introduction​

TL;DR​

bacalhau docker run \
    -i ipfs://bafybeibvcllzpfviggluobcfassm3vy4x2a4yanfxtmn4ir7olyzfrgq64:/root/.EasyOCR/model/zh_sim_g2.pth  \
    -i https://raw.githubusercontent.com/JaidedAI/EasyOCR/ae773d693c3f355aac2e58f0d8142c600172f016/examples/chinese.jpg \
    --timeout 3600 \
    --wait-timeout-secs 3600 \
    --gpu 1  \
    --memory 10Gb \
    --cpu 3 \
    --id-only \
    --wait \
    jsacex/easyocr \
    --  easyocr -l ch_sim  en -f ./inputs/chinese.jpg --detail=1 --gpu=True

Running Easy OCR Locally​​

Install the required dependencies

pip install --upgrade easyocr

Load the different example images

npx degit JaidedAI/EasyOCR/examples -f

List all the images. You'll see an output like this:

ls -l

total 3508
-rw-r--r-- 1 root root   59898 Jun 16 22:36 chinese.jpg
-rw-r--r-- 1 root root   97910 Jun 16 22:36 easyocr_framework.jpeg
-rw-r--r-- 1 root root 1740957 Jun 16 22:36 english.png
-rw-r--r-- 1 root root  487995 Jun 16 22:36 example2.png
-rw-r--r-- 1 root root  127454 Jun 16 22:36 example3.png
-rw-r--r-- 1 root root  488641 Jun 16 22:36 example.png
-rw-r--r-- 1 root root  168376 Jun 16 22:36 french.jpg
-rw-r--r-- 1 root root   42159 Jun 16 22:36 japanese.jpg
-rw-r--r-- 1 root root  225531 Jun 16 22:36 korean.png
drwxr-xr-x 1 root root    4096 Jun 15 13:37 sample_data
-rw-r--r-- 1 root root   82229 Jun 16 22:36 thai.jpg
-rw-r--r-- 1 root root   34706 Jun 16 22:36 width_ths.png

Next, we create a reader to do OCR to get coordinates which represent a rectangle containing text and the text itself:

import easyocr
reader = easyocr.Reader(['th','en'])
# Doing OCR. Get bounding boxes.
bounds = reader.readtext('thai.jpg')
bounds

Containerize your Script using Docker​

git clone https://github.com/JaidedAI/EasyOCR
cd EasyOCR

Build the Container​

The docker build command builds Docker images from a Dockerfile.

docker build -t <hub-user>/<repo-name>:<tag> .

Before running the command replace:

  1. repo-name with the name of the container, you can name it anything you want

  2. tag this is not required but you can use the latest tag

Push the container​

Next, upload the image to the registry. This can be done by using the Docker hub username, repo name, or tag.

docker push <hub-user>/<repo-name>:<tag>

Running a Bacalhau Job to Generate Easy OCR output​

Prerequisite​

Now that we have an image in the docker hub (your own or an example image from the manual), we can use the container for running on Bacalhau.

Structure of the imperative command​

Let's look closely at the command below:

  1. export JOB_ID=$( ... ) exports the job ID as environment variable

  2. bacalhau docker run: call to bacalhau

  3. The --gpu 1 flag is set to specify hardware requirements, a GPU is needed to run such a job

  4. The --id-only flag is set to print only job id

  5. -i ipfs://bafybeibvc...... Mounts the model from IPFS

  6. -i https://raw.githubusercontent.com... Mounts the Input Image from a URL

  7. jsacex/easyocr the name and the tag of the docker image we are using

  8. -- easyocr -l ch_sim en -f ./inputs/chinese.jpg --detail=1 --gpu=True execute script with following paramters:

    1. -l ch_sim: the name of the model

    2. -f ./inputs/chinese.jpg: path to the input Image or directory

    3. --detail=1: level of detail

    4. --gpu=True: we set this flag to true since we are running inference on a GPU. If you run this on a CPU - set this flag to false

export JOB_ID=$(bacalhau docker run \
    -i ipfs://bafybeibvcllzpfviggluobcfassm3vy4x2a4yanfxtmn4ir7olyzfrgq64:/root/.EasyOCR/model/zh_sim_g2.pth  \
    -i https://raw.githubusercontent.com/JaidedAI/EasyOCR/ae773d693c3f355aac2e58f0d8142c600172f016/examples/chinese.jpg \
    --timeout 3600 \
    --wait-timeout-secs 3600 \
    --gpu 1  \
    --memory 10Gb \
    --cpu 3 \
    --id-only \
    --wait \
    jsacex/easyocr \
    --  easyocr -l ch_sim  en -f ./inputs/chinese.jpg --detail=1 --gpu=True)

When a job is submitted, Bacalhau prints out the related job_id. We store that in an environment variable so that we can reuse it later on.

Declarative job description​

name: EasyOCR
type: batch
count: 1
tasks:
  - name: My main task
    Engine:
      type: docker
      params:
        Image: "jsacex/easyocr" 
        Entrypoint:
          - /bin/bash
        Parameters:
          - -c
          - easyocr -l ch_sim  en -f ./inputs/chinese.jpg --detail=1 --gpu=True
    InputSources:
    - Source:
        Type: "urlDownload"
        Params:
          URL: "https://raw.githubusercontent.com/JaidedAI/EasyOCR/ae773d693c3f355aac2e58f0d8142c600172f016/examples/chinese.jpg"
      Target: "/inputs/chinese.jpg"
    - Source:
        Type: "s3"
        Params:
          Bucket: "landsat-image-processing"
          Key: "*"
          Region: "us-east-1"
      Target: "/root/.EasyOCR/model/zh_sim_g2.pth"
    Resources:
      GPU: "1"

The job description should be saved in .yaml format, e.g. easyocr.yaml, and then run with the command:

bacalhau job run easyocr.yaml

Checking the State of your Jobs​

Job status​

You can check the status of the job using bacalhau job list.

bacalhau job list --id-filter ${JOB_ID}

When it says Completed, that means the job is done, and we can get the results.

Job information​

You can find out more information about your job by using bacalhau job describe.

bacalhau job describe ${JOB_ID}

Job download​

You can download your job results directly by using bacalhau job get. Alternatively, you can choose to create a directory to store your results. In the command below, we created a directory and downloaded our job output to be stored in that directory.

rm -rf results && mkdir -p results
bacalhau job get $JOB_ID --output-dir results

After the download has finished you should see the following contents in results directory

Viewing your Job Output​

Now you can find the file in the results/outputs folder. You can view results by running following commands:

ls results # list the contents of the current directory 
cat results/stdout # displays the contents of the current directory 

Stable Diffusion on a CPU

Introduction

This example demonstrates how to use stable diffusion online on a CPU and run it on the Bacalhau demo network. The first section describes the development of the code and the container. The second section demonstrates how to run the job using Bacalhau.

This model generated the images presented on this page.

TL;DR​

bacalhau docker run ghcr.io/bacalhau-project/examples/stable-diffusion-cpu:0.0.1 \
  -- python demo.py \
  --prompt "cod in space" \
  --output ../outputs/cod.png

Development​

Heads up! This example takes about 10 minutes to generate an image on an average CPU. Whilst this demonstrates it is possible, it might not be practical.

Prerequisites​

In order to run this example you need:

  1. A Debian-flavoured Linux (although you might be able to get it working on the newest machines)

Converting Stable Diffusion to a CPU Model Using OpenVINO​

Install Dependencies​

Note that these dependencies are only known to work on Ubuntu-based x64 machines.

sudo apt-get update
sudo apt-get install -y libgl1 libglib2.0-0 git-lfs

Clone the Repository and Dependencies​

The following commands clone the example repository, and other required repositories, and install the Python dependencies.

git clone https://github.com/js-ts/stable_diffusion.openvino
cd stable_diffusion.openvino
git lfs install
git clone https://huggingface.co/openai/clip-vit-large-patch14
git clone https://huggingface.co/bes-dev/stable-diffusion-v1-4-openvino
pip3 install -r requirements.txt

Generate an Image​

Now that we have all the dependencies installed, we can call the demo.py wrapper, which is a simple CLI, to generate an image from a prompt.

cd stable_diffusion.openvino && \
  python3 demo.py \
  --prompt "hello" \
  --output hello.png

When the generation is complete, you can open the generated hello.png and see something like this:

Lets try another prompt and see what we get:

cd stable_diffusion.openvino && \
  python3 demo.py \
  --prompt "cat driving a car" \
  --output cat.png

Running Stable Diffusion (CPU) on Bacalhau​

Now we have a working example, we can convert it into a format that allows us to perform inference in a distributed environment.

FROM python:3.9.9-bullseye

WORKDIR /src

RUN apt-get update && \
    apt-get install -y \
    libgl1 libglib2.0-0 git-lfs

RUN git lfs install

COPY requirements.txt /src/

RUN pip3 install -r requirements.txt

COPY stable_diffusion_engine.py demo.py demo_web.py /src/
COPY data/ /src/data/

RUN git clone https://huggingface.co/openai/clip-vit-large-patch14
RUN git clone https://huggingface.co/bes-dev/stable-diffusion-v1-4-openvino

# download models
RUN python3 demo.py --num-inference-steps 1 --prompt "test" --output /tmp/test.jpg

This container is using the python:3.9.9-bullseye image and the working directory is set. Next, the Dockerfile installs the same dependencies from earlier in this notebook. Then we add our custom code and pull the dependent repositories.

We've already pushed this image to GHCR, but for posterity, you'd use a command like this to update it:

docker buildx build --platform linux/amd64 --push -t ghcr.io/bacalhau-project/examples/stable-diffusion-cpu:0.0.1 .

Prerequisites​

Generating an Image Using Stable Diffusion on Bacalhau​

To submit a job, you can use the Bacalhau CLI. The following command passes a prompt to the model and generates an image in the outputs directory.

This will take about 10 minutes to complete. Go grab a coffee. Or a beer. Or both. If you want to block and wait for the job to complete, add the --wait flag.

Furthermore, the container itself is about 15GB, so it might take a while to download on the node if it isn't cached.

Structure of the command​

  1. export JOB_ID=$( ... ): Export results of a command execution as environment variable

  2. bacalhau docker run: Run a job using docker executor.

  3. --id-only: Flag to print out only the job id

  4. ghcr.io/bacalhau-project/examples/stable-diffusion-cpu:0.0.1: The name and the tag of the Docker image.

  5. The command to run inference on the model: python demo.py --prompt "First Humans On Mars" --output ../outputs/mars.png. It consists of:

    1. demo.py: The Python script that runs the inference process.

    2. --prompt "First Humans On Mars": Specifies the text prompt to be used for the inference.

    3. --output ../outputs/mars.png: Specifies the path to the output image.

When a job is submitted, Bacalhau prints out the related job_id. We store that in an environment variable so that we can reuse it later on.

export JOB_ID=$(bacalhau docker run \  
ghcr.io/bacalhau-project/examples/stable-diffusion-cpu:0.0.1 \
--id-only \
-- python demo.py --prompt "First Humans On Mars" --output ../outputs/mars.png)

Checking the State of your Jobs​

Job status: You can check the status of the job using bacalhau job list:

bacalhau job list --id-filter ${JOB_ID}

When it says Completed, that means the job is done, and we can get the results.

Job information: You can find out more information about your job by using bacalhau job describe:

bacalhau job describe ${JOB_ID}

Job download: You can download your job results directly by using bacalhau job get. Alternatively, you can choose to create a directory to store your results. In the command below, we created a directory and downloaded our job output to be stored in that directory.

rm -rf results && mkdir results
bacalhau job get ${JOB_ID} --output-dir results

Viewing your Job Output​

After the download has finished we can see the results in the results/outputs folder.

Stable Diffusion on a GPU

TL;DR​

bacalhau docker run \
    --id-only \
    --gpu 1 \
    ghcr.io/bacalhau-project/examples/stable-diffusion-gpu:0.0.1 \
    -- python main.py --o ./outputs --p "meme about tensorflow"

Prerequisite​

Quick Test​

Here is an example of an image generated by this model.

bacalhau docker run \
    --gpu 1 \
    ghcr.io/bacalhau-project/examples/stable-diffusion-gpu:0.0.1 \
    -- python main.py --o ./outputs --p "cod swimming through data"

Development​

Installing dependencies​

When you run this code for the first time, it will download the pre-trained weights, which may add a short delay.

pip install git+https://github.com/fchollet/stable-diffusion-tensorflow --upgrade --quiet
pip install tensorflow tensorflow_addons ftfy --upgrade --quiet
pip install tqdm --upgrade
apt install --allow-change-held-packages libcudnn8=8.1.0.77-1+cuda11.2

Testing the Code​

When you run this code for the first time, it will download the pre-trained weights, which may add a short delay.

from stable_diffusion_tf.stable_diffusion import Text2Image
from PIL import Image

generator = Text2Image( 
    img_height=512,
    img_width=512,
    jit_compile=False,  # You can try True as well (different performance profile)
)
img = generator.generate(
    "DSLR photograph of an astronaut riding a horse",
    num_steps=50,
    unconditional_guidance_scale=7.5,
    temperature=1,
    batch_size=1,
)
pil_img = Image.fromarray(img[0])
display(pil_img)

When running this code, if you check the GPU RAM usage, you'll see that it's sucked up many GBs, and depending on what GPU you're running, it may OOM (Out of memory) if you run this again.

You can try and reduce RAM usage by playing with batch sizes (although it is only set to 1 above!) or more carefully controlling the TensorFlow session.

To clear the GPU memory we will use numba. This won't be required when running in a single-shot manner.

pip install numba --upgrade
# clearing the GPU memory 
from numba import cuda 
device = cuda.get_current_device()
device.reset()

Write the Script​

#content of the main.py file

import argparse
from stable_diffusion_tf.stable_diffusion import Text2Image
from PIL import Image
import os
parser = argparse.ArgumentParser(description="Stable Diffusion")
parser.add_argument("--h",dest="height", type=int,help="height of the image",default=512)
parser.add_argument("--w",dest="width", type=int,help="width of the image",default=512)
parser.add_argument("--p",dest="prompt", type=str,help="Description of the image you want to generate",default="cat")
parser.add_argument("--n",dest="numSteps", type=int,help="Number of Steps",default=50)
parser.add_argument("--u",dest="unconditionalGuidanceScale", type=float,help="Number of Steps",default=7.5)
parser.add_argument("--t",dest="temperature", type=int,help="Number of Steps",default=1)
parser.add_argument("--b",dest="batchSize", type=int,help="Number of Images",default=1)
parser.add_argument("--o",dest="output", type=str,help="Output Folder where to store the Image",default="./")

args=parser.parse_args()
height=args.height
width=args.width
prompt=args.prompt
numSteps=args.numSteps
unconditionalGuidanceScale=args.unconditionalGuidanceScale
temperature=args.temperature
batchSize=args.batchSize
output=args.output

generator = Text2Image(
    img_height=height,
    img_width=width,
    jit_compile=False,  # You can try True as well (different performance profile)
)

img = generator.generate(
    prompt,
    num_steps=numSteps,
    unconditional_guidance_scale=unconditionalGuidanceScale,
    temperature=temperature,
    batch_size=batchSize,
)
for i in range(0,batchSize):
  pil_img = Image.fromarray(img[i])
  image = pil_img.save(f"{output}/image{i}.png")

Run the Script​

After writing the code the next step is to run the script.

python3 main.py

As a result, you will get something like this:

The following presents additional parameters you can try:

  1. python main.py --p "cat with three eyes - to set prompt

  2. python main.py --p "cat with three eyes" --n 100 - to set the number of iterations to 100

  3. python stable-diffusion.py --p "cat with three eyes" --b 2 to set batch size to 2 (№ of images to generate)

Containerize Script using Docker​

FROM tensorflow/tensorflow:2.10.0-gpu

RUN apt-get -y update

RUN apt-get -y install --allow-change-held-packages libcudnn8=8.1.0.77-1+cuda11.2 git

RUN python3 -m pip install --upgrade pip

RUN python -m pip install regex tqdm Pillow tensorflow tensorflow_addons ftfy  --upgrade --quiet

RUN pip install git+https://github.com/fchollet/stable-diffusion-tensorflow --upgrade --quiet

ADD main.py main.py

# Run once so it downloads and caches the pre-trained weights
RUN python main.py --n 1

Build the container​

We will run docker build command to build the container;

docker build -t <hub-user>/<repo-name>:<tag> .

Before running the command replace following:

  1. repo-name with the name of the container, you can name it anything you want

  2. tag this is not required but you can use the latest tag

In our case:

docker build -t ghcr.io/bacalhau-project/examples/stable-diffusion-gpu:0.0.1 .

Push the container​

Next, upload the image to the registry. This can be done by using the Docker hub username, repo name or tag.

docker push <hub-user>/<repo-name>:<tag>

In our case:

docker push ghcr.io/bacalhau-project/examples/stable-diffusion-gpu:0.0.1 .

Running a Bacalhau Job​

Structure of the command​

To submit a job run the Bacalhau command with following structure:

  1. export JOB_ID=$( ... ) exports the job ID as environment variable

  2. The --gpu 1 flag is set to specify hardware requirements, a GPU is needed to run such a job

  3. The --id-only flag is set to print only job id

  4. ghcr.io/bacalhau-project/examples/stable-diffusion-gpu:0.0.1: the name and the tag of the docker image we are using

  5. -- python main.py --o ./outputs --p "meme about tensorflow": The command to run inference on the model. It consists of:

    1. main.py path to the script

    2. --o ./outputs specifies the output directory

    3. --p "meme about tensorflow" specifies the prompt

export JOB_ID=$(
    bacalhau docker run \
    --id-only \
    --gpu 1 \
    ghcr.io/bacalhau-project/examples/stable-diffusion-gpu:0.0.1 \
    -- python main.py --o ./outputs --p "meme about tensorflow")

This will take about 5 minutes to complete and is mainly due to the cold-start GPU setup time. This is faster than the CPU version, but you might still want to grab some fruit or plan your lunchtime run.

Furthermore, the container itself is about 10GB, so it might take a while to download on the node if it isn't cached.

When a job is submitted, Bacalhau prints out the related job_id. We store that in an environment variable so that we can reuse it later on.

Checking the State of your Jobs​

Job status: You can check the status of the job using bacalhau job list.

bacalhau job list --id-filter ${JOB_ID}

When it says Completed, that means the job is done, and we can get the results.

Job information: You can find out more information about your job by using bacalhau job describe.

bacalhau job describe ${JOB_ID}

Job download: You can download your job results directly by using bacalhau job get. Alternatively, you can choose to create a directory to store your results. In the command below, we created a directory and downloaded our job output to be stored in that directory.

rm -rf results && mkdir -p results
bacalhau job get $JOB_ID --output-dir results

After the download has finished you should see the following contents in results directory

Viewing your Job Output​

Now you can find the file in the results/outputsfolder:

Object Detection with YOLOv5 on Bacalhau

Introduction​

Traditionally, models like YOLO required enormous amounts of training data to yield reasonable results. People might not have access to such high-quality labeled data. Thankfully, open-source communities and researchers have made it possible to utilize pre-trained models to perform inference. In other words, you can use models that have already been trained on large datasets to perform object detection on your own data.

TL;DR​

Load your dataset into S3/IPFS, specify it and pre-trained weights via the --input flags, choose a suitable container, specify the command and path to save the results - done!

Prerequisite​

Test Run with Sample Data​

To get started, let's run a test job with a small sample dataset that is included in the YOLOv5 Docker Image. This will give you a chance to familiarise yourself with the process of running a job on Bacalhau.

In addition to the usual Bacalhau flags, you will also see example of using the --gpu 1 flag in order to specify the use of a GPU.

The model requires pre-trained weights to run and by default downloads them from within the container. Bacalhau jobs don't have network access so we will pass in the weights at submission time, saving them to /usr/src/app/yolov5s.pt. You may also provide your own weights here.

The container has its own options that we must specify:

  1. --project specifies the output volume that the model will save its results to. Bacalhau defaults to using /outputs as the output directory, so we save it there.

One final additional hack that we have to do is move the weights file to a location with the standard name. As of writing this, Bacalhau downloads the file to a UUID-named file, which the model is not expecting. This is because GitHub 302 redirects the request to a random file in its backend.

Structure of the command​

  1. export JOB_ID=$( ... ) exports the job ID as environment variable

  2. The --gpu 1 flag is set to specify hardware requirements, a GPU is needed to run such a job

  3. The --timeout flag is set to make sure that if the job is not completed in the specified time, it will be terminated

  4. The --wait flag is set to wait for the job to complete before return

  5. The --wait-timeout-secs flag is set together with --wait to define how long should app wait for the job to complete

  6. The --id-only flag is set to print only job id

  7. The --input flags are used to specify the sources of input data

  8. -- /bin/bash -c 'find /inputs -type f -exec cp {} /outputs/yolov5s.pt \; ; python detect.py --weights /outputs/yolov5s.pt --source $(pwd)/data/images --project /outputs' tells the model where to find input data and where to write output

export JOB_ID=$(bacalhau docker run \
--gpu 1 \
--timeout 3600 \
--wait \
--wait-timeout-secs 3600 \
--id-only \
--input https://github.com/ultralytics/yolov5/releases/download/v6.2/yolov5s.pt \
ultralytics/yolov5:v6.2 \
-- /bin/bash -c 'find /inputs -type f -exec cp {} /outputs/yolov5s.pt \; ; python detect.py --weights /outputs/yolov5s.pt --source $(pwd)/data/images --project /outputs')

This should output a UUID (like 59c59bfb-4ef8-45ac-9f4b-f0e9afd26e70), which will be stored in the environment variable JOB_ID. This is the ID of the job that was created. You can check the status of the job using the commands below.

Declarative job description​

name: Object Detection with YOLOv5
type: batch
count: 1
tasks:
  - name: My main task
    Engine:
      type: docker
      params:
        Image: ultralytics/yolov5:v6.2
        Entrypoint:
          - /bin/bash
        Parameters:
          - -c
          - "find /inputs -type f -exec cp {} /outputs/yolov5s.pt \\; ; python detect.py --weights /outputs/yolov5s.pt --source $(pwd)/data/images --project /outputs"
    Publisher:
      Type: ipfs
    ResultPaths:
      - Name: outputs
        Path: /outputs
    InputSources:
      - Source:
          Type: "urlDownload"
          Params:
            URL: "https://github.com/ultralytics/yolov5/releases/download/v6.1/yolov5s.pt"
        Target: "/inputs"

Checking the State of your Jobs​

Job status: You can check the status of the job using bacalhau job list:

bacalhau job list --id-filter ${JOB_ID}

When it says Completed, that means the job is done, and we can get the results.

Job information: You can find out more information about your job by using bacalhau job describe:

bacalhau job describe ${JOB_ID}

Job download: You can download your job results directly by using bacalhau job get. Alternatively, you can choose to create a directory to store your results. In the command below, we created a directory and downloaded our job output to be stored in that directory.

rm -rf results && mkdir results
bacalhau job get ${JOB_ID} --output-dir results

Viewing Output​

After the download has finished we can see the results in the results/outputs/exp folder.

Using Custom Images as an Input​

Let's run a the same job again, but this time use the images above.

export JOB_ID=$(bacalhau docker run \
--gpu 1 \
--timeout 3600 \
--wait \
--wait-timeout-secs 3600 \
--id-only \
--input https://github.com/ultralytics/yolov5/releases/download/v6.2/yolov5s.pt \
--input ipfs://bafybeicyuddgg4iliqzkx57twgshjluo2jtmlovovlx5lmgp5uoh3zrvpm:/datasets \
ultralytics/yolov5:v6.2 \
-- /bin/bash -c 'find /inputs -type f -exec cp {} /outputs/yolov5s.pt \; ; python detect.py --weights /outputs/yolov5s.pt --source /datasets --project /outputs')

Just as in the example above, this should output a UUID, which will be stored in the environment variable JOB_ID. You can check the status of the job using the commands below.

Support​

Stable Diffusion Checkpoint Inference

Introduction​

This example demonstrates how to use stable diffusion using a finetuned model and run inference on it. The first section describes the development of the code and the container - it is optional as users don't need to build their own containers to use their own custom model. The second section demonstrates how to convert your model weights to ckpt. The third section demonstrates how to run the job using Bacalhau.

TL;DR​

  1. Convert your existing model weights to the ckpt format and upload to the IPFS storage.

  2. Create a job using bacalhau docker run, relevant docker image, model weights and any prompt.

  3. Download results using bacalhau job get and the job id.

Prerequisite​

To get started, you need to install:

  1. NVIDIA GPU

  2. CUDA drivers

  3. NVIDIA docker

Running Locally​

Containerize your Script using Docker​

To build your own docker container, create a Dockerfile, which contains instructions to containerize the code for inference.

FROM  pytorch/pytorch:1.13.0-cuda11.6-cudnn8-runtime

WORKDIR /

RUN apt update &&  apt install -y git

RUN git clone https://github.com/runwayml/stable-diffusion.git

WORKDIR /stable-diffusion

RUN conda env create -f environment.yaml

SHELL ["conda", "run", "-n", "ldm", "/bin/bash", "-c"]

RUN pip install opencv-python

RUN apt update

RUN apt-get install ffmpeg libsm6 libxext6 libxrender-dev  -y

This container is using the pytorch/pytorch:1.13.0-cuda11.6-cudnn8-runtime image and the working directory is set. Next the Dockerfile installs required dependencies. Then we add our custom code and pull the dependent repositories.

Build the container​

We will run docker build command to build the container.

docker build -t <hub-user>/<repo-name>:<tag> .

Before running the command replace:

  1. repo-name with the name of the container, you can name it anything you want

  2. tag this is not required but you can use the latest tag

So in our case, the command will look like this:

docker build -t jsacex/stable-diffusion-ckpt

Push the container​

Next, upload the image to the registry. This can be done by using the Docker hub username, repo name or tag.

docker push <hub-user>/<repo-name>:<tag>

Thus, in this case, the command would look this way:

docker push jsacex/stable-diffusion-ckpt

After the repo image has been pushed to Docker Hub, you can now use the container for running on Bacalhau. But before that you need to check whether your model is a ckpt file or not. If your model is a ckpt file you can skip to the running on Bacalhau, and if not - the next section describes how to convert your model into the ckpt format.

Converting model weights to CKPT​

To download the convert script:

wget -q https://github.com/TheLastBen/diffusers/raw/main/scripts/convert_diffusers_to_original_stable_diffusion.py

To convert the model weights into ckpt format, the --half flag cuts the size of the output model from 4GB to 2GB:

python3 convert_diffusers_to_original_stable_diffusion.py \
    --model_path <path-to-the-model-weights>  \
    --checkpoint_path <path-to-save-the-checkpoint>/model.ckpt \
    --half

Running a Bacalhau Job​

After the checkpoint file has been uploaded copy its CID.

Structure of the command​

Let's look closely at the command above:

  1. export JOB_ID=$( ... ): Export results of a command execution as environment variable

  2. The --gpu 1 flag is set to specify hardware requirements, a GPU is needed to run such a job

  3. -i ipfs://QmUCJuFZ2v7KvjBGHRP2K1TMPFce3reTkKVGF2BJY5bXdZ:/model.ckpt: Path to mount the checkpoint

  4. -- conda run --no-capture-output -n ldm: since we are using conda we need to specify the name of the environment which we are going to use, in this case it is ldm

  5. scripts/txt2img.py: running the python script

  6. --prompt "a photo of a person drinking coffee": the prompt you need to specify the session name in the prompt.

  7. --plms: the sampler you want to use. In this case we will use the plms sampler

  8. --ckpt ../model.ckpt: here we specify the path to our checkpoint

  9. --n_samples 1: no of samples we want to produce

  10. --skip_grid: skip creating a grid of images

  11. --outdir ../outputs: path to store the outputs

  12. --seed $RANDOM: The output generated on the same prompt will always be the same for different outputs on the same prompt set the seed parameter to random

When a job is submitted, Bacalhau prints out the related job_id. We store that in an environment variable so that we can reuse it later on.

export JOB_ID=$(bacalhau docker run \
--gpu 1 \
--timeout 3600 \
--wait-timeout-secs 3600 \
--wait \
--id-only \
-i ipfs://QmUCJuFZ2v7KvjBGHRP2K1TMPFce3reTkKVGF2BJY5bXdZ:/model.ckpt \
jsacex/stable-diffusion-ckpt \
-- conda run --no-capture-output -n ldm python scripts/txt2img.py --prompt "a photo of a person drinking coffee" --plms --ckpt ../model.ckpt --skip_grid --n_samples 1 --skip_grid --outdir ../outputs) 

Checking the State of your Jobs​

Job status: You can check the status of the job using bacalhau job list:

bacalhau job list --id-filter ${JOB_ID}

When it says Completed, that means the job is done, and we can get the results.

Job information: You can find out more information about your job by using bacalhau job describe:

bacalhau job describe ${JOB_ID}

Job download: You can download your job results directly by using bacalhau job get. Alternatively, you can choose to create a directory to store your results. In the command below, we created a directory and downloaded our job output to be stored in that directory.

rm -rf results && mkdir results
bacalhau job get ${JOB_ID} --output-dir results

Viewing your Job Output​

After the download has finished we can see the results in the results/outputs folder. We received following image for our prompt:

Generate Realistic Images using StyleGAN3 and Bacalhau

Introduction

TL;DR​

bacalhau docker run \
    --wait \
    --id-only \
    --gpu 1 \
    --timeout 3600 \
    --wait-timeout-secs 3600 \
    jsacex/stylegan3 \
    -- python gen_images.py --outdir=../outputs --trunc=1 --seeds=2 --network=stylegan3-r-afhqv2-512x512.pkl

Prerequisite​

Running StyleGAN3 locally​

To run StyleGAN3 locally, you'll need to clone the repo, install dependencies and download the model weights.

git clone https://github.com/NVlabs/stylegan3
cd stylegan3
conda env create -f environment.yml
conda activate stylegan3
wget https://api.ngc.nvidia.com/v2/models/nvidia/research/stylegan3/versions/1/files/stylegan3-r-afhqv2-512x512.pkl

Now you can generate an image using a pre-trained AFHQv2 model. Here is an example of the image we generated:

Containerize Script with Docker​

To build your own docker container, create a Dockerfile, which contains instructions to build your image.

FROM nvcr.io/nvidia/pytorch:21.08-py3

COPY . /scratch

WORKDIR /scratch

ENV HOME /scratch

Build the container​

We will run docker build command to build the container:

docker build -t <hub-user>/<repo-name>:<tag> .

Before running the command replace:

  1. repo-name with the name of the container, you can name it anything you want

  2. tag this is not required but you can use the latest tag

In our case:

docker build -t jsacex/stylegan3 

Push the container​

Next, upload the image to the registry. This can be done by using the Docker hub username, repo name or tag.

docker push <hub-user>/<repo-name>:<tag>

In our case:

docker push jsacex/stylegan3 

Running a Bacalhau Job​

Structure of the command​

To submit a job run the Bacalhau command with following structure:

  1. export JOB_ID=$( ... ) exports the job ID as environment variable

  2. bacalhau docker run: call to Bacalhau

  3. The --gpu 1 flag is set to specify hardware requirements, a GPU is needed to run such a job

  4. The --id-only flag is set to print only job id

  5. jsacex/stylegan3: the name and the tag of the docker image we are using

  6. python gen_images.py: execute the script with following parameters:

    1. --trunc=1 --seeds=2 --network=stylegan3-r-afhqv2-512x512.pkl: The animation length is either determined based on the --seeds value or explicitly specified using the --num-keyframes option. When num keyframes are specified with --num-keyframes, the output video length will be num_keyframes * w_frames frames.

    2. ../outputs: path to the output

export JOB_ID=$(bacalhau docker run \
    --wait \
    --id-only \
    --gpu 1 \
    --timeout 3600 \
    --wait-timeout-secs 3600 \
    jsacex/stylegan3 \
    -- python gen_images.py --outdir=../outputs --trunc=1 --seeds=2 --network=stylegan3-r-afhqv2-512x512.pkl)

Declarative job description​

name: StyleGAN3
type: batch
count: 1
tasks:
  - name: My main task
    Engine:
      type: docker
      params:
        Image: "jsacex/stylegan3" 
        Parameters:
          - python gen_images.py --outdir=../outputs --trunc=1 --seeds=2 --network=stylegan3-r-afhqv2-512x512.pkl
    Resources:
      GPU: "1"

The job description should be saved in .yaml format, e.g. stylegan3.yaml, and then run with the command:

bacalhau job run stylegan3.yaml

Render a latent vector interpolation video​

You can also run variations of this command to generate videos and other things. In the following command below, we will render a latent vector interpolation video. This will render a 4x2 grid of interpolations for seeds 0 through 31.

Structure of the command​

Let's look closely at the command below:

  1. export JOB_ID=$( ... ) exports the job ID as environment variable

  2. bacalhau docker run: call to bacalhau

  3. The --gpu 1 flag is set to specify hardware requirements, a GPU is needed to run such a job

  4. The --id-only flag is set to print only job id

  5. jsacex/stylegan3 the name and the tag of the docker image we are using

  6. python gen_images.py: execute the script with following parameters:

    1. --trunc=1 --seeds=2 --network=stylegan3-r-afhqv2-512x512.pkl: The animation length is either determined based on the --seeds value or explicitly specified using the --num-keyframes option. When num keyframes is specified with --num-keyframes, the output video length will be num_keyframes * w_frames frames. If --num-keyframes is not specified, the number of seeds given with --seeds must be divisible by grid size W*H (--grid). In this case, the output video length will be # seeds/(w*h)*w_frames frames.

    2. ../outputs: path to the output

export JOB_ID=$(bacalhau docker run \
    jsacex/stylegan3 \
    --gpu 1 \
    --timeout 3600 \
    --wait-timeout-secs 3600 \
    -- python gen_video.py --output=../outputs/lerp.mp4 --trunc=1 \
        --seeds=0-31 \
        --grid=4x2 \
        --network=stylegan3-r-afhqv2-512x512.pkl)

When a job is submitted, Bacalhau prints out the related job_id. We store that in an environment variable so that we can reuse it later on.

Checking the State of your Jobs​

Job status​

You can check the status of the job using bacalhau job list.

bacalhau job list --id-filter ${JOB_ID}

When it says Completed, that means the job is done, and we can get the results.

Job information​

You can find out more information about your job by using bacalhau job describe.

bacalhau job describe ${JOB_ID}

Job download​

You can download your job results directly by using bacalhau job get. Alternatively, you can choose to create a directory to store your results. In the command below, we created a directory and downloaded our job output to be stored in that directory.

rm -rf results && mkdir -p results
bacalhau job get $JOB_ID --output-dir results

After the download has finished you should see the following contents in results directory

Viewing your Job Output​

Now you can find the file in the results/outputs folder.

Support

Running Inference on a Model stored on S3

Introduction​

In this example, we will demonstrate how to run inference on a model stored on Amazon S3. We will use a PyTorch model trained on the MNIST dataset.

Running Locally​

Prerequisites​

Consider using the latest versions or use the docker method listed below in the article.

  1. Python

  2. PyTorch

Downloading the Datasets​

Use the following commands to download the model and test image:

wget https://sagemaker-sample-files.s3.amazonaws.com/datasets/image/MNIST/model/pytorch-training-2020-11-21-22-02-56-203/model.tar.gz
wget https://raw.githubusercontent.com/js-ts/mnist-test/main/digit.png

Creating the Inference Script​

This script is designed to load a pretrained PyTorch model for MNIST digit classification from a tar.gz file, extract it, and use the model to perform inference on a given input image. Ensure you have all required dependencies installed:

pip install Pillow torch torchvision
# content of the inference.py file
import torch
import torchvision.transforms as transforms
from PIL import Image
from torch.autograd import Variable
import argparse
import tarfile

class CustomModel(torch.nn.Module):
    def __init__(self):
        super(CustomModel, self).__init__()
        self.conv1 = torch.nn.Conv2d(1, 10, 5)
        self.conv2 = torch.nn.Conv2d(10, 20, 5)
        self.fc1 = torch.nn.Linear(320, 50)
        self.fc2 = torch.nn.Linear(50, 10)

    def forward(self, x):
        x = torch.relu(self.conv1(x))
        x = torch.max_pool2d(x, 2)
        x = torch.relu(self.conv2(x))
        x = torch.max_pool2d(x, 2)
        x = torch.flatten(x, 1)
        x = torch.relu(self.fc1(x))
        x = self.fc2(x)
        output = torch.log_softmax(x, dim=1)
        return output

def extract_tar_gz(file_path, output_dir):
    with tarfile.open(file_path, 'r:gz') as tar:
        tar.extractall(path=output_dir)

# Parse command-line arguments
parser = argparse.ArgumentParser()
parser.add_argument('--tar_gz_file_path', type=str, required=True, help='Path to the tar.gz file')
parser.add_argument('--output_directory', type=str, required=True, help='Output directory to extract the tar.gz file')
parser.add_argument('--image_path', type=str, required=True, help='Path to the input image file')
args = parser.parse_args()

# Extract the tar.gz file
tar_gz_file_path = args.tar_gz_file_path
output_directory = args.output_directory
extract_tar_gz(tar_gz_file_path, output_directory)

# Load the model
model_path = f"{output_directory}/model.pth"
model = CustomModel()
model.load_state_dict(torch.load(model_path, map_location=torch.device("cpu")))
model.eval()

# Transformations for the MNIST dataset
transform = transforms.Compose([
    transforms.Resize((28, 28)),
    transforms.Grayscale(num_output_channels=1),
    transforms.ToTensor(),
    transforms.Normalize((0.1307,), (0.3081,)),
])

# Function to run inference on an image
def run_inference(image, model):
    image_tensor = transform(image).unsqueeze(0)  # Apply transformations and add batch dimension
    input = Variable(image_tensor)

    # Perform inference
    output = model(input)
    _, predicted = torch.max(output.data, 1)
    return predicted.item()

# Example usage
image_path = args.image_path
image = Image.open(image_path)
predicted_class = run_inference(image, model)
print(f"Predicted class: {predicted_class}")

Running the Inference Script​

To use this script, you need to provide the paths to the tar.gz file containing the pre-trained model, the output directory where the model will be extracted, and the input image file for which you want to perform inference. The script will output the predicted digit (class) for the given input image.

python3 inference.py --tar_gz_file_path ./model.tar.gz --output_directory ./model --image_path ./digit.png

Running Inference on Bacalhau​

Prerequisite​

Structure of the Command​

  1. export JOB_ID=$( ... ): Export results of a command execution as environment variable

  2. -w /inputs Set the current working directory at /inputs in the container

  3. -i src=s3://sagemaker-sample-files/datasets/image/MNIST/model/pytorch-training-2020-11-21-22-02-56-203/model.tar.gz,dst=/model/,opt=region=us-east-1: Mount the s3 bucket at the destination path provided - /model/ and specifying the region where the bucket is located opt=region=us-east-1

  4. -i git://github.com/js-ts/mnist-test.git: Flag to mount the source code repo from GitHub. It would mount the repo at /inputs/js-ts/mnist-test in this case it also contains the test image

  5. pytorch/pytorch: The name of the Docker image

  6. -- python3 /inputs/js-ts/mnist-test/inference.py --tar_gz_file_path /model/model.tar.gz --output_directory /model-pth --image_path /inputs/js-ts/mnist-test/image.png: The command to run inference on the model. It consists of:

    1. /model/model.tar.gz is the path to the model file

    2. /model-pth is the output directory for the model

    3. /inputs/js-ts/mnist-test/image.png is the path to the input image

export JOB_ID=$(bacalhau docker run \
--wait \
--id-only \
--timeout 3600 \
--wait-timeout-secs 3600 \
-w /inputs \
-i src=s3://sagemaker-sample-files/datasets/image/MNIST/model/pytorch-training-2020-11-21-22-02-56-203/model.tar.gz,dst=/model/,opt=region=us-east-1 \
-i git://github.com/js-ts/mnist-test.git \
pytorch/pytorch \
 -- python3 /inputs/js-ts/mnist-test/inference.py --tar_gz_file_path /model/model.tar.gz --output_directory /model-pth --image_path /inputs/js-ts/mnist-test/image.png)

When the job is submitted Bacalhau prints out the related job id. We store that in an environment variable JOB_ID so that we can reuse it later on.

Viewing the Output​

Use the bacalhau job logs command to view the job output, since the script prints the result of execution to the stdout:

bacalhau job logs ${JOB_ID}

Predicted class: 0

You can also use bacalhau job get to download job results:

bacalhau job get ${JOB_ID}

Support

Training Pytorch Model with Bacalhau

Introduction

In this example tutorial, we will show you how to train a PyTorch RNN MNIST neural network model with Bacalhau. PyTorch is a framework developed by Facebook AI Research for deep learning, featuring both beginner-friendly debugging tools and a high level of customization for advanced users, with researchers and practitioners using it across companies like Facebook and Tesla. Applications include computer vision, natural language processing, cryptography, and more.

TL;DR​

bacalhau docker run \
    --gpu 1 \
    --timeout 3600 \
    --wait-timeout-secs 3600 \
    --wait \
    --id-only \
    pytorch/pytorch \
    -w /outputs \
    -i ipfs://QmdeQjz1HQQdT9wT2NHX86Le9X6X6ySGxp8dfRUKPtgziw:/data \
    -i https://raw.githubusercontent.com/pytorch/examples/main/mnist_rnn/main.py \
-- python ../inputs/main.py --save-model

Prerequisite​

Training the Model Locally​

git clone https://github.com/pytorch/examples

Install the following:

pip install --upgrade torch torchvision

Next, we run the command below to begin the training of the mnist_rnn model. We added the --save-model flag to save the model

python ./examples/mnist_rnn/main.py --save-model

Next, the downloaded MNIST dataset is saved in the data folder.

Uploading Dataset to IPFS​

Running a Bacalhau Job​

After the repo image has been pushed to Docker Hub, we can now use the container for running on Bacalhau. To submit a job, run the following Bacalhau command:

export JOB_ID=$(bacalhau docker run \
    --gpu 1 \
    --timeout 3600 \
    --wait-timeout-secs 3600 \
    --wait \
    --id-only \
    pytorch/pytorch \
    -w /outputs \
    -i ipfs://QmdeQjz1HQQdT9wT2NHX86Le9X6X6ySGxp8dfRUKPtgziw:/data \
    -i https://raw.githubusercontent.com/pytorch/examples/main/mnist_rnn/main.py \
-- python ../inputs/main.py --save-model)

Structure of the command​

  1. export JOB_ID=$( ... ) exports the job ID as environment variable

  2. bacalhau docker run: call to bacalhau

  3. The --gpu 1 flag is set to specify hardware requirements, a GPU is needed to run such a job

  4. pytorch/pytorch: Using the official pytorch Docker image

  5. The -i ipfs://QmdeQjz1HQQd.....: flag is used to mount the uploaded dataset

  6. -w /outputs: Our working directory is /outputs. This is the folder where we will save the model as it will automatically get uploaded to IPFS as outputs

  7. python ../inputs/main.py --save-model: URL script gets mounted to the /inputs folder in the container

When a job is submitted, Bacalhau prints out the related job_id. We store that in an environment variable so that we can reuse it later on.

Declarative job description​

name: Stable Diffusion Dreambooth Finetuning
type: batch
count: 1
tasks:
  - name: My main task
    Engine:
      type: docker
      params:
        Image: "pytorch/pytorch" 
        Entrypoint:
          - /bin/bash
        Parameters:
          - -c
          - python ../inputs/main.py --save-model
    InputSources:
      - Source:
          Type: "ipfs"
          Params:
            CID: "QmdeQjz1HQQdT9wT2NHX86Le9X6X6ySGxp8dfRUKPtgziw"
        Target: /data
      - Source:
          Type: urlDownload
          Params:
            URL: https://raw.githubusercontent.com/pytorch/examples/main/mnist_rnn/main.py
        Target: /inputs  
    Resources:
      GPU: "1"

The job description should be saved in .yaml format, e.g. torch.yaml, and then run with the command:

bacalhau job run torch.yaml

Checking the State of your Jobs​

Job status​

You can check the status of the job using bacalhau job list.

bacalhau job list --id-filter ${JOB_ID}

When it says Completed, that means the job is done, and we can get the results.

Job information​

You can find out more information about your job by using bacalhau job describe.

bacalhau job describe ${JOB_ID}

Job download​

You can download your job results directly by using bacalhau job get. Alternatively, you can choose to create a directory to store your results. In the command below, we created a directory and downloaded our job output to be stored in that directory.

rm -rf results && mkdir -p results
bacalhau job get $JOB_ID --output-dir results

After the download has finished you should see the following contents in results directory

Viewing your Job Output​

Now you can find results in the results/outputs folder. To view them, run the following command:

ls results/ # list the contents of the current directory 
cat results/stdout # displays the contents of the file given to it as a parameter.
ls results/outputs/ # list the successfully trained model

Training Tensorflow Model

Introduction

Training TensorFlow models Locally​

TensorFlow 2 quickstart for beginners​

  1. Load a prebuilt dataset.

  2. Build a neural network machine learning model that classifies images.

  3. Train this neural network.

  4. Evaluate the accuracy of the model.

Set up TensorFlow​

Import TensorFlow into your program to check whether it is installed

import tensorflow as tf
import os
print("TensorFlow version:", tf.__version__)
mkdir /inputs
wget https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz -O /inputs/mnist.npz
mnist = tf.keras.datasets.mnist

CWD = '' if os.getcwd() == '/' else os.getcwd()
(x_train, y_train), (x_test, y_test) = mnist.load_data('/inputs/mnist.npz')
x_train, x_test = x_train / 255.0, x_test / 255.0

Build a machine-learning model​

Build a tf.keras.Sequential model by stacking layers.

model = tf.keras.models.Sequential([
  tf.keras.layers.Flatten(input_shape=(28, 28)),
  tf.keras.layers.Dense(128, activation='relu'),
  tf.keras.layers.Dropout(0.2),
  tf.keras.layers.Dense(10)
])
predictions = model(x_train[:1]).numpy()
predictions

The tf.nn.softmax function converts these logits to probabilities for each class:

tf.nn.softmax(predictions).numpy()

Note: It is possible to bake the tf.nn.softmax function into the activation function for the last layer of the network. While this can make the model output more directly interpretable, this approach is discouraged as it's impossible to provide an exact and numerically stable loss calculation for all models when using a softmax output.

Define a loss function for training using losses.SparseCategoricalCrossentropy, which takes a vector of logits and a True index and returns a scalar loss for each example.

loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)

This loss is equal to the negative log probability of the true class: The loss is zero if the model is sure of the correct class.

This untrained model gives probabilities close to random (1/10 for each class), so the initial loss should be close to -tf.math.log(1/10) ~= 2.3.

loss_fn(y_train[:1], predictions).numpy()
model.compile(optimizer='adam',
              loss=loss_fn,
              metrics=['accuracy'])

Train and evaluate your model​

Use the Model.fit method to adjust your model parameters and minimize the loss:

model.fit(x_train, y_train, epochs=5)
model.evaluate(x_test,  y_test, verbose=2)

If you want your model to return a probability, you can wrap the trained model, and attach the softmax to it:

probability_model = tf.keras.Sequential([
  model,
  tf.keras.layers.Softmax()
])
probability_model(x_test[:5])
mkdir /outputs

The following method can be used to save the model as a checkpoint

model.save_weights('/outputs/checkpoints/my_checkpoint')
ls /outputs/

Running on Bacalhau​

The dataset and the script are mounted to the TensorFlow container using an URL, we then run the script inside the container

Declarative job description​

name: Training ML model using tensorflow
type: batch
count: 1
tasks:
  - name: My main task
    Engine:
      type: docker
      params:
        WorkingDirectory: "/inputs"
        Image: "tensorflow/tensorflow" 
        Entrypoint:
          - /bin/bash
        Parameters:
          - -c
          - python train.py
    InputSources:
      - Source:
          Type: urlDownload
          Params:
            URL: https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz
        Target: /inputs
      - Source:
          Type: urlDownload
          Params:
            URL: https://gist.githubusercontent.com/js-ts/e7d32c7d19ffde7811c683d4fcb1a219/raw/ff44ac5b157d231f464f4d43ce0e05bccb4c1d7b/train.py
        Target: /inputs
    Resources:
      GPU: "1"

The job description should be saved in .yaml format, e.g. tensorflow.yaml, and then run with the command:

bacalhau job run tensorflow.yaml

Checking the State of your Jobs​

Job status​

You can check the status of the job using bacalhau job list.

bacalhau job list --id-filter ${JOB_ID}

When it says Completed, that means the job is done, and we can get the results.

Job information​

You can find out more information about your job by using bacalhau job describe.

bacalhau job describe ${JOB_ID}

Job download​

You can download your job results directly by using bacalhau job get. Alternatively, you can choose to create a directory to store your results. In the command below, we created a directory and downloaded our job output to be stored in that directory.

rm -rf results && mkdir -p results
bacalhau job get $JOB_ID --output-dir results

After the download has finished you should see the following contents in results directory

Viewing your Job Output​

Now you can find the file in the results/outputs folder. To view it, run the following command:

cat results/outputs/

Support​

Stable Diffusion Dreambooth (Finetuning)

Stable diffusion has revolutionalized text2image models by producing high quality images based on a prompt. Dreambooth is a approach for personalization of text-to-image diffusion models. With images as input subject, we can fine-tune a pretrained text-to-image model

Dreambooth makes stable-diffusion even more powered with the ability to generate realistic looking pictures of humans, animals or any other object by just training them on 20-30 images.

In this example tutorial, we will be fine-tuning a pretrained stable diffusion using images of a human and generating images of him drinking coffee.

The following command generates the following:

  • Subject: SBF

  • Prompt: a photo of SBF without hair

bacalhau docker run \
 --gpu 1 \
 --timeout 3600 \
 --wait-timeout-secs 3600 \
  -i ipfs://QmRKnvqvpFzLjEoeeNNGHtc7H8fCn9TvNWHFnbBHkK8Mhy  \
  jsacex/dreambooth:full \
  -- bash finetune.sh /inputs /outputs "a photo of sbf man" "a photo of man" 3000 "/man" "/model"
bacalhau docker run \
 --gpu 1 \
  -i ipfs://QmUEJPr5pfV6tRzWQuNSSb3wdcN6tRQS5tdk3dYSCJ55Xs:/SBF.ckpt \
   jsacex/stable-diffusion-ckpt \
   -- conda run --no-capture-output -n ldm python scripts/txt2img.py --prompt "a photo of sbf without hair" --plms --ckpt ../SBF.ckpt --skip_grid --n_samples 1 --skip_grid --outdir ../outputs 

Output:

Building this container requires you to have a supported GPU which needs to have 16gb+ of memory, since it can be resource intensive.

We will create a Dockerfile and add the desired configuration to the file. Following commands specify how the image will be built, and what extra requirements will be included:

FROM pytorch/pytorch:1.12.1-cuda11.3-cudnn8-devel

WORKDIR /

# Install requirements
# RUN git clone https://github.com/TheLastBen/diffusers

RUN apt update && apt install wget git unzip -y

RUN wget -q https://gist.githubusercontent.com/js-ts/28684a7e6217214ec944a9224584e9af/raw/d7492bc8f36700b75d51e3346259d1a466b99a40/train_dreambooth.py

RUN wget -q https://github.com/TheLastBen/diffusers/raw/main/scripts/convert_diffusers_to_original_stable_diffusion.py

# RUN cp /content/convert_diffusers_to_original_stable_diffusion.py /content/diffusers/scripts/convert_diffusers_to_original_stable_diffusion.py

RUN pip install -qq git+https://github.com/TheLastBen/diffusers

RUN pip install -q accelerate==0.12.0 transformers ftfy bitsandbytes gradio natsort

# Install xformers

RUN pip install -q https://github.com/metrolobo/xformers_wheels/releases/download/1d31a3ac_various_6/xformers-0.0.14.dev0-cp37-cp37m-linux_x86_64.whl

RUN wget 'https://github.com/TheLastBen/fast-stable-diffusion/raw/main/Dreambooth/Regularization/Women' -O woman.zip

RUN wget 'https://github.com/TheLastBen/fast-stable-diffusion/raw/main/Dreambooth/Regularization/Men' -O man.zip

RUN wget 'https://github.com/TheLastBen/fast-stable-diffusion/raw/main/Dreambooth/Regularization/Mix' -O mix.zip

RUN unzip -j woman.zip -d woman

RUN unzip -j man.zip -d man

RUN unzip -j mix.zip -d mix

This container is using the pytorch/pytorch:1.12.1-cuda11.3-cudnn8-devel image and the working directory is set. Next, we add our custom code and pull the dependent repositories.

# finetune.sh
python clear_mem.py

accelerate launch train_dreambooth.py \
  --image_captions_filename \
  --train_text_encoder \
  --save_n_steps=$(expr $5 / 6) \
  --stop_text_encoder_training=$(expr $5 + 100) \
  --class_data_dir="$6" \
  --pretrained_model_name_or_path=${7:=/model} \
  --tokenizer_name=${7:=/model}/tokenizer/ \
  --instance_data_dir="$1" \
  --output_dir="$2" \
  --instance_prompt="$3" \
  --class_prompt="$4" \
  --seed=96576 \
  --resolution=512 \
  --mixed_precision="fp16" \
  --train_batch_size=1 \
  --gradient_accumulation_steps=1 \
  --use_8bit_adam \
  --learning_rate=2e-6 \
  --lr_scheduler="polynomial" \
  --center_crop \
  --lr_warmup_steps=0 \
  --max_train_steps=$5

echo  Convert weights to ckpt
python convert_diffusers_to_original_stable_diffusion.py --model_path $2  --checkpoint_path $2/model.ckpt --half
echo model saved at $2/model.ckpt

The shell script is there to make things much simpler since the command to train the model needs many parameters to pass and later convert the model weights to the checkpoint, you can edit this script and add in your own parameters

To download the models and run a test job in the Docker file, copy the following:

FROM pytorch/pytorch:1.12.1-cuda11.3-cudnn8-devel
WORKDIR /
# Install requirements
# RUN git clone https://github.com/TheLastBen/diffusers
RUN apt update && apt install wget git unzip -y
RUN wget -q https://gist.githubusercontent.com/js-ts/28684a7e6217214ec944a9224584e9af/raw/d7492bc8f36700b75d51e3346259d1a466b99a40/train_dreambooth.py
RUN wget -q https://github.com/TheLastBen/diffusers/raw/main/scripts/convert_diffusers_to_original_stable_diffusion.py
# RUN cp /content/convert_diffusers_to_original_stable_diffusion.py /content/diffusers/scripts/convert_diffusers_to_original_stable_diffusion.py
RUN pip install -qq git+https://github.com/TheLastBen/diffusers
RUN pip install -q accelerate==0.12.0 transformers ftfy bitsandbytes gradio natsort
# Install xformers
RUN pip install -q https://github.com/metrolobo/xformers_wheels/releases/download/1d31a3ac_various_6/xformers-0.0.14.dev0-cp37-cp37m-linux_x86_64.whl
# You need to accept the model license before downloading or using the Stable Diffusion weights. Please, visit the [model card](https://huggingface.co/runwayml/stable-diffusion-v1-5), read the license and tick the checkbox if you agree. You have to be a registered user in 🤗 Hugging Face Hub, and you'll also need to use an access token for the code to work.
# https://huggingface.co/settings/tokens
RUN mkdir -p ~/.huggingface
RUN echo -n "<your-hugging-face-token>" > ~/.huggingface/token
# copy the test dataset from a local file
# COPY jfk /jfk

# Download and extract the test dataset
RUN wget https://github.com/js-ts/test-images/raw/main/jfk.zip
RUN unzip -j jfk.zip -d jfk
RUN mkdir model
RUN wget 'https://github.com/TheLastBen/fast-stable-diffusion/raw/main/Dreambooth/Regularization/Women' -O woman.zip
RUN wget 'https://github.com/TheLastBen/fast-stable-diffusion/raw/main/Dreambooth/Regularization/Men' -O man.zip
RUN wget 'https://github.com/TheLastBen/fast-stable-diffusion/raw/main/Dreambooth/Regularization/Mix' -O mix.zip
RUN unzip -j woman.zip -d woman
RUN unzip -j man.zip -d man
RUN unzip -j mix.zip -d mix

RUN  accelerate launch train_dreambooth.py \
  --image_captions_filename \
  --train_text_encoder \
  --save_starting_step=5\
  --stop_text_encoder_training=31 \
  --class_data_dir=/man \
  --save_n_steps=5 \
  --pretrained_model_name_or_path="CompVis/stable-diffusion-v1-4" \
  --instance_data_dir="/jfk" \
  --output_dir="/model" \
  --instance_prompt="a photo of jfk man" \
  --class_prompt="a photo of man" \
  --instance_prompt="" \
  --seed=96576 \
  --resolution=512 \
  --mixed_precision="fp16" \
  --train_batch_size=1 \
  --gradient_accumulation_steps=1 \
  --use_8bit_adam \
  --learning_rate=2e-6 \
  --lr_scheduler="polynomial" \
  --center_crop \
  --lr_warmup_steps=0 \
  --max_train_steps=30

COPY finetune.sh /finetune.sh

RUN wget -q https://gist.githubusercontent.com/js-ts/624fecc3fff807d4948688cb28993a94/raw/fd69ac084debe26a815485c1f363b8a45566f1ba/clear_mem.py
# Removing your token
RUN rm -rf  ~/.huggingface/token

Then execute finetune.sh with following commands:

# finetune.sh
python clear_mem.py

accelerate launch train_dreambooth.py \
    --image_captions_filename \
   --train_text_encoder \
    --save_n_steps=$(expr $5 / 6) \
    --stop_text_encoder_training=$(expr $5 + 100) \
       --class_data_dir="$6" \
  --pretrained_model_name_or_path=${7:=/model} \
--tokenizer_name=${7:=/model}/tokenizer/ \
    --instance_data_dir="$1" \
    --output_dir="$2" \
    --instance_prompt="$3" \
   --class_prompt="$4" \
    --seed=96576 \
    --resolution=512 \
    --mixed_precision="fp16" \
    --train_batch_size=1 \
    --gradient_accumulation_steps=1 \
    --use_8bit_adam \
    --learning_rate=2e-6 \
    --lr_scheduler="polynomial" \
    --center_crop \
    --lr_warmup_steps=0 \
    --max_train_steps=$5

echo  Convert weights to ckpt
python convert_diffusers_to_original_stable_diffusion.py --model_path $2  --checkpoint_path $2/model.ckpt --half
echo model saved at $2/model.ckpt

We will run docker build command to build the container:

docker build -t <hub-user>/<repo-name>:<tag> .

Before running the command replace:

  1. repo-name with the name of the container, you can name it anything you want.

  2. tag this is not required but you can use the latest tag

Now you can push this repository to the registry designated by its name or tag.

docker push <hub-user>/<repo-name>:<tag>

The optimal dataset size is between 20-30 images. You can choose the images of the subject in different positions, full body images, half body, pictures of the face etc.

Only the subject should appear in the image so you can crop the image to just fit the subject. Make sure that the images are 512x512 size and are named in the following pattern:

Subject Name.jpg, Subject Name (2).jpg ... Subject Name (n).jpg

After the Subject dataset is created we upload it to IPFS.

To upload your dataset using NFTup just drag and drop your directory it will upload it to IPFS:

After the checkpoint file has been uploaded, copy its CID which will look like this:

bafybeidqbuphwkqwgrobv2vakwsh3l6b4q2mx7xspgh4l7lhulhc3dfa7a

Since there are a lot of combinations that you can try, processing of finetuned model can take almost 1hr+ to complete. Here are a few approaches that you can try based on your requirements:

  1. bacalhau docker run: call to bacalhau

  2. The --gpu 1 flag is set to specify hardware requirements, a GPU is needed to run such a job

  3. -i ipfs://bafybeidqbuphwkqwgrobv2vakwsh3l6b4q2mx7xspgh4l7lhulhc3dfa7a Mounts the data from IPFS via its CID

  4. jsacex/dreambooth:latest Name and tag of the docker image we are using

  5. -- bash finetune.sh /inputs /outputs "a photo of David Aronchick man" "a photo of man" 3000 "/man" execute script with following paramters:

    1. /inputs Path to the subject Images

    2. /outputs Path to save the generated outputs

    3. "a photo of < name of the subject > < class >" -> "a photo of David Aronchick man" Subject name along with class

    4. "a photo of < class >" -> "a photo of man" Name of the class

bacalhau docker run \
  --gpu 1 \
  --timeout 3600 \
  --wait-timeout-secs 3600 \
  --timeout 3600 \
  --wait-timeout-secs 3600 \
  -i <CID-OF-THE-SUBJECT> \
  jsacex/dreambooth:full \
  -- bash finetune.sh /inputs /outputs "a photo of <name-of-the-subject> man" "a photo of man" 3000 "/man" "/model"

The number of iterations is 3000. This number should be no of subject images x 100. So if there are 30 images, it would be 3000. It takes around 32 minutes on a v100 for 3000 iterations, but you can increase/decrease the number based on your requirements.

Here is our command with our parameters replaced:

bacalhau docker run \
  --gpu 1 \
  --timeout 3600 \
  --wait-timeout-secs 3600 \
  --timeout 3600 \
  --wait-timeout-secs 3600 \
  -i ipfs://bafybeidqbuphwkqwgrobv2vakwsh3l6b4q2mx7xspgh4l7lhulhc3dfa7a \
  --wait \
  --id-only \
  jsacex/dreambooth:full \
  -- bash finetune.sh /inputs /outputs "a photo of David Aronchick man" "a photo of man" 3000 "/man" "/model"

If your subject fits the above class, but has a different name you just need to replace the input CID and the subject name.

Use the /woman class images

bacalhau docker run \
  --gpu 1 \
  --timeout 3600 \
  --wait-timeout-secs 3600 \
  -i <CID-OF-THE-SUBJECT> \
  jsacex/dreambooth:full \
  -- bash finetune.sh /inputs /outputs "a photo of <name-of-the-subject> woman" "a photo of woman" 3000 "/woman"  "/model"

Here you can provide your own regularization images or use the mix class.

Use the /mix class images if the class of the subject is mix

bacalhau docker run \
  --gpu 1 \
  --timeout 3600 \
  --wait-timeout-secs 3600 \
  -i <CID-OF-THE-SUBJECT> \
  jsacex/dreambooth:full \
  -- bash finetune.sh /inputs /outputs "a photo of <name-of-the-subject> mix" "a photo of mix" 3000 "/mix"  "/model"

You can upload the model to IPFS and then create a gist, mount the model and script to the lightweight container

bacalhau docker run \
  --gpu 1 \
  --timeout 3600 \
  --wait-timeout-secs 3600 \
  -i ipfs://bafybeidqbuphwkqwgrobv2vakwsh3l6b4q2mx7xspgh4l7lhulhc3dfa7a:/aronchick \
  -i ipfs://<CID-OF-THE-MODEL>:/model 
  -i https://gist.githubusercontent.com/js-ts/54b270a36aa3c35fdc270640680b3bd4/raw/7d8e8fa47bc3811ef63772f7fc7f4360aa9d51a8/finetune.sh
  --wait \
  --id-only \
  jsacex/dreambooth:lite \
  -- bash /inputs/finetune.sh /aronchick /outputs "a photo of aronchick man" "a photo of man" 3000 "/man" "/model"

When a job is submitted, Bacalhau prints out the related job_id. Use the export JOB_ID=$(bacalhau docker run ...) wrapper to store that in an environment variable so that we can reuse it later on.

name: Stable Diffusion Dreambooth Finetuning
type: batch
count: 1
tasks:
  - name: My main task
    Engine:
      type: docker
      params:
        Image: "jsacex/dreambooth:full" 
        Parameters:
          - bash finetune.sh /inputs /outputs "a photo of aronchick man" "a photo of man" 3000 "/man" "/model"
    InputSources:
      - Target: "/inputs/data"
        Source:
          Type: "ipfs"
          Params:
            CID: "QmRKnvqvpFzLjEoeeNNGHtc7H8fCn9TvNWHFnbBHkK8Mhy"
    Resources:
      GPU: "1"

You can check the status of the job using bacalhau job list.

bacalhau job list --id-filter ${JOB_ID}

When it says Completed, that means the job is done, and we can get the results.

You can find out more information about your job by using bacalhau job describe.

bacalhau job describe ${JOB_ID}

You can download your job results directly by using bacalhau job get. Alternatively, you can choose to create a directory to store your results. In the command below, we created a directory and downloaded our job output to be stored in that directory.

rm -rf results && mkdir -p results
bacalhau job get $JOB_ID --output-dir results

After the download has finished you should see the following contents in results directory

Now you can find the file in the results/outputs folder. You can view results by running following commands:

ls results # list the contents of the current directory 

In the next steps, we will be doing inference on the finetuned model

Bacalhau currently doesn't support mounting subpaths of the CID, so instead of just mounting the model.ckpt file we need to mount the whole output CID which is 6.4GB, which might result in errors like FAILED TO COPY /inputs. So you have to manually copy the CID of the model.ckpt which is of 2GB.

To get the CID of the model.ckpt file go to https://gateway.ipfs.io/ipfs/< YOUR-OUTPUT-CID >/outputs/. For example:

https://gateway.ipfs.io/ipfs/QmcmD7M7pYLP8QgwjqpbP4dojRLiLuEBdhevuCD9kFmbdV/outputs/
ipfs://QmdpsqZn9BZx9XxzCsyPcJyS7yfYacmQXZxHzcuYwzmtGg/outputs

Or you can use the IPFS CLI:

ipfs ls QmdpsqZn9BZx9XxzCsyPcJyS7yfYacmQXZxHzcuYwzmtGg/outputs

Copy the link of model.ckpt highlighted in the box:

https://gateway.ipfs.io/ipfs/QmdpsqZn9BZx9XxzCsyPcJyS7yfYacmQXZxHzcuYwzmtGg?filename=model.ckpt

Then extract the CID portion of the link and copy it.

To run a Bacalhau Job on the fine-tuned model, we will use the bacalhau docker run command.

export JOB_ID=$(bacalhau docker run \
  --gpu 1 \
  --timeout 3600 \
  --wait-timeout-secs 3600 \
  --wait \
  --id-only \
  -i ipfs://QmdpsqZn9BZx9XxzCsyPcJyS7yfYacmQXZxHzcuYwzmtGg \
  jsacex/stable-diffusion-ckpt \
  -- conda run --no-capture-output -n ldm python scripts/txt2img.py --prompt "a photo of aronchick drinking coffee" --plms --ckpt ../inputs/model.ckpt --skip_grid --n_samples 1 --skip_grid --outdir ../outputs)

If you are facing difficulties using the above method you can mount the whole output CID

export JOB_ID=$(bacalhau docker run \
  --gpu 1 \
  --timeout 3600 \
  --wait-timeout-secs 3600 \
  --wait \
  --id-only \
  -i ipfs://QmcmD7M7pYLP8QgwjqpbP4dojRLiLuEBdhevuCD9kFmbdV \
  jsacex/stable-diffusion-ckpt \
  -- conda run --no-capture-output -n ldm python scripts/txt2img.py --prompt "a photo of aronchick drinking coffee" --plms --ckpt ../inputs/outputs/model.ckpt --skip_grid --n_samples 1 --skip_grid --outdir ../outputs)

When a job is sumbitted, Bacalhau prints out the related job_id. We store that in an environment variable so that we can reuse it later on.

We got an image like this as a result:

Running BIDS Apps on Bacalhau

Introduction

Prerequisite​

Downloading datasets​

Let's take a look at the structure of the data directory:

Uploading the datasets to IPFS​

When you pin your data, you'll get a CID which is in a format like this QmaNyzSpJCt1gMCQLd3QugihY6HzdYmA8QMEa45LDBbVPz. Copy the CID as it will be used to access your data

Running a Bacalhau Job​

Structure of the command​

Let's look closely at the command above:

  1. bacalhau docker run: call to bacalhau

  2. -i ipfs://QmaNyzSpJCt1gMCQLd3QugihY6HzdYmA8QMEa45LDBbVPz:/data: mount the CID of the dataset that is uploaded to IPFS and mount it to a folder called data on the container

  3. nipreps/mriqc:latest: the name and the tag of the docker image we are using

  4. ../data/ds005: path to input dataset

  5. ../outputs: path to the output

  6. participant --participant_label 01 02 03: run the mriqc on subjects with participant labels 01, 02, and 03

When a job is submitted, Bacalhau prints out the related job_id. We store that in an environment variable so that we can reuse it later on.

Declarative job description​

Copy

The job description should be saved in .yaml format, e.g. bids.yaml, and then run with the command:

Checking the State of your Jobs​

Job status: You can check the status of the job using bacalhau job list.

When it says Published or Completed, that means the job is done, and we can get the results.

Job information: You can find out more information about your job by using bacalhau job describe.

Job download: You can download your job results directly by using bacalhau job get. Alternatively, you can choose to create a directory to store your results. In the command below, we created a directory (results) and downloaded our job output to be stored in that directory.

Viewing your Job Output​

To view the file, run the following command:

Support​

Molecular Simulation with OpenMM and Bacalhau

Introduction

In this example tutorial, our focus will be on running OpenMM molecular simulation with Bacalhau.

Prerequisite​

Running Locally​

Downloading Datasets​

Writing the Script​

Running the Script​

This is only done to check whether your Python script is running. If there are no errors occurring, proceed further.

Uploading the Data to IPFS​

When you pin your data, you'll get a CID. Copy the CID as it will be used to access your data

Containerize Script using Docker​

To build your own docker container, create a Dockerfile, which contains instructions to build your image.

Build the container​

We will run docker build command to build the container:

Before running the command, replace:

repo-name with the name of the container, you can name it anything you want

tag this is not required but you can use the latest tag

In our case, this will be:

Push the container​

Next, upload the image to the registry. This can be done by using the Docker hub username, repo name, or tag.

Run a Bacalhau Job​

Now that we have the data in IPFS and the docker image pushed, we can run a job on the Bacalhau network.

Structure of the command​

Lets look closely at the command above:

  1. bacalhau docker run: call to Bacalhau

  2. bafybeig63whfqyuvwqqrp5456fl4anceju24ttyycexef3k5eurg5uvrq4: here we mount the CID of the dataset we uploaded to IPFS to use on the job

  3. ghcr.io/bacalhau-project/examples/openmm:0.3: the name and the tag of the image we are using

  4. python run_openmm_simulation.py: the script that will be executed inside the container

When a job is submitted, Bacalhau prints out the related job_id. We store that in an environment variable so that we can reuse it later on.

Checking the State of your Jobs​

Job status: You can check the status of the job using bacalhau job list.

When it says Published or Completed, that means the job is done, and we can get the results.

Job information: You can find out more information about your job by using bacalhau job describe.

Job download: You can download your job results directly by using bacalhau job get. Alternatively, you can choose to create a directory to store your results. In the command below, we created a directory (results) and downloaded our job output to be stored in that directory.

Viewing your Job Output​

To view the file, run the following command:

Support​

Gromacs for Analysis

Introduction​

GROMACS is a package for high-performance molecular dynamics and output analysis. Molecular dynamics is a computer simulation method for analyzing the physical movements of atoms and molecules

In this example tutorial, our focus will be on running Gromacs package with Bacalhau

Prerequisites​

Downloading datasets​

Uploading the datasets to IPFS​

Copy the CID in the end which is QmeeEB1YMrG6K8z43VdsdoYmQV46gAPQCHotZs9pwusCm9

Running Bacalhau Job​

Let's run a Bacalhau job that converts coordinate files to topology and FF-compliant coordinate files:

Structure of the command​

Lets look closely at the command above:

  1. bacalhau docker run: call to Bacalhau

  2. -i ipfs://QmeeEB1YMrG6K8z43VdsdoYmQV46gAPQCHotZs9pwusCm9:/input: here we mount the CID of the dataset we uploaded to IPFS to use on the job

  3. -f input/1AKI.pdb: input file

  4. -o outputs/1AKI_processed.gro: output file

  5. -water Water model to use. In this case we use spc

When a job is submitted, Bacalhau prints out the related job_id. We store that in an environment variable so that we can reuse it later on.

Declarative job description​

The job description should be saved in .yaml format, e.g. gromacs.yaml, and then run with the command:

Checking the State of your Jobs​

Job status: You can check the status of the job using bacalhau job list.

When it says Published or Completed, that means the job is done, and we can get the results.

Job information: You can find out more information about your job by using bacalhau job describe.

Job download: You can download your job results directly by using bacalhau job get. Alternatively, you can choose to create a directory to store your results. In the command below, we created a directory (results) and downloaded our job output to be stored in that directory.

Viewing your Job Output​

To view the file, run the following command:

Support​

Coresets On Bacalhau

Introduction

We construct a small coreset for arbitrary shapes of numerical data with a decent time cost. The implementation was mainly based on the coreset construction algorithm that was proposed by Braverman et al. (SODA 2021).

In this tutorial example, we will run compressed dataset with Bacalhau

Prerequisite​

Running Locally​

Clone the repo which contains the code

Downloading the dataset​

To download the dataset you should open Street Map, which is a public repository that aims to generate and distribute accessible geographic data for the whole world. Basically, it supplies detailed position information, including the longitude and latitude of the places around the world.

Installing Dependencies​

The following command is installing Linux dependencies:

Ensure that the requirements.txt file contains the following dependencies:

The following command is installing Python dependencies:

Running the Script​

To run coreset locally, you need to convert from compressed pbf format to geojson format:

The following command is to run the Python script to generate the coreset:

Containerize Script using Docker​

To build your own docker container, create a Dockerfile, which contains instructions on how the image will be built, and what extra requirements will be included.

We will use the python:3.8 image, we run the same commands for installing dependencies that we used locally.

Build the container​

We will run docker build command to build the container:

Before running the command replace:

repo-name with the name of the container, you can name it anything you want

tag this is not required but you can use the latest tag

In our case:

Push the container​

Next, upload the image to the registry. This can be done by using the Docker hub username, repo name or tag.

In our case:

Running a Bacalhau Job​

Structure of the command​

Let's look closely at the command above:

  1. bacalhau docker run: call to bacalhau

  2. --input https://github.com/js-ts/Coreset/blob/master/monaco-latest.geojson: mount the monaco-latest.geojson file inside the container so it can be used by the script

  3. jsace/coreset: the name of the docker image we are using

  4. python Coreset/python/coreset.py -f monaco-latest.geojson -o outputs: the script initializes cluster centers, creates a coreset using these centers, and saves the results to the specified folder.

Additional parameters:

-k: amount of initialized centers (default=5)

-n: size of coreset (default=50)

-o: the output folder

When a job is submitted, Bacalhau prints out the related job_id. We store that in an environment variable so that we can reuse it later on.

Declarative job description​

The job description should be saved in .yaml format, e.g. coreset.yaml, and then run with the command:

Checking the State of your Jobs​

Job status: You can check the status of the job using bacalhau job list.

When it says Published or Completed, that means the job is done, and we can get the results.

Job information: You can find out more information about your job by using bacalhau job describe.

Job download: You can download your job results directly by using bacalhau job get. Alternatively, you can choose to create a directory to store your results. In the command below, we created a directory (results) and downloaded our job output to be stored in that directory.

Viewing your Job Output​

To view the file, run the following command:

To view the output as a CSV file, run:

Support​

Genomics Data Generation

Introduction​

In this tutorial example, we will run a genomics model on Bacalhau.

Prerequisite​

Running Locally​​

In our case this will be the following command:

Containerize Script using Docker​

To run Genomics on Bacalhau we need to set up a Docker container. To do this, you'll need to create a Dockerfile and add your desired configuration. The Dockerfile is a text document that contains the commands that specify how the image will be built.

We will use the kipoi/kipoi-veff2:py37 image and perform variant-centered effect prediction using the kipoi_veff2_predict tool.

Build the container​

The docker build command builds Docker images from a Dockerfile.

Before running the command replace:

repo-name with the name of the container, you can name it anything you want

tag this is not required but you can use the latest tag

In our case:

Push the container​

Next, upload the image to the registry. This can be done by using the Docker hub username, repo name or tag.

Running a Bacalhau job​

After the repo image has been pushed to Docker Hub, we can now use the container for running on Bacalhau. To submit a job for generating genomics data, run the following Bacalhau command:

Structure of the command​

Let's look closely at the command above:

  1. bacalhau docker run: call to Bacalhau

  2. jsacex/kipoi-veff2:py37: the name of the image we are using

  3. kipoi_veff2_predict ./examples/input/test.vcf ./examples/input/test.fa ../outputs/output.tsv -m "DeepSEA/predict" -s "diff" -s "logit": the command that will be executed inside the container. It performs variant-centered effect prediction using the kipoi_veff2_predict tool

  4. ./examples/input/test.vcf: the path to a Variant Call Format (VCF) file containing information about genetic variants

  5. ./examples/input/test.fa: the path to a FASTA file containing DNA sequences. FASTA files contain nucleotide sequences used for variant effect prediction

  6. ../outputs/output.tsv: the path to the output file where the prediction results will be stored. The output file format is Tab-Separated Values (TSV), and it will contain information about the predicted variant effects

  7. -m "DeepSEA/predict": specifies the model to be used for prediction

  8. -s "diff" -s "logit": indicates using two scoring functions for comparing prediction results. In this case, the "diff" and "logit" scoring functions are used. These scoring functions can be employed to analyze differences between predictions for the reference and alternative alleles.

When a job is submitted, Bacalhau prints out the related job_id. We store that in an environment variable so that we can reuse it later on.

Declarative job description​

The job description should be saved in .yaml format, e.g. genomics.yaml, and then run with the command:

Checking the State of your Jobs​

Job status: You can check the status of the job using bacalhau job list.

When it says Published or Completed, that means the job is done, and we can get the results.

Job information: You can find out more information about your job by using bacalhau job describe.

Job download: You can download your job results directly by using bacalhau job get. Alternatively, you can choose to create a directory to store your results. In the command below, we created a directory (results) and downloaded our job output to be stored in that directory.

Viewing your Job Output​

To view the file, run the following command:

Support​

Job Specification

A Job represents a discrete unit of work that can be scheduled and executed. It carries all the necessary information to define the nature of the work, how it should be executed, and the resources it requires.

job Parameters

  • Name (string : <optional>): A logical name to refer to the job. Defaults to job ID.

  • Namespace (string: "default"): The namespace in which the job is running. ClientID is used as a namespace in the public demo network.

  • Priority (int: 0): Determines the scheduling priority.

  • Count (int: <required): Number of replicas to be scheduled. This is only applicable for jobs of type batch and service.

Server-Generated Parameters

The following parameters are generated by the server and should not be set directly.

  • ID (string): A unique identifier assigned to this job. It's auto-generated by the server and should not be set directly. Used for distinguishing between jobs with similar names.

  • Version (int): A monotonically increasing version number incremented on job specification update.

  • Revision (int): A monotonically increasing revision number incremented on each update to the job's state or specification.

  • CreateTime (int): Timestamp of job creation.

  • ModifyTime (int): Timestamp of last job modification.

Task Specification

A Task signifies a distinct unit of work within the broader context of a Job. It defines the specifics of how the task should be executed, where the results should be published, what environment variables are needed, among other configurations

Task Parameters

  1. Name (string : <required>): A unique identifier representing the name of the task.

  2. Env (map[string]string : optional): A set of environment variables for the driver.

Job Types

The different job types available in Bacalhau

Bacalhau has recently introduced different job types in v1.1, providing more control and flexibility over the orchestration and scheduling of those jobs - depending on their type.

Despite the differences in job types, all jobs benefit from core functionalities provided by Bacalhau, including:

  1. Node selection - the appropriate nodes are selected based on several criteria, including resource availability, priority and feedback from the nodes.

  2. Job monitoring - jobs are monitored to ensure they complete, and that they stay in a healthy state.

  3. Retries - within limits, Bacalhau will retry certain jobs a set number of times should it fail to complete successfully when requested.

Batch Jobs

Ideal for intermittent yet intensive data dives, for instance performing computation over large datasets before publishing the response. This approach eliminates the continuous processing overhead, focusing on specific, in-depth investigations and computation.

Ops Jobs

Similar to batch jobs, ops jobs have a broader reach. They are executed on all nodes that align with the job specification, but otherwise behave like batch jobs.

Ops jobs are perfect for urgent investigations, granting direct access to logs on host machines, where previously you may have had to wait for the logs to arrive at a central location before being able to query them. They can also be used for delivering configuration files for other systems should you wish to deploy an update to many machines at once.

Daemon Jobs

Daemon jobs run continuously on all nodes that meet the criteria given in the job specification. Should any new compute nodes join the cluster after the job was started, and should they meet the criteria, the job will be scheduled to run on that node too.

A good application of daemon jobs is to handle continuously generated data on every compute node. This might be from edge devices like sensors, or cameras, or from logs where they are generated. The data can then be aggregated and compressed them before sending it onwards. For logs, the aggregated data can be relayed at regular intervals to platforms like Kafka or Kinesis, or directly to other logging services with edge devices potentially delivering results via MQTT.

Daemon Job Example

The example demonstrates a job that:

  1. Has a priority of 100

  2. Will be executed continuously on all suitable nodes

  3. Will be executed only on nodes with label = WebService

  4. Uses the docker engine

  5. Executes a query with manually specified parameters

  6. Has access to 2 local directories with logs

  7. Publishes the results to the IPFS, if any

  8. Has network access type Full in order to send data to the S3 storage

Service Jobs

Service jobs run continuously on a specified number of nodes that meet the criteria given in the job specification. Bacalhau's orchestrator selects the optimal nodes to run the job, and continuously monitors its health, performance. If required, it will reschedule on other nodes.

This job type is good for long-running consumers such as streaming or queuing services, or real-time event listeners.

Service Job Example

The example demonstrates a job that:

  1. Has a priority of 100

  2. Will be executed continuously on all suitable nodes

  3. Will be executed only on nodes with architecture = arm64 and located in the us-west-2 region

  4. Uses the docker engine

  5. Executes a query with multiple parameters

  6. Has access to 2 local directories with logs

  7. Publishes the results to the IPFS, if any

  8. Has network access type Full in order to send data to the S3 storage

Limits and Timeouts

Note that in version v1.5.0 the configuration management approach was completely changed and certain limits were deprecated.

Resource Limits

These are the configuration keys that control the capacity of the Bacalhau node, and the limits for jobs that might be run.

Windows Support

Timeouts

Bacalhau can limit the total time a job spends executing. A job that spends too long executing will be cancelled, and no results will be published.

By default, a Bacalhau node does not enforce any limit on job execution time. Both node operators and job submitters can supply a maximum execution time limit. If a job submitter asks for a longer execution time than permitted by a node operator, their job will be rejected.

Configuring Execution Time Limits

Job submitters can pass the --timeout flag to any Bacalhau job submission CLI to set a maximum job execution time. The supplied value should be a whole number of seconds with no unit.

The timeout can also be added to an existing job spec by adding the Timeout property to the Spec.

Node operators can use configuration keys to specify default and maximum job execution time limits. The supplied values should be a numeric value followed by a time unit (one of s for seconds, m for minutes or h for hours).

Here is a list of the relevant properties:

Note, that timeouts can not be configured for Daemon and Service jobs.

Introduction

TL;DR

Prerequisite

Bacalhau client, see more information .

Running whisper locally

Create the script

Containerize Script using Docker

See more information on how to containerize your script/app

Build the container

hub-user with your docker hub username, If you don’t have a docker hub account , and use the username of the account you created

Push the container

Running a Bacalhau Job

We will transcribe the moon landing video, which can be found here:

Since the downloaded video is in mov format we convert the video to mp4 format and then upload it to our public storage in this case IPFS. To do this, a public network can be used, or you can create your own and use it to pin files.

Structure of the command

Declarative job description

The same job can be presented in the format. In this case, the description will look like this:

Checking the State of your Jobs

Job status

Job information

Job download

Viewing your Job Output

To get started, you need to install the Bacalhau client, see more information

The simplest way to upload the data to IPFS is to use a third-party service to "pin" data to the IPFS network, to ensure that the data exists and is available. To do this you need an account with a pinning service like or . Once registered you can use their UI or API or SDKs to upload files.

To submit a workload to Bacalhau, we will use the bacalhau docker run command. The command allows one to pass input data volume with a -i ipfs://CID:path argument just like Docker, except the left-hand side of the argument is a . This results in Bacalhau mounting a data volume inside the container. By default, Bacalhau mounts the input volume at the path /inputs inside the container.

so we must run the full command after the -- argument. In this line you will list all of the mp4 files in the /inputs directory and execute ffmpeg against each instance.

The same job can be presented in the format. In this case, the description will look like this:

If you have questions or need support or guidance, please reach out to the (#general channel).

You may want to create your own container for this kind of task. In that case, use the instructions for and your own image in the docker hub. Use huggingface/transformers-pytorch-deepspeed-nightly-gpu as base image, install dependencies listed above and copy the inference.py into it. So your Dockerfile will look like this:

To get started, you need to install the Bacalhau client, see more information

In this example tutorial, we use Bacalhau and Easy OCR to digitize paper records or for recognizing characters or extract text data from images stored on IPFS, S3 or on the web. is a ready-to-use OCR with 80+ supported languages and all popular writing scripts including Latin, Chinese, Arabic, Devanagari, Cyrillic etc. With easy OCR, you use the pre-trained models or use your own fine-tuned model.

You can skip this step and go straight to running a

We will use the Dockerfile that is already created in the . Use the command below to clone the repo

hub-user with your docker hub username, If you don’t have a docker hub account follow to create docker account, and use the username of the account you created

To get started, you need to install the Bacalhau client, see more information .

Since the model and the image aren't present in the container we will mount the image from an URL and the model from IPFS. You can find models to download from . You can choose the model you want to use in this case we will be using the zh_sim_g2 model

The same job can be presented in the format. In this case, the description will look like this:

is an open-source text-to-image model, which generates images from text. It's a cutting-edge alternative to and uses the for image generation. At the core the model generates graphics from text using a .

The text-to-image stable diffusion model was trained on a fleet of GPU machines, at great cost. To use this trained model for inference, you also need to run it on a GPU.

However, this isn't always desired or possible. One alternative is to use a project called from Intel that allows you to convert and optimize models from a variety of frameworks (and ONNX if your framework isn't directly supported) to run on a Intel CPU. This is what we will do in this example.

First we convert the trained stable diffusion models so that they work efficiently on a CPU with OpenVINO. Choose the fine tuned version of Stable Diffusion you want to use. The example is quite complex, so we have created a to host the code. This is a fork from this .

In summary, the code downloads a of the pre-trained stable diffusion model. This model leverages OpenAI's and is wrapped inside an OpenVINO runtime, which executes the model.

The core code representing these tasks can be found . This is a mashup that creates a pipeline necessary to tokenize the text and run the stable diffusion model. This boilerplate could be simplified by leveraging the more recent version of the . But let's continue.

First we will create a Dockerfile to containerize the inference code. The Dockerfile , but is presented here to aid understanding.

To run this example you will need installed and running

is a distributed computing platform that allows you to run jobs on a network of computers. It is designed to be easy to use and to run on a variety of hardware. In this example, we will use it to run the stable diffusion model on a CPU.

Some of the jobs presented in the Examples section may require more resources than are currently available on the demo network. Consider or running less resource-intensive jobs on the demo network

This example tutorial demonstrates how to use Stable Diffusion on a GPU and run it on the demo network. is a state of the art text-to-image model that generates images from text and was developed as an open-source alternative to . It is based on a and uses a to generate images from text.

To get started, you need to install the Bacalhau client, see more information .

This stable diffusion example is based on the . You might also be interested in the Pytorch oriented .

Based on the requirements , we will install the following:

We have a sample code from this the repo which we will use to check if the code is working as expected. Our output for this code will be a DSLR photograph of an astronaut riding a horse.

You need a script to execute when we submit jobs. The code below is a slightly modified version of the code we ran above which we got from , however, this includes more things such as argument parsing to be able to customize the generator.

For a full list of arguments that you can pass to the script, see more information

Docker is the easiest way to run TensorFlow on a GPU since the host machine only requires the . To containerize the inference code, we will create a Dockerfile. The Dockerfile is a text document that contains the commands that specify how the image will be built.

The Dockerfile leverages the latest official TensorFlow GPU image and then installs other dependencies like git, CUDA packages, and other image-related necessities. See for the expected requirements.

See more information on how to containerize your script/app

hub-user with your docker hub username, If you don’t have a docker hub account follow to create a Docker account, and use the username of the account you created

Some of the jobs presented in the Examples section may require more resources than are currently available on the demo network. Consider or running less resource-intensive jobs on the demo network

The Bacalhau command passes a prompt to the model and generates an image in the outputs directory. The main difference in the example below compared to all the other examples is the addition of the --gpu X flag, which tells Bacalhau to only schedule the job on nodes that have X GPUs free. You can in the documentation.

The identification and localization of objects in images and videos is a computer vision task called object detection. Several algorithms have emerged in the past few years to tackle the problem. One of the most popular algorithms to date for real-time object detection is , initially proposed

Bacalhau is a highly scalable decentralized computing platform and is well suited to running massive object detection jobs. In this example, you can take advantage of the GPUs available on the Bacalhau Network and perform an end-to-end object detection inference, using the

To get started, you need to install the Bacalhau client, see more information

Remember that by default Bacalhau does not provide any network connectivity when running a job. So you need to either provide all assets at job submission time, or use the --network=full or --network=http flags to access the data at task time. See the page for more details

--input to select which pre-trained weights you desire with details on the

For more container flags refer to the yolov5/detect.py file in the .

Some of the jobs presented in the Examples section may require more resources than are currently available on the demo network. Consider or running less resource-intensive jobs on the demo network

The same job can be presented in the format. In this case, the description will look like this:

Now let's use some custom images. First, you will need to ingest your images onto IPFS or S3 storage. For more information about how to do that see the section.

This example will use the dataset.

We have already uploaded this dataset to the IPFS storage under the CID: bafybeicyuddgg4iliqzkx57twgshjluo2jtmlovovlx5lmgp5uoh3zrvpm. You can browse to this dataset via .

To check the state of the job and view job output refer to the .

If you have questions or need support or guidance, please reach out to the (#general channel).

is a state of the art text-to-image model that generates images from text and was developed as an open-source alternative to . It is based on a and uses a to generate images from text.

The following guide is using the fine-tuned model, which was finetuned on Bacalhau. To learn how to finetune your own stable diffusion model refer to .

Bacalhau client, see more information

This part of the guide is optional - you can skip it and proceed to the if you are not going to use your own custom image.

See more information on how to containerize your script/app

hub-user with your docker hub username, If you don’t have a docker hub account , and use the username of the account you created

To do inference on your own checkpoint on Bacalhau you need to first upload it to your public storage, which can be mounted anywhere on your machine. In this case, we will be using (Recommended Option). To upload your dataset using drag and drop your directory and it will upload it to IPFS.

Some of the jobs presented in the Examples section may require more resources than are currently available on the demo network. Consider or running less resource-intensive jobs on the demo network

In this example tutorial, we will show you how to generate realistic images with and Bacalhau. StyleGAN is based on Generative Adversarial Networks (GANs), which include a generator and discriminator network that has been trained to differentiate images generated by the generator from real images. However, during the training, the generator tries to fool the discriminator, which results in the generation of realistic-looking images. With StyleGAN3 we can generate realistic-looking images or videos. It can generate not only human faces but also animals, cars, and landscapes.

To get started, you need to install the Bacalhau client, see more information

See more information on how to containerize your script/app

hub-user with your docker hub username, If you don’t have a docker hub account follow to create docker account (), and use the username of the account you created

Some of the jobs presented in the Examples section may require more resources than are currently available on the demo network. Consider or running less resource-intensive jobs on the demo network

The same job can be presented in the format. In this case, the description will look like this:

If you have questions or need support or guidance, please reach out to the (#general channel).

To get started, you need to install the Bacalhau client, see more information

If you have questions or need support or guidance, please reach out to the (#general channel).

To get started, you need to install the Bacalhau client, see more information

To train our model locally, we will start by cloning the Pytorch examples :

Now that we have downloaded our dataset, the next step is to upload it to IPFS. The simplest way to upload the data to IPFS is to use a third-party service to "pin" data to the IPFS network, to ensure that the data exists and is available. To do this you need an account with a pinning service like or . Once registered you can use their UI or API or SDKs to upload files.

Once you have uploaded your data, you'll be finished copying the CID. is the dataset we have uploaded.

The -i https://raw.githubusercontent.com/py..........: flag is used to mount our training script. We will use the URL to this

The same job can be presented in the format. In this case, the description will look like this:

is an open-source machine learning software library, TensorFlow is used to train neural networks. Expressed in the form of stateful dataflow graphs, each node in the graph represents the operations performed by neural networks on multi-dimensional arrays. These multi-dimensional arrays are commonly known as “tensors”, hence the name TensorFlow. In this example, we will be training a MNIST model.

This section is from

This short introduction uses to:

For each example, the model returns a vector of or scores, one for each class.

Before you start training, configure and compile the model using Keras Model.compile. Set the class to adam, set the loss to the loss_fn function you defined earlier, and specify a metric to be evaluated for the model by setting the metrics parameter to accuracy.

The Model.evaluate method checks the models performance, usually on a "" or "".

The image classifier is now trained to ~98% accuracy on this dataset. To learn more, read the .

The same job can be presented in the format. In this case, the description will look like this:

If you have questions or need support or guidance, please reach out to the (#general channel).

Introduction

Although the used to finetune the pre-trained model since both the Imagen model and Dreambooth code are closed source, several opensource projects have emerged using stable diffusion.

TL;DR

Inference

Prerequisites

To get started, you need to install the Bacalhau client, see more information

Setting up Docker Container

You can skip this section entirely and directly go to

Downloading the models

Build the Docker container

hub-user with your docker hub username, If you don’t have a docker hub account follow to create a Docker account, and use the username of the account you create.

Create the Subject Dataset

You can view the for reference.

Uploading the Subject Images to IPFS

In this case, we will be using (Recommended Option) to upload files and directories with .

Approaches to run a Bacalhau Job on a Finetuned Model

Case 1: If the subject is of class male

Structure of the command

Case 2 : If the subject is of class female

Case 3: If the subject is of class mix

Case 4: If you want a different tokenizer, model, and a different shell script with custom parameters

Declarative job description

The same job can be presented in the format. In this case, the description will look like this. Change the command in the Parameters section and CID to suit your goals.

Checking the State of your Jobs

Job status

Job information

Job download

Viewing your Job Output

Inference on the Fine-Tuned Model

Refer to our for more details of how to build a SD inference container

If you use the browser, you can use following:

Run the Bacalhau Job on the Fine-Tuned Model

To check the status of your job and download results refer back to the .

In this example tutorial, we will look at how to run BIDS App on Bacalhau. BIDS (Brain Imaging Data Structure) is an emerging standard for organizing and describing neuroimaging datasets. is a container image capturing a neuroimaging pipeline that takes a BIDS formatted dataset as input. Each BIDS App has the same core set of command line arguments, making them easy to run and integrate into automated platforms. BIDS Apps are constructed in a way that does not depend on any software outside of the image other than the container engine.

To get started, you need to install the Bacalhau client, see more information

For this tutorial, download file ds005.tar from this Bids dataset and untar it in a directory:

The simplest way to upload the data to IPFS is to use a third-party service to "pin" data to the IPFS network, to ensure that the data exists and is available. To do this, you need an account with a pinning service like or . Once registered, you can use their UI or API or SDKs to upload files.

Alternatively, you can upload your dataset to IPFS using , but the recommended approach is to use a pinning service as we have mentioned above.

The same job can be presented in the format. In this case, the description will look like this:

If you have questions or need support or guidance, please reach out to the (#general channel).

In this tutorial example, we will showcase how to containerize an OpenMM workload so that it can be executed on the Bacalhau network and take advantage of the distributed storage & compute resources. is a toolkit for molecular simulation. It is a physic-based library that is useful for refining the structure and exploring functional interactions with other molecules. It provides a combination of extreme flexibility (through custom forces and integrators), openness, and high performance (especially on recent GPUs) that make it truly unique among simulation codes.

To get started, you need to install the Bacalhau client, see more information

We use a processed 2DRI dataset that represents the ribose binding protein in bacterial transport and chemotaxis. The source organism is the bacteria.

Protein data can be stored in a .pdb file, this is a human-readable format. It provides for the description and annotation of protein and nucleic acid structures including atomic coordinates, secondary structure assignments, as well as atomic connectivity. See more information about PDB format . For the original, unprocessed 2DRI dataset, you can download it from the RCSB Protein Data Bank .

The relevant code of the processed 2DRI dataset can be found . Let's print the first 10 lines of the 2dri-processed.pdb file. The output contains a number of ATOM records. These describe the coordinates of the atoms that are part of the protein.

To run the script above all we need is a Python environment with the installed. This script makes sure that there are no empty cells and to filter out potential error sources from the file.

The simplest way to upload the data to IPFS is to use a third-party service to "pin" data to the IPFS network, to ensure that the data exists and is available. To do this, you need an account with a pinning service like or . Once registered, you can use their UI or API or SDKs to upload files.

See more information on how to containerize your script/app

hub-user with your docker hub username, If you don’t have a docker hub account , and use the username of the account you created

If you have questions or need support or guidance, please reach out to the (#general channel).

In this example, we will make use of program to add hydrogens to the molecules and generates coordinates in Gromacs (Gromos) format and topology in Gromacs format.

To get started, you need to install the Bacalhau client, see more information

Datasets can be found here , In this example we use dataset. After downloading, place it in a folder called “input”

The simplest way to upload the data to IPFS is to use a third-party service to "pin" data to the IPFS network, to ensure that the data exists and is available. To do this you need an account with a pinning service like or . Once registered you can use their UI or API or SDKs to upload files.

Alternatively, you can upload your dataset to IPFS using :

gromacs/gromacs: we use the official

gmx pdb2gmx: command in GROMACS that performs the conversion of molecular structural data from the Protein Data Bank (PDB) format to the GROMACS format, which is used for conducting Molecular Dynamics (MD) simulations and analyzing the results. Additional parameters could be found here

For a similar tutorial that you can try yourself, check out

The same job can be presented in the format. In this case, the description will look like this:

If you have questions or need support or guidance, please reach out to the (#general channel).

is a data subsetting method. Since the uncompressed datasets can get very large when compressed, it becomes much harder to train them as training time increases with the dataset size. To reduce training time and cut costs, we employ the coreset method; the coreset method can also be applied to other datasets. In this case, we use the coreset method which can lead to a fast speed in solving the k-means problem among the big data with high accuracy in the meantime.

For a deeper understanding of the core concepts, it's recommended to explore: 1. 2.

To get started, you need to install the Bacalhau client, see more information

The dataset is a osm.pbf (compressed format for .osm file), the file can be downloaded from

coreset.py contains the following script

See more information on how to containerize your script/app

hub-user with your docker hub username, If you don’t have a docker hub account , and use the username of the account you created

After the repo image has been pushed to Docker Hub, we can now use the container for running on Bacalhau. We've already converted the monaco-latest.osm.pbf file from compressed pbf format to geojson format . To submit a job, run the following Bacalhau command:

The same job can be presented in the format. In this case, the description will look like this:

If you have questions or need support or guidance, please reach out to the (#general channel).

Kipoi (pronounce: kípi; from the Greek κήποι: gardens) is an API and a repository of ready-to-use trained models for genomics. It currently contains 2201 different models, covering canonical predictive tasks in transcriptional and post-transcriptional gene regulation. Kipoi's API is implemented as a , and it is also accessible from the command line.

To get started, you need to install the Bacalhau client, see more information

To run locally you need to install kipoi-veff2. You can find out the information about installing and usage

See more information on how to containerize your script/app

hub-user with your docker hub username. If you don’t have a docker hub account , and use the username of the account you created

In this example, a model from github.com is downloaded during the job execution. In order to do this, use the flag when describing the job, and --job-selection-accept-networked when starting the compute node on which the job will be executed.

Note, that in the demo network, nodes do not accept jobs that require full network access. Consider creating your own .

The same job can be presented in the format. In this case, the description will look like this:

If you have questions or need support or guidance, please reach out to the (#general channel).

Type (string: <required>): The type of the job, such as batch, ops, daemon or service. You can learn more about the supported jobs types in the guide.

Meta ( : nil): Arbitrary metadata associated with the job.

Labels ([] : nil): Arbitrary labels associated with the job for filtering purposes.

Constraints ([] : nil): These are selectors which must be true for a compute node to run this job.

Tasks ([] : <required>):: Task associated with the job, which defines a unit of work within the job. Today we are only supporting single task per job, but with future plans to extend this.

State (): Represents the current state of the job.

Engine ( : required): Configures the execution engine for the task, such as or .

Publisher ( : optional): Specifies where the results of the task should be published, such as and publishers. Only applicable for tasks of type batch and ops.

Meta ( : optional): Allows association of arbitrary metadata with this task.

InputSources ([] : optional): Lists remote artifacts that should be downloaded before task execution and mounted within the task, such as from or .

ResultPaths ([] : optional): Indicates volumes within the task that should be included in the published result. Only applicable for tasks of type batch and ops.

Resources ( : optional): Details the resources that this task requires.

Network ( : optional): Configurations related to the networking aspects of the task.

Timeouts ( : optional): Configurations concerning any timeouts associated with the task.

Batch jobs are executed on demand, running on a specified number of Bacalhau nodes. These jobs either run until completion or until they reach a timeout. They are designed to carry out a single, discrete task before finishing. This is the only job type.

Batch Job Example

This example shows a sample Batch job description with all available parameters.

The example demonstrates a job that:

  1. Has a priority of 100

  2. Will be executed on 2 nodes

  3. Will be executed only on nodes with Linux OS

  4. Uses the docker engine

  5. Executes a python script with multiple arguments

  6. Preloads and mounts IPFS data as a local directory

  7. Publishes the results to the IPFS

  8. Has network access type HTTP and 2 allowed domains

Ops Job Example

This example shows a sample Ops job description with all available parameters.

The example demonstrates a job that:

  1. Has a priority of 100

  2. Will be executed on all suitable nodes

  3. Will be executed only on nodes with label = WebService

  4. Uses the docker engine

  5. Executes a query with manually specified parameters

  6. Has access to a local directory

  7. Publishes the results to the IPFS, if any

  8. Has network access type HTTP and 2 allowed domains

This example shows a sample Daemon job description with all available parameters.

This example shows a sample Service job description with all available parameters.

Check out the to learn about all the changes in configuration management: CLI commands syntax and configuration files management.

Configuration key
Description

It is also possible to additionally specify the number of resources to be allocated to each job by default, if the required number of resources is not specified in the job itself. JobDefaults.<>.Task.Resources.<Resource Type> configuration keys are used for this purpose. E.g. to provide each job with 2Gb of RAM the following key is used: JobDefaults.Ops.Task.Resources.Memory:

See the complete for more details.

Resource limits are not supported for Docker jobs running on Windows. Resource limits will be applied at the job bid stage based on reported job requirements but will be silently unenforced. Jobs will be able to access as many resources as requested at runtime.

Running a Windows-based node is not officially supported, so your mileage may vary. Some features (like ) are not present in Windows-based nodes.

Bacalhau currently makes the assumption that all containers are Linux-based. Users of the Docker executor will need to manually ensure that their Docker engine is running and to support Linux containers, e.g. using the WSL-based backend.

Applying job timeouts allows node operators to more fairly distribute the work submitted to their nodes. It also protects users from transient errors that result in their jobs waiting indefinitely.

​
​
​
here
​
​
​
here
​
follow these instructions to create a Docker account
​
​
https://www.nasa.gov/multimedia/hd/apollo11_hdpage.html
IPFS
private IPFS network
​
​
declarative
​
​
​
​
​
here
NFT.storage
Pinata
content identifier (CID)
Bacalhau overwrites the default entrypoint
declarative
Bacalhau team via Slack
creating
publishing
here
EasyOCR
Easy OCR repo
these instructions
here
here
declarative
Stable Diffusion
DALL·E 2
Diffusion Probabilistic Model
Transformer
original
OpenVINO
supported
Docker
separate repository
Github repository
pre-optimized OpenVINO version
original
CLIP transformer
in the stable_diffusion_engine.py file
diffusers library
can be found in the repository
Bacalhau
Bacalhau
starting your own network
Bacalhau
Stable Diffusion
DALL·E 2
Diffusion Probabilistic Model
Transformer
here
Keras/Tensorflow implementation
diffusers library
here
Stable Diffusion in TensorFlow/Keras
here
argument parsing
here
NVIDIA® driver
the original repository
here
these instructions
starting your own network
read more about GPU support
YOLO (You Only Look Once)
by Redmond et al.
YOLOv5 Docker Image developed by Ultralytics.
here
Internet Access
yolov5 release page
YOLO repository
starting your own network
data ingestion
Cyclist Dataset for Object Detection | Kaggle
a HTTP IPFS proxy
Bacalhau team via Slack
Stable Diffusion
DALL·E 2
Diffusion Probabilistic Model
Transformer
this guide
here
here
follow these instructions to create the Docker account
NFT.Storage
NFTup
starting your own network
StyleGAN3
here
here
these instructions
https://docs.docker.com/docker-id/
starting your own network
Bacalhau team via Slack
here
Bacalhau team via Slack
here
repo
Pinata
NFT.Storage
Here
Pytorch example
declarative
TensorFlow
TensorFlow 2 quickstart for beginners
Keras
logits
log-odds
optimizer
Validation-set
Test-set
TensorFlow tutorials
declarative
Bacalhau team via Slack
​
dreambooth paper
Imagen
​
​
​
here
​
​
​
these instructions
​
Subject Image dataset of David Aronchick
​
NFT.Storage
NFTUp
​
​
​
​
​
​
​
declarative
​
​
​
​
​
​
guide on CKPT model
Brave
​
Bacalhau job
declarative
instructions above
Running a Bacalhau job
declarative
running a job on Bacalhau
guide above
mkdir data
tar -xf ds005.tar -C data 
data
└── ds005
    ├── CHANGES
    ├── dataset_description.json
    ├── participants.tsv
    ├── README
    ├── sub-01
    │   ├── anat
    │   │   ├── sub-01_inplaneT2.nii.gz
    │   │   └── sub-01_T1w.nii.gz
    │   └── func
    │       ├── sub-01_task-mixedgamblestask_run-01_bold.nii.gz
    │       ├── sub-01_task-mixedgamblestask_run-01_events.tsv
    │       ├── sub-01_task-mixedgamblestask_run-02_bold.nii.gz
    │       ├── sub-01_task-mixedgamblestask_run-02_events.tsv
    │       ├── sub-01_task-mixedgamblestask_run-03_bold.nii.gz
    │       └── sub-01_task-mixedgamblestask_run-03_events.tsv
    ├── sub-02
    │   ├── anat
    │   │   ├── sub-02_inplaneT2.nii.gz
    │   │   └── sub-02_T1w.nii.gz
    ...
export JOB_ID=$(bacalhau docker run \
    --id-only \
    --wait \
    --timeout 3600 \
    --wait-timeout-secs 3600 \
    -i ipfs://QmaNyzSpJCt1gMCQLd3QugihY6HzdYmA8QMEa45LDBbVPz:/data \
    nipreps/mriqc:latest \
    -- mriqc ../data/ds005 ../outputs participant --participant_label 01 02 03)
name: Running BIDS
type: batch
count: 1
tasks:
  - name: My main task
    Engine:
      type: docker
      params:
        Image: nipreps/mriqc:latest
        Entrypoint:
          - /bin/bash
        Parameters:
          - -c
          - mriqc ../data/ds005 ../outputs participant --participant_label 01 02 03
    Publisher:
      Type: ipfs
    ResultPaths:
      - Name: outputs
        Path: /outputs
    InputSources:
      - Target: "/data"
        Source:
          Type: "ipfs"
          Params:
            CID: "QmaNyzSpJCt1gMCQLd3QugihY6HzdYmA8QMEa45LDBbVPz"
bacalhau job run bids.yaml
bacalhau job list --id-filter ${JOB_ID} --wide
bacalhau job describe ${JOB_ID}
rm -rf results && mkdir -p results
bacalhau job get $JOB_ID --output-dir results
ls results/ # list the contents of the current directory 
cat results/stdout # displays the contents of the current directory 
head ./dataset/2dri-processed.pdb

Expected Output
    REMARK   1 CREATED WITH OPENMM 7.6, 2022-07-12
    CRYST1   81.309   81.309   81.309  90.00  90.00  90.00 P 1           1 
    ATOM      1  N   LYS A   1      64.731   9.461  59.430  1.00  0.00           N  
    ATOM      2  CA  LYS A   1      63.588  10.286  58.927  1.00  0.00           C  
    ATOM      3  HA  LYS A   1      62.707   9.486  59.038  1.00  0.00           H  
    ATOM      4  C   LYS A   1      63.790  10.671  57.468  1.00  0.00           C  
    ATOM      5  O   LYS A   1      64.887  11.089  57.078  1.00  0.00           O  
    ATOM      6  CB  LYS A   1      63.458  11.567  59.749  1.00  0.00           C  
    ATOM      7  HB2 LYS A   1      63.333  12.366  58.879  1.00  0.00           H  
    ATOM      8  HB3 LYS A   1      64.435  11.867  60.372  1.00  0.00           H  
# Import the packages
import os
from openmm import *
from openmm.app import *
from openmm.unit import *

# Specify the input files
input_path = 'inputs/2dri-processed.pdb'
if not os.path.exists(input_path):
    raise FileNotFoundError(f"Input file not found: {input_path}")

# Function to check and filter PDB file lines
def filter_valid_pdb_lines(input_path, output_path):
    with open(input_path, 'r') as infile, open(output_path, 'w') as outfile:
        lines = infile.readlines()
        for i, line in enumerate(lines):
            if line.startswith("ATOM") or line.startswith("HETATM"):
                if len(line) >= 54:
                    try:
                        float(line[30:38].strip())
                        float(line[38:46].strip())
                        float(line[46:54].strip())
                        outfile.write(line)
                    except ValueError:
                        print(f"Skipping line {i + 1} because it has invalid coordinates: {line.strip()}")
                else:
                    print(f"Skipping line {i + 1} because it is too short: {line.strip()}")
            else:
                outfile.write(line)

# Filter PDB file
filtered_pdb_path = 'inputs/filtered_2dri-processed.pdb'
filter_valid_pdb_lines(input_path, filtered_pdb_path)

# Load the filtered PDB file
try:
    pdb = PDBFile(filtered_pdb_path)
except ValueError as e:
    print(f"ValueError while reading filtered PDB file: {e}")
    raise

forcefield = ForceField('amber14-all.xml', 'amber14/tip3pfb.xml')

# Output
output_path = 'outputs/final_state.pdbx'
if not os.path.exists(os.path.dirname(output_path)):
    os.makedirs(os.path.dirname(output_path))

# System Configuration
nonbondedMethod = PME
nonbondedCutoff = 1.0 * nanometers
ewaldErrorTolerance = 0.0005
constraints = HBonds
rigidWater = True
constraintTolerance = 0.000001
hydrogenMass = 1.5 * amu

# Integration Options
dt = 0.002 * picoseconds
temperature = 310 * kelvin
friction = 1.0 / picosecond
pressure = 1.0 * atmospheres
barostatInterval = 25

# Simulation Options
steps = 10
equilibrationSteps = 0
# platform = Platform.getPlatformByName('CUDA')
platform = Platform.getPlatformByName('CPU')
# platformProperties = {'Precision': 'single'}
platformProperties = {}
dcdReporter = DCDReporter('trajectory.dcd', 1000)
dataReporter = StateDataReporter('log.txt', 1000, totalSteps=steps,
                                 step=True, time=True, speed=True, progress=True, elapsedTime=True, remainingTime=True,
                                 potentialEnergy=True, kineticEnergy=True, totalEnergy=True, temperature=True,
                                 volume=True, density=True, separator='\t')
checkpointReporter = CheckpointReporter('checkpoint.chk', 1000)

# Prepare the Simulation
print('Building system...')
topology = pdb.topology
positions = pdb.positions
system = forcefield.createSystem(topology, nonbondedMethod=nonbondedMethod, nonbondedCutoff=nonbondedCutoff,
                                 constraints=constraints, rigidWater=rigidWater, ewaldErrorTolerance=ewaldErrorTolerance,
                                 hydrogenMass=hydrogenMass)
system.addForce(MonteCarloBarostat(pressure, temperature, barostatInterval))
integrator = LangevinMiddleIntegrator(temperature, friction, dt)
integrator.setConstraintTolerance(constraintTolerance)
simulation = Simulation(topology, system, integrator, platform, platformProperties)
simulation.context.setPositions(positions)

# Minimize and Equilibrate
print('Performing energy minimization...')
simulation.minimizeEnergy()
print('Equilibrating...')
simulation.context.setVelocitiesToTemperature(temperature)
simulation.step(equilibrationSteps)

# Simulate
print('Simulating...')
simulation.reporters.append(dcdReporter)
simulation.reporters.append(dataReporter)
simulation.reporters.append(checkpointReporter)
simulation.currentStep = 0
simulation.step(steps)

# Write a file with the final simulation state
state = simulation.context.getState(getPositions=True, enforcePeriodicBox=system.usesPeriodicBoundaryConditions())
with open(output_path, mode="w+") as file:
    PDBxFile.writeFile(simulation.topology, state.getPositions(), file)
print('Simulation complete, file written to disk at: {}'.format(output_path))
python run_openmm_simulation.py
FROM conda/miniconda3

RUN conda install -y -c conda-forge openmm

WORKDIR /project

COPY ./run_openmm_simulation.py /project

LABEL org.opencontainers.image.source https://github.com/bacalhau-project/examples

CMD ["python","run_openmm_simulation.py"]
docker build -t <hub-user>/<repo-name>:<tag> .
docker buildx build --platform linux/amd64 --push -t ghcr.io/bacalhau-project/examples/openmm:0.3 .
docker push <hub-user>/<repo-name>:<tag>
export JOB_ID=$(bacalhau docker run \
    --input ipfs://bafybeig63whfqyuvwqqrp5456fl4anceju24ttyycexef3k5eurg5uvrq4 \
    --wait \
    --id-only \
    ghcr.io/bacalhau-project/examples/openmm:0.3 \
    -- python run_openmm_simulation.py)
bacalhau job list --id-filter=${JOB_ID} --no-style
bacalhau job describe ${JOB_ID}
rm -rf results && mkdir -p results
bacalhau job get ${JOB_ID} --output-dir results # Download the results
cat results/outputs/final_state.pdbx
input
└── 1AKI.pdb
ipfs add -r input/

added QmTCCqPzX3qSJHuMeSma9uCqUnriZ5eJX7MnxebxydL89f input/1AKI.pdb
added QmeeEB1YMrG6K8z43VdsdoYmQV46gAPQCHotZs9pwusCm9 input
 113.59 KiB / 113.59 KiB [=================================] 100.00%
export JOB_ID=$(bacalhau docker run \
    --id-only \
    --wait \
    --timeout 3600 \
    --wait-timeout-secs 3600 \
    -i ipfs://QmeeEB1YMrG6K8z43VdsdoYmQV46gAPQCHotZs9pwusCm9:/input \
    gromacs/gromacs \
    -- /bin/bash -c 'echo 15 | gmx pdb2gmx -f input/1AKI.pdb -o outputs/1AKI_processed.gro -water spc')
name: Gromacs
type: batch
count: 1
tasks:
  - name: My main task
    Engine:
      type: docker
      params:
        Image: gromacs/gromacs
        Entrypoint:
          - /bin/bash
        Parameters:
          - -c
          - echo 15 | gmx pdb2gmx -f input/1AKI.pdb -o outputs/1AKI_processed.gro -water spc
    Publisher:
      Type: ipfs
    ResultPaths:
      - Name: outputs
        Path: /outputs      
    InputSources:
      - Target: "/input"
        Source:
          Type: "s3"
          Params:
            Bucket: "bacalhau-gromacs"
            Key: "*"
            Region: "us-east-1"
bacalhau job run gromacs.yaml
bacalhau job list --id-filter ${JOB_ID} --wide
bacalhau job describe ${JOB_ID}
rm -rf results && mkdir -p results
bacalhau job get $JOB_ID --output-dir results
cat results/outputs/1AKI_processed.gro  
git clone https://github.com/js-ts/Coreset
wget https://download.geofabrik.de/europe/monaco-latest.osm.pbf
sudo apt-get -y update \
sudo apt-get -y install osmium-tool \
sudo apt-get -y install libpq-dev gdal-bin libgdal-dev libxml2-dev libxslt-dev
# requirements.txt
certifi==2020.12.5
chardet==4.0.0
cycler==0.10.0
idna==2.10
joblib
kiwisolver==1.3.1
lxml==4.6.2
matplotlib==3.3.3
numpy==1.19.4
overpy==0.4
pandas==1.1.4
Pillow==8.0.1
pyparsing==2.4.7
python-dateutil==2.8.1
pytz==2020.4
requests==2.25.1
scikit-learn
scipy
six==1.15.0
threadpoolctl
tqdm==4.56.0
urllib3==1.26.2
geopandas
pip3 install -r Coreset/requirements.txt
osmium export monaco-latest.osm.pbf -o monaco-latest.geojson
python Coreset/python/coreset.py -f monaco-latest.geojson
FROM python:3.8

RUN apt-get -y update && apt-get -y install osmium-tool && apt update && apt-get -y install libpq-dev gdal-bin libgdal-dev libxml2-dev libxslt-dev

ADD Coreset Coreset

ADD monaco-latest.geojson .

RUN cd Coreset && pip3 install -r requirements.txt
docker build -t <hub-user>/<repo-name>:<tag> .
docker build -t jsace/coreset
docker push <hub-user>/<repo-name>:<tag>
docker push jsace/coreset
bacalhau docker run \
    --input https://github.com/js-ts/Coreset/blob/master/monaco-latest.geojson \
    jsace/coreset \
    -- /bin/bash -c 'python Coreset/python/coreset.py -f monaco-latest.geojson -o outputs'
name: Coresets On Bacalhau
type: batch
count: 1
tasks:
  - name: My main task
    Engine:
      type: docker
      params:
        Image: "jsace/coreset" 
        Entrypoint:
          - /bin/bash
        Parameters:
          - -c
          - "osmium export input/liechtenstein-latest.osm.pbf -o /liechtenstein-latest.geojson;python Coreset/python/coreset.py -f /liechtenstein-latest.geojson -o /outputs"
    Publisher:
      Type: ipfs
    ResultPaths:
      - Name: outputs
        Path: /outputs      
    InputSources:
      - Source:
          Type: "s3"
          Params:
            Bucket: "coreset"
            Key: "*"
            Region: "us-east-1"
        Target: "/input"    
bacalhau job run coreset.yaml
bacalhau job list --id-filter ${JOB_ID} --wide
bacalhau job describe ${JOB_ID}
rm -rf results && mkdir -p results
bacalhau job get $JOB_ID --output-dir results
ls results/outputs

centers.csv                       coreset-weights-monaco-latest.csv
coreset-values-monaco-latest.csv  ids.csv
cat results/outputs/centers.csv | head -n 10

lat,long
7.423843975787508,43.730621154072196
7.4252607,43.7399135
7.411026970571964,43.72937671121925
7.459404485446199,43.62065587026715
7.429551373022234,43.74042043301333
cat results/outputs/coreset-values-monaco-latest.csv | head -n 10

7.418849799999999384e+00,4.372759140000000144e+01
7.416779063194204547e+00,4.373053835217195484e+01
7.422073648233502574e+00,4.374059957604499971e+01
7.434173206469590234e+00,4.374591689556921636e+01
7.417540100000000081e+00,4.372501400000000160e+01
7.427359010538406636e+00,4.374324133692341121e+01
7.427839200000001085e+00,4.374025220000000758e+01
7.418834173612560257e+00,4.372760402368248833e+01
7.416381731248183229e+00,4.373708812663696932e+01
7.412050699999999992e+00,4.372842109999999849e+01
cat results/outputs/coreset-weights-monaco-latest.csv | head -n 10

7.704359156916230233e+01
2.090893934427382987e+02
1.560611140982714744e+02
2.516557569411126281e+02
7.714605094768158722e+01
2.640808776415075840e+02
2.326085291610944523e+02
7.704841021255269595e+01
2.089705263763523249e+02
1.728105655128551632e+02
kipoi_veff2_predict ./examples/input/test.vcf ./examples/input/test.fa ./output.tsv -m "DeepSEA/predict" -s "diff" -s "logit"
FROM kipoi/kipoi-veff2:py37

RUN kipoi_veff2_predict ./examples/input/test.vcf ./examples/input/test.fa ./output.tsv -m "DeepSEA/predict" -s "diff" -s "logit"
docker build -t <hub-user>/<repo-name>:<tag> .
docker build -t jsacex/kipoi-veff2:py37 .
docker push <hub-user>/<repo-name>:<tag>
export JOB_ID=$(bacalhau docker run \
    --id-only \
    --memory 20Gb \
    --wait \
    --timeout 3600 \
    --wait-timeout-secs 3600 \
    --publisher ipfs \
    --network full \
    jsacex/kipoi-veff2:py37 \
    -- kipoi_veff2_predict ./examples/input/test.vcf ./examples/input/test.fa ../outputs/output.tsv -m "DeepSEA/predict" -s "diff" -s "logit")
name: Genomics
type: batch
count: 1
tasks:
  - name: My main task
    Engine:
      type: docker
      params:
        Image: jsacex/kipoi-veff2:py37
        Entrypoint:
          - /bin/bash
        Parameters:
          - -c
          - kipoi_veff2_predict ./examples/input/test.vcf ./examples/input/test.fa ../outputs/output.tsv -m "DeepSEA/predict" -s "diff" -s "logit"
    Publisher:
      Type: ipfs
    Network:
      Type: full
    ResultPaths:
      - Name: outputs
        Path: /outputs
    Resources:
      Memory: 20gb
bacalhau job run genomics.yaml
bacalhau job list --id-filter ${JOB_ID} --wide
bacalhau job describe ${JOB_ID}
rm -rf results && mkdir -p results
bacalhau job get $JOB_ID --output-dir results
cat results/outputs/output.tsv | head -n 10  
Type: batch
Count: 1
Priority: 50
Meta:
  version: "1.2.5"
Labels:
  project: "my-project"
Constraints:
  - Key: Architecture
    Operator: '='
    Values:
      - arm64
  - Key: region
    Operator: '='
    Values:
      - us-west-2
Tasks:
  #...
# This example shows a sample daemon job file. 
# Parameters, marked as Optional can be skipped - the default values will be used
# Example from the https://blog.bacalhau.org/p/tutorial-save-25-m-yearly-by-managing is used

# Name of the job. Optional. Default value - job ID
Name: Logstash


# Type of the job
Type: daemon


# The namespace in which the job is running. Default value - “default”
Namespace: logging


# Priority - determines the scheduling priority. By default is 0
Priority: 100


# Meta - arbitrary metadata associated with the job. 
# Optional
Meta:
  Job purpose : Provide detailed example of the daemon job
  Meta purpose: Describe the job


# Labels - Arbitrary labels associated with the job for filtering purposes. 
# Optional
Labels:
  Job type: daemon job
  Daemon job feature: To be executed continuously on all suitable nodes


# Constraint - a condition that must be met for a compute node to be eligible to run a given job. 
# Should be specified in a following format: key - operator - value
# Optional.
Constraints:
  - Key: service
    Operator: ==
    Values:
      - WebService


# Task associated with the job, which defines a unit of work within the job. 
# Currently, only one task per job is supported.
Tasks:
  # Name - unique identifier for a task. Default value - “main”
  - Name: main


    # Engine - the execution engine for the task. 
    # Defines engine type (docker or wasm) and relevant parameters. 
    # In this example, docker engine will be used.  
    Engine:
      Type: docker


    # Params: A set of key-value pairs that provide the specific configurations for the chosen type
      Params:

        # Image: docker image to be used in the task.
        Image: expanso/nginx-access-log-agent:1.0.0


        # Entrypoint defines a command that will be executed when container starts. 
        # For this example we don't need any so default value 'null' can be used
        Entrypoint: null


        # Parameters define CLI commands, executed after entrypoint        
        Parameters:
          - --query
          - {{.query}}
          - --start-time
          - {{or (index . "start-time") ""}}
          - --end-time
          - {{or (index . "end-time") ""}}


        # WorkingDirectory sets a working directory for entrypoint and paramters' commands.
        # Default value - empty string ""
        WorkingDirectory: ""


        # EnvironmentVariables sets environment variables for the engine
        EnvironmentVariables:
          - OPENSEARCH_ENDPOINT={{.OpenSearchEndpoint}}
          - S3_BUCKET={{.AccessLogBucket}}
          - AWS_REGION={{.AWSRegion}}
          - AGGREGATE_DURATION=10
          - S3_TIME_FILE=60


        # Meta - arbitrary metadata associated with the task. 
        # Optional
        Meta:
          Task goal : show how to create declarative descriptions

    # Publisher specifies where the results of the task should be published - S3, IPFS, Local or none
    # Optional
    # To use IPFS publisher you need to specify only type
    # To use S3 publisher you need to specify bucket, key, region and endpoint
    # See S3 Publisher specification for more details
    Publisher:
      Type: ipfs


    # InputSources lists remote artifacts that should be downloaded before task execution 
    # and mounted within the task.
    # Ensure that localDirectory source is enabled on the nodes
    # Optional
    InputSources:
      - Target: /app/logs
        Source:
          Type: localDirectory
          Params:
            SourcePath: /data/log-orchestration/logs
      - Target: /app/state
        Source:
          Type: localDirectory
          Params:
            SourcePath: /data/log-orchestration/state
            ReadWrite: true



    # ResultPaths indicate volumes within the task that should be included in the published result
    # Only applicable for batch and ops jobs.
    # Optional
    ResultPaths:
      - Name: outputs
        Path: /outputs


    # Resources is a structured way to detail the required computational resources for the task. 
    # Optional
    Resources:
      # CPU can be specified in cores (e.g. 1) or in milliCPU units (e.g. 250m or 0.25)
      CPU: 250m
      
      # Memory highlights amount of RAM for a job. Can be specified in Kb, Mb, Gb, Tb
      Memory: 1Gb
      
      # Disk states disk storage space, needed for the task.
      Disk: 100mb

      # Denotes the number of GPU units required.
      GPU: "0"


    # Network specifies networking requirements.  
    # Optional
    # Job may have full access to the network,
    # may have no access at all,
    # or may have limited HTTP(S) access to a specific list of domains
    Network:
      Type: Full


    # Timeouts define configurations concerning any timeouts associated with the task. 
    # Optional
    Timeouts:
      # QueueTimeout defines how long will job wait for suitable nodes in the network
      # if none are currently available.
      QueueTimeout: 101

      # TotalTimeout defines job execution timeout. When it is reached the job will be terminated
      TotalTimeout: 301
# This example shows a sample daemon job file. 
# Parameters, marked as Optional can be skipped - the default values will be used
# Example from the https://blog.bacalhau.org/p/introducing-new-job-types-new-horizons is used

# Name of the job. Optional. Default value - job ID
Name: Kinesis Consumer


# Type of the job
Type: service


# The namespace in which the job is running. Default value - “default”
Namespace: service


# Priority - determines the scheduling priority. By default is 0
Priority: 100


# Meta - arbitrary metadata associated with the job. 
# Optional
Meta:
  Job purpose : Provide detailed example of the service job
  Meta purpose: Describe the job


# Labels - Arbitrary labels associated with the job for filtering purposes. 
# Optional
Labels:
  Job type: service job
  Daemon job feature: To be executed continuously on a certain amount of suitable nodes


# Constraint - a condition that must be met for a compute node to be eligible to run a given job. 
# Should be specified in a following format: key - operator - value
# Optional.
Constraints:
  - Key: Architecture
    Operator: '='
    Values:
      - arm64
  - Key: region
    Operator: '='
    Values:
      - us-west-2


# Task associated with the job, which defines a unit of work within the job. 
# Currently, only one task per job is supported.
Tasks:
  # Name - unique identifier for a task. Default value - “main”
  - Name: main


    # Engine - the execution engine for the task. 
    # Defines engine type (docker or wasm) and relevant parameters. 
    # In this example, docker engine will be used.  
    Engine:
      Type: docker


    # Params: A set of key-value pairs that provide the specific configurations for the chosen type
      Params:

        # Image: docker image to be used in the task.
        Image: my-kinesis-consumer:latest


        # Entrypoint defines a command that will be executed when container starts. 
        # For this example we don't need any so default value 'null' can be used
        Entrypoint: null


        # Parameters define CLI commands, executed after entrypoint        
        Parameters:
          - -stream-arn
          - arn:aws:kinesis:us-west-2:123456789012:stream/my-kinesis-stream
          - -shard-iterator
          - TRIM_HORIZON


        # WorkingDirectory sets a working directory for entrypoint and paramters' commands.
        # Default value - empty string ""
        WorkingDirectory: ""


        # EnvironmentVariables sets environment variables for the engine
        EnvironmentVariables:
          - DEFAULT_USER_NAME = root
          - API_KEY = none


        # Meta - arbitrary metadata associated with the task. 
        # Optional
        Meta:
          Task goal : show how to create declarative descriptions

    # Publisher specifies where the results of the task should be published - S3, IPFS, Local or none
    # Optional
    # To use IPFS publisher you need to specify only type
    # To use S3 publisher you need to specify bucket, key, region and endpoint
    # See S3 Publisher specification for more details
    Publisher:
      Type: ipfs


    # InputSources lists remote artifacts that should be downloaded before task execution 
    # and mounted within the task.
    # Ensure that localDirectory source is enabled on the nodes
    # Optional
    InputSources:
      - Target: /app/logs
        Source:
          Type: localDirectory
          Params:
            SourcePath: /data/log-orchestration/logs
      - Target: /app/state
        Source:
          Type: localDirectory
          Params:
            SourcePath: /data/log-orchestration/state
            ReadWrite: true



    # ResultPaths indicate volumes within the task that should be included in the published result
    # Only applicable for batch and ops jobs.
    # Optional
    ResultPaths:
      - Name: outputs
        Path: /outputs


    # Resources is a structured way to detail the required computational resources for the task. 
    # Optional
    Resources:
      # CPU can be specified in cores (e.g. 1) or in milliCPU units (e.g. 250m or 0.25)
      CPU: 250m
      
      # Memory highlights amount of RAM for a job. Can be specified in Kb, Mb, Gb, Tb
      Memory: 4Gb
      
      # Disk states disk storage space, needed for the task.
      Disk: 100mb

      # Denotes the number of GPU units required.
      GPU: "0"


    # Network specifies networking requirements.  
    # Optional
    # Job may have full access to the network,
    # may have no access at all,
    # or may have limited HTTP(S) access to a specific list of domains
    Network:
      Type: Full


    # Timeouts define configurations concerning any timeouts associated with the task. 
    # Optional
    Timeouts:
      # QueueTimeout defines how long will job wait for suitable nodes in the network
      # if none are currently available.
      QueueTimeout: 101

      # TotalTimeout defines job execution timeout. When it is reached the job will be terminated
      TotalTimeout: 301

Compute.AllocatedCapacity.CPU

Specifies the amount of CPU a compute node allocates for running jobs. It can be expressed as a percentage (e.g., 85%) or a Kubernetes resource string

Compute.AllocatedCapacity.Disk

Specifies the amount of Disk space a compute node allocates for running jobs. It can be expressed as a percentage (e.g., 85%) or a Kubernetes resource string (e.g., 10Gi)

Compute.AllocatedCapacity.GPU

Specifies the amount of GPU a compute node allocates for running jobs. It can be expressed as a percentage (e.g., 85%) or a Kubernetes resource string (e.g., 1).

Note: When using percentages, the result is always rounded up to the nearest whole GPU

Compute.AllocatedCapacity.Memory

Specifies the amount of Memory a compute node allocates for running jobs. It can be expressed as a percentage (e.g., 85%) or a Kubernetes resource string (e.g., 1Gi)

bacalhau config set JobDefaults.Ops.Task.Resources.Memory=2Gi

JobDefaults.Batch.Task.Timeouts.ExecutionTimeout

Default value for batch job execution timeouts on your current compute node. It will be assigned to batch jobs with no timeout requirement defined

JobDefaults.Ops.Task.Timeouts.ExecutionTimeout

Default value for ops job execution timeouts on your current compute node. It will be assigned to ops jobs with no timeout requirement defined

JobDefaults.Batch.Task.Timeouts.TotalTimeout

Default value for the maximum execution timeout this compute node supports for batch jobs. Jobs with higher timeout requirements will not be bid on

JobDefaults.Ops.Task.Timeouts.TotalTimeout

Default value for the maximum execution timeout this compute node supports for ops jobs. Jobs with higher timeout requirements will not be bid on

Local Publisher Specification

Bacalhau's Local Publisher provides a useful option for storing task results on the compute node, allowing for ease of access and retrieval for testing or trying our Bacalhau.

The Local Publisher should not be used for Production use as it is not a reliable storage option. For production use, we recommend using a more reliable option such as an S3-compatible storage service.

Local Publisher Parameters

The local publisher requires no specific parameters to be defined in the publisher specification. The user only needs to indicate the publisher type as "local", and Bacalhau handles the rest. Here is an example of how to set up a Local Publisher in a job specification.

Publisher:
  Type: local

Published Result Specification

Once the job is executed, the results are published to the local compute node, and stored as compressed tar file, which can be accessed and retrieved over HTTP from the command line using the get command. TAhis will download and extract the contents for the user from the remove compute node.

Result Parameters

URL (string): This is the HTTP URL to the results of the computation, which is hosted on the compute node where it ran. Here's a sample of how the published result might appear:

PublishedResult:
  Type: local
  Params:
    URL: "http://192.168.0.11:6001/e-c4b80d04-ff2b-49d6-9b99-d3a8e669a6bf.tgz"

In this example, the task results will be stored on the compute node, and can be referenced and retrieved using the specified URL.

Caveats

  1. By default the compute node will attempt to use a public address for the HTTP server delivering task output, but there is no guarantee that the compute node is accessible on that address. If the compute node is behind a NAT or firewall, the user may need to manually specify the address to use for the HTTP server in the config.yaml file.

  2. There is no lifecycle management for the content stored on the compute node. The user is responsible for managing the content and ensuring that it is removed when no longer needed before the compute node runs out of disk space.

  3. If the address/port of the compute node changes, then previously stored content will no longer be accessible. The user will need to manually update the address in the config.yaml file and re-publish the content to make it accessible again.

Local Source Specification

The Local input source allows Bacalhau jobs to access files and directories that are already present on the compute node. This is especially useful for utilizing locally stored datasets, configuration files, logs, or other necessary resources without the need to fetch them from a remote source, ensuring faster job initialization and execution.

Source Specification Parameters

Here are the parameters that you can define for a Local input source:

  • SourcePath (string: <required>): The absolute path on the compute node where the Local or file is located. Bacalhau will access this path to read data, and if permitted, write data as well.

  • ReadWrite (bool: false): A boolean flag that, when set to true, gives Bacalhau both read and write access to the specified Local or file. If set to false, Bacalhau will have read-only access.

Allow-listing Local Paths

For security reasons, direct access to local paths must be explicitly allowed when running the Bacalhau compute node. This is achieved using the Compute.AllowListedLocalPaths configuration key followed by a comma-separated list of the paths, or path patterns, that should be accessible. Each path can be suffixed with permissions as well:

  • :rw - Read-Write access.

  • :ro - Read-Only access (default if no suffix is provided).

Check out the default settings on your server, as this may be set to :ro and may lead to an error, when a different access is required.

For instance:

bacalhau config set Compute.AllowListedLocalPaths=/etc/config:rw,/etc/*.conf:ro

Example

Below is an example of how to define a Local input source in YAML format.

InputSources:
  - Source:
      Type: "localDirectory"
      Params:
        SourcePath: "/etc/config"
        ReadWrite: true
    Target: "/config"

In this example, Bacalhau is configured to access the Local "/etc/config" on the compute node. The contents of this directory are made available at the "/config" path within the task's environment, with read and write access. Adjusting the ReadWrite flag to false would enable read-only access, preventing modifications to the local data from within the Bacalhau task.

Example (Imperative/CLI)

When using the Bacalhau CLI to define the local input source, you can employ the following imperative approach. Below are example commands demonstrating how to define the local input source with various configurations:

  1. Mount readonly file to /config:

    bacalhau docker run -i file:///etc/config:/config ubuntu ...
  2. Mount writable file to default /input:

    bacalhau docker run -i file:///var/checkpoints:/myCheckpoints,opt=rw=true ubuntu ...

Docker Engine Specification

Docker Engine is one of the execution engines supported in Bacalhau. It allows users to run tasks inside Docker containers, offering an isolated and consistent environment for execution. Below are the parameters to configure the Docker Engine.

Docker Engine Parameters

  • Image (string: <required>): Specifies the Docker image to use for task execution. It should be an image that can be pulled by Docker.

  • Entrypoint (string[]: <optional>): Allows overriding the default entrypoint set in the Docker image. Each string in the array represents a segment of the entrypoint command.

  • Parameters (string[]: <optional>): Additional command-line arguments to be included in the container’s startup command, appended after the entrypoint.

  • EnvironmentVariables (string[]: <optional>): Sets environment variables within the Docker container during task execution. Each string should be formatted as KEY=value.

  • WorkingDirectory (string: <optional>): Sets the path inside the container where the task executes. If not specified, it defaults to the working directory defined in the Docker image.

Example

Here’s an example of configuring the Docker Engine within a job or task using YAML:

Engine:
  Type: "Docker"
  Params:
    Image: "ubuntu:20.04"
    Entrypoint:
      - "/bin/bash"
      - "-c"
    Parameters:
      - "echo Hello, World!"
    EnvironmentVariables:
      - "MY_ENV_VAR=myvalue"
    WorkingDirectory: "/app"

In this example, the task will be executed inside an Ubuntu 20.04 Docker container. The entrypoint is overridden to execute a bash shell that runs an echo command. An environment variable MY_ENV_VAR is set with the value myvalue, and the working directory inside the container is set to /app.

BIDS App
here
folder
Pinata
nft.storage
IPFS CLI
declarative
Bacalhau team via Slack
OpenMM
here
Escherichia coli
here
here
here
OpenMM library
Pinata
nft.storage
here
follow these instructions to create docker account
Bacalhau team via Slack
gmx pdb2gmx
here
https://www.rcsb.org
RCSB PDB - 1AKI
NFT.storage
Pinata
IPFS CLI
gromacs - Docker Image
gmx pdb2gmx — GROMACS 2022.2 documentation
KALP-15 in DPPC - GROMACS Tutorial
declarative
Bacalhau team via Slack
Coreset
Coresets for Ordered Weighted Clustering
Efficient Implementation of Coreset-based K-Means Methods
here
Geofabrik Download Server
here
here
follow these instructions to create docker account
here
declarative
Bacalhau team via Slack
python package
here
here
here
follow these instructions to create a Docker Account
private network
declarative
Bacalhau team via Slack
Job Types
Meta
Label
Constraint
Task
State
SpecConfig
Docker
WebAssembly
SpecConfig
S3
IPFS
Meta
InputSource
S3
HTTP/HTTPs
ResultPath
Resources
Network
Timeouts
queueable
# This example shows a sample job file. 
# Parameters, marked as Optional can be skipped - the default values will be used


# Name of the job. Optional. Default value - job ID
Name: Batch Job Example


# Type of the job
Type: batch


# The namespace in which the job is running. Default value - “default”
Namespace: default


# Priority - determines the scheduling priority. By default is 0
Priority: 100


# Count - number of replicas to be scheduled. 
# This is only applicable for jobs of type batch and service.
Count: 2


# Meta - arbitrary metadata associated with the job. 
# Optional
Meta:
  Job purpose : Provide detailed example of the batch job
  Meta purpose: Describe the job


# Labels - Arbitrary labels associated with the job for filtering purposes. 
# Optional
Labels:
  Some option: Some text
  Some other option: Some other text


# Constraint - a condition that must be met for a compute node to be eligible to run a given job. 
# Should be specified in a following format: key - operator - value
# Optional.
Constraints:
- Key: "Operating-System"
  Operator: "="
  Values: ["linux"]


# Task associated with the job, which defines a unit of work within the job. 
# Currently, only one task per job is supported.
Tasks:
  # Name - unique identifier for a task. Default value - “main”
  - Name: Important Calculations


    # Engine - the execution engine for the task. 
    # Defines engine type (docker or wasm) and relevant parameters. 
    # In this example, docker engine will be used.  
    Engine:
      Type: docker


    # Params: A set of key-value pairs that provide the specific configurations for the chosen type
      Params:

        # Image: docker image to be used in the task.
        Image: alek5eyk/batchjobexample:1.1


        # Entrypoint defines a command that will be executed when container starts. 
        # For this example we don't need any so default value 'null' can be used
        Entrypoint: null


        # Parameters define CLI commands, executed after entrypoint        
        Parameters:
          - python
          - supercalc.py
          - "5"
          - /outputs/result.txt


        # WorkingDirectory sets a working directory for entrypoint and paramters' commands.
        # Default value - empty string ""
        WorkingDirectory: ""


        # EnvironmentVariables sets environment variables for the engine
        EnvironmentVariables:
        - DEFAULT_USER_NAME = root
        - API_KEY = none


        # Meta - arbitrary metadata associated with the task. 
        # Optional
        Meta:
          Task goal : show how to create declarative descriptions

    # Publisher specifies where the results of the task should be published - S3, IPFS, Local or none
    # Optional
    # To use IPFS publisher you need to specify only type
    # To use S3 publisher you need to specify bucket, key, region and endpoint
    # See S3 Publisher specification for more details
    Publisher:
      Type: ipfs


    # InputSources lists remote artifacts that should be downloaded before task execution 
    # and mounted within the task
    # Optional
    InputSources:
      - Target: /data
        Source:
          Type: ipfs
          Params:
            CID: "QmSYE8dVx6RTdDFFhBu51JjFG1fwwPdUJoXZ4ZNXvfoK2V"



    # ResultPaths indicate volumes within the task that should be included in the published result
    # Only applicable for batch and ops jobs.
    # Optional
    ResultPaths:
      - Name: outputs
        Path: /outputs


    # Resources is a structured way to detail the required computational resources for the task. 
    # Optional
    Resources:
      # CPU can be specified in cores (e.g. 1) or in milliCPU units (e.g. 250m or 0.25)
      CPU: 250m
      
      # Memory highlights amount of RAM for a job. Can be specified in Kb, Mb, Gb, Tb
      Memory: 1Gb
      
      # Disk states disk storage space, needed for the task.
      Disk: 100mb

      # Denotes the number of GPU units required.
      GPU: "0"


    # Network specifies networking requirements.  
    # Optional
    # Job may have full access to the network,
    # may have no access at all,
    # or may have limited HTTP(S) access to a specific list of domains
    Network:
      Domains:
      - example.com
      - ghcr.io
      Type: HTTP


    # Timeouts define configurations concerning any timeouts associated with the task. 
    # Optional
    Timeouts:
      # QueueTimeout defines how long will job wait for suitable nodes in the network
      # if none are currently available.
      QueueTimeout: 101

      # TotalTimeout defines job execution timeout. When it is reached the job will be terminated
      TotalTimeout: 301
declarative
# This example shows a sample ops job file. 
# Parameters, marked as Optional can be skipped - the default values will be used
# Example from the https://blog.bacalhau.org/p/real-time-log-analysis-with-bacalhau is used


# Name of the job. Optional. Default value - job ID
Name: Live logs processing


# Type of the job
Type: ops


# The namespace in which the job is running. Default value - “default”
Namespace: logging


# Priority - determines the scheduling priority. By default is 0
Priority: 100


# Meta - arbitrary metadata associated with the job. 
# Optional
Meta:
  Job purpose : Provide detailed example of the ops job
  Meta purpose: Describe the job


# Labels - Arbitrary labels associated with the job for filtering purposes. 
# Optional
Labels:
  Job type: ops job
  Ops job feature: To be executed on all suitable nodes


# Constraint - a condition that must be met for a compute node to be eligible to run a given job. 
# Should be specified in a following format: key - operator - value
# Optional.
Constraints:
  - Key: service
    Operator: ==
    Values:
      - WebService


# Task associated with the job, which defines a unit of work within the job. 
# Currently, only one task per job is supported.
Tasks:
  # Name - unique identifier for a task. Default value - “main”
  - Name: LiveLogProcessing


    # Engine - the execution engine for the task. 
    # Defines engine type (docker or wasm) and relevant parameters. 
    # In this example, docker engine will be used.  
    Engine:
      Type: docker


    # Params: A set of key-value pairs that provide the specific configurations for the chosen type
      Params:

        # Image: docker image to be used in the task.
        Image: expanso/nginx-access-log-processor:1.0.0


        # Entrypoint defines a command that will be executed when container starts. 
        # For this example we don't need any so default value 'null' can be used
        Entrypoint: null


        # Parameters define CLI commands, executed after entrypoint        
        Parameters:
          - --query
          - {{.query}}
          - --start-time
          - {{or (index . "start-time") ""}}
          - --end-time
          - {{or (index . "end-time") ""}}


        # WorkingDirectory sets a working directory for entrypoint and paramters' commands.
        # Default value - empty string ""
        WorkingDirectory: ""


        # EnvironmentVariables sets environment variables for the engine
        EnvironmentVariables:
        - DEFAULT_USER_NAME = root
        - API_KEY = none


        # Meta - arbitrary metadata associated with the task. 
        # Optional
        Meta:
          Task goal : show how to create declarative descriptions

    # Publisher specifies where the results of the task should be published - S3, IPFS, Local or none
    # Optional
    # To use IPFS publisher you need to specify only type
    # To use S3 publisher you need to specify bucket, key, region and endpoint
    # See S3 Publisher specification for more details
    Publisher:
      Type: ipfs


    # InputSources lists remote artifacts that should be downloaded before task execution 
    # and mounted within the task.
    # Ensure that localDirectory source is enabled on the nodes
    # Optional
    InputSources:
      - Target: /logs
        Source:
          Type: localDirectory
          Params:
            SourcePath: /data/log-orchestration/logs



    # ResultPaths indicate volumes within the task that should be included in the published result
    # Only applicable for batch and ops jobs.
    # Optional
    ResultPaths:
      - Name: outputs
        Path: /outputs


    # Resources is a structured way to detail the required computational resources for the task. 
    # Optional
    Resources:
      # CPU can be specified in cores (e.g. 1) or in milliCPU units (e.g. 250m or 0.25)
      CPU: 250m
      
      # Memory highlights amount of RAM for a job. Can be specified in Kb, Mb, Gb, Tb
      Memory: 1Gb
      
      # Disk states disk storage space, needed for the task.
      Disk: 100mb

      # Denotes the number of GPU units required.
      GPU: "0"


    # Network specifies networking requirements.  
    # Optional
    # Job may have full access to the network,
    # may have no access at all,
    # or may have limited HTTP(S) access to a specific list of domains
    Network:
      Domains:
      - example.com
      - ghcr.io
      Type: HTTP


    # Timeouts define configurations concerning any timeouts associated with the task. 
    # Optional
    Timeouts:
      # QueueTimeout defines how long will job wait for suitable nodes in the network
      # if none are currently available.
      QueueTimeout: 101

      # TotalTimeout defines job execution timeout. When it is reached the job will be terminated
      TotalTimeout: 301
declarative
declarative
declarative
release notes
configuration keys list
​
configured appropriately
​
​
--network full
Job type
Ops
resource limits
discontinued
List Jobs API Documentation
Job Executions API Documentation
Stop Job API Documentation
Job History API Documentation

WebAssembly (WASM) Engine Specification

The WASM Engine in Bacalhau allows tasks to be executed in a WebAssembly environment, offering compatibility and speed. This engine supports WASM and WASI (WebAssembly System Interface) jobs, making it highly adaptable for various use cases. Below are the parameters for configuring the WASM Engine.

WASM Engine Parameters

  • Entrypoint (string: <optional>): The name of the function within the EntryModule to execute. For WASI jobs, this should typically be _start. The entrypoint function should have zero parameters and zero results.

  • Parameters (string[]: <optional>): An array of strings containing arguments that will be supplied to the program as ARGV. This allows parameterized execution of the WASM task.

  • EnvironmentVariables (map[string]string: <optional>): A mapping of environment variable keys to their values, made available within the executing WASM environment.

Example

Here’s a sample configuration of the WASM Engine within a task, expressed in YAML:

Engine:
Type: "WASM"
Params:
  EntryModule:
    Source:
      Type: "s3"
      Params:
        Bucket: "my-bucket"
        Key: "entry.wasm"
  Entrypoint: "_start"
  Parameters:
    - "--option"
    - "value"
  EnvironmentVariables:
    VAR1: "value1"
    VAR2: "value2"
  ImportModules:
    - Source:
        Type: "localDirectory"
        Params:
          Path: "/local/path/to/module.wasm"

In this example, the task is configured to run in a WASM environment. The EntryModule is fetched from an S3 bucket, the entrypoint is _start, and parameters and environment variables are passed into the WASM environment. Additionally, an ImportModule is loaded from a local directory, making its exports available to the EntryModule.

S3 Publisher Specification

Bacalhau's S3 Publisher provides users with a secure and efficient method to publish task results to any S3-compatible storage service. This publisher supports not just AWS S3, but other S3-compatible services offered by cloud providers like Google Cloud Storage and Azure Blob Storage, as well as open-source options like MinIO. The integration is designed to be highly flexible, ensuring users can choose the storage option that aligns with their needs, privacy preferences, and operational requirements.

Publisher Parameters

  1. Bucket (string: <required>): The name of the S3 bucket where the task results will be stored.

  2. Key (string: <required>): The object key within the specified bucket where the task results will be stored.

  3. Endpoint (string: <optional>): The endpoint URL of the S3 service (useful for S3-compatible services).

  4. Region (string: <optional>): The region where the S3 bucket is located.

Published Result Spec

  1. Bucket: Confirms the name of the bucket containing the stored results.

  2. Key: Identifies the unique object key within the specified bucket.

  3. Region: Notes the AWS region of the bucket.

  4. Endpoint: Records the endpoint URL for S3-compatible storage services.

  5. VersionID: The version ID of the stored object, enabling versioning support for retrieving specific versions of stored data.

  6. ChecksumSHA256: The SHA-256 checksum of the stored object, providing a method to verify data integrity.

Dynamic Naming

With the S3 Publisher in Bacalhau, you have the flexibility to use dynamic naming for the objects you publish to S3. This allows you to incorporate specific job and execution details into the object key, making it easier to trace, manage, and organize your published artifacts.

Bacalhau supports the following dynamic placeholders that will be replaced with their actual values during the publishing process:

  1. {executionID}: Replaced with the specific execution ID.

  2. {jobID}: Replaced with the ID of the job.

  3. {nodeID}: Replaced with the ID of the node where the execution took place

  4. {date}: Replaced with the current date in the format YYYYMMDD.

  5. {time}: Replaced with the current time in the format HHMMSS.

Additionally, if you are publishing an archive and the object key does not end with .tar.gz, it will be automatically appended. Conversely, if you're not archiving and the key doesn't end with a /, a trailing slash will be added.

Example

Imagine you've specified the following object key pattern for publishing:

results/{jobID}/{date}/{time}/

Given a job with ID abc123, executed on 2023-09-26 at 14:05:30, the published object key would be:

results/abc123/20230926/140530/

This dynamic naming feature offers a powerful way to create organized, intuitive naming conventions for your Bacalhau published objects in S3.

Examples

Declarative Examples

Here’s an example YAML configuration that outlines the process of using the S3 Publisher with Bacalhau:

Publisher:
  Type: "s3"
  Params:
    Bucket: "my-task-results"
    Key: "task123/result.tar.gz"
    Endpoint: "https://s3.us-west-2.amazonaws.com"

In this configuration, task results will be published to the specified S3 bucket and object key. If you’re using an S3-compatible service, simply update the Endpoint parameter with the appropriate URL.

The results will be compressed into a single object, and the published result specification will look like:

PublishedResult:
  Type: "s3"
  Params:
    Bucket: "my-task-results"
    Key: "task123/result.tar.gz"
    Endpoint: "https://s3.us-west-2.amazonaws.com"
    Region: "us-west-2"
    ChecksumSHA256: "0x9a3a..."
    VersionID: "3/L4kqtJlcpXroDTDmJ+rmDbwQaHWyOb..."

Imperative Examples

The Bacalhau command-line interface (CLI) provides an imperative approach to specify the S3 Publisher. Below are a few examples showcasing how to define an S3 publisher using CLI commands:

  1. Basic Docker job writing to S3 with default configurations:

    bacalhau docker run -p s3://bucket/key ubuntu ...

    This command writes to the S3 bucket using default endpoint and region settings.

  2. Docker job writing to S3 with a specific endpoint and region:

    bacalhau docker run -p s3://bucket/key,opt=endpoint=http://s3.example.com,opt=region=us-east-1 ubuntu ...

    This command specifies a unique endpoint and region for the S3 bucket.

  3. Using naming placeholders:

    bacalhau docker run -p s3://bucket/result-{date}-{jobID} ubuntu ...

    Dynamic naming placeholders like {date} and {jobID} allow for organized naming structures, automatically replacing these placeholders with appropriate values upon execution.

Remember to replace the placeholders like bucket, key, and other parameters with your specific values. These CLI commands offer a quick and customizable way to submit jobs and specify how the results should be published to S3.

Credential Requirements

To support this storage provider, no extra dependencies are necessary. However, valid AWS credentials are essential to sign the requests. The storage provider employs the default credentials chain to retrieve credentials, primarily sourcing them from:

  1. Environment variables: AWS credentials can be specified using AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables.

  2. Credentials file: The credentials file typically located at ~/.aws/credentials can also be used to fetch the necessary AWS credentials.

  3. IAM Roles for Amazon EC2 Instances: If you're running your tasks within an Amazon EC2 instance, IAM roles can be utilized to provide the necessary permissions and credentials.

Required IAM Policies

Compute Nodes

Compute nodes must run with the following policies to publish to S3:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:PutObject"
            ],
            "Resource": "arn:aws:s3:::BUCKET_NAME/*"
        }
    ]
}
  • PutObject Permissions: The s3:PutObject permission is necessary to publish objects to the specified S3 bucket.

  • Resource: The Resource field in the policy specifies the Amazon Resource Name (ARN) of the S3 bucket. The /* suffix is necessary to allow publishing with any prefix within the bucket or can be replaced with a prefix to limit the scope of the policy. You can also specify multiple resources in the policy to allow publishing to multiple buckets, or * to allow publishing to all buckets in the account.

Requester Node

To enable downloading published results using bacalhau job get <job_id> command, the requester node must run with the following policies:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetObject"
            ],
            "Resource": "arn:aws:s3:::BUCKET_NAME/*"
        }
    ]
}
  • GetObject Permissions: The s3:GetObject permission is necessary for the requester node to provide a pre-signed URL to download the published results by the client.

IPFS Publisher Specification

IPFS Publisher Parameters

For the IPFS publisher, no specific parameters need to be defined in the publisher specification. The user only needs to indicate the publisher type as IPFS, and Bacalhau handles the rest. Here is an example of how to set up an IPFS Publisher in a job specification.

Publisher:
  Type: ipfs

Published Result Specification

Once the job is executed, the results are published to IPFS, and a unique CID (Content Identifier) is generated for each file or piece of data. This CID acts as an address to the file in the IPFS network and can be used to access the file globally.

Result Parameters

  • CID (string): This is the unique content identifier generated by IPFS, which can be used to access the published content from anywhere in the world. Every data piece stored on IPFS has its unique CID. Here's a sample of how the published result might appear:

PublishedResult:
  Type: ipfs
  Params:
    CID: "QmXoypizjW3WknFiJnKLwHCnL72vedxjQkDDP1mXWo6uco"

In this example, the task results will be stored in IPFS, and can be referenced and retrieved using the specified CID. This is indicative of Bacalhau's commitment to offering flexible, reliable, and decentralized options for result storage, catering to a diverse set of user needs and preferences.

IPFS Source Specification

Source Specification Parameters

Here are the parameters that you can define for an IPFS input source:

  • CID (string: <required>): The Content Identifier that uniquely pinpoints the file or directory on the IPFS network. Bacalhau retrieves the content associated with this CID for use in the task.

Example

Below is an example of how to define an IPFS input source in YAML format.

InputSources:
  - Source:
      Type: "ipfs"
      Params:
        CID: "QmY7Yh4UquoXHLPFo2XbhXkhBvFoPwmQUSa92pxnxjY3fZ"
  - Target: "/data"

In this configuration, the data associated with the specified CID is fetched from the IPFS network and made available in the task's environment at the "/data" path.

Example (Imperative/CLI)

Utilizing IPFS as an input source in Bacalhau via the CLI is straightforward. Below are example commands that demonstrate how to define the IPFS input source:

  1. Mount an IPFS CID to the default /inputs directory:

    bacalhau docker run -i ipfs://QmeZRGhe4PmjctYVSVHuEiA9oSXnqmYa4kQubSHgWbjv72 ubuntu ...
  2. Mount an IPFS CID to a custom /data directory:

    bacalhau docker run -i ipfs://QmeZRGhe4PmjctYVSVHuEiA9oSXnqmYa4kQubSHgWbjv72:/data ubuntu ...

These commands provide a seamless mechanism to fetch and mount data from IPFS directly into your task's execution environment using the Bacalhau CLI.

Reading Data from Multiple S3 Buckets using Bacalhau

Introduction

Bacalhau, a powerful and versatile data processing platform, has recently integrated Amazon Web Services (AWS) S3, allowing users to seamlessly access and process data stored in S3 buckets within their Bacalhau jobs. This integration not only simplifies data input, output, and processing operations but also streamlines the overall workflow by enabling users to store and manage their data effectively in S3 buckets. With Bacalhau, you can process several Large s3 buckets in parallel. In this example, we will walk you through the process of reading data from multiple S3 buckets, converting TIFF images to JPEG format.

Advantages of Converting TIFF to JPEG

There are several advantages to converting images from TIFF to JPEG format:

  1. Reduced File Size: JPEG images use lossy compression, which significantly reduces file size compared to lossless formats like TIFF. Smaller file sizes lead to faster upload and download times, as well as reduced storage requirements.

  2. Efficient Processing: With smaller file sizes, image processing tasks tend to be more efficient and require less computational resources when working with JPEG images compared to TIFF images.

  3. Training Machine Learning Models: Smaller file sizes and reduced computational requirements make JPEG images more suitable for training machine learning models, particularly when dealing with large datasets, as they can help speed up the training process and reduce the need for extensive computational resources.

Running the job on Bacalhau

We will use the S3 mount feature to mount bucket objects from s3 buckets. Let’s have a look at the example below:

-i src=s3://sentinel-s1-rtc-indigo/tiles/RTC/1/IW/10/S/DH/2017/S1A_20170125_10SDH_ASC/Gamma0_VH.tif,dst=/sentinel-s1-rtc-indigo/,opt=region=us-west-2

It defines S3 object as input to the job:

  1. sentinel-s1-rtc-indigo: bucket’s name

  2. tiles/RTC/1/IW/10/S/DH/2017/S1A_20170125_10SDH_ASC/Gamma0_VH.tif: represents the key of the object in that bucket. The object to be processed is called Gamma0_VH.tif and is located in the subdirectory with the specified path.

  3. But if you want to specify the entire objects located in the path, you can simply add * to the end of the path (tiles/RTC/1/IW/10/S/DH/2017/S1A_20170125_10SDH_ASC/*)

  4. dst=/sentinel-s1-rtc-indigo: the destination to which to mount the s3 bucket object

  5. opt=region=us-west-2 : specifying the region in which the bucket is located

Prerequisite

1. Running the job on multiple buckets with multiple objects

In the example below, we will mount several bucket objects from public s3 buckets located in a specific region:

export JOB_ID=$(bacalhau docker run \
    --wait \
    --id-only \
    --timeout 3600 \
    --publisher=ipfs \
    --memory=10Gb \
    --wait-timeout-secs 3600 \
    -i src=s3://bdc-sentinel-2/s2-16d/v1/075/086/2018/02/18/*,dst=/bdc-sentinel-2/,opt=region=us-west-2  \
    -i src=s3://sentinel-cogs/sentinel-s2-l2a-cogs/28/M/CV/2022/6/S2B_28MCV_20220620_0_L2A/*,dst=/sentinel-cogs/,opt=region=us-west-2 \
    jsacex/gdal-s3)

The job has been submitted and Bacalhau has printed out the related job_id. We store that in an environment variable so that we can reuse it later on.

2. Checking the State of your Jobs

Job status: You can check the status of the job using bacalhau job list.

bacalhau job list --id-filter=${JOB_ID} --no-style

When it says Published or Completed, that means the job is done, and we can get the results.

Job information: You can find out more information about your job by using bacalhau job describe.

bacalhau job describe ${JOB_ID}

Job download: You can download your job results directly by using bacalhau job get. Alternatively, you can choose to create a directory to store your results. In the command below, we created a directory (results) and downloaded our job output to be stored in that directory.

rm -rf results && mkdir results # Temporary directory to store the results
bacalhau job get ${JOB_ID} --output-dir results # Download the results

3. Viewing your Job Output

Display the image

To view the images, download the job results and open the folder:

results/outputs/S2-16D_V1_075086_20180218_B04_TCI.jpg
results/outputs/B04_TCI.jpg

Support

EntryModule ( : required): Specifies the WASM module that contains the start function or the main execution code of the task. The InputSource should point to the location of the WASM binary.

ImportModules ([] : optional): An array of InputSources pointing to additional WASM modules. The exports from these modules will be available as imports to the EntryModule, enabling modular and reusable WASM code.

Results published to S3 are stored as objects that can also be used as inputs to other Bacalhau jobs by using . The published result specification includes the following parameters:

For a more detailed overview on AWS credential management and other ways to provide these credentials, please refer to the AWS official documentation on .

For more information on IAM policies specific to Amazon S3 buckets and users, please refer to the .

The IPFS Publisher in Bacalhau amplifies the versatility of task result storage by integrating with the . IPFS is a protocol and network designed to create a peer-to-peer method of storing and sharing hypermedia in a distributed file system. Bacalhau's seamless integration with IPFS ensures that users have a decentralized option for publishing their task results, enhancing accessibility and resilience while reducing dependence on a single point of failure.

The IPFS Input Source enables users to easily integrate data hosted on the into Bacalhau jobs. By specifying the Content Identifier (CID) of the desired IPFS file or directory, users can have the content fetched and made available in the task's execution environment, ensuring efficient and decentralized data access.

To get started, you need to install the Bacalhau client, see more information

If you have questions or need support or guidance, please reach out to the (#general channel).

InputSource
InputSource
S3 Input Source
standardized credentials
AWS documentation on Using IAM Policies with Amazon S3
InterPlanetary File System (IPFS)
InterPlanetary File System (IPFS)
here
Bacalhau team via Slack
bacalhau docker run
flags
bacalhau job get
appropriate flags
command
appropriate flags
command
--input
--output-volumes
bacalhau job get
argument
CLI Guide
CLI commands guide
194KB
scaled_Prominent_Late_Gothic_styled_architecture.mp4
69KB
scaled_Calm_waves_on_a_rocky_sea_gulf.mp4
155KB
scaled_Bird_flying_over_the_lake.mp4
Bacalhau Architecture
png
png
png
png
The photo generated with Stable Diffusion on the basis of Checkpoint Inference with Bacalhau.
.png image
.jpg image