1 of 100

v1.3.1

Welcome to Bacalhau Docs

This is an older version of Bacalhau. For the latest version, go to this link.

What is Bacalhau?

Bacalhau is a platform for fast, cost efficient, and secure computation by running jobs where the data is generated and stored. With Bacalhau, you can streamline your existing workflows without the need of extensive rewriting by running arbitrary Docker containers and WebAssembly (wasm) images as tasks. This architecture is also referred to as Compute Over Data (or CoD). Bacalhau was coined from the Portuguese word for salted Cod fish.

Bacalhau seeks to transform data processing for large-scale datasets to improve cost and efficiency, and to open up data processing to larger audiences. Our goals is to create an open, collaborative compute ecosystem that enables unparalleled collaboration. We (Expanso.io) offer a demo network so you can try out jobs without even installing. Give it a shot!

Why Bacalhau?

⚡️ Bacalhau simplifies the process of managing compute jobs by providing a unified platform for managing jobs across different regions, clouds, and edge devices.

🤝 Bacalhau provides reliable and network-partition resistant orchestration, ensuring that your jobs will complete even if there are network disruptions.

🚨 Bacalhau provides a complete and permanent audit log of exactly what happened, so you can be confident that your jobs are being executed securely.

🔐 You can run private workloads to reduce the chance of leaking private information or inadvertently sharing your data outside of your organization.

💸 Bacalhau reduces ingress/egress costs since jobs are processed closer to the source.

🤓 You can mount your data anywhere on your machine, and Bacalhau will be able to run against that data.

💥 You can integrate with services running on nodes to run a jobs, such as on DuckDB.

📚 Bacalhau operates at scale over parallel jobs. You can batch process petabytes (quadrillion bytes) of data.

How it works

Bacalhau concists of a peer-to-peer network of nodes that enables decentralized communication between computers. The network consists of two types of nodes:

Requester Node: responsible for handling user requests, discovering and ranking compute nodes, forwarding jobs to compute nodes, and monitoring the job lifecycle.

Compute Node: responsible for executing jobs and producing results. Different compute nodes can be used for different types of jobs, depending on their capabilities and resources.

For a more detailed tutorial, check out our Getting Started Tutorial.

The goal of the Bacalhau project is to make it easy to perform distributed computation next to where the data resides. In order to do this, first you need to ingest some data.

Data ingestion

Data is identified by its content identifier (CID) and can be accessed by anyone who knows the CID. Here are some options that can help you mount your data:

Copy data from a URL to public storage
Pin Data to public storage
Copy Data from S3 Bucket to public storage

The options are not limited to the above mentioned. You can mount your data anywhere on your machine, and Bacalhau will be able to run against that data

Security in Bacalhau

All workloads run under restricted Docker or WASM permissions on the node. Additionally, you can use existing (locked down) binaries that are pre-installed through Pluggable Executors.

Best practices in 12-factor apps is to use environment variables to store sensitive data such as access tokens, API keys, or passwords. These variables can be accessed by Bacalhau at runtime and are not visible to anyone who has access to the code or the server.

Alternatively, you can pre-provision credentials to the nodes and access those on a node by node basis.

Finally, endpoints (such as vaults) can also be used to provide secure access to Bacalhau. This way, the client can authenticate with Bacalhau using the token without exposing their credentials.

Workloads Bacalhau is best suited for

Bacalhau can be used for a variety of data processing workloads, including machine learning, data analytics, and scientific computing. It is well-suited for workloads that require processing large amounts of data in a distributed and parallelized manner.

Use Cases

Once you have more than 10 devices generating or storing around 100GB of data, you're likely to face challenges with processing that data efficiently. Traditional computing approaches may struggle to handle such large volumes, and that's where distributed computing solutions like Bacalhau can be extremely useful. Bacalhau can be used in various industries, including security, web serving, financial services, IoT, Edge, Fog, and multi-cloud. Bacalhau shines when it comes to data-intensive applications like data engineering, model training, model inference, molecular dynamics, etc.

Here are some example tutorials on how you can process your data with Bacalhau:

Stable Diffusion AI
Generate Realistic Images using StyleGAN3 and Bacalhau
Object Detection with YOLOv5 on Bacalhau
Running Genomics on Bacalhau
Training Pytorch Model with Bacalhau

For more tutorials, visit our example page

Community

Bacalhau has a very friendly community and we are always happy to help you get started:

GitHub Discussions – ask anything about the project, give feedback or answer questions that will help other users.
Join the Slack Community and go to #bacalhau channel – it is the easiest way engage with other members in the community and get help.
Contributing – learn how to contribute to the Bacalhau project.

Next Steps

👉 Continue with Bacalhau Getting Started guide to learn how to install and run a job with the Bacalhau client.

👉 Or jump directly to try out the different Examples that showcases Bacalhau abilities.

Getting Started

How Bacalhau Works

In this tutorial we will go over the components and the architecture of Bacalhau. You will learn how it is built, what components are used, how you could interact and how you could use Bacalhau.

Chapter 1 - Architecture

Bacalhau is a peer-to-peer network of nodes that enables decentralized communication between computers. The network consists of two types of nodes, which can communicate with each other.

The requester and compute nodes together form a p2p network and use gossiping to discover each other, share information about node capabilities, available resources and health status. Bacalhau is a peer-to-peer network of nodes that enables decentralized communication between computers.

Requester Node: responsible for handling user requests, discovering and ranking compute nodes, forwarding jobs to compute nodes, and monitoring the job lifecycle.

Compute Node: responsible for executing jobs and producing results. Different compute nodes can be used for different types of jobs, depending on their capabilities and resources.

To interact with the Bacalhau network, users can use the Bacalhau CLI (command-line interface) to send requests to a requester node in the network. These requests are sent using the JSON format over HTTP, a widely-used protocol for transmitting data over the internet. Bacalhau's architecture involves two main sections which are the core components and interfaces.

Components overview

Core Components

The core components are responsible for handling requests and connecting different nodes. The network includes two different components:

Requester node

In the Bacalhau network, the requester node is responsible for handling requests from clients using JSON over HTTP. This node serves as the main custodian of jobs that are submitted to it. When a job is submitted to a requester node, it selects compute nodes that are capable and suitable to execute the job, and coordinates the job execution.

Compute node

In the Bacalhau network, it is the compute node that is responsible for determining whether it can execute a job or not. This model allows for a more decentralized approach to job orchestration as the network will function properly even if the requester nodes have stale view of the network, or if concurrent requesters are allocating jobs to the same compute nodes. Once the compute node has run the job and produced results, it will publish the results to a remote destination as specified in the job specification (e.g. S3), and notify the requester of the job completion. The compute node has a collection of named executors, storage sources, and publishers, and it will choose the most appropriate ones based on the job specifications.

Interfaces

The interfaces handle the distribution, execution, storage and publishing of jobs. In the following all the different components are described and their respective protocols are shown.

Transport

The transport interface is responsible for sending messages about jobs that are created, accepted, and executed to other compute nodes. It also manages the identity of individual Bacalhau nodes to ensure that messages are only delivered to authorized nodes, which improves network security. To achieve this, the transport interface uses a protocol, which is a point-to-point scheduling protocol that runs securely and is used to distribute job messages efficiently to other nodes on the network. This is our upgrade to previous handlers as it ensures that messages are delivered to the right nodes without causing network congestion, thereby making communication between nodes more scalable and efficient.

Executor

The executor is a critical component of the Bacalhau network that handles the execution of jobs and ensures that the storage used by the job is local. One of its main responsibilities is to present the input and output storage volumes into the job when it is run. The executor performs two primary functions: presenting the storage volumes in a format that is suitable for the executor and running the job. When the job is completed, the executor will merge the stdout, stderr and named output volumes into a results folder that is then published to a remote location. Overall, the executor plays a crucial role in the Bacalhau network by ensuring that jobs are executed properly, and their results are published accurately.

Storage Provider

In a peer-to-peer network like Bacalhau, storage providers play a crucial role in presenting an upstream storage source. There can be different storage providers available in the network, each with its own way of manifesting the CID (Content IDentifier) to the executor. For instance, there can be a POSIX storage provider that presents the CID as a POSIX filesystem, or a library storage provider that streams the contents of the CID via a library call. Therefore, the storage providers and Executor implementations are loosely coupled, allowing the POSIX and library storage providers to be used across multiple executors, wherever it is deemed appropriate.

Publisher

The publisher is responsible for uploading the final results of a job to a remote location where clients can access them, such as S3 or IPFS.

Chapter 2 - Job cycle

Job preparation

You can create jobs in the Bacalhau network using various job types introduced in version 1.2. Each job may need specific variables, resource requirements and data details that are described in the Job Specification.

Advanced job preparation

Prepare data with Bacalhau by copying from URLs, pinning to public storage or copying from an S3 bucket. Mount data anywhere for Bacalhau to run against. Refer to IPFS, Local, S3 and URL Source Specifications for data source usage.

Optimize workflows without completely redesigning them. Run arbitrary tasks using Docker containers and WebAssembly images. Follow the Onboarding guides for Docker and WebAssembly workloads.

Explore GPU workload support with Bacalhau. Learn how to run GPU workloads using the Bacalhau client in the GPU Workloads section. Integrate Python applications with Bacalhau using the Bacalhau Python SDK.

For node operation, refer to the Running a Node section for configuring and running a Bacalhau node. If you prefer an isolated environment, explore the Private Cluster for performing tasks without connecting to the main Bacalhau network.

Job Submission

You should use the Bacalhau client to send a task to the network. The client transmits the job information to the Bacalhau network via established protocols and interfaces. Jobs submitted via the Bacalhau CLI are forwarded to a Bacalhau network node at http://bootstrap.production.bacalhau.org/ via port 1234 by default. This Bacalhau node will act as the requester node for the duration of the job lifecycle.

Bacalhau provides an interface to interact with the server via a REST API. Bacalhau uses 127.0.0.1 as the localhost and 1234 as the port by default.

bacalhau create [flags]

You can use the command with appropriate flags to create a job in Bacalhau using JSON and YAML formats.

Endpoint: `PUT /api/v1/orchestrator/jobs`

You can use Create Job API Documentation to submit a new job for execution.

You can use the bacalhau docker run command to start a job in a Docker container. Below, you can see an excerpt of the commands:

Bacalhau Docker CLI commands

bacalhau docker run [flags] IMAGE[:TAG|@DIGEST] [COMMAND] [ARG...]

Flags:
    --concurrency int                  How many nodes should run the job (default 1)
    --cpu string                       Job CPU cores (e.g. 500m, 2, 8).
    --disk string                      Job Disk requirement (e.g. 500Gb, 2Tb, 8Tb).
    --do-not-track                     When true the job will not be tracked(?) TODO BETTER DEFINITION
    --domain stringArray               Domain(s) that the job needs to access (for HTTP networking)
    --download                         Should we download the results once the job is complete?
    --download-timeout-secs duration   Timeout duration for IPFS downloads. (default 5m0s)
    --dry-run                          Do not submit the job, but instead print out what will be submitted
    --entrypoint strings               Override the default ENTRYPOINT of the image
-e, --env strings                      The environment variables to supply to the job (e.g. --env FOO=bar --env BAR=baz)
-f, --follow                           When specified will follow the output from the job as it runs
-g, --gettimeout int                   Timeout for getting the results of a job in --wait (default 10)
    --gpu string                       Job GPU requirement (e.g. 1, 2, 8).
-h, --help                             help for run
    --id-only                          Print out only the Job ID on successful submission.
-i, --input storage                    Mount URIs as inputs to the job. Can be specified multiple times. Format: src=URI,dst=PATH[,opt=key=value]
                                    Examples:
                                    # Mount IPFS CID to /inputs directory
                                    -i ipfs://QmeZRGhe4PmjctYVSVHuEiA9oSXnqmYa4kQubSHgWbjv72
                                    # Mount S3 object to a specific path
                                    -i s3://bucket/key,dst=/my/input/path
                                    # Mount S3 object with specific endpoint and region
                                    -i src=s3://bucket/key,dst=/my/input/path,opt=endpoint=https://s3.example.com,opt=region=us-east-1
    --ipfs-connect string              The ipfs host multiaddress to connect to, otherwise an in-process IPFS node will be created if not set.
    --ipfs-serve-path string           path local Ipfs node will persist data to
    --ipfs-swarm-addrs strings         IPFS multiaddress to connect the in-process IPFS node to - cannot be used with --ipfs-connect. (default [/ip4/35.245.161.250/tcp/4001/p2p/12D3KooWAQpZzf3qiNxpwizXeArGjft98ZBoMNgVNNpoWtKAvtYH,/ip4/35.245.161.250/udp/4001/quic/p2p/12D3KooWAQpZzf3qiNxpwizXeArGjft98ZBoMNgVNNpoWtKAvtYH,/ip4/34.86.254.26/tcp/4001/p2p/12D3KooWLfFBjDo8dFe1Q4kSm8inKjPeHzmLBkQ1QAjTHocAUazK,/ip4/34.86.254.26/udp/4001/quic/p2p/12D3KooWLfFBjDo8dFe1Q4kSm8inKjPeHzmLBkQ1QAjTHocAUazK,/ip4/35.245.215.155/tcp/4001/p2p/12D3KooWH3rxmhLUrpzg81KAwUuXXuqeGt4qyWRniunb5ipjemFF,/ip4/35.245.215.155/udp/4001/quic/p2p/12D3KooWH3rxmhLUrpzg81KAwUuXXuqeGt4qyWRniunb5ipjemFF,/ip4/34.145.201.224/tcp/4001/p2p/12D3KooWBCBZnXnNbjxqqxu2oygPdLGseEbfMbFhrkDTRjUNnZYf,/ip4/34.145.201.224/udp/4001/quic/p2p/12D3KooWBCBZnXnNbjxqqxu2oygPdLGseEbfMbFhrkDTRjUNnZYf,/ip4/35.245.41.51/tcp/4001/p2p/12D3KooWJM8j97yoDTb7B9xV1WpBXakT4Zof3aMgFuSQQH56rCXa,/ip4/35.245.41.51/udp/4001/quic/p2p/12D3KooWJM8j97yoDTb7B9xV1WpBXakT4Zof3aMgFuSQQH56rCXa])
    --ipfs-swarm-key string            Optional IPFS swarm key required to connect to a private IPFS swarm
-l, --labels strings                   List of labels for the job. Enter multiple in the format '-l a -l 2'. All characters not matching /a-zA-Z0-9_:|-/ and all emojis will be stripped.
    --memory string                    Job Memory requirement (e.g. 500Mb, 2Gb, 8Gb).
    --network network-type             Networking capability required by the job. None, HTTP, or Full (default None)
    --node-details                     Print out details of all nodes (overridden by --id-only).
-o, --output strings                   name:path of the output data volumes. 'outputs:/outputs' is always added unless '/outputs' is mapped to a different name. (default [outputs:/outputs])
    --output-dir string                Directory to write the output to.
    --private-internal-ipfs            Whether the in-process IPFS node should auto-discover other nodes, including the public IPFS network - cannot be used with --ipfs-connect. Use "--private-internal-ipfs=false" to disable. To persist a local Ipfs node, set BACALHAU_SERVE_IPFS_PATH to a valid path. (default true)
-p, --publisher publisher              Where to publish the result of the job (default ipfs)
    --raw                              Download raw result CIDs instead of merging multiple CIDs into a single result
-s, --selector string                  Selector (label query) to filter nodes on which this job can be executed, supports '=', '==', and '!='.(e.g. -s key1=value1,key2=value2). Matching objects must satisfy all of the specified label constraints.
    --target all|any                   Whether to target the minimum number of matching nodes ("any") (default) or all matching nodes ("all") (default any)
    --timeout int                      Job execution timeout in seconds (e.g. 300 for 5 minutes)
    --wait                             Wait for the job to finish. Use --wait=false to return as soon as the job is submitted. (default true)
    --wait-timeout-secs int            When using --wait, how many seconds to wait for the job to complete before giving up. (default 600)
-w, --workdir string                   Working directory inside the container. Overrides the working directory shipped with the image (e.g. via WORKDIR in Dockerfile).

You can also use the bacalhau wasm run command to run a job compiled into the (WASM) format. Below, you can find an excerpt of the commands in the Bacalhau CLI:

Bacalhau WASM CLI commands

bacalhau wasm run {cid-of-wasm | } [--entry-point ] [wasm-args ...] [flags]

Flags:
    --concurrency int                  How many nodes should run the job (default 1)
    --cpu string                       Job CPU cores (e.g. 500m, 2, 8).
    --disk string                      Job Disk requirement (e.g. 500Gb, 2Tb, 8Tb).
    --do-not-track                     When true the job will not be tracked(?) TODO BETTER DEFINITION
    --domain stringArray               Domain(s) that the job needs to access (for HTTP networking)
    --download                         Should we download the results once the job is complete?
    --download-timeout-secs duration   Timeout duration for IPFS downloads. (default 5m0s)
    --dry-run                          Do not submit the job, but instead print out what will be submitted
    --entry-point string               The name of the WASM function in the entry module to call. This should be a zero-parameter zero-result function that
                                will execute the job. (default "_start")
-e, --env strings                      The environment variables to supply to the job (e.g. --env FOO=bar --env BAR=baz)
-f, --follow                           When specified will follow the output from the job as it runs
-g, --gettimeout int                   Timeout for getting the results of a job in --wait (default 10)
    --gpu string                       Job GPU requirement (e.g. 1, 2, 8).
-h, --help                             help for run
    --id-only                          Print out only the Job ID on successful submission.
-U, --import-module-urls url           URL of the WASM modules to import from a URL source. URL accept any valid URL supported by the 'wget' command, and supports both HTTP and HTTPS.
-I, --import-module-volumes cid:path   CID:path of the WASM modules to import from IPFS, if you need to set the path of the mounted data.
-i, --input storage                    Mount URIs as inputs to the job. Can be specified multiple times. Format: src=URI,dst=PATH[,opt=key=value]
                                        Examples:
                                        # Mount IPFS CID to /inputs directory
                                        -i ipfs://QmeZRGhe4PmjctYVSVHuEiA9oSXnqmYa4kQubSHgWbjv72
                                        # Mount S3 object to a specific path
                                        -i s3://bucket/key,dst=/my/input/path
                                        # Mount S3 object with specific endpoint and region
                                        -i src=s3://bucket/key,dst=/my/input/path,opt=endpoint=https://s3.example.com,opt=region=us-east-1
    --ipfs-connect string              The ipfs host multiaddress to connect to, otherwise an in-process IPFS node will be created if not set.
    --ipfs-serve-path string           path local Ipfs node will persist data to
    --ipfs-swarm-addrs strings         IPFS multiaddress to connect the in-process IPFS node to - cannot be used with --ipfs-connect. (default [/ip4/35.245.161.250/tcp/4001/p2p/12D3KooWAQpZzf3qiNxpwizXeArGjft98ZBoMNgVNNpoWtKAvtYH,/ip4/35.245.161.250/udp/4001/quic/p2p/12D3KooWAQpZzf3qiNxpwizXeArGjft98ZBoMNgVNNpoWtKAvtYH,/ip4/34.86.254.26/tcp/4001/p2p/12D3KooWLfFBjDo8dFe1Q4kSm8inKjPeHzmLBkQ1QAjTHocAUazK,/ip4/34.86.254.26/udp/4001/quic/p2p/12D3KooWLfFBjDo8dFe1Q4kSm8inKjPeHzmLBkQ1QAjTHocAUazK,/ip4/35.245.215.155/tcp/4001/p2p/12D3KooWH3rxmhLUrpzg81KAwUuXXuqeGt4qyWRniunb5ipjemFF,/ip4/35.245.215.155/udp/4001/quic/p2p/12D3KooWH3rxmhLUrpzg81KAwUuXXuqeGt4qyWRniunb5ipjemFF,/ip4/34.145.201.224/tcp/4001/p2p/12D3KooWBCBZnXnNbjxqqxu2oygPdLGseEbfMbFhrkDTRjUNnZYf,/ip4/34.145.201.224/udp/4001/quic/p2p/12D3KooWBCBZnXnNbjxqqxu2oygPdLGseEbfMbFhrkDTRjUNnZYf,/ip4/35.245.41.51/tcp/4001/p2p/12D3KooWJM8j97yoDTb7B9xV1WpBXakT4Zof3aMgFuSQQH56rCXa,/ip4/35.245.41.51/udp/4001/quic/p2p/12D3KooWJM8j97yoDTb7B9xV1WpBXakT4Zof3aMgFuSQQH56rCXa])
    --ipfs-swarm-key string            Optional IPFS swarm key required to connect to a private IPFS swarm
-l, --labels strings                   List of labels for the job. Enter multiple in the format '-l a -l 2'. All characters not matching /a-zA-Z0-9_:|-/ and all emojis will be stripped.
    --memory string                    Job Memory requirement (e.g. 500Mb, 2Gb, 8Gb).
    --network network-type             Networking capability required by the job. None, HTTP, or Full (default None)
    --node-details                     Print out details of all nodes (overridden by --id-only).
-o, --output strings                   name:path of the output data volumes. 'outputs:/outputs' is always added unless '/outputs' is mapped to a different name. (default [outputs:/outputs])
    --output-dir string                Directory to write the output to.
    --private-internal-ipfs            Whether the in-process IPFS node should auto-discover other nodes, including the public IPFS network - cannot be used with --ipfs-connect. Use "--private-internal-ipfs=false" to disable. To persist a local Ipfs node, set BACALHAU_SERVE_IPFS_PATH to a valid path. (default true)
-p, --publisher publisher              Where to publish the result of the job (default ipfs)
    --raw                              Download raw result CIDs instead of merging multiple CIDs into a single result
-s, --selector string                  Selector (label query) to filter nodes on which this job can be executed, supports '=', '==', and '!='.(e.g. -s key1=value1,key2=value2). Matching objects must satisfy all of the specified label constraints.
    --target all|any                   Whether to target the minimum number of matching nodes ("any") (default) or all matching nodes ("all") (default any)
    --timeout int                      Job execution timeout in seconds (e.g. 300 for 5 minutes)
    --wait                             Wait for the job to finish. Use --wait=false to return as soon as the job is submitted. (default true)
    --wait-timeout-secs int            When using --wait, how many seconds to wait for the job to complete before giving up. (default 600)

Job Acceptance

When a job is submitted to a requester node, it selects compute nodes that are capable and suitable to execute the job, and communicate with them directly. The compute node has a collection of named executors, storage sources, and publishers, and it will choose the most appropriate ones based on the job specifications.

Job execution

The selected compute node receives the job and starts its execution inside a container. The container can use different executors to work with the data and perform the necessary actions. A job can use the docker executor, WASM executor or a library storage volumes. Use Docker Engine Specification to view the parameters to configure the Docker Engine. If you want tasks to be executed in a WebAssembly environment, pay attention to WebAssembly Engine Specification.

Results publishing

When the Compute node completes the job, it publishes the results to S3's remote storage, IPFS.

Bacalhau's seamless integration with IPFS ensures that users have a decentralized option for publishing their task results, enhancing accessibility and resilience while reducing dependence on a single point of failure. View IPFS Publisher Specification to get the detailed information.

Bacalhau's S3 Publisher provides users with a secure and efficient method to publish task results to any S3-compatible storage service. This publisher supports not just AWS S3, but other S3-compatible services offered by cloud providers like Google Cloud Storage and Azure Blob Storage, as well as open-source options like MinIO. View S3 Publisher Specification to get the detailed information.

Chapter 3 - Returning Information

The Bacalhau client receives updates on the task execution status and results. A user can access the results and manage tasks through the command line interface.

Get Job Results

To Get the results of a job you can run the following command.

bacalhau get [id] [flags]

One can choose from a wide range of flags, from which a few are shown below.

Usage:
  bacalhau get [id] [flags]

Flags:
      --download-timeout-secs duration   Timeout duration for IPFS downloads. (default 5m0s)
  -h, --help                             help for get
      --ipfs-connect string              The ipfs host multiaddress to connect to, otherwise an in-process IPFS node will be created if not set.
      --ipfs-serve-path string           path local Ipfs node will persist data to
      --ipfs-swarm-addrs strings         IPFS multiaddress to connect the in-process IPFS node to - cannot be used with --ipfs-connect. (default [/ip4/35.245.161.250/tcp/4001/p2p/12D3KooWAQpZzf3qiNxpwizXeArGjft98ZBoMNgVNNpoWtKAvtYH,/ip4/35.245.161.250/udp/4001/quic/p2p/12D3KooWAQpZzf3qiNxpwizXeArGjft98ZBoMNgVNNpoWtKAvtYH,/ip4/34.86.254.26/tcp/4001/p2p/12D3KooWLfFBjDo8dFe1Q4kSm8inKjPeHzmLBkQ1QAjTHocAUazK,/ip4/34.86.254.26/udp/4001/quic/p2p/12D3KooWLfFBjDo8dFe1Q4kSm8inKjPeHzmLBkQ1QAjTHocAUazK,/ip4/35.245.215.155/tcp/4001/p2p/12D3KooWH3rxmhLUrpzg81KAwUuXXuqeGt4qyWRniunb5ipjemFF,/ip4/35.245.215.155/udp/4001/quic/p2p/12D3KooWH3rxmhLUrpzg81KAwUuXXuqeGt4qyWRniunb5ipjemFF,/ip4/34.145.201.224/tcp/4001/p2p/12D3KooWBCBZnXnNbjxqqxu2oygPdLGseEbfMbFhrkDTRjUNnZYf,/ip4/34.145.201.224/udp/4001/quic/p2p/12D3KooWBCBZnXnNbjxqqxu2oygPdLGseEbfMbFhrkDTRjUNnZYf,/ip4/35.245.41.51/tcp/4001/p2p/12D3KooWJM8j97yoDTb7B9xV1WpBXakT4Zof3aMgFuSQQH56rCXa,/ip4/35.245.41.51/udp/4001/quic/p2p/12D3KooWJM8j97yoDTb7B9xV1WpBXakT4Zof3aMgFuSQQH56rCXa])
      --ipfs-swarm-key string            Optional IPFS swarm key required to connect to a private IPFS swarm
      --output-dir string                Directory to write the output to.
      --private-internal-ipfs            Whether the in-process IPFS node should auto-discover other nodes, including the public IPFS network - cannot be used with --ipfs-connect. Use "--private-internal-ipfs=false" to disable. To persist a local Ipfs node, set BACALHAU_SERVE_IPFS_PATH to a valid path. (default true)
      --raw                              Download raw result CIDs instead of merging multiple CIDs into a single result

Describe a Job

To describe a specific job, inserting the ID to the CLI or API gives back an overview of the job.

bacalhau describe [id] [flags]

You can use the command with appropriate flags to get a full description of a job in yaml format.

Endpoint: `GET /api/v1/orchestrator/jobs/:jobID`

You can use describe Job API Documentation to retrieve the specification and current status of a particular job.

List of Jobs

If you run more then one job or you want to find a specific job ID

bacalhau list [flags]

You can use the command with appropriate flags to list jobs on the network in yaml format.

Endpoint: `GET /api/v1/orchestrator/jobs`

You can use List Jobs API Documentation to retrieve a list of jobs.

Job Executions

To list executions follow the following commands.

bacalhau job executions [id] [flags]

You can use the command with appropriate flags to list all executions associated with a job, identified by its ID, in yaml format.

Endpoint: `GET /api/v1/orchestrator/jobs/:jobID/executions`

You can use Job Executions API Documentation to retrieve all executions for a particular job.

Chapter 4 - Monitoring and Management

The Bacalhau client provides the user with tools to monitor and manage the execution of jobs. You can get information about status, progress and decide on next steps. View the Bacalhau Agent APIs if you want to know the node's health, capabilities, and deployed Bacalhau version. To get information about the status and characteristics of the nodes in the cluster use Nodes API Documentation.

Stop a Job

bacalhau cancel [id] [flags]

You can use the command with appropriate flags to cancel a job that was previously submitted and stop it running if it has not yet completed.

Endpoint: `DELETE /api/v1/orchestrator/jobs/:jobID`

You can use Stop Job API Documentation to terminate a specific job asynchronously.

Job History

bacalhau job history [id] [flags]

You can use the command with appropriate flags to enumerate the historical events related to a job, identified by its ID.

Endpoint: `GET /api/v1/orchestrator/jobs/:jobID/history`

You can use Job History API Documentation to retrieve historical events for a specific job.

Job Logs

bacalhau logs [flags] [id]

You can use this command to retrieve the log output (stdout, and stderr) from a job. If the job is still running it is possible to follow the logs after the previously generated logs are retrieved.

To familiarize yourself with all the commands used in Bacalhau, please view CLI Commands

Installation

Install the Bacalhau CLI

In this tutorial, you'll learn how to install and run a job with the Bacalhau client using the Bacalhau CLI or Docker.

Step 1 - Install the Bacalhau Client

The Bacalhau client is a command-line interface (CLI) that allows you to submit jobs to the Bacalhau. The client is available for Linux, macOS, and Windows. You can also run the Bacalhau client in a Docker container.

By default, you will submit to the Bacalhau public network, but the same CLI can be configured to submit to a private Bacalhau network. For more information, please read Running Bacalhau on a Private Network.

Step 1.1 - Install the Bacalhau CLI

You can install or update the Bacalhau CLI by running the commands in a terminal. You may need sudo mode or root password to install the local Bacalhau binary to /usr/local/bin:

curl -sL https://get.bacalhau.org/install.sh | bash

Windows users can download the latest release tarball from Github and extract bacalhau.exe to any location available in the PATH environment variable.

docker image rm -f ghcr.io/bacalhau-project/bacalhau:latest # Remove old image if it exists
docker pull ghcr.io/bacalhau-project/bacalhau:latest

To run a specific version of Bacalhau using Docker, use the command docker run -it ghcr.io/bacalhau-project/bacalhau:v1.0.3, where v1.0.3 is the version you want to run; note that the latest tag will not re-download the image if you have an older version. For more information on running the Docker image, check out the Bacalhau docker image example.

Step 1.2 - Verify the Installation

To verify installation and check the version of the client and server, use the version command. To run a Bacalhau client command with Docker, prefix it with docker run ghcr.io/bacalhau-project/bacalhau:latest.

bacalhau version

docker run -it ghcr.io/bacalhau-project/bacalhau:latest version

If you're wondering which server is being used, the Bacalhau Project has a demo network that's shared with the community. This network allows you to familiarize with Bacalhau's capabilities and launch jobs from your computer without maintaining a compute cluster on your own.

Step 2 - Submit a Hello World job

To submit a job in Bacalhau, we will use the bacalhau docker run command. The command runs a job using the Docker executor on the node. Let's take a quick look at its syntax:

bacalhau docker run [FLAGS] IMAGE[:TAG] [COMMAND]

bacalhau docker run ubuntu echo Hello World

We will use the command to submit a Hello World job that runs an echo program within an Ubuntu container.

Let's take a look at the results of the command execution in the terminal:

Job successfully submitted. Job ID: f8e7789d-8e76-4e6c-8e71-436e2d76c72e
Checking job status... (Enter Ctrl+C to exit at any time, your job will continue running):

    Communicating with the network  ................  done ✅  0.2s
       Creating job for submission  ................  done ✅  0.7s
                   Job in progress  ................  done ✅  2.1s

To download the results, execute:
    bacalhau get f8e7789d-8e76-4e6c-8e71-436e2d76c72e

To get more details about the run, execute:
    bacalhau describe f8e7789d-8e76-4e6c-8e71-436e2d76c72e

After the above command is run, the job is submitted to the public network, which processes the job and Bacalhau prints out the related job id:

Job successfully submitted. Job ID: 9d20bbad-c3fc-48f8-907b-1da61c927fbd
Checking job status...

The job_id above is shown in its full form. For convenience, you can use the shortened version, in this case: 9d20bbad.

While this command is designed to resemble Docker's run command which you may be familiar with, Bacalhau introduces a whole new set of flags to support its computing model.

docker run -t ghcr.io/bacalhau-project/bacalhau:latest \
                docker run \
                --id-only \
                --wait \
                ubuntu:latest -- \
                sh -c 'uname -a && echo "Hello from Docker Bacalhau!"'

Let's take a look at the results of the command execution in the terminal:

14:02:25.992 | INF pkg/repo/fs.go:81 > Initializing repo at '/root/.bacalhau' for environment 'production'
19b105c9-4cb5-43bd-a12f-d715d738addd

Step 3 - Checking the State of your Jobs

After having deployed the job, we now can use the CLI for the interaction with the network. The jobs were sent to the public demo network, where it was processed and we can call the following functions. The job_id will differ for every submission.

Step 3.1 - Job status:

You can check the status of the job using bacalhau list command adding the --id-filter flag and specifying your job id.

bacalhau list --id-filter 9d20bbad

Let's take a look at the results of the command execution in the terminal:

CREATED   ID          JOB                                       STATE      PUBLISHED
15:24:31  0ed7617d    Type:"docker",Params:"map[Entrypoint:<ni  Completed
                       l> EnvironmentVariables:[] Image:ubuntu:
                       latest Parameters:[sh -c uname -a && ech
                       o "Hello from Docker Bacalhau!"] Working
                       Directory:]"

When it says Completed, that means the job is done, and we can get the results.

For a comprehensive list of flags you can pass to the list command check out the related CLI Reference page

Step 3.2 - Job information:

You can find out more information about your job by using bacalhau describe.

bacalhau describe 9d20bbad

Let's take a look at the results of the command execution in the terminal:

Job:
    APIVersion: V1beta2
    Metadata:
        ClientID: 0ff57b2521334a92e9ddab4b2f8202c887b1eaa35d2aa945ab0e247d3bc0aa88
        CreatedAt: "2023-12-21T15:24:31.750306239Z"
        ID: 0ed7617d-d5ff-40f7-8411-89830b3f3058
        Requester:
        RequesterNodeID: QmbxGSsM6saCTyKkiWSxhJCt6Fgj7M9cns1vzYtfDbB5Ws
        RequesterPublicKey: CAASpgIwggEiMA0GCSqGSIb3DQEBAQUAA4IBDwAwggEKAoIBAQDEHTUAD1JzO0130W9vsaDGhU0PVgpcNjG3fYlE0sJ1BiBWENFuP4jx3Q9alcjNGhdRFdju0Mb/fidTOtJcPhxTdb+H6JxFP6HsADGes9jU4ylBU2SL2vfdb0KXzKdXjNHGGf4BuCGTcH07Oqxp209diK/cT7takL2fLjcgs1tM+6PzlfGzFqCPxvh9Sa0ek34mdmHjcp1XH8yjF1OKOuHvD+pYphqvOBL/2LEN+EBC4fz/QUnhUajCmKYO83MJcNUXSGxb4AN6K3DpVV+cJph7fj9ADdP7i996o2S4Gkz8W4Wpt/jICaPpkUjmyU3Jgcw7MHkZaYEzWxnnO2J936+pAgMBAAE=
    Spec:
    ...

This outputs all information about the job, including stdout, stderr, where the job was scheduled, and so on.

Step 3.3 - Job download:

You can download your job results directly by using bacalhau get.

bacalhau get 9d20bbad

This results in

Fetching results of job '0ed7617d'...
Results for job '0ed7617d' have been written to...
/Users/test/job-0ed7617d

In the command below, we created a directory called myfolder and download our job output to be stored in that directory.

Fetching results of job '0ed7617d'...
Results for job '0ed7617d' have been written to...
/myfolder

While executing this command, you may encounter warnings regarding receive and send buffer sizes: failed to sufficiently increase receive buffer size. These warnings can arise due to limitations in the UDP buffer used by Bacalhau to process tasks. Additional information can be found in https://github.com/quic-go/quic-go/wiki/UDP-Buffer-Sizes.

After the download has finished you should see the following contents in the results directory.

job-0ed7617d
├── exitCode
├── outputs
├── stderr
└── stdout

Step 4 - Viewing your Job Output

cat job-9d20bbad/stdout

That should print out the string Hello World.

Hello world

Step 5 - Where to go next?

Here are few resources that provide a deeper dive into running jobs with Bacalhau:

How Bacalhau works, Setting up Bacalhau, Examples & Use Cases

Support

If you have questions or need support or guidance, please reach out to the Bacalhau team via Slack (#general channel).

Docker Workload Onboarding

How to use docker containers with Bacalhau

Docker Workloads

Bacalhau executes jobs by running them within containers. Bacalhau employs a syntax closely resembling Docker, allowing you to utilize the same containers. The key distinction lies in how input and output data are transmitted to the container via IPFS, enabling scalability on a global level.

This section describes how to migrate a workload based on a Docker container into a format that will work with the Bacalhau client.

You can check out this example tutorial on to see how we used all these steps together.

Requirements

Here are few things to note before getting started:

Container Registry: Ensure that the container is published to a public container registry that is accessible from the Bacalhau network.
Architecture Compatibility: Bacalhau supports only images that match the host node's architecture. Typically, most nodes run on linux/amd64, so containers in arm64 format are not able to run.
Input Flags: The --input ipfs://... flag supports only directories and does not support CID subpaths. The --input https://... flag supports only single files and does not support URL directories. The --input s3://... flag supports S3 keys and prefixes. For example, s3://bucket/logs-2023-04* includes all logs for April 2023.

You can check to see a used by the Bacalhau team

Note: Only about a third of examples have their containers here. If you can't find one, feel free to contact the team.

Runtime Restrictions

To help provide a safe, secure network for all users, we add the following runtime restrictions:

Limited Ingress/Egress Networking:

All ingress/egress networking is limited as described in the documentation. You won't be able to pull data/code/weights/ etc. from an external source.

Data Passing with Docker Volumes:

A job includes the concept of input and output volumes, and the Docker executor implements support for these. This means you can specify your CIDs, URLs, and/or S3 objects as input paths and also write results to an output volume. This can be seen in the following example:

The above example demonstrates an input volume flag -i s3://mybucket/logs-2023-04*, which mounts all S3 objects in bucket mybucket with logs-2023-04 prefix within the docker container at location /input (root).

Output volumes are mounted to the Docker container at the location specified. In the example above, any content written to /output_folder will be made available within the apples folder in the job results CID.

Once the job has run on the executor, the contents of stdout and stderr will be added to any named output volumes the job has used (in this case apples), and all those entities will be packaged into the results folder which is then published to a remote location by the publisher.

Onboarding Your Workload

Step 1 - Read Data From Your Directory

If you need to pass data into your container you will do this through a Docker volume. You'll need to modify your code to read from a local directory.

We make the assumption that you are reading from a directory called /inputs, which is set as the default.

Step 2 - Write Data to the Your Directory

If you need to return data from your container you will do this through a Docker volume. You'll need to modify your code to write to a local directory.

We make the assumption that you are writing to a directory called /outputs, which is set as the default.

Step 3 - Build and Push Your Image To a Registry

Most Bacalhau nodes are of an x86_64 architecture, therefore containers should be built for x86_64 systems.

For example:

Step 4 - Test Your Container

To test your docker image locally, you'll need to execute the following command, changing the environment variables as necessary:

Let's see what each command will be used for:

Exports the current working directory of the host system to the LOCAL_INPUT_DIR variable. This variable will be used for binding a volume and transferring data into the container.

Exports the current working directory of the host system to the LOCAL_OUTPUT_DIR variable. Similarly, this variable will be used for binding a volume and transferring data from the container.

Creates an array of commands CMD that will be executed inside the container. In this case, it is a simple command executing 'ls' in the /inputs directory and writing text to the /outputs/stdout file.

Launches a Docker container using the specified variables and commands. It binds volumes to facilitate data exchange between the host and the container.

For example:

The result of the commands' execution is shown below:

Step 5 - Upload the Input Data

Data is identified by its content identifier (CID) and can be accessed by anyone who knows the CID. You can use either of these methods to upload your data:

You can choose to

You can mount your data anywhere on your machine, and Bacalhau will be able to run against that data

Step 6 - Run the Workload on Bacalhau

To launch your workload in a Docker container, using the specified image and working with input data specified via IPFS CID, run the following command.

To check the status of your job, run the following command.

To get more information on your job, you can run the following command.

To download your job, run.

To put this all together into one would look like the following.

This outputs the following.

The --input flag does not support CID subpaths for ipfs:// content.

Alternatively, you can run your workload with a publicly accessible http(s) URL, which will download the data temporarily into your public storage:

The --input flag does not support URL directories.

Troubleshooting

If you run into this compute error while running your docker image

This can often be resolved by re-tagging your docker image

Support

WebAssembly (WASM) Workloads

Bacalhau supports running programs that are compiled to WebAssembly (WASM). With the Bacalhau client, you can upload WASM programs, retrieve data from public storage, read and write data, receive program arguments, and access environment variables.

Prerequisites and Limitations

Supported WebAssembly System Interface (WASI) Bacalhau can run compiled WASM programs that expect the WebAssembly System Interface (WASI) Snapshot 1. Through this interface, WebAssembly programs can access data, environment variables, and program arguments.
Networking Restrictions All ingress/egress networking is disabled – you won't be able to pull data/code/weights etc. from an external source. WASM jobs can say what data they need using URLs or CIDs (Content IDentifier) and can then access the data by reading from the filesystem.
Single-Threading There is no multi-threading as WASI does not expose any interface for it.

Onboarding Your Workload

Step 1: Replace network operations with filesystem reads and writes

If your program typically involves reading from and writing to network endpoints, follow these steps to adapt it for Bacalhau:

Replace Network Operations: Instead of making HTTP requests to external servers (e.g., example.com), modify your program to read data from the local filesystem.
Input Data Handling: Specify the input data location in Bacalhau using the --input flag when running the job. For instance, if your program used to fetch data from example.com, read from the /inputs folder locally, and provide the URL as input when executing the Bacalhau job. For example, --input http://example.com.
Output Handling: Adjust your program to output results to standard output (stdout) or standard error (stderr) pipes. Alternatively, you can write results to the filesystem, typically into an output mount. In the case of WASM jobs, a default folder at /outputs is available, ensuring that data written there will persist after the job concludes.

By making these adjustments, you can effectively transition your program to operate within the Bacalhau environment, utilizing filesystem operations instead of traditional network interactions.

You can specify additional or different output mounts using the -o flag.

Step 2: Configure your compiler to output WASI-compliant WebAssembly

You will need to compile your program to WebAssembly that expects WASI. Check the instructions for your compiler to see how to do this.

For example, Rust users can specify the wasm32-wasi target to rustup and cargo to get programs compiled for WASI WebAssembly. See the Rust example for more information on this.

Step 3: Upload the input data

Data is identified by its content identifier (CID) and can be accessed by anyone who knows the CID. You can use either of these methods to upload your data:

Copy data from a URL to public storage
Pin Data to public storage
Copy Data from S3 Bucket to public storage.

You can mount your data anywhere on your machine, and Bacalhau will be able to run against that data

Step 4: Run your program

You can run a WebAssembly program on Bacalhau using the bacalhau wasm run command.

bacalhau wasm run

Run Locally Compiled Program:

If your program is locally compiled, specify it as an argument. For instance, running the following command will upload and execute the main.wasm program:

bacalhau wasm run main.wasm

The program you specify will be uploaded to a Bacalhau storage node and will be publicly available.

Alternative Program Specification:

You can use a Content IDentifier (CID) for a specific WebAssembly program.

bacalhau wasm run Qmajb9T3jBdMSp7xh2JruNrqg3hniCnM6EUVsBocARPJRQ

Input Data Specification:

Make sure to specify any input data using --input flag.

bacalhau wasm run --input http://example.com

This ensures the necessary data is available for the program's execution.

Program arguments

You can give the WASM program arguments by specifying them after the program path or CID. If the WASM program is already compiled and located in the current directory, you can run it by adding arguments after the file name:

bacalhau wasm run echo.wasm hello world

For a specific WebAssembly program, run:

bacalhau wasm run Qmajb9T3jBdMSp7xh2JruNrqg3hniCnM6EUVsBocARPJRQ hello world

Write your program to use program arguments to specify input and output paths. This makes your program more flexible in handling different configurations of input and output volumes.

For example, instead of hard-coding your program to read from /inputs/data.txt, accept a program argument that should contain the path and then specify the path as an argument to bacalhau wasm run:

bacalhau wasm run prog.wasm /inputs/data.txt

Your language of choice should contain a standard way of reading program arguments that will work with WASI.

Environment variables

You can also specify environment variables using the -e flag.

$ bacalhau wasm run prog.wasm -e HELLO=world

Examples

See the Rust example for a workload that leverages WebAssembly support.

Support

If you have questions or need support or guidance, please reach out to the Bacalhau team via Slack (#general channel).

Hardware Setup

Different jobs may require different amounts of resources to execute. Some jobs may have specific hardware requirements, such as GPU. This page describes how to specify hardware requirements for your job.

Please bear in mind that each executor is implemented independently and these docs might be slightly out of date. Double check the man page for the executor you are using with bacalhau [executor] --help.

Docker Executor

The following table describes how to specify hardware requirements for the Docker executor.

Flag

Default

Description

How it Works

When you specify hardware requirements, the job will be offered out to the network to see if there are any nodes that can satisfy the requirements. If there are, the job will be scheduled on the node and the executor will be started.

GPU Setup

Bacalhau supports GPU workloads. Learn how to run a job using GPU workloads with the Bacalhau client.

Prerequisites

The Bacalhau network must have an executor node with a GPU exposed
Your container must include the CUDA runtime (cudart) and must be compatible with the CUDA version running on the node

Usage

Use following command to see available resources amount:

bacalhau node list --show=capacity

To submit a request for a job that requires more than the standard set of resources, add the --cpu and --memory flags. For example, for a job that requires 2 CPU cores and 4Gb of RAM, use --cpu=2 --memory=4Gb, e.g.:

bacalhau docker run ubuntu echo Hello World --cpu=2 --memory=4Gb

To submit a GPU job request, use the --gpu flag under the docker run command to select the number of GPUs your job requires. For example:

bacalhau docker run --gpu=1 nvidia/cuda:11.0.3-base-ubuntu20.04 nvidia-smi

Limitations

The following limitations currently exist within Bacalhau.

Maximum CPU and memory limits depend on the participants in the network
For GPU:
1. NVIDIA, Intel or AMD GPUs only
2. Only the Docker Executor supports GPUs

Setting Up

Running a Node

Job selection policy

When running a node, you can choose which jobs you want to run by using configuration options, environment variables or flags to specify a job selection policy.

Job selection probes

If you want more control over making the decision to take on jobs, you can use the --job-selection-probe-exec and --job-selection-probe-http flags.

These are external programs that are passed the following data structure so that they can make a decision about whether or not to take on a job:

The exec probe is a script to run that will be given the job data on stdin, and must exit with status code 0 if the job should be run.

The http probe is a URL to POST the job data to. The job will be rejected if the HTTP request returns a non-positive status code (e.g. >= 400).

For example, the following response will reject the job:

If the HTTP response is not a JSON blob, the content is ignored and any non-error status code will accept the job.

Authentication & Authorization

How to configure authentication and authorization on your Bacalhau node.

Authentication and authorization

Bacalhau includes a flexible auth system that supports multiple methods of auth that are appropriate for different deployment environments.

By default

With no specific authentication configuration supplied, Bacalhau runs in "anonymous mode" – which allows unidentified users limited control over the system. "Anonymous mode" is only appropriate for testing or evaluation setups.

In anonymous mode, Bacalhau will allow:

Users identified by a self-generated private key to submit any job and cancel their own jobs.
Users not identified by any key to access other read-only endpoints, such as to read job lists, describe jobs, and query node or agent information.

Restricting anonymous access

Bacalhau auth is controlled by policies. Configuring the auth system is done by supplying a different policy file.

Restricting API access to only users that have authenticated requires specifying a new authorization policy. You can download a policy that restricts anonymous access and install it by using:

Once the node is restarted, accessing the node APIs will require the user to be authenticated, but by default will still allow users with a self-generated key to authenticate themselves.

Restricting the list of keys that can authenticate to only a known set requires specifying a new authentication policy. You can download a policy that restricts key-based access and install it by using:

Then, modify the allowed_clients variable in challange_ns_no_anon.rego to include acceptable client IDs, found by running bacalhau id.

Once the node is restarted, only keys in the allowed list will be able to access any API.

Username and password access

Users can authenticate using a username and password instead of specifying a private key for access. Again, this requires installation of an appropriate policy on the server.

Passwords are not stored in plaintext and are salted. The downloaded policy expects password hashes and salts generated by scrypt. To generate a salted password, the helper script in pkg/authn/ask/gen_password can be used:

This will ask for a password and generate a salt and hash to authenticate with it. Add the encoded username, salt and hash into the ask_ns_password.rego.

Writing custom policies

In principle, Bacalhau can implement any auth scheme that can be described in a structured way by a policy file.

Custom authentication policies

Bacalhau will pass information pertinent to the current request into every authentication policy query as a field on the input variable. The exact information depends on the type of authentication used.

`challenge` authentication

challenge authentication uses identifies the user by the presence of a private key. The user is asked to sign an input phrase to prove they have the key they are identifying with.

Policies used for challenge authentication do not need to actually implement the challenge verification logic as this is handled by the core code. Instead, they will only be invoked if this verification passes.

Policies for this type will need to implement these rules:

bacalhau.authn.token: if the user should be authenticated, an access token they should use in subsequent requests. If the user should not be authenticated, should be undefined.

They should expect as fields on the input variable:

clientId: an ID derived from the user's private key that identifies them uniquely
nodeId: the ID of the requester node that this user is authenticating with
signingKey: the private key (as a JWK) that should be used to sign any access tokens to be returned

The simplest possible policy might therefore be this policy that returns the same opaque token for all users:

`ask` authentication

ask authentication uses credentials supplied manually by the user as identification. For example, an ask policy could require a username and password as input and check these against a known list. ask policies do all the verification of the supplied credentials.

Policies for this type will need to implement these rules:

bacalhau.authn.token: if the user should be authenticated, an access token they should use in subsequent requests. If the user should not be authenticated, should be undefined.
bacalhau.authn.schema: a static JSON schema that should be used to collect information about the user. The type of declared fields may be used to pick the input method, and if a field is marked as writeOnly then it will be collected in a secure way (e.g. not shown on screen). The schema rule does not receive any input data.

They should expect as fields on the input variable:

ask: a map of field names from the JSON schema to strings supplied by the user. The policy should validate these credentials.
nodeId: the ID of the requester node that this user is authenticating with
signingKey: the private key (as a JWK) that should be used to sign any access tokens to be returned

The simplest possible policy might therefore be one that asks for no data and returns the same opaque token for every user:

Custom authorization policies

Authorization policies do not vary depending on the type of authentication used – Bacalhau uses one authz policy for all API requests.

Authz policies are invoked for every API request. Authz policies should check the validity of any supplied access tokens and issue an authz decision for the requested API endpoint. It is not required that authz policies enforce that an access token is present – they may choose to grant access to unauthorized users.

Policies will need to implement these rules:

bacalhau.authz.token_valid: true if the access token in the request is "valid" (but does not necessarily grant access for this request), or false if it is invalid for every request (e.g. because it has expired) and should be discarded.
bacalhau.authz.allow: true if the user should be permitted to carry out the input request, false otherwise.

They should expect as fields on the input variable for both rules:

http: details of the user's HTTP request:
- host: the hostname used in the HTTP request
- method: the HTTP method (e.g. GET, POST)
- path: the path requested, as an array of path components without slashes
- query: a map of URL query parameters to their values
- headers: a map of HTTP header names to arrays representing their values
- body: a blob of any content submitted as the body
constraints: details about the receiving node that should be used to validate any supplied tokens:
- cert: keys that the input token should have been signed with
- iss: the name of a node that this node will recognize as the issuer of any signed tokens
- aud: the name of this node that is receiving the request

Notably, the constraints data is appropriate to be passed directly to the Rego io.jwt.decode_verify method which will validate the access token as a JWT against the given constraints.

The simplest possible authz policy might be this one that allows all users to access all endpoints:

Configuration Overview

How to configure your Bacalhau node.

Bacalhau employs the and libraries for configuration management. Users can configure their Bacalhau node through a combination of command-line flags, environment variables, and the dedicated configuration file.

The Bacalhau Repo

Bacalhau manages its configuration, metadata, and internal state within a specialized repository named .bacalhau. Serving as the heart of the Bacalhau node, this repository holds the data and settings that determine node behavior. It's located on the filesystem, and by default, Bacalhau initializes this repository at $HOME/.bacalhau, where $HOME is the home directory of the user running the bacalhau process.

To customize this location, users can:

Set the BACALHAU_DIR environment variable to specify their desired path.
Utilize the --repo command line flag to specify their desired path.

Upon executing a Bacalhau command for the first time, the system will initialize the .bacalhau repository. If such a repository already exists, Bacalhau will seamlessly access its contents.

Structure of a Newly Initialized .bacalhau Repository

Below is the structure of a freshly initialized `.bacalhau` repository:

This repository comprises four directories and seven files:

Files

user_id.pem:
- This file houses the Bacalhau node user's cryptographic private key, used for signing requests sent to a Requester Node.
- Format: PEM.
repo.version:
- Indicates the version of the Bacalhau node's repository.
- Format: JSON, e.g., {"Version":1}.
libp2p_private_key:
- Format: Base64 encoded RSA private key.
config.yaml:
- Contains configuration settings for the Bacalhau node.
- Format: YAML.
update.json:
- A file containing the date/time when the last version check was made.
- Format: JSON, e.g., {"LastCheck":"2024-01-24T11:06:14.631816Z"}
tokens.json:
- A file containing the tokens obtained through authenticating with bacalhau clusters.

Directories

QmdGUjsMHEgtAfdtw7U62yPEcAZFtA33tKMsczLToegZtv-compute:
- Note: The segment QmdGUjsMHEgtAfdtw7U62yPEcAZFtA33tKMsczLToegZtv is a unique NodeID for each Bacalhau node, derived from the libp2p_private_key.
QmdGUjsMHEgtAfdtw7U62yPEcAZFtA33tKMsczLToegZtv-requester:
- Note: NodeID derivation is similar to the Compute directory.
executor_storages:
- Storage for data handled by Bacalhau storage drivers.
plugins:
- Houses binaries that allow the Compute node to execute specific tasks.
- Note: This feature is currently experimental and isn't active during standard node operations.

Configuring a Bacalhau Node

Within a .bacalhau repository, a config.yaml file may be present. This file serves as the configuration source for the bacalhau node and adheres to the YAML format.

Although the config.yaml file is optional, its presence allows Bacalhau to load custom configurations; otherwise, Bacalhau is configured with built-in default values, environment variables and command line flags.

Modifications to the config.yaml file will not be dynamically loaded by the Bacalhau node. A restart of the node is required for any changes to take effect. Bacalhau determines its configuration based on the following precedence order, with each item superseding the subsequent:

Command-line Flag
Environment Variable
Config File
Defaults

Relationship Between `config.yaml` and Bacalhau Environment Variables

Bacalhau establishes a direct relationship between the value-bearing keys within the config.yaml file and corresponding environment variables. For these keys that have no further sub-keys, the environment variable name is constructed by capitalizing each segment of the key, and then joining them with underscores, prefixed with BACALHAU_.

For example, a YAML key with the path Node.IPFS.Connect translates to the environment variable BACALHAU_NODE_IPFS_CONNECT and is represented in a file like:

There is no corresponding environment variable for either Node or Node.IPFS. Config values may also have other environment variables that set them for simplicity or to maintain backwards compatibility.

Environments

Bacalhau leverages the BACALHAU_ENVIRONMENT environment variable to determine the specific environment configuration when initializing a repository. Notably, if a .bacalhau repository has already been initialized, the BACALHAU_ENVIRONMENT setting will be ignored.
By default, if the BACALHAU_ENVIRONMENT variable is not explicitly set by the user, Bacalhau will adopt the production environment settings.
Below is a breakdown of the configurations associated with each environment:
1. Production (public network)
- Environment Variable: BACALHAU_ENVIRONMENT=production
- Configurations:
  - Node.ClientAPI.Host: "bootstrap.production.bacalhau.org"
  - Node.Client.API.Host: 1234
  - ...other configurations specific to this environment...
2. Staging (staging network)
- Environment Variable: BACALHAU_ENVIRONMENT=staging
- Configurations:
  - Node.ClientAPI.Host: "bootstrap.staging.bacalhau.org"
  - Node.Client.API.Host: 1234
  - ...other configurations specific to this environment...
3. Development (development network)
- Environment Variable: BACALHAU_ENVIRONMENT=development
- Configurations:
  - Node.ClientAPI.Host: "bootstrap.development.bacalhau.org"
  - Node.Client.API.Host: 1234
  - ...other configurations specific to this environment...
4. Local (private or local networks)
- Environment Variable: BACALHAU_ENVIRONMENT=local
- Configurations:
  - Node.ClientAPI.Host: "0.0.0.0"
  - Node.Client.API.Host: 1234
  - ...other configurations specific to this environment...

Usage Examples

How to initialize a Bacalhau Server for a local private network

How to initialize a Bacalhau Server with a custom repo path

How to start a Bacalhau Server with DEBUG logs

Configuring Transport Level Security

How to configure TLS for the requester node APIs

By default, the requester node APIs used by the Bacalhau CLI are accessible over HTTP, but it is possible to configure it to use Transport Level Security (TLS) so that they are accessible over HTTPS instead. There are several ways to obtain the necessary certificates and keys, and Bacalhau supports obtaining them via ACME and Certificate Authorities or even self-signing them.

Once configured, you must ensure that instead of using http://IP:PORT you use https://IP:PORT to access the Bacalhau API

Getting a certificate from Let's Encrypt with ACME

Automatic Certificate Management Environment (ACME) is a protocol that allows for automating the deployment of Public Key Infrastructure, and is the protocol used to obtain a free certificate from the Let's Encrypt Certificate Authority.

Using the --autocert [hostname] parameter to the CLI (in the serve and devstack commands), a certificate is obtained automatically from Lets Encrypt. The provided hostname should be a comma-separated list of hostnames, but they should all be publicly resolvable as Lets Encrypt will attempt to connect to the server to verify ownership (using the ACME HTTP-01 challenge). On the very first request this can take a short time whilst the first certificate is issued, but afterwards they are then cached in the bacalhau repository.

Alternatively, you may set these options via the environment variable, BACALHAU_AUTO_TLS. If you are using a configuration file, you can set the values inNode.ServerAPI.TLS.AutoCert instead.

As a result of the Lets Encrypt verification step, it is necessary for the server to be able to handle requests on port 443. This typically requires elevated privileges, and rather than obtain these through a privileged account (such as root), you should instead use setcap to grant the executable the right to bind to ports <1024.

sudo setcap CAP_NET_BIND_SERVICE+ep $(which bacalhau)

A cache of ACME data is held in the config repository, by default ~/.bacalhau/autocert-cache, and this will be used to manage renewals to avoid rate limits.

Getting a certificate from a Certificate Authority

Obtaining a TLS certificate from a Certificate Authority (CA) without using the Automated Certificate Management Environment (ACME) protocol involves a manual process that typically requires the following steps:

Choose a Certificate Authority: First, you need to select a trusted Certificate Authority that issues TLS certificates. Popular CAs include DigiCert, GlobalSign, Comodo (now Sectigo), and others. You may also consider whether you want a free or paid certificate, as CAs offer different pricing models.
Generate a Certificate Signing Request (CSR): A CSR is a text file containing information about your organization and the domain for which you need the certificate. You can generate a CSR using various tools or directly on your web server. Typically, this involves providing details such as your organization's name, common name (your domain name), location, and other relevant information.
Submit the CSR: Access your chosen CA's website and locate their certificate issuance or order page. You'll typically find an option to "Submit CSR" or a similar option. Paste the contents of your CSR into the provided text box.
Verify Domain Ownership: The CA will usually require you to verify that you own the domain for which you're requesting the certificate. They may send an email to one of the standard domain-related email addresses (e.g., admin@yourdomain.com, webmaster@yourdomain.com). Follow the instructions in the email to confirm domain ownership.
Complete Additional Verification: Depending on the CA's policies and the type of certificate you're requesting (e.g., Extended Validation or EV certificates), you may need to provide additional documentation to verify your organization's identity. This can include legal documents or phone calls from the CA to confirm your request.
Payment and Processing: If you're obtaining a paid certificate, you'll need to make the payment at this stage. Once the CA has received your payment and completed the verification process, they will issue the TLS certificate.

Once you have obtained your certificates, you will need to put two files in a location that bacalhau can read them. You need the server certificate, often called something like server.cert or server.cert.pem, and the server key which is often called something like server.key or server.key.pem.

Once you have these two files available, you must start bacalhau serve which two new flags. These are tlscert and tlskey flags, whose arguments should point to the relevant file. An example of how it is used is:

bacalhau serve --node-type=requester --tlscert=server.cert --tlskey=server.key

Alternatively, you may set these options via the environment variables, BACALHAU_TLS_CERT and BACALHAU_TLS_KEY. If you are using a configuration file, you can set the values inNode.ServerAPI.TLS.ServerCertificate and Node.ServerAPI.TLS.ServerKey instead.

Self-signed certificates

If you wish, it is possible to use Bacalhau with a self-signed certificate which does not rely on an external Certificate Authority. This is an involved process and so is not described in detail here although there is a helpful script in the Bacalhau github repository which should provide a good starting point.

Once you have generated the necessary files, the steps are much like above, you must start bacalhau serve which two new flags. These are tlscert and tlskey flags, whose arguments should point to the relevant file. An example of how it is used is:

bacalhau serve --node-type=requester --tlscert=server.cert --tlskey=server.key

Alternatively, you may set these options via the environment variables, BACALHAU_TLS_CERT and BACALHAU_TLS_KEY. If you are using a configuration file, you can set the values inNode.ServerAPI.TLS.ServerCertificate and Node.ServerAPI.TLS.ServerKey instead.

If you use self-signed certificates, it is unlikely that any clients will be able to verify the certificate when connecting to the Bacalhau APIs. There are three options available to work around this problem:

Provide a CA certificate file of trusted certificate authorities, which many software libraries support in addition to system authorities.
Install the CA certificate file in the system keychain of each machine that needs access to the Bacalhau APIs.
Instruct the software library you are using not to verify HTTPS requests.

GPU Installation

How to enable GPU support on your Bacalhau node

Bacalhau supports GPUs out of the box and defaults to allowing execution on all GPUs installed on the node.

Prerequisites

Bacalhau makes the assumption that you have installed all the necessary drivers and tools on your node host and have appropriately configured them for use by Docker.

In general for GPUs from any vendor, the Bacalhau client requires:

Nvidia

NVIDIA GPU Drivers
NVIDIA Container Toolkit (nvidia-docker2)
Verify installation by Running a Sample Workload
nvidia-smi installed and functional

AMD

AMD GPU drivers
rocm-smi tool installed and functional

See the Running ROCm Docker containers for guidance on how to run Docker workloads on AMD GPU.

Intel

Intel GPU drivers
xpu-smi tool installed and functional

See the Running on GPU under docker for guidance on how to run Docker workloads on Intel GPU.

GPU Node Configuration

Access to GPUs can be controlled using resource limits. To limit the number of GPUs that can be used per job, set a job resource limit. To limit access to GPUs from all jobs, set a total resource limit.

Observability

Bacalhau supports the three main 'pillars' of observability - logging, metrics, and tracing. Bacalhau uses the OpenTelemetry Go SDK for metrics and tracing, which can be configured using the standard environment variables. Exporting metrics and traces can be as simple as setting the OTEL_EXPORTER_OTLP_PROTOCOL and OTEL_EXPORTER_OTLP_ENDPOINT environment variables. Custom code is used for logging as the OpenTelemetry Go SDK currently doesn't support logging.

Logging

Logging in Bacalhau outputs in human-friendly format to stderr at INFO level by default, but this can be changed by two environment variables:

LOG_LEVEL - Can be one of trace, debug, error, warn or fatal to output more or fewer logging messages as required
LOG_TYPE - Can be one of the following values:
- default - output logs to stderr in a human-friendly format
- json - log messages outputted to stdout in JSON format
- combined - log JSON formatted messages to stdout and human-friendly format to stderr

Log statements should include the relevant trace, span and job ID so it can be tracked back to the work being performed.

Metrics

Bacalhau produces a number of different metrics including those around the libp2p resource manager (rcmgr), performance of the requester HTTP API and the number of jobs accepted/completed/received.

Tracing

Traces are produced for all major pieces of work when processing a job, although the naming of some spans is still being worked on. You can find relevant traces covering working on a job by searching for the jobid attribute.

Viewing

The metrics and traces can easily be forwarded to a variety of different services as we use OpenTelemetry, such as Honeycomb or Datadog.

To view the data locally, or simply to not use a SaaS offering, you can start up Jaeger and Prometheus placing these three files into a directory then running docker compose start while running Bacalhau with the OTEL_EXPORTER_OTLP_PROTOCOL=grpc and OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317 environment variables.

version: "2"
services:

  jaeger-all-in-one:
    image: "jaegertracing/all-in-one:1.42"
    restart: "always"
    ports:
      - "16686:16686" # Jaeger UI
      - "14250:14250" # Jaeger gRPC endpoint

  otel-collector:
    image: "otel/opentelemetry-collector:0.70.0"
    restart: "always"
    command: ["--config=/etc/otel-collector-config.yaml"]
    volumes:
      - "./otel-collector-config.yaml:/etc/otel-collector-config.yaml"
    ports:
      - "8888:8888"   # Prometheus metrics exposed by the collector
      - "8889:8889"   # Prometheus exporter metrics
      - "13133:13133" # health_check extension
      - "4317:4317"   # OTLP gRPC receiver
    depends_on:
      - "jaeger-all-in-one"
      - "prometheus"

  prometheus:
    container_name: "prometheus"
    image: "prom/prometheus:v2.42.0"
    restart: "always"
    volumes:
      - "./prometheus.yaml:/etc/prometheus/prometheus.yml"
    ports:
      - "9090:9090" # Prometheus UI

receivers:
  otlp:
    protocols:
      grpc:

exporters:
  prometheus:
    endpoint: "0.0.0.0:8889"

  jaeger:
    endpoint: jaeger-all-in-one:14250
    tls:
      insecure: true

processors:
  batch:

extensions:
  health_check:

service:
  extensions: [health_check]
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch]
      exporters: [jaeger]
    metrics:
      receivers: [otlp]
      processors: [batch]
      exporters: [prometheus]

scrape_configs:
  - job_name: 'otel-collector'
    scrape_interval: 10s
    static_configs:
      - targets: ['otel-collector:8889']
      - targets: ['otel-collector:8888']

Configuring node persistence

How to configure compute/requester persistence

Both compute nodes, and requester nodes, maintain state. How that state is maintained is configurable, although the defaults are likely adequate for most use-cases. This page describes how to configure the persistence of compute and requester nodes should the defaults not be suitable.

Compute node persistence

The computes nodes maintain information about the work that has been allocated to them, including:

The current state of the execution, and
The original job that resulted in this allocation

This information is used by the compute and requester nodes to ensure allocated jobs are completed successfully. By default, compute nodes store their state in a bolt-db database and this is located in the bacalhau repository along with configuration data. For a compute node whose ID is "abc", the database can be found in ~/.bacalhau/abc-compute/executions.db.

In some cases, it may be preferable to maintain the state in memory, with the caveat that should the node restart, all state will be lost. This can be configured using the environment variables in the table below.

Environment Variable

Flag alternative

Value

Effect

Requester node persistence

When running a requester node, it maintains state about the jobs it has been requested to orchestrate and schedule, the evaluation of those jobs, and the executions that have been allocated. By default, this state is stored in a bolt db database that, with a node ID of "xyz" can be found in ~/.bacalhau/xyz-requester/jobs.db.

Environment Variable

Flag alternative

Value

Effect

Resource Limits

These are the flags that control the capacity of the Bacalhau node, and the limits for jobs that might be run.

  --limit-job-cpu string                 Job CPU core limit for single job (e.g. 500m, 2, 8).
  --limit-job-gpu string                 Job GPU limit for single job (e.g. 1, 2, or 8).
  --limit-job-memory string              Job Memory limit for single job  (e.g. 500Mb, 2Gb, 8Gb).
  --limit-total-cpu string               Total CPU core limit to run all jobs (e.g. 500m, 2, 8).
  --limit-total-gpu string               Total GPU limit to run all jobs (e.g. 1, 2, or 8).
  --limit-total-memory string            Total Memory limit to run all jobs  (e.g. 500Mb, 2Gb, 8Gb).

The --limit-total-* flags control the total system resources you want to give to the network. If left blank, the system will attempt to detect these values automatically.

The --limit-job-* flags control the maximum amount of resources a single job can consume for it to be selected for execution.

Resource limits are not supported for Docker jobs running on Windows. Resource limits will be applied at the job bid stage based on reported job requirements but will be silently unenforced. Jobs will be able to access as many resources as requested at runtime.

Connect Storage

Bacalhau has two ways to make use of external storage providers: Sources and Publishers. Sources storage resources consumed as inputs to jobs. And Publishers storage resources created with the results of jobs.

Sources

S3

Bacalhau allows you to use S3 or any S3-compatible storage service as an input source. Users can specify files or entire prefixes stored in S3 buckets to be fetched and mounted directly into the job execution environment. This capability ensures that your jobs have immediate access to the necessary data. See the for more details.

To use the S3 source, you will have to to specify the mandatory name of the S3 bucket and the optional parameters Key, Filter, Region, Endpoint, VersionID and ChechsumSHA256.

Below is an example of how to define an S3 input source in YAML format:

IPFS

The multiaddress above is just an example - you'll need to get the multiaddress of the IPFS server you want to connect to.

You can then configure your Bacalhau node to use this IPFS server by passing the --ipfs-connect argument to the serve command:

Below is an example of how to define an IPFS input source in YAML format:

To use a local data source, you will have to to:

Enable the use of local data when configuring the node itself by using the --allow-listed-local-paths flag for bacalhau serve, specifying the file path and access mode. For example

In the job description specify parameters SourcePath - the absolute path on the compute node where your data is located and ReadWrite - the access mode.

Below is an example of how to define a Local input source in YAML format:

To use a URL data source, you will have to to specify only URL parameter, as in the part of the declarative job description below:

Here’s an example of the part of the declarative job description that outlines the process of using the S3 Publisher with Bacalhau:

To speed up the download or to retrieve results from a private IPFS node, pass the swarm multiaddress to bacalhau get to download results.

Pass the swarm key to bacalhau get if the IPFS swarm is a private swarm.

And part of the declarative job description with an IPFS publisher will look like this:

The Local Publisher should not be used for Production use as it is not a reliable storage option. For production use, we recommend using a more reliable option such as an S3-compatible storage service.

Here is an example of part of the declarative job description with a local publisher:

Test Network Locally

Before you join the main Bacalhau network, you can test locally.

To test, you can use the bacalhau devstack command, which offers a way to get a 3 node cluster running locally.

By settings PREDICTABLE_API_PORT=1 , the first node of our 3 node cluster will always listen on port 20000

In another window, export the following environment variables so that the Bacalhau client binary connects to our local development cluster:

You can now interact with Bacalhau - all jobs are running by the local devstack cluster.

Job execution timeouts

Bacalhau can limit the total time a job spends executing. A job that spends too long executing will be cancelled and no results will be published.

By default, a Bacalhau node does not enforce any limit on job execution time. Both node operators and job submitters can supply a maximum execution time limit. If a job submitter asks for a longer execution time than permitted by a node operator, their job will be rejected.

Applying job timeouts allows node operators to more fairly distribute the work submitted to their nodes. It also protects users from transient errors that results in their jobs waiting indefinitely.

Configuring execution time limits for a job

Job submitters can pass the --timeout flag to any Bacalhau job submission CLI to set a maximum job execution time. The supplied value should be a whole number of seconds with no unit.

The timeout can also be added to an existing job spec by adding the Timeout property to the Spec.

Configuring execution time limits for a node

Node operators can pass the --max-job-execution-timeout flag to bacalhau serve to configure the maximum job time limit. The supplied value should be a numeric value followed by a time unit (one of s for seconds, m for minutes or h for hours).

Node operators can also use configuration properties to configure execution limits.

Compute nodes will use the properties:

Config property

Meaning

Requester nodes will use the properties:

Config property

Meaning

Bacalhau WebUI

How to run the WebUI.

Overview

The Bacalhau WebUI offers an intuitive interface for interacting with the Bacalhau network. This guide provides comprehensive instructions for setting up, deploying, and utilizing the WebUI.

For contributing to the WebUI's development, please refer to the Bacalhau WebUI GitHub Repository.

Spinning Up the WebUI Locally

Prerequisites

Ensure you have a Bacalhau v1.1.7 or later installed.

Running the WebUI

To launch the WebUI locally, execute the following command:

bacalhau serve --node-type=requester,compute --web-ui

This command initializes a requester and compute node, configured to listen on HOST=0.0.0.0 and PORT=1234.

Accessing the Local WebUI

Once started, the WebUI is accessible at http://127.0.0.1/. This local instance allows you to interact with your local Bacalhau network setup.

Accessing the WebUI from the Browser

For observational purposes, a development version of the WebUI is available at bootstrap.development.bacalhau.org. This instance displays jobs from the development server.

N.b. The development version of the WebUI is for observation only and may not reflect the latest changes or features available in the local setup.

Windows support

Running a Windows-based node is not officially supported, so your mileage may vary. Some features (like ) are not present in Windows-based nodes.

Bacalhau currently makes the assumption that all containers are Linux-based. Users of the Docker executor will need to manually ensure that their Docker engine is running and to support Linux containers, e.g. using the WSL-based backend.

Workload Onboarding

This directory contains examples relating to performing common tasks with Bacalhau.

Container

Docker Workload Onboarding

How to use docker containers with Bacalhau

Docker Workloads

This section describes how to migrate a workload based on a Docker container into a format that will work with the Bacalhau client.

You can check out this example tutorial on to see how we used all these steps together.

Requirements

Here are few things to note before getting started:

Container Registry: Ensure that the container is published to a public container registry that is accessible from the Bacalhau network.
Architecture Compatibility: Bacalhau supports only images that match the host node's architecture. Typically, most nodes run on linux/amd64, so containers in arm64 format are not able to run.
Input Flags: The --input ipfs://... flag supports only directories and does not support CID subpaths. The --input https://... flag supports only single files and does not support URL directories. The --input s3://... flag supports S3 keys and prefixes. For example, s3://bucket/logs-2023-04* includes all logs for April 2023.

You can check to see a used by the Bacalhau team

Note: Only about a third of examples have their containers here. The rest are under random docker hub registries.

Runtime Restrictions

To help provide a safe, secure network for all users, we add the following runtime restrictions:

Limited Ingress/Egress Networking:

All ingress/egress networking is limited as described in the documentation. You won't be able to pull data/code/weights/ etc. from an external source.

Data Passing with Docker Volumes:

Onboarding Your Workload

Step 1 - Read Data From Your Directory

If you need to pass data into your container you will do this through a Docker volume. You'll need to modify your code to read from a local directory.

We make the assumption that you are reading from a directory called /inputs, which is set as the default.

Step 2 - Write Data to the Your Directory

If you need to return data from your container you will do this through a Docker volume. You'll need to modify your code to write to a local directory.

We make the assumption that you are writing to a directory called /outputs, which is set as the default.

Step 3 - Build and Push Your Image To a Registry

For example:

Step 4 - Test Your Container

To test your docker image locally, you'll need to execute the following command, changing the environment variables as necessary:

Let's see what each command will be used for:

For example:

The result of the commands' execution is shown below:

Step 5 - Upload the Input Data

Data is identified by its content identifier (CID) and can be accessed by anyone who knows the CID. You can use either of these methods to upload your data:

You can mount your data anywhere on your machine, and Bacalhau will be able to run against that data

Step 6 - Run the Workload on Bacalhau

To launch your workload in a Docker container, using the specified image and working with input data specified via IPFS CID, run the following command:

To check the status of your job, run the following command:

To get more information on your job,run:

To download your job, run:

For example, running:

outputs:

The --input flag does not support CID subpaths for ipfs:// content.

Alternatively, you can run your workload with a publicly accessible http(s) URL, which will download the data temporarily into your public storage:

The --input flag does not support URL directories.

Troubleshooting

If you run into this compute error while running your docker image

This can often be resolved by re-tagging your docker image

Support

WebAssembly (WASM) Workloads

Bacalhau supports running programs that are compiled to . With the Bacalhau client, you can upload WASM programs, retrieve data from public storage, read and write data, receive program arguments, and access environment variables.

Prerequisites and Limitations

Supported WebAssembly System Interface (WASI) Bacalhau can run compiled WASM programs that expect the WebAssembly System Interface (WASI) Snapshot 1. Through this interface, WebAssembly programs can access data, environment variables, and program arguments.
Networking Restrictions All ingress/egress networking is disabled – you won't be able to pull data/code/weights etc. from an external source. WASM jobs can say what data they need using URLs or CIDs (Content IDentifier) and can then access the data by reading from the filesystem.
Single-Threading There is no multi-threading as WASI does not expose any interface for it.

Onboarding Your Workload

Step 1: Replace network operations with filesystem reads and writes

If your program typically involves reading from and writing to network endpoints, follow these steps to adapt it for Bacalhau:

Replace Network Operations: Instead of making HTTP requests to external servers (e.g., example.com), modify your program to read data from the local filesystem.
Input Data Handling: Specify the input data location in Bacalhau using the --input flag when running the job. For instance, if your program used to fetch data from example.com, read from the /inputs folder locally, and provide the URL as input when executing the Bacalhau job. For example, --input http://example.com.
Output Handling: Adjust your program to output results to standard output (stdout) or standard error (stderr) pipes. Alternatively, you can write results to the filesystem, typically into an output mount. In the case of WASM jobs, a default folder at /outputs is available, ensuring that data written there will persist after the job concludes.

By making these adjustments, you can effectively transition your program to operate within the Bacalhau environment, utilizing filesystem operations instead of traditional network interactions.

You can specify additional or different output mounts using the -o flag.

Step 2: Configure your compiler to output WASI-compliant WebAssembly

You will need to compile your program to WebAssembly that expects WASI. Check the instructions for your compiler to see how to do this.

Step 3: Upload the input data

Data is identified by its content identifier (CID) and can be accessed by anyone who knows the CID. You can use either of these methods to upload your data:

You can mount your data anywhere on your machine, and Bacalhau will be able to run against that data

Step 4: Run your program

You can run a WebAssembly program on Bacalhau using the bacalhau wasm run command.

Run Locally Compiled Program:

If your program is locally compiled, specify it as an argument. For instance, running the following command will upload and execute the main.wasm program:

The program you specify will be uploaded to a Bacalhau storage node and will be publicly available.

Alternative Program Specification:

You can use a Content IDentifier (CID) for a specific WebAssembly program.

Input Data Specification:

Make sure to specify any input data using --input flag.

This ensures the necessary data is available for the program's execution.

Program arguments

For a specific WebAssembly program, run:

Write your program to use program arguments to specify input and output paths. This makes your program more flexible in handling different configurations of input and output volumes.

Your language of choice should contain a standard way of reading program arguments that will work with WASI.

Environment variables

You can also specify environment variables using the -e flag.

Examples

Support

Bacalhau Docker Image

How to use the Bacalhau Docker image

This documentation explains how to use the Bacalhau Docker image to run tasks and manage them using the Bacalhau client.

Prerequisites

To get started, you need to install the Bacalhau client (see more information here) and Docker.

1. Pull the Bacalhau Docker image

The first step is to pull the Bacalhau Docker image from the Github container registry.

docker pull ghcr.io/bacalhau-project/bacalhau:latest

Expected output:

latest: Pulling from bacalhau-project/bacalhau
d14ccdd25413: Pull complete
621f190d05c8: Pull complete
Digest: sha256:3cda5619984de9b56c738c50f94188684170f54f7e417f8dcbe74ff8ec8eb434
Status: Downloaded newer image for ghcr.io/bacalhau-project/bacalhau:latest
ghcr.io/bacalhau-project/bacalhau:latest

You can also pull a specific version of the image, e.g.:

docker pull ghcr.io/bacalhau-project/bacalhau:v0.3.16

Remember that the "latest" tag is just a string. It doesn't refer to the latest version of the Bacalhau client, it refers to an image that has the "latest" tag. Therefore, if your machine has already downloaded the "latest" image, it won't download it again. To force a download, you can use the --no-cache flag.

2. Check version

To check the version of the Bacalhau client, run:

docker run -t ghcr.io/bacalhau-project/bacalhau:latest version

Expected Output:

13:38:54.518 | INF pkg/repo/fs.go:81 > Initializing repo at '/root/.bacalhau' for environment 'production'
CLIENT  SERVER  UPDATE MESSAGE
v1.2.0  v1.2.0

3. Running a Bacalhau Job

In the example below, an Ubuntu-based job runs to print the message 'Hello from Docker Bacalhau':

docker run -t ghcr.io/bacalhau-project/bacalhau:latest \
    docker run \
        --id-only \
        --wait \
        ubuntu:latest \
        -- sh -c 'uname -a && echo "Hello from Docker Bacalhau!"'

Structure of the command

ghcr.io/bacalhau-project/bacalhau:latest : Name of the Bacalhau Docker image
--id-only: Output only the job id
--wait: Wait for the job to finish
ubuntu:latest. Ubuntu container
--: Separate Bacalhau parameters from the command to be executed inside the container
sh -c 'uname -a && echo "Hello from Docker Bacalhau!"': The command executed inside the container

Let's have a look at the command execution in the terminal:

13:53:46.478 | INF pkg/repo/fs.go:81 > Initializing repo at '/root/.bacalhau' for environment 'production'
ab95a5cc-e6b7-40f1-957d-596b02251a66

The output you're seeing is in two parts: The first line: 13:53:46.478 | INF pkg/repo/fs.go:81 > Initializing repo at '/root/.bacalhau' for environment 'production' is an informational message indicating the initialization of a repository at the specified directory ('/root/.bacalhau') for the production environment. The second line: ab95a5cc-e6b7-40f1-957d-596b02251a66 is a job ID, which represents the result of executing a command inside a Docker container. It can be used to obtain additional information about the executed job or to access the job's results. We store that in an environment variable so that we can reuse it later on (env: JOB_ID=ab95a5cc-e6b7-40f1-957d-596b02251a66)

To print out the content of the Job ID, run the following command:

docker run -t ghcr.io/bacalhau-project/bacalhau:latest \
    describe ab95a5cc-e6b7-40f1-957d-596b02251a66 \
        | grep -A 2 "stdout: |"

Expected Output:

stdout: |
        Linux fff680719453 6.2.0-1019-gcp #21~22.04.1-Ubuntu SMP Thu Nov 16 18:18:34 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
        Hello from Docker Bacalhau!

4. Submit a Job With Output Files

One inconvenience that you'll see is that you'll need to mount directories into the container to access files. This is because the container is running in a separate environment from your host machine. Let's take a look at the example below:

The first part of the example should look familiar, except for the Docker commands.

docker run -t ghcr.io/bacalhau-project/bacalhau:latest \
    docker run \
        --id-only \
        --wait \
        --gpu 1 \
        ghcr.io/bacalhau-project/examples/stable-diffusion-gpu:0.0.1 -- \
            python main.py --o ./outputs --p "A Docker whale and a cod having a conversation about the state of the ocean"

When a job is submitted, Bacalhau prints out the related job_id (a46a9aa9-63ef-486a-a2f8-6457d7bafd2e):

09:05:58.434 | INF pkg/repo/fs.go:81 > Initializing repo at '/root/.bacalhau' for environment 'production'
a46a9aa9-63ef-486a-a2f8-6457d7bafd2e

5. Checking the State of your Jobs

Job status: You can check the status of the job using bacalhau list.

docker run -t ghcr.io/bacalhau-project/bacalhau:latest \
    list $JOB_ID \

When it says Completed, that means the job is done, and we can get the results.

Job information: You can find out more information about your job by using bacalhau describe.

docker run -t ghcr.io/bacalhau-project/bacalhau:latest \
    describe $JOB_ID \

Job download: You can download your job results directly by using bacalhau get. Alternatively, you can choose to create a directory to store your results. In the command below, we created a directory and downloaded our job output to be stored in the result directory.

bacalhau get ${JOB_ID} --output-dir result

After the download has finished, you should see the following contents in the results directory.

Support

If have questions or need support or guidance, please reach out to the Bacalhau team via Slack (#general channel).

How To Work With Custom Containers in Bacalhau

Bacalhau operates by executing jobs within containers. This example shows you how to build and use a custom docker container.

Prerequisite

To get started, you need to install the Bacalhau client, see more information here
This example requires Docker. If you don't have Docker installed, you can install it from here. Docker commands will not work on hosted notebooks like Google Colab, but the Bacalhau commands will.

1. Running Containers

Docker Command

You're likely familiar with executing Docker commands to start a container:

docker run docker/whalesay cowsay sup old fashioned container run

This command runs a container from the docker/whalesay image. The container executes the cowsay sup old fashioned container run command:

_________________________________
< sup old fashioned container run >
 ---------------------------------
    \
     \
      \
                    ##        .
              ## ## ##       ==
           ## ## ## ##      ===
       /""""""""""""""""___/ ===
  ~~~ {~~ ~~~~ ~~~ ~~~~ ~~ ~ /  ===- ~~~
       \______ o          __/
        \    \        __/
          \____\______/

Bacalhau Command

export JOB_ID=$(bacalhau docker run \
    --wait \
    --id-only \ 
    docker/whalesay -- bash -c 'cowsay hello web3 uber-run')

This command also runs a container from the docker/whalesay image, using Bacalhau. We use the bacalhau docker run command to start a job in a Docker container. It contains additional flags such as --wait to wait for job completion and --id-only to return only the job identifier. Inside the container, the bash -c 'cowsay hello web3 uber-run' command is executed.

When a job is submitted, Bacalhau prints out the related job_id (7e41b9b9-a9e2-4866-9fce-17020d8ec9e0):

7e41b9b9-a9e2-4866-9fce-17020d8ec9e0

We store that in an environment variable so that we can reuse it later on.

You can download your job results directly by using bacalhau get. Alternatively, you can choose to create a directory to store your results. In the command below, we created a directory (results) and downloaded our job output to be stored in that directory.

rm -rf results && mkdir -p results
bacalhau get ${JOB_ID}  --output-dir results

Viewing your job output

cat ./results/stdout

 _____________________
< hello web3 uber-run >
 ---------------------
    \
     \
      \
                    ##        .
              ## ## ##       ==
           ## ## ## ##      ===
       /""""""""""""""""___/ ===
  ~~~ {~~ ~~~~ ~~~ ~~~~ ~~ ~ /  ===- ~~~
       \______ o          __/
        \    \        __/
          \____\______/

Both commands execute cowsay in the docker/whalesay container, but Bacalhau provides additional features for working with jobs at scale.

Bacalhau Syntax

Bacalhau uses a syntax that is similar to Docker, and you can use the same containers. The main difference is that input and output data is passed to the container via IPFS, to enable planetary scale. In the example above, it doesn't make too much difference except that we need to download the stdout.

The --wait flag tells Bacalhau to wait for the job to finish before returning. This is useful in interactive sessions like this, but you would normally allow jobs to complete in the background and use the bacalhau list command to check on their status.

Another difference is that by default Bacalhau overwrites the default entry point for the container, so you have to pass all shell commands as arguments to the run command after the -- flag.

2. Building Your Own Custom Container For Bacalhau

To use your own custom container, you must publish the container to a container registry that is accessible from the Bacalhau network. At this time, only public container registries are supported.

To demonstrate this, you will develop and build a simple custom container that comes from an old Docker example. I remember seeing cowsay at a Docker conference about a decade ago. I think it's about time we brought it back to life and distribute it across the Bacalhau network.

# write to the cod.cow
$the_cow = <<"EOC";
   $thoughts
    $thoughts
                               ,,,,_
                            ┌Φ▓╬▓╬▓▓▓W      @▓▓▒,
                           ╠▓╬▓╬╣╬╬▓╬▓▓   ╔╣╬╬▓╬╣▓,
                    __,┌╓═╠╬╠╬╬╬Ñ╬╬╬Ñ╬╬¼,╣╬╬▓╬╬▓╬▓▓▓┐        ╔W_             ,φ▓▓
               ,«@▒╠╠╠╠╩╚╙╙╩Ü╚╚╚╚╩╙╙╚╠╩╚╚╟▓▒╠╠╫╣╬╬╫╬╣▓,   _φ╬▓╬╬▓,        ,φ╣▓▓╬╬
          _,φÆ╩╬╩╙╚╩░╙╙░░╩`=░╙╚»»╦░=╓╙Ü1R░│░╚Ü░╙╙╚╠╠╠╣╣╬≡Φ╬▀╬╣╬╬▓▓▓_   ╓▄▓▓▓▓▓▓╬▌
      _,φ╬Ñ╩▌▐█[▒░░░░R░░▀░`,_`!R`````╙`-'╚Ü░░Ü░░░░░░░│││░╚╚╙╚╩╩╩╣Ñ╩╠▒▒╩╩▀▓▓╣▓▓╬╠▌
     '╚╩Ü╙│░░╙Ö▒Ü░░░H░░R ▒¥╣╣@@@▓▓▓  := '`   `░``````````````````````````]▓▓▓╬╬╠H
       '¬═▄ `\░╙Ü░╠DjK` Å»»╙╣▓▓▓▓╬Ñ     -»`       -`      `  ,;╓▄╔╗∞  ~▓▓▓▀▓▓╬╬╬▌
             '^^^`   _╒Γ   `╙▀▓▓╨                     _, ⁿD╣▓╬╣▓╬▓╜      ╙╬▓▓╬╬▓▓
                 ```└                           _╓▄@▓▓▓╜   `╝╬▓▓╙           ²╣╬▓▓
                        %φ▄╓_             ~#▓╠▓▒╬▓╬▓▓^        `                ╙╙
                         `╣▓▓▓              ╠╬▓╬▓╬▀`
                           ╚▓▌               '╨▀╜
EOC

Next, the Dockerfile adds the script and sets the entry point.

# write the Dockerfile
FROM debian:stretch
RUN apt-get update && apt-get install -y cowsay
# "cowsay" installs to /usr/games
ENV PATH $PATH:/usr/games
RUN echo '#!/bin/bash\ncowsay "${@:1}"' > /usr/bin/codsay && \
    chmod +x /usr/bin/codsay
COPY cod.cow /usr/share/cowsay/cows/default.cow

Now let's build and test the container locally.

docker build -t ghcr.io/bacalhau-project/examples/codsay:latest . 2> /dev/null

%%bashdocker run --rm ghcr.io/bacalhau-project/examples/codsay:latest codsay I like swimming in data

Once your container is working as expected then you should push it to a public container registry. In this example, I'm pushing to Github's container registry, but we'll skip the step below because you probably don't have permission. Remember that the Bacalhau nodes expect your container to have a linux/amd64 architecture.

docker buildx build --platform linux/amd64,linux/arm64 --push -t ghcr.io/bacalhau-project/examples/codsay:latest .

3. Running Your Custom Container on Bacalhau

Now we're ready to submit a Bacalhau job using your custom container. This code runs a job, downloads the results, and prints the stdout.

The bacalhau docker run command strips the default entry point, so don't forget to run your entry point in the command line arguments.

export JOB_ID=$(bacalhau docker run \
    --wait \
    --id-only \
    ghcr.io/bacalhau-project/examples/codsay:v1.0.0 \
    -- bash -c 'codsay Look at all this data')

When a job is submitted, Bacalhau prints out the related job_id. We store that in an environment variable so that we can reuse it later on.

Download your job results directly by using bacalhau get command.

rm -rf results && mkdir -p results
bacalhau get ${JOB_ID}  --output-dir results

View your job output

cat ./results/stdout

_______________________
< Look at all this data >
 -----------------------
   \
    \
                               ,,,,_
                            ┌Φ▓╬▓╬▓▓▓W      @▓▓▒,
                           ╠▓╬▓╬╣╬╬▓╬▓▓   ╔╣╬╬▓╬╣▓,
                    __,┌╓═╠╬╠╬╬╬Ñ╬╬╬Ñ╬╬¼,╣╬╬▓╬╬▓╬▓▓▓┐        ╔W_             ,φ▓▓
               ,«@▒╠╠╠╠╩╚╙╙╩Ü╚╚╚╚╩╙╙╚╠╩╚╚╟▓▒╠╠╫╣╬╬╫╬╣▓,   _φ╬▓╬╬▓,        ,φ╣▓▓╬╬
          _,φÆ╩╬╩╙╚╩░╙╙░░╩`=░╙╚»»╦░=╓╙Ü1R░│░╚Ü░╙╙╚╠╠╠╣╣╬≡Φ╬▀╬╣╬╬▓▓▓_   ╓▄▓▓▓▓▓▓╬▌
      _,φ╬Ñ╩▌▐█[▒░░░░R░░▀░`,_`!R`````╙`-'╚Ü░░Ü░░░░░░░│││░╚╚╙╚╩╩╩╣Ñ╩╠▒▒╩╩▀▓▓╣▓▓╬╠▌
     '╚╩Ü╙│░░╙Ö▒Ü░░░H░░R ▒¥╣╣@@@▓▓▓  := '`   `░``````````````````````````]▓▓▓╬╬╠H
       '¬═▄ `░╙Ü░╠DjK` Å»»╙╣▓▓▓▓╬Ñ     -»`       -`      `  ,;╓▄╔╗∞  ~▓▓▓▀▓▓╬╬╬▌
             '^^^`   _╒Γ   `╙▀▓▓╨                     _, ⁿD╣▓╬╣▓╬▓╜      ╙╬▓▓╬╬▓▓
                 ```└                           _╓▄@▓▓▓╜   `╝╬▓▓╙           ²╣╬▓▓
                        %φ▄╓_             ~#▓╠▓▒╬▓╬▓▓^        `                ╙╙
                         `╣▓▓▓              ╠╬▓╬▓╬▀`
                           ╚▓▌               '╨▀╜

Support

If you have questions or need support or guidance, please reach out to the Bacalhau team via Slack (#general channel).

Python

Building and Running Custom Python Container

Introduction

In this tutorial example, we will walk you through building your own Python container and running the container on Bacalhau.

Prerequisites

To get started, you need to install the Bacalhau client, see more information here

1. Sample Recommendation Dataset

We will be using a simple recommendation script that, when given a movie ID, recommends other movies based on user ratings. Assuming you want recommendations for the movie 'Toy Story' (1995), it will suggest movies from similar categories:

Recommendations for Toy Story (1995):
1  :  Toy Story (1995)
58  :  Postino, Il (The Postman) (1994)
3159  :  Fantasia 2000 (1999)
359  :  I Like It Like That (1994)
756  :  Carmen Miranda: Bananas Is My Business (1994)
618  :  Two Much (1996)
48  :  Pocahontas (1995)
2695  :  Boys, The (1997)
2923  :  Citizen's Band (a.k.a. Handle with Care) (1977)
688  :  Operation Dumbo Drop (1995)

Downloading the dataset

Download Movielens1M dataset from this link https://files.grouplens.org/datasets/movielens/ml-1m.zip

wget https://files.grouplens.org/datasets/movielens/ml-1m.zip

In this example, we’ll be using 2 files from the MovieLens 1M dataset: ratings.dat and movies.dat. After the dataset is downloaded, extract the zip and place ratings.dat and movies.dat into a folder called input:

# Extracting the downloaded zip file
unzip ml-1m.zip

#moving  ratings.dat and movies.dat into a folder called 'input'
mkdir input; mv ml-1m/movies.dat ml-1m/ratings.dat input/

The structure of the input directory should be

input
├── movies.dat
└── ratings.dat

Installing Dependencies

To create a requirements.txt for the Python libraries we’ll be using, create:

# content of the requirements.txt
numpy
pandas

To install the dependencies, run:

pip install -r requirements.txt

Writing the Script

Create a new file called similar-movies.py and in it paste the following script

# content of the similar-movies.py

# Imports
import numpy as np
import pandas as pd
import argparse
from distutils.dir_util import mkpath
import warnings
warnings.filterwarnings("ignore")
# Read the files with pandas
data = pd.io.parsers.read_csv('input/ratings.dat',
names=['user_id', 'movie_id', 'rating', 'time'],
engine='python', delimiter='::', encoding='latin-1')
movie_data = pd.io.parsers.read_csv('input/movies.dat',
names=['movie_id', 'title', 'genre'],
engine='python', delimiter='::', encoding='latin-1')

# Create the ratings matrix of shape (m×u) with rows as movies and columns as users

ratings_mat = np.ndarray(
shape=((np.max(data.movie_id.values)), np.max(data.user_id.values)),
dtype=np.uint8)
ratings_mat[data.movie_id.values-1, data.user_id.values-1] = data.rating.values

# Normalise matrix (subtract mean off)

normalised_mat = ratings_mat - np.asarray([(np.mean(ratings_mat, 1))]).T

# Compute SVD

normalised_mat = ratings_mat - np.matrix(np.mean(ratings_mat, 1)).T
cov_mat = np.cov(normalised_mat)
evals, evecs = np.linalg.eig(cov_mat)

# Calculate cosine similarity, sort by most similar, and return the top N.

def top_cosine_similarity(data, movie_id, top_n=10):

index = movie_id - 1
# Movie id starts from 1

movie_row = data[index, :]
magnitude = np.sqrt(np.einsum('ij, ij -> i', data, data))
similarity = np.dot(movie_row, data.T) / (magnitude[index] * magnitude)
sort_indexes = np.argsort(-similarity)
return sort_indexes[:top_n]

# Helper function to print top N similar movies
def print_similar_movies(movie_data, movie_id, top_indexes):
print('Recommendations for {0}: \n'.format(
movie_data[movie_data.movie_id == movie_id].title.values[0]))
for id in top_indexes + 1:
print(str(id),' : ',movie_data[movie_data.movie_id == id].title.values[0])


parser = argparse.ArgumentParser(description='Personal information')
parser.add_argument('--k', dest='k', type=int, help='principal components to represent the movies',default=50)
parser.add_argument('--id', dest='id', type=int, help='Id of the movie',default=1)
parser.add_argument('--n', dest='n', type=int, help='No of recommendations',default=10)

args = parser.parse_args()
k = args.k
movie_id = args.id # Grab an id from movies.dat
top_n = args.n

# k = 50
# # Grab an id from movies.dat
# movie_id = 1
# top_n = 10

sliced = evecs[:, :k] # representative data
top_indexes = top_cosine_similarity(sliced, movie_id, top_n)
print_similar_movies(movie_data, movie_id, top_indexes)

What the similar-movies.py script does

Read the files with pandas. The code uses Pandas to read data from the files ratings.dat and movies.dat.
Create the ratings matrix of shape (m×u) with rows as movies and columns as user
Normalise matrix (subtract mean off). The ratings matrix is normalized by subtracting the mean off.
Compute SVD: a singular value decomposition (SVD) of the normalized ratings matrix is performed.
Calculate cosine similarity, sort by most similar, and return the top N.
Select k principal components to represent the movies, a movie_id to find recommendations, and print the top_n results.

For further reading on how the script works, go to Simple Movie Recommender Using SVD | Alyssa

Running the Script

Running the script similar-movies.py using the default values:

python similar-movies.py

You can also use other flags to set your own values.

2. Setting Up Docker

We will create a Dockerfile and add the desired configuration to the file. These commands specify how the image will be built, and what extra requirements will be included.

FROM python:3.8
ADD similar-movies.py .
ADD /input input
COPY ./requirements.txt /requirements.txt
RUN pip install -r requirements.txt

We will use the python:3.8 docker image and add our script similar-movies.py to copy the script to the docker image, similarly, we also add the dataset directory and also the requirements, after that run the command to install the dependencies in the image

The final folder structure will look like this:

├── Dockerfile
├── input
│   ├── movies.dat
│   └── ratings.dat
├── requirements.txt
└── similar-movies.py

See more information on how to containerize your script/app here

Build the container

We will run docker build command to build the container:

docker build -t <hub-user>/<repo-name>:<tag> .

Before running the command replace:

hub-user with your docker hub username, If you don’t have a docker hub account follow these instructions to create a docker account, and use the username of the account you created

repo-name with the name of the container, you can name it anything you want

tag this is not required, but you can use the latest tag

In our case:

docker build -t jsace/python-similar-movies .

Push the container

Next, upload the image to the registry. This can be done by using the Docker hub username, repo name or tag.

docker push <hub-user>/<repo-name>:<tag>

In our case:

docker push jsace/python-similar-movies

3. Running a Bacalhau Job

After the repo image has been pushed to Docker Hub, we can now use the container for running on Bacalhau. You can submit a Bacalhau job by running your container on Bacalhau with default or custom parameters.

Running the Container with Default Parameters

To submit a Bacalhau job by running your container on Bacalhau with default parameters, run the following Bacalhau command:

export JOB_ID=$(bacalhau docker run \
    --id-only \
    --wait \
    jsace/python-similar-movies \
    -- python similar-movies.py)

Structure of the command

bacalhau docker run: call to Bacalhau
jsace/python-similar-movies: the name and of the docker image we are using
-- python similar-movies.py: execute the Python script

When a job is submitted, Bacalhau prints out the related job_id. We store that in an environment variable so that we can reuse it later on.

Running the Container with Custom Parameters

To submit a Bacalhau job by running your container on Bacalhau with custom parameters, run the following Bacalhau command:

bacalhau docker run \
    jsace/python-similar-movies \
    -- python similar-movies.py --k 50 --id 10 --n 10

Structure of the command

bacalhau docker run: call to Bacalhau
jsace/python-similar-movies: the name of the docker image we are using
-- python similar-movies.py --k 50 --id 10 --n 10: execute the python script. The script will use Singular Value Decomposition (SVD) and cosine similarity to find 10 movies most similar to the one with identifier 10, using 50 principal components.

4. Checking the State of your Jobs

Job status: You can check the status of the job using bacalhau list.

bacalhau list --id-filter ${JOB_ID}

When it says Published or Completed, that means the job is done, and we can get the results.

Job information: You can find out more information about your job by using bacalhau describe.

bacalhau describe ${JOB_ID}

Job download: You can download your job results directly by using bacalhau get. Alternatively, you can choose to create a directory to store your results. In the command below, we created a directory (results) and downloaded our job output to be stored in that directory.

rm -rf results && mkdir -p results
bacalhau get $JOB_ID --output-dir results

5. Viewing your Job Output

To view the file, run the following command:

cat results/stdout # displays the contents of the file

Support

If you have questions or need support or guidance, please reach out to the Bacalhau team via Slack (#general channel).

Running Pandas on Bacalhau

Introduction

Pandas is a Python package that provides fast, flexible, and expressive data structures designed to make working with data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real-world data analysis in Python. Additionally, it has the broader goal of becoming the most powerful and flexible open-source data analysis/manipulation tool available in any language. It is already well on its way towards this goal.

In this tutorial example, we will run Pandas script on Bacalhau.

Prerequisite

To get started, you need to install the Bacalhau client, see more information here

1. Running Pandas Locally

To run the Pandas script on Bacalhau for analysis, first, we will place the Pandas script in a container and then run it at scale on Bacalhau.

To get started, you need to install the Pandas library from pip:

pip install pandas

Importing data from CSV to DataFrame

Pandas is built around the idea of a DataFrame, a container for representing data. Below you will create a DataFrame by importing a CSV file. A CSV file is a text file with one record of data per line. The values within the record are separated using the “comma” character. Pandas provides a useful method, named read_csv() to read the contents of the CSV file into a DataFrame. For example, we can create a file named transactions.csv containing details of Transactions. The CSV file is stored in the same directory that contains the Python script.

# read_csv.py
import pandas as pd

print(pd.read_csv("transactions.csv"))

The overall purpose of the command above is to read data from a CSV file (transactions.csv) using Pandas and print the resulting DataFrame.

To download the transactions.csv file, run:

wget https://cloudflare-ipfs.com/ipfs/QmfKJT13h5k1b23ja3ZCVg5nFL9oKz2bVXc8oXgtwiwhjz/transactions.csv

To output a content of the transactions.csv file, run:

cat transactions.csv

Running the script

Now let's run the script to read in the CSV file. The output will be a DataFrame object.

python3 read_csv.py

2. Ingesting Data

To run Pandas on Bacalhau you must store your assets in a location that Bacalhau has access to. We usually default to storing data on IPFS and code in a container, but you can also easily upload your script to IPFS too.

If you are interested in finding out more about how to ingest your data into IPFS, please see the data ingestion guide.

We've already uploaded the script and data to IPFS to the following CID: QmfKJT13h5k1b23ja3ZCVg5nFL9oKz2bVXc8oXgtwiwhjz. You can look at this by browsing to one of the HTTP IPFS proxies like ipfs.io or w3s.link.

3. Running a Bacalhau Job

Now we're ready to run a Bacalhau job, whilst mounting the Pandas script and data from IPFS. We'll use the bacalhau docker run command to do this:

export JOB_ID=$(bacalhau docker run \
    --wait \
    --id-only \
    -i ipfs://QmfKJT13h5k1b23ja3ZCVg5nFL9oKz2bVXc8oXgtwiwhjz:/files \
    -w /files \
    amancevice/pandas \
    -- python read_csv.py)

Structure of the command

bacalhau docker run: call to Bacalhau
amancevice/pandas : Docker image with pandas installed.
-i ipfs://QmfKJT13h5k1b23ja3ZCVg5nFL9oKz2bVXc8oXgtwiwhjz:/files: Mounting the uploaded dataset to path. The -i flag allows us to mount a file or directory from IPFS into the container. It takes two arguments, the first is the IPFS CID
QmfKJT13h5k1b23ja3ZCVg5nFL9oKz2bVXc8oXgtwiwhjz) and the second is the file path within IPFS (/files). The -i flag can be used multiple times to mount multiple directories.
-w /files Our working directory is /files. This is the folder where we will save the model as it will automatically get uploaded to IPFS as outputs
python read_csv.py: python script to read pandas script

When a job is submitted, Bacalhau prints out the related job_id. We store that in an environment variable so that we can reuse it later on.

4. Checking the State of your Jobs

Job status: You can check the status of the job using bacalhau list.

bacalhau list --id-filter ${JOB_ID}

When it says Published or Completed, that means the job is done, and we can get the results.

Job information: You can find out more information about your job by using bacalhau describe.

bacalhau describe ${JOB_ID}

Job download: You can download your job results directly by using bacalhau get. Alternatively, you can choose to create a directory to store your results. In the command below, we created a directory (results) and downloaded our job output to be stored in that directory.

rm -rf results && mkdir -p results
bacalhau get ${JOB_ID}  --output-dir results

5. Viewing your Job Output

To view the file, run the following command:

cat results/stdout

Support

If you have questions or need support or guidance, please reach out to the Bacalhau team via Slack (#general channel).

Running a Python Script

This tutorial serves as an introduction to Bacalhau. In this example, you'll be executing a simple "Hello, World!" Python script hosted on a website on Bacalhau.

Prerequisites

To get started, you need to install the Bacalhau client, see more information here

1. Running Python Locally

We'll be using a very simple Python script that displays the traditional first greeting. Create a file called hello-world.py:

# hello-world.py
print("Hello, world!")

Running the script to print out the output:

python3 hello-world.py

After the script has run successfully locally we can now run it on Bacalhau.

2. Running a Bacalhau Job

To submit a workload to Bacalhau you can use the bacalhau docker run command. This command allows passing input data into the container using content identifier (CID) volumes, we will be using the --input URL:path argument for simplicity. This results in Bacalhau mounting a data volume inside the container. By default, Bacalhau mounts the input volume at the path /inputs inside the container.

Bacalhau overwrites the default entrypoint, so we must run the full command after the -- argument.

export JOB_ID=$(bacalhau docker run \
    --id-only \
    --input https://raw.githubusercontent.com/bacalhau-project/examples/151eebe895151edd83468e3d8b546612bf96cd05/workload-onboarding/trivial-python/hello-world.py \
    python:3.10-slim \
    -- python3 /inputs/hello-world.py)

Structure of the command

bacalhau docker run: call to Bacalhau
--id-only: specifies that only the job identifier (job_id) will be returned after executing the container, not the entire output
--input https://raw.githubusercontent.com/bacalhau-project/examples/151eebe895151edd83468e3d8b546612bf96cd05/workload-onboarding/trivial-python/hello-world.py \: indicates where to get the input data for the container. In this case, the input data is downloaded from the specified URL, which represents the Python script "hello-world.py".
python:3.10-slim: the Docker image that will be used to run the container. In this case, it uses the Python 3.10 image with a minimal set of components (slim).
--: This double dash is used to separate the Bacalhau command options from the command that will be executed inside the Docker container.
python3 /inputs/hello-world.py: running the hello-world.py Python script stored in /inputs.

When a job is submitted, Bacalhau prints out the related job_id. We store that in an environment variable so that we can reuse it later on.

Declarative job description

The same job can be presented in the declarative format. In this case, the description will look like this:

name: Running Trivial Python
type: batch
count: 1
tasks:
  - name: My main task
    Engine:
      type: docker
      params:
        Image: python:3.10-slim
        Entrypoint:
          - /bin/bash
        Parameters:
          - -c
          - python3 /inputs/hello-world.py
    InputSources:
      - Target: /inputs
        Source:
          Type: urlDownload
          Params:
            URL: https://raw.githubusercontent.com/bacalhau-project/examples/151eebe895151edd83468e3d8b546612bf96cd05/workload-onboarding/trivial-python/hello-world.py
            Path: /inputs/hello-world.py

The job description should be saved in .yaml format, e.g. helloworld.yaml, and then run with the command:

bacalhau job run helloworld.yaml

3. Checking the State of your Jobs

Job status: You can check the status of the job using bacalhau list.

bacalhau list --id-filter ${JOB_ID} --no-style

When it says Published or Completed, that means the job is done, and we can get the results.

Job information: You can find out more information about your job by using bacalhau describe.

bacalhau describe ${JOB_ID}

Job download: You can download your job results directly by using bacalhau get. Alternatively, you can choose to create a directory to store your results. In the command below, we created a directory (results) and downloaded our job output to be stored in that directory.

rm -rf results && mkdir results
bacalhau get ${JOB_ID} --output-dir results

4. Viewing your Job Output

To view the file, run the following command:

cat results/stdout

Support

If you have questions or need support or guidance, please reach out to the Bacalhau team via Slack (#general channel).

Running Jupyter Notebooks on Bacalhau

Introduction

Jupyter Notebooks have become an essential tool for data scientists, researchers, and developers for interactive computing and the development of data-driven projects. They provide an efficient way to share code, equations, visualizations, and narrative text with support for multiple programming languages. In this tutorial, we will introduce you to running Jupyter Notebooks on Bacalhau, a powerful and flexible container orchestration platform. By leveraging Bacalhau, you can execute Jupyter Notebooks in a scalable and efficient manner using Docker containers, without the need for manual setup or configuration.

In the following sections, we will explore two examples of executing Jupyter Notebooks on Bacalhau:

Executing a Simple Hello World Notebook: We will begin with a basic example to familiarize you with the process of running a Jupyter Notebook on Bacalhau. We will execute a simple "Hello, World!" notebook to demonstrate the steps required for running a notebook in a containerized environment.
Notebook to Train an MNIST Model: In this section, we will dive into a more advanced example. We will execute a Jupyter Notebook that trains a machine-learning model on the popular MNIST dataset. This will showcase the potential of Bacalhau to handle more complex tasks while providing you with insights into utilizing containerized environments for your data science projects.

Prerequisite

To get started, you need to install the Bacalhau client, see more information here

1. Executing a Simple Hello World Notebook

There are no external dependencies that we need to install. All dependencies are already there in the container.

export JOB_ID=$(bacalhau docker run \
    --wait \
    --id-only \
    --timeout 3600 \
    --wait-timeout-secs 3600 \
    -w /inputs \
    -i https://raw.githubusercontent.com/js-ts/hello-notebook/main/hello.ipynb \
    jsacex/jupyter \
    -- jupyter nbconvert --execute --to notebook --output /outputs/hello_output.ipynb hello.ipynb)

/inputs/hello.ipynb: This is the path of the input Jupyter Notebook inside the Docker container.
-i: This flag stands for "input" and is used to provide the URL of the input Jupyter Notebook you want to execute.
https://raw.githubusercontent.com/js-ts/hello-notebook/main/hello.ipynb: This is the URL of the input Jupyter Notebook.
jsacex/jupyter: This is the name of the Docker image used for running the Jupyter Notebook. It is a minimal Jupyter Notebook stack based on the official Jupyter Docker Stacks.
--: This double dash is used to separate the Bacalhau command options from the command that will be executed inside the Docker container.
jupyter nbconvert: This is the primary command used to convert and execute Jupyter Notebooks. It allows for the conversion of notebooks to various formats, including execution.
--execute: This flag tells nbconvert to execute the notebook and store the results in the output file.
--to notebook: This option specifies the output format. In this case, we want to keep the output as a Jupyter Notebook.
--output /outputs/hello_output.ipynb: This option specifies the path and filename for the output Jupyter Notebook, which will contain the results of the executed input notebook.

Checking the State of your Jobs

Job status: You can check the status of the job using bacalhau list:

bacalhau list --id-filter=${JOB_ID} --no-style

When it says Published or Completed, that means the job is done, and we can get the results.

Job information: You can find out more information about your job by using bacalhau describe:

bacalhau describe ${JOB_ID}

Job download: You can download your job results directly by using bacalhau get. Alternatively, you can choose to create a directory to store your results. In the command below, we created a directory (results) and downloaded our job output to be stored in that directory.

rm -rf results && mkdir results # Temporary directory to store the results
bacalhau get ${JOB_ID} --output-dir results # Download the results

After the download has finished you can see the contents in the results directory, running the command below:

ls results/outputs

hello_output.nbconvert.ipynb

2. Running Notebook to Train an MNIST Model

Building the container (optional)

Prerequisite

Install Docker on your local machine.
Sign up for a DockerHub account if you don't already have one.

Step 1: Create a Dockerfile

Create a new file named Dockerfile in your project directory with the following content:

# Use the official Python image as the base image
FROM tensorflow/tensorflow:nightly-gpu

# Set the working directory in the container
WORKDIR /

RUN apt-get update -y

COPY mnist.ipynb .
# Install the Python packages
COPY requirements.txt .

RUN python3 -m pip install --upgrade pip

# Install the Python packages
RUN pip install --no-cache-dir -r requirements.txt

RUN pip install -U scikit-learn

This Dockerfile creates a Docker image based on the official TensorFlow GPU-enabled image, sets the working directory to the root, updates the package list, and copies an IPython notebook (mnist.ipynb) and a requirements.txt file. It then upgrades pip and installs Python packages from the requirements.txt file, along with scikit-learn. The resulting image provides an environment ready for running the mnist.ipynb notebook with TensorFlow and scikit-learn, as well as other specified dependencies.

Step 2: Build the Docker Image

In your terminal, navigate to the directory containing the Dockerfile and run the following command to build the Docker image:

docker build -t your-dockerhub-username/jupyter-mnist-tensorflow:latest .

Replace "your-dockerhub-username" with your actual DockerHub username. This command will build the Docker image and tag it with your DockerHub username and the name "your-dockerhub-username/jupyter-mnist-tensorflow".

Step 3: Push the Docker Image to DockerHub

Once the build process is complete, push the Docker image to DockerHub using the following command:

docker push your-dockerhub-username/jupyter-mnist-tensorflow

Again, replace "your-dockerhub-username" with your actual DockerHub username. This command will push the Docker image to your DockerHub repository.

Running the job on Bacalhau

Prerequisite

To get started, you need to install the Bacalhau client, see more information here

export JOB_ID=$(bacalhau docker run \
    --wait \
    --id-only \
    --timeout 3600 \
    --wait-timeout-secs 3600 \
    --gpu 1 \
    -i gitlfs://huggingface.co/datasets/VedantPadwal/mnist.git \
    jsacex/jupyter-tensorflow-mnist:v02 \
    -- jupyter nbconvert --execute --to notebook --output /outputs/mnist_output.ipynb mnist.ipynb)

Structure of the command

--gpu 1: Flag to specify the number of GPUs to use for the execution. In this case, 1 GPU will be used.
-i gitlfs://huggingface.co/datasets/VedantPadwal/mnist.git: The -i flag is used to clone the MNIST dataset from Hugging Face's repository using Git LFS. The files will be mounted inside the container.
jsacex/jupyter-tensorflow-mnist:v02: The name and the tag of the Docker image.
--: This double dash is used to separate the Bacalhau command options from the command that will be executed inside the Docker container.
jupyter nbconvert --execute --to notebook --output /outputs/mnist_output.ipynb mnist.ipynb: The command to be executed inside the container. In this case, it runs the jupyter nbconvert command to execute the mnist.ipynb notebook and save the output as mnist_output.ipynb in the /outputs directory.

Checking the State of your Jobs

Job status: You can check the status of the job using bacalhau list.

bacalhau list --id-filter=${JOB_ID} --no-style

When it says Published or Completed, that means the job is done, and we can get the results.

Job information: You can find out more information about your job by using bacalhau describe.

bacalhau describe ${JOB_ID}

Job download: You can download your job results directly by using bacalhau get. Alternatively, you can choose to create a directory to store your results. In the command below, we created a directory (results) and downloaded our job output to be stored in that directory.

rm -rf results && mkdir results # Temporary directory to store the results
bacalhau get ${JOB_ID} --output-dir results # Download the results

After the download has finished you can see the contents in the results directory, running the command below:

ls results/outputs

The outputs include our trained model and the Jupyter notebook with the output cells.

Support

If you have questions or need support or guidance, please reach out to the Bacalhau team via Slack (#general channel).

Scripting Bacalhau with Python

Bacalhau allows you to easily execute batch jobs via the CLI. But sometimes you need to do more than that. You might need to execute a script that requires user input, or you might need to execute a script that requires a lot of parameters. In any case, you probably want to execute your jobs in a repeatable manner.

This example demonstrates a simple Python script that is able to orchestrate the execution of lots of jobs in a repeatable manner.

Prerequisite

To get started, you need to install the Bacalhau client, see more information here

Executing Bacalhau Jobs with Python Scripts

To demonstrate this example, I will use the data generated from an Ethereum example. This produced a list of hashes that I will iterate over and execute a job for each one.

#write following to the file hashes.txt
bafybeihvtzberlxrsz4lvzrzvpbanujmab3hr5okhxtbgv2zvonqos2l3i
bafybeifb25fgxrzu45lsc47gldttomycqcsao22xa2gtk2ijbsa5muzegq
bafybeig4wwwhs63ly6wbehwd7tydjjtnw425yvi2tlzt3aii3pfcj6hvoq
bafybeievpb5q372q3w5fsezflij3wlpx6thdliz5xowimunoqushn3cwka
bafybeih6te26iwf5kzzby2wqp67m7a5pmwilwzaciii3zipvhy64utikre
bafybeicjd4545xph6rcyoc74wvzxyaz2vftapap64iqsp5ky6nz3f5yndm

Now let's create a file called bacalhau.py. The script below automates the submission, monitoring, and retrieval of results for multiple Bacalhau jobs in parallel. It is designed to be used in a scenario where there are multiple hash files, each representing a job, and the script manages the execution of these jobs using Bacalhau commands.

# write following to the file bacalhau.py
import json, glob, os, multiprocessing, shutil, subprocess, tempfile, time

# checkStatusOfJob checks the status of a Bacalhau job
def checkStatusOfJob(job_id: str) -> str:
    assert len(job_id) > 0
    p = subprocess.run(
        ["bacalhau", "list", "--output", "json", "--id-filter", job_id],
        stdout=subprocess.PIPE,
        stderr=subprocess.PIPE,
        text=True,
    )
    r = parseJobStatus(p.stdout)
    if r == "":
        print("job status is empty! %s" % job_id)
    elif r == "Completed":
        print("job completed: %s" % job_id)
    else:
        print("job not completed: %s - %s" % (job_id, r))

    return r


# submitJob submits a job to the Bacalhau network
def submitJob(cid: str) -> str:
    assert len(cid) > 0
    p = subprocess.run(
        [
            "bacalhau",
            "docker",
            "run",
            "--id-only",
            "--wait=false",
            "--input",
            "ipfs://" + cid + ":/inputs/data.tar.gz",
            "ghcr.io/bacalhau-project/examples/blockchain-etl:0.0.6",
        ],
        stdout=subprocess.PIPE,
        stderr=subprocess.PIPE,
        text=True,
    )
    if p.returncode != 0:
        print("failed (%d) job: %s" % (p.returncode, p.stdout))
    job_id = p.stdout.strip()
    print("job submitted: %s" % job_id)

    return job_id


# getResultsFromJob gets the results from a Bacalhau job
def getResultsFromJob(job_id: str) -> str:
    assert len(job_id) > 0
    temp_dir = tempfile.mkdtemp()
    print("getting results for job: %s" % job_id)
    for i in range(0, 5): # try 5 times
        p = subprocess.run(
            [
                "bacalhau",
                "get",
                "--output-dir",
                temp_dir,
                job_id,
            ],
            stdout=subprocess.PIPE,
            stderr=subprocess.PIPE,
            text=True,
        )
        if p.returncode == 0:
            break
        else:
            print("failed (exit %d) to get job: %s" % (p.returncode, p.stdout))

    return temp_dir


# parseJobStatus parses the status of a Bacalhau job
def parseJobStatus(result: str) -> str:
    if len(result) == 0:
        return ""
    r = json.loads(result)
    if len(r) > 0:
        return r[0]["State"]["State"]
    return ""


# parseHashes splits lines from a text file into a list
def parseHashes(filename: str) -> list:
    assert os.path.exists(filename)
    with open(filename, "r") as f:
        hashes = f.read().splitlines()
    return hashes


def main(file: str, num_files: int = -1):
    # Use multiprocessing to work in parallel
    count = multiprocessing.cpu_count()
    with multiprocessing.Pool(processes=count) as pool:
        hashes = parseHashes(file)[:num_files]
        print("submitting %d jobs" % len(hashes))
        job_ids = pool.map(submitJob, hashes)
        assert len(job_ids) == len(hashes)

        print("waiting for jobs to complete...")
        while True:
            job_statuses = pool.map(checkStatusOfJob, job_ids)
            total_finished = sum(map(lambda x: x == "Completed", job_statuses))
            if total_finished >= len(job_ids):
                break
            print("%d/%d jobs completed" % (total_finished, len(job_ids)))
            time.sleep(2)

        print("all jobs completed, saving results...")
        results = pool.map(getResultsFromJob, job_ids)
        print("finished saving results")

        # Do something with the results
        shutil.rmtree("results", ignore_errors=True)
        os.makedirs("results", exist_ok=True)
        for r in results:
            path = os.path.join(r, "outputs", "*.csv")
            csv_file = glob.glob(path)
            for f in csv_file:
                print("moving %s to results" % f)
                shutil.move(f, "results")

if __name__ == "__main__":
    main("hashes.txt", 10)

This code has a few interesting features:

Change the value in the main call (main("hashes.txt", 10)) to change the number of jobs to execute.
Because all jobs are complete at different times, there's a loop to check that all jobs have been completed before downloading the results. If you don't do this, you'll likely see an error when trying to download the results. The while True loop is used to monitor the status of jobs and wait for them to complete.
When downloading the results, the IPFS get often times out, so I wrapped that in a loop. The for i in range(0, 5) loop in the getResultsFromJob function involves retrying the bacalhau get operation if it fails to complete successfully.

Let's run it!

python3 bacalhau.py

Hopefully, the results directory contains all the combined results from the jobs we just executed. Here's we're expecting to see CSV files:

ls results

transactions_00000000_00049999.csv  transactions_00150000_00199999.csv
transactions_00050000_00099999.csv  transactions_00200000_00249999.csv
transactions_00100000_00149999.csv  transactions_00250000_00299999.csv

Success! We've now executed a bunch of jobs in parallel using Python. This is a great way to execute lots of jobs in a repeatable manner. You can alter the file above for your purposes.

Next Steps

You might also be interested in the following examples:

Analysing Data with Python Pandas

Support

If you have questions or need support or guidance, please reach out to the Bacalhau team via Slack (#general channel).

R (language)

Building and Running your Custom R Containers on Bacalhau

Introduction

This example will walk you through building Time Series Forecasting using Prophet. Prophet is a forecasting procedure implemented in R and Python. It is fast and provides completely automated forecasts that can be tuned by hand by data scientists and analysts.

Quick script to run custom R container on Bacalhau:

bacalhau docker run \
    -i ipfs://QmY8BAftd48wWRYDf5XnZGkhwqgjpzjyUG3hN1se6SYaFt:/example_wp_log_R.csv \
    ghcr.io/bacalhau-project/examples/r-prophet:0.0.2 \
    -- Rscript Saturating-Forecasts.R "/example_wp_log_R.csv" "/outputs/output0.pdf" "/outputs/output1.pdf"

Prerequisites

To get started, you need to install the Bacalhau client, see more information here

1. Running Prophet in R Locally

Open R studio or R-supported IDE. If you want to run this on a notebook server, then make sure you use an R kernel. Prophet is a CRAN package, so you can use install.packages to install the prophet package:

R -e "install.packages('prophet',dependencies=TRUE, repos='http://cran.rstudio.com/')"

After installation is finished, you can download the example data that is stored in IPFS:

wget https://w3s.link/ipfs/QmZiwZz7fXAvQANKYnt7ya838VPpj4agJt5EDvRYp3Deeo/example_wp_log_R.csv

The code below instantiates the library and fits a model to the data.

mkdir -p outputs
mkdir -p R

Create a new file called Saturating-Forecasts.R and in it paste the following script:

# content of the Saturating-Forecasts.R

# Library Inclusion
library('prophet')


# Command Line Arguments:
args = commandArgs(trailingOnly=TRUE)
args

input = args[1]
output = args[2]
output1 = args[3]


# File Path Processing:
I <- paste("", input, sep ="")

O <- paste("", output, sep ="")

O1 <- paste("", output1 ,sep ="")


# Read CSV Data:
df <- read.csv(I)


# Forecasting 1:
df$cap <- 8.5
m <- prophet(df, growth = 'logistic')

future <- make_future_dataframe(m, periods = 1826)
future$cap <- 8.5
fcst <- predict(m, future)
pdf(O)
plot(m, fcst)
dev.off()

# Forecasting 2:
df$y <- 10 - df$y
df$cap <- 6
df$floor <- 1.5
future$cap <- 6
future$floor <- 1.5
m <- prophet(df, growth = 'logistic')
fcst <- predict(m, future)
pdf(O1)
plot(m, fcst)
dev.off()

This script performs time series forecasting using the Prophet library in R, taking input data from a CSV file, applying the forecasting model, and generating plots for analysis.

Let's have a look at the command below:

Rscript Saturating-Forecasts.R "example_wp_log_R.csv" "outputs/output0.pdf" "outputs/output1.pdf"

This command uses Rscript to execute the script that was created and written to the Saturating-Forecasts.R file.

The input parameters provided in this case are the names of input and output files:

example_wp_log_R.csv - the example data that was previously downloaded.

outputs/output0.pdf - the name of the file to save the first forecast plot.

outputs/output1.pdf - the name of the file to save the second forecast plot.

2. Running R Prophet on Bacalhau

To use Bacalhau, you need to package your code in an appropriate format. The developers have already pushed a container for you to use, but if you want to build your own, you can follow the steps below. You can view a dedicated container example in the documentation.

3. Containerize Script with Docker

To build your own docker container, create a Dockerfile, which contains instructions to build your image.

FROM r-base
RUN R -e "install.packages('prophet',dependencies=TRUE, repos='http://cran.rstudio.com/')"
RUN mkdir /R
RUN mkdir /outputs
COPY Saturating-Forecasts.R R
WORKDIR /R

These commands specify how the image will be built, and what extra requirements will be included. We use r-base as the base image and then install the prophet package. We then copy the Saturating-Forecasts.R script into the container and set the working directory to the R folder.

Build the container

We will run docker build command to build the container:

docker build -t <hub-user>/<repo-name>:<tag> .

Before running the command replace:

hub-user with your docker hub username. If you don’t have a docker hub account follow these instructions to create docker account, and use the username of the account you created

repo-name with the name of the container, you can name it anything you want

tag this is not required but you can use the latest tag

In our case:

docker buildx build --platform linux/amd64 -t ghcr.io/bacalhau-project/examples/r-prophet:0.0.1 .

Push the container

Next, upload the image to the registry. This can be done by using the Docker hub username, repo name, or tag.

docker push <hub-user>/<repo-name>:<tag>

In our case:

docker push ghcr.io/bacalhau-project/examples/r-prophet:0.0.1

4. Running a Job on Bacalhau

The following command passes a prompt to the model and generates the results in the outputs directory. It takes approximately 2 minutes to run.

export JOB_ID=$(bacalhau docker run \
    --wait \
    --id-only \
    -i ipfs://QmY8BAftd48wWRYDf5XnZGkhwqgjpzjyUG3hN1se6SYaFt:/example_wp_log_R.csv \
    ghcr.io/bacalhau-project/examples/r-prophet:0.0.2 \
    -- Rscript Saturating-Forecasts.R "/example_wp_log_R.csv" "/outputs/output0.pdf" "/outputs/output1.pdf")

Structure of the command

bacalhau docker run: call to Bacalhau
-i ipfs://QmY8BAftd48wWRYDf5XnZGkhwqgjpzjyUG3hN1se6SYaFt:/example_wp_log_R.csv: Mounting the uploaded dataset at /inputs in the execution. It takes two arguments, the first is the IPFS CID (QmY8BAftd48wWRYDf5XnZGkhwqgjpzjyUG3hN1se6SYaFtz) and the second is file path within IPFS (/example_wp_log_R.csv)
ghcr.io/bacalhau-project/examples/r-prophet:0.0.2: the name and the tag of the docker image we are using
/example_wp_log_R.csv : path to the input dataset
/outputs/output0.pdf, /outputs/output1.pdf: paths to the output
Rscript Saturating-Forecasts.R: execute the R script

When a job is submitted, Bacalhau prints out the related job_id. We store that in an environment variable so that we can reuse it later on:

5. Checking the State of your Jobs

Job status: You can check the status of the job using bacalhau list.

bacalhau list --id-filter ${JOB_ID}

When it says Published or Completed, that means the job is done, and we can get the results.

Job information: You can find out more information about your job by using bacalhau describe.

bacalhau describe ${JOB_ID}

Job download: You can download your job results directly by using bacalhau get. Alternatively, you can choose to create a directory to store your results. In the command below, we created a directory (results) and downloaded our job output to be stored in that directory.

rm -rf results && mkdir -p results
bacalhau get ${JOB_ID} --output-dir results

6. Viewing your Job Output

To view the file, run the following command:

ls results/outputs

You can't natively display PDFs in notebooks, so here are some static images of the PDFs:

output0.pdf

output1.pdf

Support

If you have questions or need support or guidance, please reach out to the Bacalhau team via Slack (#general channel).

Running a Simple R Script on Bacalhau

You can use official Docker containers for each language, like R or Python. In this example, we will use the official R container and run it on Bacalhau.

In this tutorial example, we will run a "hello world" R script on Bacalhau.

Prerequisites

To get started, you need to install the Bacalhau client, see more information here

1. Running an R Script Locally

To install R follow these instructions A Installing R and RStudio | Hands-On Programming with R. After R and RStudio are installed, create and run a script called hello.R:

# hello.R
print("hello world")

Run the script:

Rscript hello.R

Next, upload the script to your public storage (in our case, IPFS). We've already uploaded the script to IPFS and the CID is: QmVHSWhAL7fNkRiHfoEJGeMYjaYZUsKHvix7L54SptR8ie. You can look at this by browsing to one of the HTTP IPFS proxies like ipfs.io or w3s.link.

2. Running a Job on Bacalhau

Now it's time to run the script on Bacalhau:

export JOB_ID=$(bacalhau docker run \
    --wait \
    --id-only \
    -i ipfs://QmQRVx3gXVLaRXywgwo8GCTQ63fHqWV88FiwEqCidmUGhk:/hello.R \
    r-base \
    -- Rscript hello.R)

Structure of the command

bacalhau docker run: call to Bacalhau
i ipfs://QmQRVx3gXVLaRXywgwo8GCTQ63fHqWV88FiwEqCidmUGhk:/hello.R: Mounting the uploaded dataset at /inputs in the execution. It takes two arguments, the first is the IPFS CID (QmQRVx3gXVLaRXywgwo8GCTQ63fHqWV88FiwEqCidmUGhk) and the second is file path within IPFS (/hello.R)
r-base: docker official image we are using
Rscript hello.R: execute the R script

When a job is submitted, Bacalhau prints out the related job_id. We store that in an environment variable so that we can reuse it later on:

Declarative job description

The same job can be presented in the declarative format. In this case, the description will look like this:

name: Running a Simple R Script
type: batch
count: 1
tasks:
  - name: My main task
    Engine:
      type: docker
      params:
        Image: r-base:latest
        Entrypoint:
          - /bin/bash
        Parameters:
          - -c        
          - Rscript /hello.R
    InputSources:
      - Target: "/"
        Source:
          Type: urlDownload
          Params:
            URL: https://raw.githubusercontent.com/bacalhau-project/examples/main/scripts/hello.R
            Path: /hello.R

The job description should be saved in .yaml format, e.g. rhello.yaml, and then run with the command:

bacalhau job run rhello.yaml

3. Checking the State of your Jobs

Job status: You can check the status of the job using bacalhau list.

bacalhau list --id-filter ${JOB_ID}

When it says Published or Completed, that means the job is done, and we can get the results.

Job information: You can find out more information about your job by using bacalhau describe.

bacalhau describe  ${JOB_ID}

Job download: You can download your job results directly by using bacalhau get. Alternatively, you can choose to create a directory to store your results. In the command below, we created a directory (results) and downloaded our job output to be stored in that directory.

rm -rf results && mkdir results
bacalhau get ${JOB_ID} --output-dir results

4. Viewing your Job Output

To view the file, run the following command:

cat results/stdout

Futureproofing your R Scripts

You can generate the job request using bacalhau describe with the --spec flag. This will allow you to re-run that job in the future:

bacalhau describe ${JOB_ID} --spec > job.yaml

cat job.yaml

Support

If you have questions or need support or guidance, please reach out to the Bacalhau team via Slack (#general channel).

Run CUDA programs on Bacalhau

What is CUDA

In this tutorial, we will look at how to run CUDA programs on Bacalhau. CUDA (Compute Unified Device Architecture) is an extension of C/C++ programming. It is a parallel computing platform and programming model created by NVIDIA. It helps developers speed up their applications by harnessing the power of GPU accelerators.

In addition to accelerating high-performance computing (HPC) and research applications, CUDA has also been widely adopted across consumer and industrial ecosystems. CUDA also makes it easy for developers to take advantage of all the latest GPU architecture innovations

Advantage of GPU over CPU

Architecturally, the CPU is composed of just a few cores with lots of cache memory that can handle a few software threads at a time. In contrast, a GPU is composed of hundreds of cores that can handle thousands of threads simultaneously.

Computations like matrix multiplication could be done much faster on GPU than on CPU

Prerequisite

To get started, you need to install the Bacalhau client, see more information here

1. Running CUDA locally

You'll need to have the following installed:

NVIDIA GPU
CUDA drivers installed
nvcc installed

Checking if nvcc is installed:

nvcc --version

Downloading the programs:

mkdir inputs outputs
wget -P inputs https://raw.githubusercontent.com/tristanpenman/cuda-examples/master/00-hello-world.cu
wget -P inputs https://raw.githubusercontent.com/tristanpenman/cuda-examples/master/02-cuda-hello-world-faster.cu

Viewing the programs

00-hello-world.cu:

# View the contents of the standard C++ program
cat inputs/00-hello-world.cu

# Measure the time it takes to compile and run the program
nvcc -o ./outputs/hello ./inputs/00-hello-world.cu; ./outputs/hello

This example represents a standard C++ program that inefficiently utilizes GPU resources due to the use of non-parallel loops.

02-cuda-hello-world-faster.cu:

# View the contents of the CUDA program with vector addition
!cat inputs/02-cuda-hello-world-faster.cu

# Remove any previous output
rm -rf outputs/hello

# Measure the time for compilation and execution
nvcc --expt-relaxed-constexpr -o ./outputs/hello ./inputs/02-cuda-hello-world-faster.cu; ./outputs/hello

In this example we utilize Vector addition using CUDA and allocate the memory in advance and copy the memory to the GPU using cudaMemcpy so that it can utilize the HBM (High Bandwidth memory of the GPU). Compilation and execution occur faster (1.39 seconds) compared to the previous example (8.67 seconds).

2. Running a Bacalhau Job

To submit a job, run the following Bacalhau command:

export JOB_ID=$(bacalhau docker run \
    --gpu 1 \
    --timeout 3600 \
    --wait-timeout-secs 3600 \
    -i https://raw.githubusercontent.com/tristanpenman/cuda-examples/master/02-cuda-hello-world-faster.cu \
    --id-only \
    --wait \
    nvidia/cuda:11.2.2-cudnn8-devel-ubuntu18.04 \
    -- /bin/bash -c 'nvcc --expt-relaxed-constexpr  -o ./outputs/hello ./inputs/02-cuda-hello-world-faster.cu; ./outputs/hello ')

Structure of the Commands

bacalhau docker run: call to Bacalhau
-i https://raw.githubusercontent.com/tristanpenman/cuda-examples/master/02-cuda-hello-world-faster.cu: URL path of the input data volumes downloaded from a URL source.
nvidia/cuda:11.2.0-cudnn8-devel-ubuntu18.04: Docker container for executing CUDA programs (you need to choose the right CUDA docker container). The container should have the tag of "devel" in them.
nvcc --expt-relaxed-constexpr -o ./outputs/hello ./inputs/02-cuda-hello-world-faster.cu: Compilation using the nvcc compiler and save it to the outputs directory as hello
Note that there is ; between the commands: -- /bin/bash -c 'nvcc --expt-relaxed-constexpr -o ./outputs/hello ./inputs/02-cuda-hello-world-faster.cu; ./outputs/hello The ";" symbol allows executing multiple commands sequentially in a single line.
./outputs/hello: Execution hello binary: You can combine compilation and execution commands.

Note that the CUDA version will need to be compatible with the graphics card on the host machine

When a job is submitted, Bacalhau prints out the related job_id. We store that in an environment variable so that we can reuse it later on:

3. Checking the State of your Jobs

Job status: You can check the status of the job using bacalhau list.

bacalhau list --id-filter ${JOB_ID} --wide

When it says Published or Completed, that means the job is done, and we can get the results.

Job information: You can find out more information about your job by using bacalhau describe.

bacalhau describe ${JOB_ID}

Job download: You can download your job results directly by using bacalhau get. Alternatively, you can choose to create a directory to store your results. In the command below, we created a directory (results) and downloaded our job output to be stored in that directory.

rm -rf results && mkdir -p results
bacalhau get $JOB_ID --output-dir results

4. Viewing your Job Output

To view the file, run the following command:

cat results/stdout

Support

If you have questions or need support or guidance, please reach out to the Bacalhau team via Slack (#general channel).

Running a Prolog Script

Introduction

Prolog is intended primarily as a declarative programming language: the program logic is expressed in terms of relations, represented as facts and rules. A computation is initiated by running a query over these relations. Prolog is well-suited for specific tasks that benefit from rule-based logical queries such as searching databases, voice control systems, and filling templates.

This tutorial is a quick guide on how to run a hello world script on Bacalhau.

Prerequisites

To get started, you need to install the Bacalhau client, see more information

1. Running Locally

To get started, install swipl

Create a file called helloworld.pl. The following script prints ‘Hello World’ to the stdout:

Running the script to print out the output:

After the script has run successfully locally, we can now run it on Bacalhau.

Before running it on Bacalhau we need to upload it to IPFS.

Using the IPFS cli:

Run the command below to check if our script has been uploaded.

This command outputs the CID. Copy the CID of the file, which in our case is QmYq9ipYf3vsj7iLv5C67BXZcpLHxZbvFAJbtj7aKN5qii

2. Running a Bacalhau Job

We will mount the script to the container using the -i flag: -i: ipfs://< CID >:/< name-of-the-script >.

To submit a job, run the following Bacalhau command:

Structure of the Command

-i ipfs://QmYq9ipYf3vsj7iLv5C67BXZcpLHxZbvFAJbtj7aKN5qii:/helloworld.pl : Sets the input data for the container.
mYq9ipYf3vsj7iLv5C67BXZcpLHxZbvFAJbtj7aKN5qii is our CID which points to the helloworld.pl file on the IPFS network. This file will be accessible within the container.
-- swipl -q -s helloworld.pl -g hello_world: instructs SWI-Prolog to load the program from the helloworld.pl file and execute the hello_world function in quiet mode:
1. -q: running in quiet mode
2. -s: load file as a script. In this case we want to run the helloworld.pl script
3. -g: is the name of the function you want to execute. In this case its hello_world

When a job is submitted, Bacalhau prints out the related job_id. We store that in an environment variable so that we can reuse it later on:

3. Checking the State of your Jobs

Job status: You can check the status of the job using bacalhau list.

When it says Published or Completed, that means the job is done, and we can get the results.

Job information: You can find out more information about your job by using bacalhau describe.

Job download: You can download your job results directly by using bacalhau get. Alternatively, you can choose to create a directory to store your results. In the command below, we created a directory (results) and downloaded our job output to be stored in that directory.

4. Viewing your Job Output

To view the file, run the following command:

Support

Reading Data from Multiple S3 Buckets using Bacalhau

Introduction

Bacalhau, a powerful and versatile data processing platform, has recently integrated Amazon Web Services (AWS) S3, allowing users to seamlessly access and process data stored in S3 buckets within their Bacalhau jobs. This integration not only simplifies data input, output, and processing operations but also streamlines the overall workflow by enabling users to store and manage their data effectively in S3 buckets. With Bacalhau, you can process several Large s3 buckets in parallel. In this example, we will walk you through the process of reading data from multiple S3 buckets, converting TIFF images to JPEG format.

Advantages of Converting TIFF to JPEG

There are several advantages to converting images from TIFF to JPEG format:

Reduced File Size: JPEG images use lossy compression, which significantly reduces file size compared to lossless formats like TIFF. Smaller file sizes lead to faster upload and download times, as well as reduced storage requirements.
Efficient Processing: With smaller file sizes, image processing tasks tend to be more efficient and require less computational resources when working with JPEG images compared to TIFF images.
Training Machine Learning Models: Smaller file sizes and reduced computational requirements make JPEG images more suitable for training machine learning models, particularly when dealing with large datasets, as they can help speed up the training process and reduce the need for extensive computational resources.

Running the job on Bacalhau

We will use the S3 mount feature to mount bucket objects from s3 buckets. Let’s have a look at the example below:

-i src=s3://sentinel-s1-rtc-indigo/tiles/RTC/1/IW/10/S/DH/2017/S1A_20170125_10SDH_ASC/Gamma0_VH.tif,dst=/sentinel-s1-rtc-indigo/,opt=region=us-west-2 It defines S3 object as input to the job:

sentinel-s1-rtc-indigo: bucket’s name

tiles/RTC/1/IW/10/S/DH/2017/S1A_20170125_10SDH_ASC/Gamma0_VH.tif: represents the key of the object in that bucket. The object to be processed is called Gamma0_VH.tif and is located in the subdirectory with the specified path.

But if you want to specify the entire objects located in the path, you can simply add * to the end of the path (tiles/RTC/1/IW/10/S/DH/2017/S1A_20170125_10SDH_ASC/*)

dst=/sentinel-s1-rtc-indigo: the destination to which to mount the s3 bucket object

opt=region=us-west-2 : specifying the region in which the bucket is located

Prerequisite

TBD 1. Running the job on multiple buckets with multiple objects

In the example below, we will mount several bucket objects from public s3 buckets located in a specific region:

The job has been submitted and Bacalhau has printed out the related job_id. We store that in an environment variable so that we can reuse it later on:

2. Checking the State of your Jobs

Job status: You can check the status of the job using bacalhau list.

When it says Published or Completed, that means the job is done, and we can get the results.

Job information: You can find out more information about your job by using bacalhau describe.

Job download: You can download your job results directly by using bacalhau get. Alternatively, you can choose to create a directory to store your results. In the command below, we created a directory (results) and downloaded our job output to be stored in that directory.

3. Viewing your Job Output

Display the image

To view the images, we will use glob to return all file paths that match a specific pattern.

The code processes and displays all images in the specified directory by applying cropping and resizing with a specified reduction factor.

Support

Running Rust programs as WebAssembly (WASM)

Bacalhau supports running jobs as a program. This example demonstrates how to compile a project into WebAssembly and run the program on Bacalhau.

Prerequisites

To get started, you need to install the Bacalhau client, see more information .
A working Rust installation with the wasm32-wasi target. For example, you can use to install Rust and configure it to build WASM targets. For those using the notebook, these are installed in hidden cells below.

1. Develop a Rust Program Locally

We can use cargo (which will have been installed by rustup) to start a new project (my-program) and compile it:

We can then write a Rust program. Rust programs that run on Bacalhau can read and write files, access a simple clock, and make use of pseudo-random numbers. They cannot memory-map files or run code on multiple threads.

The program below will use the Rust imageproc crate to resize an image through seam carving, based on .

In the main function main() an image is loaded, the original is saved, and then a loop is performed to reduce the width of the image by removing "seams." The results of the process are saved, including the original image with drawn seams and a gradient image with highlighted seams.

We also need to install the imageproc and image libraries and switch off the default features to make sure that multi-threading is disabled (default-features = false). After disabling the default features, you need to explicitly specify only the features that you need:

We can now build the Rust program into a WASM blob using cargo:

This command navigates to the my-program directory and builds the project using Cargo with the target set to wasm32-wasi in release mode.

This will generate a WASM file at ./my-program/target/wasm32-wasi/release/my-program.wasm which can now be run on Bacalhau.

2. Running WASM on Bacalhau

Now that we have a WASM binary, we can upload it to IPFS and use it as input to a Bacalhau job.

The -i flag allows specifying a URI to be mounted as a named volume in the job, which can be an IPFS CID, HTTP URL, or S3 object.

For this example, we are using an image of the Statue of Liberty that has been pinned to a storage facility.

Structure of the Commands

bacalhau wasm run: call to Bacalhau
./my-program/target/wasm32-wasi/release/my-program.wasm: the path to the WASM file that will be executed
_start: the entry point of the WASM program, where its execution begins
--id-only: this flag indicates that only the identifier of the executed job should be returned
-i ipfs://bafybeifdpl6dw7atz6uealwjdklolvxrocavceorhb3eoq6y53cbtitbeu:/inputs: input data volume that will be accessible within the job at the specified destination path

When a job is submitted, Bacalhau prints out the related job_id. We store that in an environment variable so that we can reuse it later on:

You can download your job results directly by using bacalhau get. Alternatively, you can choose to create a directory to store your results. In the command below, we created a directory (wasm_results) and downloaded our job output to be stored in that directory.

We can now get the results.

Viewing Job Output

When we view the files, we can see the original image, the resulting shrunk image, and the seams that were removed.

Support

Generate Synthetic Data using Sparkov Data Generation technique

Introduction

A synthetic dataset is generated by algorithms or simulations which has similar characteristics to real-world data. Collecting real-world data, especially data that contains sensitive user data like credit card information, is not possible due to security and privacy concerns. If a data scientist needs to train a model to detect credit fraud, they can use synthetically generated data instead of using real data without compromising the privacy of users.

The advantage of using Bacalhau is that you can generate terabytes of synthetic data without having to install any dependencies or store the data locally.

In this example, we will learn how to run Bacalhau on a synthetic dataset. We will generate synthetic credit card transaction data using the Sparkov program and store the results in IPFS.

Prerequisite

To get started, you need to install the Bacalhau client, see more information

1. Running Sparkov Locally

To run Sparkov locally, you'll need to clone the repo and install dependencies:

Go to the Sparkov_Data_Generation directory:

Create a temporary directory (outputs) to store the outputs:

2. Running the script

The command above executes the Python script datagen.py, passing the following arguments to it:

-n 1000: Number of customers to generate
-o ../outputs: path to store the outputs
"01-01-2022": Start date
"10-01-2022": End date

Thus, this command uses a Python script to generate synthetic credit card transaction data for the period from 01-01-2022 to 10-01-2022 and saves the results in the ../outputs directory.

To see the full list of options, use:

3. Containerize Script using Docker

To build your own docker container, create a Dockerfile, which contains instructions to build your image:

These commands specify how the image will be built, and what extra requirements will be included. We use python:3.8 as the base image, install git, clone the Sparkov_Data_Generation repository from GitHub, set the working directory inside the container to /Sparkov_Data_Generation/, and install Python dependencies listed in the requirements.txt file."

Build the container

We will run docker build command to build the container:

Before running the command replace:

repo-name with the name of the container, you can name it anything you want

tag this is not required but you can use the latest tag

In our case:

Push the container

Next, upload the image to the registry. This can be done by using the Docker hub username, repo name or tag.

In our case:

After the repo image has been pushed to Docker Hub, we can now use the container for running on Bacalhau

4. Running a Bacalhau Job

Now we're ready to run a Bacalhau job:

Structure of the command:

bacalhau docker run: call to Bacalhau
jsacex/sparkov-data-generation: the name of the docker image we are using
-- python3 datagen.py -n 1000 -o ../outputs "01-01-2022" "10-01-2022": the arguments passed into the container, specifying the execution of the Python script datagen.py with specific parameters, such as the amount of data, output path, and time range.

When a job is submitted, Bacalhau prints out the related job_id. We store that in an environment variable so that we can reuse it later on:

5. Checking the State of your Jobs

Job status: You can check the status of the job using bacalhau list.

When it says Published or Completed, that means the job is done, and we can get the results.

Job information: You can find out more information about your job by using bacalhau describe.

Job download: You can download your job results directly by using bacalhau get. Alternatively, you can choose to create a directory to store your results. In the command below, we created a directory (results) and downloaded our job output to be stored in that directory.

6. Viewing your Job Output

To view the contents of the current directory, run the following command:

Support

Data Ingestion

Copy Data from URL to Public Storage

To upload a file from a URL we will use the bacalhau docker run command.

The job has been submitted and Bacalhau has printed out the related job id.

Structure of the command

Let's look closely at the command above:

bacalhau docker run: call to bacalhau using docker executor
--input https://raw.githubusercontent.com/filecoin-project/bacalhau/main/README.md: URL path of the input data volumes downloaded from a URL source.
ghcr.io/bacalhau-project/examples/upload:v1: the name and tag of the docker image we are using

The bacalhau docker run command takes advantage of the --input parameter. This will download a file from a public URL and place it in the /inputs directory of the container (by default). Then we will use a helper container to move that data to the /outputs directory.

You can find out more about the which is designed to simplify the data uploading process.

For more details, see the

Checking the State of Your Jobs

Job status: You can check the status of the job using bacalhau list, processing the json ouput with the jq:

When the job status is Published or Completed, that means the job is done, and we can get the results using the job ID.

Job information: You can find out more information about your job by using bacalhau describe.

Job download: You can download your job results directly by using bacalhau get. Alternatively, you can choose to create a directory to store your results. In the command below, we removed a directory in case it was present before, created it and downloaded our job output to be stored in that directory.

Each job result contains an outputs subfolder and exitCode, stderr and stdout files with relevant content. To view the execution logs execute following:

And to view the job execution result (README.md file in the example case), which was saved as a job output, execute:

To get the output CID from a completed job, run the following command:

The job will upload the CID to the public storage via IPFS. We will store the CID in an environment variable so that we can reuse it later on.

Now that we have the CID, we can use it in a new job. This time we will use the --input parameter to tell Bacalhau to use the CID we just uploaded.

In this case, the only goal of our job is just to list the contents of the /inputs directory. You can see that the "input" data is located under /inputs/outputs/README.md.

The job has been submitted and Bacalhau has printed out the related job id. We store that in an environment variable so that we can reuse it later on.

Pinning Data

How to pin data to public storage

If you have data that you want to make available to your Bacalhau jobs (or other people), you can pin it using a pinning service like Pinata, NFT.Storage, Thirdweb, etc. Pinning services store data on behalf of users. The pinning provider is essentially guaranteeing that your data will be available if someone knows the CID. Most pinning services offer you a free tier, so you can try them out without spending any money.

Basic steps

To use a pinning service, you will almost always need to create an account. After registration, you get an API token, which is necessary to control and access the files. Then you need to upload files - usually services provide a web interface, CLI and code samples for integration into your application. Once you upload the files you will get its CID, which looks like this: QmUyUg8en7G6RVL5uhyoLBxSWFgRMdMraCRWFcDdXKWEL9. Now you can access pinned data from the jobs via this CID.

Data source can be specified via --input flag, see the for more details

Jobs

Networking Instructions

Examples

Case Studies

Data Engineering

This directory contains examples relating to data engineering workloads. The goal is to provide a range of examples that show you how to work with Bacalhau in a variety of use cases.

Model Inference

Stable Diffusion Checkpoint Inference