Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
This directory contains examples relating to performing common tasks with Bacalhau.
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
This directory contains examples relating to data engineering workloads. The goal is to provide a range of examples that show you how to work with Bacalhau in a variety of use cases.
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Bacalhau supports running programs that are compiled to WebAssembly (WASM). With the Bacalhau client, you can upload WASM programs, retrieve data from public storage, read and write data, receive program arguments, and access environment variables.
Supported WebAssembly System Interface (WASI) Bacalhau can run compiled WASM programs that expect the WebAssembly System Interface (WASI) Snapshot 1. Through this interface, WebAssembly programs can access data, environment variables, and program arguments.
Networking Restrictions All ingress/egress networking is disabled – you won't be able to pull data/code/weights
etc. from an external source. WASM jobs can say what data they need using URLs or CIDs (Content IDentifier) and can then access the data by reading from the filesystem.
Single-Threading There is no multi-threading as WASI does not expose any interface for it.
If your program typically involves reading from and writing to network endpoints, follow these steps to adapt it for Bacalhau:
Replace Network Operations: Instead of making HTTP requests to external servers (e.g., example.com), modify your program to read data from the local filesystem.
Input Data Handling: Specify the input data location in Bacalhau using the --input
flag when running the job. For instance, if your program used to fetch data from example.com
, read from the /inputs
folder locally, and provide the URL as input when executing the Bacalhau job. For example, --input http://example.com
.
Output Handling: Adjust your program to output results to standard output (stdout
) or standard error (stderr
) pipes. Alternatively, you can write results to the filesystem, typically into an output mount. In the case of WASM jobs, a default folder at /outputs
is available, ensuring that data written there will persist after the job concludes.
By making these adjustments, you can effectively transition your program to operate within the Bacalhau environment, utilizing filesystem operations instead of traditional network interactions.
You can specify additional or different output mounts using the -o
flag.
You will need to compile your program to WebAssembly that expects WASI. Check the instructions for your compiler to see how to do this.
For example, Rust users can specify the wasm32-wasi
target to rustup
and cargo
to get programs compiled for WASI WebAssembly. See the Rust example for more information on this.
Data is identified by its content identifier (CID) and can be accessed by anyone who knows the CID. You can use either of these methods to upload your data:
You can mount your data anywhere on your machine, and Bacalhau will be able to run against that data
You can run a WebAssembly program on Bacalhau using the bacalhau wasm run
command.
Run Locally Compiled Program:
If your program is locally compiled, specify it as an argument. For instance, running the following command will upload and execute the main.wasm
program:
The program you specify will be uploaded to a Bacalhau storage node and will be publicly available.
Alternative Program Specification:
You can use a Content IDentifier (CID) for a specific WebAssembly program.
Input Data Specification:
Make sure to specify any input data using --input
flag.
This ensures the necessary data is available for the program's execution.
You can give the WASM program arguments by specifying them after the program path or CID. If the WASM program is already compiled and located in the current directory, you can run it by adding arguments after the file name:
For a specific WebAssembly program, run:
Write your program to use program arguments to specify input and output paths. This makes your program more flexible in handling different configurations of input and output volumes.
For example, instead of hard-coding your program to read from /inputs/data.txt
, accept a program argument that should contain the path and then specify the path as an argument to bacalhau wasm run
:
Your language of choice should contain a standard way of reading program arguments that will work with WASI.
You can also specify environment variables using the -e
flag.
See the Rust example for a workload that leverages WebAssembly support.
If you have questions or need support or guidance, please reach out to the Bacalhau team via Slack (#general channel).
How to use docker containers with Bacalhau
Bacalhau executes jobs by running them within containers. Bacalhau employs a syntax closely resembling Docker, allowing you to utilize the same containers. The key distinction lies in how input and output data are transmitted to the container via IPFS, enabling scalability on a global level.
This section describes how to migrate a workload based on a Docker container into a format that will work with the Bacalhau client.
You can check out this example tutorial on how to work with custom containers in Bacalhau to see how we used all these steps together.
Here are few things to note before getting started:
Container Registry: Ensure that the container is published to a public container registry that is accessible from the Bacalhau network.
Architecture Compatibility: Bacalhau supports only images that match the host node's architecture. Typically, most nodes run on linux/amd64
, so containers in arm64
format are not able to run.
Input Flags: The --input ipfs://...
flag supports only directories and does not support CID subpaths. The --input https://...
flag supports only single files and does not support URL directories. The --input s3://...
flag supports S3 keys and prefixes. For example, s3://bucket/logs-2023-04*
includes all logs for April 2023.
You can check to see a list of example public containers used by the Bacalhau team
Note: Only about a third of examples have their containers here. If you can't find one, feel free to contact the team.
To help provide a safe, secure network for all users, we add the following runtime restrictions:
Limited Ingress/Egress Networking:
All ingress/egress networking is limited as described in the networking documentation. You won't be able to pull data/code/weights/
etc. from an external source.
Data Passing with Docker Volumes:
A job includes the concept of input and output volumes, and the Docker executor implements support for these. This means you can specify your CIDs, URLs, and/or S3 objects as input
paths and also write results to an output
volume. This can be seen in the following example:
The above example demonstrates an input volume flag -i s3://mybucket/logs-2023-04*
, which mounts all S3 objects in bucket mybucket
with logs-2023-04
prefix within the docker container at location /input
(root).
Output volumes are mounted to the Docker container at the location specified. In the example above, any content written to /output_folder
will be made available within the apples
folder in the job results CID.
Once the job has run on the executor, the contents of stdout
and stderr
will be added to any named output volumes the job has used (in this case apples
), and all those entities will be packaged into the results folder which is then published to a remote location by the publisher.
If you need to pass data into your container you will do this through a Docker volume. You'll need to modify your code to read from a local directory.
We make the assumption that you are reading from a directory called /inputs
, which is set as the default.
You can specify which directory the data is written to with the --input
CLI flag.
If you need to return data from your container you will do this through a Docker volume. You'll need to modify your code to write to a local directory.
We make the assumption that you are writing to a directory called /outputs
, which is set as the default.
You can specify which directory the data is written to with the --output-volumes
CLI flag.
At this step, you create (or update) a Docker image that Bacalhau will use to perform your task. You build your image from your code and dependencies, then push it to a public registry so that Bacalhau can access it. This is necessary for other Bacalhau nodes to run your container and execute the task.
Most Bacalhau nodes are of an x86_64
architecture, therefore containers should be built for x86_64
systems.
For example:
To test your docker image locally, you'll need to execute the following command, changing the environment variables as necessary:
Let's see what each command will be used for:
Exports the current working directory of the host system to the LOCAL_INPUT_DIR
variable. This variable will be used for binding a volume and transferring data into the container.
Exports the current working directory of the host system to the LOCAL_OUTPUT_DIR variable. Similarly, this variable will be used for binding a volume and transferring data from the container.
Creates an array of commands CMD that will be executed inside the container. In this case, it is a simple command executing 'ls' in the /inputs directory and writing text to the /outputs/stdout file.
Launches a Docker container using the specified variables and commands. It binds volumes to facilitate data exchange between the host and the container.
Bacalhau will use the default ENTRYPOINT if your image contains one. If you need to specify another entrypoint, use the --entrypoint
flag to bacalhau docker run
.
For example:
The result of the commands' execution is shown below:
Data is identified by its content identifier (CID) and can be accessed by anyone who knows the CID. You can use either of these methods to upload your data:
You can choose to
You can mount your data anywhere on your machine, and Bacalhau will be able to run against that data
To launch your workload in a Docker container, using the specified image and working with input
data specified via IPFS CID, run the following command.
To check the status of your job, run the following command.
To get more information on your job, you can run the following command.
To download your job, run.
To put this all together into one would look like the following.
This outputs the following.
The --input
flag does not support CID subpaths for ipfs://
content.
Alternatively, you can run your workload with a publicly accessible http(s) URL, which will download the data temporarily into your public storage:
The --input
flag does not support URL directories.
If you run into this compute error while running your docker image
This can often be resolved by re-tagging your docker image
If you have questions or need support or guidance, please reach out to the Bacalhau team via Slack (#general channel).
In this tutorial you are setting up your own network
Bacalhau allows you to create your own private network so you can securely run private workloads without the risks inherent in working on public nodes or inadvertently distributing data outside your organization.
This tutorial describes the process of creating your own private network from multiple nodes, configuring the nodes, and running demo jobs.
Install Bacalhau curl -sL https://get.bacalhau.org/install.sh | bash
on every host
Start the Requester node: bacalhau serve --node-type requester
Copy and paste the command it outputs under the "To connect a compute node to this orchestrator, run the following command in your shell" line to other hosts
Copy and paste the environment variables it outputs under the "To connect to this node from the client, run the following commands in your shell" line to a client machine
Done! Run sample hello-world command on the client machine bacalhau docker run apline echo hello
Prepare the hosts on which the nodes are going to be set up. They could be:
Physical Hosts
Local Hypervisor VMs
Install Bacalhau on each host
Ensure that all nodes are connected to the same network and that the necessary ports are open for communication between them.
Ensure your nodes have an internet connection in case you have to download or upload any data (docker images, input data, results)
Ensure that Docker Engine is installed in case you are going to run Docker Workloads
Bacalhau is designed to be versatile in its deployment, capable of running on various environments: physical hosts, virtual machines or cloud instances. Its resource requirements are modest, ensuring compatibility with a wide range of hardware configurations. However, for certain workloads, such as machine learning, it's advisable to consider hardware configurations optimized for computational tasks, including GPUs.
The Bacalhau network consists of nodes of two types: compute and requester. Compute Node is responsible for executing jobs and producing results. Requester Node is responsible for handling user requests, forwarding jobs to compute nodes and monitoring the job lifecycle.
The first step is to start up the initial Requester node. This node will connect to nothing but will listen for connections.
Start by creating a secure token. This token will be used for authentication between the orchestrator and compute nodes during their communications. Any string can be used as a token, preferably not easy to guess or bruteforce. In addition, new authentication methods will be introduced in future releases.
Let's use the uuidgen
tool to create our token, then add it to the Bacalhau configuration and run the requester node:
This will produce output similar to this, indicating that the node is up and running:
Note that for security reasons, the output of the command contains the localhost 127.0.0.1
address instead of your real IP. To connect to this node, you should replace it with your real public IP address yourself. The method for obtaining your public IP address may vary depending on the type of instance you're using. Windows and Linux instances can be queried for their public IP using the following command:
If you are using a cloud deployment, you can find your public IP through their console, e.g. AWS and Google Cloud
Now let's move to another host from the preconditions, start a compute node on it and connect to the requester node. Here you will also need to add the same token to the configuration as on the requester.
Then execute the serve
command to connect to the requester node:
This will produce output similar to this, indicating that the node is up and running:
To ensure that the nodes are connected to the network, run the following command, specifying the public IP of the requester node:
This will produce output similar to this, indicating that the nodes belong to the same network:
To connect to the requester node find the following lines in the requester node logs:
The exact commands list will be different for each node and is outputted by the bacalhau serve
command.
Note that by default such command contains 127.0.0.1
or 0.0.0.0
instead of actual public IP. Make sure to replace it before executing the command.
Now you can submit your jobs using the bacalhau docker run
, bacalhau wasm run
and bacalhau job run
commands. For example submit a hello-world job bacalhau docker run alpine echo hello
:
You will be able to see the job execution logs on the compute node:
By default, IPFS & Local publishers and URL & IPFS sources are available on the compute node. The following describes how to configure the appropriate sources and publishers:
To set up S3 publisher you need to specify environment variables such as AWS_ACCESS_KEY_ID
and AWS_SECRET_ACCESS_KEY
, populating a credentials file to be located on your compute node, i.e. ~/.aws/credentials
, or creating an IAM role for your compute nodes if you are utilizing cloud instances.
Your chosen publisher can be set for your Bacalhau compute nodes declaratively or imperatively using either configuration yaml file:
Or within your imperative job execution commands:
S3 compatible publishers can also be used as input sources for your jobs, with a similar configuration.
By default, bacalhau creates its own in-process IPFS node that will attempt to discover other IPFS nodes, including public nodes, on its own. If you specify the --private-internal-ipfs
flag when starting the node, the node will not attempt to discover other nodes. Note, that such an IPFS node exists only with the compute node and will be shut down along with it. Alternatively, you can create your own private IPFS network and connect to it using the appropriate flags.
IPFS publisher can be set for your Bacalhau compute nodes declaratively or imperatively using either configuration yaml file:
Or within your imperative job execution commands:
Data pinned to the IPFS network can be used as input source. To do this, you will need to specify the CID in declarative:
Or imperative format:
Bacalhau allows to publish job results directly to the compute node. Please note that this method is not a reliable storage option and is recommended to be used mainly for introductory purposes.
Local publisher can be set for your Bacalhau compute nodes declaratively or imperatively using configuration yaml file:
Or within your imperative job execution commands:
The Local input source allows Bacalhau jobs to access files and directories that are already present on the compute node. To allow jobs to access local files when starting a node, the --allow-listed-local-paths
flag should be used, specifying the path to the data and access mode :rw
for Read-Write access or :ro
for Read-Only (used by default). For example:
Further, the path to local data in declarative or imperative form must be specified in the job. Declarative example of the local input source:
Imperative example of the local input source:
Your private cluster can be quickly set up for testing packaged jobs and tweaking data processing pipelines. However, when using a private cluster in production, here are a few considerations to note.
Ensure you are running the Bacalhau process from a dedicated system user with limited permissions. This enhances security and reduces the risk of unauthorized access to critical system resources. If you are using an orchestrator such as Terraform, utilize a service file to manage the Bacalhau process, ensuring the correct user is specified and consistently used. Here’s a sample service file
Create an authentication file for your clients. A dedicated authentication file or policy can ease the process of maintaining secure data transmission within your network. With this, clients can authenticate themselves, and you can limit the Bacalhau API endpoints unauthorized users have access to.
Consistency is a key consideration when deploying decentralized tools such as Bacalhau. You can use an installation script to affix a specific version of Bacalhau or specify deployment actions, ensuring that each host instance has all the necessary resources for efficient operations.
Ensure separation of concerns in your cloud deployments by mounting the Bacalhau repository on a separate non-boot disk. This prevents instability on shutdown or restarts and improves performance within your host instances.
That's all folks! 🎉 Please contact us on Slack #bacalhau
channel for questions and feedback!
How to configure TLS for the requester node APIs
By default, the requester node APIs used by the Bacalhau CLI are accessible over HTTP, but it is possible to configure it to use Transport Level Security (TLS) so that they are accessible over HTTPS instead. There are several ways to obtain the necessary certificates and keys, and Bacalhau supports obtaining them via ACME and Certificate Authorities or even self-signing them.
Once configured, you must ensure that instead of using http://IP:PORT you use https://IP:PORT to access the Bacalhau API
Automatic Certificate Management Environment (ACME) is a protocol that allows for automating the deployment of Public Key Infrastructure, and is the protocol used to obtain a free certificate from the Let's Encrypt Certificate Authority.
Using the --autocert [hostname]
parameter to the CLI (in the serve
and devstack
commands), a certificate is obtained automatically from Lets Encrypt. The provided hostname should be a comma-separated list of hostnames, but they should all be publicly resolvable as Lets Encrypt will attempt to connect to the server to verify ownership (using the ACME HTTP-01 challenge). On the very first request this can take a short time whilst the first certificate is issued, but afterwards they are then cached in the bacalhau repository.
Alternatively, you may set these options via the environment variable, BACALHAU_AUTO_TLS
. If you are using a configuration file, you can set the values inNode.ServerAPI.TLS.AutoCert
instead.
As a result of the Lets Encrypt verification step, it is necessary for the server to be able to handle requests on port 443. This typically requires elevated privileges, and rather than obtain these through a privileged account (such as root), you should instead use setcap to grant the executable the right to bind to ports <1024.
A cache of ACME data is held in the config repository, by default ~/.bacalhau/autocert-cache
, and this will be used to manage renewals to avoid rate limits.
Obtaining a TLS certificate from a Certificate Authority (CA) without using the Automated Certificate Management Environment (ACME) protocol involves a manual process that typically requires the following steps:
Choose a Certificate Authority: First, you need to select a trusted Certificate Authority that issues TLS certificates. Popular CAs include DigiCert, GlobalSign, Comodo (now Sectigo), and others. You may also consider whether you want a free or paid certificate, as CAs offer different pricing models.
Generate a Certificate Signing Request (CSR): A CSR is a text file containing information about your organization and the domain for which you need the certificate. You can generate a CSR using various tools or directly on your web server. Typically, this involves providing details such as your organization's name, common name (your domain name), location, and other relevant information.
Submit the CSR: Access your chosen CA's website and locate their certificate issuance or order page. You'll typically find an option to "Submit CSR" or a similar option. Paste the contents of your CSR into the provided text box.
Verify Domain Ownership: The CA will usually require you to verify that you own the domain for which you're requesting the certificate. They may send an email to one of the standard domain-related email addresses (e.g., admin@yourdomain.com, webmaster@yourdomain.com). Follow the instructions in the email to confirm domain ownership.
Complete Additional Verification: Depending on the CA's policies and the type of certificate you're requesting (e.g., Extended Validation or EV certificates), you may need to provide additional documentation to verify your organization's identity. This can include legal documents or phone calls from the CA to confirm your request.
Payment and Processing: If you're obtaining a paid certificate, you'll need to make the payment at this stage. Once the CA has received your payment and completed the verification process, they will issue the TLS certificate.
Once you have obtained your certificates, you will need to put two files in a location that bacalhau can read them. You need the server certificate, often called something like server.cert
or server.cert.pem
, and the server key which is often called something like server.key
or server.key.pem
.
Once you have these two files available, you must start bacalhau serve
which two new flags. These are tlscert
and tlskey
flags, whose arguments should point to the relevant file. An example of how it is used is:
Alternatively, you may set these options via the environment variables, BACALHAU_TLS_CERT
and BACALHAU_TLS_KEY
. If you are using a configuration file, you can set the values inNode.ServerAPI.TLS.ServerCertificate
and Node.ServerAPI.TLS.ServerKey
instead.
If you wish, it is possible to use Bacalhau with a self-signed certificate which does not rely on an external Certificate Authority. This is an involved process and so is not described in detail here although there is a helpful script in the Bacalhau github repository which should provide a good starting point.
Once you have generated the necessary files, the steps are much like above, you must start bacalhau serve
which two new flags. These are tlscert
and tlskey
flags, whose arguments should point to the relevant file. An example of how it is used is:
Alternatively, you may set these options via the environment variables, BACALHAU_TLS_CERT
and BACALHAU_TLS_KEY
. If you are using a configuration file, you can set the values inNode.ServerAPI.TLS.ServerCertificate
and Node.ServerAPI.TLS.ServerKey
instead.
If you use self-signed certificates, it is unlikely that any clients will be able to verify the certificate when connecting to the Bacalhau APIs. There are three options available to work around this problem:
Provide a CA certificate file of trusted certificate authorities, which many software libraries support in addition to system authorities.
Install the CA certificate file in the system keychain of each machine that needs access to the Bacalhau APIs.
Instruct the software library you are using not to verify HTTPS requests.
In this tutorial, you'll learn how to install and run a job with the Bacalhau client using the Bacalhau CLI or Docker.
The Bacalhau client is a command-line interface (CLI) that allows you to submit jobs to the Bacalhau. The client is available for Linux, macOS, and Windows. You can also run the Bacalhau client in a Docker container.
By default, you will submit to the Bacalhau public network, but the same CLI can be configured to submit to a private Bacalhau network. For more information, please read Running Bacalhau on a Private Network.
You can install or update the Bacalhau CLI by running the commands in a terminal. You may need sudo mode or root password to install the local Bacalhau binary to /usr/local/bin
:
Windows users can download the latest release tarball from Github and extract bacalhau.exe
to any location available in the PATH environment variable.
To run a specific version of Bacalhau using Docker, use the command docker run -it ghcr.io/bacalhau-project/bacalhau:v1.0.3
, where v1.0.3
is the version you want to run; note that the latest
tag will not re-download the image if you have an older version. For more information on running the Docker image, check out the Bacalhau docker image example.
To verify installation and check the version of the client and server, use the version
command. To run a Bacalhau client command with Docker, prefix it with docker run ghcr.io/bacalhau-project/bacalhau:latest
.
If you're wondering which server is being used, the Bacalhau Project has a demo network that's shared with the community. This network allows you to familiarize with Bacalhau's capabilities and launch jobs from your computer without maintaining a compute cluster on your own.
To submit a job in Bacalhau, we will use the bacalhau docker run
command. The command runs a job using the Docker executor on the node. Let's take a quick look at its syntax:
We will use the command to submit a Hello World job that runs an echo program within an Ubuntu container.
Let's take a look at the results of the command execution in the terminal:
After the above command is run, the job is submitted to the public network, which processes the job and Bacalhau prints out the related job id:
The job_id
above is shown in its full form. For convenience, you can use the shortened version, in this case: 9d20bbad
.
While this command is designed to resemble Docker's run command which you may be familiar with, Bacalhau introduces a whole new set of flags to support its computing model.
Let's take a look at the results of the command execution in the terminal:
After having deployed the job, we now can use the CLI for the interaction with the network. The jobs were sent to the public demo network, where it was processed and we can call the following functions. The job_id
will differ for every submission.
You can check the status of the job using bacalhau list
command adding the --id-filter
flag and specifying your job id.
Let's take a look at the results of the command execution in the terminal:
When it says Completed
, that means the job is done, and we can get the results.
For a comprehensive list of flags you can pass to the list command check out the related CLI Reference page
You can find out more information about your job by using bacalhau describe
.
Let's take a look at the results of the command execution in the terminal:
This outputs all information about the job, including stdout, stderr, where the job was scheduled, and so on.
You can download your job results directly by using bacalhau get
.
This results in
In the command below, we created a directory called myfolder
and download our job output to be stored in that directory.
While executing this command, you may encounter warnings regarding receive and send buffer sizes: failed to sufficiently increase receive buffer size
. These warnings can arise due to limitations in the UDP buffer used by Bacalhau to process tasks. Additional information can be found in https://github.com/quic-go/quic-go/wiki/UDP-Buffer-Sizes.
After the download has finished you should see the following contents in the results directory.
That should print out the string Hello World
.
Here are few resources that provide a deeper dive into running jobs with Bacalhau:
How Bacalhau works, Setting up Bacalhau, Examples & Use Cases
If you have questions or need support or guidance, please reach out to the Bacalhau team via Slack (#general channel).
Different jobs may require different amounts of resources to execute. Some jobs may have specific hardware requirements, such as GPU. This page describes how to specify hardware requirements for your job.
Please bear in mind that each executor is implemented independently and these docs might be slightly out of date. Double check the man page for the executor you are using with bacalhau [executor] --help
.
The following table describes how to specify hardware requirements for the Docker executor.
Flag | Default | Description |
---|---|---|
When you specify hardware requirements, the job will be offered out to the network to see if there are any nodes that can satisfy the requirements. If there are, the job will be scheduled on the node and the executor will be started.
Bacalhau supports GPU workloads. Learn how to run a job using GPU workloads with the Bacalhau client.
The Bacalhau network must have an executor node with a GPU exposed
Your container must include the CUDA runtime (cudart) and must be compatible with the CUDA version running on the node
Use following command to see available resources amount:
To submit a request for a job that requires more than the standard set of resources, add the --cpu
and --memory
flags. For example, for a job that requires 2 CPU cores and 4Gb of RAM, use --cpu=2 --memory=4Gb
, e.g.:
To submit a GPU job request, use the --gpu
flag under the docker run
command to select the number of GPUs your job requires. For example:
The following limitations currently exist within Bacalhau.
Maximum CPU and memory limits depend on the participants in the network
For GPU:
NVIDIA, Intel or AMD GPUs only
Only the Docker Executor supports GPUs
This is an older version of Bacalhau. For the latest version, go to .
Bacalhau is a platform for fast, cost efficient, and secure computation by running jobs where the data is generated and stored. With Bacalhau, you can streamline your existing workflows without the need of extensive rewriting by running arbitrary Docker containers and WebAssembly (wasm) images as tasks. This architecture is also referred to as Compute Over Data (or CoD). was coined from the Portuguese word for salted Cod fish.
Bacalhau seeks to transform data processing for large-scale datasets to improve cost and efficiency, and to open up data processing to larger audiences. Our goals is to create an open, collaborative compute ecosystem that enables unparalleled collaboration. We () offer a demo network so you can try out jobs without even installing. Give it a shot!
⚡️ Bacalhau simplifies the process of managing compute jobs by providing a unified platform for managing jobs across different regions, clouds, and edge devices.
🤝 Bacalhau provides reliable and network-partition resistant orchestration, ensuring that your jobs will complete even if there are network disruptions.
🚨 Bacalhau provides a complete and permanent audit log of exactly what happened, so you can be confident that your jobs are being executed securely.
🔐 You can run private workloads to reduce the chance of leaking private information or inadvertently sharing your data outside of your organization.
💸 Bacalhau reduces ingress/egress costs since jobs are processed closer to the source.
🤓 You can on your machine, and Bacalhau will be able to run against that data.
💥 You can integrate with services running on nodes to run a jobs, such as on .
📚 Bacalhau operates at scale over parallel jobs. You can batch process petabytes (quadrillion bytes) of data.
Bacalhau concists of a peer-to-peer network of nodes that enables decentralized communication between computers. The network consists of two types of nodes:
Requester Node: responsible for handling user requests, discovering and ranking compute nodes, forwarding jobs to compute nodes, and monitoring the job lifecycle.
Compute Node: responsible for executing jobs and producing results. Different compute nodes can be used for different types of jobs, depending on their capabilities and resources.
The goal of the Bacalhau project is to make it easy to perform distributed computation next to where the data resides. In order to do this, first you need to ingest some data.
Data is identified by its content identifier (CID) and can be accessed by anyone who knows the CID. Here are some options that can help you mount your data:
The options are not limited to the above mentioned. You can mount your data anywhere on your machine, and Bacalhau will be able to run against that data
All workloads run under restricted Docker or WASM permissions on the node. Additionally, you can use existing (locked down) binaries that are pre-installed through Pluggable Executors.
Alternatively, you can pre-provision credentials to the nodes and access those on a node by node basis.
Finally, endpoints (such as vaults) can also be used to provide secure access to Bacalhau. This way, the client can authenticate with Bacalhau using the token without exposing their credentials.
Bacalhau can be used for a variety of data processing workloads, including machine learning, data analytics, and scientific computing. It is well-suited for workloads that require processing large amounts of data in a distributed and parallelized manner.
Here are some example tutorials on how you can process your data with Bacalhau:
Bacalhau has a very friendly community and we are always happy to help you get started:
Bacalhau supports the three main 'pillars' of observability - logging, metrics, and tracing. Bacalhau uses the for metrics and tracing, which can be configured using the . Exporting metrics and traces can be as simple as setting the OTEL_EXPORTER_OTLP_PROTOCOL
and OTEL_EXPORTER_OTLP_ENDPOINT
environment variables. Custom code is used for logging as the .
Logging in Bacalhau outputs in human-friendly format to stderr at INFO
level by default, but this can be changed by two environment variables:
LOG_LEVEL
- Can be one of trace
, debug
, error
, warn
or fatal
to output more or fewer logging messages as required
LOG_TYPE
- Can be one of the following values:
default
- output logs to stderr in a human-friendly format
json
- log messages outputted to stdout in JSON format
combined
- log JSON formatted messages to stdout and human-friendly format to stderr
Log statements should include the relevant trace, span and job ID so it can be tracked back to the work being performed.
Bacalhau produces a number of different metrics including those around the libp2p resource manager (rcmgr
), performance of the requester HTTP API and the number of jobs accepted/completed/received.
Traces are produced for all major pieces of work when processing a job, although the naming of some spans is still being worked on. You can find relevant traces covering working on a job by searching for the jobid
attribute.
The metrics and traces can easily be forwarded to a variety of different services as we use OpenTelemetry, such as Honeycomb or Datadog.
To view the data locally, or simply to not use a SaaS offering, you can start up Jaeger and Prometheus placing these three files into a directory then running docker compose start
while running Bacalhau with the OTEL_EXPORTER_OTLP_PROTOCOL=grpc
and OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317
environment variables.
How to configure compute/requester persistence
Both compute nodes, and requester nodes, maintain state. How that state is maintained is configurable, although the defaults are likely adequate for most use-cases. This page describes how to configure the persistence of compute and requester nodes should the defaults not be suitable.
The computes nodes maintain information about the work that has been allocated to them, including:
The current state of the execution, and
The original job that resulted in this allocation
This information is used by the compute and requester nodes to ensure allocated jobs are completed successfully. By default, compute nodes store their state in a bolt-db database and this is located in the bacalhau repository along with configuration data. For a compute node whose ID is "abc", the database can be found in ~/.bacalhau/abc-compute/executions.db
.
In some cases, it may be preferable to maintain the state in memory, with the caveat that should the node restart, all state will be lost. This can be configured using the environment variables in the table below.
Environment Variable | Flag alternative | Value | Effect |
---|
When running a requester node, it maintains state about the jobs it has been requested to orchestrate and schedule, the evaluation of those jobs, and the executions that have been allocated. By default, this state is stored in a bolt db database that, with a node ID of "xyz" can be found in ~/.bacalhau/xyz-requester/jobs.db
.
Environment Variable | Flag alternative | Value | Effect |
---|
In this tutorial we will go over the components and the architecture of Bacalhau. You will learn how it is built, what components are used, how you could interact and how you could use Bacalhau.
Bacalhau is a peer-to-peer network of nodes that enables decentralized communication between computers. The network consists of two types of nodes, which can communicate with each other.
The requester and compute nodes together form a p2p network and use gossiping to discover each other, share information about node capabilities, available resources and health status. Bacalhau is a peer-to-peer network of nodes that enables decentralized communication between computers.
Requester Node: responsible for handling user requests, discovering and ranking compute nodes, forwarding jobs to compute nodes, and monitoring the job lifecycle.
Compute Node: responsible for executing jobs and producing results. Different compute nodes can be used for different types of jobs, depending on their capabilities and resources.
To interact with the Bacalhau network, users can use the Bacalhau CLI (command-line interface) to send requests to a requester node in the network. These requests are sent using the JSON format over HTTP, a widely-used protocol for transmitting data over the internet. Bacalhau's architecture involves two main sections which are the core components and interfaces.
The core components are responsible for handling requests and connecting different nodes. The network includes two different components:
The interfaces handle the distribution, execution, storage and publishing of jobs. In the following all the different components are described and their respective protocols are shown.
Bacalhau provides an interface to interact with the server via a REST API. Bacalhau uses 127.0.0.1 as the localhost and 1234 as the port by default.
When a job is submitted to a requester node, it selects compute nodes that are capable and suitable to execute the job, and communicate with them directly. The compute node has a collection of named executors, storage sources, and publishers, and it will choose the most appropriate ones based on the job specifications.
When the Compute node completes the job, it publishes the results to S3's remote storage, IPFS.
The Bacalhau client receives updates on the task execution status and results. A user can access the results and manage tasks through the command line interface.
To Get the results of a job you can run the following command.
One can choose from a wide range of flags, from which a few are shown below.
To describe a specific job, inserting the ID to the CLI or API gives back an overview of the job.
If you run more then one job or you want to find a specific job ID
To list executions follow the following commands.
How to configure authentication and authorization on your Bacalhau node.
Bacalhau includes a flexible auth system that supports multiple methods of auth that are appropriate for different deployment environments.
With no specific authentication configuration supplied, Bacalhau runs in "anonymous mode" – which allows unidentified users limited control over the system. "Anonymous mode" is only appropriate for testing or evaluation setups.
In anonymous mode, Bacalhau will allow:
Users identified by a self-generated private key to submit any job and cancel their own jobs.
Users not identified by any key to access other read-only endpoints, such as to read job lists, describe jobs, and query node or agent information.
Bacalhau auth is controlled by policies. Configuring the auth system is done by supplying a different policy file.
Restricting API access to only users that have authenticated requires specifying a new authorization policy. You can download a policy that restricts anonymous access and install it by using:
Once the node is restarted, accessing the node APIs will require the user to be authenticated, but by default will still allow users with a self-generated key to authenticate themselves.
Restricting the list of keys that can authenticate to only a known set requires specifying a new authentication policy. You can download a policy that restricts key-based access and install it by using:
Then, modify the allowed_clients
variable in challange_ns_no_anon.rego
to include acceptable client IDs, found by running bacalhau id
.
Once the node is restarted, only keys in the allowed list will be able to access any API.
Users can authenticate using a username and password instead of specifying a private key for access. Again, this requires installation of an appropriate policy on the server.
Passwords are not stored in plaintext and are salted. The downloaded policy expects password hashes and salts generated by scrypt
. To generate a salted password, the helper script in pkg/authn/ask/gen_password
can be used:
This will ask for a password and generate a salt and hash to authenticate with it. Add the encoded username, salt and hash into the ask_ns_password.rego
.
In principle, Bacalhau can implement any auth scheme that can be described in a structured way by a policy file.
Bacalhau will pass information pertinent to the current request into every authentication policy query as a field on the input
variable. The exact information depends on the type of authentication used.
challenge
authenticationchallenge
authentication uses identifies the user by the presence of a private key. The user is asked to sign an input phrase to prove they have the key they are identifying with.
Policies used for challenge
authentication do not need to actually implement the challenge verification logic as this is handled by the core code. Instead, they will only be invoked if this verification passes.
Policies for this type will need to implement these rules:
bacalhau.authn.token
: if the user should be authenticated, an access token they should use in subsequent requests. If the user should not be authenticated, should be undefined.
They should expect as fields on the input
variable:
clientId
: an ID derived from the user's private key that identifies them uniquely
nodeId
: the ID of the requester node that this user is authenticating with
signingKey
: the private key (as a JWK) that should be used to sign any access tokens to be returned
The simplest possible policy might therefore be this policy that returns the same opaque token for all users:
ask
authenticationask
authentication uses credentials supplied manually by the user as identification. For example, an ask
policy could require a username and password as input and check these against a known list. ask
policies do all the verification of the supplied credentials.
Policies for this type will need to implement these rules:
bacalhau.authn.token
: if the user should be authenticated, an access token they should use in subsequent requests. If the user should not be authenticated, should be undefined.
bacalhau.authn.schema
: a static JSON schema that should be used to collect information about the user. The type
of declared fields may be used to pick the input method, and if a field is marked as writeOnly
then it will be collected in a secure way (e.g. not shown on screen). The schema
rule does not receive any input
data.
They should expect as fields on the input
variable:
ask
: a map of field names from the JSON schema to strings supplied by the user. The policy should validate these credentials.
nodeId
: the ID of the requester node that this user is authenticating with
signingKey
: the private key (as a JWK) that should be used to sign any access tokens to be returned
The simplest possible policy might therefore be one that asks for no data and returns the same opaque token for every user:
Authorization policies do not vary depending on the type of authentication used – Bacalhau uses one authz policy for all API requests.
Authz policies are invoked for every API request. Authz policies should check the validity of any supplied access tokens and issue an authz decision for the requested API endpoint. It is not required that authz policies enforce that an access token is present – they may choose to grant access to unauthorized users.
Policies will need to implement these rules:
bacalhau.authz.token_valid
: true if the access token in the request is "valid" (but does not necessarily grant access for this request), or false if it is invalid for every request (e.g. because it has expired) and should be discarded.
bacalhau.authz.allow
: true if the user should be permitted to carry out the input request, false otherwise.
They should expect as fields on the input
variable for both rules:
http
: details of the user's HTTP request:
host
: the hostname used in the HTTP request
method
: the HTTP method (e.g. GET
, POST
)
path
: the path requested, as an array of path components without slashes
query
: a map of URL query parameters to their values
headers
: a map of HTTP header names to arrays representing their values
body
: a blob of any content submitted as the body
constraints
: details about the receiving node that should be used to validate any supplied tokens:
cert
: keys that the input token should have been signed with
iss
: the name of a node that this node will recognize as the issuer of any signed tokens
aud
: the name of this node that is receiving the request
Notably, the constraints
data is appropriate to be passed directly to the Rego io.jwt.decode_verify
method which will validate the access token as a JWT against the given constraints.
The simplest possible authz policy might be this one that allows all users to access all endpoints:
How to configure your Bacalhau node.
Bacalhau employs the and libraries for configuration management. Users can configure their Bacalhau node through a combination of command-line flags, environment variables, and the dedicated configuration file.
Bacalhau manages its configuration, metadata, and internal state within a specialized repository named .bacalhau
. Serving as the heart of the Bacalhau node, this repository holds the data and settings that determine node behavior. It's located on the filesystem, and by default, Bacalhau initializes this repository at $HOME/.bacalhau
, where $HOME
is the home directory of the user running the bacalhau process.
To customize this location, users can:
Set the BACALHAU_DIR
environment variable to specify their desired path.
Utilize the --repo
command line flag to specify their desired path.
Upon executing a Bacalhau command for the first time, the system will initialize the .bacalhau
repository. If such a repository already exists, Bacalhau will seamlessly access its contents.
Structure of a Newly Initialized .bacalhau
Repository
.bacalhau
repository:This repository comprises four directories and seven files:
user_id.pem
:
This file houses the Bacalhau node user's cryptographic private key, used for signing requests sent to a Requester Node.
Format: PEM.
repo.version
:
Indicates the version of the Bacalhau node's repository.
Format: JSON, e.g., {"Version":1}
.
libp2p_private_key
:
Format: Base64 encoded RSA private key.
config.yaml
:
Contains configuration settings for the Bacalhau node.
Format: YAML.
update.json
:
A file containing the date/time when the last version check was made.
Format: JSON, e.g., {"LastCheck":"2024-01-24T11:06:14.631816Z"}
tokens.json
:
A file containing the tokens obtained through authenticating with bacalhau clusters.
QmdGUjsMHEgtAfdtw7U62yPEcAZFtA33tKMsczLToegZtv-compute
:
Note: The segment QmdGUjsMHEgtAfdtw7U62yPEcAZFtA33tKMsczLToegZtv
is a unique NodeID for each Bacalhau node, derived from the libp2p_private_key
.
QmdGUjsMHEgtAfdtw7U62yPEcAZFtA33tKMsczLToegZtv-requester
:
Note: NodeID derivation is similar to the Compute directory.
executor_storages
:
Storage for data handled by Bacalhau storage drivers.
plugins
:
Houses binaries that allow the Compute node to execute specific tasks.
Note: This feature is currently experimental and isn't active during standard node operations.
Within a .bacalhau
repository, a config.yaml
file may be present. This file serves as the configuration source for the bacalhau node and adheres to the YAML format.
Although the config.yaml
file is optional, its presence allows Bacalhau to load custom configurations; otherwise, Bacalhau is configured with built-in default values, environment variables and command line flags.
Modifications to the config.yaml
file will not be dynamically loaded by the Bacalhau node. A restart of the node is required for any changes to take effect. Bacalhau determines its configuration based on the following precedence order, with each item superseding the subsequent:
Command-line Flag
Environment Variable
Config File
Defaults
config.yaml
and Bacalhau Environment VariablesBacalhau establishes a direct relationship between the value-bearing keys within the config.yaml
file and corresponding environment variables. For these keys that have no further sub-keys, the environment variable name is constructed by capitalizing each segment of the key, and then joining them with underscores, prefixed with BACALHAU_
.
For example, a YAML key with the path Node.IPFS.Connect
translates to the environment variable BACALHAU_NODE_IPFS_CONNECT
and is represented in a file like:
There is no corresponding environment variable for either Node
or Node.IPFS
. Config values may also have other environment variables that set them for simplicity or to maintain backwards compatibility.
Bacalhau leverages the BACALHAU_ENVIRONMENT
environment variable to determine the specific environment configuration when initializing a repository. Notably, if a .bacalhau
repository has already been initialized, the BACALHAU_ENVIRONMENT
setting will be ignored.
By default, if the BACALHAU_ENVIRONMENT
variable is not explicitly set by the user, Bacalhau will adopt the production
environment settings.
Below is a breakdown of the configurations associated with each environment:
1. Production (public network)
Environment Variable: BACALHAU_ENVIRONMENT=production
Configurations:
Node.ClientAPI.Host
: "bootstrap.production.bacalhau.org"
Node.Client.API.Host
: 1234
...other configurations specific to this environment...
2. Staging (staging network)
Environment Variable: BACALHAU_ENVIRONMENT=staging
Configurations:
Node.ClientAPI.Host
: "bootstrap.staging.bacalhau.org"
Node.Client.API.Host
: 1234
...other configurations specific to this environment...
3. Development (development network)
Environment Variable: BACALHAU_ENVIRONMENT=development
Configurations:
Node.ClientAPI.Host
: "bootstrap.development.bacalhau.org"
Node.Client.API.Host
: 1234
...other configurations specific to this environment...
4. Local (private or local networks)
Environment Variable: BACALHAU_ENVIRONMENT=local
Configurations:
Node.ClientAPI.Host
: "0.0.0.0"
Node.Client.API.Host
: 1234
...other configurations specific to this environment...
Or
How to enable GPU support on your Bacalhau node
Bacalhau supports GPUs out of the box and defaults to allowing execution on all GPUs installed on the node.
Bacalhau makes the assumption that you have installed all the necessary drivers and tools on your node host and have appropriately configured them for use by Docker.
In general for GPUs from any vendor, the Bacalhau client requires:
Verify installation by
nvidia-smi
installed and functional
rocm-smi
tool installed and functional
See the for guidance on how to run Docker workloads on AMD GPU.
xpu-smi
tool installed and functional
With that, you have just successfully run a job on Bacalhau!
For a more detailed tutorial, check out our .
Best practices in is to use environment variables to store sensitive data such as access tokens, API keys, or passwords. These variables can be accessed by Bacalhau at runtime and are not visible to anyone who has access to the code or the server.
Once you have more than 10 devices generating or storing around 100GB of data, you're likely to face challenges with processing that data efficiently. Traditional computing approaches may struggle to handle such large volumes, and that's where distributed computing solutions like Bacalhau can be extremely useful. Bacalhau can be used in various industries, including security, web serving, financial services, IoT, Edge, Fog, and multi-cloud. Bacalhau shines when it comes to data-intensive applications like , , , , etc.
For more tutorials, visit our
– ask anything about the project, give feedback or answer questions that will help other users.
and go to #bacalhau channel – it is the easiest way engage with other members in the community and get help.
– learn how to contribute to the Bacalhau project.
👉 Continue with Bacalhau to learn how to install and run a job with the Bacalhau client.
👉 Or jump directly to try out the different that showcases Bacalhau abilities.
You can create jobs in the Bacalhau network using various introduced in version 1.2. Each job may need specific variables, resource requirements and data details that are described in the .
Prepare data with Bacalhau by , or . Mount data anywhere for Bacalhau to run against. Refer to , , and Source Specifications for data source usage.
Optimize workflows without completely redesigning them. Run arbitrary tasks using Docker containers and WebAssembly images. Follow the Onboarding guides for and workloads.
Explore GPU workload support with Bacalhau. Learn how to run using the Bacalhau client in the GPU Workloads section. Integrate Python applications with Bacalhau using the .
For node operation, refer to the section for configuring and running a Bacalhau node. If you prefer an isolated environment, explore the for performing tasks without connecting to the main Bacalhau network.
You should use the Bacalhau client to send a task to the network. The client transmits the job information to the Bacalhau network via established protocols and interfaces. Jobs submitted via the Bacalhau CLI are forwarded to a Bacalhau network node at via port 1234
by default. This Bacalhau node will act as the requester node for the duration of the job lifecycle.
You can use the command with to create a job in Bacalhau using JSON and YAML formats.
You can use to submit a new job for execution.
You can use the bacalhau docker run
to start a job in a Docker container. Below, you can see an excerpt of the commands:
You can also use the bacalhau wasm run
to run a job compiled into the (WASM) format. Below, you can find an excerpt of the commands in the Bacalhau CLI:
The selected compute node receives the job and starts its execution inside a container. The container can use different executors to work with the data and perform the necessary actions. A job can use the docker executor, WASM executor or a library storage volumes. Use to view the parameters to configure the Docker Engine. If you want tasks to be executed in a WebAssembly environment, pay attention to .
Bacalhau's seamless integration with IPFS ensures that users have a decentralized option for publishing their task results, enhancing accessibility and resilience while reducing dependence on a single point of failure. View to get the detailed information.
Bacalhau's S3 Publisher provides users with a secure and efficient method to publish task results to any S3-compatible storage service. This publisher supports not just AWS S3, but other S3-compatible services offered by cloud providers like Google Cloud Storage and Azure Blob Storage, as well as open-source options like MinIO. View to get the detailed information.
You can use the command with to get a full description of a job in yaml format.
You can use to retrieve the specification and current status of a particular job.
You can use the command with to list jobs on the network in yaml format.
You can use to retrieve a list of jobs.
You can use the command with to list all executions associated with a job, identified by its ID, in yaml format.
You can use to retrieve all executions for a particular job.
The Bacalhau client provides the user with tools to monitor and manage the execution of jobs. You can get information about status, progress and decide on next steps. View the if you want to know the node's health, capabilities, and deployed Bacalhau version. To get information about the status and characteristics of the nodes in the cluster use .
You can use the command with to cancel a job that was previously submitted and stop it running if it has not yet completed.
You can use to terminate a specific job asynchronously.
You can use the command with to enumerate the historical events related to a job, identified by its ID.
You can use to retrieve historical events for a specific job.
You can use this to retrieve the log output (stdout, and stderr) from a job. If the job is still running it is possible to follow the logs after the previously generated logs are retrieved.
To familiarize yourself with all the commands used in Bacalhau, please view
Policies are written in a language called , also used by Kubernetes. Users who want to write their own policies should get familiar with the Rego language.
A more realistic example that returns a signed JWT is in .
A more realistic example that returns a signed JWT is in .
A more realistic example (which is the Bacalhau "anonymous mode" default) is in .
Stores the Bacalhau node's private key, essential for its network identity. The NodeID of a Bacalhau node is derived from this key.
Contains the executions.db
database, which aids the Compute node in state persistence. Additionally, the jobStats.json
file records the Compute Node's completed jobs tally.
Contains the jobs.db
database for the Requester node's state persistence.
Note: The above configurations provided for each environment are not exhaustive. Consult the specific environment documentation for a .
See the for guidance on how to run Docker workloads on Intel GPU.
Access to GPUs can be controlled using . To limit the number of GPUs that can be used per job, set a job resource limit. To limit access to GPUs from all jobs, set a total resource limit.
--cpu
500m
Job CPU cores (e.g. 500m, 2, 8)
--memory
1Gb
Job Memory requirement (e.g. 500Mb, 2Gb, 8Gb).
--gpu
0
Job GPU requirement (e.g. 1).
BACALHAU_COMPUTE_STORE_TYPE | --compute-execution-store-type | boltdb | Uses the bolt db execution store (default) |
BACALHAU_COMPUTE_STORE_PATH | --compute-execution-store-path | A path (inc. filename) | Specifies where the boltdb database should be stored. Default is |
BACALHAU_JOB_STORE_TYPE | --requester-job-store-type | boltdb | Uses the bolt db job store (default) |
BACALHAU_JOB_STORE_PATH | --requester-job-store-path | A path (inc. filename) | Specifies where the boltdb database should be stored. Default is |
Before you join the main Bacalhau network, you can test locally.
To test, you can use the bacalhau devstack
command, which offers a way to get a 3 node cluster running locally.
By settings PREDICTABLE_API_PORT=1
, the first node of our 3 node cluster will always listen on port 20000
In another window, export the following environment variables so that the Bacalhau client binary connects to our local development cluster:
You can now interact with Bacalhau - all jobs are running by the local devstack cluster.
This tutorial describes how to add new nodes to an existing private network. Two basic scenarios will be covered:
Adding a physical host / virtual machine as a new node
Adding a cloud instance as a new node
You should have an established private network consisting of at least one requester node. See the Create Private Network guide to set one up.
You should have a new host (physical/virtual machine, cloud instance or docker container) with Bacalhau installed
Let's assume that you already have a private network with at least one requester node. In this case, the process of adding new nodes follows the Create And Connect Compute Node section. You will need to:
Set the token in the node.network.authsecret
parameter
Execute bacalhau serve
specifying the node type
and orchestrator
address via flags. You can find an example of such a command in the logs of the requester node, here is how it might look like:
Remember that in such example you need to replace all 127.0.0.1
and 0.0.0.0.0
addresses with the actual public IP address of your node
Let's assume you already have all the necessary cloud infrastructure set up with a private network with at least one requester node. In this case, you can add new nodes manually (AWS, Azure, GCP) or use a tool like Terraform to automatically create and add any number of nodes to your network. The process of adding new nodes manually follows the Create And Connect Compute Node section.
To automate the process using Terraform follow these steps:
Configure terraform for your cloud provider
Determine the IP address of your requester node
Write a terraform script, which does the following:
Adds a new instance
Installs bacalhau
on it
Launches a compute node
Execute the script
If you have questions or need support or guidance, please reach out to the Bacalhau team via Slack (#general channel).
Bacalhau has two ways to make use of external storage providers: Sources and Publishers. Sources storage resources consumed as inputs to jobs. And Publishers storage resources created with the results of jobs.
Bacalhau allows you to use S3 or any S3-compatible storage service as an input source. Users can specify files or entire prefixes stored in S3 buckets to be fetched and mounted directly into the job execution environment. This capability ensures that your jobs have immediate access to the necessary data. See the S3 source specification for more details.
To use the S3 source, you will have to to specify the mandatory name of the S3 bucket and the optional parameters Key, Filter, Region, Endpoint, VersionID and ChechsumSHA256.
Below is an example of how to define an S3 input source in YAML format:
To start, you'll need to connect the Bacalhau node to an IPFS server so that you can run jobs that consume CIDs as inputs. You can either install IPFS and run it locally, or you can connect to a remote IPFS server.
In both cases, you should have an IPFS multiaddress for the IPFS server that should look something like this:
The multiaddress above is just an example - you'll need to get the multiaddress of the IPFS server you want to connect to.
You can then configure your Bacalhau node to use this IPFS server by passing the --ipfs-connect
argument to the serve
command:
Or, set the Node.IPFS.Connect
property in the Bacalhau configuration file. See the IPFS input source specification for more details.
Below is an example of how to define an IPFS input source in YAML format:
The Local input source allows Bacalhau jobs to access files and directories that are already present on the compute node. This is especially useful for utilizing locally stored datasets, configuration files, logs, or other necessary resources without the need to fetch them from a remote source, ensuring faster job initialization and execution. See the Local source specification for more details.
To use a local data source, you will have to to:
Enable the use of local data when configuring the node itself by using the --allow-listed-local-paths
flag for bacalhau serve, specifying the file path and access mode. For example
In the job description specify parameters SourcePath - the absolute path on the compute node where your data is located and ReadWrite - the access mode.
Below is an example of how to define a Local input source in YAML format:
The URL Input Source provides a straightforward method for Bacalhau jobs to access and incorporate data available over HTTP/HTTPS. By specifying a URL, users can ensure the required data, whether a single file or a web page content, is retrieved and prepared in the job's execution environment, enabling direct and efficient data utilization. See the URL source specification for more details.
To use a URL data source, you will have to to specify only URL parameter, as in the part of the declarative job description below:
Bacalhau's S3 Publisher provides users with a secure and efficient method to publish job results to any S3-compatible storage service. To use an S3 publisher you will have to specify required parameters Bucket and Key and optional parameters Region, Endpoint, VersionID, ChecksumSHA256. See the S3 publisher specification for more details.
Here’s an example of the part of the declarative job description that outlines the process of using the S3 Publisher with Bacalhau:
The IPFS publisher works using the same setup as above - you'll need to have an IPFS server running and a multiaddress for it. Then you'll pass that multiaddress using the --ipfs-connect
argument to the serve
command. If you are publishing to a public IPFS node, you can use bacalhau get
with no further arguments to download the results. However, you may experience a delay in results becoming available as indexing of new data by public nodes takes time.
To use the IPFS publisher you will have to specify CID which can be used to access the published content. See the IPFS publisher specification for more details.
To speed up the download or to retrieve results from a private IPFS node, pass the swarm multiaddress to bacalhau get
to download results.
Pass the swarm key to bacalhau get
if the IPFS swarm is a private swarm.
And part of the declarative job description with an IPFS publisher will look like this:
The Local Publisher should not be used for Production use as it is not a reliable storage option. For production use, we recommend using a more reliable option such as an S3-compatible storage service.
Another possibility to store the results of a job execution is on a compute node. In such case the results will be published to the local compute node, and stored as compressed tar file, which can be accessed and retrieved over HTTP from the command line using the get command. To use the Local publisher you will have to specify the only URL parameter with a HTTP URL to the location where you would like to save the result. See the Local publisher specification for more details.
Here is an example of part of the declarative job description with a local publisher:
How to use docker containers with Bacalhau
Bacalhau executes jobs by running them within containers. Bacalhau employs a syntax closely resembling Docker, allowing you to utilize the same containers. The key distinction lies in how input and output data are transmitted to the container via IPFS, enabling scalability on a global level.
This section describes how to migrate a workload based on a Docker container into a format that will work with the Bacalhau client.
You can check out this example tutorial on how to work with custom containers in Bacalhau to see how we used all these steps together.
Here are few things to note before getting started:
Container Registry: Ensure that the container is published to a public container registry that is accessible from the Bacalhau network.
Architecture Compatibility: Bacalhau supports only images that match the host node's architecture. Typically, most nodes run on linux/amd64
, so containers in arm64
format are not able to run.
Input Flags: The --input ipfs://...
flag supports only directories and does not support CID subpaths. The --input https://...
flag supports only single files and does not support URL directories. The --input s3://...
flag supports S3 keys and prefixes. For example, s3://bucket/logs-2023-04*
includes all logs for April 2023.
You can check to see a list of example public containers used by the Bacalhau team
Note: Only about a third of examples have their containers here. The rest are under random docker hub registries.
To help provide a safe, secure network for all users, we add the following runtime restrictions:
Limited Ingress/Egress Networking:
All ingress/egress networking is limited as described in the networking documentation. You won't be able to pull data/code/weights/
etc. from an external source.
Data Passing with Docker Volumes:
A job includes the concept of input and output volumes, and the Docker executor implements support for these. This means you can specify your CIDs, URLs, and/or S3 objects as input
paths and also write results to an output
volume. This can be seen in the following example:
The above example demonstrates an input volume flag -i s3://mybucket/logs-2023-04*
, which mounts all S3 objects in bucket mybucket
with logs-2023-04
prefix within the docker container at location /input
(root).
Output volumes are mounted to the Docker container at the location specified. In the example above, any content written to /output_folder
will be made available within the apples
folder in the job results CID.
Once the job has run on the executor, the contents of stdout
and stderr
will be added to any named output volumes the job has used (in this case apples
), and all those entities will be packaged into the results folder which is then published to a remote location by the publisher.
If you need to pass data into your container you will do this through a Docker volume. You'll need to modify your code to read from a local directory.
We make the assumption that you are reading from a directory called /inputs
, which is set as the default.
You can specify which directory the data is written to with the --input
CLI flag.
If you need to return data from your container you will do this through a Docker volume. You'll need to modify your code to write to a local directory.
We make the assumption that you are writing to a directory called /outputs
, which is set as the default.
You can specify which directory the data is written to with the --output-volumes
CLI flag.
At this step, you create (or update) a Docker image that Bacalhau will use to perform your task. You build your image from your code and dependencies, then push it to a public registry so that Bacalhau can access it. This is necessary for other Bacalhau nodes to run your container and execute the given task.
Most Bacalhau nodes are of an x86_64
architecture, therefore containers should be built for x86_64
systems.
For example:
To test your docker image locally, you'll need to execute the following command, changing the environment variables as necessary:
Let's see what each command will be used for:
Bacalhau will use the default ENTRYPOINT if your image contains one. If you need to specify another entrypoint, use the --entrypoint
flag to bacalhau docker run
.
For example:
The result of the commands' execution is shown below:
Data is identified by its content identifier (CID) and can be accessed by anyone who knows the CID. You can use either of these methods to upload your data:
Copy data from a URL to public storage
Copy Data from S3 Bucket to public storage
You can mount your data anywhere on your machine, and Bacalhau will be able to run against that data
To launch your workload in a Docker container, using the specified image and working with input
data specified via IPFS CID, run the following command:
To check the status of your job, run the following command:
To get more information on your job,run:
To download your job, run:
For example, running:
outputs:
The --input
flag does not support CID subpaths for ipfs://
content.
Alternatively, you can run your workload with a publicly accessible http(s) URL, which will download the data temporarily into your public storage:
The --input
flag does not support URL directories.
If you run into this compute error while running your docker image
This can often be resolved by re-tagging your docker image
If you have questions or need support or guidance, please reach out to the Bacalhau team via Slack (#general channel)
Bacalhau supports running programs that are compiled to WebAssembly (WASM). With the Bacalhau client, you can upload WASM programs, retrieve data from public storage, read and write data, receive program arguments, and access environment variables.
Supported WebAssembly System Interface (WASI) Bacalhau can run compiled WASM programs that expect the WebAssembly System Interface (WASI) Snapshot 1. Through this interface, WebAssembly programs can access data, environment variables, and program arguments.
Networking Restrictions All ingress/egress networking is disabled – you won't be able to pull data/code/weights
etc. from an external source. WASM jobs can say what data they need using URLs or CIDs (Content IDentifier) and can then access the data by reading from the filesystem.
Single-Threading There is no multi-threading as WASI does not expose any interface for it.
If your program typically involves reading from and writing to network endpoints, follow these steps to adapt it for Bacalhau:
Replace Network Operations: Instead of making HTTP requests to external servers (e.g., example.com), modify your program to read data from the local filesystem.
Input Data Handling: Specify the input data location in Bacalhau using the --input
flag when running the job. For instance, if your program used to fetch data from example.com
, read from the /inputs
folder locally, and provide the URL as input when executing the Bacalhau job. For example, --input http://example.com
.
Output Handling: Adjust your program to output results to standard output (stdout
) or standard error (stderr
) pipes. Alternatively, you can write results to the filesystem, typically into an output mount. In the case of WASM jobs, a default folder at /outputs
is available, ensuring that data written there will persist after the job concludes.
By making these adjustments, you can effectively transition your program to operate within the Bacalhau environment, utilizing filesystem operations instead of traditional network interactions.
You can specify additional or different output mounts using the -o
flag.
You will need to compile your program to WebAssembly that expects WASI. Check the instructions for your compiler to see how to do this.
For example, Rust users can specify the wasm32-wasi
target to rustup
and cargo
to get programs compiled for WASI WebAssembly. See the Rust example for more information on this.
Data is identified by its content identifier (CID) and can be accessed by anyone who knows the CID. You can use either of these methods to upload your data:
Copy data from a URL to public storage
Copy Data from S3 Bucket to public storage
You can mount your data anywhere on your machine, and Bacalhau will be able to run against that data
You can run a WebAssembly program on Bacalhau using the bacalhau wasm run
command.
Run Locally Compiled Program:
If your program is locally compiled, specify it as an argument. For instance, running the following command will upload and execute the main.wasm
program:
The program you specify will be uploaded to a Bacalhau storage node and will be publicly available.
Alternative Program Specification:
You can use a Content IDentifier (CID) for a specific WebAssembly program.
Input Data Specification:
Make sure to specify any input data using --input
flag.
This ensures the necessary data is available for the program's execution.
You can give the WASM program arguments by specifying them after the program path or CID. If the WASM program is already compiled and located in the current directory, you can run it by adding arguments after the file name:
For a specific WebAssembly program, run:
Write your program to use program arguments to specify input and output paths. This makes your program more flexible in handling different configurations of input and output volumes.
For example, instead of hard-coding your program to read from /inputs/data.txt
, accept a program argument that should contain the path and then specify the path as an argument to bacalhau wasm run
:
Your language of choice should contain a standard way of reading program arguments that will work with WASI.
You can also specify environment variables using the -e
flag.
See the Rust example for a workload that leverages WebAssembly support.
If you have questions or need support or guidance, please reach out to the Bacalhau team via Slack (#general channel)
How to run the WebUI.
The Bacalhau WebUI offers an intuitive interface for interacting with the Bacalhau network. This guide provides comprehensive instructions for setting up, deploying, and utilizing the WebUI.
For contributing to the WebUI's development, please refer to the Bacalhau WebUI GitHub Repository.
Ensure you have a Bacalhau v1.1.7 or later installed.
To launch the WebUI locally, execute the following command:
This command initializes a requester and compute node, configured to listen on HOST=0.0.0.0
and PORT=1234
.
Once started, the WebUI is accessible at http://127.0.0.1/. This local instance allows you to interact with your local Bacalhau network setup.
For observational purposes, a development version of the WebUI is available at bootstrap.development.bacalhau.org. This instance displays jobs from the development server.
N.b. The development version of the WebUI is for observation only and may not reflect the latest changes or features available in the local setup.
These are the flags that control the capacity of the Bacalhau node, and the limits for jobs that might be run.
The --limit-total-*
flags control the total system resources you want to give to the network. If left blank, the system will attempt to detect these values automatically.
The --limit-job-*
flags control the maximum amount of resources a single job can consume for it to be selected for execution.
Resource limits are not supported for Docker jobs running on Windows. Resource limits will be applied at the job bid stage based on reported job requirements but will be silently unenforced. Jobs will be able to access as many resources as requested at runtime.
Running a Windows-based node is not officially supported, so your mileage may vary. Some features (like resource limits) are not present in Windows-based nodes.
Bacalhau currently makes the assumption that all containers are Linux-based. Users of the Docker executor will need to manually ensure that their Docker engine is running and configured appropriately to support Linux containers, e.g. using the WSL-based backend.
Bacalhau can limit the total time a job spends executing. A job that spends too long executing will be cancelled, and no results will be published.
By default, a Bacalhau node does not enforce any limit on job execution time. Both node operators and job submitters can supply a maximum execution time limit. If a job submitter asks for a longer execution time than permitted by a node operator, their job will be rejected.
Applying job timeouts allows node operators to more fairly distribute the work submitted to their nodes. It also protects users from transient errors that result in their jobs waiting indefinitely.
Job submitters can pass the --timeout
flag to any Bacalhau job submission CLI to set a maximum job execution time. The supplied value should be a whole number of seconds with no unit.
The timeout can also be added to an existing job spec by adding the Timeout
property to the Spec
.
Node operators can pass the --max-job-execution-timeout
flag to bacalhau serve
to configure the maximum job time limit. The supplied value should be a numeric value followed by a time unit (one of s
for seconds, m
for minutes or h
for hours).
Node operators can also use configuration properties to configure execution limits.
Compute nodes will use the properties:
Requester nodes will use the properties:
How to use the Bacalhau Docker image
This documentation explains how to use the Bacalhau Docker image to run tasks and manage them using the Bacalhau client.
To get started, you need to install the Bacalhau client (see more information here) and Docker.
The first step is to pull the Bacalhau Docker image from the Github container registry.
Expected output:
You can also pull a specific version of the image, e.g.:
Remember that the "latest" tag is just a string. It doesn't refer to the latest version of the Bacalhau client, it refers to an image that has the "latest" tag. Therefore, if your machine has already downloaded the "latest" image, it won't download it again. To force a download, you can use the --no-cache
flag.
To check the version of the Bacalhau client, run:
Expected Output:
In the example below, an Ubuntu-based job runs to print the message 'Hello from Docker Bacalhau':
ghcr.io/bacalhau-project/bacalhau:latest
: Name of the Bacalhau Docker image
--id-only
: Output only the job id
--wait
: Wait for the job to finish
ubuntu:latest.
Ubuntu container
--
: Separate Bacalhau parameters from the command to be executed inside the container
sh -c 'uname -a && echo "Hello from Docker Bacalhau!"'
: The command executed inside the container
Let's have a look at the command execution in the terminal:
The output you're seeing is in two parts: The first line: 13:53:46.478 | INF pkg/repo/fs.go:81 > Initializing repo at '/root/.bacalhau' for environment 'production'
is an informational message indicating the initialization of a repository at the specified directory ('/root/.bacalhau')
for the production
environment. The second line: ab95a5cc-e6b7-40f1-957d-596b02251a66
is a job ID
, which represents the result of executing a command inside a Docker container. It can be used to obtain additional information about the executed job or to access the job's results. We store that in an environment variable so that we can reuse it later on (env: JOB_ID=ab95a5cc-e6b7-40f1-957d-596b02251a66
)
To print out the content of the Job ID, run the following command:
Expected Output:
One inconvenience that you'll see is that you'll need to mount directories into the container to access files. This is because the container is running in a separate environment from your host machine. Let's take a look at the example below:
The first part of the example should look familiar, except for the Docker commands.
When a job is submitted, Bacalhau prints out the related job_id
(a46a9aa9-63ef-486a-a2f8-6457d7bafd2e
):
Job status: You can check the status of the job using bacalhau list
.
When it says Completed
, that means the job is done, and we can get the results.
Job information: You can find out more information about your job by using bacalhau describe
.
Job download: You can download your job results directly by using bacalhau get
. Alternatively, you can choose to create a directory to store your results. In the command below, we created a directory and downloaded our job output to be stored in the result
directory.
After the download has finished, you should see the following contents in the results directory.
If have questions or need support or guidance, please reach out to the Bacalhau team via Slack (#general channel).
Pandas is a Python package that provides fast, flexible, and expressive data structures designed to make working with data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real-world data analysis in Python. Additionally, it has the broader goal of becoming the most powerful and flexible open-source data analysis/manipulation tool available in any language. It is already well on its way towards this goal.
In this tutorial example, we will run Pandas script on Bacalhau.
To get started, you need to install the Bacalhau client, see more information
To run the Pandas script on Bacalhau for analysis, first, we will place the Pandas script in a container and then run it at scale on Bacalhau.
To get started, you need to install the Pandas library from pip:
Pandas is built around the idea of a DataFrame, a container for representing data. Below you will create a DataFrame by importing a CSV file. A CSV file is a text file with one record of data per line. The values within the record are separated using the “comma” character. Pandas provides a useful method, named read_csv()
to read the contents of the CSV file into a DataFrame. For example, we can create a file named transactions.csv
containing details of Transactions. The CSV file is stored in the same directory that contains the Python script.
The overall purpose of the command above is to read data from a CSV file (transactions.csv
) using Pandas and print the resulting DataFrame.
To download the transactions.csv
file, run:
To output a content of the transactions.csv
file, run:
Now let's run the script to read in the CSV file. The output will be a DataFrame object.
To run Pandas on Bacalhau you must store your assets in a location that Bacalhau has access to. We usually default to storing data on IPFS and code in a container, but you can also easily upload your script to IPFS too.
Now we're ready to run a Bacalhau job, whilst mounting the Pandas script and data from IPFS. We'll use the bacalhau docker run
command to do this:
bacalhau docker run
: call to Bacalhau
amancevice/pandas
: Docker image with pandas installed.
-i ipfs://QmfKJT13h5k1b23ja3ZCVg5nFL9oKz2bVXc8oXgtwiwhjz:/files
: Mounting the uploaded dataset to path. The -i
flag allows us to mount a file or directory from IPFS into the container. It takes two arguments, the first is the IPFS CID
QmfKJT13h5k1b23ja3ZCVg5nFL9oKz2bVXc8oXgtwiwhjz
) and the second is the file path within IPFS (/files
). The -i
flag can be used multiple times to mount multiple directories.
-w /files
Our working directory is /files. This is the folder where we will save the model as it will automatically get uploaded to IPFS as outputs
python read_csv.py
: python script to read pandas script
When a job is submitted, Bacalhau prints out the related job_id
. We store that in an environment variable so that we can reuse it later on.
Job status: You can check the status of the job using bacalhau list
.
When it says Published
or Completed
, that means the job is done, and we can get the results.
Job information: You can find out more information about your job by using bacalhau describe
.
Job download: You can download your job results directly by using bacalhau get
. Alternatively, you can choose to create a directory to store your results. In the command below, we created a directory (results
) and downloaded our job output to be stored in that directory.
To view the file, run the following command:
In this tutorial example, we will walk you through building your own Python container and running the container on Bacalhau.
To get started, you need to install the Bacalhau client, see more information
We will be using a simple recommendation script that, when given a movie ID, recommends other movies based on user ratings. Assuming you want recommendations for the movie 'Toy Story' (1995), it will suggest movies from similar categories:
In this example, we’ll be using 2 files from the MovieLens 1M dataset: ratings.dat
and movies.dat
. After the dataset is downloaded, extract the zip and place ratings.dat
and movies.dat
into a folder called input
:
The structure of the input directory should be
To create a requirements.txt
for the Python libraries we’ll be using, create:
To install the dependencies, run:
Create a new file called similar-movies.py
and in it paste the following script
What the similar-movies.py
script does
Read the files with pandas. The code uses Pandas to read data from the files ratings.dat
and movies.dat
.
Create the ratings matrix of shape (m×u) with rows as movies and columns as user
Normalise matrix (subtract mean off). The ratings matrix is normalized by subtracting the mean off.
Compute SVD: a singular value decomposition (SVD) of the normalized ratings matrix is performed.
Calculate cosine similarity, sort by most similar, and return the top N.
Select k principal components to represent the movies, a movie_id
to find recommendations, and print the top_n
results.
Running the script similar-movies.py
using the default values:
You can also use other flags to set your own values.
We will create a Dockerfile
and add the desired configuration to the file. These commands specify how the image will be built, and what extra requirements will be included.
We will use the python:3.8
docker image and add our script similar-movies.py
to copy the script to the docker image, similarly, we also add the dataset
directory and also the requirements
, after that run the command to install the dependencies in the image
The final folder structure will look like this:
We will run docker build
command to build the container:
Before running the command replace:
repo-name
with the name of the container, you can name it anything you want
tag
this is not required, but you can use the latest
tag
In our case:
Next, upload the image to the registry. This can be done by using the Docker hub username
, repo name
or tag
.
In our case:
After the repo image has been pushed to Docker Hub, we can now use the container for running on Bacalhau. You can submit a Bacalhau job by running your container on Bacalhau with default or custom parameters.
To submit a Bacalhau job by running your container on Bacalhau with default parameters, run the following Bacalhau command:
bacalhau docker run
: call to Bacalhau
jsace/python-similar-movies
: the name and of the docker image we are using
-- python similar-movies.py
: execute the Python script
When a job is submitted, Bacalhau prints out the related job_id
. We store that in an environment variable so that we can reuse it later on.
To submit a Bacalhau job by running your container on Bacalhau with custom parameters, run the following Bacalhau command:
bacalhau docker run
: call to Bacalhau
jsace/python-similar-movies
: the name of the docker image we are using
-- python similar-movies.py --k 50 --id 10 --n 10
: execute the python script. The script will use Singular Value Decomposition (SVD) and cosine similarity to find 10 movies most similar to the one with identifier 10, using 50 principal components.
Job status: You can check the status of the job using bacalhau list
.
When it says Published
or Completed
, that means the job is done, and we can get the results.
Job information: You can find out more information about your job by using bacalhau describe
.
Job download: You can download your job results directly by using bacalhau get
. Alternatively, you can choose to create a directory to store your results. In the command below, we created a directory (results
) and downloaded our job output to be stored in that directory.
To view the file, run the following command:
This tutorial serves as an introduction to Bacalhau. In this example, you'll be executing a simple "Hello, World!" Python script hosted on a website on Bacalhau.
To get started, you need to install the Bacalhau client, see more information
We'll be using a very simple Python script that displays the . Create a file called hello-world.py
:
Running the script to print out the output:
After the script has run successfully locally we can now run it on Bacalhau.
To submit a workload to Bacalhau you can use the bacalhau docker run
command. This command allows passing input data into the container using volumes, we will be using the --input URL:path
for simplicity. This results in Bacalhau mounting a data volume inside the container. By default, Bacalhau mounts the input volume at the path /inputs
inside the container.
, so we must run the full command after the --
argument.
bacalhau docker run
: call to Bacalhau
--id-only
: specifies that only the job identifier (job_id) will be returned after executing the container, not the entire output
--input https://raw.githubusercontent.com/bacalhau-project/examples/151eebe895151edd83468e3d8b546612bf96cd05/workload-onboarding/trivial-python/hello-world.py \
: indicates where to get the input data for the container. In this case, the input data is downloaded from the specified URL, which represents the Python script "hello-world.py".
python:3.10-slim
: the Docker image that will be used to run the container. In this case, it uses the Python 3.10 image with a minimal set of components (slim).
--
: This double dash is used to separate the Bacalhau command options from the command that will be executed inside the Docker container.
python3 /inputs/hello-world.py
: running the hello-world.py
Python script stored in /inputs
.
When a job is submitted, Bacalhau prints out the related job_id
. We store that in an environment variable so that we can reuse it later on.
The job description should be saved in .yaml
format, e.g. helloworld.yaml
, and then run with the command:
Job status: You can check the status of the job using bacalhau list
.
When it says Published
or Completed
, that means the job is done, and we can get the results.
Job information: You can find out more information about your job by using bacalhau describe
.
Job download: You can download your job results directly by using bacalhau get
. Alternatively, you can choose to create a directory to store your results. In the command below, we created a directory (results
) and downloaded our job output to be stored in that directory.
To view the file, run the following command:
Jupyter Notebooks have become an essential tool for data scientists, researchers, and developers for interactive computing and the development of data-driven projects. They provide an efficient way to share code, equations, visualizations, and narrative text with support for multiple programming languages. In this tutorial, we will introduce you to running Jupyter Notebooks on Bacalhau, a powerful and flexible container orchestration platform. By leveraging Bacalhau, you can execute Jupyter Notebooks in a scalable and efficient manner using Docker containers, without the need for manual setup or configuration.
In the following sections, we will explore two examples of executing Jupyter Notebooks on Bacalhau:
Executing a Simple Hello World Notebook: We will begin with a basic example to familiarize you with the process of running a Jupyter Notebook on Bacalhau. We will execute a simple "Hello, World!" notebook to demonstrate the steps required for running a notebook in a containerized environment.
Notebook to Train an MNIST Model: In this section, we will dive into a more advanced example. We will execute a Jupyter Notebook that trains a machine-learning model on the popular MNIST dataset. This will showcase the potential of Bacalhau to handle more complex tasks while providing you with insights into utilizing containerized environments for your data science projects.
To get started, you need to install the Bacalhau client, see more information
There are no external dependencies that we need to install. All dependencies are already there in the container.
/inputs/hello.ipynb
: This is the path of the input Jupyter Notebook inside the Docker container.
-i
: This flag stands for "input" and is used to provide the URL of the input Jupyter Notebook you want to execute.
https://raw.githubusercontent.com/js-ts/hello-notebook/main/hello.ipynb
: This is the URL of the input Jupyter Notebook.
jsacex/jupyter
: This is the name of the Docker image used for running the Jupyter Notebook. It is a minimal Jupyter Notebook stack based on the official Jupyter Docker Stacks.
--
: This double dash is used to separate the Bacalhau command options from the command that will be executed inside the Docker container.
jupyter nbconvert
: This is the primary command used to convert and execute Jupyter Notebooks. It allows for the conversion of notebooks to various formats, including execution.
--execute
: This flag tells nbconvert
to execute the notebook and store the results in the output file.
--to notebook
: This option specifies the output format. In this case, we want to keep the output as a Jupyter Notebook.
--output /outputs/hello_output.ipynb
: This option specifies the path and filename for the output Jupyter Notebook, which will contain the results of the executed input notebook.
Job status: You can check the status of the job using bacalhau list
:
When it says Published
or Completed
, that means the job is done, and we can get the results.
Job information: You can find out more information about your job by using bacalhau describe
:
Job download: You can download your job results directly by using bacalhau get
. Alternatively, you can choose to create a directory to store your results. In the command below, we created a directory (results
) and downloaded our job output to be stored in that directory.
After the download has finished you can see the contents in the results
directory, running the command below:
Install Docker on your local machine.
Sign up for a DockerHub account if you don't already have one.
Step 1: Create a Dockerfile
Create a new file named Dockerfile in your project directory with the following content:
This Dockerfile creates a Docker image based on the official TensorFlow
GPU-enabled image, sets the working directory to the root, updates the package list, and copies an IPython notebook (mnist.ipynb
) and a requirements.txt
file. It then upgrades pip
and installs Python packages from the requirements.txt
file, along with scikit-learn
. The resulting image provides an environment ready for running the mnist.ipynb
notebook with TensorFlow
and scikit-learn
, as well as other specified dependencies.
Step 2: Build the Docker Image
In your terminal, navigate to the directory containing the Dockerfile and run the following command to build the Docker image:
Replace "your-dockerhub-username" with your actual DockerHub username. This command will build the Docker image and tag it with your DockerHub username and the name "your-dockerhub-username/jupyter-mnist-tensorflow".
Step 3: Push the Docker Image to DockerHub
Once the build process is complete, push the Docker image to DockerHub using the following command:
Again, replace "your-dockerhub-username" with your actual DockerHub username. This command will push the Docker image to your DockerHub repository.
--gpu 1
: Flag to specify the number of GPUs to use for the execution. In this case, 1 GPU will be used.
-i gitlfs://huggingface.co/datasets/VedantPadwal/mnist.git
: The -i
flag is used to clone the MNIST dataset from Hugging Face's repository using Git LFS. The files will be mounted inside the container.
jsacex/jupyter-tensorflow-mnist:v02
: The name and the tag of the Docker image.
--
: This double dash is used to separate the Bacalhau command options from the command that will be executed inside the Docker container.
jupyter nbconvert --execute --to notebook --output /outputs/mnist_output.ipynb mnist.ipynb
: The command to be executed inside the container. In this case, it runs the jupyter nbconvert
command to execute the mnist.ipynb
notebook and save the output as mnist_output.ipynb
in the /outputs
directory.
Job status: You can check the status of the job using bacalhau list
.
When it says Published
or Completed
, that means the job is done, and we can get the results.
Job information: You can find out more information about your job by using bacalhau describe
.
Job download: You can download your job results directly by using bacalhau get
. Alternatively, you can choose to create a directory to store your results. In the command below, we created a directory (results
) and downloaded our job output to be stored in that directory.
After the download has finished you can see the contents in the results
directory, running the command below:
The outputs include our trained model and the Jupyter notebook with the output cells.
Bacalhau allows you to easily execute batch jobs via the CLI. But sometimes you need to do more than that. You might need to execute a script that requires user input, or you might need to execute a script that requires a lot of parameters. In any case, you probably want to execute your jobs in a repeatable manner.
This example demonstrates a simple Python script that is able to orchestrate the execution of lots of jobs in a repeatable manner.
To get started, you need to install the Bacalhau client, see more information
To demonstrate this example, I will use the data generated from an Ethereum example. This produced a list of hashes that I will iterate over and execute a job for each one.
Now let's create a file called bacalhau.py
. The script below automates the submission, monitoring, and retrieval of results for multiple Bacalhau jobs in parallel. It is designed to be used in a scenario where there are multiple hash files, each representing a job, and the script manages the execution of these jobs using Bacalhau commands.
This code has a few interesting features:
Change the value in the main
call (main("hashes.txt", 10)
) to change the number of jobs to execute.
Because all jobs are complete at different times, there's a loop to check that all jobs have been completed before downloading the results. If you don't do this, you'll likely see an error when trying to download the results. The while True
loop is used to monitor the status of jobs and wait for them to complete.
When downloading the results, the IPFS get often times out, so I wrapped that in a loop. The for i in range(0, 5)
loop in the getResultsFromJob
function involves retrying the bacalhau get
operation if it fails to complete successfully.
Let's run it!
Hopefully, the results
directory contains all the combined results from the jobs we just executed. Here's we're expecting to see CSV files:
Success! We've now executed a bunch of jobs in parallel using Python. This is a great way to execute lots of jobs in a repeatable manner. You can alter the file above for your purposes.
You might also be interested in the following examples:
In this tutorial, we will look at how to run CUDA programs on Bacalhau. CUDA (Compute Unified Device Architecture) is an extension of C/C++ programming. It is a parallel computing platform and programming model created by NVIDIA. It helps developers speed up their applications by harnessing the power of GPU accelerators.
In addition to accelerating high-performance computing (HPC) and research applications, CUDA has also been widely adopted across consumer and industrial ecosystems. CUDA also makes it easy for developers to take advantage of all the latest GPU architecture innovations
Architecturally, the CPU is composed of just a few cores with lots of cache memory that can handle a few software threads at a time. In contrast, a GPU is composed of hundreds of cores that can handle thousands of threads simultaneously.
Computations like matrix multiplication could be done much faster on GPU than on CPU
To get started, you need to install the Bacalhau client, see more information
You'll need to have the following installed:
NVIDIA GPU
CUDA drivers installed
nvcc
installed
Checking if nvcc
is installed:
Downloading the programs:
00-hello-world.cu
:
This example represents a standard C++ program that inefficiently utilizes GPU resources due to the use of non-parallel loops.
02-cuda-hello-world-faster.cu
:
In this example we utilize Vector addition using CUDA and allocate the memory in advance and copy the memory to the GPU using cudaMemcpy so that it can utilize the HBM (High Bandwidth memory of the GPU). Compilation and execution occur faster (1.39 seconds) compared to the previous example (8.67 seconds).
To submit a job, run the following Bacalhau command:
bacalhau docker run
: call to Bacalhau
-i https://raw.githubusercontent.com/tristanpenman/cuda-examples/master/02-cuda-hello-world-faster.cu
: URL path of the input data volumes downloaded from a URL source.
nvidia/cuda:11.2.0-cudnn8-devel-ubuntu18.04
: Docker container for executing CUDA programs (you need to choose the right CUDA docker container). The container should have the tag of "devel" in them.
nvcc --expt-relaxed-constexpr -o ./outputs/hello ./inputs/02-cuda-hello-world-faster.cu
: Compilation using the nvcc compiler and save it to the outputs directory as hello
Note that there is ;
between the commands: -- /bin/bash -c 'nvcc --expt-relaxed-constexpr -o ./outputs/hello ./inputs/02-cuda-hello-world-faster.cu; ./outputs/hello
The ";" symbol allows executing multiple commands sequentially in a single line.
./outputs/hello
: Execution hello binary: You can combine compilation and execution commands.
Note that the CUDA version will need to be compatible with the graphics card on the host machine
When a job is submitted, Bacalhau prints out the related job_id
. We store that in an environment variable so that we can reuse it later on:
Job status: You can check the status of the job using bacalhau list
.
When it says Published
or Completed
, that means the job is done, and we can get the results.
Job information: You can find out more information about your job by using bacalhau describe
.
Job download: You can download your job results directly by using bacalhau get
. Alternatively, you can choose to create a directory to store your results. In the command below, we created a directory (results
) and downloaded our job output to be stored in that directory.
To view the file, run the following command:
This example will walk you through building Time Series Forecasting using . Prophet is a forecasting procedure implemented in R and Python. It is fast and provides completely automated forecasts that can be tuned by hand by data scientists and analysts.
Quick script to run custom R container on Bacalhau:
To get started, you need to install the Bacalhau client, see more information
Open R studio or R-supported IDE. If you want to run this on a notebook server, then make sure you use an R kernel. Prophet is a CRAN package, so you can use install.packages
to install the prophet
package:
After installation is finished, you can download the example data that is stored in IPFS:
The code below instantiates the library and fits a model to the data.
Create a new file called Saturating-Forecasts.R
and in it paste the following script:
This script performs time series forecasting using the Prophet library in R, taking input data from a CSV file, applying the forecasting model, and generating plots for analysis.
Let's have a look at the command below:
This command uses Rscript to execute the script that was created and written to the Saturating-Forecasts.R
file.
The input parameters provided in this case are the names of input and output files:
example_wp_log_R.csv
- the example data that was previously downloaded.
outputs/output0.pdf
- the name of the file to save the first forecast plot.
outputs/output1.pdf
- the name of the file to save the second forecast plot.
To build your own docker container, create a Dockerfile
, which contains instructions to build your image.
These commands specify how the image will be built, and what extra requirements will be included. We use r-base
as the base image and then install the prophet
package. We then copy the Saturating-Forecasts.R
script into the container and set the working directory to the R
folder.
We will run docker build
command to build the container:
Before running the command replace:
repo-name
with the name of the container, you can name it anything you want
tag
this is not required but you can use the latest
tag
In our case:
Next, upload the image to the registry. This can be done by using the Docker hub username, repo name, or tag.
In our case:
The following command passes a prompt to the model and generates the results in the outputs directory. It takes approximately 2 minutes to run.
bacalhau docker run
: call to Bacalhau
-i ipfs://QmY8BAftd48wWRYDf5XnZGkhwqgjpzjyUG3hN1se6SYaFt:/example_wp_log_R.csv
: Mounting the uploaded dataset at /inputs
in the execution. It takes two arguments, the first is the IPFS CID (QmY8BAftd48wWRYDf5XnZGkhwqgjpzjyUG3hN1se6SYaFtz
) and the second is file path within IPFS (/example_wp_log_R.csv
)
ghcr.io/bacalhau-project/examples/r-prophet:0.0.2
: the name and the tag of the docker image we are using
/example_wp_log_R.csv
: path to the input dataset
/outputs/output0.pdf
, /outputs/output1.pdf
: paths to the output
Rscript Saturating-Forecasts.R
: execute the R script
When a job is submitted, Bacalhau prints out the related job_id
. We store that in an environment variable so that we can reuse it later on:
Job status: You can check the status of the job using bacalhau list
.
When it says Published
or Completed
, that means the job is done, and we can get the results.
Job information: You can find out more information about your job by using bacalhau describe
.
Job download: You can download your job results directly by using bacalhau get
. Alternatively, you can choose to create a directory to store your results. In the command below, we created a directory (results
) and downloaded our job output to be stored in that directory.
To view the file, run the following command:
You can't natively display PDFs in notebooks, so here are some static images of the PDFs:
output0.pdf
output1.pdf
Bacalhau, a powerful and versatile data processing platform, has recently integrated Amazon Web Services (AWS) S3, allowing users to seamlessly access and process data stored in S3 buckets within their Bacalhau jobs. This integration not only simplifies data input, output, and processing operations but also streamlines the overall workflow by enabling users to store and manage their data effectively in S3 buckets. With Bacalhau, you can process several Large s3 buckets in parallel. In this example, we will walk you through the process of reading data from multiple S3 buckets, converting TIFF images to JPEG format.
There are several advantages to converting images from TIFF to JPEG format:
Reduced File Size: JPEG images use lossy compression, which significantly reduces file size compared to lossless formats like TIFF. Smaller file sizes lead to faster upload and download times, as well as reduced storage requirements.
Efficient Processing: With smaller file sizes, image processing tasks tend to be more efficient and require less computational resources when working with JPEG images compared to TIFF images.
Training Machine Learning Models: Smaller file sizes and reduced computational requirements make JPEG images more suitable for training machine learning models, particularly when dealing with large datasets, as they can help speed up the training process and reduce the need for extensive computational resources.
We will use the S3 mount feature to mount bucket objects from s3 buckets. Let’s have a look at the example below:
It defines S3 object as input to the job:
sentinel-s1-rtc-indigo
: bucket’s name
tiles/RTC/1/IW/10/S/DH/2017/S1A_20170125_10SDH_ASC/Gamma0_VH.tif
: represents the key of the object in that bucket. The object to be processed is called Gamma0_VH.tif
and is located in the subdirectory with the specified path.
But if you want to specify the entire objects located in the path, you can simply add *
to the end of the path (tiles/RTC/1/IW/10/S/DH/2017/S1A_20170125_10SDH_ASC/*
)
dst=/sentinel-s1-rtc-indigo
: the destination to which to mount the s3 bucket object
opt=region=us-west-2
: specifying the region in which the bucket is located
In the example below, we will mount several bucket objects from public s3 buckets located in a specific region:
The job has been submitted and Bacalhau has printed out the related job_id
. We store that in an environment variable so that we can reuse it later on.
Job status: You can check the status of the job using bacalhau list
.
When it says Published
or Completed
, that means the job is done, and we can get the results.
Job information: You can find out more information about your job by using bacalhau describe
.
Job download: You can download your job results directly by using bacalhau get
. Alternatively, you can choose to create a directory to store your results. In the command below, we created a directory (results
) and downloaded our job output to be stored in that directory.
To view the images, download the job results and open the folder:
You can use official Docker containers for each language, like R or Python. In this example, we will use the official R container and run it on Bacalhau.
In this tutorial example, we will run a "hello world" R script on Bacalhau.
To get started, you need to install the Bacalhau client, see more information
To install R follow these instructions . After R and RStudio are installed, create and run a script called hello.R
:
Run the script:
Next, upload the script to your public storage (in our case, IPFS). We've already uploaded the script to IPFS and the CID is: QmVHSWhAL7fNkRiHfoEJGeMYjaYZUsKHvix7L54SptR8ie
. You can look at this by browsing to one of the HTTP IPFS proxies like or .
Now it's time to run the script on Bacalhau:
bacalhau docker run
: call to Bacalhau
i ipfs://QmQRVx3gXVLaRXywgwo8GCTQ63fHqWV88FiwEqCidmUGhk:/hello.R
: Mounting the uploaded dataset at /inputs
in the execution. It takes two arguments, the first is the IPFS CID (QmQRVx3gXVLaRXywgwo8GCTQ63fHqWV88FiwEqCidmUGhk
) and the second is file path within IPFS (/hello.R
)
r-base
: docker official image we are using
Rscript hello.R
: execute the R script
When a job is submitted, Bacalhau prints out the related job_id
. We store that in an environment variable so that we can reuse it later on:
The job description should be saved in .yaml
format, e.g. rhello.yaml
, and then run with the command:
Job status: You can check the status of the job using bacalhau list
.
When it says Published
or Completed
, that means the job is done, and we can get the results.
Job information: You can find out more information about your job by using bacalhau describe
.
Job download: You can download your job results directly by using bacalhau get
. Alternatively, you can choose to create a directory to store your results. In the command below, we created a directory (results
) and downloaded our job output to be stored in that directory.
To view the file, run the following command:
You can generate the job request using bacalhau describe
with the --spec
flag. This will allow you to re-run that job in the future:
Bacalhau supports running jobs as a program. This example demonstrates how to compile a project into WebAssembly and run the program on Bacalhau.
To get started, you need to install the Bacalhau client, see more information .
A working Rust installation with the wasm32-wasi
target. For example, you can use to install Rust and configure it to build WASM targets. For those using the notebook, these are installed in hidden cells below.
We can use cargo
(which will have been installed by rustup
) to start a new project (my-program
) and compile it:
We can then write a Rust program. Rust programs that run on Bacalhau can read and write files, access a simple clock, and make use of pseudo-random numbers. They cannot memory-map files or run code on multiple threads.
The program below will use the Rust imageproc
crate to resize an image through seam carving, based on .
In the main function main()
an image is loaded, the original is saved, and then a loop is performed to reduce the width of the image by removing "seams." The results of the process are saved, including the original image with drawn seams and a gradient image with highlighted seams.
We also need to install the imageproc
and image
libraries and switch off the default features to make sure that multi-threading is disabled (default-features = false
). After disabling the default features, you need to explicitly specify only the features that you need:
We can now build the Rust program into a WASM blob using cargo
:
This command navigates to the my-program
directory and builds the project using Cargo with the target set to wasm32-wasi
in release mode.
This will generate a WASM file at ./my-program/target/wasm32-wasi/release/my-program.wasm
which can now be run on Bacalhau.
Now that we have a WASM binary, we can upload it to IPFS and use it as input to a Bacalhau job.
The -i
flag allows specifying a URI to be mounted as a named volume in the job, which can be an IPFS CID, HTTP URL, or S3 object.
For this example, we are using an image of the Statue of Liberty that has been pinned to a storage facility.
bacalhau wasm run
: call to Bacalhau
./my-program/target/wasm32-wasi/release/my-program.wasm
: the path to the WASM file that will be executed
_start
: the entry point of the WASM program, where its execution begins
--id-only
: this flag indicates that only the identifier of the executed job should be returned
-i ipfs://bafybeifdpl6dw7atz6uealwjdklolvxrocavceorhb3eoq6y53cbtitbeu:/inputs
: input data volume that will be accessible within the job at the specified destination path
When a job is submitted, Bacalhau prints out the related job_id. We store that in an environment variable so that we can reuse it later on:
You can download your job results directly by using bacalhau get
. Alternatively, you can choose to create a directory to store your results. In the command below, we created a directory (wasm_results
) and downloaded our job output to be stored in that directory.
We can now get the results.
When we view the files, we can see the original image, the resulting shrunk image, and the seams that were removed.
Prolog is intended primarily as a declarative programming language: the program logic is expressed in terms of relations, represented as facts and rules. A computation is initiated by running a query over these relations. Prolog is well-suited for specific tasks that benefit from rule-based logical queries such as searching databases, voice control systems, and filling templates.
This tutorial is a quick guide on how to run a hello world script on Bacalhau.
To get started, you need to install the Bacalhau client, see more information
To get started, install swipl
Create a file called helloworld.pl
. The following script prints ‘Hello World’ to the stdout:
Running the script to print out the output:
After the script has run successfully locally, we can now run it on Bacalhau.
Before running it on Bacalhau we need to upload it to IPFS.
Using the IPFS cli
:
Run the command below to check if our script has been uploaded.
This command outputs the CID. Copy the CID of the file, which in our case is QmYq9ipYf3vsj7iLv5C67BXZcpLHxZbvFAJbtj7aKN5qii
We will mount the script to the container using the -i
flag: -i: ipfs://< CID >:/< name-of-the-script >
.
To submit a job, run the following Bacalhau command:
-i ipfs://QmYq9ipYf3vsj7iLv5C67BXZcpLHxZbvFAJbtj7aKN5qii:/helloworld.pl
: Sets the input data for the container.
mYq9ipYf3vsj7iLv5C67BXZcpLHxZbvFAJbtj7aKN5qii
is our CID which points to the helloworld.pl
file on the IPFS network. This file will be accessible within the container.
-- swipl -q -s helloworld.pl -g hello_world
: instructs SWI-Prolog to load the program from the helloworld.pl
file and execute the hello_world
function in quiet mode:
-q
: running in quiet mode
-s
: load file as a script. In this case we want to run the helloworld.pl
script
-g
: is the name of the function you want to execute. In this case its hello_world
When a job is submitted, Bacalhau prints out the related job_id
. We store that in an environment variable so that we can reuse it later on:
Job status: You can check the status of the job using bacalhau list
.
When it says Published
or Completed
, that means the job is done, and we can get the results.
Job information: You can find out more information about your job by using bacalhau describe
.
Job download: You can download your job results directly by using bacalhau get
. Alternatively, you can choose to create a directory to store your results. In the command below, we created a directory (results
) and downloaded our job output to be stored in that directory.
To view the file, run the following command:
How to pin data to public storage
If you have data that you want to make available to your Bacalhau jobs (or other people), you can pin it using a pinning service like Pinata, NFT.Storage, Thirdweb, etc. Pinning services store data on behalf of users. The pinning provider is essentially guaranteeing that your data will be available if someone knows the CID. Most pinning services offer you a free tier, so you can try them out without spending any money.
To use a pinning service, you will almost always need to create an account. After registration, you get an API token, which is necessary to control and access the files. Then you need to upload files - usually services provide a web interface, CLI and code samples for integration into your application. Once you upload the files you will get its CID, which looks like this: QmUyUg8en7G6RVL5uhyoLBxSWFgRMdMraCRWFcDdXKWEL9
. Now you can access pinned data from the jobs via this CID.
Data source can be specified via --input
flag, see the for more details
Templating Support in Bacalhau Job Run
This documentation introduces templating support for , providing users with the ability to dynamically inject variables into their job specifications. This feature is particularly useful when running multiple jobs with varying parameters such as DuckDB query, S3 buckets, prefixes, and time ranges without the need to edit each job specification file manually.
The motivation behind this feature arises from the need to streamline the process of preparing and running multiple jobs with different configurations. Rather than manually editing job specs for each run, users can leverage placeholders and pass actual values at runtime.
The templating functionality in Bacalhau is built upon the Go text/template
package. This powerful library offers a wide range of features for manipulating and formatting text based on template definitions and input variables.
For more detailed information about the Go text/template
library and its syntax, please refer to the official documentation: .
You can also use environment variables for templating:
To preview the final templated job spec without actually submitting the job, you can use the --dry-run
flag:
This will output the processed job specification, showing you how the placeholders have been replaced with the provided values.
This is an ops
job that runs on all nodes that match the job selection criteria. It accepts duckdb query
variable, and two optional start-time
and end-time
variables to define the time range for the query.
To run this job, you can use the following command:
This is a batch
job that runs on a single node. It accepts duckdb query
variable, and four other variables to define the S3 bucket, prefix, pattern for the logs and the AWS region.
To run this job, you can use the following command:
Here is a quick tutorial on how to copy Data from S3 to a public storage. In this tutorial, we will scrape all the links from a public AWS S3 buckets and then copy the data to IPFS using Bacalhau.
To get started, you need to install the Bacalhau client, see more information
Let's look closely at the command above:
bacalhau docker run
: call to bacalhau
-i "s3://noaa-goes16/ABI-L1b-RadC/2000/001/12/OR_ABI-L1b-RadC-M3C01*:/inputs,opt=region=us-east-1
: defines S3 objects as inputs to the job. In this case, it will download all objects that match the prefix ABI-L1b-RadC/2000/001/12/OR_ABI-L1b-RadC-M3C01
from the bucket noaa-goes16
in us-east-1
region, and mount the objects under /inputs
path inside the docker job.
-- sh -c "cp -r /inputs/* /outputs/"
: copies all files under /inputs
to /outputs
, which is by default the result output directory which all of its content will be published to the specified destination, which is IPFS by default
When a job is submitted, Bacalhau prints out the related job_id
. We store that in an environment variable so that we can reuse it later on.
This works either with datasets that are publicly available or with private datasets, provided that the nodes have the necessary credentials to access. See the for more details.
Job status: You can check the status of the job using bacalhau list
.
When it says Published
or Completed
, that means the job is done, and we can get the results.
Job information: You can find out more information about your job by using bacalhau describe
.
Job download: You can download your job results directly by using bacalhau get
. Alternatively, you can choose to create a directory to store your results. In the command below, we remove the results directory if it exists, create it again and download our job output to be stored in that directory.
When the download is completed, the results of the job will be present in the directory. To view them, run the following command:
First you need to install jq
(if it is not already installed) to process JSON:
To extract the CIDs from output JSON, execute following:
The extracted CID will look like this:
You can publish your results to Amazon s3 or other S3-compatible destinations like MinIO, Ceph, or SeaweedFS to conveniently store and share your outputs.
To facilitate publishing results, define publishers and their configurations using the PublisherSpec structure.
For S3-compatible destinations, the configuration is as follows:
For Amazon S3, you can specify the PublisherSpec
configuration as shown below:
Let's explore some examples to illustrate how you can use this:
Publishing results to S3 using default settings
Publishing results to S3 with a custom endpoint and region:
Publishing results to S3 as a single compressed file
Utilizing naming placeholders in the object key
Tracking content identification and maintaining lineage across different jobs' inputs and outputs can be challenging. To address this, the publisher encodes the SHA-256 checksum of the published results, specifically when publishing a single compressed file.
Here's an example of a sample result:
To enable support for the S3-compatible storage provider, no additional dependencies are required. However, valid AWS credentials are necessary to sign the requests. The storage provider uses the default credentials chain, which checks the following sources for credentials:
Environment variables, such as AWS_ACCESS_KEY_ID
and AWS_SECRET_ACCESS_KEY
Credentials file ~/.aws/credentials
IAM Roles for Amazon EC2 Instances
Config property | Meaning |
---|---|
Config property | Meaning |
---|---|
If you are interested in finding out more about how to ingest your data into IPFS, please see the .
We've already uploaded the script and data to IPFS to the following CID: QmfKJT13h5k1b23ja3ZCVg5nFL9oKz2bVXc8oXgtwiwhjz
. You can look at this by browsing to one of the HTTP IPFS proxies like or .
If you have questions or need support or guidance, please reach out to the (#general channel).
Download Movielens1M dataset from this link
For further reading on how the script works, go to
See more information on how to containerize your script/app
hub-user
with your docker hub username, If you don’t have a docker hub account , and use the username of the account you created
If you have questions or need support or guidance, please reach out to the (#general channel).
The same job can be presented in the format. In this case, the description will look like this:
If you have questions or need support or guidance, please reach out to the (#general channel).
To get started, you need to install the Bacalhau client, see more information
If you have questions or need support or guidance, please reach out to the (#general channel).
If you have questions or need support or guidance, please reach out to the (#general channel).
If you have questions or need support or guidance, please reach out to the (#general channel).
To use Bacalhau, you need to package your code in an appropriate format. The developers have already pushed a container for you to use, but if you want to build your own, you can follow the steps below. You can view a in the documentation.
hub-user
with your docker hub username. If you don’t have a docker hub account , and use the username of the account you created
If you have questions or need support or guidance, please reach out to the (#general channel).
To get started, you need to install the Bacalhau client, see more information
If you have questions or need support or guidance, please reach out to the (#general channel).
The same job can be presented in the format. In this case, the description will look like this:
If you have questions or need support or guidance, please reach out to the (#general channel).
If you have questions or need support or guidance, please reach out to the (#general channel).
Since the data uploaded to IPFS isn’t pinned, we will need to do that manually. Check this information on how to pin your We recommend using .
If you have questions or need support or guidance, please reach out to the (#general channel).
For questions, feedback, please reach out in our
Node.Compute.JobTimeouts.MinJobExecutionTimeout
The minimum acceptable value for a job timeout. A job will only be accepted if it is submitted with a timeout of longer than this value.
Node.Compute.JobTimeouts.MaxJobExecutionTimeout
The maximum acceptable value for a job timeout. A job will only be accepted if it is submitted with a timeout of shorter than this value.
Node.Compute.JobTimeouts.DefaultJobExecutionTimeout
The job timeout that will be applied to jobs that are submitted without a timeout value.
Node.Requester.Timeouts.MinJobExecutionTimeout
If a job is submitted with a timeout less than this value, the default job execution timeout will be used instead.
Node.Requester.Timeouts.DefaultJobExecutionTimeout
The timeout to use in the job if a timeout is missing or too small.
An InputSource
defines where and how to retrieve specific artifacts needed for a Task
, such as files or data, and where to mount them within the task's context. This ensures the necessary data is present before the task's execution begins.
Bacalhau's InputSource
natively supports fetching data from remote sources like S3 and IPFS and can also mount local directories. It is intended to be flexible for future expansion.
InputSource
Parameters:Source (
SpecConfig
: <required>)
: Specifies the origin of the artifact, which could be a URL, an S3 bucket, or other locations.
Alias (string: <optional>)
: An optional identifier for this input source. It's particularly useful for dynamic operations within a task, such as dynamically importing data in WebAssembly using an alias.
Target (string: <required>)
: Defines the path inside the task's environment where the retrieved artifact should be mounted or stored. This ensures that the task can access the data during its execution.
In this example, the first input source fetches data from an S3 bucket and mounts it at /my_s3_data
within the task. The second input source mounts a local directory at /my_local_data
and allows the task to read and write data to it.
A synthetic dataset is generated by algorithms or simulations which has similar characteristics to real-world data. Collecting real-world data, especially data that contains sensitive user data like credit card information, is not possible due to security and privacy concerns. If a data scientist needs to train a model to detect credit fraud, they can use synthetically generated data instead of using real data without compromising the privacy of users.
The advantage of using Bacalhau is that you can generate terabytes of synthetic data without having to install any dependencies or store the data locally.
In this example, we will learn how to run Bacalhau on a synthetic dataset. We will generate synthetic credit card transaction data using the Sparkov program and store the results in IPFS.
To get started, you need to install the Bacalhau client, see more information here
To run Sparkov locally, you'll need to clone the repo and install dependencies:
Go to the Sparkov_Data_Generation
directory:
Create a temporary directory (outputs
) to store the outputs:
The command above executes the Python script datagen.py
, passing the following arguments to it:
-n 1000
: Number of customers to generate
-o ../outputs
: path to store the outputs
"01-01-2022"
: Start date
"10-01-2022"
: End date
Thus, this command uses a Python script to generate synthetic credit card transaction data for the period from 01-01-2022
to 10-01-2022
and saves the results in the ../outputs
directory.
To see the full list of options, use:
To build your own docker container, create a Dockerfile
, which contains instructions to build your image:
These commands specify how the image will be built, and what extra requirements will be included. We use python:3.8
as the base image, install git
, clone the Sparkov_Data_Generation
repository from GitHub, set the working directory inside the container to /Sparkov_Data_Generation/
, and install Python dependencies listed in the requirements.txt
file."
See more information on how to containerize your script/app here
We will run docker build
command to build the container:
Before running the command replace:
hub-user
with your docker hub username. If you don’t have a docker hub account follow these instructions to create docker account, and use the username of the account you created
repo-name
with the name of the container, you can name it anything you want
tag
this is not required but you can use the latest
tag
In our case:
Next, upload the image to the registry. This can be done by using the Docker hub username, repo name or tag.
In our case:
After the repo image has been pushed to Docker Hub, we can now use the container for running on Bacalhau
Now we're ready to run a Bacalhau job:
bacalhau docker run
: call to Bacalhau
jsacex/sparkov-data-generation
: the name of the docker image we are using
-- python3 datagen.py -n 1000 -o ../outputs "01-01-2022" "10-01-2022"
: the arguments passed into the container, specifying the execution of the Python script datagen.py
with specific parameters, such as the amount of data, output path, and time range.
When a job is submitted, Bacalhau prints out the related job_id
. We store that in an environment variable so that we can reuse it later on:
Job status: You can check the status of the job using bacalhau list
.
When it says Published
or Completed
, that means the job is done, and we can get the results.
Job information: You can find out more information about your job by using bacalhau describe
.
Job download: You can download your job results directly by using bacalhau get
. Alternatively, you can choose to create a directory to store your results. In the command below, we created a directory (results
) and downloaded our job output to be stored in that directory.
To view the contents of the current directory, run the following command:
If you have questions or need support or guidance, please reach out to the Bacalhau team via Slack (#general channel).
The Labels
block within a Job
specification plays a crucial role in Bacalhau, serving as a mechanism for filtering jobs. By attaching specific labels to jobs, users can quickly and effectively filter and manage jobs via both the Command Line Interface (CLI) and Application Programming Interface (API) based on various criteria.
Labels
ParametersLabels are essentially key-value pairs attached to jobs, allowing for detailed categorizations and filtrations. Each label consists of a Key
and a Value
. These labels can be filtered using operators to pinpoint specific jobs fitting certain criteria.
Jobs can be filtered using the following operators:
in
: Checks if the key's value matches any within a specified list of values.
notin
: Validates that the key's value isn’t within a provided list of values.
exists
: Checks for the presence of a specified key, regardless of its value.
!
: Validates the absence of a specified key. (i.e., DoesNotExist)
gt
: Checks if the key's value is greater than a specified value.
lt
: Checks if the key's value is less than a specified value.
= & ==
: Used for exact match comparisons between the key’s value and a specified value.
!=
: Validates that the key’s value doesn't match a specified value.
Filter jobs with a label whose key is "environment" and value is "development":
Filter jobs with a label whose key is "version" and value is greater than "2.0":
Filter jobs with a label "project" existing:
Filter jobs without a "project" label:
Job Management: Enables efficient management of jobs by categorizing them based on distinct attributes or criteria.
Automation: Facilitates the automation of job deployment and management processes by allowing scripts and tools to target specific categories of jobs.
Monitoring & Analytics: Enhances monitoring and analytics by grouping jobs into meaningful categories, allowing for detailed insights and analysis.
The Labels
block is instrumental in the enhanced management, filtering, and operation of jobs within Bacalhau. By understanding and utilizing the available operators and label parameters effectively, users can optimize their workflow, automate processes, and achieve detailed insights into their jobs.
The Network
object offers a method to specify the networking requirements of a Task
. It defines the scope and constraints of the network connectivity based on the demands of the task.
Network
Parameters:Type (string: "None")
: Indicates the network configuration's nature. There are several network modes available:
None
: This mode implies that the task does not necessitate any networking capabilities.
Full
: Specifies that the task mandates unrestricted, raw IP networking without any imposed filters.
HTTP
: This mode constrains the task to only require HTTP networking with specific domains. In this model:
The job specifier puts forward a job, stipulating the domain(s) it intends to communicate with.
The compute provider assesses the inherent risk of the job based on these domains and bids accordingly.
At runtime, the network traffic remains strictly confined to the designated domain(s).
A typical command for this might resemble: bacalhau docker run —network=http —domain=crates.io —domain=github.com -i ipfs://Qmy1234myd4t4,dst=/code rust/compile
The primary risks for the compute provider center around possible violations of its terms, its hosting provider's terms, or even prevailing laws in its jurisdiction. This encompasses issues such as unauthorized access or distribution of illicit content and potential cyber-attacks.
Conversely, the job specifier's primary risk involves operating in a paid environment. External entities might seek to exploit this environment, for instance, through a compromised package download that initiates a crypto mining operation, depleting the allocated, prepaid job time. By limiting traffic strictly to the pre-specified domains, the potential for such cyber threats diminishes considerably.
While a compute provider might impose its limits through other means, having domains declared upfront allows it to selectively bid on jobs that it can execute without issues, improving the user experience for job specifiers.
Domains (string[]: <optional>)
: A list of domain strings, relevant primarily when the Type
is set to HTTP. It dictates the specific domains the task can communicate with over HTTP.
Understanding and utilizing these configurations aptly can ensure that tasks are executed in an environment that aligns with their networking requirements, bolstering efficiency and security.
To upload a file from a URL we will use the bacalhau docker run
command.
The job has been submitted and Bacalhau has printed out the related job id.
Let's look closely at the command above:
bacalhau docker run
: call to bacalhau using docker executor
--input https://raw.githubusercontent.com/filecoin-project/bacalhau/main/README.md
: URL path of the input data volumes downloaded from a URL source.
ghcr.io/bacalhau-project/examples/upload:v1
: the name and tag of the docker image we are using
The bacalhau docker run
command takes advantage of the --input
parameter. This will download a file from a public URL and place it in the /inputs
directory of the container (by default). Then we will use a helper container to move that data to the /outputs directory.
You can find out more about the helper container in the examples repository which is designed to simplify the data uploading process.
For more details, see the CLI commands guide
Job status: You can check the status of the job using bacalhau list
, processing the json ouput with the jq
:
When the job status is Published
or Completed
, that means the job is done, and we can get the results using the job ID.
Job information: You can find out more information about your job by using bacalhau describe
.
Job download: You can download your job results directly by using bacalhau get
. Alternatively, you can choose to create a directory to store your results. In the command below, we removed a directory in case it was present before, created it and downloaded our job output to be stored in that directory.
Each job result contains an outputs
subfolder and exitCode
, stderr
and stdout
files with relevant content. To view the execution logs execute following:
And to view the job execution result (README.md
file in the example case), which was saved as a job output, execute:
To get the output CID from a completed job, run the following command:
The job will upload the CID to the public storage via IPFS. We will store the CID in an environment variable so that we can reuse it later on.
Now that we have the CID, we can use it in a new job. This time we will use the --input
parameter to tell Bacalhau to use the CID we just uploaded.
In this case, the only goal of our job is just to list the contents of the /inputs
directory. You can see that the "input" data is located under /inputs/outputs/README.md
.
The job has been submitted and Bacalhau has printed out the related job id. We store that in an environment variable so that we can reuse it later on.
For questions and feedback, please reach out in our Slack
A Constraint
represents a condition that must be met for a compute node to be eligible to run a given job. Operators have the flexibility to manually define node labels when initiating a node using the bacalhau serve command. Additionally, Bacalhau boasts features like automatic resource detection and dynamic labeling, further enhancing its capability.
By defining constraints, you can ensure that jobs are scheduled on nodes that have the necessary requirements or conditions.
Constraint
Parameters:Key: The name of the attribute or property to check on the compute node. This could be anything from a specific hardware feature, operating system version, or any other node property.
Operator: Determines the kind of comparison to be made against the Key
's value, which can be:
in
: Checks if the Key's value exists within the provided list of values.
notin
: Ensures the Key's value doesn't match any in the provided list of values.
exists
: Verifies that a value for the specified Key is present, regardless of its actual value.
!
: Confirms the absence of the specified Key. i.e DoesNotExist
gt
: Assesses if the Key's value is greater than the provided value.
lt
: Assesses if the Key's value is less than the provided value.
=
& ==
: Both are used to compare the Key's value for an exact match with the provided value.
!=
: Ensures the Key's value is not the same as the provided value.
Values (optional): A list of values that the node attribute, specified by the Key
, is compared against using the Operator
. This is not needed for operators like exists
or !
.
Consider a scenario where a job should only run on nodes with a GPU and an operating system version greater than 2.0
. The constraints for such a requirement might look like:
In this example, the first constraint checks if the node has a GPU, the second constraint ensures the OS is linux, and deployed in eu-west-1 or eu-west-2`.
Constraints are evaluated as a logical AND, meaning all constraints must be satisfied for a node to be eligible.
Using too many specific constraints can lead to a job not being scheduled if no nodes satisfy all the conditions.
It's essential to balance the specificity of constraints with the broader needs and resources available in the cluster.
The different job types available in Bacalhau
Bacalhau has recently introduced different job types in v1.1, providing more control and flexibility over the orchestration and scheduling of those jobs - depending on their type.
Despite the differences in job types, all jobs benefit from core functionalities provided by Bacalhau, including:
Node selection - the appropriate nodes are selected based on several criteria, including resource availability, priority and feedback from the nodes.
Job monitoring - jobs are monitored to ensure they complete, and that they stay in a healthy state.
Retries - within limits, Bacalhau will retry certain jobs a set number of times should it fail to complete successfully when requested.
Batch jobs are executed on demand, running on a specified number of Bacalhau nodes. These jobs either run until completion or until they reach a timeout. They are designed to carry out a single, discrete task before finishing. This is the only queueable job type.
Ideal for intermittent yet intensive data dives, for instance performing computation over large datasets before publishing the response. This approach eliminates the continuous processing overhead, focusing on specific, in-depth investigations and computation.
Similar to batch jobs, ops jobs have a broader reach. They are executed on all nodes that align with the job specification, but otherwise behave like batch jobs.
Ops jobs are perfect for urgent investigations, granting direct access to logs on host machines, where previously you may have had to wait for the logs to arrive at a central location before being able to query them. They can also be used for delivering configuration files for other systems should you wish to deploy an update to many machines at once.
Daemon jobs run continuously on all nodes that meet the criteria given in the job specification. Should any new compute nodes join the cluster after the job was started, and should they meet the criteria, the job will be scheduled to run on that node too.
A good application of daemon jobs is to handle continuously generated data on every compute node. This might be from edge devices like sensors, or cameras, or from logs where they are generated. The data can then be aggregated and compressed them before sending it onwards. For logs, the aggregated data can be relayed at regular intervals to platforms like Kafka or Kinesis, or directly to other logging services with edge devices potentially delivering results via MQTT.
Service jobs run continuously on a specified number of nodes that meet the criteria given in the job specification. Bacalhau's orchestrator selects the optimal nodes to run the job, and continuously monitors its health, performance. If required, it will reschedule on other nodes.
This job type is good for long-running consumers such as streaming or queuing services, or real-time event listeners.
In both the Job
and Task
specifications within Bacalhau, the Meta
block is a versatile element used to attach arbitrary metadata. This metadata isn't utilized for filtering or categorizing jobs; there's a separate Labels
block specifically designated for that purpose. Instead, the Meta
block is instrumental for embedding additional information for operators or external systems, enhancing clarity and context.
Meta
Parameters in Job and Task SpecsThe Meta
block is comprised of key-value pairs, with both keys and values being strings. These pairs aren't constrained by a predefined structure, offering flexibility for users to annotate jobs and tasks with diverse metadata.
Users can incorporate any arbitrary key-value pairs to convey descriptive information or context about the job or task.
project: Identifies the associated project.
version: Specifies the version of the application or service.
owner: Names the responsible team or individual.
environment: Indicates the stage in the development lifecycle.
Beyond user-defined metadata, Bacalhau automatically injects specific metadata keys for identification and security purposes.
bacalhau.org/requester.id: A unique identifier for the orchestrator that handled the job.
bacalhau.org/requester.publicKey: The public key of the requester, aiding in security and validation.
bacalhau.org/client.id: The ID for the client submitting the job, enhancing traceability.
Identification: The metadata aids in uniquely identifying jobs and tasks, connecting them to their originators and executors.
Context Enhancement: Metadata can supplement jobs and tasks with additional data, offering insights and context that aren't captured by standard parameters.
Security Enhancement: Auto-generated keys like the requester's public key contribute to the secure handling and execution of jobs and tasks.
While the Meta
block is distinct from the Labels
block used for filtering, its contribution to providing context, security, and traceability is integral in managing and understanding the diverse jobs and tasks within the Bacalhau ecosystem effectively.
A Job
represents a discrete unit of work that can be scheduled and executed. It carries all the necessary information to define the nature of the work, how it should be executed, and the resources it requires.
job
ParametersName (string : <optional>)
: A logical name to refer to the job. Defaults to job ID.
Namespace (string: "default")
: The namespace in which the job is running. ClientID
is used as a namespace in the public demo network.
Type (string: <required>)
: The type of the job, such as batch
, ops
, daemon
or service
. You can learn more about the supported jobs types in the Job Types guide.
Priority (int: 0
): Determines the scheduling priority.
Count (int: <required)
: Number of replicas to be scheduled. This is only applicable for jobs of type batch
and service
.
Meta (
Meta
: nil)
: Arbitrary metadata associated with the job.
Labels (
Label
[] : nil)
: Arbitrary labels associated with the job for filtering purposes.
Constraints (
Constraint
[] : nil)
: These are selectors which must be true for a compute node to run this job.
Tasks (
Task
[] : <required>)
:: Task associated with the job, which defines a unit of work within the job. Today we are only supporting single task per job, but with future plans to extend this.
The following parameters are generated by the server and should not be set directly.
ID (string)
: A unique identifier assigned to this job. It's auto-generated by the server and should not be set directly. Used for distinguishing between jobs with similar names.
State (
State
)
: Represents the current state of the job.
Version (int)
: A monotonically increasing version number incremented on job specification update.
Revision (int)
: A monotonically increasing revision number incremented on each update to the job's state or specification.
CreateTime (int)
: Timestamp of job creation.
ModifyTime (int)
: Timestamp of last job modification.
State
Structure SpecificationWithin Bacalhau, the State
structure is designed to represent the status or state of an object (like a Job
), coupled with a human-readable message for added context. Below is a breakdown of the structure:
State
ParametersStateType (T : <required>)
: Represents the current state of the object. This is a generic parameter that will take on a specific value from a set of defined state types for the object in question. For jobs, this will be one of the JobStateType
values.
Message (string : <optional>)
: A human-readable message giving more context about the current state. Particularly useful for states like Failed
to provide insight into the nature of any error.
When State
is used for a job, the StateType
can be one of the following:
Pending
: This indicates that the job is submitted but is not yet scheduled for execution.
Running
: The job is scheduled and is currently undergoing execution.
Completed
: This state signifies that a job has successfully executed its task. Only applicable for batch jobs.
Failed
: A state indicating that the job encountered errors and couldn't successfully complete.
JobStateTypeStopped
: The job has been intentionally halted by the user before its natural completion.
The inclusion of the Message
field can offer detailed insights, especially in states like Failed
, aiding in error comprehension and debugging.
SpecConfig
provides a unified structure to specify configurations for various components in Bacalhau, including engines, publishers, and input sources. Its flexible design allows seamless integration with multiple systems like Docker, WebAssembly (Wasm), AWS S3, and local directories, among others.
SpecConfig
ParametersType (string : <required>)
: Specifies the type of the configuration. Examples include docker
and wasm
for execution engines, S3
for input sources and publishers, etc.
Params (map[string]any : <optional>)
: A set of key-value pairs that provide the specific configurations for the chosen type. The keys and values are flexible and depend on the Type
. For instance, parameters for a Docker engine might include image name and version, while an S3 publisher would require configurations like the bucket name and AWS region. If not provided, it defaults to nil
.
Here are a few hypothetical examples to demonstrate how you might define SpecConfig
for different components:
Full Docker spec can be found here.
Full S3 Publisher can be found here.
Full local source can be found here.
Remember, the exact keys and values in the Params
map will vary depending on the specific requirements of the component being configured. Always refer to the individual component's documentation to understand the available parameters.
A Task
signifies a distinct unit of work within the broader context of a Job
. It defines the specifics of how the task should be executed, where the results should be published, what environment variables are needed, among other configurations
Task
ParametersName (string : <required>)
: A unique identifier representing the name of the task.
Engine (
SpecConfig
: required)
: Configures the execution engine for the task, such as Docker or WebAssembly.
Publisher (
SpecConfig
: optional)
: Specifies where the results of the task should be published, such as S3 and IPFS publishers. Only applicable for tasks of type batch
and ops
.
Env (map[string]string : optional)
: A set of environment variables for the driver.
Meta (
Meta
: optional)
: Allows association of arbitrary metadata with this task.
InputSources (
InputSource
[] : optional)
: Lists remote artifacts that should be downloaded before task execution and mounted within the task, such as from S3 or HTTP/HTTPs.
ResultPaths (
ResultPath
[] : optional)
: Indicates volumes within the task that should be included in the published result. Only applicable for tasks of type batch
and ops
.
Resources (
Resources
: optional)
: Details the resources that this task requires.
Network (
Network
: optional)
: Configurations related to the networking aspects of the task.
Timeouts (
Timeouts
: optional)
: Configurations concerning any timeouts associated with the task.
A ResultPath
denotes a specific location within a Task
that contains meaningful output or results. By specifying a ResultPath
, you can pinpoint which files or directories are essential and should be retained or published after the task's execution.
ResultPath
Parameters:Name: A descriptive label or identifier for the result, allowing for easier referencing and understanding of the output's nature or significance.
Path: Specifies the exact location, either a file or a directory, within the task's environment where the result or output is stored. This ensures that after the task completes, the critical data at this path can be accessed, retained, or published as necessary.
The Resources
provides a structured way to detail the computational resources a Task
requires. By specifying these requirements, you ensure that the task is scheduled on a node with adequate resources, optimizing performance and avoiding potential issues linked to resource constraints.
Resources
Parameters:CPU (string: <optional>)
: Defines the CPU resources required for the task. Units can be specified in cores (e.g., 2
for 2 CPU cores) or in milliCPU units (e.g., 250m
or 0.25
for 250 milliCPU units). For instance, if you have half a CPU core, you can represent it as 500m
or 0.5
.
Memory (string: <optional>)
: Highlights the amount of RAM needed for the task. You can specify the memory in various units such as:
Kb
for Kilobytes
Mb
for Megabytes
Gb
for Gigabytes
Tb
for Terabytes
Disk (string: <optional>)
: States the disk storage space needed for the task. Similarly, the disk space can be expressed in units like Gb
for Gigabytes, Mb
for Megabytes, and so on. As an example, 10Gb
indicates 10 Gigabytes of storage space.
GPU (string: <optional>)
: Denotes the number of GPU units required. For example, 2
signifies the requirement of 2 GPU units. This is crucial for tasks involving heavy computational processes, machine learning models, or tasks that leverage GPU acceleration.
The Timeouts
object provides a mechanism to impose timing constraints on specific task operations, particularly execution. By setting these timeouts, users can ensure tasks don't run indefinitely and align them with intended durations.
Timeouts
Parameters:ExecutionTimeout (int: <optional>)
: Defines the maximum duration (in seconds) that a task is permitted to run. A value of zero indicates that there's no set timeout. This could be particularly useful for tasks that function as daemons and are designed to run indefinitely.
Utilizing the Timeouts
judiciously helps in managing resource utilization and ensures tasks adhere to expected timelines, thereby enhancing the efficiency and predictability of job executions.
Bacalhau has an update checking service to automatically detect whether a newer version of the software is available.
Users who are both running CLI commands and operating nodes will be regularly informed that a new release can be downloaded and installed.
Bacalhau will run an update check regularly when client commands are executed. If an update is available, explanatory text will be printed at the end of the command.
To force a manual update check, run the bacalhau version
command, which will explicitly list the latest software release alongside the server and client versions.
Bacalhau will run an update check regularly as part of the normal operation of the node.
If an update is available, an INFO level message will be printed to the log.
Bacalhau has some configuration options for controlling how often checks are performed. By default, an update check will run no more than once every 24 hours. Users can opt out of automatic update checks using the configuration described below.
It's important to note that disabling the automatic update checks may lead to potential issues, arising from mismatched versions of different actors within Bacalhau.
To output update check config, run bacalhau config list
:
When running a node, you can choose which jobs you want to run by using configuration options, environment variables or flags to specify a job selection policy.
setting-up/networking-instructions/networking.md
If you want more control over making the decision to take on jobs, you can use the --job-selection-probe-exec
and --job-selection-probe-http
flags.
These are external programs that are passed the following data structure so that they can make a decision about whether or not to take on a job:
The exec
probe is a script to run that will be given the job data on stdin
, and must exit with status code 0 if the job should be run.
The http
probe is a URL to POST the job data to. The job will be rejected if the HTTP request returns a non-positive status code (e.g. >= 400).
If the HTTP response is a JSON blob, it should match the following schema and will be used to respond to the bid directly:
For example, the following response will reject the job:
If the HTTP response is not a JSON blob, the content is ignored and any non-error status code will accept the job.
DuckDB is a relational table-oriented database management system that supports SQL queries for producing analytical results. It also comes with various features that are useful for data analytics.
DuckDB is suited for the following use cases:
Processing and storing tabular datasets, e.g. from CSV or Parquet files
Interactive data analysis, e.g. Joining & aggregate multiple large tables
Concurrent large changes, to multiple large tables, e.g. appending rows, adding/removing/updating columns
Large result set transfer to client
In this example tutorial, we will show how to use DuckDB with Bacalhau. The advantage of using DuckDB with Bacalhau is that you don’t need to install, and there is no need to download the datasets since the datasets are already there on IPFS or on the web.
How to run a relational database (like DUCKDB) on Bacalhau
To get started, you need to install the Bacalhau client, see more information
You can skip this entirely and directly go to running on Bacalhau.
If you want any additional dependencies to be installed along with DuckDB, you need to build your own container.
To build your own docker container, create a Dockerfile
, which contains instructions to build your DuckDB docker container.
We will run docker build
command to build the container:
Before running the command replace:
repo-name with the name of the container, you can name it anything you want
tag this is not required, but you can use the latest tag
In our case:
Next, upload the image to the registry. This can be done by using the Docker hub username, repo name or tag.
In our case:
After the repo image has been pushed to Docker Hub, we can now use the container for running on Bacalhau. To submit a job, run the following Bacalhau command:
Let's look closely at the command above:
bacalhau docker run
: call to bacalhau
davidgasquez/datadex:v0.2.0
: the name and the tag of the docker image we are using
/inputs/
: path to input dataset
'duckdb -s "select 1"'
: execute DuckDB
When a job is submitted, Bacalhau prints out the related job_id
. We store that in an environment variable so that we can reuse it later on.
The job description should be saved in .yaml
format, e.g. duckdb1.yaml
, and then run with the command:
Job status: You can check the status of the job using bacalhau list
.
When it says Published
or Completed
, that means the job is done, and we can get the results.
Job information: You can find out more information about your job by using bacalhau describe
.
Job download: You can download your job results directly by using bacalhau get
. Alternatively, you can choose to create a directory to store your results. In the command below, we created a directory and downloaded our job output to be stored in that directory.
Each job creates 3 subfolders: the combined_results,per_shard files, and the raw directory. To view the file, run the following command:
Expected output:
Below is the bacalhau docker run
command to to run arbitrary SQL commands over the yellow taxi trips dataset
Let's look closely at the command above:
bacalhau docker run
: call to bacalhau
-i ipfs://bafybeiejgmdpwlfgo3dzfxfv3cn55qgnxmghyv7vcarqe3onmtzczohwaq \
: CIDs to use on the job. Mounts them at '/inputs' in the execution.
davidgasquez/duckdb:latest
: the name and the tag of the docker image we are using
/inputs
: path to input dataset
duckdb -s
: execute DuckDB
The job description should be saved in .yaml
format, e.g. duckdb2.yaml
, and then run with the command:
When a job is submitted, Bacalhau prints out the related job_id
. We store that in an environment variable so that we can reuse it later on.
Job status: You can check the status of the job using bacalhau list
.
Job information: You can find out more information about your job by using bacalhau describe
.
Job download: You can download your job results directly by using bacalhau get
. Alternatively, you can choose to create a directory to store your results. In the command below, we created a directory and downloaded our job output to be stored in that directory.
Each job creates 3 subfolders: the combined_results, per_shard files, and the raw directory. To view the file, run the following command:
Converting from CSV to parquet or avro reduces the size of the file and allows for faster read and write speeds. With Bacalhau, you can convert your CSV files stored on ipfs or on the web without the need to download files and install dependencies locally.
In this example tutorial we will convert a CSV file from a URL to parquet format and save the converted parquet file to IPFS
To get started, you need to install the Bacalhau client, see more information
Let's download the transactions.csv
file:
You can use the CSV files from
Write the converter.py
Python script, that serves as a CSV converter to Avro or Parquet formats:
In our case:
To build your own docker container, create a Dockerfile
, which contains instructions to build your image.
We will run the docker build
command to build the container:
Before running the command replace:
repo-name
with the name of the container, you can name it anything you want
tag
this is not required but you can use the latest tag
In our case:
Next, upload the image to the registry. This can be done by using the Docker hub username, repo name or tag.
In our case:
With the command below, we are mounting the CSV file for transactions from IPFS
Let's look closely at the command above:
bacalhau docker run
: call to Bacalhau
-i ipfs://QmTAQMGiSv9xocaB4PUCT5nSBHrf9HZrYj21BAZ5nMTY2W
: CIDs to use on the job. Mounts them at '/inputs' in the execution.
jsacex/csv-to-arrow-or-parque
: the name and the tag of the docker image we are using
../inputs/transactions.csv
: path to input dataset
../outputs/transactions.parquet parquet
: path to the output
python3 src/converter.py
: execute the script
When a job is submitted, Bacalhau prints out the related job_id
. We store that in an environment variable so that we can reuse it later on.
The job description should be saved in .yaml
format, e.g. convertcsv.yaml
, and then run with the command:
Job status: You can check the status of the job using bacalhau list
.
When it says Published
or Completed
, that means the job is done, and we can get the results.
Job information: You can find out more information about your job by using bacalhau describe
.
Job download: You can download your job results directly by using bacalhau get
. Alternatively, you can choose to create a directory to store your results. In the command below, we created a directory (results
) and downloaded our job output to be stored in that directory.
To view the file, run the following command:
Alternatively, you can do this:
The Surface Ocean CO₂ Atlas (SOCAT) contains measurements of the of CO₂ in seawater around the globe. But to calculate how much carbon the ocean is taking up from the atmosphere, these measurements need to be converted to the partial pressure of CO₂. We will convert the units by combining measurements of the surface temperature and fugacity. Python libraries (xarray, pandas, numpy) and the pyseaflux package facilitate this process.
In this example tutorial, our focus will be on running the oceanography dataset with Bacalhau, where we will investigate the data and convert the workload. This will enable the execution on the Bacalhau network, allowing us to leverage its distributed storage and compute resources.
To get started, you need to install the Bacalhau client, see more information
For the purposes of this example we will use the dataset in the "Gridded" format from the and long-term global sea surface temperature data from - information about that dataset can be found .
Next let's write the requirements.txt
. This file will also be used by the Dockerfile to install the dependencies.
We can see that the dataset contains latitude-longitude coordinates, the date, and a series of seawater measurements. Below is a plot of the average sea surface temperature (SST) between 2010 and 2020, where data have been collected by buoys and vessels.
Let's create a new file called main.py
and paste the following script in it:
This code loads and processes SST and SOCAT data, combines them, computes pCO2, and saves the results for further use.
This resulted in the IPFS CID of bafybeidunikexxu5qtuwc7eosjpuw6a75lxo7j5ezf3zurv52vbrmqwf6y
.
We will create a Dockerfile
and add the desired configuration to the file. These commands specify how the image will be built, and what extra requirements will be included.
We will run docker build
command to build the container:
Before running the command replace:
repo-name
with the name of the container, you can name it anything you want
tag
this is not required but you can use the latest tag
Now you can push this repository to the registry designated by its name or tag.
Now that we have the data in IPFS and the Docker image pushed, next is to run a job using the bacalhau docker run
command
Let's look closely at the command above:
bacalhau docker run
: call to Bacalhau
--input ipfs://bafybeidunikexxu5qtuwc7eosjpuw6a75lxo7j5ezf3zurv52vbrmqwf6y
: CIDs to use on the job. Mounts them at '/inputs' in the execution.
ghcr.io/bacalhau-project/examples/socat:0.0.11
: the name and the tag of the image we are using
python main.py
: execute the script
When a job is submitted, Bacalhau prints out the related job_id
. We store that in an environment variable so that we can reuse it later on.
Job status: You can check the status of the job using bacalhau list
.
When it says Published
or Completed
, that means the job is done, and we can get the results.
Job information: You can find out more information about your job by using bacalhau describe
.
Job download: You can download your job results directly by using bacalhau get
. Alternatively, you can choose to create a directory to store your results. In the command below, we created a directory (results
) and downloaded our job output to be stored in that directory.
To view the file, run the following command:
Dolly 2.0, the groundbreaking, open-source, instruction-following Large Language Model (LLM) that has been fine-tuned on a human-generated instruction dataset, licensed for both research and commercial purposes. Developed using the EleutherAI Pythia model family, this 12-billion-parameter language model is built exclusively on a high-quality, human-generated instruction following dataset, contributed by Databricks employees.
Dolly 2.0 package is open source, including the training code, dataset, and model weights, all available for commercial use. This unprecedented move empowers organizations to create, own, and customize robust LLMs capable of engaging in human-like interactions, without the need for API access fees or sharing data with third parties.
A NVIDIA GPU
Python
Create an inference.py
file with following code:
export JOB_ID=$( ... )
: Export results of a command execution as environment variable
bacalhau docker run
: Run a job using docker executor.
--gpu 1
: Flag to specify the number of GPUs to use for the execution. In this case, 1 GPU will be used.
-w /inputs
: Flag to set the working directory inside the container to /inputs
.
-i gitlfs://huggingface.co/databricks/dolly-v2-3b.git
: Flag to clone the Dolly V2-3B model from Hugging Face's repository using Git LFS. The files will be mounted to /inputs/databricks/dolly-v2-3b
.
-i https://gist.githubusercontent.com/js-ts/d35e2caa98b1c9a8f176b0b877e0c892/raw/3f020a6e789ceef0274c28fc522ebf91059a09a9/inference.py
: Flag to download the inference.py
script from the provided URL. The file will be mounted to /inputs/inference.py
.
jsacex/dolly_inference:latest
: The name and the tag of the Docker image.
The command to run inference on the model: python inference.py --prompt "Where is Earth located ?" --model_version "./databricks/dolly-v2-3b"
. It consists of:
inference.py
: The Python script that runs the inference process using the Dolly V2-3B model.
--prompt "Where is Earth located ?"
: Specifies the text prompt to be used for the inference.
--model_version "./databricks/dolly-v2-3b"
: Specifies the path to the Dolly V2-3B model. In this case, the model files are mounted to /inputs/databricks/dolly-v2-3b
.
When a job is submitted, Bacalhau prints out the related job_id
. We store that in an environment variable so that we can reuse it later on.
Job status: You can check the status of the job using bacalhau list
:
When it says Completed
, that means the job is done, and we can get the results.
Job information: You can find out more information about your job by using bacalhau describe
:
Job download: You can download your job results directly by using bacalhau get
. Alternatively, you can choose to create a directory to store your results. In the command below, we created a directory and downloaded our job output to be stored in that directory.
After the download has finished, we can see the results in the results/outputs
folder.
Ethereum Blockchain Analysis with Ethereum-ETL and Bacalhau
Mature blockchains are difficult to analyze because of their size. Ethereum-ETL is a tool that makes it easy to extract information from an Ethereum node, but it's not easy to get working in a batch manner. It takes approximately 1 week for an Ethereum node to download the entire chain (even more in my experience) and importing and exporting data from the Ethereum node is slow.
For this example, we ran an Ethereum node for a week and allowed it to synchronize. We then ran ethereum-etl to extract the information and pinned it on Filecoin. This means that we can both now access the data without having to run another Ethereum node.
But there's still a lot of data and these types of analyses typically need repeating or refining. So it makes absolute sense to use a decentralized network like Bacalhau to process the data in a scalable way.
In this tutorial example, we will run Ethereum-ETL tool on Bacalhau to extract data from an Ethereum node.
To get started, you need to install the Bacalhau client, see more information
First let's download one of the IPFS files and inspect it locally:
You can see the full list of IPFS CIDs in the appendix at the bottom of the page.
If you don't already have the Pandas library, let's install it:
The following code inspects the daily trading volume of Ethereum for a single chunk (100,000 blocks) of data.
This is all good, but we can do better. We can use the Bacalhau client to download the data from IPFS and then run the analysis on the data in the cloud. This means that we can analyze the entire Ethereum blockchain without having to download it locally.
To run jobs on the Bacalhau network you need to package your code. In this example, I will package the code as a Docker image.
But before we do that, we need to develop the code that will perform the analysis. The code below is a simple script to parse the incoming data and produce a CSV file with the daily trading volume of Ethereum.
Next, let's make sure the file works as expected:
And finally, package the code inside a Docker image to make the process reproducible. Here I'm passing the Bacalhau default /inputs
and /outputs
directories. The /inputs
directory is where the data will be read from and the /outputs
directory is where the results will be saved to.
We've already pushed the container, but for posterity, the following command pushes this container to GHCR.
To run our analysis on the Ethereum blockchain, we will use the bacalhau docker run
command.
The job has been submitted and Bacalhau has printed out the related job id. We store that in an environment variable so that we can reuse it later on.
Bacalhau also mounts a data volume to store output data. The bacalhau docker run
command creates an output data volume mounted at /outputs
. This is a convenient location to store the results of your job.
Job status: You can check the status of the job using bacalhau list
.
When it says Published
or Completed
, that means the job is done, and we can get the results.
Job information: You can find out more information about your job by using bacalhau describe
.
Job download: You can download your job results directly by using bacalhau get
. Alternatively, you can choose to create a directory to store your results. In the command below, we created a directory (results
) and downloaded our job output to be stored in that directory.
To view the file, run the following command:
To view the images, we will use glob to return all file paths that match a specific pattern.
Ok, so that works. Let's scale this up! We can run the same analysis on the entire Ethereum blockchain (up to the point where I have uploaded the Ethereum data). To do this, we need to run the analysis on each of the chunks of data that we have stored on IPFS. We can do this by running the same job on each of the chunks.
See the appendix for the hashes.txt
file.
Now take a look at the job id's. You can use these to check the status of the jobs and download the results:
You might want to double-check that the jobs ran ok by doing a bacalhau list
.
Wait until all of these jobs have been completed. And then download all the results and merge them into a single directory. This might take a while, so this is a good time to treat yourself to a nice Dark Mild. There's also been some issues in the past communicating with IPFS, so if you get an error, try again.
To view the images, we will use glob to return all file paths that match a specific pattern.
That's it! There are several years of Ethereum transaction volume data.
The following list is a list of IPFS CID's for the Ethereum data that we used in this tutorial. You can use these CID's to download the rest of the chain if you so desire. The CIDs are ordered by block number and they increase 50,000 blocks at a time. Here's a list of ordered CIDs:
Parallel Video Resizing via File Sharding
Many data engineering workloads consist of embarrassingly parallel workloads where you want to run a simple execution on a large number of files. In this example tutorial, we will run a simple video filter on a large number of video files.
To get started, you need to install the Bacalhau client, see more information
The simplest way to upload the data to IPFS is to use a third-party service to "pin" data to the IPFS network, to ensure that the data exists and is available. To do this you need an account with a pinning service like or . Once registered you can use their UI or API or SDKs to upload files.
This resulted in the IPFS CID of Qmd9CBYpdgCLuCKRtKRRggu24H72ZUrGax5A9EYvrbC72j
.
To submit a workload to Bacalhau, we will use the bacalhau docker run
command. The command allows one to pass input data volume with a -i ipfs://CID:path
argument just like Docker, except the left-hand side of the argument is a . This results in Bacalhau mounting a data volume inside the container. By default, Bacalhau mounts the input volume at the path /inputs
inside the container.
Let's look closely at the command above:
bacalhau docker run
: call to Bacalhau
-i ipfs://Qmd9CBYpdgCLuCKRtKRRggu24H72ZUrGax5A9EYvrbC72j
: CIDs to use on the job. Mounts them at '/inputs' in the execution.
linuxserver/ffmpeg
: the name of the docker image we are using to resize the videos
-- bash -c 'find /inputs -iname "*.mp4" -printf "%f\n" | xargs -I{} ffmpeg -y -i /inputs/{} -vf "scale=-1:72,setsar=1:1" /outputs/scaled_{}'
: the command that will be executed inside the container. It uses find
to locate all files with the extension ".mp4" within /inputs
and then uses ffmpeg
to resize each found file to 72 pixels in height, saving the results in the /outputs
folder.
When a job is submitted, Bacalhau prints out the related job_id
. We store that in an environment variable so that we can reuse it later on.
The job description should be saved in .yaml
format, e.g. video.yaml
, and then run with the command:
Job status: You can check the status of the job using bacalhau list
.
When it says Published
or Completed
, that means the job is done, and we can get the results.
Job information: You can find out more information about your job by using bacalhau describe
.
Job download: You can download your job results directly by using bacalhau get
. Alternatively, you can choose to create a directory to store your results. In the command below, we created a directory (results
) and downloaded our job output to be stored in that directory.
To view the results open the results/outputs/
folder.
Whisper is an automatic speech recognition (ASR) system trained on 680,000 hours of multilingual and multitask supervised data collected from the web. It shows that the use of such a large and diverse dataset leads to improved robustness to accents, background noise, and technical language. Moreover, it enables transcription in multiple languages, as well as translation from those languages into English. Creators are open-sourcing models and inference code to serve as a foundation for building useful applications and for further research on robust speech processing. In this example, we will transcribe an audio clip locally, containerize the script and then run the container on Bacalhau.
The advantage of using Bacalhau over managed Automatic Speech Recognition services is that you can run your own containers which can scale to do batch process petabytes of videos or audio for automatic speech recognition
To get started, you need to install:
Bacalhau client, see more information
Whisper
PyTorch
pandas
Before we create and run the script we need a sample audio file to test the code. For that we download a sample audio clip:
We will create a script that accepts parameters (input file path, output file path, temperature, etc.) and set the default parameters. Also if the input file is in mp4
format, then the script converts it to wav
format. The transcript can be saved in various formats. Then the large model is loaded and we pass it the required parameters.
This model is not only limited to English and transcription, it supports many other languages.
Next, let's create an openai-whisper script:
Let's run the script with the default parameters:
To view the outputs, execute following:
To build your own docker container, create a Dockerfile
, which contains instructions on how the image will be built, and what extra requirements will be included.
We choose pytorch/pytorch:1.12.1-cuda11.3-cudnn8-runtime
as our base image.
And then install all the dependencies, after that we will add the test audio file and our openai-whisper script to the container, we will also run a test command to check whether our script works inside the container and if the container builds successfully
We will run docker build
command to build the container;
Before running the command replace:
repo-name with the name of the container, you can name it anything you want
tag this is not required but you can use the latest tag
In our case:
Next, upload the image to the registry. This can be done by using the Docker hub username, repo name or tag.
In our case:
After the dataset has been uploaded, copy the CID:
Let's look closely at the command below:
export JOB_ID=$( ... )
exports the job ID as environment variable
bacalhau docker run
: call to bacalhau
The-i ipfs://bafybeielf6z4cd2nuey5arckect5bjmelhouvn5r
: flag to mount the CID which contains our file to the container at the path /inputs
The --gpu 1
flag is set to specify hardware requirements, a GPU is needed to run such a job
jsacex/whisper
: the name and the tag of the docker image we are using
python openai-whisper.py
: execute the script with following parameters:
-p inputs/Apollo_11_moonwalk_montage_720p.mp4
: the input path of our file
-o outputs
: the path where to store the outputs
When a job is submitted, Bacalhau prints out the related job_id
. We store that in an environment variable so that we can reuse it later on.
You can check the status of the job using bacalhau list
.
When it says Completed
, that means the job is done, and we can get the results.
You can find out more information about your job by using bacalhau describe
.
You can download your job results directly by using bacalhau get
. Alternatively, you can choose to create a directory to store your results. In the command below, we created a directory and downloaded our job output to be stored in that directory.
After the download has finished you should see the following contents in results directory
Now you can find the file in the results/outputs
folder. To view it, run the following command:
How to process images stored in IPFS with Bacalhau
In this example tutorial, we will show you how to use Bacalhau to process images on a .
Bacalhau has the unique capability of operating at a massive scale in a distributed environment. This is made possible because data is naturally sharded across the IPFS network amongst many providers. We can take advantage of this to process images in parallel.
To get started, you need to install the Bacalhau client, see more information
To submit a workload to Bacalhau, we will use the bacalhau docker run
command. This command allows to pass input data volume with a -i ipfs://CID:path
argument just like Docker, except the left-hand side of the argument is a . This results in Bacalhau mounting a data volume inside the container. By default, Bacalhau mounts the input volume at the path /inputs
inside the container.
Bacalhau also mounts a data volume to store output data. The bacalhau docker run
command creates an output data volume mounted at /outputs
. This is a convenient location to store the results of your job.
Let's look closely at the command above:
bacalhau docker run
: call to Bacalhau
-i ipfs://QmeZRGhe4PmjctYVSVHuEiA9oSXnqmYa4kQubSHgWbjv72:/input_images
: Specifies the input data, which is stored in IPFS at the given CID.
--entrypoint mogrify
: Overrides the default ENTRYPOINT of the image, indicating that the mogrify utility from the ImageMagick package will be used instead of the default entry.
dpokidov/imagemagick:7.1.0-47-ubuntu
: The name and the tag of the docker image we are using
-- -resize 100x100 -quality 100 -path /outputs '/input_images/*.jpg'
: These arguments are passed to mogrify and specify operations on the images: resizing to 100x100 pixels, setting quality to 100, and saving the results to the /outputs
folder.
When a job is submitted, Bacalhau prints out the related job_id
. We store that in an environment variable so that we can reuse it later on.
The job description should be saved in .yaml
format, e.g. image.yaml
, and then run with the command:
Job status: You can check the status of the job using bacalhau list
:
When it says Completed
, that means the job is done, and we can get the results.
Job information: You can find out more information about your job by using bacalhau describe
:
Job download: You can download your job results directly by using bacalhau get
. Alternatively, you can choose to create a directory to store your results. In the command below, we created a directory and downloaded our job output to be stored in that directory.
To view the images, open the results/outputs/
folder:
By default, Bacalhau jobs do not have any access to the internet. This is to keep both compute providers and users safe from malicious activities.
However, by using data volumes you can read and access your data from within jobs and write back results.
When you submit a Bacalhau job, you'll need to specify the internet locations to download data from and write results to. Both and jobs support these features.
When submitting a Bacalhau job, you can specify the CID (Content IDentifier) or HTTP(S) URL to download data from. The data will be retrieved before the job starts and made available to the job as a directory on the filesystem. When running Bacalhau jobs, you can specify as many CIDs or URLs as needed using --input
which is accepted by both bacalhau docker run
and bacalhau wasm run
. See for more information.
You can write back results from your Bacalhau jobs to your public storage location. By default, jobs will write results to the storage provider using the --publisher
command line flag. See on how to configure this.
To use these features, the data to be downloaded has to be known before the job starts. For some workloads, the required data is computed as part of the job if the purpose of the job is to process web results. In these cases, networking may be possible during job execution.
To run Docker jobs on Bacalhau to access the internet, you'll need to specify one of the following:
full: unfiltered networking for any protocol --network=full
http: HTTP(S)-only networking to a specified list of domains --network=http
none: no networking at all, the default --network=none
Specifying none
will still allow Bacalhau to download and upload data before and after the job.
Jobs using http
must specify the domains they want to access when the job is submitted. When the job runs, only HTTP requests to those domains will be possible and data transfer will be rate limited to 10Mbit/sec in either direction to prevent ddos.
Jobs will be provided with which contain a TCP address of an HTTP proxy to connect through. Most tools and libraries will use these environment variables by default. If not, they must be used by user code to configure HTTP proxy usage.
The required networking can be specified using the --network
flag. For http
networking, the required domains can be specified using the --domain
flag, multiple times for as many domains as required. Specifying a domain starting with a .
means that all sub-domains will be included. For example, specifying .example.com
will cover some.thing.example.com
as well as example.com
.
Bacalhau jobs are explicitly prevented from starting other Bacalhau jobs, even if a Bacalhau requester node is specified on the HTTP allowlist.
Bacalhau has support for describing jobs that can access the internet during job execution. The ability for compute nodes to run jobs that require internet access depends on what compute nodes are currently part of the network.
Compute nodes that join the Bacalhau network do not accept networked jobs by default (i.e. they only accept jobs that specify --network=none
, which is also the default).
Bacalhau supports GPU workloads. In this tutorial, learn how to run a job using GPU workloads with the Bacalhau client.
The Bacalhau network must have an executor node with a GPU exposed
Your container must include the CUDA runtime (cudart) and must be compatible with the CUDA version running on the node
To submit a job request, use the --gpu
flag under the docker run
command to select the number of GPUs your job requires. For example:
The following limitations currently exist within Bacalhau. Bacalhau supports:
NVIDIA, Intel or AMD GPUs only
GPUs for the Docker executor only
In this example tutorial, we use Bacalhau and Easy OCR to digitize paper records or for recognizing characters or extract text data from images stored on IPFS, S3 or on the web. is a ready-to-use OCR with 80+ supported languages and all popular writing scripts including Latin, Chinese, Arabic, Devanagari, Cyrillic etc. With easy OCR, you use the pre-trained models or use your own fine-tuned model.
Install the required dependencies
Load the different example images
List all the images. You'll see an output like this:
Next, we create a reader to do OCR to get coordinates which represent a rectangle containing text and the text itself:
The docker build
command builds Docker images from a Dockerfile.
Before running the command replace:
repo-name with the name of the container, you can name it anything you want
tag this is not required but you can use the latest tag
Next, upload the image to the registry. This can be done by using the Docker hub username, repo name, or tag.
Now that we have an image in the docker hub (your own or an example image from the manual), we can use the container for running on Bacalhau.
Let's look closely at the command below:
export JOB_ID=$( ... )
exports the job ID as environment variable
bacalhau docker run
: call to bacalhau
The --gpu 1
flag is set to specify hardware requirements, a GPU is needed to run such a job
The --id-only
flag is set to print only job id
-i ipfs://bafybeibvc......
Mounts the model from IPFS
-i https://raw.githubusercontent.com...
Mounts the Input Image from a URL
jsacex/easyocr
the name and the tag of the docker image we are using
-- easyocr -l ch_sim en -f ./inputs/chinese.jpg --detail=1 --gpu=True
execute script with following paramters:
-l ch_sim
: the name of the model
-f ./inputs/chinese.jpg
: path to the input Image or directory
--detail=1
: level of detail
--gpu=True
: we set this flag to true since we are running inference on a GPU. If you run this on a CPU - set this flag to false
When a job is submitted, Bacalhau prints out the related job_id
. We store that in an environment variable so that we can reuse it later on.
The job description should be saved in .yaml
format, e.g. easyocr.yaml
, and then run with the command:
You can check the status of the job using bacalhau list
.
When it says Completed
, that means the job is done, and we can get the results.
You can find out more information about your job by using bacalhau describe
.
You can download your job results directly by using bacalhau get
. Alternatively, you can choose to create a directory to store your results. In the command below, we created a directory and downloaded our job output to be stored in that directory.
After the download has finished you should see the following contents in results directory
Now you can find the file in the results/outputs
folder. You can view results by running following commands:
In this example tutorial, we will show you how to generate realistic images with and Bacalhau. StyleGAN is based on Generative Adversarial Networks (GANs), which include a generator and discriminator network that has been trained to differentiate images generated by the generator from real images. However, during the training, the generator tries to fool the discriminator, which results in the generation of realistic-looking images. With StyleGAN3 we can generate realistic-looking images or videos. It can generate not only human faces but also animals, cars, and landscapes.
To get started, you need to install the Bacalhau client, see more information
To run StyleGAN3 locally, you'll need to clone the repo, install dependencies and download the model weights.
Now you can generate an image using a pre-trained AFHQv2
model. Here is an example of the image we generated:
To build your own docker container, create a Dockerfile
, which contains instructions to build your image.
We will run docker build
command to build the container:
Before running the command replace:
repo-name with the name of the container, you can name it anything you want
tag this is not required but you can use the latest tag
In our case:
Next, upload the image to the registry. This can be done by using the Docker hub username, repo name or tag.
In our case:
To submit a job run the Bacalhau command with following structure:
export JOB_ID=$( ... )
exports the job ID as environment variable
bacalhau docker run
: call to Bacalhau
The --gpu 1
flag is set to specify hardware requirements, a GPU is needed to run such a job
The --id-only
flag is set to print only job id
jsacex/stylegan3
: the name and the tag of the docker image we are using
python gen_images.py
: execute the script with following parameters:
--trunc=1 --seeds=2 --network=stylegan3-r-afhqv2-512x512.pkl
: The animation length is either determined based on the --seeds
value or explicitly specified using the --num-keyframes
option. When num keyframes are specified with --num-keyframes
, the output video length will be num_keyframes * w_frames
frames.
../outputs
: path to the output
The job description should be saved in .yaml
format, e.g. stylegan3.yaml
, and then run with the command:
You can also run variations of this command to generate videos and other things. In the following command below, we will render a latent vector interpolation video. This will render a 4x2 grid of interpolations for seeds 0 through 31.
Let's look closely at the command below:
export JOB_ID=$( ... )
exports the job ID as environment variable
bacalhau docker run
: call to bacalhau
The --gpu 1
flag is set to specify hardware requirements, a GPU is needed to run such a job
The --id-only
flag is set to print only job id
jsacex/stylegan3
the name and the tag of the docker image we are using
python gen_images.py
: execute the script with following parameters:
--trunc=1 --seeds=2 --network=stylegan3-r-afhqv2-512x512.pkl
: The animation length is either determined based on the --seeds
value or explicitly specified using the --num-keyframes
option. When num keyframes is specified with --num-keyframes
, the output video length will be num_keyframes * w_frames frames
. If --num-keyframes
is not specified, the number of seeds given with --seeds
must be divisible by grid size W*H (--grid
). In this case, the output video length will be # seeds/(w*h)*w_frames
frames.
../outputs
: path to the output
When a job is submitted, Bacalhau prints out the related job_id
. We store that in an environment variable so that we can reuse it later on.
You can check the status of the job using bacalhau list
.
When it says Completed
, that means the job is done, and we can get the results.
You can find out more information about your job by using bacalhau describe
.
You can download your job results directly by using bacalhau get
. Alternatively, you can choose to create a directory to store your results. In the command below, we created a directory and downloaded our job output to be stored in that directory.
After the download has finished you should see the following contents in results directory
Now you can find the file in the results/outputs
folder.
Config property | Environment variable | Default value | Meaning |
---|---|---|---|
See more information on how to containerize your script/app
hub-user with your docker hub username, If you don’t have a docker hub account , and use the username of the account you created
The same job can be presented in the format. In this case, the description will look like this:
The same job can be presented in the format. In this case, the description will look like this:
If you have questions or need support or guidance, please reach out to the (#general channel).
You can find out more information about converter.py
You can skip this section entirely and directly go to
See more information on how to containerize your script/app
hub-user
with your docker hub username. If you don’t have a docker hub account , and use the username of the account you created
The same job can be presented in the format. In this case, the description will look like this:
If you have questions or need support or guidance, please reach out to the (#general channel).
To convert the data from fugacity of CO2 (fCO2) to partial pressure of CO2 (pCO2) we will combine the measurements of the surface temperature and fugacity. The conversion is performed by the package.
The simplest way to upload the data to IPFS is to use a third-party service to "pin" data to the IPFS network, to ensure that the data exists and is available. To do this you need an account with a pinning service like or . Once registered you can use their UI or API or SDKs to upload files.
hub-user
with your docker hub username, If you don’t have a docker hub account , and use the username of the account you created
For more information about working with custom containers, see the .
If you have questions or need support or guidance, please reach out to the (#general channel).
You may want to create your own container for this kind of task. In that case, use the instructions for and your own image in the docker hub. Use huggingface/transformers-pytorch-deepspeed-nightly-gpu
as base image, install dependencies listed above and copy the inference.py
into it. So your Dockerfile will look like this:
To get started, you need to install the Bacalhau client, see more information
The bacalhau docker run
command allows passing input data volume with --input
or -i ipfs://CID:path
argument just like Docker, except the left-hand side of the argument is a . This results in Bacalhau mounting a data volume inside the container. By default, Bacalhau mounts the input volume at the path /inputs
inside the container.
If you have questions or need support or guidance, please reach out to the (#general channel).
so we must run the full command after the --
argument. In this line you will list all of the mp4 files in the /inputs
directory and execute ffmpeg
against each instance.
The same job can be presented in the format. In this case, the description will look like this:
If you have questions or need support or guidance, please reach out to the (#general channel).
See more information on how to containerize your script/app
hub-user with your docker hub username, If you don’t have a docker hub account , and use the username of the account you created
We will transcribe the moon landing video, which can be found here:
Since the downloaded video is in mov format we convert the video to mp4 format and then upload it to our public storage in this case IPFS. We will be using (Recommended Option). To upload your dataset using just drag and drop your directory it will upload it to IPFS.
The same job can be presented in the format. In this case, the description will look like this:
The same job can be presented in the format. In this case, the description will look like this:
If you have questions or need support or guidance, please reach out to the (#general channel).
The public compute nodes provided by the Bacalhau network will accept jobs that require HTTP networking as long as the domains are from .
If you need to access a domain that isn't on the allowlist, you can make a request to the to include your required domains. You can also set up your own compute node that implements the allowlist you need.
You can skip this step and go straight to running a
We will use the Dockerfile
that is already created in the . Use the command below to clone the repo
hub-user with your docker hub username, If you don’t have a docker hub account follow to create docker account, and use the username of the account you created
To get started, you need to install the Bacalhau client, see more information
Since the model and the image aren't present in the container we will mount the image from an URL and the model from IPFS. You can find models to download from . You can choose the model you want to use in this case we will be using the zh_sim_g2
model
The same job can be presented in the format. In this case, the description will look like this:
See more information on how to containerize your script/app
hub-user with your docker hub username, If you don’t have a docker hub account follow to create docker account (), and use the username of the account you created
Some of the jobs presented in the Examples section may require more resources than are currently available on the demo network. Consider or running less resource-intensive jobs on the demo network
The same job can be presented in the format. In this case, the description will look like this:
If you have questions or need support or guidance, please reach out to the (#general channel).
Config property
serve
flag
Default value
Meaning
Node.Compute.JobSelection.Locality
--job-selection-data-locality
Anywhere
Only accept jobs that reference data we have locally ("local") or anywhere ("anywhere").
Node.Compute.JobSelection.ProbeExec
--job-selection-probe-exec
unused
Use the result of an external program to decide if we should take on the job.
Node.Compute.JobSelection.ProbeHttp
--job-selection-probe-http
unused
Use the result of a HTTP POST to decide if we should take on the job.
Node.Compute.JobSelection.RejectStatelessJobs
--job-selection-reject-stateless
False
Reject jobs that don't specify any input data.
Node.Compute.JobSelection.AcceptNetworkedJobs
--job-selection-accept-networked
False
Accept jobs that require network connections.
Update.SkipChecks
BACALHAU_UPDATE_SKIPCHECKS
False
If true, no update checks will be performed. As an environment variable, set to "1"
, "t"
or "true"
.
Update.CheckFrequency
BACALHAU_UPDATE_CHECKFREQUENCY
24 hours
The minimum amount of time between automated update checks. Set as any duration of hours, minutes or seconds, e.g. 24h
or 10m
.
Update.CheckStatePath
BACALHAU_UPDATE_CHECKSTATEPATH
$BACALHAU_DIR/update.json
An absolute path where Bacalhau should store the date and time of the last check.
The identification and localization of objects in images and videos is a computer vision task called object detection. Several algorithms have emerged in the past few years to tackle the problem. One of the most popular algorithms to date for real-time object detection is YOLO (You Only Look Once), initially proposed by Redmond et al.
Traditionally, models like YOLO required enormous amounts of training data to yield reasonable results. People might not have access to such high-quality labeled data. Thankfully, open-source communities and researchers have made it possible to utilize pre-trained models to perform inference. In other words, you can use models that have already been trained on large datasets to perform object detection on your own data.
Bacalhau is a highly scalable decentralized computing platform and is well suited to running massive object detection jobs. In this example, you can take advantage of the GPUs available on the Bacalhau Network and perform an end-to-end object detection inference, using the YOLOv5 Docker Image developed by Ultralytics.
Load your dataset into S3/IPFS, specify it and pre-trained weights via the --input
flags, choose a suitable container, specify the command and path to save the results - done!
To get started, you need to install the Bacalhau client, see more information here
To get started, let's run a test job with a small sample dataset that is included in the YOLOv5 Docker Image. This will give you a chance to familiarise yourself with the process of running a job on Bacalhau.
In addition to the usual Bacalhau flags, you will also see example of using the --gpu 1
flag in order to specify the use of a GPU.
Remember that by default Bacalhau does not provide any network connectivity when running a job. So you need to either provide all assets at job submission time, or use the --network=full
or --network=http
flags to access the data at task time. See the Internet Access page for more details
The model requires pre-trained weights to run and by default downloads them from within the container. Bacalhau jobs don't have network access so we will pass in the weights at submission time, saving them to /usr/src/app/yolov5s.pt
. You may also provide your own weights here.
The container has its own options that we must specify:
--input
to select which pre-trained weights you desire with details on the yolov5 release page
--project
specifies the output volume that the model will save its results to. Bacalhau defaults to using /outputs
as the output directory, so we save it there.
For more container flags refer to the yolov5/detect.py
file in the YOLO repository.
One final additional hack that we have to do is move the weights file to a location with the standard name. As of writing this, Bacalhau downloads the file to a UUID-named file, which the model is not expecting. This is because GitHub 302 redirects the request to a random file in its backend.
Some of the jobs presented in the Examples section may require more resources than are currently available on the demo network. Consider starting your own network or running less resource-intensive jobs on the demo network
export JOB_ID=$( ... )
exports the job ID as environment variable
The --gpu 1
flag is set to specify hardware requirements, a GPU is needed to run such a job
The --timeout
flag is set to make sure that if the job is not completed in the specified time, it will be terminated
The --wait
flag is set to wait for the job to complete before return
The --wait-timeout-secs
flag is set together with --wait
to define how long should app wait for the job to complete
The --id-only
flag is set to print only job id
The --input
flags are used to specify the sources of input data
-- /bin/bash -c 'find /inputs -type f -exec cp {} /outputs/yolov5s.pt \; ; python detect.py --weights /outputs/yolov5s.pt --source $(pwd)/data/images --project /outputs'
tells the model where to find input data and where to write output
This should output a UUID (like 59c59bfb-4ef8-45ac-9f4b-f0e9afd26e70
), which will be stored in the environment variable JOB_ID
. This is the ID of the job that was created. You can check the status of the job using the commands below.
The same job can be presented in the declarative format. In this case, the description will look like this:
Job status: You can check the status of the job using bacalhau list
:
When it says Completed
, that means the job is done, and we can get the results.
Job information: You can find out more information about your job by using bacalhau describe
:
Job download: You can download your job results directly by using bacalhau get
. Alternatively, you can choose to create a directory to store your results. In the command below, we created a directory and downloaded our job output to be stored in that directory.
After the download has finished we can see the results in the results/outputs/exp
folder.
Now let's use some custom images. First, you will need to ingest your images onto IPFS or S3 storage. For more information about how to do that see the data ingestion section.
This example will use the Cyclist Dataset for Object Detection | Kaggle dataset.
We have already uploaded this dataset to the IPFS storage under the CID: bafybeicyuddgg4iliqzkx57twgshjluo2jtmlovovlx5lmgp5uoh3zrvpm
. You can browse to this dataset via a HTTP IPFS proxy.
Let's run a the same job again, but this time use the images above.
Just as in the example above, this should output a UUID, which will be stored in the environment variable JOB_ID
. You can check the status of the job using the commands below.
To check the state of the job and view job output refer to the instructions above.
If you have questions or need support or guidance, please reach out to the Bacalhau team via Slack (#general channel).
In this example tutorial, we will show you how to train a Pytorch RNN MNIST neural network model with Bacalhau. PyTorch is a framework developed by Facebook AI Research for deep learning, featuring both beginner-friendly debugging tools and a high level of customization for advanced users, with researchers and practitioners using it across companies like Facebook and Tesla. Applications include computer vision, natural language processing, cryptography, and more.
To get started, you need to install the Bacalhau client, see more information here
To train our model locally, we will start by cloning the Pytorch examples repo:
Install the following:
Next, we run the command below to begin the training of the mnist_rnn
model. We added the --save-model
flag to save the model
Next, the downloaded MNIST dataset is saved in the data
folder.
Now that we have downloaded our dataset, the next step is to upload it to IPFS. The simplest way to upload the data to IPFS is to use a third-party service to "pin" data to the IPFS network, to ensure that the data exists and is available. To do this you need an account with a pinning service like Pinata or NFT.Storage. Once registered you can use their UI or API or SDKs to upload files.
Once you have uploaded your data, you'll be finished copying the CID. Here is the dataset we have uploaded.
After the repo image has been pushed to Docker Hub, we can now use the container for running on Bacalhau. To submit a job, run the following Bacalhau command:
export JOB_ID=$( ... )
exports the job ID as environment variable
bacalhau docker run
: call to bacalhau
The --gpu 1
flag is set to specify hardware requirements, a GPU is needed to run such a job
pytorch/pytorch
: Using the official pytorch Docker image
The -i ipfs://QmdeQjz1HQQd.....
: flag is used to mount the uploaded dataset
The -i https://raw.githubusercontent.com/py..........
: flag is used to mount our training script. We will use the URL to this Pytorch example
-w /outputs:
Our working directory is /outputs. This is the folder where we will save the model as it will automatically get uploaded to IPFS as outputs
python ../inputs/main.py --save-model
: URL script gets mounted to the /inputs
folder in the container
When a job is submitted, Bacalhau prints out the related job_id
. We store that in an environment variable so that we can reuse it later on.
The same job can be presented in the declarative format. In this case, the description will look like this:
The job description should be saved in .yaml
format, e.g. torch.yaml
, and then run with the command:
You can check the status of the job using bacalhau list
.
When it says Completed
, that means the job is done, and we can get the results.
You can find out more information about your job by using bacalhau describe
.
You can download your job results directly by using bacalhau get
. Alternatively, you can choose to create a directory to store your results. In the command below, we created a directory and downloaded our job output to be stored in that directory.
After the download has finished you should see the following contents in results directory
Now you can find results in the results/outputs
folder. To view them, run the following command:
Kipoi (pronounce: kípi; from the Greek κήποι: gardens) is an API and a repository of ready-to-use trained models for genomics. It currently contains 2201 different models, covering canonical predictive tasks in transcriptional and post-transcriptional gene regulation. Kipoi's API is implemented as a python package, and it is also accessible from the command line.
In this tutorial example, we will run a genomics model on Bacalhau.
To get started, you need to install the Bacalhau client, see more information here
To run locally you need to install kipoi-veff2. You can find out the information about installing and usage here
In our case this will be the following command:
To run Genomics on Bacalhau we need to set up a Docker container. To do this, you'll need to create a Dockerfile
and add your desired configuration. The Dockerfile is a text document that contains the commands that specify how the image will be built.
We will use the kipoi/kipoi-veff2:py37
image and perform variant-centered effect prediction using the kipoi_veff2_predict
tool.
See more information on how to containerize your script/app here
The docker build
command builds Docker images from a Dockerfile.
Before running the command replace:
hub-user
with your docker hub username. If you don’t have a docker hub account follow these instructions to create a Docker Account, and use the username of the account you created
repo-name
with the name of the container, you can name it anything you want
tag
this is not required but you can use the latest tag
In our case:
Next, upload the image to the registry. This can be done by using the Docker hub username, repo name or tag.
After the repo image has been pushed to Docker Hub, we can now use the container for running on Bacalhau. To submit a job for generating genomics data, run the following Bacalhau command:
Let's look closely at the command above:
bacalhau docker run
: call to Bacalhau
jsacex/kipoi-veff2:py37
: the name of the image we are using
kipoi_veff2_predict ./examples/input/test.vcf ./examples/input/test.fa ../outputs/output.tsv -m "DeepSEA/predict" -s "diff" -s "logit"
: the command that will be executed inside the container. It performs variant-centered effect prediction using the kipoi_veff2_predict tool
./examples/input/test.vcf
: the path to a Variant Call Format (VCF) file containing information about genetic variants
./examples/input/test.fa
: the path to a FASTA file containing DNA sequences. FASTA files contain nucleotide sequences used for variant effect prediction
../outputs/output.tsv
: the path to the output file where the prediction results will be stored. The output file format is Tab-Separated Values (TSV), and it will contain information about the predicted variant effects
-m "DeepSEA/predict"
: specifies the model to be used for prediction
-s "diff" -s "logit"
: indicates using two scoring functions for comparing prediction results. In this case, the "diff" and "logit" scoring functions are used. These scoring functions can be employed to analyze differences between predictions for the reference and alternative alleles.
When a job is submitted, Bacalhau prints out the related job_id
. We store that in an environment variable so that we can reuse it later on.
Job status: You can check the status of the job using bacalhau list
.
When it says Published
or Completed
, that means the job is done, and we can get the results.
Job information: You can find out more information about your job by using bacalhau describe
.
Job download: You can download your job results directly by using bacalhau get
. Alternatively, you can choose to create a directory to store your results. In the command below, we created a directory (results
) and downloaded our job output to be stored in that directory.
To view the file, run the following command:
If you have questions or need support or guidance, please reach out to the Bacalhau team via Slack (#general channel).
In this example tutorial, we will look at how to run BIDS App on Bacalhau. BIDS (Brain Imaging Data Structure) is an emerging standard for organizing and describing neuroimaging datasets. BIDS App is a container image capturing a neuroimaging pipeline that takes a BIDS formatted dataset as input. Each BIDS App has the same core set of command line arguments, making them easy to run and integrate into automated platforms. BIDS Apps are constructed in a way that does not depend on any software outside of the image other than the container engine.
To get started, you need to install the Bacalhau client, see more information here
For this tutorial, download file ds005.tar
from this Bids dataset folder and untar it in a directory:
Let's take a look at the structure of the data
directory:
The simplest way to upload the data to IPFS is to use a third-party service to "pin" data to the IPFS network, to ensure that the data exists and is available. To do this, you need an account with a pinning service like Pinata or nft.storage. Once registered, you can use their UI or API or SDKs to upload files.
When you pin your data, you'll get a CID which is in a format like this QmaNyzSpJCt1gMCQLd3QugihY6HzdYmA8QMEa45LDBbVPz
. Copy the CID as it will be used to access your data
Alternatively, you can upload your dataset to IPFS using IPFS CLI, but the recommended approach is to use a pinning service as we have mentioned above.
Let's look closely at the command above:
bacalhau docker run
: call to bacalhau
-i ipfs://QmaNyzSpJCt1gMCQLd3QugihY6HzdYmA8QMEa45LDBbVPz:/data
: mount the CID of the dataset that is uploaded to IPFS and mount it to a folder called data on the container
nipreps/mriqc:latest
: the name and the tag of the docker image we are using
../data/ds005
: path to input dataset
../outputs
: path to the output
participant --participant_label 01 02 03
: run the mriqc on subjects with participant labels 01, 02, and 03
When a job is submitted, Bacalhau prints out the related job_id. We store that in an environment variable so that we can reuse it later on.
Job status: You can check the status of the job using bacalhau list
.
When it says Published
or Completed
, that means the job is done, and we can get the results.
Job information: You can find out more information about your job by using bacalhau describe
.
Job download: You can download your job results directly by using bacalhau get
. Alternatively, you can choose to create a directory to store your results. In the command below, we created a directory (results
) and downloaded our job output to be stored in that directory.
To view the file, run the following command:
If you have questions or need support or guidance, please reach out to the Bacalhau team via Slack (#general channel).
Coreset is a data subsetting method. Since the uncompressed datasets can get very large when compressed, it becomes much harder to train them as training time increases with the dataset size. To reduce training time and cut costs, we employ the coreset method; the coreset method can also be applied to other datasets. In this case, we use the coreset method which can lead to a fast speed in solving the k-means problem among the big data with high accuracy in the meantime.
We construct a small coreset for arbitrary shapes of numerical data with a decent time cost. The implementation was mainly based on the coreset construction algorithm that was proposed by Braverman et al. (SODA 2021).
For a deeper understanding of the core concepts, it's recommended to explore: 1. Coresets for Ordered Weighted Clustering 2. Efficient Implementation of Coreset-based K-Means Methods
In this tutorial example, we will run compressed dataset with Bacalhau
To get started, you need to install the Bacalhau client, see more information here
Clone the repo which contains the code
To download the dataset you should open Street Map, which is a public repository that aims to generate and distribute accessible geographic data for the whole world. Basically, it supplies detailed position information, including the longitude and latitude of the places around the world.
The dataset is a osm.pbf
(compressed format for .osm
file), the file can be downloaded from Geofabrik Download Server
The following command is installing Linux dependencies:
Ensure that the requirements.txt
file contains the following dependencies:
The following command is installing Python dependencies:
To run coreset locally, you need to convert from compressed pbf
format to geojson
format:
The following command is to run the Python script to generate the coreset:
coreset.py
contains the following script here
To build your own docker container, create a Dockerfile
, which contains instructions on how the image will be built, and what extra requirements will be included.
We will use the python:3.8
image, we run the same commands for installing dependencies that we used locally.
See more information on how to containerize your script/app here
We will run docker build
command to build the container:
Before running the command replace:
hub-user
with your docker hub username, If you don’t have a docker hub account follow these instructions to create docker account, and use the username of the account you created
repo-name
with the name of the container, you can name it anything you want
tag
this is not required but you can use the latest tag
In our case:
Next, upload the image to the registry. This can be done by using the Docker hub username, repo name or tag.
In our case:
After the repo image has been pushed to Docker Hub, we can now use the container for running on Bacalhau. We've already converted the monaco-latest.osm.pbf
file from compressed pbf
format to geojson
format here. To submit a job, run the following Bacalhau command:
Let's look closely at the command above:
bacalhau docker run
: call to bacalhau
--input https://github.com/js-ts/Coreset/blob/master/monaco-latest.geojson
: mount the monaco-latest.geojson
file inside the container so it can be used by the script
jsace/coreset
: the name of the docker image we are using
python Coreset/python/coreset.py -f monaco-latest.geojson -o outputs
: the script initializes cluster centers, creates a coreset using these centers, and saves the results to the specified folder.
-k
: amount of initialized centers (default=5)
-n
: size of coreset (default=50)
-o
: the output folder
When a job is submitted, Bacalhau prints out the related job_id
. We store that in an environment variable so that we can reuse it later on.
The same job can be presented in the declarative format. In this case, the description will look like this:
The job description should be saved in .yaml
format, e.g. coreset.yaml
, and then run with the command:
Job status: You can check the status of the job using bacalhau list
.
When it says Published
or Completed
, that means the job is done, and we can get the results.
Job information: You can find out more information about your job by using bacalhau describe
.
Job download: You can download your job results directly by using bacalhau get
. Alternatively, you can choose to create a directory to store your results. In the command below, we created a directory (results
) and downloaded our job output to be stored in that directory.
To view the file, run the following command:
To view the output as a CSV file, run:
If you have questions or need support or guidance, please reach out to the Bacalhau team via Slack (#general channel).
GROMACS is a package for high-performance molecular dynamics and output analysis. Molecular dynamics is a computer simulation method for analyzing the physical movements of atoms and molecules
In this example, we will make use of gmx pdb2gmx program to add hydrogens to the molecules and generates coordinates in Gromacs (Gromos) format and topology in Gromacs format.
In this example tutorial, our focus will be on running Gromacs package with Bacalhau
To get started, you need to install the Bacalhau client, see more information here
Datasets can be found here https://www.rcsb.org, In this example we use RCSB PDB - 1AKI dataset. After downloading, place it in a folder called “input”
The simplest way to upload the data to IPFS is to use a third-party service to "pin" data to the IPFS network, to ensure that the data exists and is available. To do this you need an account with a pinning service like NFT.storage or Pinata. Once registered you can use their UI or API or SDKs to upload files.
Alternatively, you can upload your dataset to IPFS using IPFS CLI:
Copy the CID in the end which is QmeeEB1YMrG6K8z43VdsdoYmQV46gAPQCHotZs9pwusCm9
Let's run a Bacalhau job that converts coordinate files to topology and FF-compliant coordinate files:
Lets look closely at the command above:
bacalhau docker run
: call to Bacalhau
-i ipfs://QmeeEB1YMrG6K8z43VdsdoYmQV46gAPQCHotZs9pwusCm9:/input
: here we mount the CID of the dataset we uploaded to IPFS to use on the job
gromacs/gromacs
: we use the official gromacs - Docker Image
gmx pdb2gmx
: command in GROMACS that performs the conversion of molecular structural data from the Protein Data Bank (PDB) format to the GROMACS format, which is used for conducting Molecular Dynamics (MD) simulations and analyzing the results. Additional parameters could be found here gmx pdb2gmx — GROMACS 2022.2 documentation
-f input/1AKI.pdb
: input file
-o outputs/1AKI_processed.gro
: output file
-water
Water model to use. In this case we use spc
For a similar tutorial that you can try yourself, check out KALP-15 in DPPC - GROMACS Tutorial
When a job is submitted, Bacalhau prints out the related job_id
. We store that in an environment variable so that we can reuse it later on.
The same job can be presented in the declarative format. In this case, the description will look like this:
The job description should be saved in .yaml
format, e.g. gromacs.yaml
, and then run with the command:
Job status: You can check the status of the job using bacalhau list
.
When it says Published
or Completed
, that means the job is done, and we can get the results.
Job information: You can find out more information about your job by using bacalhau describe
.
Job download: You can download your job results directly by using bacalhau get
. Alternatively, you can choose to create a directory to store your results. In the command below, we created a directory (results
) and downloaded our job output to be stored in that directory.
To view the file, run the following command:
If you have questions or need support or guidance, please reach out to the Bacalhau team via Slack (#general channel).
TensorFlow is an open-source machine learning software library, TensorFlow is used to train neural networks. Expressed in the form of stateful dataflow graphs, each node in the graph represents the operations performed by neural networks on multi-dimensional arrays. These multi-dimensional arrays are commonly known as “tensors”, hence the name TensorFlow. In this example, we will be training a MNIST model.
This section is from TensorFlow 2 quickstart for beginners
This short introduction uses Keras to:
Load a prebuilt dataset.
Build a neural network machine learning model that classifies images.
Train this neural network.
Evaluate the accuracy of the model.
Import TensorFlow into your program to check whether it is installed
Build a tf.keras.Sequential
model by stacking layers.
For each example, the model returns a vector of logits or log-odds scores, one for each class.
The tf.nn.softmax
function converts these logits to probabilities for each class:
Note: It is possible to bake the tf.nn.softmax
function into the activation function for the last layer of the network. While this can make the model output more directly interpretable, this approach is discouraged as it's impossible to provide an exact and numerically stable loss calculation for all models when using a softmax output.
Define a loss function for training using losses.SparseCategoricalCrossentropy
, which takes a vector of logits and a True
index and returns a scalar loss for each example.
This loss is equal to the negative log probability of the true class: The loss is zero if the model is sure of the correct class.
This untrained model gives probabilities close to random (1/10 for each class), so the initial loss should be close to -tf.math.log(1/10) ~= 2.3
.
Before you start training, configure and compile the model using Keras Model.compile
. Set the optimizer
class to adam
, set the loss
to the loss_fn
function you defined earlier, and specify a metric to be evaluated for the model by setting the metrics
parameter to accuracy
.
Use the Model.fit
method to adjust your model parameters and minimize the loss:
The Model.evaluate
method checks the models performance, usually on a "Validation-set" or "Test-set".
The image classifier is now trained to ~98% accuracy on this dataset. To learn more, read the TensorFlow tutorials.
If you want your model to return a probability, you can wrap the trained model, and attach the softmax to it:
The following method can be used to save the model as a checkpoint
The dataset and the script are mounted to the TensorFlow container using an URL, we then run the script inside the container
Let's look closely at the command below:
export JOB_ID=$( ... )
exports the job ID as environment variable
bacalhau docker run
: call to bacalhau
The -i https://gist.githubusercontent.com/js-ts/e7d32c7d19ffde7811c683d4fcb1a219/raw/ff44ac5b157d231f464f4d43ce0e05bccb4c1d7b/train.py
flag is used to mount the training script
The -i https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz
flag is used to mount the dataset
tensorflow/tensorflow
: the name and the tag of the docker image we are using
python train.py
: command to execute the script
By default whatever URL you mount using the -i
flag gets mounted at the path /inputs
so we choose that as our input directory -w /inputs
When a job is submitted, Bacalhau prints out the related job_id
. We store that in an environment variable so that we can reuse it later on.
The same job can be presented in the declarative format. In this case, the description will look like this:
The job description should be saved in .yaml
format, e.g. tensorflow.yaml
, and then run with the command:
You can check the status of the job using bacalhau list
.
When it says Completed
, that means the job is done, and we can get the results.
You can find out more information about your job by using bacalhau describe
.
You can download your job results directly by using bacalhau get
. Alternatively, you can choose to create a directory to store your results. In the command below, we created a directory and downloaded our job output to be stored in that directory.
After the download has finished you should see the following contents in results directory
Now you can find the file in the results/outputs
folder. To view it, run the following command:
If you have questions or need support or guidance, please reach out to the Bacalhau team via Slack (#general channel).
You can run a stand-alone Bacalhau and IPFS network on your computer with the following guide.
The devstack
command of bacalhau
will start a 3 node cluster alongside isolated ipfs servers.
This is useful to kick the tires and/or developing on the codebase. It's also the tool used by some tests.
x86_64 or ARM64 architecture
Ubuntu 20.0+ has most often been used for development and testing
Go >= 1.21
(Optional) Docker Engine
(Optional) A build of the latest Bacalhau release
This will start a 3 node Bacalhau cluster.
Each node has its own IPFS server isolated using the IPFS_PATH
environment variable and its own API RPC server isolated using a random port. These IPFS nodes are not connected to the public IPFS network. If you wish to connect the devstack to the public IPFS network, you can include --public-ipfs
flag.
You can also use your own IPFS node and connect it to the devstack by running (after starting the devstack):
If you would like to make it a bit more predictable and/or ignore errors (such as during CI), you can add the following before your execution:
Once everything has started up - you will see output like the following:
The message above contains the environment variables you need for a new window. You can paste these into a new terminal so that bacalhau will use your local devstack.
Alternatively, to remove the need to copy and paste, you can set DEVSTACK_ENV_FILE
environment variable to the name of a .env file that devstack will write to, and bacalhau commands will read from, before launching the devstack e.g.:
When the devstack is shut down, the local env file (if configured) will be removed. It is suggested you use .devstack.env
to avoid clashing with longer lived .env
files.
Open an additional terminal window to be used for submitting jobs.
Copy and paste environment variables from previous message into this window. EG:
You are now ready to submit a job to your local devstack.
This will submit a simple job to a single node:
This should output something like the following:
After a short while - the job should be in complete
state.
Download the results to the current directory:
You should now have the following files and directories:
stdout
stderr
volumes/output
If you cat stdout
it should read "hello devstack test". If you write any files in your job, they will appear in volumes/output.
Bacalhau clusters are composed of requester nodes, and compute nodes. The requester nodes are responsible for managing the compute nodes that make up the cluster. This functionality is only currently available when using NATS for the network transport.
The two main areas of functionality for the requester nodes are, managing the membership of compute nodes that require approval to take part in the cluster, and monitoring the health of the compute nodes. They are also responsible for collecting information provided by the compute nodes on a regular schedule.
As compute nodes start, they register their existence with the requester nodes. Once registered, they will maintain a sentinel file to note that they are already registered, this avoids unnecessary registration attempts.
Once registered, the requester node will need to approve the compute node before it can take part in the cluster. This is to ensure that the requester node is aware of all the compute nodes that are part of the cluster. In future, we may provide mechanisms for auto-approval of nodes joining the cluster, but currently all compute nodes registering default to the PENDING state.
Listing the current nodes in the system will show requester nodes automatically APPROVED, and compute nodes in the PENDING state.
Nodes can be rejected using their node id, and optionally specifying a reason with the -m flag.
Nodes can be approved using their node id.
There is currently no support for auto-eviction of nodes, but they can be manually removed from the cluster using the node delete
command. Note, if they are manually removed, they are able to manually re-register, so this is most useful when you know the node will not be coming back.
After all of these actions, the node list looks like
Compute nodes will provide information about themselves to the requester nodes on a regular schedule. This information is used to help the requester nodes make decisions about where to schedule workloads.
These updates are broken down into:
Node Information: This is the information about the node itself, such as the hostname, CPU architecture, and any labels associated with the node. This information is persisted to the Node Info Store.
Resource Information: This is the information about the resources available on the node, such as the amount of memory, storage and CPU available. This information is held in memory and used to make scheduling decisions. It is not persisted to disk as it is considered transient.
Health Information: This heartbeat is used to determine if the node is still healthy, and if it is not, the requester node will mark the node as unhealthy. Eventually, the node will be marked as Unknown if it does not recover. This information is held in memory and used to make scheduling decisions. Like the resource information, it is not persisted to disk as it is considered transient.
Various configuration options are available to control the frequency of these updates, and the timeout for the health check. These can be set in the configuration file.
For the compute node, these settings are:
Node Information: InfoUpdateFrequency
- The interval between updates of the node information.
Resource Information: ResourceUpdateFrequency
- The interval between updates of the resource information.
Heartbeat: HeartbeatFrequency
- The interval between heartbeats sent by the compute node.
Heartbeat: HeartbeatTopic
- The name of the pubsub topic that heartbeat messages are sent via.
For the requester node, these settings are:
Heartbeat HeartbeatFrequency
- How often the heartbeat server will check the priority queue of node heartbeats.
Heartbeat HeartbeatTopic
- The name of the pubsub topic that heartbeat messages are sent via. Should be the same as the compute node value.
Node health NodeDisconnectedAfter
- The interval after which the node will be considered disconnected if a heartbeat has not been received.
As compute nodes are added and removed from the cluster, the requester nodes will emit events to the NATS PubSub system. These events can be consumed by other systems to react to changes in the cluster membership.
After a discussion about this with Honeycomb - https://honeycombpollinators.slack.com/archives/CNQ943Q75/p1661657055345219
In outline form:
Start a new trace for each significant process
Trace should span across CLI to Server and back
Trace should contain baggage about the trace (e.g. job id, user id, etc)
New span for every short lived action (e.g. < 10 min)
New trace for jobs longer than 1 hour
Try very hard to break up traces into smaller pieces
In tail sampling: the decision to keep or drop information is done at the traceid level.
There's usually a pretty short timeout for decision-time so if something happens 30 minutes into the trace that means you'd want to keep it (errors), the decision to drop it may have been made 28 minutes ago.
For this reason I recommend keeping traceid to each job executed in an async queue.
You can also add your own custom identifiers to things for aggregation sake later.
This is a good use for a lot of activities that are enqueued at the same time or otherwise relate to each other.
You can ALSO have layers of tracing.
One for the scheduler and one for the execution.
The scheduler would have spans for "adding to queue" and "pulling from queue" so you can measure lag time there.
Then each job execution gets its own tracer to attach spans to.
You'd basically not sample the scheduler traces and just manage events via more or less detail.
Give it another identifier for the week-long activity and then add it to spans and traces for every event related to that super-lengthy activity.
Start a tracer for the server process that makes a new trace each time a request comes from the CLI. If the daemon does other stuff on a schedule or timer then I'd start a new trace for each of those activities.
The long running job should be a single trace unless it goes over an hour or.
Then the options are to break it after some duration and start a new traceid or figure out the next decomposition of the long running job. Is it computing the whole time? Is it waiting a lot?
And really, if you start with a huge trace and see that it makes the waterfall look bad since there's a time limit or span limit, then split it up.
If you start with tons of tiny traces and the aggregates don't make sense due to sampling, then maybe combining them back together would help.
The things we see are that sampling is tougher and waterfalls are wonky if traces are too big and aggregates look weird if traces are broken into too small of a unit.
The "make a new tracer" and have parallel output isn't something I'd start with. Only escalate to if the scheduler and jobs are misbehaving and it's unclear why
The imported "trace" library has all the mechanisms for interacting with the tracer...
This example uses a named tracer. https://github.com/open-telemetry/opentelemetry-go/blob/main/example/namedtracer/main.go
The two things you need in any given place are the otel trace import and the context.
You can get the current span from context and either add a child span or add attributes depending on what you're doing.
Interacting with the tracer is usually just done at initialization for each service.
The http/grpc service autoinstrumentation should receive the spancontext from the client and continue to add child spans to that.
You still need to configure the exporter on the server since that's not propagated.
The trace.start function or withspan block are good ways to make a new span in the current trace.
Some more docs - https://www.honeycomb.io/blog/ask-miss-o11y-opentelemetry-baggage/
https://github.com/honeycombio/example-greeting-service/tree/main/golang
For any top level function (e.g. that could be executed by the CLI), include the following code:
where NAME_OF_FUNCTION
is of the form folder/file/command
-> cmd/bacalhau/describe
.
This initiates the cleanup manager, pulls in the cmd Context (which is created in root.go
).
Then it creates a root span, which is a function that automatically adds helpful things like the environment something is running in, and can be extended in the future.
We then assign the defer to end the span, and register the cleanup manager for shutting down the trace provider.
When you start a new function, simply add:
Here, NAME_OF_SPAN
should be of the form toplevelfolder/packagename.functionname
E.g pkg/computenode.subscriptionEventBidRejected
The ctx
variable should come from the function declaration, and if it does not have it, we should see if it makes sense to thread it through from the original call. Reminder, ctx
should be the first parameter for all functions that require it, according to the Go docs. Please avoid using context.Background
or otherwise creating a new context. This will mess up tying spans together.
If you do feel the need to create a new one, use the anchor tag (in comments) ANCHOR
, so that we can search for it.
Additionally, if you would like to add baggage to the span, which must be done for each span created, you can pull it from the context (if it exists). You can do so with the following commands:
We do check to make sure the baggage already exists and if it doesn't we do not add it. If you attempt to add a baggage that does not exist, we print out the stack trace (but only in debug mode).
You MUST manually add the baggage to the span in the function where you create the new span you create. Attributes do NOT carry through from parent to children spans (though, interestingly, baggage DOES carry through).
If you are adding baggage TO a span, because you're creating a node ID or job ID for example, you can use the following:
This context now carries the baggage forward to any function that references it.
Generally, add context and tracing where possible. However, for things that are short and do not perform significant compute, I/O, networking, etc, you can skip context and tracing for cleanliness. For example, if you have a function which provisions a struct, or does other things that we do not expect to be traced, you can skip adding context or tracing to it.
If you trace an entire function, and the thing you are debating to add a trace to is a sub function, you may not need to create a subspan. Generally, if you can imagine any situation in which you would debug a problem in a function, you probably want to add a trace.
Further, you may want to create spans inside functions to trace particular blocks of code. This is not recommended, because it makes using defer
a challenge, and defer
gives you a number of nice clean up features that you will want for tracing. A good rule of thumb is if you have something that is long enough to be a span, it should be a function.
Some good reading:
https://github.com/honeycombio/honeycomb-opentelemetry-go
https://github.com/honeycombio/example-greeting-service/blob/main/golang/year-service/main.go
Bacalhau authenticates and authorizes users in a multi-step flow.
We know our potential users have many possible requirements around auth and exist across the entire spectrum from "no auth needed because its a simple local deployment" to "enterprise-grade security for publicly accessible nodes". Hence, the auth system needs to be unopinionated about how authentication and authorization gets achieved.
The auth system has therefore been designed with a few goals in mind:
Flexible authentication: it should be easy for users to add their own authentication method, including simple methods like using shared secrets and more complex methods up to OAuth and OIDC.
Flexible authorization: it should be possible for users to be authorized based on a number of different modes, including group-based auth, RBAC and ABAC. The exact permissions of each should be customizable. The system should not require, for example, a particular model of "namespaces" or "workspaces" because these don't necessarily fit all use cases.
Future proofing: the auth system should not require core-level upgrades to support advancements in cryptography. The hash functions and key sizes that are considered "secure" change over time, so the Bacalhau core should not be forced to have an opinion on this by the auth system and should not have to play "whack-a-mole" with supporting different configurations for different customers. Instead, it should be possible for customers to apply a policy that makes sense for them and upgrade security at their own pace.
Performance: any calls to remote servers or complex algorithms to decide logic should happen once in the authentication process, and then subsequent calls to the API should introduce little overhead from authorization.
Auth server is a set of API endpoints that are trusted to make auth decisions. This is something built into the requester node and doesn't need to be a separate service, but could also be implemented as an external service if desired.
User agent is a tool that acts on behalf of the user, running in a trusted way locally to them. The user agent submits API calls to the requester node on their behalf – so the CLI, Web UI and SDK are all user agents. We use the term "user agent" to differentiate from a "client", which in the OAuth sense means a third-party service that the user does not have complete trust in.
Bacalhau implements flexible authentication and authorization using policies which are written using a machine-executable policy format called Rego.
Each authentication policy receives authentication credentials as input and outputs access tokens that will supplied to future API calls.
Each authorization policy receives access tokens as input and outputs decisions about allowable access to APIs and job submission.
These two policies work together to define the entire authentication and authorization scheme.
The basic list of steps is:
Get the list of acceptable authn methods
Pick one and execute it, collecting any credentials from the user
Submit the credentials to the authn API
Receive an access token and use it in all future requests
User agents make a request to their configured auth server to retrieve a list of authentication methods, keyed by name.
Each authentication method object describes:
a type of authentication, identified by a specific key
parameters to be used in running the authentication method, specific to that type
Each "type" can be used to implement a number of different authentication methods. The types broadly correlate with behavior that the user agent needs to take to run the authentication flow, such that there can be a single piece of user agent code that is capable of running each type, with different input parameters.
The supported types are:
challenge
authenticationThis method is used to identify users via a private key that they hold. The authentication response contains a InputPhrase
that the user should sign and return to the endpoint.
ask
authenticationThis method requires the user to manually input some information. This method can be used to implement username and password authentication, shared secret authentication, and even 2FA or security question auth.
The required information is represented by a JSON Schema in the object itself. The implementation should parse the JSON Schema and ask the user questions to populate an object that is valid by it.
The user agent decides which authentication method to use (e.g. by asking the user, or by knowing it has an appropriate key) and operates the flow.
Once all the data for the method has been successfully collected, the user agent POSTs the data to the auth endpoint for the method. The endpoint is the base auth endpoint plus the name of the method, e.g. /api/v1/auth/<method>
. So to submit data for a "userpass" method, the user agent would POST to /api/v1/auth/userpass
.
The auth server processes the request by inputting the auth credentials into a auth policy. If the auth policy finds the passed data acceptable, it returns an access token that the user can use in subsequent calls.
(Aside: there is actually no specification on the structure of the access token. The user agent should treat it as an opaque blob that it receives from the auth server and submits to the API server. Currently, all of the core Bacalhau code also does not have any opinion of the auth token – it is not assumed to be any specific type of object, and all parsing and handling is handled by the Rego policies. However, all of the currently implemented Rego policies output and expect JWTs, and it is recommended that users continue to use this convention. The rest of this document will assume access tokens are JWTs.)
The signed JWT is returned to the user agent. The user agent takes appropriate steps to keep the access token secret.
In principle, the auth policy can return any JWT it wishes, which will be interpreted later in the API auth policy – it is up to the authn policy and the authz policy to work together to apply auth. The policy to run is identified by the Node.Auth.Methods
variable, which is a map of method names to policy paths.
However, the default authn and authz policies make decisions using namespaces. Here, the authn policy returns a set of namespaces with associated access permissions, and the authz policy controls access based on them.
In this default case, the JWT includes the fields:
iss
(issuer)The node ID of the auth server.
sub
(subject)A network-unique user ID, derived from the auth credentials. The sub
does not need to identify the same user across different authentication methods, but should ideally be the same if the user logs in via the same auth method again.
ist
(issued at)The timestamp when the token was issued.
exp
(expires at)The timestamp after which the token is no longer valid.
ns
(namespaces)A map of namespaces to permission bits.
The key in the map is a namespace name that the user has some level of access of. Namespace names are ephemeral – i.e. there does not need to be a persistent or coordinated store of namespaces shared across the whole cluster. Instead, the format of namespace names is an interface for the network operator to decide.
For example, the default policy will just give the user access to a namespace identified by the sub
field (e.g. their username). But in principle, more complex setups involving groups could be used.
Namespace names can be a *
, which by convention will match any set of characters, like a filesystem glob. But it is up to the various auth policies to actually implement this. So a JWT claim containing "*"
would give default permissions for all namespaces.
The value in the map is an unsigned integer encoding permission bits. If the following bits are set:
0b00000001
: user can describe jobs in the namespace
0b00000010
: user can create jobs in the namespace
0b00000100
: user can download results from the namespace
0b00001000
: user can cancel jobs in the namespace
The user agent includes an Authorization
header with the access token it wishes to use passed as a bearer token:
Note that the Authorization
header is strictly optional – access for unauthorized users is controlled using the policy, and may be allowed. The API call is allowed to proceed if the authorization policy returns a positive decision.
The requester node executes the API authorization policy and passes details of the API call. The default policy is one where the namespaces of the token are checked if present, and non-namespaced APIs require a valid signed token.
As above, custom policies are allowed. The policy to execute is defined by the Node.Auth.AccessPolicyPath
config variable. For non-namespaced APIs, such as node APIs, the policy may make a blanket decision simply using whether the user has an authorization token or not, or may choose to make a decision depending on the type of authorization. For namespaced APIs, such as job APIs, the policy should examine the namespaces in the JWT token and respond accordingly.
The authz server will return a 403 Forbidden
error if the user is not allowed to carry out the requested action. It will also return a 401 Unauthorized
error if the token the user passed is not valid for any future request. In the latter case, the user agent should discard the token and execute the above flow again to get a new one.
There are a number of roadmap items that will enhance the auth system:
The Web UI currently does not have any authn/z capability, and so can only work with the default Bacalhau configuration which does not limit unauthenticated users from querying read-only API endpoints.
To upgrade the Web UI to work in authenticated cases, it will be necessary to implement the algorithms noted above. In short:
The Web UI will need to query the auth API endpoint for available authn methods.
It should then pick an appropriate authn method, either by asking the user, choosing based on known available data (e.g. existing presence of a private key), or by picking the only available option.
It should then run the authn flow for that type:
For challenge
types, it will need a private key. It should probably generate and store one persistently rather than asking the user to upload theirs.
For ask
types, it will need to parse the input JSON Schema and present a web form to collect the necessary authn credentials.
Once it has successfully authenticated, it should persistently store the access token and add it to all subsequent API requests.
external
authentication typeThis type will power future OAuth2/OIDC authentication. The principle is that:
The type will specify a remote endpoint to redirect the user to. The CLI will open a browser to this endpoint (or otherwise advise the user to do this) and the Web UI will just issue a redirect to this endpoint.
The user completes authentication at the remote service and is then redirected back to a supplied endpoint with valid credentials.
The CLI may need to run a temporary web server to receive the redirect (this is how CLI tools like gcloud
currently handle the OIDC flow). The Web UI will need to specify a redirect that it can subsequently decode credentials for.
Also specified in the authentication method data will be any query parameters that the CLI/WebUI needs to populate with the redirect path. E.g. the specific OIDC scheme might specify the return location as a ?redirect
url query parameter, and the authentication type should specify the name of this parameter.
There doesn't need to be an optional step where the user exchanges the identity token they received from the remote auth server for a Bacalhau auth token. Instead, the system could just use the returned credential directly.
However, this may be a beneficial step for mapping OIDC credentials into e.g. a JWT that specifies available namespaces. So there should probably be a step where the token received from the OIDC flow is passed to the authn method endpoint, and a policy has the chance to return a different token. In the basic case, it can check the validity of the token and return it unchanged.
The returned credential will be a JWT or similar access token. The user agent should use this credential to query the API as above. The authz policy should be configured to recognize these access tokens and apply authz control based on their content, as for the other methods.
Stable Diffusion is a state of the art text-to-image model that generates images from text and was developed as an open-source alternative to DALL·E 2. It is based on a Diffusion Probabilistic Model and uses a Transformer to generate images from text.
This example demonstrates how to use stable diffusion using a finetuned model and run inference on it. The first section describes the development of the code and the container - it is optional as users don't need to build their own containers to use their own custom model. The second section demonstrates how to convert your model weights to ckpt. The third section demonstrates how to run the job using Bacalhau.
The following guide is using the fine-tuned model, which was finetuned on Bacalhau. To learn how to finetune your own stable diffusion model refer to this guide.
Convert your existing model weights to the ckpt
format and upload to the IPFS storage.
Create a job using bacalhau docker run
, relevant docker image, model weights and any prompt.
Download results using bacalhau get
and the job id.
To get started, you need to install:
Bacalhau client, see more information here
NVIDIA GPU
CUDA drivers
NVIDIA docker
This part of the guide is optional - you can skip it and proceed to the Running a Bacalhau job if you are not going to use your own custom image.
To build your own docker container, create a Dockerfile
, which contains instructions to containerize the code for inference.
This container is using the pytorch/pytorch:1.13.0-cuda11.6-cudnn8-runtime
image and the working directory is set. Next the Dockerfile installs required dependencies. Then we add our custom code and pull the dependent repositories.
See more information on how to containerize your script/app here
We will run docker build
command to build the container.
Before running the command replace:
hub-user with your docker hub username, If you don’t have a docker hub account follow these instructions to create the Docker account, and use the username of the account you created
repo-name with the name of the container, you can name it anything you want
tag this is not required but you can use the latest
tag
So in our case, the command will look like this:
Next, upload the image to the registry. This can be done by using the Docker hub username, repo name or tag.
Thus, in this case, the command would look this way:
After the repo image has been pushed to Docker Hub, you can now use the container for running on Bacalhau. But before that you need to check whether your model is a ckpt
file or not. If your model is a ckpt
file you can skip to the running on Bacalhau, and if not - the next section describes how to convert your model into the ckpt
format.
To download the convert script:
To convert the model weights into ckpt
format, the --half
flag cuts the size of the output model from 4GB to 2GB:
To do inference on your own checkpoint on Bacalhau you need to first upload it to your public storage, which can be mounted anywhere on your machine. In this case, we will be using NFT.Storage (Recommended Option). To upload your dataset using NFTup drag and drop your directory and it will upload it to IPFS.
After the checkpoint file has been uploaded copy its CID.
Some of the jobs presented in the Examples section may require more resources than are currently available on the demo network. Consider starting your own network or running less resource-intensive jobs on the demo network
Let's look closely at the command above:
export JOB_ID=$( ... )
: Export results of a command execution as environment variable
The --gpu 1
flag is set to specify hardware requirements, a GPU is needed to run such a job
-i ipfs://QmUCJuFZ2v7KvjBGHRP2K1TMPFce3reTkKVGF2BJY5bXdZ:/model.ckpt
: Path to mount the checkpoint
-- conda run --no-capture-output -n ldm
: since we are using conda we need to specify the name of the environment which we are going to use, in this case it is ldm
scripts/txt2img.py
: running the python script
--prompt "a photo of a person drinking coffee"
: the prompt you need to specify the session name in the prompt.
--plms
: the sampler you want to use. In this case we will use the plms
sampler
--ckpt ../model.ckpt
: here we specify the path to our checkpoint
--n_samples 1
: no of samples we want to produce
--skip_grid
: skip creating a grid of images
--outdir ../outputs
: path to store the outputs
--seed $RANDOM
: The output generated on the same prompt will always be the same for different outputs on the same prompt set the seed parameter to random
When a job is submitted, Bacalhau prints out the related job_id
. We store that in an environment variable so that we can reuse it later on.
Job status: You can check the status of the job using bacalhau list
:
When it says Completed
, that means the job is done, and we can get the results.
Job information: You can find out more information about your job by using bacalhau describe
:
Job download: You can download your job results directly by using bacalhau get
. Alternatively, you can choose to create a directory to store your results. In the command below, we created a directory and downloaded our job output to be stored in that directory.
After the download has finished we can see the results in the results/outputs
folder. We received following image for our prompt:
Stable Diffusion is a state of the art text-to-image model that generates images from text and was developed as an open-source alternative to DALL·E 2. It is based on a Diffusion Probabilistic Model and uses a Transformer to generate images from text.
This example demonstrates how to use stable diffusion on a CPU and run it on the Bacalhau demo network. The first section describes the development of the code and the container. The second section demonstrates how to run the job using Bacalhau.
The images presented on this page were generated by this model.
The original text-to-image stable diffusion model was trained on a fleet of GPU machines, at great cost. To use this trained model for inference, you also need to run it on a GPU.
However, this isn't always desired or possible. One alternative is to use a project called OpenVINO from Intel that allows you to convert and optimize models from a variety of frameworks (and ONNX if your framework isn't directly supported) to run on a supported Intel CPU. This is what we will do in this example.
Heads up! This example takes about 10 minutes to generate an image on an average CPU. Whilst this demonstrates it is possible, it might not be practical.
In order to run this example you need:
A Debian-flavoured Linux (although you might be able to get it working on M1 macs)
The first step is to convert the trained stable diffusion models so that they work efficiently on a CPU using OpenVINO. The example is quite complex, so we have created a separate repository (which is a fork from Github user Sergei Belousov) to host the code.
In summary, the code downloads a pre-optimized OpenVINO version of the original pre-trained stable diffusion model, which also leverages OpenAI's CLIP transformer and is then wrapped inside an OpenVINO runtime, which reads in and executes the model.
The core code representing these tasks can be found in the stable_diffusion_engine.py
file. This is a mashup that creates a pipeline necessary to tokenize the text and run the stable diffusion model. This boilerplate could be simplified by leveraging the more recent version of the diffusers library. But let's crack on.
Note that these dependencies are only known to work on Ubuntu-based x64 machines.
The following commands clone the example repository, and other required repositories, and install the Python dependencies.
Now that we have all the dependencies installed, we can call the demo.py
wrapper, which is a simple CLI, to generate an image from a prompt.
When the generation is complete, you can open the generated hello.png
and see something like this:
Lets try another prompt and see what we get:
Now we have a working example, we can convert it into a format that allows us to perform inference in a distributed environment.
First we will create a Dockerfile
to containerize the inference code. The Dockerfile can be found in the repository, but is presented here to aid understanding.
This container is using the python:3.9.9-bullseye
image and the working directory is set. Next, the Dockerfile installs the same dependencies from earlier in this notebook. Then we add our custom code and pull the dependent repositories.
We've already pushed this image to GHCR, but for posterity, you'd use a command like this to update it:
To run this example you will need Bacalhau installed and running
Bacalhau is a distributed computing platform that allows you to run jobs on a network of computers. It is designed to be easy to use and to run on a variety of hardware. In this example, we will use it to run the stable diffusion model on a CPU.
To submit a job, you can use the Bacalhau CLI. The following command passes a prompt to the model and generates an image in the outputs directory.
This will take about 10 minutes to complete. Go grab a coffee. Or a beer. Or both. If you want to block and wait for the job to complete, add the --wait
flag.
Furthermore, the container itself is about 15GB, so it might take a while to download on the node if it isn't cached.
Some of the jobs presented in the Examples section may require more resources than are currently available on the demo network. Consider starting your own network or running less resource-intensive jobs on the demo network
export JOB_ID=$( ... )
: Export results of a command execution as environment variable
bacalhau docker run
: Run a job using docker executor.
--id-only
: Flag to print out only the job id
ghcr.io/bacalhau-project/examples/stable-diffusion-cpu:0.0.1
: The name and the tag of the Docker image.
The command to run inference on the model: python demo.py --prompt "First Humans On Mars" --output ../outputs/mars.png
. It consists of:
demo.py
: The Python script that runs the inference process.
--prompt "First Humans On Mars"
: Specifies the text prompt to be used for the inference.
--output ../outputs/mars.png
: Specifies the path to the output image.
When a job is submitted, Bacalhau prints out the related job_id
. We store that in an environment variable so that we can reuse it later on.
Job status: You can check the status of the job using bacalhau list
:
When it says Completed
, that means the job is done, and we can get the results.
Job information: You can find out more information about your job by using bacalhau describe
:
Job download: You can download your job results directly by using bacalhau get
. Alternatively, you can choose to create a directory to store your results. In the command below, we created a directory and downloaded our job output to be stored in that directory.
After the download has finished we can see the results in the results/outputs
folder.
Stable diffusion has revolutionalized text2image models by producing high quality images based on a prompt. Dreambooth is a approach for personalization of text-to-image diffusion models. With images as input subject, we can fine-tune a pretrained text-to-image model
Although the dreambooth paper used Imagen to finetune the pre-trained model since both the Imagen model and Dreambooth code are closed source, several opensource projects have emerged using stable diffusion.
Dreambooth makes stable-diffusion even more powered with the ability to generate realistic looking pictures of humans, animals or any other object by just training them on 20-30 images.
In this example tutorial, we will be fine-tuning a pretrained stable diffusion using images of a human and generating images of him drinking coffee.
The following command generates the following:
Subject: SBF
Prompt: a photo of SBF without hair
Output:
To get started, you need to install the Bacalhau client, see more information here
You can skip this section entirely and directly go to running a job on Bacalhau
Building this container requires you to have a supported GPU which needs to have 16gb+ of memory, since it can be resource intensive.
We will create a Dockerfile
and add the desired configuration to the file. Following commands specify how the image will be built, and what extra requirements will be included:
This container is using the pytorch/pytorch:1.12.1-cuda11.3-cudnn8-devel
image and the working directory is set. Next, we add our custom code and pull the dependent repositories.
The shell script is there to make things much simpler since the command to train the model needs many parameters to pass and later convert the model weights to the checkpoint, you can edit this script and add in your own parameters
To download the models and run a test job in the Docker file, copy the following:
Then execute finetune.sh
with following commands:
We will run docker build
command to build the container:
Before running the command replace:
hub-user with your docker hub username, If you don’t have a docker hub account follow these instructions to create a Docker account, and use the username of the account you create.
repo-name with the name of the container, you can name it anything you want.
tag this is not required but you can use the latest tag
Now you can push this repository to the registry designated by its name or tag.
The optimal dataset size is between 20-30 images. You can choose the images of the subject in different positions, full body images, half body, pictures of the face etc.
Only the subject should appear in the image so you can crop the image to just fit the subject. Make sure that the images are 512x512 size and are named in the following pattern:
You can view the Subject Image dataset of David Aronchick for reference.
After the Subject dataset is created we upload it to IPFS.
In this case, we will be using NFT.Storage (Recommended Option) to upload files and directories with NFTUp.
To upload your dataset using NFTup just drag and drop your directory it will upload it to IPFS:
After the checkpoint file has been uploaded, copy its CID which will look like this:
Since there are a lot of combinations that you can try, processing of finetuned model can take almost 1hr+ to complete. Here are a few approaches that you can try based on your requirements:
bacalhau docker run
: call to bacalhau
The --gpu 1
flag is set to specify hardware requirements, a GPU is needed to run such a job
-i ipfs://bafybeidqbuphwkqwgrobv2vakwsh3l6b4q2mx7xspgh4l7lhulhc3dfa7a
Mounts the data from IPFS via its CID
jsacex/dreambooth:latest
Name and tag of the docker image we are using
-- bash finetune.sh /inputs /outputs "a photo of David Aronchick man" "a photo of man" 3000 "/man"
execute script with following paramters:
/inputs
Path to the subject Images
/outputs
Path to save the generated outputs
"a photo of < name of the subject > < class >"
-> "a photo of David Aronchick man"
Subject name along with class
"a photo of < class >
" -> "a photo of man"
Name of the class
The number of iterations is 3000. This number should be no of subject images x 100. So if there are 30 images, it would be 3000. It takes around 32 minutes on a v100
for 3000 iterations, but you can increase/decrease the number based on your requirements.
Here is our command with our parameters replaced:
If your subject fits the above class, but has a different name you just need to replace the input CID and the subject name.
Use the /woman
class images
Here you can provide your own regularization images or use the mix class.
Use the /mix
class images if the class of the subject is mix
You can upload the model to IPFS and then create a gist, mount the model and script to the lightweight container
When a job is submitted, Bacalhau prints out the related job_id
. Use the export JOB_ID=$(bacalhau docker run ...)
wrapper to store that in an environment variable so that we can reuse it later on.
The same job can be presented in the declarative format. In this case, the description will look like this. Change the command in the Parameters section and CID to suit your goals.
You can check the status of the job using bacalhau list
.
When it says Completed
, that means the job is done, and we can get the results.
You can find out more information about your job by using bacalhau describe
.
You can download your job results directly by using bacalhau get
. Alternatively, you can choose to create a directory to store your results. In the command below, we created a directory and downloaded our job output to be stored in that directory.
After the download has finished you should see the following contents in results directory
Now you can find the file in the results/outputs
folder. You can view results by running following commands:
In the next steps, we will be doing inference on the finetuned model
Refer to our guide on CKPT model for more details of how to build a SD inference container
Bacalhau currently doesn't support mounting subpaths of the CID, so instead of just mounting the model.ckpt
file we need to mount the whole output CID which is 6.4GB, which might result in errors like FAILED TO COPY /inputs
. So you have to manually copy the CID of the model.ckpt
which is of 2GB.
To get the CID of the model.ckpt
file go to https://gateway.ipfs.io/ipfs/< YOUR-OUTPUT-CID >/outputs/
. For example:
If you use the Brave browser, you can use following:
Or you can use the IPFS CLI:
Copy the link of model.ckpt
highlighted in the box:
Then extract the CID portion of the link and copy it.
To run a Bacalhau Job on the fine-tuned model, we will use the bacalhau docker run
command.
If you are facing difficulties using the above method you can mount the whole output CID
When a job is sumbitted, Bacalhau prints out the related job_id
. We store that in an environment variable so that we can reuse it later on.
To check the status of your job and download results refer back to the guide above.
We got an image like this as a result:
This example tutorial demonstrates how to use Stable Diffusion on a GPU and run it on the Bacalhau demo network. Stable Diffusion is a state of the art text-to-image model that generates images from text and was developed as an open-source alternative to DALL·E 2. It is based on a Diffusion Probabilistic Model and uses a Transformer to generate images from text.
To get started, you need to install the Bacalhau client, see more information here
Here is an example of an image generated by this model.
This stable diffusion example is based on the Keras/Tensorflow implementation. You might also be interested in the Pytorch oriented diffusers library.
When you run this code for the first time, it will download the pre-trained weights, which may add a short delay.
Based on the requirements here, we will install the following:
We have a sample code from this the Stable Diffusion in TensorFlow/Keras repo which we will use to check if the code is working as expected. Our output for this code will be a DSLR photograph of an astronaut riding a horse.
When you run this code for the first time, it will download the pre-trained weights, which may add a short delay.
When running this code, if you check the GPU RAM usage, you'll see that it's sucked up many GBs, and depending on what GPU you're running, it may OOM (Out of memory) if you run this again.
You can try and reduce RAM usage by playing with batch sizes (although it is only set to 1 above!) or more carefully controlling the TensorFlow session.
To clear the GPU memory we will use numba. This won't be required when running in a single-shot manner.
You need a script to execute when we submit jobs. The code below is a slightly modified version of the code we ran above which we got from here, however, this includes more things such as argument parsing argument parsing to be able to customize the generator.
For a full list of arguments that you can pass to the script, see more information here
After writing the code the next step is to run the script.
As a result, you will get something like this:
The following presents additional parameters you can try:
python main.py --p "cat with three eyes
- to set prompt
python main.py --p "cat with three eyes" --n 100
- to set the number of iterations to 100
python stable-diffusion.py --p "cat with three eyes" --b 2
to set batch size to 2 (№ of images to generate)
Docker is the easiest way to run TensorFlow on a GPU since the host machine only requires the NVIDIA® driver. To containerize the inference code, we will create a Dockerfile
. The Dockerfile is a text document that contains the commands that specify how the image will be built.
The Dockerfile leverages the latest official TensorFlow GPU image and then installs other dependencies like git
, CUDA
packages, and other image-related necessities. See the original repository for the expected requirements.
See more information on how to containerize your script/app here
We will run docker build
command to build the container;
Before running the command replace following:
hub-user with your docker hub username, If you don’t have a docker hub account follow these instructions to create a Docker account, and use the username of the account you created
repo-name with the name of the container, you can name it anything you want
tag this is not required but you can use the latest tag
In our case:
Next, upload the image to the registry. This can be done by using the Docker hub username, repo name or tag.
In our case:
Some of the jobs presented in the Examples section may require more resources than are currently available on the demo network. Consider starting your own network or running less resource-intensive jobs on the demo network
To submit a job run the Bacalhau command with following structure:
export JOB_ID=$( ... )
exports the job ID as environment variable
The --gpu 1
flag is set to specify hardware requirements, a GPU is needed to run such a job
The --id-only
flag is set to print only job id
ghcr.io/bacalhau-project/examples/stable-diffusion-gpu:0.0.1
: the name and the tag of the docker image we are using
-- python main.py --o ./outputs --p "meme about tensorflow"
: The command to run inference on the model. It consists of:
main.py
path to the script
--o ./outputs
specifies the output directory
--p "meme about tensorflow"
specifies the prompt
The Bacalhau command passes a prompt to the model and generates an image in the outputs directory. The main difference in the example below compared to all the other examples is the addition of the --gpu X
flag, which tells Bacalhau to only schedule the job on nodes that have X
GPUs free. You can read more about GPU support in the documentation.
This will take about 5 minutes to complete and is mainly due to the cold-start GPU setup time. This is faster than the CPU version, but you might still want to grab some fruit or plan your lunchtime run.
Furthermore, the container itself is about 10GB, so it might take a while to download on the node if it isn't cached.
When a job is submitted, Bacalhau prints out the related job_id
. We store that in an environment variable so that we can reuse it later on.
Job status: You can check the status of the job using bacalhau list
.
When it says Completed
, that means the job is done, and we can get the results.
Job information: You can find out more information about your job by using bacalhau describe
.
Job download: You can download your job results directly by using bacalhau get
. Alternatively, you can choose to create a directory to store your results. In the command below, we created a directory and downloaded our job output to be stored in that directory.
After the download has finished you should see the following contents in results directory
Now you can find the file in the results/outputs
folder:
In this tutorial example, we will showcase how to containerize an OpenMM workload so that it can be executed on the Bacalhau network and take advantage of the distributed storage & compute resources. OpenMM is a toolkit for molecular simulation. It is a physic-based library that is useful for refining the structure and exploring functional interactions with other molecules. It provides a combination of extreme flexibility (through custom forces and integrators), openness, and high performance (especially on recent GPUs) that make it truly unique among simulation codes.
In this example tutorial, our focus will be on running OpenMM molecular simulation with Bacalhau.
To get started, you need to install the Bacalhau client, see more information here
We use a processed 2DRI dataset that represents the ribose binding protein in bacterial transport and chemotaxis. The source organism is the Escherichia coli bacteria.
Protein data can be stored in a .pdb
file, this is a human-readable format. It provides for the description and annotation of protein and nucleic acid structures including atomic coordinates, secondary structure assignments, as well as atomic connectivity. See more information about PDB format here. For the original, unprocessed 2DRI dataset, you can download it from the RCSB Protein Data Bank here.
The relevant code of the processed 2DRI dataset can be found here. Let's print the first 10 lines of the 2dri-processed.pdb
file. The output contains a number of ATOM records. These describe the coordinates of the atoms that are part of the protein.
To run the script above all we need is a Python environment with the OpenMM library installed.
The simplest way to upload the data to IPFS is to use a third-party service to "pin" data to the IPFS network, to ensure that the data exists and is available. To do this, you need an account with a pinning service like Pinata or nft.storage. Once registered, you can use their UI or API or SDKs to upload files.
When you pin your data, you'll get a CID. Copy the CID as it will be used to access your data
To build your own docker container, create a Dockerfile
, which contains instructions to build your image.
See more information on how to containerize your script/app here
We will run docker build
command to build the container:
Before running the command, replace:
hub-user
with your docker hub username, If you don’t have a docker hub account follow these instructions to create docker account, and use the username of the account you created
repo-name
with the name of the container, you can name it anything you want
tag
this is not required but you can use the latest tag
In our case, this will be:
Next, upload the image to the registry. This can be done by using the Docker hub username, repo name, or tag.
Now that we have the data in IPFS and the docker image pushed, we can run a job on the Bacalhau network.
Lets look closely at the command above:
bacalhau docker run
: call to Bacalhau
bafybeig63whfqyuvwqqrp5456fl4anceju24ttyycexef3k5eurg5uvrq4
: here we mount the CID of the dataset we uploaded to IPFS to use on the job
ghcr.io/bacalhau-project/examples/openmm:0.3
: the name and the tag of the image we are using
python run_openmm_simulation.py
: the script that will be executed inside the container
When a job is submitted, Bacalhau prints out the related job_id
. We store that in an environment variable so that we can reuse it later on.
Job status: You can check the status of the job using bacalhau list
.
When it says Published
or Completed
, that means the job is done, and we can get the results.
Job information: You can find out more information about your job by using bacalhau describe
.
Job download: You can download your job results directly by using bacalhau get
. Alternatively, you can choose to create a directory to store your results. In the command below, we created a directory (results
) and downloaded our job output to be stored in that directory.
To view the file, run the following command:
If you have questions or need support or guidance, please reach out to the Bacalhau team via Slack (#general channel).
Requester nodes store job state and history in a boltdb-backed store (pkg/jobstore/boltdb).
The location of the database file can be specified using the BACALHAU_JOB_STORE_PATH
environment variable, which will specify which file to use to store the database. When not specified, the file will be {$BACALHAU_DIR}/{NODE_ID}-requester.db
.
By default, compute nodes store their execution information in an bolddb-backed store (pkg/compute/store/boltdb).
The location of the database file (for a single node) can be specified using the BACALHAU_COMPUTE_STORE_PATH
environment variable, which will specify which file to use to store the database. When not specified, the file will be {$BACALHAU_DIR}/{NODE_ID}-compute.db
.
As compute nodes restart, they will find they have existing state in the boltdb database. At startup the database currently iterates the executions to calculate the counters for each state. This will be a good opportunity to do some compaction of the records in the database, and cleanup items no longer in use.
Currently only batch jobs are possible, and so for each of the listed states below, no action is taken at restart. In future it would make sense to remove records older than a certain age, or moved them to failed, depending on their current state. For other job types (to be implemented) this may require restarting jobs, resetting jobs,
State | Batch jobs |
---|
The databases can be inspected using the bbolt tool. The bbolt tool can be installed to $GOBIN with:
Once installed, and assuming the database file is stored in $FILE you can use bbolt to:
ExecutionStateCreated | No action |
ExecutionStateBidAccepted | No action |
ExecutionStateRunning | No action |
ExecutionStateWaitingVerification | No action |
ExecutionStateResultAccepted | No action |
ExecutionStatePublishing | No action |
ExecutionStateCompleted | No action |
ExecutionStateFailed | No action |
ExecutionStateCancelled | No action |
When running a node, you can choose which jobs you want to run by using configuration options, environment variables or flags to specify a job selection policy.
If you want more control over making the decision to take on jobs, you can use the --job-selection-probe-exec
and --job-selection-probe-http
flags.
These are external programs that are passed the following data structure so that they can make a decision about whether or not to take on a job:
The exec
probe is a script to run that will be given the job data on stdin
, and must exit with status code 0 if the job should be run.
The http
probe is a URL to POST the job data to. The job will be rejected if the HTTP request returns a non-positive status code (e.g. >= 400).
If the HTTP response is a JSON blob, it should match the following schema and will be used to respond to the bid directly:
For example, the following response will reject the job:
If the HTTP response is not a JSON blob, the content is ignored and any non-error status code will accept the job.
Bacalhau operates by executing jobs within containers. This example shows you how to build and use a custom docker container.
To get started, you need to install the Bacalhau client, see more information here
This example requires Docker. If you don't have Docker installed, you can install it from here. Docker commands will not work on hosted notebooks like Google Colab, but the Bacalhau commands will.
You're likely familiar with executing Docker commands to start a container:
This command runs a container from the docker/whalesay
image. The container executes the cowsay sup old fashioned container run
command:
This command also runs a container from the docker/whalesay
image, using Bacalhau. We use the bacalhau docker run
command to start a job in a Docker container. It contains additional flags such as --wait
to wait for job completion and --id-only
to return only the job identifier. Inside the container, the bash -c 'cowsay hello web3 uber-run'
command is executed.
When a job is submitted, Bacalhau prints out the related job_id
(7e41b9b9-a9e2-4866-9fce-17020d8ec9e0
):
We store that in an environment variable so that we can reuse it later on.
You can download your job results directly by using bacalhau get
. Alternatively, you can choose to create a directory to store your results. In the command below, we created a directory (results
) and downloaded our job output to be stored in that directory.
Viewing your job output
Both commands execute cowsay in the docker/whalesay
container, but Bacalhau provides additional features for working with jobs at scale.
Bacalhau uses a syntax that is similar to Docker, and you can use the same containers. The main difference is that input and output data is passed to the container via IPFS, to enable planetary scale. In the example above, it doesn't make too much difference except that we need to download the stdout.
The --wait
flag tells Bacalhau to wait for the job to finish before returning. This is useful in interactive sessions like this, but you would normally allow jobs to complete in the background and use the bacalhau list
command to check on their status.
Another difference is that by default Bacalhau overwrites the default entry point for the container, so you have to pass all shell commands as arguments to the run
command after the --
flag.
To use your own custom container, you must publish the container to a container registry that is accessible from the Bacalhau network. At this time, only public container registries are supported.
To demonstrate this, you will develop and build a simple custom container that comes from an old Docker example. I remember seeing cowsay at a Docker conference about a decade ago. I think it's about time we brought it back to life and distribute it across the Bacalhau network.
Next, the Dockerfile adds the script and sets the entry point.
Now let's build and test the container locally.
Once your container is working as expected then you should push it to a public container registry. In this example, I'm pushing to Github's container registry, but we'll skip the step below because you probably don't have permission. Remember that the Bacalhau nodes expect your container to have a linux/amd64
architecture.
Now we're ready to submit a Bacalhau job using your custom container. This code runs a job, downloads the results, and prints the stdout.
The bacalhau docker run
command strips the default entry point, so don't forget to run your entry point in the command line arguments.
When a job is submitted, Bacalhau prints out the related job_id
. We store that in an environment variable so that we can reuse it later on.
Download your job results directly by using bacalhau get
command.
View your job output
If you have questions or need support or guidance, please reach out to the Bacalhau team via Slack (#general channel).
Config property
serve
flag
Default value
Meaning
Node.Compute.JobSelection.Locality
--job-selection-data-locality
Anywhere
Only accept jobs that reference data we have locally ("local") or anywhere ("anywhere").
Node.Compute.JobSelection.ProbeExec
--job-selection-probe-exec
unused
Use the result of an external program to decide if we should take on the job.
Node.Compute.JobSelection.ProbeHttp
--job-selection-probe-http
unused
Use the result of a HTTP POST to decide if we should take on the job.
Node.Compute.JobSelection.RejectStatelessJobs
--job-selection-reject-stateless
False
Reject jobs that don't specify any input data.
Node.Compute.JobSelection.AcceptNetworkedJobs
--job-selection-accept-networked
False
Accept jobs that require network connections.
In this example, we will demonstrate how to run inference on a model stored on Amazon S3. We will use a PyTorch model trained on the MNIST dataset.
Consider using the latest versions or use the docker method listed below in the article.
Python
PyTorch
Use the following commands to download the model and test image:
This script is designed to load a pretrained PyTorch model for MNIST digit classification from a tar.gz
file, extract it, and use the model to perform inference on a given input image. Ensure you have all required dependencies installed:
To use this script, you need to provide the paths to the tar.gz
file containing the pre-trained model, the output directory where the model will be extracted, and the input image file for which you want to perform inference. The script will output the predicted digit (class) for the given input image.
To get started, you need to install the Bacalhau client, see more information here
export JOB_ID=$( ... )
: Export results of a command execution as environment variable
-w /inputs
Set the current working directory at /inputs
in the container
-i src=s3://sagemaker-sample-files/datasets/image/MNIST/model/pytorch-training-2020-11-21-22-02-56-203/model.tar.gz,dst=/model/,opt=region=us-east-1
: Mount the s3 bucket at the destination path provided - /model/
and specifying the region where the bucket is located opt=region=us-east-1
-i git://github.com/js-ts/mnist-test.git
: Flag to mount the source code repo from GitHub. It would mount the repo at /inputs/js-ts/mnist-test
in this case it also contains the test image
pytorch/pytorch
: The name of the Docker image
-- python3 /inputs/js-ts/mnist-test/inference.py --tar_gz_file_path /model/model.tar.gz --output_directory /model-pth --image_path /inputs/js-ts/mnist-test/image.png
: The command to run inference on the model. It consists of:
/model/model.tar.gz
is the path to the model file
/model-pth
is the output directory for the model
/inputs/js-ts/mnist-test/image.png
is the path to the input image
When the job is submitted Bacalhau prints out the related job id. We store that in an environment variable JOB_ID
so that we can reuse it later on.
Use the bacalhau logs
command to view the job output, since the script prints the result of execution to the stdout:
You can also use bacalhau get
to download job results:
If you have questions or need support or guidance, please reach out to the Bacalhau team via Slack (#general channel).