1 of 20

Workload Onboarding

This directory contains examples relating to performing common tasks with Bacalhau.

Container

Docker Workload Onboarding

How to use docker containers with Bacalhau

Docker Workloads

Bacalhau executes jobs by running them within containers. Bacalhau employs a syntax closely resembling Docker, allowing you to utilize the same containers. The key distinction lies in how input and output data are transmitted to the container via IPFS, enabling scalability on a global level.

This section describes how to migrate a workload based on a Docker container into a format that will work with the Bacalhau client.

You can check out this example tutorial on to see how we used all these steps together.

Requirements

Here are few things to note before getting started:

Container Registry: Ensure that the container is published to a public container registry that is accessible from the Bacalhau network.
Architecture Compatibility: Bacalhau supports only images that match the host node's architecture. Typically, most nodes run on linux/amd64, so containers in arm64 format are not able to run.
Input Flags: The --input ipfs://... flag supports only directories and does not support CID subpaths. The --input https://... flag supports only single files and does not support URL directories. The --input s3://... flag supports S3 keys and prefixes. For example, s3://bucket/logs-2023-04* includes all logs for April 2023.

You can check to see a used by the Bacalhau team

Note: Only about a third of examples have their containers here. The rest are under random docker hub registries.

Runtime Restrictions

To help provide a safe, secure network for all users, we add the following runtime restrictions:

Limited Ingress/Egress Networking:

All ingress/egress networking is limited as described in the documentation. You won't be able to pull data/code/weights/ etc. from an external source.

Data Passing with Docker Volumes:

A job includes the concept of input and output volumes, and the Docker executor implements support for these. This means you can specify your CIDs, URLs, and/or S3 objects as input paths and also write results to an output volume. This can be seen in the following example:

The above example demonstrates an input volume flag -i s3://mybucket/logs-2023-04*, which mounts all S3 objects in bucket mybucket with logs-2023-04 prefix within the docker container at location /input (root).

Output volumes are mounted to the Docker container at the location specified. In the example above, any content written to /output_folder will be made available within the apples folder in the job results CID.

Once the job has run on the executor, the contents of stdout and stderr will be added to any named output volumes the job has used (in this case apples), and all those entities will be packaged into the results folder which is then published to a remote location by the publisher.

Onboarding Your Workload

Step 1 - Read Data From Your Directory

If you need to pass data into your container you will do this through a Docker volume. You'll need to modify your code to read from a local directory.

We make the assumption that you are reading from a directory called /inputs, which is set as the default.

Step 2 - Write Data to the Your Directory

If you need to return data from your container you will do this through a Docker volume. You'll need to modify your code to write to a local directory.

We make the assumption that you are writing to a directory called /outputs, which is set as the default.

Step 3 - Build and Push Your Image To a Registry

For example:

Step 4 - Test Your Container

To test your docker image locally, you'll need to execute the following command, changing the environment variables as necessary:

Let's see what each command will be used for:

For example:

The result of the commands' execution is shown below:

Step 5 - Upload the Input Data

Data is identified by its content identifier (CID) and can be accessed by anyone who knows the CID. You can use either of these methods to upload your data:

You can mount your data anywhere on your machine, and Bacalhau will be able to run against that data

Step 6 - Run the Workload on Bacalhau

To launch your workload in a Docker container, using the specified image and working with input data specified via IPFS CID, run the following command:

To check the status of your job, run the following command:

To get more information on your job,run:

To download your job, run:

For example, running:

outputs:

The --input flag does not support CID subpaths for ipfs:// content.

Alternatively, you can run your workload with a publicly accessible http(s) URL, which will download the data temporarily into your public storage:

The --input flag does not support URL directories.

Troubleshooting

If you run into this compute error while running your docker image

This can often be resolved by re-tagging your docker image

Support

WebAssembly (WASM) Workloads

Bacalhau supports running programs that are compiled to . With the Bacalhau client, you can upload WASM programs, retrieve data from public storage, read and write data, receive program arguments, and access environment variables.

Prerequisites and Limitations

Supported WebAssembly System Interface (WASI) Bacalhau can run compiled WASM programs that expect the WebAssembly System Interface (WASI) Snapshot 1. Through this interface, WebAssembly programs can access data, environment variables, and program arguments.
Networking Restrictions All ingress/egress networking is disabled – you won't be able to pull data/code/weights etc. from an external source. WASM jobs can say what data they need using URLs or CIDs (Content IDentifier) and can then access the data by reading from the filesystem.
Single-Threading There is no multi-threading as WASI does not expose any interface for it.

Onboarding Your Workload

Step 1: Replace network operations with filesystem reads and writes

If your program typically involves reading from and writing to network endpoints, follow these steps to adapt it for Bacalhau:

Replace Network Operations: Instead of making HTTP requests to external servers (e.g., example.com), modify your program to read data from the local filesystem.
Input Data Handling: Specify the input data location in Bacalhau using the --input flag when running the job. For instance, if your program used to fetch data from example.com, read from the /inputs folder locally, and provide the URL as input when executing the Bacalhau job. For example, --input http://example.com.
Output Handling: Adjust your program to output results to standard output (stdout) or standard error (stderr) pipes. Alternatively, you can write results to the filesystem, typically into an output mount. In the case of WASM jobs, a default folder at /outputs is available, ensuring that data written there will persist after the job concludes.

By making these adjustments, you can effectively transition your program to operate within the Bacalhau environment, utilizing filesystem operations instead of traditional network interactions.

You can specify additional or different output mounts using the -o flag.

Step 2: Configure your compiler to output WASI-compliant WebAssembly

You will need to compile your program to WebAssembly that expects WASI. Check the instructions for your compiler to see how to do this.

Step 3: Upload the input data

Data is identified by its content identifier (CID) and can be accessed by anyone who knows the CID. You can use either of these methods to upload your data:

You can mount your data anywhere on your machine, and Bacalhau will be able to run against that data

Step 4: Run your program

You can run a WebAssembly program on Bacalhau using the bacalhau wasm run command.

Run Locally Compiled Program:

If your program is locally compiled, specify it as an argument. For instance, running the following command will upload and execute the main.wasm program:

The program you specify will be uploaded to a Bacalhau storage node and will be publicly available.

Alternative Program Specification:

You can use a Content IDentifier (CID) for a specific WebAssembly program.

Input Data Specification:

Make sure to specify any input data using --input flag.

This ensures the necessary data is available for the program's execution.

Program arguments

You can give the WASM program arguments by specifying them after the program path or CID. If the WASM program is already compiled and located in the current directory, you can run it by adding arguments after the file name:

For a specific WebAssembly program, run:

Write your program to use program arguments to specify input and output paths. This makes your program more flexible in handling different configurations of input and output volumes.

For example, instead of hard-coding your program to read from /inputs/data.txt, accept a program argument that should contain the path and then specify the path as an argument to bacalhau wasm run:

Your language of choice should contain a standard way of reading program arguments that will work with WASI.

Environment variables

You can also specify environment variables using the -e flag.

Examples

Support

Bacalhau Docker Image

How to use the Bacalhau Docker image

This documentation explains how to use the Bacalhau Docker image to run tasks and manage them using the Bacalhau client.

Prerequisites

To get started, you need to install the Bacalhau client (see more information ) and Docker.

1. Pull the Bacalhau Docker image

The first step is to pull the Bacalhau Docker image from the .

Expected output:

You can also pull a specific version of the image, e.g.:

Remember that the "latest" tag is just a string. It doesn't refer to the latest version of the Bacalhau client, it refers to an image that has the "latest" tag. Therefore, if your machine has already downloaded the "latest" image, it won't download it again. To force a download, you can use the --no-cache flag.

2. Check version

To check the version of the Bacalhau client, run:

Expected Output:

3. Running a Bacalhau Job

In the example below, an Ubuntu-based job runs to print the message 'Hello from Docker Bacalhau':

Structure of the command

ghcr.io/bacalhau-project/bacalhau:latest : Name of the Bacalhau Docker image
--id-only: Output only the job id
--wait: Wait for the job to finish
ubuntu:latest. Ubuntu container
--: Separate Bacalhau parameters from the command to be executed inside the container
sh -c 'uname -a && echo "Hello from Docker Bacalhau!"': The command executed inside the container

Let's have a look at the command execution in the terminal:

The output you're seeing is in two parts: The first line: 13:53:46.478 | INF pkg/repo/fs.go:81 > Initializing repo at '/root/.bacalhau' for environment 'production' is an informational message indicating the initialization of a repository at the specified directory ('/root/.bacalhau') for the production environment. The second line: ab95a5cc-e6b7-40f1-957d-596b02251a66 is a job ID, which represents the result of executing a command inside a Docker container. It can be used to obtain additional information about the executed job or to access the job's results. We store that in an environment variable so that we can reuse it later on (env: JOB_ID=ab95a5cc-e6b7-40f1-957d-596b02251a66)

To print out the content of the Job ID, run the following command:

Expected Output:

4. Submit a Job With Output Files

One inconvenience that you'll see is that you'll need to mount directories into the container to access files. This is because the container is running in a separate environment from your host machine. Let's take a look at the example below:

The first part of the example should look familiar, except for the Docker commands.

When a job is submitted, Bacalhau prints out the related job_id (a46a9aa9-63ef-486a-a2f8-6457d7bafd2e):

5. Checking the State of your Jobs

Job status: You can check the status of the job using bacalhau list.

When it says Completed, that means the job is done, and we can get the results.

Job information: You can find out more information about your job by using bacalhau describe.

Job download: You can download your job results directly by using bacalhau get. Alternatively, you can choose to create a directory to store your results. In the command below, we created a directory and downloaded our job output to be stored in the result directory.

After the download has finished, you should see the following contents in the results directory.

Support

How To Work With Custom Containers in Bacalhau

Bacalhau operates by executing jobs within containers. This example shows you how to build and use a custom docker container.

Prerequisite

To get started, you need to install the Bacalhau client, see more information here
This example requires Docker. If you don't have Docker installed, you can install it from here. Docker commands will not work on hosted notebooks like Google Colab, but the Bacalhau commands will.

1. Running Containers

Docker Command

You're likely familiar with executing Docker commands to start a container:

docker run docker/whalesay cowsay sup old fashioned container run

This command runs a container from the docker/whalesay image. The container executes the cowsay sup old fashioned container run command:

_________________________________
< sup old fashioned container run >
 ---------------------------------
    \
     \
      \
                    ##        .
              ## ## ##       ==
           ## ## ## ##      ===
       /""""""""""""""""___/ ===
  ~~~ {~~ ~~~~ ~~~ ~~~~ ~~ ~ /  ===- ~~~
       \______ o          __/
        \    \        __/
          \____\______/

Bacalhau Command

export JOB_ID=$(bacalhau docker run \
    --wait \
    --id-only \ 
    docker/whalesay -- bash -c 'cowsay hello web3 uber-run')

This command also runs a container from the docker/whalesay image, using Bacalhau. We use the bacalhau docker run command to start a job in a Docker container. It contains additional flags such as --wait to wait for job completion and --id-only to return only the job identifier. Inside the container, the bash -c 'cowsay hello web3 uber-run' command is executed.

When a job is submitted, Bacalhau prints out the related job_id (7e41b9b9-a9e2-4866-9fce-17020d8ec9e0):

7e41b9b9-a9e2-4866-9fce-17020d8ec9e0

We store that in an environment variable so that we can reuse it later on.

You can download your job results directly by using bacalhau get. Alternatively, you can choose to create a directory to store your results. In the command below, we created a directory (results) and downloaded our job output to be stored in that directory.

rm -rf results && mkdir -p results
bacalhau get ${JOB_ID}  --output-dir results

Viewing your job output

cat ./results/stdout

 _____________________
< hello web3 uber-run >
 ---------------------
    \
     \
      \
                    ##        .
              ## ## ##       ==
           ## ## ## ##      ===
       /""""""""""""""""___/ ===
  ~~~ {~~ ~~~~ ~~~ ~~~~ ~~ ~ /  ===- ~~~
       \______ o          __/
        \    \        __/
          \____\______/

Both commands execute cowsay in the docker/whalesay container, but Bacalhau provides additional features for working with jobs at scale.

Bacalhau Syntax

Bacalhau uses a syntax that is similar to Docker, and you can use the same containers. The main difference is that input and output data is passed to the container via IPFS, to enable planetary scale. In the example above, it doesn't make too much difference except that we need to download the stdout.

The --wait flag tells Bacalhau to wait for the job to finish before returning. This is useful in interactive sessions like this, but you would normally allow jobs to complete in the background and use the bacalhau list command to check on their status.

Another difference is that by default Bacalhau overwrites the default entry point for the container, so you have to pass all shell commands as arguments to the run command after the -- flag.

2. Building Your Own Custom Container For Bacalhau

To use your own custom container, you must publish the container to a container registry that is accessible from the Bacalhau network. At this time, only public container registries are supported.

To demonstrate this, you will develop and build a simple custom container that comes from an old Docker example. I remember seeing cowsay at a Docker conference about a decade ago. I think it's about time we brought it back to life and distribute it across the Bacalhau network.

# write to the cod.cow
$the_cow = <<"EOC";
   $thoughts
    $thoughts
                               ,,,,_
                            ┌Φ▓╬▓╬▓▓▓W      @▓▓▒,
                           ╠▓╬▓╬╣╬╬▓╬▓▓   ╔╣╬╬▓╬╣▓,
                    __,┌╓═╠╬╠╬╬╬Ñ╬╬╬Ñ╬╬¼,╣╬╬▓╬╬▓╬▓▓▓┐        ╔W_             ,φ▓▓
               ,«@▒╠╠╠╠╩╚╙╙╩Ü╚╚╚╚╩╙╙╚╠╩╚╚╟▓▒╠╠╫╣╬╬╫╬╣▓,   _φ╬▓╬╬▓,        ,φ╣▓▓╬╬
          _,φÆ╩╬╩╙╚╩░╙╙░░╩`=░╙╚»»╦░=╓╙Ü1R░│░╚Ü░╙╙╚╠╠╠╣╣╬≡Φ╬▀╬╣╬╬▓▓▓_   ╓▄▓▓▓▓▓▓╬▌
      _,φ╬Ñ╩▌▐█[▒░░░░R░░▀░`,_`!R`````╙`-'╚Ü░░Ü░░░░░░░│││░╚╚╙╚╩╩╩╣Ñ╩╠▒▒╩╩▀▓▓╣▓▓╬╠▌
     '╚╩Ü╙│░░╙Ö▒Ü░░░H░░R ▒¥╣╣@@@▓▓▓  := '`   `░``````````````````````````]▓▓▓╬╬╠H
       '¬═▄ `\░╙Ü░╠DjK` Å»»╙╣▓▓▓▓╬Ñ     -»`       -`      `  ,;╓▄╔╗∞  ~▓▓▓▀▓▓╬╬╬▌
             '^^^`   _╒Γ   `╙▀▓▓╨                     _, ⁿD╣▓╬╣▓╬▓╜      ╙╬▓▓╬╬▓▓
                 ```└                           _╓▄@▓▓▓╜   `╝╬▓▓╙           ²╣╬▓▓
                        %φ▄╓_             ~#▓╠▓▒╬▓╬▓▓^        `                ╙╙
                         `╣▓▓▓              ╠╬▓╬▓╬▀`
                           ╚▓▌               '╨▀╜
EOC

Next, the Dockerfile adds the script and sets the entry point.

# write the Dockerfile
FROM debian:stretch
RUN apt-get update && apt-get install -y cowsay
# "cowsay" installs to /usr/games
ENV PATH $PATH:/usr/games
RUN echo '#!/bin/bash\ncowsay "${@:1}"' > /usr/bin/codsay && \
    chmod +x /usr/bin/codsay
COPY cod.cow /usr/share/cowsay/cows/default.cow

Now let's build and test the container locally.

docker build -t ghcr.io/bacalhau-project/examples/codsay:latest . 2> /dev/null

%%bashdocker run --rm ghcr.io/bacalhau-project/examples/codsay:latest codsay I like swimming in data

Once your container is working as expected then you should push it to a public container registry. In this example, I'm pushing to Github's container registry, but we'll skip the step below because you probably don't have permission. Remember that the Bacalhau nodes expect your container to have a linux/amd64 architecture.

docker buildx build --platform linux/amd64,linux/arm64 --push -t ghcr.io/bacalhau-project/examples/codsay:latest .

3. Running Your Custom Container on Bacalhau

Now we're ready to submit a Bacalhau job using your custom container. This code runs a job, downloads the results, and prints the stdout.

The bacalhau docker run command strips the default entry point, so don't forget to run your entry point in the command line arguments.

export JOB_ID=$(bacalhau docker run \
    --wait \
    --id-only \
    ghcr.io/bacalhau-project/examples/codsay:v1.0.0 \
    -- bash -c 'codsay Look at all this data')

When a job is submitted, Bacalhau prints out the related job_id. We store that in an environment variable so that we can reuse it later on.

Download your job results directly by using bacalhau get command.

rm -rf results && mkdir -p results
bacalhau get ${JOB_ID}  --output-dir results

View your job output

cat ./results/stdout

_______________________
< Look at all this data >
 -----------------------
   \
    \
                               ,,,,_
                            ┌Φ▓╬▓╬▓▓▓W      @▓▓▒,
                           ╠▓╬▓╬╣╬╬▓╬▓▓   ╔╣╬╬▓╬╣▓,
                    __,┌╓═╠╬╠╬╬╬Ñ╬╬╬Ñ╬╬¼,╣╬╬▓╬╬▓╬▓▓▓┐        ╔W_             ,φ▓▓
               ,«@▒╠╠╠╠╩╚╙╙╩Ü╚╚╚╚╩╙╙╚╠╩╚╚╟▓▒╠╠╫╣╬╬╫╬╣▓,   _φ╬▓╬╬▓,        ,φ╣▓▓╬╬
          _,φÆ╩╬╩╙╚╩░╙╙░░╩`=░╙╚»»╦░=╓╙Ü1R░│░╚Ü░╙╙╚╠╠╠╣╣╬≡Φ╬▀╬╣╬╬▓▓▓_   ╓▄▓▓▓▓▓▓╬▌
      _,φ╬Ñ╩▌▐█[▒░░░░R░░▀░`,_`!R`````╙`-'╚Ü░░Ü░░░░░░░│││░╚╚╙╚╩╩╩╣Ñ╩╠▒▒╩╩▀▓▓╣▓▓╬╠▌
     '╚╩Ü╙│░░╙Ö▒Ü░░░H░░R ▒¥╣╣@@@▓▓▓  := '`   `░``````````````````````````]▓▓▓╬╬╠H
       '¬═▄ `░╙Ü░╠DjK` Å»»╙╣▓▓▓▓╬Ñ     -»`       -`      `  ,;╓▄╔╗∞  ~▓▓▓▀▓▓╬╬╬▌
             '^^^`   _╒Γ   `╙▀▓▓╨                     _, ⁿD╣▓╬╣▓╬▓╜      ╙╬▓▓╬╬▓▓
                 ```└                           _╓▄@▓▓▓╜   `╝╬▓▓╙           ²╣╬▓▓
                        %φ▄╓_             ~#▓╠▓▒╬▓╬▓▓^        `                ╙╙
                         `╣▓▓▓              ╠╬▓╬▓╬▀`
                           ╚▓▌               '╨▀╜

Support

If you have questions or need support or guidance, please reach out to the Bacalhau team via Slack (#general channel).

Python

Building and Running Custom Python Container

Introduction

In this tutorial example, we will walk you through building your own Python container and running the container on Bacalhau.

Prerequisites

To get started, you need to install the Bacalhau client, see more information here

1. Sample Recommendation Dataset

We will be using a simple recommendation script that, when given a movie ID, recommends other movies based on user ratings. Assuming you want recommendations for the movie 'Toy Story' (1995), it will suggest movies from similar categories:

Recommendations for Toy Story (1995):
1  :  Toy Story (1995)
58  :  Postino, Il (The Postman) (1994)
3159  :  Fantasia 2000 (1999)
359  :  I Like It Like That (1994)
756  :  Carmen Miranda: Bananas Is My Business (1994)
618  :  Two Much (1996)
48  :  Pocahontas (1995)
2695  :  Boys, The (1997)
2923  :  Citizen's Band (a.k.a. Handle with Care) (1977)
688  :  Operation Dumbo Drop (1995)

Downloading the dataset

Download Movielens1M dataset from this link https://files.grouplens.org/datasets/movielens/ml-1m.zip

wget https://files.grouplens.org/datasets/movielens/ml-1m.zip

In this example, we’ll be using 2 files from the MovieLens 1M dataset: ratings.dat and movies.dat. After the dataset is downloaded, extract the zip and place ratings.dat and movies.dat into a folder called input:

# Extracting the downloaded zip file
unzip ml-1m.zip

#moving  ratings.dat and movies.dat into a folder called 'input'
mkdir input; mv ml-1m/movies.dat ml-1m/ratings.dat input/

The structure of the input directory should be

input
├── movies.dat
└── ratings.dat

Installing Dependencies

To create a requirements.txt for the Python libraries we’ll be using, create:

# content of the requirements.txt
numpy
pandas

To install the dependencies, run:

pip install -r requirements.txt

Writing the Script

Create a new file called similar-movies.py and in it paste the following script

# content of the similar-movies.py

# Imports
import numpy as np
import pandas as pd
import argparse
from distutils.dir_util import mkpath
import warnings
warnings.filterwarnings("ignore")
# Read the files with pandas
data = pd.io.parsers.read_csv('input/ratings.dat',
names=['user_id', 'movie_id', 'rating', 'time'],
engine='python', delimiter='::', encoding='latin-1')
movie_data = pd.io.parsers.read_csv('input/movies.dat',
names=['movie_id', 'title', 'genre'],
engine='python', delimiter='::', encoding='latin-1')

# Create the ratings matrix of shape (m×u) with rows as movies and columns as users

ratings_mat = np.ndarray(
shape=((np.max(data.movie_id.values)), np.max(data.user_id.values)),
dtype=np.uint8)
ratings_mat[data.movie_id.values-1, data.user_id.values-1] = data.rating.values

# Normalise matrix (subtract mean off)

normalised_mat = ratings_mat - np.asarray([(np.mean(ratings_mat, 1))]).T

# Compute SVD

normalised_mat = ratings_mat - np.matrix(np.mean(ratings_mat, 1)).T
cov_mat = np.cov(normalised_mat)
evals, evecs = np.linalg.eig(cov_mat)

# Calculate cosine similarity, sort by most similar, and return the top N.

def top_cosine_similarity(data, movie_id, top_n=10):

index = movie_id - 1
# Movie id starts from 1

movie_row = data[index, :]
magnitude = np.sqrt(np.einsum('ij, ij -> i', data, data))
similarity = np.dot(movie_row, data.T) / (magnitude[index] * magnitude)
sort_indexes = np.argsort(-similarity)
return sort_indexes[:top_n]

# Helper function to print top N similar movies
def print_similar_movies(movie_data, movie_id, top_indexes):
print('Recommendations for {0}: \n'.format(
movie_data[movie_data.movie_id == movie_id].title.values[0]))
for id in top_indexes + 1:
print(str(id),' : ',movie_data[movie_data.movie_id == id].title.values[0])


parser = argparse.ArgumentParser(description='Personal information')
parser.add_argument('--k', dest='k', type=int, help='principal components to represent the movies',default=50)
parser.add_argument('--id', dest='id', type=int, help='Id of the movie',default=1)
parser.add_argument('--n', dest='n', type=int, help='No of recommendations',default=10)

args = parser.parse_args()
k = args.k
movie_id = args.id # Grab an id from movies.dat
top_n = args.n

# k = 50
# # Grab an id from movies.dat
# movie_id = 1
# top_n = 10

sliced = evecs[:, :k] # representative data
top_indexes = top_cosine_similarity(sliced, movie_id, top_n)
print_similar_movies(movie_data, movie_id, top_indexes)

What the similar-movies.py script does

Read the files with pandas. The code uses Pandas to read data from the files ratings.dat and movies.dat.
Create the ratings matrix of shape (m×u) with rows as movies and columns as user
Normalise matrix (subtract mean off). The ratings matrix is normalized by subtracting the mean off.
Compute SVD: a singular value decomposition (SVD) of the normalized ratings matrix is performed.
Calculate cosine similarity, sort by most similar, and return the top N.
Select k principal components to represent the movies, a movie_id to find recommendations, and print the top_n results.

For further reading on how the script works, go to Simple Movie Recommender Using SVD | Alyssa

Running the Script

Running the script similar-movies.py using the default values:

python similar-movies.py

You can also use other flags to set your own values.

2. Setting Up Docker

We will create a Dockerfile and add the desired configuration to the file. These commands specify how the image will be built, and what extra requirements will be included.

FROM python:3.8
ADD similar-movies.py .
ADD /input input
COPY ./requirements.txt /requirements.txt
RUN pip install -r requirements.txt

We will use the python:3.8 docker image and add our script similar-movies.py to copy the script to the docker image, similarly, we also add the dataset directory and also the requirements, after that run the command to install the dependencies in the image

The final folder structure will look like this:

├── Dockerfile
├── input
│   ├── movies.dat
│   └── ratings.dat
├── requirements.txt
└── similar-movies.py

See more information on how to containerize your script/app here

Build the container

We will run docker build command to build the container:

docker build -t <hub-user>/<repo-name>:<tag> .

Before running the command replace:

hub-user with your docker hub username, If you don’t have a docker hub account follow these instructions to create a docker account, and use the username of the account you created

repo-name with the name of the container, you can name it anything you want

tag this is not required, but you can use the latest tag

In our case:

docker build -t jsace/python-similar-movies .

Push the container

Next, upload the image to the registry. This can be done by using the Docker hub username, repo name or tag.

docker push <hub-user>/<repo-name>:<tag>

In our case:

docker push jsace/python-similar-movies

3. Running a Bacalhau Job

After the repo image has been pushed to Docker Hub, we can now use the container for running on Bacalhau. You can submit a Bacalhau job by running your container on Bacalhau with default or custom parameters.

Running the Container with Default Parameters

To submit a Bacalhau job by running your container on Bacalhau with default parameters, run the following Bacalhau command:

export JOB_ID=$(bacalhau docker run \
    --id-only \
    --wait \
    jsace/python-similar-movies \
    -- python similar-movies.py)

Structure of the command

bacalhau docker run: call to Bacalhau
jsace/python-similar-movies: the name and of the docker image we are using
-- python similar-movies.py: execute the Python script

When a job is submitted, Bacalhau prints out the related job_id. We store that in an environment variable so that we can reuse it later on.

Running the Container with Custom Parameters

To submit a Bacalhau job by running your container on Bacalhau with custom parameters, run the following Bacalhau command:

bacalhau docker run \
    jsace/python-similar-movies \
    -- python similar-movies.py --k 50 --id 10 --n 10

Structure of the command

bacalhau docker run: call to Bacalhau
jsace/python-similar-movies: the name of the docker image we are using
-- python similar-movies.py --k 50 --id 10 --n 10: execute the python script. The script will use Singular Value Decomposition (SVD) and cosine similarity to find 10 movies most similar to the one with identifier 10, using 50 principal components.

4. Checking the State of your Jobs

Job status: You can check the status of the job using bacalhau list.

bacalhau list --id-filter ${JOB_ID}

When it says Published or Completed, that means the job is done, and we can get the results.

Job information: You can find out more information about your job by using bacalhau describe.

bacalhau describe ${JOB_ID}

Job download: You can download your job results directly by using bacalhau get. Alternatively, you can choose to create a directory to store your results. In the command below, we created a directory (results) and downloaded our job output to be stored in that directory.

rm -rf results && mkdir -p results
bacalhau get $JOB_ID --output-dir results

5. Viewing your Job Output

To view the file, run the following command:

cat results/stdout # displays the contents of the file

Support

If you have questions or need support or guidance, please reach out to the Bacalhau team via Slack (#general channel).

Running Pandas on Bacalhau

Introduction

Pandas is a Python package that provides fast, flexible, and expressive data structures designed to make working with data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real-world data analysis in Python. Additionally, it has the broader goal of becoming the most powerful and flexible open-source data analysis/manipulation tool available in any language. It is already well on its way towards this goal.

In this tutorial example, we will run Pandas script on Bacalhau.

Prerequisite

To get started, you need to install the Bacalhau client, see more information

1. Running Pandas Locally

To run the Pandas script on Bacalhau for analysis, first, we will place the Pandas script in a container and then run it at scale on Bacalhau.

To get started, you need to install the Pandas library from pip:

Importing data from CSV to DataFrame

Pandas is built around the idea of a DataFrame, a container for representing data. Below you will create a DataFrame by importing a CSV file. A CSV file is a text file with one record of data per line. The values within the record are separated using the “comma” character. Pandas provides a useful method, named read_csv() to read the contents of the CSV file into a DataFrame. For example, we can create a file named transactions.csv containing details of Transactions. The CSV file is stored in the same directory that contains the Python script.

The overall purpose of the command above is to read data from a CSV file (transactions.csv) using Pandas and print the resulting DataFrame.

To download the transactions.csv file, run:

To output a content of the transactions.csv file, run:

Running the script

Now let's run the script to read in the CSV file. The output will be a DataFrame object.

2. Ingesting Data

To run Pandas on Bacalhau you must store your assets in a location that Bacalhau has access to. We usually default to storing data on IPFS and code in a container, but you can also easily upload your script to IPFS too.

3. Running a Bacalhau Job

Now we're ready to run a Bacalhau job, whilst mounting the Pandas script and data from IPFS. We'll use the bacalhau docker run command to do this:

Structure of the command

bacalhau docker run: call to Bacalhau
amancevice/pandas : Docker image with pandas installed.
-i ipfs://QmfKJT13h5k1b23ja3ZCVg5nFL9oKz2bVXc8oXgtwiwhjz:/files: Mounting the uploaded dataset to path. The -i flag allows us to mount a file or directory from IPFS into the container. It takes two arguments, the first is the IPFS CID
QmfKJT13h5k1b23ja3ZCVg5nFL9oKz2bVXc8oXgtwiwhjz) and the second is the file path within IPFS (/files). The -i flag can be used multiple times to mount multiple directories.
-w /files Our working directory is /files. This is the folder where we will save the model as it will automatically get uploaded to IPFS as outputs
python read_csv.py: python script to read pandas script

When a job is submitted, Bacalhau prints out the related job_id. We store that in an environment variable so that we can reuse it later on.

4. Checking the State of your Jobs

Job status: You can check the status of the job using bacalhau list.

When it says Published or Completed, that means the job is done, and we can get the results.

Job information: You can find out more information about your job by using bacalhau describe.

Job download: You can download your job results directly by using bacalhau get. Alternatively, you can choose to create a directory to store your results. In the command below, we created a directory (results) and downloaded our job output to be stored in that directory.

5. Viewing your Job Output

To view the file, run the following command:

Support

Running a Python Script

This tutorial serves as an introduction to Bacalhau. In this example, you'll be executing a simple "Hello, World!" Python script hosted on a website on Bacalhau.

Prerequisites

To get started, you need to install the Bacalhau client, see more information here

1. Running Python Locally

We'll be using a very simple Python script that displays the traditional first greeting. Create a file called hello-world.py:

# hello-world.py
print("Hello, world!")

Running the script to print out the output:

python3 hello-world.py

After the script has run successfully locally we can now run it on Bacalhau.

2. Running a Bacalhau Job

To submit a workload to Bacalhau you can use the bacalhau docker run command. This command allows passing input data into the container using content identifier (CID) volumes, we will be using the --input URL:path argument for simplicity. This results in Bacalhau mounting a data volume inside the container. By default, Bacalhau mounts the input volume at the path /inputs inside the container.

Bacalhau overwrites the default entrypoint, so we must run the full command after the -- argument.

export JOB_ID=$(bacalhau docker run \
    --id-only \
    --input https://raw.githubusercontent.com/bacalhau-project/examples/151eebe895151edd83468e3d8b546612bf96cd05/workload-onboarding/trivial-python/hello-world.py \
    python:3.10-slim \
    -- python3 /inputs/hello-world.py)

Structure of the command

bacalhau docker run: call to Bacalhau
--id-only: specifies that only the job identifier (job_id) will be returned after executing the container, not the entire output
--input https://raw.githubusercontent.com/bacalhau-project/examples/151eebe895151edd83468e3d8b546612bf96cd05/workload-onboarding/trivial-python/hello-world.py \: indicates where to get the input data for the container. In this case, the input data is downloaded from the specified URL, which represents the Python script "hello-world.py".
python:3.10-slim: the Docker image that will be used to run the container. In this case, it uses the Python 3.10 image with a minimal set of components (slim).
--: This double dash is used to separate the Bacalhau command options from the command that will be executed inside the Docker container.
python3 /inputs/hello-world.py: running the hello-world.py Python script stored in /inputs.

When a job is submitted, Bacalhau prints out the related job_id. We store that in an environment variable so that we can reuse it later on.

Declarative job description

The same job can be presented in the declarative format. In this case, the description will look like this:

name: Running Trivial Python
type: batch
count: 1
tasks:
  - name: My main task
    Engine:
      type: docker
      params:
        Image: python:3.10-slim
        Entrypoint:
          - /bin/bash
        Parameters:
          - -c
          - python3 /inputs/hello-world.py
    InputSources:
      - Target: /inputs
        Source:
          Type: urlDownload
          Params:
            URL: https://raw.githubusercontent.com/bacalhau-project/examples/151eebe895151edd83468e3d8b546612bf96cd05/workload-onboarding/trivial-python/hello-world.py
            Path: /inputs/hello-world.py

The job description should be saved in .yaml format, e.g. helloworld.yaml, and then run with the command:

bacalhau job run helloworld.yaml

3. Checking the State of your Jobs

Job status: You can check the status of the job using bacalhau list.

bacalhau list --id-filter ${JOB_ID} --no-style

When it says Published or Completed, that means the job is done, and we can get the results.

Job information: You can find out more information about your job by using bacalhau describe.

bacalhau describe ${JOB_ID}

Job download: You can download your job results directly by using bacalhau get. Alternatively, you can choose to create a directory to store your results. In the command below, we created a directory (results) and downloaded our job output to be stored in that directory.

rm -rf results && mkdir results
bacalhau get ${JOB_ID} --output-dir results

4. Viewing your Job Output

To view the file, run the following command:

cat results/stdout

Support

If you have questions or need support or guidance, please reach out to the Bacalhau team via Slack (#general channel).

Running Jupyter Notebooks on Bacalhau

Introduction

Jupyter Notebooks have become an essential tool for data scientists, researchers, and developers for interactive computing and the development of data-driven projects. They provide an efficient way to share code, equations, visualizations, and narrative text with support for multiple programming languages. In this tutorial, we will introduce you to running Jupyter Notebooks on Bacalhau, a powerful and flexible container orchestration platform. By leveraging Bacalhau, you can execute Jupyter Notebooks in a scalable and efficient manner using Docker containers, without the need for manual setup or configuration.

In the following sections, we will explore two examples of executing Jupyter Notebooks on Bacalhau:

Executing a Simple Hello World Notebook: We will begin with a basic example to familiarize you with the process of running a Jupyter Notebook on Bacalhau. We will execute a simple "Hello, World!" notebook to demonstrate the steps required for running a notebook in a containerized environment.
Notebook to Train an MNIST Model: In this section, we will dive into a more advanced example. We will execute a Jupyter Notebook that trains a machine-learning model on the popular MNIST dataset. This will showcase the potential of Bacalhau to handle more complex tasks while providing you with insights into utilizing containerized environments for your data science projects.

Prerequisite

To get started, you need to install the Bacalhau client, see more information here

1. Executing a Simple Hello World Notebook

There are no external dependencies that we need to install. All dependencies are already there in the container.

export JOB_ID=$(bacalhau docker run \
    --wait \
    --id-only \
    --timeout 3600 \
    --wait-timeout-secs 3600 \
    -w /inputs \
    -i https://raw.githubusercontent.com/js-ts/hello-notebook/main/hello.ipynb \
    jsacex/jupyter \
    -- jupyter nbconvert --execute --to notebook --output /outputs/hello_output.ipynb hello.ipynb)

/inputs/hello.ipynb: This is the path of the input Jupyter Notebook inside the Docker container.
-i: This flag stands for "input" and is used to provide the URL of the input Jupyter Notebook you want to execute.
https://raw.githubusercontent.com/js-ts/hello-notebook/main/hello.ipynb: This is the URL of the input Jupyter Notebook.
jsacex/jupyter: This is the name of the Docker image used for running the Jupyter Notebook. It is a minimal Jupyter Notebook stack based on the official Jupyter Docker Stacks.
--: This double dash is used to separate the Bacalhau command options from the command that will be executed inside the Docker container.
jupyter nbconvert: This is the primary command used to convert and execute Jupyter Notebooks. It allows for the conversion of notebooks to various formats, including execution.
--execute: This flag tells nbconvert to execute the notebook and store the results in the output file.
--to notebook: This option specifies the output format. In this case, we want to keep the output as a Jupyter Notebook.
--output /outputs/hello_output.ipynb: This option specifies the path and filename for the output Jupyter Notebook, which will contain the results of the executed input notebook.

Checking the State of your Jobs

Job status: You can check the status of the job using bacalhau list:

bacalhau list --id-filter=${JOB_ID} --no-style

When it says Published or Completed, that means the job is done, and we can get the results.

Job information: You can find out more information about your job by using bacalhau describe:

bacalhau describe ${JOB_ID}

Job download: You can download your job results directly by using bacalhau get. Alternatively, you can choose to create a directory to store your results. In the command below, we created a directory (results) and downloaded our job output to be stored in that directory.

rm -rf results && mkdir results # Temporary directory to store the results
bacalhau get ${JOB_ID} --output-dir results # Download the results

After the download has finished you can see the contents in the results directory, running the command below:

ls results/outputs

hello_output.nbconvert.ipynb

2. Running Notebook to Train an MNIST Model

Building the container (optional)

Prerequisite

Install Docker on your local machine.
Sign up for a DockerHub account if you don't already have one.

Step 1: Create a Dockerfile

Create a new file named Dockerfile in your project directory with the following content:

# Use the official Python image as the base image
FROM tensorflow/tensorflow:nightly-gpu

# Set the working directory in the container
WORKDIR /

RUN apt-get update -y

COPY mnist.ipynb .
# Install the Python packages
COPY requirements.txt .

RUN python3 -m pip install --upgrade pip

# Install the Python packages
RUN pip install --no-cache-dir -r requirements.txt

RUN pip install -U scikit-learn

This Dockerfile creates a Docker image based on the official TensorFlow GPU-enabled image, sets the working directory to the root, updates the package list, and copies an IPython notebook (mnist.ipynb) and a requirements.txt file. It then upgrades pip and installs Python packages from the requirements.txt file, along with scikit-learn. The resulting image provides an environment ready for running the mnist.ipynb notebook with TensorFlow and scikit-learn, as well as other specified dependencies.

Step 2: Build the Docker Image

In your terminal, navigate to the directory containing the Dockerfile and run the following command to build the Docker image:

docker build -t your-dockerhub-username/jupyter-mnist-tensorflow:latest .

Replace "your-dockerhub-username" with your actual DockerHub username. This command will build the Docker image and tag it with your DockerHub username and the name "your-dockerhub-username/jupyter-mnist-tensorflow".

Step 3: Push the Docker Image to DockerHub

Once the build process is complete, push the Docker image to DockerHub using the following command:

docker push your-dockerhub-username/jupyter-mnist-tensorflow

Again, replace "your-dockerhub-username" with your actual DockerHub username. This command will push the Docker image to your DockerHub repository.

Running the job on Bacalhau

Prerequisite

To get started, you need to install the Bacalhau client, see more information here

export JOB_ID=$(bacalhau docker run \
    --wait \
    --id-only \
    --timeout 3600 \
    --wait-timeout-secs 3600 \
    --gpu 1 \
    -i gitlfs://huggingface.co/datasets/VedantPadwal/mnist.git \
    jsacex/jupyter-tensorflow-mnist:v02 \
    -- jupyter nbconvert --execute --to notebook --output /outputs/mnist_output.ipynb mnist.ipynb)

Structure of the command

--gpu 1: Flag to specify the number of GPUs to use for the execution. In this case, 1 GPU will be used.
-i gitlfs://huggingface.co/datasets/VedantPadwal/mnist.git: The -i flag is used to clone the MNIST dataset from Hugging Face's repository using Git LFS. The files will be mounted inside the container.
jsacex/jupyter-tensorflow-mnist:v02: The name and the tag of the Docker image.
--: This double dash is used to separate the Bacalhau command options from the command that will be executed inside the Docker container.
jupyter nbconvert --execute --to notebook --output /outputs/mnist_output.ipynb mnist.ipynb: The command to be executed inside the container. In this case, it runs the jupyter nbconvert command to execute the mnist.ipynb notebook and save the output as mnist_output.ipynb in the /outputs directory.

Checking the State of your Jobs

Job status: You can check the status of the job using bacalhau list.

bacalhau list --id-filter=${JOB_ID} --no-style

When it says Published or Completed, that means the job is done, and we can get the results.

Job information: You can find out more information about your job by using bacalhau describe.

bacalhau describe ${JOB_ID}

Job download: You can download your job results directly by using bacalhau get. Alternatively, you can choose to create a directory to store your results. In the command below, we created a directory (results) and downloaded our job output to be stored in that directory.

rm -rf results && mkdir results # Temporary directory to store the results
bacalhau get ${JOB_ID} --output-dir results # Download the results

After the download has finished you can see the contents in the results directory, running the command below:

ls results/outputs

The outputs include our trained model and the Jupyter notebook with the output cells.

Support

If you have questions or need support or guidance, please reach out to the Bacalhau team via Slack (#general channel).

Scripting Bacalhau with Python

Bacalhau allows you to easily execute batch jobs via the CLI. But sometimes you need to do more than that. You might need to execute a script that requires user input, or you might need to execute a script that requires a lot of parameters. In any case, you probably want to execute your jobs in a repeatable manner.

This example demonstrates a simple Python script that is able to orchestrate the execution of lots of jobs in a repeatable manner.

Prerequisite

To get started, you need to install the Bacalhau client, see more information

Executing Bacalhau Jobs with Python Scripts

To demonstrate this example, I will use the data generated from an Ethereum example. This produced a list of hashes that I will iterate over and execute a job for each one.

Now let's create a file called bacalhau.py. The script below automates the submission, monitoring, and retrieval of results for multiple Bacalhau jobs in parallel. It is designed to be used in a scenario where there are multiple hash files, each representing a job, and the script manages the execution of these jobs using Bacalhau commands.

This code has a few interesting features:

Change the value in the main call (main("hashes.txt", 10)) to change the number of jobs to execute.
Because all jobs are complete at different times, there's a loop to check that all jobs have been completed before downloading the results. If you don't do this, you'll likely see an error when trying to download the results. The while True loop is used to monitor the status of jobs and wait for them to complete.
When downloading the results, the IPFS get often times out, so I wrapped that in a loop. The for i in range(0, 5) loop in the getResultsFromJob function involves retrying the bacalhau get operation if it fails to complete successfully.

Let's run it!

Hopefully, the results directory contains all the combined results from the jobs we just executed. Here's we're expecting to see CSV files:

Success! We've now executed a bunch of jobs in parallel using Python. This is a great way to execute lots of jobs in a repeatable manner. You can alter the file above for your purposes.

Next Steps

You might also be interested in the following examples:

Support

R (language)

Building and Running your Custom R Containers on Bacalhau

Introduction

This example will walk you through building Time Series Forecasting using . Prophet is a forecasting procedure implemented in R and Python. It is fast and provides completely automated forecasts that can be tuned by hand by data scientists and analysts.

Quick script to run custom R container on Bacalhau:

Prerequisites

To get started, you need to install the Bacalhau client, see more information

1. Running Prophet in R Locally

Open R studio or R-supported IDE. If you want to run this on a notebook server, then make sure you use an R kernel. Prophet is a CRAN package, so you can use install.packages to install the prophet package:

After installation is finished, you can download the example data that is stored in IPFS:

The code below instantiates the library and fits a model to the data.

Create a new file called Saturating-Forecasts.R and in it paste the following script:

This script performs time series forecasting using the Prophet library in R, taking input data from a CSV file, applying the forecasting model, and generating plots for analysis.

Let's have a look at the command below:

This command uses Rscript to execute the script that was created and written to the Saturating-Forecasts.R file.

The input parameters provided in this case are the names of input and output files:

example_wp_log_R.csv - the example data that was previously downloaded.

outputs/output0.pdf - the name of the file to save the first forecast plot.

outputs/output1.pdf - the name of the file to save the second forecast plot.

2. Running R Prophet on Bacalhau

3. Containerize Script with Docker

To build your own docker container, create a Dockerfile, which contains instructions to build your image.

These commands specify how the image will be built, and what extra requirements will be included. We use r-base as the base image and then install the prophet package. We then copy the Saturating-Forecasts.R script into the container and set the working directory to the R folder.

Build the container

We will run docker build command to build the container:

Before running the command replace:

repo-name with the name of the container, you can name it anything you want

tag this is not required but you can use the latest tag

In our case:

Push the container

Next, upload the image to the registry. This can be done by using the Docker hub username, repo name, or tag.

In our case:

4. Running a Job on Bacalhau

The following command passes a prompt to the model and generates the results in the outputs directory. It takes approximately 2 minutes to run.

Structure of the command

bacalhau docker run: call to Bacalhau
-i ipfs://QmY8BAftd48wWRYDf5XnZGkhwqgjpzjyUG3hN1se6SYaFt:/example_wp_log_R.csv: Mounting the uploaded dataset at /inputs in the execution. It takes two arguments, the first is the IPFS CID (QmY8BAftd48wWRYDf5XnZGkhwqgjpzjyUG3hN1se6SYaFtz) and the second is file path within IPFS (/example_wp_log_R.csv)
ghcr.io/bacalhau-project/examples/r-prophet:0.0.2: the name and the tag of the docker image we are using
/example_wp_log_R.csv : path to the input dataset
/outputs/output0.pdf, /outputs/output1.pdf: paths to the output
Rscript Saturating-Forecasts.R: execute the R script

When a job is submitted, Bacalhau prints out the related job_id. We store that in an environment variable so that we can reuse it later on:

5. Checking the State of your Jobs

Job status: You can check the status of the job using bacalhau list.

When it says Published or Completed, that means the job is done, and we can get the results.

Job information: You can find out more information about your job by using bacalhau describe.

Job download: You can download your job results directly by using bacalhau get. Alternatively, you can choose to create a directory to store your results. In the command below, we created a directory (results) and downloaded our job output to be stored in that directory.

6. Viewing your Job Output

To view the file, run the following command:

You can't natively display PDFs in notebooks, so here are some static images of the PDFs:

output0.pdf

output1.pdf

Support

Running a Simple R Script on Bacalhau

You can use official Docker containers for each language, like R or Python. In this example, we will use the official R container and run it on Bacalhau.

In this tutorial example, we will run a "hello world" R script on Bacalhau.

Prerequisites

To get started, you need to install the Bacalhau client, see more information here

1. Running an R Script Locally

To install R follow these instructions A Installing R and RStudio | Hands-On Programming with R. After R and RStudio are installed, create and run a script called hello.R:

# hello.R
print("hello world")

Run the script:

Rscript hello.R

Next, upload the script to your public storage (in our case, IPFS). We've already uploaded the script to IPFS and the CID is: QmVHSWhAL7fNkRiHfoEJGeMYjaYZUsKHvix7L54SptR8ie. You can look at this by browsing to one of the HTTP IPFS proxies like ipfs.io or w3s.link.

2. Running a Job on Bacalhau

Now it's time to run the script on Bacalhau:

export JOB_ID=$(bacalhau docker run \
    --wait \
    --id-only \
    -i ipfs://QmQRVx3gXVLaRXywgwo8GCTQ63fHqWV88FiwEqCidmUGhk:/hello.R \
    r-base \
    -- Rscript hello.R)

Structure of the command

bacalhau docker run: call to Bacalhau
i ipfs://QmQRVx3gXVLaRXywgwo8GCTQ63fHqWV88FiwEqCidmUGhk:/hello.R: Mounting the uploaded dataset at /inputs in the execution. It takes two arguments, the first is the IPFS CID (QmQRVx3gXVLaRXywgwo8GCTQ63fHqWV88FiwEqCidmUGhk) and the second is file path within IPFS (/hello.R)
r-base: docker official image we are using
Rscript hello.R: execute the R script

When a job is submitted, Bacalhau prints out the related job_id. We store that in an environment variable so that we can reuse it later on:

Declarative job description

The same job can be presented in the declarative format. In this case, the description will look like this:

name: Running a Simple R Script
type: batch
count: 1
tasks:
  - name: My main task
    Engine:
      type: docker
      params:
        Image: r-base:latest
        Entrypoint:
          - /bin/bash
        Parameters:
          - -c        
          - Rscript /hello.R
    InputSources:
      - Target: "/"
        Source:
          Type: urlDownload
          Params:
            URL: https://raw.githubusercontent.com/bacalhau-project/examples/main/scripts/hello.R
            Path: /hello.R

The job description should be saved in .yaml format, e.g. rhello.yaml, and then run with the command:

bacalhau job run rhello.yaml

3. Checking the State of your Jobs

Job status: You can check the status of the job using bacalhau list.

bacalhau list --id-filter ${JOB_ID}

When it says Published or Completed, that means the job is done, and we can get the results.

Job information: You can find out more information about your job by using bacalhau describe.

bacalhau describe  ${JOB_ID}

Job download: You can download your job results directly by using bacalhau get. Alternatively, you can choose to create a directory to store your results. In the command below, we created a directory (results) and downloaded our job output to be stored in that directory.

rm -rf results && mkdir results
bacalhau get ${JOB_ID} --output-dir results

4. Viewing your Job Output

To view the file, run the following command:

cat results/stdout

Futureproofing your R Scripts

You can generate the job request using bacalhau describe with the --spec flag. This will allow you to re-run that job in the future:

bacalhau describe ${JOB_ID} --spec > job.yaml

cat job.yaml

Support

If you have questions or need support or guidance, please reach out to the Bacalhau team via Slack (#general channel).

Run CUDA programs on Bacalhau

What is CUDA

In this tutorial, we will look at how to run CUDA programs on Bacalhau. CUDA (Compute Unified Device Architecture) is an extension of C/C++ programming. It is a parallel computing platform and programming model created by NVIDIA. It helps developers speed up their applications by harnessing the power of GPU accelerators.

In addition to accelerating high-performance computing (HPC) and research applications, CUDA has also been widely adopted across consumer and industrial ecosystems. CUDA also makes it easy for developers to take advantage of all the latest GPU architecture innovations

Advantage of GPU over CPU

Architecturally, the CPU is composed of just a few cores with lots of cache memory that can handle a few software threads at a time. In contrast, a GPU is composed of hundreds of cores that can handle thousands of threads simultaneously.

Computations like matrix multiplication could be done much faster on GPU than on CPU

Prerequisite

To get started, you need to install the Bacalhau client, see more information here

1. Running CUDA locally

You'll need to have the following installed:

NVIDIA GPU
CUDA drivers installed
nvcc installed

Checking if nvcc is installed:

nvcc --version

Downloading the programs:

mkdir inputs outputs
wget -P inputs https://raw.githubusercontent.com/tristanpenman/cuda-examples/master/00-hello-world.cu
wget -P inputs https://raw.githubusercontent.com/tristanpenman/cuda-examples/master/02-cuda-hello-world-faster.cu

Viewing the programs

00-hello-world.cu:

# View the contents of the standard C++ program
cat inputs/00-hello-world.cu

# Measure the time it takes to compile and run the program
nvcc -o ./outputs/hello ./inputs/00-hello-world.cu; ./outputs/hello

This example represents a standard C++ program that inefficiently utilizes GPU resources due to the use of non-parallel loops.

02-cuda-hello-world-faster.cu:

# View the contents of the CUDA program with vector addition
!cat inputs/02-cuda-hello-world-faster.cu

# Remove any previous output
rm -rf outputs/hello

# Measure the time for compilation and execution
nvcc --expt-relaxed-constexpr -o ./outputs/hello ./inputs/02-cuda-hello-world-faster.cu; ./outputs/hello

In this example we utilize Vector addition using CUDA and allocate the memory in advance and copy the memory to the GPU using cudaMemcpy so that it can utilize the HBM (High Bandwidth memory of the GPU). Compilation and execution occur faster (1.39 seconds) compared to the previous example (8.67 seconds).

2. Running a Bacalhau Job

To submit a job, run the following Bacalhau command:

export JOB_ID=$(bacalhau docker run \
    --gpu 1 \
    --timeout 3600 \
    --wait-timeout-secs 3600 \
    -i https://raw.githubusercontent.com/tristanpenman/cuda-examples/master/02-cuda-hello-world-faster.cu \
    --id-only \
    --wait \
    nvidia/cuda:11.2.2-cudnn8-devel-ubuntu18.04 \
    -- /bin/bash -c 'nvcc --expt-relaxed-constexpr  -o ./outputs/hello ./inputs/02-cuda-hello-world-faster.cu; ./outputs/hello ')

Structure of the Commands

bacalhau docker run: call to Bacalhau
-i https://raw.githubusercontent.com/tristanpenman/cuda-examples/master/02-cuda-hello-world-faster.cu: URL path of the input data volumes downloaded from a URL source.
nvidia/cuda:11.2.0-cudnn8-devel-ubuntu18.04: Docker container for executing CUDA programs (you need to choose the right CUDA docker container). The container should have the tag of "devel" in them.
nvcc --expt-relaxed-constexpr -o ./outputs/hello ./inputs/02-cuda-hello-world-faster.cu: Compilation using the nvcc compiler and save it to the outputs directory as hello
Note that there is ; between the commands: -- /bin/bash -c 'nvcc --expt-relaxed-constexpr -o ./outputs/hello ./inputs/02-cuda-hello-world-faster.cu; ./outputs/hello The ";" symbol allows executing multiple commands sequentially in a single line.
./outputs/hello: Execution hello binary: You can combine compilation and execution commands.

Note that the CUDA version will need to be compatible with the graphics card on the host machine

When a job is submitted, Bacalhau prints out the related job_id. We store that in an environment variable so that we can reuse it later on:

3. Checking the State of your Jobs

Job status: You can check the status of the job using bacalhau list.

bacalhau list --id-filter ${JOB_ID} --wide

When it says Published or Completed, that means the job is done, and we can get the results.

Job information: You can find out more information about your job by using bacalhau describe.

bacalhau describe ${JOB_ID}

Job download: You can download your job results directly by using bacalhau get. Alternatively, you can choose to create a directory to store your results. In the command below, we created a directory (results) and downloaded our job output to be stored in that directory.

rm -rf results && mkdir -p results
bacalhau get $JOB_ID --output-dir results

4. Viewing your Job Output

To view the file, run the following command:

cat results/stdout

Support

If you have questions or need support or guidance, please reach out to the Bacalhau team via Slack (#general channel).

Running a Prolog Script

Introduction

Prolog is intended primarily as a declarative programming language: the program logic is expressed in terms of relations, represented as facts and rules. A computation is initiated by running a query over these relations. Prolog is well-suited for specific tasks that benefit from rule-based logical queries such as searching databases, voice control systems, and filling templates.

This tutorial is a quick guide on how to run a hello world script on Bacalhau.

Prerequisites

To get started, you need to install the Bacalhau client, see more information here

1. Running Locally

To get started, install swipl

sudo add-apt-repository ppa:swi-prolog/stable
sudo apt-get update
sudo apt-get install swi-prolog

Create a file called helloworld.pl. The following script prints ‘Hello World’ to the stdout:

# helloworld.pl
hello_world :- write('Hello World'), nl,
               halt.

Running the script to print out the output:

swipl -q -s helloworld.pl -g hello_world

After the script has run successfully locally, we can now run it on Bacalhau.

Before running it on Bacalhau we need to upload it to IPFS.

Using the IPFS cli:

wget https://dist.ipfs.io/go-ipfs/v0.4.2/go-ipfs_v0.4.2_linux-amd64.tar.gz
tar xvfz go-ipfs_v0.4.2_linux-amd64.tar.gz
mv go-ipfs/ipfs /usr/local/bin/ipfs
ipfs init
ipfs cat /ipfs/QmYwAPJzv5CZsnA625s3Xf2nemtYgPpHdWEz79ojWnPbdG/readme
ipfs config Addresses.Gateway /ip4/127.0.0.1/tcp/8082
ipfs config Addresses.API /ip4/127.0.0.1/tcp/5002
nohup ipfs daemon > startup.log &

Run the command below to check if our script has been uploaded.

ipfs add helloworld.pl

This command outputs the CID. Copy the CID of the file, which in our case is QmYq9ipYf3vsj7iLv5C67BXZcpLHxZbvFAJbtj7aKN5qii

Since the data uploaded to IPFS isn’t pinned, we will need to do that manually. Check this information on how to pin your data We recommend using NFT.Storage.

2. Running a Bacalhau Job

We will mount the script to the container using the -i flag: -i: ipfs://< CID >:/< name-of-the-script >.

To submit a job, run the following Bacalhau command:

export JOB_ID=$(bacalhau docker run \
    -i ipfs://QmYq9ipYf3vsj7iLv5C67BXZcpLHxZbvFAJbtj7aKN5qii:/helloworld.pl \
    --wait \
    --id-only \
    swipl \
    -- swipl -q -s helloworld.pl -g hello_world)

Structure of the Command

-i ipfs://QmYq9ipYf3vsj7iLv5C67BXZcpLHxZbvFAJbtj7aKN5qii:/helloworld.pl : Sets the input data for the container.
mYq9ipYf3vsj7iLv5C67BXZcpLHxZbvFAJbtj7aKN5qii is our CID which points to the helloworld.pl file on the IPFS network. This file will be accessible within the container.
-- swipl -q -s helloworld.pl -g hello_world: instructs SWI-Prolog to load the program from the helloworld.pl file and execute the hello_world function in quiet mode:
1. -q: running in quiet mode
2. -s: load file as a script. In this case we want to run the helloworld.pl script
3. -g: is the name of the function you want to execute. In this case its hello_world

When a job is submitted, Bacalhau prints out the related job_id. We store that in an environment variable so that we can reuse it later on:

3. Checking the State of your Jobs

Job status: You can check the status of the job using bacalhau list.

bacalhau list --id-filter ${JOB_ID} --wide

When it says Published or Completed, that means the job is done, and we can get the results.

Job information: You can find out more information about your job by using bacalhau describe.

bacalhau describe ${JOB_ID}

Job download: You can download your job results directly by using bacalhau get. Alternatively, you can choose to create a directory to store your results. In the command below, we created a directory (results) and downloaded our job output to be stored in that directory.

rm -rf results && mkdir -p results
bacalhau get $JOB_ID --output-dir results

4. Viewing your Job Output

To view the file, run the following command:

cat results/stdout

Support

If you have questions or need support or guidance, please reach out to the Bacalhau team via Slack (#general channel).

Reading Data from Multiple S3 Buckets using Bacalhau

Introduction

Bacalhau, a powerful and versatile data processing platform, has recently integrated Amazon Web Services (AWS) S3, allowing users to seamlessly access and process data stored in S3 buckets within their Bacalhau jobs. This integration not only simplifies data input, output, and processing operations but also streamlines the overall workflow by enabling users to store and manage their data effectively in S3 buckets. With Bacalhau, you can process several Large s3 buckets in parallel. In this example, we will walk you through the process of reading data from multiple S3 buckets, converting TIFF images to JPEG format.

Advantages of Converting TIFF to JPEG

There are several advantages to converting images from TIFF to JPEG format:

Reduced File Size: JPEG images use lossy compression, which significantly reduces file size compared to lossless formats like TIFF. Smaller file sizes lead to faster upload and download times, as well as reduced storage requirements.
Efficient Processing: With smaller file sizes, image processing tasks tend to be more efficient and require less computational resources when working with JPEG images compared to TIFF images.
Training Machine Learning Models: Smaller file sizes and reduced computational requirements make JPEG images more suitable for training machine learning models, particularly when dealing with large datasets, as they can help speed up the training process and reduce the need for extensive computational resources.

Running the job on Bacalhau

We will use the S3 mount feature to mount bucket objects from s3 buckets. Let’s have a look at the example below:

-i src=s3://sentinel-s1-rtc-indigo/tiles/RTC/1/IW/10/S/DH/2017/S1A_20170125_10SDH_ASC/Gamma0_VH.tif,dst=/sentinel-s1-rtc-indigo/,opt=region=us-west-2

It defines S3 object as input to the job:

sentinel-s1-rtc-indigo: bucket’s name
tiles/RTC/1/IW/10/S/DH/2017/S1A_20170125_10SDH_ASC/Gamma0_VH.tif: represents the key of the object in that bucket. The object to be processed is called Gamma0_VH.tif and is located in the subdirectory with the specified path.
But if you want to specify the entire objects located in the path, you can simply add * to the end of the path (tiles/RTC/1/IW/10/S/DH/2017/S1A_20170125_10SDH_ASC/*)
dst=/sentinel-s1-rtc-indigo: the destination to which to mount the s3 bucket object
opt=region=us-west-2 : specifying the region in which the bucket is located

Prerequisite

To get started, you need to install the Bacalhau client, see more information here

1. Running the job on multiple buckets with multiple objects

In the example below, we will mount several bucket objects from public s3 buckets located in a specific region:

export JOB_ID=$(bacalhau docker run \
    --wait \
    --id-only \
    --timeout 3600 \
    --publisher=ipfs \
    --memory=10Gb \
    --wait-timeout-secs 3600 \
    -i src=s3://bdc-sentinel-2/s2-16d/v1/075/086/2018/02/18/*,dst=/bdc-sentinel-2/,opt=region=us-west-2  \
    -i src=s3://sentinel-cogs/sentinel-s2-l2a-cogs/28/M/CV/2022/6/S2B_28MCV_20220620_0_L2A/*,dst=/sentinel-cogs/,opt=region=us-west-2 \
    jsacex/gdal-s3)

The job has been submitted and Bacalhau has printed out the related job_id. We store that in an environment variable so that we can reuse it later on.

2. Checking the State of your Jobs

Job status: You can check the status of the job using bacalhau list.

bacalhau list --id-filter=${JOB_ID} --no-style

When it says Published or Completed, that means the job is done, and we can get the results.

Job information: You can find out more information about your job by using bacalhau describe.

bacalhau describe ${JOB_ID}

Job download: You can download your job results directly by using bacalhau get. Alternatively, you can choose to create a directory to store your results. In the command below, we created a directory (results) and downloaded our job output to be stored in that directory.

rm -rf results && mkdir results # Temporary directory to store the results
bacalhau get ${JOB_ID} --output-dir results # Download the results

3. Viewing your Job Output

Display the image

To view the images, download the job results and open the folder:

results/outputs/S2-16D_V1_075086_20180218_B04_TCI.jpg

results/outputs/B04_TCI.jpg

Support

If you have questions or need support or guidance, please reach out to the Bacalhau team via Slack (#general channel).

Running Rust programs as WebAssembly (WASM)

Bacalhau supports running jobs as a WebAssembly (WASM) program. This example demonstrates how to compile a Rust project into WebAssembly and run the program on Bacalhau.

Prerequisites

To get started, you need to install the Bacalhau client, see more information here.
A working Rust installation with the wasm32-wasi target. For example, you can use rustup to install Rust and configure it to build WASM targets. For those using the notebook, these are installed in hidden cells below.

1. Develop a Rust Program Locally

We can use cargo (which will have been installed by rustup) to start a new project (my-program) and compile it:

cargo init my-program

We can then write a Rust program. Rust programs that run on Bacalhau can read and write files, access a simple clock, and make use of pseudo-random numbers. They cannot memory-map files or run code on multiple threads.

The program below will use the Rust imageproc crate to resize an image through seam carving, based on an example from their repository.

// ./my-program/src/main.rs
use image::{open, GrayImage, Luma, Pixel};
use imageproc::definitions::Clamp;
use imageproc::gradients::sobel_gradient_map;
use imageproc::map::map_colors;
use imageproc::seam_carving::*;
use std::path::Path;

fn main() {
    let input_path = "inputs/image0.JPG";
    let output_dir = "outputs/";

    let input_path = Path::new(&input_path);
    let output_dir = Path::new(&output_dir);

    // Load image and convert to grayscale
    let input_image = open(input_path)
        .expect(&format!("Could not load image at {:?}", input_path))
        .to_rgb8();

    // Save original image in output directory
    let original_path = output_dir.join("original.png");
    input_image.save(&original_path).unwrap();

    // We will reduce the image width by this amount, removing one seam at a time.
    let seams_to_remove: u32 = input_image.width() / 6;

    let mut shrunk = input_image.clone();
    let mut seams = Vec::new();

    // Record each removed seam so that we can draw them on the original image later.
    for i in 0..seams_to_remove {
        if i % 100 == 0 {
            println!("Removing seam {}", i);
        }
        let vertical_seam = find_vertical_seam(&shrunk);
        shrunk = remove_vertical_seam(&mut shrunk, &vertical_seam);
        seams.push(vertical_seam);
    }

    // Draw the seams on the original image.
    let gray_image = map_colors(&input_image, |p| p.to_luma());
    let annotated = draw_vertical_seams(&gray_image, &seams);
    let annotated_path = output_dir.join("annotated.png");
    annotated.save(&annotated_path).unwrap();

    // Draw the seams on the gradient magnitude image.
    let gradients = sobel_gradient_map(&input_image, |p| {
        let mean = (p[0] + p[1] + p[2]) / 3;
        Luma([mean as u32])
    });
    let clamped_gradients: GrayImage = map_colors(&gradients, |p| Luma([Clamp::clamp(p[0])]));
    let annotated_gradients = draw_vertical_seams(&clamped_gradients, &seams);
    let gradients_path = output_dir.join("gradients.png");
    clamped_gradients.save(&gradients_path).unwrap();
    let annotated_gradients_path = output_dir.join("annotated_gradients.png");
    annotated_gradients.save(&annotated_gradients_path).unwrap();

    // Save the shrunk image.
    let shrunk_path = output_dir.join("shrunk.png");
    shrunk.save(&shrunk_path).unwrap();
}

In the main function main() an image is loaded, the original is saved, and then a loop is performed to reduce the width of the image by removing "seams." The results of the process are saved, including the original image with drawn seams and a gradient image with highlighted seams.

We also need to install the imageproc and image libraries and switch off the default features to make sure that multi-threading is disabled (default-features = false). After disabling the default features, you need to explicitly specify only the features that you need:

// ./my-program/Cargo.toml
[package]
name = "my-program"
version = "0.1.0"
edition = "2021"

[dependencies.image]
version = "0.24.4"
default-features = false
features = ["png", "jpeg", "bmp"]

[dependencies.imageproc]
version = "0.23.0"
default-features = false

We can now build the Rust program into a WASM blob using cargo:

cd my-program && cargo build --target wasm32-wasi --release

This command navigates to the my-program directory and builds the project using Cargo with the target set to wasm32-wasi in release mode.

This will generate a WASM file at ./my-program/target/wasm32-wasi/release/my-program.wasm which can now be run on Bacalhau.

2. Running WASM on Bacalhau

Now that we have a WASM binary, we can upload it to IPFS and use it as input to a Bacalhau job.

The -i flag allows specifying a URI to be mounted as a named volume in the job, which can be an IPFS CID, HTTP URL, or S3 object.

For this example, we are using an image of the Statue of Liberty that has been pinned to a storage facility.

export JOB_ID=$(bacalhau wasm run \
    ./my-program/target/wasm32-wasi/release/my-program.wasm _start \
    --id-only \
    -i ipfs://bafybeifdpl6dw7atz6uealwjdklolvxrocavceorhb3eoq6y53cbtitbeu:/inputs)

Structure of the Commands

bacalhau wasm run: call to Bacalhau
./my-program/target/wasm32-wasi/release/my-program.wasm: the path to the WASM file that will be executed
_start: the entry point of the WASM program, where its execution begins
--id-only: this flag indicates that only the identifier of the executed job should be returned
-i ipfs://bafybeifdpl6dw7atz6uealwjdklolvxrocavceorhb3eoq6y53cbtitbeu:/inputs: input data volume that will be accessible within the job at the specified destination path

When a job is submitted, Bacalhau prints out the related job_id. We store that in an environment variable so that we can reuse it later on:

You can download your job results directly by using bacalhau get. Alternatively, you can choose to create a directory to store your results. In the command below, we created a directory (wasm_results) and downloaded our job output to be stored in that directory.

We can now get the results.

rm -rf wasm_results && mkdir -p wasm_results
bacalhau get ${JOB_ID} --output-dir wasm_results

Viewing Job Output

When we view the files, we can see the original image, the resulting shrunk image, and the seams that were removed.

./wasm_results/outputs/original.png

./wasm_results/outputs/annotated_gradients.png

./wasm_results/outputs/shrunk.png

Support

If you have questions or need support or guidance, please reach out to the Bacalhau team via Slack (#general channel).

Generate Synthetic Data using Sparkov Data Generation technique

Introduction

A synthetic dataset is generated by algorithms or simulations which has similar characteristics to real-world data. Collecting real-world data, especially data that contains sensitive user data like credit card information, is not possible due to security and privacy concerns. If a data scientist needs to train a model to detect credit fraud, they can use synthetically generated data instead of using real data without compromising the privacy of users.

The advantage of using Bacalhau is that you can generate terabytes of synthetic data without having to install any dependencies or store the data locally.

In this example, we will learn how to run Bacalhau on a synthetic dataset. We will generate synthetic credit card transaction data using the Sparkov program and store the results in IPFS.

Prerequisite

To get started, you need to install the Bacalhau client, see more information here

1. Running Sparkov Locally

To run Sparkov locally, you'll need to clone the repo and install dependencies:

git clone https://github.com/js-ts/Sparkov_Data_Generation/
pip3 install -r Sparkov_Data_Generation/requirements.txt

Go to the Sparkov_Data_Generation directory:

cd Sparkov_Data_Generation

Create a temporary directory (outputs) to store the outputs:

mkdir ../outputs

2. Running the script

python3 datagen.py -n 1000 -o ../outputs "01-01-2022" "10-01-2022"

The command above executes the Python script datagen.py, passing the following arguments to it:

-n 1000: Number of customers to generate
-o ../outputs: path to store the outputs
"01-01-2022": Start date
"10-01-2022": End date

Thus, this command uses a Python script to generate synthetic credit card transaction data for the period from 01-01-2022 to 10-01-2022 and saves the results in the ../outputs directory.

To see the full list of options, use:

python datagen.py -h

3. Containerize Script using Docker

To build your own docker container, create a Dockerfile, which contains instructions to build your image:

FROM python:3.8

RUN apt update && apt install git

RUN git clone https://github.com/js-ts/Sparkov_Data_Generation/

WORKDIR /Sparkov_Data_Generation/

RUN pip3 install -r requirements.txt

These commands specify how the image will be built, and what extra requirements will be included. We use python:3.8 as the base image, install git, clone the Sparkov_Data_Generation repository from GitHub, set the working directory inside the container to /Sparkov_Data_Generation/, and install Python dependencies listed in the requirements.txt file."

See more information on how to containerize your script/app here

Build the container

We will run docker build command to build the container:

docker build -t <hub-user>/<repo-name>:<tag> .

Before running the command replace:

hub-user with your docker hub username. If you don’t have a docker hub account follow these instructions to create docker account, and use the username of the account you created

repo-name with the name of the container, you can name it anything you want

tag this is not required but you can use the latest tag

In our case:

docker build -t jsacex/sparkov-data-generation .

Push the container

Next, upload the image to the registry. This can be done by using the Docker hub username, repo name or tag.

docker push <hub-user>/<repo-name>:<tag>

In our case:

docker push jsacex/sparkov-data-generation

After the repo image has been pushed to Docker Hub, we can now use the container for running on Bacalhau

4. Running a Bacalhau Job

Now we're ready to run a Bacalhau job:

export JOB_ID=$(bacalhau docker run \
    --id-only \
    --wait \
    jsacex/sparkov-data-generation \
    --  python3 datagen.py -n 1000 -o ../outputs "01-01-2022" "10-01-2022")

Structure of the command:

bacalhau docker run: call to Bacalhau
jsacex/sparkov-data-generation: the name of the docker image we are using
-- python3 datagen.py -n 1000 -o ../outputs "01-01-2022" "10-01-2022": the arguments passed into the container, specifying the execution of the Python script datagen.py with specific parameters, such as the amount of data, output path, and time range.

When a job is submitted, Bacalhau prints out the related job_id. We store that in an environment variable so that we can reuse it later on:

5. Checking the State of your Jobs

Job status: You can check the status of the job using bacalhau list.

bacalhau list --id-filter ${JOB_ID}

When it says Published or Completed, that means the job is done, and we can get the results.

Job information: You can find out more information about your job by using bacalhau describe.

bacalhau describe ${JOB_ID}

Job download: You can download your job results directly by using bacalhau get. Alternatively, you can choose to create a directory to store your results. In the command below, we created a directory (results) and downloaded our job output to be stored in that directory.

rm -rf results && mkdir -p results
bacalhau get ${JOB_ID} --output-dir results

6. Viewing your Job Output

To view the contents of the current directory, run the following command:

ls results/outputs

Support

If you have questions or need support or guidance, please reach out to the Bacalhau team via Slack (#general channel).

# write following to the file bacalhau.py import json, glob, os, multiprocessing, shutil, subprocess, tempfile, time # checkStatusOfJob checks the status of a Bacalhau job def checkStatusOfJob(job_id: str) -> str: assert len(job_id) > 0 p = subprocess.run( ["bacalhau", "list", "--output", "json", "--id-filter", job_id], stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True, ) r = parseJobStatus(p.stdout) if r == "": print("job status is empty! %s" % job_id) elif r == "Completed": print("job completed: %s" % job_id) else: print("job not completed: %s - %s" % (job_id, r)) return r # submitJob submits a job to the Bacalhau network def submitJob(cid: str) -> str: assert len(cid) > 0 p = subprocess.run( [ "bacalhau", "docker", "run", "--id-only", "--wait=false", "--input", "ipfs://" + cid + ":/inputs/data.tar.gz", "ghcr.io/bacalhau-project/examples/blockchain-etl:0.0.6", ], stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True, ) if p.returncode != 0: print("failed (%d) job: %s" % (p.returncode, p.stdout)) job_id = p.stdout.strip() print("job submitted: %s" % job_id) return job_id # getResultsFromJob gets the results from a Bacalhau job def getResultsFromJob(job_id: str) -> str: assert len(job_id) > 0 temp_dir = tempfile.mkdtemp() print("getting results for job: %s" % job_id) for i in range(0, 5): # try 5 times p = subprocess.run( [ "bacalhau", "get", "--output-dir", temp_dir, job_id, ], stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True, ) if p.returncode == 0: break else: print("failed (exit %d) to get job: %s" % (p.returncode, p.stdout)) return temp_dir # parseJobStatus parses the status of a Bacalhau job def parseJobStatus(result: str) -> str: if len(result) == 0: return "" r = json.loads(result) if len(r) > 0: return r[0]["State"]["State"] return "" # parseHashes splits lines from a text file into a list def parseHashes(filename: str) -> list: assert os.path.exists(filename) with open(filename, "r") as f: hashes = f.read().splitlines() return hashes def main(file: str, num_files: int = -1): # Use multiprocessing to work in parallel count = multiprocessing.cpu_count() with multiprocessing.Pool(processes=count) as pool: hashes = parseHashes(file)[:num_files] print("submitting %d jobs" % len(hashes)) job_ids = pool.map(submitJob, hashes) assert len(job_ids) == len(hashes) print("waiting for jobs to complete...") while True: job_statuses = pool.map(checkStatusOfJob, job_ids) total_finished = sum(map(lambda x: x == "Completed", job_statuses)) if total_finished >= len(job_ids): break print("%d/%d jobs completed" % (total_finished, len(job_ids))) time.sleep(2) print("all jobs completed, saving results...") results = pool.map(getResultsFromJob, job_ids) print("finished saving results") # Do something with the results shutil.rmtree("results", ignore_errors=True) os.makedirs("results", exist_ok=True) for r in results: path = os.path.join(r, "outputs", "*.csv") csv_file = glob.glob(path) for f in csv_file: print("moving %s to results" % f) shutil.move(f, "results") if __name__ == "__main__": main("hashes.txt", 10)

Workload Onboarding

Container

Docker Workload Onboarding

Docker Workloads

Requirements

Runtime Restrictions

Onboarding Your Workload

Step 1 - Read Data From Your Directory

Step 2 - Write Data to the Your Directory

Step 3 - Build and Push Your Image To a Registry

Step 4 - Test Your Container

Step 5 - Upload the Input Data

Step 6 - Run the Workload on Bacalhau

Troubleshooting

Support

WebAssembly (WASM) Workloads

Prerequisites and Limitations

Onboarding Your Workload

Step 1: Replace network operations with filesystem reads and writes

Step 2: Configure your compiler to output WASI-compliant WebAssembly

Step 3: Upload the input data

Step 4: Run your program

Program arguments

Environment variables

Examples

Support

Bacalhau Docker Image

Prerequisites

1. Pull the Bacalhau Docker image

2. Check version

3. Running a Bacalhau Job

Structure of the command

4. Submit a Job With Output Files

5. Checking the State of your Jobs

Support

How To Work With Custom Containers in Bacalhau

Prerequisite

1. Running Containers

Docker Command

Bacalhau Command

Bacalhau Syntax

2. Building Your Own Custom Container For Bacalhau

3. Running Your Custom Container on Bacalhau

Support

Python

Building and Running Custom Python Container

Introduction

Prerequisites

1. Sample Recommendation Dataset

Downloading the dataset

Installing Dependencies

Writing the Script

Running the Script

2. Setting Up Docker

Build the container

Push the container

3. Running a Bacalhau Job

Running the Container with Default Parameters

Structure of the command

Running the Container with Custom Parameters

Structure of the command

4. Checking the State of your Jobs

5. Viewing your Job Output

Support

Running Pandas on Bacalhau

Introduction

Prerequisite

1. Running Pandas Locally

Importing data from CSV to DataFrame

Running the script

2. Ingesting Data

3. Running a Bacalhau Job

Structure of the command

4. Checking the State of your Jobs

5. Viewing your Job Output

Support

Running a Python Script

Prerequisites​

1. Running Python Locally​

2. Running a Bacalhau Job​

Prerequisites

1. Running Python Locally

2. Running a Bacalhau Job

Structure of the command

Declarative job description

3. Checking the State of your Jobs

4. Viewing your Job Output

Support

Prerequisites

1. Running an R Script Locally

2. Running a Job on Bacalhau

Structure of the command

Declarative job description

3. Checking the State of your Jobs

4. Viewing your Job Output

Futureproofing your R Scripts

Support

1. Running Locally