1 of 6

Python

Building and Running Custom Python Container

Introduction

In this tutorial example, we will walk you through building your own Python container and running the container on Bacalhau.

Prerequisites

To get started, you need to install the Bacalhau client, see more information here

1. Sample Recommendation Dataset

We will be using a simple recommendation script that, when given a movie ID, recommends other movies based on user ratings. Assuming you want recommendations for the movie 'Toy Story' (1995), it will suggest movies from similar categories:

Recommendations for Toy Story (1995):
1  :  Toy Story (1995)
58  :  Postino, Il (The Postman) (1994)
3159  :  Fantasia 2000 (1999)
359  :  I Like It Like That (1994)
756  :  Carmen Miranda: Bananas Is My Business (1994)
618  :  Two Much (1996)
48  :  Pocahontas (1995)
2695  :  Boys, The (1997)
2923  :  Citizen's Band (a.k.a. Handle with Care) (1977)
688  :  Operation Dumbo Drop (1995)

Downloading the dataset

Download Movielens1M dataset from this link https://files.grouplens.org/datasets/movielens/ml-1m.zip

wget https://files.grouplens.org/datasets/movielens/ml-1m.zip

In this example, we’ll be using 2 files from the MovieLens 1M dataset: ratings.dat and movies.dat. After the dataset is downloaded, extract the zip and place ratings.dat and movies.dat into a folder called input:

# Extracting the downloaded zip file
unzip ml-1m.zip

#moving  ratings.dat and movies.dat into a folder called 'input'
mkdir input; mv ml-1m/movies.dat ml-1m/ratings.dat input/

The structure of the input directory should be

input
├── movies.dat
└── ratings.dat

Installing Dependencies

To create a requirements.txt for the Python libraries we’ll be using, create:

# content of the requirements.txt
numpy
pandas

To install the dependencies, run:

pip install -r requirements.txt

Writing the Script

Create a new file called similar-movies.py and in it paste the following script

# content of the similar-movies.py

# Imports
import numpy as np
import pandas as pd
import argparse
from distutils.dir_util import mkpath
import warnings
warnings.filterwarnings("ignore")
# Read the files with pandas
data = pd.io.parsers.read_csv('input/ratings.dat',
names=['user_id', 'movie_id', 'rating', 'time'],
engine='python', delimiter='::', encoding='latin-1')
movie_data = pd.io.parsers.read_csv('input/movies.dat',
names=['movie_id', 'title', 'genre'],
engine='python', delimiter='::', encoding='latin-1')

# Create the ratings matrix of shape (m×u) with rows as movies and columns as users

ratings_mat = np.ndarray(
shape=((np.max(data.movie_id.values)), np.max(data.user_id.values)),
dtype=np.uint8)
ratings_mat[data.movie_id.values-1, data.user_id.values-1] = data.rating.values

# Normalise matrix (subtract mean off)

normalised_mat = ratings_mat - np.asarray([(np.mean(ratings_mat, 1))]).T

# Compute SVD

normalised_mat = ratings_mat - np.matrix(np.mean(ratings_mat, 1)).T
cov_mat = np.cov(normalised_mat)
evals, evecs = np.linalg.eig(cov_mat)

# Calculate cosine similarity, sort by most similar, and return the top N.

def top_cosine_similarity(data, movie_id, top_n=10):

index = movie_id - 1
# Movie id starts from 1

movie_row = data[index, :]
magnitude = np.sqrt(np.einsum('ij, ij -> i', data, data))
similarity = np.dot(movie_row, data.T) / (magnitude[index] * magnitude)
sort_indexes = np.argsort(-similarity)
return sort_indexes[:top_n]

# Helper function to print top N similar movies
def print_similar_movies(movie_data, movie_id, top_indexes):
print('Recommendations for {0}: \n'.format(
movie_data[movie_data.movie_id == movie_id].title.values[0]))
for id in top_indexes + 1:
print(str(id),' : ',movie_data[movie_data.movie_id == id].title.values[0])


parser = argparse.ArgumentParser(description='Personal information')
parser.add_argument('--k', dest='k', type=int, help='principal components to represent the movies',default=50)
parser.add_argument('--id', dest='id', type=int, help='Id of the movie',default=1)
parser.add_argument('--n', dest='n', type=int, help='No of recommendations',default=10)

args = parser.parse_args()
k = args.k
movie_id = args.id # Grab an id from movies.dat
top_n = args.n

# k = 50
# # Grab an id from movies.dat
# movie_id = 1
# top_n = 10

sliced = evecs[:, :k] # representative data
top_indexes = top_cosine_similarity(sliced, movie_id, top_n)
print_similar_movies(movie_data, movie_id, top_indexes)

What the similar-movies.py script does

Read the files with pandas. The code uses Pandas to read data from the files ratings.dat and movies.dat.
Create the ratings matrix of shape (m×u) with rows as movies and columns as user
Normalise matrix (subtract mean off). The ratings matrix is normalized by subtracting the mean off.
Compute SVD: a singular value decomposition (SVD) of the normalized ratings matrix is performed.
Calculate cosine similarity, sort by most similar, and return the top N.
Select k principal components to represent the movies, a movie_id to find recommendations, and print the top_n results.

For further reading on how the script works, go to Simple Movie Recommender Using SVD | Alyssa

Running the Script

Running the script similar-movies.py using the default values:

python similar-movies.py

You can also use other flags to set your own values.

2. Setting Up Docker

We will create a Dockerfile and add the desired configuration to the file. These commands specify how the image will be built, and what extra requirements will be included.

FROM python:3.8
ADD similar-movies.py .
ADD /input input
COPY ./requirements.txt /requirements.txt
RUN pip install -r requirements.txt

We will use the python:3.8 docker image and add our script similar-movies.py to copy the script to the docker image, similarly, we also add the dataset directory and also the requirements, after that run the command to install the dependencies in the image

The final folder structure will look like this:

├── Dockerfile
├── input
│   ├── movies.dat
│   └── ratings.dat
├── requirements.txt
└── similar-movies.py

See more information on how to containerize your script/app here

Build the container

We will run docker build command to build the container:

docker build -t <hub-user>/<repo-name>:<tag> .

Before running the command replace:

hub-user with your docker hub username, If you don’t have a docker hub account follow these instructions to create a docker account, and use the username of the account you created

repo-name with the name of the container, you can name it anything you want

tag this is not required, but you can use the latest tag

In our case:

docker build -t jsace/python-similar-movies .

Push the container

Next, upload the image to the registry. This can be done by using the Docker hub username, repo name or tag.

docker push <hub-user>/<repo-name>:<tag>

In our case:

docker push jsace/python-similar-movies

3. Running a Bacalhau Job

After the repo image has been pushed to Docker Hub, we can now use the container for running on Bacalhau. You can submit a Bacalhau job by running your container on Bacalhau with default or custom parameters.

Running the Container with Default Parameters

To submit a Bacalhau job by running your container on Bacalhau with default parameters, run the following Bacalhau command:

export JOB_ID=$(bacalhau docker run \
    --id-only \
    --wait \
    jsace/python-similar-movies \
    -- python similar-movies.py)

Structure of the command

bacalhau docker run: call to Bacalhau
jsace/python-similar-movies: the name and of the docker image we are using
-- python similar-movies.py: execute the Python script

When a job is submitted, Bacalhau prints out the related job_id. We store that in an environment variable so that we can reuse it later on.

Running the Container with Custom Parameters

To submit a Bacalhau job by running your container on Bacalhau with custom parameters, run the following Bacalhau command:

bacalhau docker run \
    jsace/python-similar-movies \
    -- python similar-movies.py --k 50 --id 10 --n 10

Structure of the command

bacalhau docker run: call to Bacalhau
jsace/python-similar-movies: the name of the docker image we are using
-- python similar-movies.py --k 50 --id 10 --n 10: execute the python script. The script will use Singular Value Decomposition (SVD) and cosine similarity to find 10 movies most similar to the one with identifier 10, using 50 principal components.

4. Checking the State of your Jobs

Job status: You can check the status of the job using bacalhau list.

bacalhau list --id-filter ${JOB_ID}

When it says Published or Completed, that means the job is done, and we can get the results.

Job information: You can find out more information about your job by using bacalhau describe.

bacalhau describe ${JOB_ID}

Job download: You can download your job results directly by using bacalhau get. Alternatively, you can choose to create a directory to store your results. In the command below, we created a directory (results) and downloaded our job output to be stored in that directory.

rm -rf results && mkdir -p results
bacalhau get $JOB_ID --output-dir results

5. Viewing your Job Output

To view the file, run the following command:

cat results/stdout # displays the contents of the file

Support

If you have questions or need support or guidance, please reach out to the Bacalhau team via Slack (#general channel).

Running Pandas on Bacalhau

Introduction

Pandas is a Python package that provides fast, flexible, and expressive data structures designed to make working with data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real-world data analysis in Python. Additionally, it has the broader goal of becoming the most powerful and flexible open-source data analysis/manipulation tool available in any language. It is already well on its way towards this goal.

In this tutorial example, we will run Pandas script on Bacalhau.

Prerequisite

To get started, you need to install the Bacalhau client, see more information here

1. Running Pandas Locally

To run the Pandas script on Bacalhau for analysis, first, we will place the Pandas script in a container and then run it at scale on Bacalhau.

To get started, you need to install the Pandas library from pip:

pip install pandas

Importing data from CSV to DataFrame

Pandas is built around the idea of a DataFrame, a container for representing data. Below you will create a DataFrame by importing a CSV file. A CSV file is a text file with one record of data per line. The values within the record are separated using the “comma” character. Pandas provides a useful method, named read_csv() to read the contents of the CSV file into a DataFrame. For example, we can create a file named transactions.csv containing details of Transactions. The CSV file is stored in the same directory that contains the Python script.

# read_csv.py
import pandas as pd

print(pd.read_csv("transactions.csv"))

The overall purpose of the command above is to read data from a CSV file (transactions.csv) using Pandas and print the resulting DataFrame.

To download the transactions.csv file, run:

wget https://cloudflare-ipfs.com/ipfs/QmfKJT13h5k1b23ja3ZCVg5nFL9oKz2bVXc8oXgtwiwhjz/transactions.csv

To output a content of the transactions.csv file, run:

cat transactions.csv

Running the script

Now let's run the script to read in the CSV file. The output will be a DataFrame object.

python3 read_csv.py

2. Ingesting Data

To run Pandas on Bacalhau you must store your assets in a location that Bacalhau has access to. We usually default to storing data on IPFS and code in a container, but you can also easily upload your script to IPFS too.

If you are interested in finding out more about how to ingest your data into IPFS, please see the data ingestion guide.

We've already uploaded the script and data to IPFS to the following CID: QmfKJT13h5k1b23ja3ZCVg5nFL9oKz2bVXc8oXgtwiwhjz. You can look at this by browsing to one of the HTTP IPFS proxies like ipfs.io or w3s.link.

3. Running a Bacalhau Job

Now we're ready to run a Bacalhau job, whilst mounting the Pandas script and data from IPFS. We'll use the bacalhau docker run command to do this:

export JOB_ID=$(bacalhau docker run \
    --wait \
    --id-only \
    -i ipfs://QmfKJT13h5k1b23ja3ZCVg5nFL9oKz2bVXc8oXgtwiwhjz:/files \
    -w /files \
    amancevice/pandas \
    -- python read_csv.py)

Structure of the command

bacalhau docker run: call to Bacalhau
amancevice/pandas : Docker image with pandas installed.
-i ipfs://QmfKJT13h5k1b23ja3ZCVg5nFL9oKz2bVXc8oXgtwiwhjz:/files: Mounting the uploaded dataset to path. The -i flag allows us to mount a file or directory from IPFS into the container. It takes two arguments, the first is the IPFS CID
QmfKJT13h5k1b23ja3ZCVg5nFL9oKz2bVXc8oXgtwiwhjz) and the second is the file path within IPFS (/files). The -i flag can be used multiple times to mount multiple directories.
-w /files Our working directory is /files. This is the folder where we will save the model as it will automatically get uploaded to IPFS as outputs
python read_csv.py: python script to read pandas script

When a job is submitted, Bacalhau prints out the related job_id. We store that in an environment variable so that we can reuse it later on.

4. Checking the State of your Jobs

Job status: You can check the status of the job using bacalhau list.

bacalhau list --id-filter ${JOB_ID}

When it says Published or Completed, that means the job is done, and we can get the results.

Job information: You can find out more information about your job by using bacalhau describe.

bacalhau describe ${JOB_ID}

rm -rf results && mkdir -p results
bacalhau get ${JOB_ID}  --output-dir results

5. Viewing your Job Output

To view the file, run the following command:

cat results/stdout

Support

If you have questions or need support or guidance, please reach out to the Bacalhau team via Slack (#general channel).

Running a Python Script

This tutorial serves as an introduction to Bacalhau. In this example, you'll be executing a simple "Hello, World!" Python script hosted on a website on Bacalhau.

Prerequisites

To get started, you need to install the Bacalhau client, see more information

1. Running Python Locally

We'll be using a very simple Python script that displays the . Create a file called hello-world.py:

# hello-world.py
print("Hello, world!")

Running the script to print out the output:

python3 hello-world.py

After the script has run successfully locally we can now run it on Bacalhau.

2. Running a Bacalhau Job

To submit a workload to Bacalhau you can use the bacalhau docker run command. This command allows passing input data into the container using volumes, we will be using the --input URL:path for simplicity. This results in Bacalhau mounting a data volume inside the container. By default, Bacalhau mounts the input volume at the path /inputs inside the container.

, so we must run the full command after the -- argument.

export JOB_ID=$(bacalhau docker run \
    --id-only \
    --input https://raw.githubusercontent.com/bacalhau-project/examples/151eebe895151edd83468e3d8b546612bf96cd05/workload-onboarding/trivial-python/hello-world.py \
    python:3.10-slim \
    -- python3 /inputs/hello-world.py)

bacalhau docker run: call to Bacalhau
--id-only: specifies that only the job identifier (job_id) will be returned after executing the container, not the entire output
--input https://raw.githubusercontent.com/bacalhau-project/examples/151eebe895151edd83468e3d8b546612bf96cd05/workload-onboarding/trivial-python/hello-world.py \: indicates where to get the input data for the container. In this case, the input data is downloaded from the specified URL, which represents the Python script "hello-world.py".
python:3.10-slim: the Docker image that will be used to run the container. In this case, it uses the Python 3.10 image with a minimal set of components (slim).
--: This double dash is used to separate the Bacalhau command options from the command that will be executed inside the Docker container.
python3 /inputs/hello-world.py: running the hello-world.py Python script stored in /inputs.

When a job is submitted, Bacalhau prints out the related job_id. We store that in an environment variable so that we can reuse it later on.

name: Running Trivial Python
type: batch
count: 1
tasks:
  - name: My main task
    Engine:
      type: docker
      params:
        Image: python:3.10-slim
        Entrypoint:
          - /bin/bash
        Parameters:
          - -c
          - python3 /inputs/hello-world.py
    InputSources:
      - Target: /inputs
        Source:
          Type: urlDownload
          Params:
            URL: https://raw.githubusercontent.com/bacalhau-project/examples/151eebe895151edd83468e3d8b546612bf96cd05/workload-onboarding/trivial-python/hello-world.py
            Path: /inputs/hello-world.py

The job description should be saved in .yaml format, e.g. helloworld.yaml, and then run with the command:

bacalhau job run helloworld.yaml

Job status: You can check the status of the job using bacalhau list.

bacalhau list --id-filter ${JOB_ID} --no-style

When it says Published or Completed, that means the job is done, and we can get the results.

Job information: You can find out more information about your job by using bacalhau describe.

bacalhau describe ${JOB_ID}

rm -rf results && mkdir results
bacalhau get ${JOB_ID} --output-dir results

To view the file, run the following command:

cat results/stdout

Running Jupyter Notebooks on Bacalhau

Introduction

Jupyter Notebooks have become an essential tool for data scientists, researchers, and developers for interactive computing and the development of data-driven projects. They provide an efficient way to share code, equations, visualizations, and narrative text with support for multiple programming languages. In this tutorial, we will introduce you to running Jupyter Notebooks on Bacalhau, a powerful and flexible container orchestration platform. By leveraging Bacalhau, you can execute Jupyter Notebooks in a scalable and efficient manner using Docker containers, without the need for manual setup or configuration.

In the following sections, we will explore two examples of executing Jupyter Notebooks on Bacalhau:

Executing a Simple Hello World Notebook: We will begin with a basic example to familiarize you with the process of running a Jupyter Notebook on Bacalhau. We will execute a simple "Hello, World!" notebook to demonstrate the steps required for running a notebook in a containerized environment.
Notebook to Train an MNIST Model: In this section, we will dive into a more advanced example. We will execute a Jupyter Notebook that trains a machine-learning model on the popular MNIST dataset. This will showcase the potential of Bacalhau to handle more complex tasks while providing you with insights into utilizing containerized environments for your data science projects.

Prerequisite

To get started, you need to install the Bacalhau client, see more information here

1. Executing a Simple Hello World Notebook

There are no external dependencies that we need to install. All dependencies are already there in the container.

export JOB_ID=$(bacalhau docker run \
    --wait \
    --id-only \
    --timeout 3600 \
    --wait-timeout-secs 3600 \
    -w /inputs \
    -i https://raw.githubusercontent.com/js-ts/hello-notebook/main/hello.ipynb \
    jsacex/jupyter \
    -- jupyter nbconvert --execute --to notebook --output /outputs/hello_output.ipynb hello.ipynb)

/inputs/hello.ipynb: This is the path of the input Jupyter Notebook inside the Docker container.
-i: This flag stands for "input" and is used to provide the URL of the input Jupyter Notebook you want to execute.
https://raw.githubusercontent.com/js-ts/hello-notebook/main/hello.ipynb: This is the URL of the input Jupyter Notebook.
jsacex/jupyter: This is the name of the Docker image used for running the Jupyter Notebook. It is a minimal Jupyter Notebook stack based on the official Jupyter Docker Stacks.
--: This double dash is used to separate the Bacalhau command options from the command that will be executed inside the Docker container.
jupyter nbconvert: This is the primary command used to convert and execute Jupyter Notebooks. It allows for the conversion of notebooks to various formats, including execution.
--execute: This flag tells nbconvert to execute the notebook and store the results in the output file.
--to notebook: This option specifies the output format. In this case, we want to keep the output as a Jupyter Notebook.
--output /outputs/hello_output.ipynb: This option specifies the path and filename for the output Jupyter Notebook, which will contain the results of the executed input notebook.

Checking the State of your Jobs

Job status: You can check the status of the job using bacalhau list:

bacalhau list --id-filter=${JOB_ID} --no-style

When it says Published or Completed, that means the job is done, and we can get the results.

Job information: You can find out more information about your job by using bacalhau describe:

bacalhau describe ${JOB_ID}

rm -rf results && mkdir results # Temporary directory to store the results
bacalhau get ${JOB_ID} --output-dir results # Download the results

After the download has finished you can see the contents in the results directory, running the command below:

ls results/outputs

hello_output.nbconvert.ipynb

2. Running Notebook to Train an MNIST Model

Building the container (optional)

Prerequisite

Install Docker on your local machine.
Sign up for a DockerHub account if you don't already have one.

Step 1: Create a Dockerfile

Create a new file named Dockerfile in your project directory with the following content:

# Use the official Python image as the base image
FROM tensorflow/tensorflow:nightly-gpu

# Set the working directory in the container
WORKDIR /

RUN apt-get update -y

COPY mnist.ipynb .
# Install the Python packages
COPY requirements.txt .

RUN python3 -m pip install --upgrade pip

# Install the Python packages
RUN pip install --no-cache-dir -r requirements.txt

RUN pip install -U scikit-learn

This Dockerfile creates a Docker image based on the official TensorFlow GPU-enabled image, sets the working directory to the root, updates the package list, and copies an IPython notebook (mnist.ipynb) and a requirements.txt file. It then upgrades pip and installs Python packages from the requirements.txt file, along with scikit-learn. The resulting image provides an environment ready for running the mnist.ipynb notebook with TensorFlow and scikit-learn, as well as other specified dependencies.

Step 2: Build the Docker Image

In your terminal, navigate to the directory containing the Dockerfile and run the following command to build the Docker image:

docker build -t your-dockerhub-username/jupyter-mnist-tensorflow:latest .

Replace "your-dockerhub-username" with your actual DockerHub username. This command will build the Docker image and tag it with your DockerHub username and the name "your-dockerhub-username/jupyter-mnist-tensorflow".

Step 3: Push the Docker Image to DockerHub

Once the build process is complete, push the Docker image to DockerHub using the following command:

docker push your-dockerhub-username/jupyter-mnist-tensorflow

Again, replace "your-dockerhub-username" with your actual DockerHub username. This command will push the Docker image to your DockerHub repository.

Running the job on Bacalhau

Prerequisite

To get started, you need to install the Bacalhau client, see more information here

export JOB_ID=$(bacalhau docker run \
    --wait \
    --id-only \
    --timeout 3600 \
    --wait-timeout-secs 3600 \
    --gpu 1 \
    -i gitlfs://huggingface.co/datasets/VedantPadwal/mnist.git \
    jsacex/jupyter-tensorflow-mnist:v02 \
    -- jupyter nbconvert --execute --to notebook --output /outputs/mnist_output.ipynb mnist.ipynb)

Structure of the command

--gpu 1: Flag to specify the number of GPUs to use for the execution. In this case, 1 GPU will be used.
-i gitlfs://huggingface.co/datasets/VedantPadwal/mnist.git: The -i flag is used to clone the MNIST dataset from Hugging Face's repository using Git LFS. The files will be mounted inside the container.
jsacex/jupyter-tensorflow-mnist:v02: The name and the tag of the Docker image.
--: This double dash is used to separate the Bacalhau command options from the command that will be executed inside the Docker container.
jupyter nbconvert --execute --to notebook --output /outputs/mnist_output.ipynb mnist.ipynb: The command to be executed inside the container. In this case, it runs the jupyter nbconvert command to execute the mnist.ipynb notebook and save the output as mnist_output.ipynb in the /outputs directory.

Checking the State of your Jobs

Job status: You can check the status of the job using bacalhau list.

bacalhau list --id-filter=${JOB_ID} --no-style

When it says Published or Completed, that means the job is done, and we can get the results.

Job information: You can find out more information about your job by using bacalhau describe.

bacalhau describe ${JOB_ID}

rm -rf results && mkdir results # Temporary directory to store the results
bacalhau get ${JOB_ID} --output-dir results # Download the results

After the download has finished you can see the contents in the results directory, running the command below:

ls results/outputs

The outputs include our trained model and the Jupyter notebook with the output cells.

Support

If you have questions or need support or guidance, please reach out to the Bacalhau team via Slack (#general channel).

Scripting Bacalhau with Python

Bacalhau allows you to easily execute batch jobs via the CLI. But sometimes you need to do more than that. You might need to execute a script that requires user input, or you might need to execute a script that requires a lot of parameters. In any case, you probably want to execute your jobs in a repeatable manner.

This example demonstrates a simple Python script that is able to orchestrate the execution of lots of jobs in a repeatable manner.

Prerequisite

To get started, you need to install the Bacalhau client, see more information here

Executing Bacalhau Jobs with Python Scripts

To demonstrate this example, I will use the data generated from an Ethereum example. This produced a list of hashes that I will iterate over and execute a job for each one.

#write following to the file hashes.txt
bafybeihvtzberlxrsz4lvzrzvpbanujmab3hr5okhxtbgv2zvonqos2l3i
bafybeifb25fgxrzu45lsc47gldttomycqcsao22xa2gtk2ijbsa5muzegq
bafybeig4wwwhs63ly6wbehwd7tydjjtnw425yvi2tlzt3aii3pfcj6hvoq
bafybeievpb5q372q3w5fsezflij3wlpx6thdliz5xowimunoqushn3cwka
bafybeih6te26iwf5kzzby2wqp67m7a5pmwilwzaciii3zipvhy64utikre
bafybeicjd4545xph6rcyoc74wvzxyaz2vftapap64iqsp5ky6nz3f5yndm

Now let's create a file called bacalhau.py. The script below automates the submission, monitoring, and retrieval of results for multiple Bacalhau jobs in parallel. It is designed to be used in a scenario where there are multiple hash files, each representing a job, and the script manages the execution of these jobs using Bacalhau commands.

# write following to the file bacalhau.py
import json, glob, os, multiprocessing, shutil, subprocess, tempfile, time

# checkStatusOfJob checks the status of a Bacalhau job
def checkStatusOfJob(job_id: str) -> str:
    assert len(job_id) > 0
    p = subprocess.run(
        ["bacalhau", "list", "--output", "json", "--id-filter", job_id],
        stdout=subprocess.PIPE,
        stderr=subprocess.PIPE,
        text=True,
    )
    r = parseJobStatus(p.stdout)
    if r == "":
        print("job status is empty! %s" % job_id)
    elif r == "Completed":
        print("job completed: %s" % job_id)
    else:
        print("job not completed: %s - %s" % (job_id, r))

    return r


# submitJob submits a job to the Bacalhau network
def submitJob(cid: str) -> str:
    assert len(cid) > 0
    p = subprocess.run(
        [
            "bacalhau",
            "docker",
            "run",
            "--id-only",
            "--wait=false",
            "--input",
            "ipfs://" + cid + ":/inputs/data.tar.gz",
            "ghcr.io/bacalhau-project/examples/blockchain-etl:0.0.6",
        ],
        stdout=subprocess.PIPE,
        stderr=subprocess.PIPE,
        text=True,
    )
    if p.returncode != 0:
        print("failed (%d) job: %s" % (p.returncode, p.stdout))
    job_id = p.stdout.strip()
    print("job submitted: %s" % job_id)

    return job_id


# getResultsFromJob gets the results from a Bacalhau job
def getResultsFromJob(job_id: str) -> str:
    assert len(job_id) > 0
    temp_dir = tempfile.mkdtemp()
    print("getting results for job: %s" % job_id)
    for i in range(0, 5): # try 5 times
        p = subprocess.run(
            [
                "bacalhau",
                "get",
                "--output-dir",
                temp_dir,
                job_id,
            ],
            stdout=subprocess.PIPE,
            stderr=subprocess.PIPE,
            text=True,
        )
        if p.returncode == 0:
            break
        else:
            print("failed (exit %d) to get job: %s" % (p.returncode, p.stdout))

    return temp_dir


# parseJobStatus parses the status of a Bacalhau job
def parseJobStatus(result: str) -> str:
    if len(result) == 0:
        return ""
    r = json.loads(result)
    if len(r) > 0:
        return r[0]["State"]["State"]
    return ""


# parseHashes splits lines from a text file into a list
def parseHashes(filename: str) -> list:
    assert os.path.exists(filename)
    with open(filename, "r") as f:
        hashes = f.read().splitlines()
    return hashes


def main(file: str, num_files: int = -1):
    # Use multiprocessing to work in parallel
    count = multiprocessing.cpu_count()
    with multiprocessing.Pool(processes=count) as pool:
        hashes = parseHashes(file)[:num_files]
        print("submitting %d jobs" % len(hashes))
        job_ids = pool.map(submitJob, hashes)
        assert len(job_ids) == len(hashes)

        print("waiting for jobs to complete...")
        while True:
            job_statuses = pool.map(checkStatusOfJob, job_ids)
            total_finished = sum(map(lambda x: x == "Completed", job_statuses))
            if total_finished >= len(job_ids):
                break
            print("%d/%d jobs completed" % (total_finished, len(job_ids)))
            time.sleep(2)

        print("all jobs completed, saving results...")
        results = pool.map(getResultsFromJob, job_ids)
        print("finished saving results")

        # Do something with the results
        shutil.rmtree("results", ignore_errors=True)
        os.makedirs("results", exist_ok=True)
        for r in results:
            path = os.path.join(r, "outputs", "*.csv")
            csv_file = glob.glob(path)
            for f in csv_file:
                print("moving %s to results" % f)
                shutil.move(f, "results")

if __name__ == "__main__":
    main("hashes.txt", 10)

This code has a few interesting features:

Change the value in the main call (main("hashes.txt", 10)) to change the number of jobs to execute.
Because all jobs are complete at different times, there's a loop to check that all jobs have been completed before downloading the results. If you don't do this, you'll likely see an error when trying to download the results. The while True loop is used to monitor the status of jobs and wait for them to complete.
When downloading the results, the IPFS get often times out, so I wrapped that in a loop. The for i in range(0, 5) loop in the getResultsFromJob function involves retrying the bacalhau get operation if it fails to complete successfully.

Let's run it!

python3 bacalhau.py

Hopefully, the results directory contains all the combined results from the jobs we just executed. Here's we're expecting to see CSV files:

ls results

transactions_00000000_00049999.csv  transactions_00150000_00199999.csv
transactions_00050000_00099999.csv  transactions_00200000_00249999.csv
transactions_00100000_00149999.csv  transactions_00250000_00299999.csv

Success! We've now executed a bunch of jobs in parallel using Python. This is a great way to execute lots of jobs in a repeatable manner. You can alter the file above for your purposes.

Next Steps

You might also be interested in the following examples:

Analysing Data with Python Pandas

Support

If you have questions or need support or guidance, please reach out to the Bacalhau team via Slack (#general channel).

Building and Running Custom Python Container

Introduction

In this tutorial example, we will walk you through building your own Python container and running the container on Bacalhau.

Prerequisites

To get started, you need to install the Bacalhau client, see more information here

1. Sample Recommendation Dataset

Recommendations for Toy Story (1995):
1  :  Toy Story (1995)
58  :  Postino, Il (The Postman) (1994)
3159  :  Fantasia 2000 (1999)
359  :  I Like It Like That (1994)
756  :  Carmen Miranda: Bananas Is My Business (1994)
618  :  Two Much (1996)
48  :  Pocahontas (1995)
2695  :  Boys, The (1997)
2923  :  Citizen's Band (a.k.a. Handle with Care) (1977)
688  :  Operation Dumbo Drop (1995)

Downloading the dataset

Download Movielens1M dataset from this link https://files.grouplens.org/datasets/movielens/ml-1m.zip

wget https://files.grouplens.org/datasets/movielens/ml-1m.zip

# Extracting the downloaded zip file
unzip ml-1m.zip

#moving  ratings.dat and movies.dat into a folder called 'input'
mkdir input; mv ml-1m/movies.dat ml-1m/ratings.dat input/

The structure of the input directory should be

input
├── movies.dat
└── ratings.dat

Installing Dependencies

To create a requirements.txt for the Python libraries we’ll be using, create:

# content of the requirements.txt
numpy
pandas

To install the dependencies, run:

pip install -r requirements.txt

Writing the Script

Create a new file called similar-movies.py and in it paste the following script

# content of the similar-movies.py

# Imports
import numpy as np
import pandas as pd
import argparse
from distutils.dir_util import mkpath
import warnings
warnings.filterwarnings("ignore")
# Read the files with pandas
data = pd.io.parsers.read_csv('input/ratings.dat',
names=['user_id', 'movie_id', 'rating', 'time'],
engine='python', delimiter='::', encoding='latin-1')
movie_data = pd.io.parsers.read_csv('input/movies.dat',
names=['movie_id', 'title', 'genre'],
engine='python', delimiter='::', encoding='latin-1')

# Create the ratings matrix of shape (m×u) with rows as movies and columns as users

ratings_mat = np.ndarray(
shape=((np.max(data.movie_id.values)), np.max(data.user_id.values)),
dtype=np.uint8)
ratings_mat[data.movie_id.values-1, data.user_id.values-1] = data.rating.values

# Normalise matrix (subtract mean off)

normalised_mat = ratings_mat - np.asarray([(np.mean(ratings_mat, 1))]).T

# Compute SVD

normalised_mat = ratings_mat - np.matrix(np.mean(ratings_mat, 1)).T
cov_mat = np.cov(normalised_mat)
evals, evecs = np.linalg.eig(cov_mat)

# Calculate cosine similarity, sort by most similar, and return the top N.

def top_cosine_similarity(data, movie_id, top_n=10):

index = movie_id - 1
# Movie id starts from 1

movie_row = data[index, :]
magnitude = np.sqrt(np.einsum('ij, ij -> i', data, data))
similarity = np.dot(movie_row, data.T) / (magnitude[index] * magnitude)
sort_indexes = np.argsort(-similarity)
return sort_indexes[:top_n]

# Helper function to print top N similar movies
def print_similar_movies(movie_data, movie_id, top_indexes):
print('Recommendations for {0}: \n'.format(
movie_data[movie_data.movie_id == movie_id].title.values[0]))
for id in top_indexes + 1:
print(str(id),' : ',movie_data[movie_data.movie_id == id].title.values[0])


parser = argparse.ArgumentParser(description='Personal information')
parser.add_argument('--k', dest='k', type=int, help='principal components to represent the movies',default=50)
parser.add_argument('--id', dest='id', type=int, help='Id of the movie',default=1)
parser.add_argument('--n', dest='n', type=int, help='No of recommendations',default=10)

args = parser.parse_args()
k = args.k
movie_id = args.id # Grab an id from movies.dat
top_n = args.n

# k = 50
# # Grab an id from movies.dat
# movie_id = 1
# top_n = 10

sliced = evecs[:, :k] # representative data
top_indexes = top_cosine_similarity(sliced, movie_id, top_n)
print_similar_movies(movie_data, movie_id, top_indexes)

What the similar-movies.py script does

Read the files with pandas. The code uses Pandas to read data from the files ratings.dat and movies.dat.
Create the ratings matrix of shape (m×u) with rows as movies and columns as user
Normalise matrix (subtract mean off). The ratings matrix is normalized by subtracting the mean off.
Compute SVD: a singular value decomposition (SVD) of the normalized ratings matrix is performed.
Calculate cosine similarity, sort by most similar, and return the top N.
Select k principal components to represent the movies, a movie_id to find recommendations, and print the top_n results.

For further reading on how the script works, go to Simple Movie Recommender Using SVD | Alyssa

Running the Script

Running the script similar-movies.py using the default values:

python similar-movies.py

You can also use other flags to set your own values.

2. Setting Up Docker

We will create a Dockerfile and add the desired configuration to the file. These commands specify how the image will be built, and what extra requirements will be included.

FROM python:3.8
ADD similar-movies.py .
ADD /input input
COPY ./requirements.txt /requirements.txt
RUN pip install -r requirements.txt

The final folder structure will look like this:

├── Dockerfile
├── input
│   ├── movies.dat
│   └── ratings.dat
├── requirements.txt
└── similar-movies.py

See more information on how to containerize your script/app here

Build the container

We will run docker build command to build the container:

docker build -t <hub-user>/<repo-name>:<tag> .

Before running the command replace:

hub-user with your docker hub username, If you don’t have a docker hub account follow these instructions to create a docker account, and use the username of the account you created

repo-name with the name of the container, you can name it anything you want

tag this is not required, but you can use the latest tag

In our case:

docker build -t jsace/python-similar-movies .

Push the container

Next, upload the image to the registry. This can be done by using the Docker hub username, repo name or tag.

docker push <hub-user>/<repo-name>:<tag>

In our case:

docker push jsace/python-similar-movies

3. Running a Bacalhau Job

Running the Container with Default Parameters

To submit a Bacalhau job by running your container on Bacalhau with default parameters, run the following Bacalhau command:

export JOB_ID=$(bacalhau docker run \
    --id-only \
    --wait \
    jsace/python-similar-movies \
    -- python similar-movies.py)

Structure of the command

bacalhau docker run: call to Bacalhau
jsace/python-similar-movies: the name and of the docker image we are using
-- python similar-movies.py: execute the Python script

When a job is submitted, Bacalhau prints out the related job_id. We store that in an environment variable so that we can reuse it later on.

Running the Container with Custom Parameters

To submit a Bacalhau job by running your container on Bacalhau with custom parameters, run the following Bacalhau command:

bacalhau docker run \
    jsace/python-similar-movies \
    -- python similar-movies.py --k 50 --id 10 --n 10

Structure of the command

bacalhau docker run: call to Bacalhau
jsace/python-similar-movies: the name of the docker image we are using
-- python similar-movies.py --k 50 --id 10 --n 10: execute the python script. The script will use Singular Value Decomposition (SVD) and cosine similarity to find 10 movies most similar to the one with identifier 10, using 50 principal components.

4. Checking the State of your Jobs

Job status: You can check the status of the job using bacalhau list.

bacalhau list --id-filter ${JOB_ID}

When it says Published or Completed, that means the job is done, and we can get the results.

Job information: You can find out more information about your job by using bacalhau describe.

bacalhau describe ${JOB_ID}

rm -rf results && mkdir -p results
bacalhau get $JOB_ID --output-dir results

5. Viewing your Job Output

To view the file, run the following command:

cat results/stdout # displays the contents of the file

Support

If you have questions or need support or guidance, please reach out to the Bacalhau team via Slack (#general channel).

Running Jupyter Notebooks on Bacalhau

Introduction

In the following sections, we will explore two examples of executing Jupyter Notebooks on Bacalhau:

Executing a Simple Hello World Notebook: We will begin with a basic example to familiarize you with the process of running a Jupyter Notebook on Bacalhau. We will execute a simple "Hello, World!" notebook to demonstrate the steps required for running a notebook in a containerized environment.
Notebook to Train an MNIST Model: In this section, we will dive into a more advanced example. We will execute a Jupyter Notebook that trains a machine-learning model on the popular MNIST dataset. This will showcase the potential of Bacalhau to handle more complex tasks while providing you with insights into utilizing containerized environments for your data science projects.

Prerequisite

To get started, you need to install the Bacalhau client, see more information here

1. Executing a Simple Hello World Notebook

There are no external dependencies that we need to install. All dependencies are already there in the container.

export JOB_ID=$(bacalhau docker run \
    --wait \
    --id-only \
    --timeout 3600 \
    --wait-timeout-secs 3600 \
    -w /inputs \
    -i https://raw.githubusercontent.com/js-ts/hello-notebook/main/hello.ipynb \
    jsacex/jupyter \
    -- jupyter nbconvert --execute --to notebook --output /outputs/hello_output.ipynb hello.ipynb)

/inputs/hello.ipynb: This is the path of the input Jupyter Notebook inside the Docker container.
-i: This flag stands for "input" and is used to provide the URL of the input Jupyter Notebook you want to execute.
https://raw.githubusercontent.com/js-ts/hello-notebook/main/hello.ipynb: This is the URL of the input Jupyter Notebook.
jsacex/jupyter: This is the name of the Docker image used for running the Jupyter Notebook. It is a minimal Jupyter Notebook stack based on the official Jupyter Docker Stacks.
--: This double dash is used to separate the Bacalhau command options from the command that will be executed inside the Docker container.
jupyter nbconvert: This is the primary command used to convert and execute Jupyter Notebooks. It allows for the conversion of notebooks to various formats, including execution.
--execute: This flag tells nbconvert to execute the notebook and store the results in the output file.
--to notebook: This option specifies the output format. In this case, we want to keep the output as a Jupyter Notebook.
--output /outputs/hello_output.ipynb: This option specifies the path and filename for the output Jupyter Notebook, which will contain the results of the executed input notebook.

Checking the State of your Jobs

Job status: You can check the status of the job using bacalhau list:

bacalhau list --id-filter=${JOB_ID} --no-style

When it says Published or Completed, that means the job is done, and we can get the results.

Job information: You can find out more information about your job by using bacalhau describe:

bacalhau describe ${JOB_ID}

rm -rf results && mkdir results # Temporary directory to store the results
bacalhau get ${JOB_ID} --output-dir results # Download the results

After the download has finished you can see the contents in the results directory, running the command below:

ls results/outputs

hello_output.nbconvert.ipynb

2. Running Notebook to Train an MNIST Model

Building the container (optional)

Prerequisite

Install Docker on your local machine.
Sign up for a DockerHub account if you don't already have one.

Step 1: Create a Dockerfile

Create a new file named Dockerfile in your project directory with the following content:

# Use the official Python image as the base image
FROM tensorflow/tensorflow:nightly-gpu

# Set the working directory in the container
WORKDIR /

RUN apt-get update -y

COPY mnist.ipynb .
# Install the Python packages
COPY requirements.txt .

RUN python3 -m pip install --upgrade pip

# Install the Python packages
RUN pip install --no-cache-dir -r requirements.txt

RUN pip install -U scikit-learn

Step 2: Build the Docker Image

In your terminal, navigate to the directory containing the Dockerfile and run the following command to build the Docker image:

docker build -t your-dockerhub-username/jupyter-mnist-tensorflow:latest .

Step 3: Push the Docker Image to DockerHub

Once the build process is complete, push the Docker image to DockerHub using the following command:

docker push your-dockerhub-username/jupyter-mnist-tensorflow

Again, replace "your-dockerhub-username" with your actual DockerHub username. This command will push the Docker image to your DockerHub repository.

Running the job on Bacalhau

Prerequisite

To get started, you need to install the Bacalhau client, see more information here

export JOB_ID=$(bacalhau docker run \
    --wait \
    --id-only \
    --timeout 3600 \
    --wait-timeout-secs 3600 \
    --gpu 1 \
    -i gitlfs://huggingface.co/datasets/VedantPadwal/mnist.git \
    jsacex/jupyter-tensorflow-mnist:v02 \
    -- jupyter nbconvert --execute --to notebook --output /outputs/mnist_output.ipynb mnist.ipynb)

Structure of the command

--gpu 1: Flag to specify the number of GPUs to use for the execution. In this case, 1 GPU will be used.
-i gitlfs://huggingface.co/datasets/VedantPadwal/mnist.git: The -i flag is used to clone the MNIST dataset from Hugging Face's repository using Git LFS. The files will be mounted inside the container.
jsacex/jupyter-tensorflow-mnist:v02: The name and the tag of the Docker image.
--: This double dash is used to separate the Bacalhau command options from the command that will be executed inside the Docker container.
jupyter nbconvert --execute --to notebook --output /outputs/mnist_output.ipynb mnist.ipynb: The command to be executed inside the container. In this case, it runs the jupyter nbconvert command to execute the mnist.ipynb notebook and save the output as mnist_output.ipynb in the /outputs directory.

Checking the State of your Jobs

Job status: You can check the status of the job using bacalhau list.

bacalhau list --id-filter=${JOB_ID} --no-style

When it says Published or Completed, that means the job is done, and we can get the results.

Job information: You can find out more information about your job by using bacalhau describe.

bacalhau describe ${JOB_ID}

rm -rf results && mkdir results # Temporary directory to store the results
bacalhau get ${JOB_ID} --output-dir results # Download the results

After the download has finished you can see the contents in the results directory, running the command below:

ls results/outputs

The outputs include our trained model and the Jupyter notebook with the output cells.

Support

If you have questions or need support or guidance, please reach out to the Bacalhau team via Slack (#general channel).