1 of 41

Use Cases

Examples

Table of Contents for Bacalhau Examples

Below you will find a table of contents for a wide array of examples on how to use Bacalhau.

Data Engineering

This directory contains examples relating to data engineering workloads. The goal is to provide a range of examples that show you how to work with Bacalhau in a variety of use cases.

Using Bacalhau with DuckDB

Introduction

DuckDB is a relational table-oriented database management system and supports SQL queries for producing analytical results. It also comes with various features that are useful for data analytics.

DuckDB is suited for the following use cases:

Processing and storing tabular datasets, e.g. from CSV or Parquet files
Interactive data analysis, e.g. Joining & aggregate multiple large tables
Concurrent large changes, to multiple large tables, e.g. appending rows, adding/removing/updating columns
Large result set transfer to client

In this example tutorial, we will show how to use DuckDB with Bacalhau. The advantage of using DuckDB with Bacalhau is that you don’t need to install, there is no need to download the datasets since the datasets are are already available on the server you are looking to run the query on.

Prerequisites

To get started, you need to install the Bacalhau client, see more information here

We will be using Bacalhau setup with the standard setting up Bacalhau network.

We will also need to have a server with some data to run the query on. In this example, we will use a server with the Yellow Taxi Trips dataset.

If you do not already have this data on your server, you can download it using the scripts in the prep_data directory. The command to download the data is ./prep_data/run_download_jobs.sh - and you must have the /bacalhau_data directory on your server.

Running the Query

To submit a job, run the following Bacalhau command:

bacalhau docker run -e QUERY="select 1" docker.io/bacalhauproject/duckdb:latest

This is a simple query that will return a single row with a single column - but the query will be executed in DuckDB, on a remote server.

Structure of the command

Let's look closely at the command above:

bacalhau docker run: call to bacalhau
-e QUERY="select 1": the query to execute
docker.io/bacalhauproject/duckdb:latest: the name and the tag of the docker image we are using

When a job is submitted, Bacalhau runs the query in DuckDB, and returns the results to the client.

After we run it, when we describe the job, we can see the following in standard output:

Standard Output
┌───────┐
│   1   │
│ int32 │
├───────┤
│     1 │
└───────┘

Running with a YAML file

What if you didn't want to run everything on the command line? You can use a YAML file to define the job. In simple_query.sql, we have a simple query that will return the number of rows in the dataset.

-- simple_query.sql
SELECT COUNT(*) AS row_count FROM yellow_taxi_trips;

To run this query, we can use the following YAML file:

# duckdb_query_job.yaml
Tasks:
  - Engine:
      Params:
        Image: docker.io/bacalhauproject/duckdb:latest
        WorkingDirectory: ""
        EnvironmentVariables:
          - QUERY=WITH yellow_taxi_trips AS (SELECT * FROM read_parquet('{{ .filename }}')) {{ .query }}
      Type: docker
    Name: duckdb-query-job
    InputSources:
      - Source:
          Type: "localDirectory"
          Params:
            SourcePath: "/bacalhau_data"
            ReadWrite: true
        Target: "/bacalhau_data"
    Publisher:
      Type: "local"
      Params:
        TargetPath: "/bacalhau_data"
    Network:
      Type: Full
    Resources:
      CPU: 2000m
      Memory: 2048Mi
    Timeouts: {}
Type: batch
Count: 1

Though this looks like a lot of code, it is actually quite simple. The Tasks section defines the task to run, and the InputSources section defines the input dataset. The Publisher section defines where the results will be published, and the Resources section defines the resources required for the job.

All the work is done in the environment variables, which are passed to the Docker image, and handed to DuckDB to execute the query.

To run this query, we can use the following command:

bacalhau job run duckdb_query_job.yaml --template-vars="filename=/bacalhau_data/yellow_tripdata_2020-02.parquet" --template-vars="QUERY=$(cat simple_query.sql)"

This breaks down into the following steps:

bacalhau job run: call to bacalhau
duckdb_query_job.yaml: the YAML file we are using
--template-vars="filename=/bacalhau_data/yellow_tripdata_2020-02.parquet": the file to read
--template-vars="QUERY=$(cat simple_query.sql)": the query to execute

When we run this, we get back the following simple output:

Standard Output
┌───────────┐
│ row_count │
│   int64   │
├───────────┤
│   6299367 │
└───────────┘

More complex queries

Let's say we want to run a more complex query. In window_query_simple.sql, we have a query that will return the average number of rides per 5 minute interval.

-- window_query_simple.sql
SELECT
    DATE_TRUNC('hour', tpep_pickup_datetime) + INTERVAL (FLOOR(EXTRACT(MINUTE FROM tpep_pickup_datetime) / 5) * 5) MINUTE AS interval_start,
    COUNT(*) AS ride_count
FROM
    yellow_taxi_trips
GROUP BY
    interval_start
ORDER BY
    interval_start;

When we run this, we get back the following output:

┌─────────────────────┬────────────┐
│   interval_start    │ ride_count │
│      timestamp      │   int64    │
├─────────────────────┼────────────┤
│ 2008-12-31 22:20:00 │          1 │
│ 2008-12-31 23:00:00 │          1 │
│ 2008-12-31 23:05:00 │          1 │
│ 2008-12-31 23:10:00 │          1 │
│ 2008-12-31 23:15:00 │          1 │
│ 2008-12-31 23:30:00 │          1 │
│ 2009-01-01 00:00:00 │          3 │
│ 2009-01-01 00:05:00 │          3 │
│ 2009-01-01 00:15:00 │          1 │
│ 2009-01-01 00:40:00 │          1 │
│ 2009-01-01 01:15:00 │          1 │
│ 2009-01-01 01:20:00 │          1 │
│ 2009-01-01 01:35:00 │          1 │
│ 2009-01-01 01:40:00 │          1 │
│ 2009-01-01 02:00:00 │          2 │
│ 2009-01-01 02:15:00 │          1 │
│ 2009-01-01 04:05:00 │          1 │
│ 2009-01-01 04:15:00 │          2 │
│ 2009-01-01 04:45:00 │          1 │
│ 2009-01-01 06:30:00 │          1 │
│          ·          │          · │
│          ·          │          · │
│          ·          │          · │
│ 2020-03-05 12:15:00 │          1 │

The sql file needs to be run in a single line, otherwise the line breaks will cause some issues with the templating. We're working on improving this!

With this structure, you can now run virtually any query you want on remote servers, without ever having to download the data. Welcome to compute over data by Bacalhau!

Need Help?

If you get stuck or have questions:

Check out the official Bacalhau Documentation
Open an issue in our GitHub repository
Join our Slack

Ethereum Blockchain Analysis with Ethereum-ETL and Bacalhau

Introduction

Mature blockchains are difficult to analyze because of their size. Ethereum-ETL is a tool that makes it easy to extract information from an Ethereum node, but it's not easy to get working in a batch manner. It takes approximately 1 week for an Ethereum node to download the entire chain (even more in my experience) and importing and exporting data from the Ethereum node is slow.

For this example, we ran an Ethereum node for a week and allowed it to synchronize. We then ran ethereum-etl to extract the information and pinned it on Filecoin. This means that we can both now access the data without having to run another Ethereum node.

But there's still a lot of data and these types of analyses typically need repeating or refining. So it makes absolute sense to use a decentralized network like Bacalhau to process the data in a scalable way.

In this tutorial example, we will run Ethereum-ETL tool on Bacalhau to extract data from an Ethereum node.

Prerequisite

To get started, you need to install the Bacalhau client, see more information here

Analysing Ethereum Data Locally

First let's download one of the IPFS files and inspect it locally:

wget -q -O file.tar.gz https://w3s.link/ipfs/bafybeifgqjvmzbtz427bne7af5tbndmvniabaex77us6l637gqtb2iwlwq
tar -xvf file.tar.gz

You can see the full list of IPFS CIDs in the appendix at the bottom of the page.

If you don't already have the Pandas library, let's install it:

pip install pandas

# Use pandas to read in transaction data and clean up the columns
import pandas as pd
import glob

file = glob.glob('output_*/transactions/start_block=*/end_block=*/transactions*.csv')[0]
print("Loading file %s" % file)
df = pd.read_csv(file)
df['value'] = df['value'].astype('float')
df['from_address'] = df['from_address'].astype('string')
df['to_address'] = df['to_address'].astype('string')
df['hash'] = df['hash'].astype('string')
df['block_hash'] = df['block_hash'].astype('string')
df['block_datetime'] = pd.to_datetime(df['block_timestamp'], unit='s')
df.info()

# Total volume per day
df[['block_datetime', 'value']].groupby(pd.Grouper(key='block_datetime', freq='1D')).sum().plot()

The following code inspects the daily trading volume of Ethereum for a single chunk (100,000 blocks) of data.

This is all good, but we can do better. We can use the Bacalhau client to download the data from IPFS and then run the analysis on the data in the cloud. This means that we can analyze the entire Ethereum blockchain without having to download it locally.

Analysing Ethereum Data With Bacalhau

To run jobs on the Bacalhau network you need to package your code. In this example, I will package the code as a Docker image.

But before we do that, we need to develop the code that will perform the analysis. The code below is a simple script to parse the incoming data and produce a CSV file with the daily trading volume of Ethereum.

# main.py
import glob, os, sys, shutil, tempfile
import pandas as pd

def main(input_dir, output_dir):
    search_path = os.path.join(input_dir, "output*", "transactions", "start_block*", "end_block*", "transactions_*.csv")
    csv_files = glob.glob(search_path)
    if len(csv_files) == 0:
        print("No CSV files found in %s" % search_path)
        sys.exit(1)
    for transactions_file in csv_files:
        print("Loading %s" % transactions_file)
        df = pd.read_csv(transactions_file)
        df['value'] = df['value'].astype('float')
        df['block_datetime'] = pd.to_datetime(df['block_timestamp'], unit='s')
        
        print("Processing %d blocks" % (df.shape[0]))
        results = df[['block_datetime', 'value']].groupby(pd.Grouper(key='block_datetime', freq='1D')).sum()
        print("Finished processing %d days worth of records" % (results.shape[0]))

        save_path = os.path.join(output_dir, os.path.basename(transactions_file))
        os.makedirs(os.path.dirname(save_path), exist_ok=True)
        print("Saving to %s" % (save_path))
        results.to_csv(save_path)

def extractData(input_dir, output_dir):
    search_path = os.path.join(input_dir, "*.tar.gz")
    gz_files = glob.glob(search_path)
    if len(gz_files) == 0:
        print("No tar.gz files found in %s" % search_path)
        sys.exit(1)
    for f in gz_files:
        shutil.unpack_archive(filename=f, extract_dir=output_dir)

if __name__ == "__main__":
    if len(sys.argv) != 3:
        print('Must pass arguments. Format: [command] input_dir output_dir')
        sys.exit()
    with tempfile.TemporaryDirectory() as tmp_dir:
        extractData(sys.argv[1], tmp_dir)
        main(tmp_dir, sys.argv[2])

Next, let's make sure the file works as expected:

python main.py . outputs/

And finally, package the code inside a Docker image to make the process reproducible. Here I'm passing the Bacalhau default /inputs and /outputs directories. The /inputs directory is where the data will be read from and the /outputs directory is where the results will be saved to.

FROM python:3.11-slim-bullseye
WORKDIR /src
RUN pip install pandas==1.5.1
ADD main.py .
CMD ["python", "main.py", "/inputs", "/outputs"]

We've already pushed the container, but for posterity, the following command pushes this container to GHCR.

docker buildx build --platform linux/amd64 --push -t ghcr.io/bacalhau-project/examples/blockchain-etl:0.0.1 .

Running a Bacalhau Job

To run our analysis on the Ethereum blockchain, we will use the bacalhau docker run command.

export JOB_ID=$(bacalhau docker run \
    --id-only \
    --input ipfs://bafybeifgqjvmzbtz427bne7af5tbndmvniabaex77us6l637gqtb2iwlwq:/inputs/data.tar.gz \
    ghcr.io/bacalhau-project/examples/blockchain-etl:0.0.6)

The job has been submitted and Bacalhau has printed out the related job id. We store that in an environment variable so that we can reuse it later on.

The bacalhau docker run command allows passing input data volume with --input or -i ipfs://CID:path argument just like Docker, except the left-hand side of the argument is a content identifier (CID). This results in Bacalhau mounting a data volume inside the container. By default, Bacalhau mounts the input volume at the path /inputs inside the container.

Bacalhau also mounts a data volume to store output data. The bacalhau docker run command creates an output data volume mounted at /outputs. This is a convenient location to store the results of your job.

Declarative job description

The same job can be presented in the declarative format. In this case, the description will look like this:

name: Ethereum Blockchain Analysis with Ethereum-ETL
type: batch
count: 1
tasks:
  - name: My main task
    Engine:
      type: docker
      params:
        Image: ghcr.io/bacalhau-project/examples/blockchain-etl:0.0.6
    Publisher:
      Type: ipfs
    ResultPaths:
      - Name: outputs
        Path: /outputs
    InputSources:
      - Target: "/inputs/data.tar.gz"
        Source:
          Type: "ipfs"
          Params:
            CID: "bafybeifgqjvmzbtz427bne7af5tbndmvniabaex77us6l637gqtb2iwlwq"

The job description should be saved in .yaml format, e.g. blockchain.yaml, and then run with the command:

Copy

bacalhau job run blockchain.yaml

Checking the State of your Jobs

Job status: You can check the status of the job using bacalhau job list.

bacalhau job list --id-filter ${JOB_ID}

When it says Published or Completed, that means the job is done, and we can get the results.

Job information: You can find out more information about your job by using bacalhau job describe.

bacalhau job describe ${JOB_ID}

Job download: You can download your job results directly by using bacalhau job get. Alternatively, you can choose to create a directory to store your results. In the command below, we created a directory (results) and downloaded our job output to be stored in that directory.

rm -rf results && mkdir -p results # Temporary directory to store the results
bacalhau job get ${JOB_ID} --output-dir results # Download the results

Viewing your Job Output

To view the file, run the following command:

ls -lah results/outputs

Display the image

To view the images, we will use glob to return all file paths that match a specific pattern.

import glob
import pandas as pd

# Get CSV files list from a folder
csv_files = glob.glob("results/outputs/*.csv")
df = pd.read_csv(csv_files[0], index_col='block_datetime')
df.plot()

Massive Scale Ethereum Analysis

Ok, so that works. Let's scale this up! We can run the same analysis on the entire Ethereum blockchain (up to the point where I have uploaded the Ethereum data). To do this, we need to run the analysis on each of the chunks of data that we have stored on IPFS. We can do this by running the same job on each of the chunks.

See the appendix for the hashes.txt file.

printf "" > job_ids.txt
for h in $(cat hashes.txt); do \
    bacalhau docker run \
    --id-only \
    --wait=false \
    --input=ipfs://$h:/inputs/data.tar.gz \
    ghcr.io/bacalhau-project/examples/blockchain-etl:0.0.6 >> job_ids.txt 
done

Now take a look at the job id's. You can use these to check the status of the jobs and download the results:

cat job_ids.txt

d840df7b-9318-4e5b-ab06-adb72dd95394
09d01f9c-9409-42b9-829d-92e22fcdd062
0072758f-3575-44d7-b193-da4a22f6bc86
2043dee4-fc82-4768-92cb-4d23dd2514b1
36ef8e9e-9eae-4218-81e6-15883d0a5b8d
932aa406-cd29-4933-b09f-c8cea4d77164
1f3e5273-bdd4-4ef0-b7ed-b83591fab64e
8bfabe96-54e3-4fee-b344-a0517c683268
1cd588a1-5c76-4f91-ba90-af7931bca596
b9c29531-e1b4-4520-b03d-7406a22bbdb3
8665b8be-24a9-4c78-9913-803d3e3c9a65
06115147-bc83-49e8-bb71-7b447c8ad1bc
84afed3e-831c-462b-a3e3-9a23bc7d6fb8
ed6e55e6-98d3-4bde-8ece-1f05838d489e
...

You might want to double-check that the jobs ran ok by doing a bacalhau job list.

bacalhau job list -n 50

Wait until all of these jobs have been completed. And then download all the results and merge them into a single directory. This might take a while, so this is a good time to treat yourself to a nice Dark Mild. There's also been some issues in the past communicating with IPFS, so if you get an error, try again.

for id in $(cat job_ids.txt); do \
    rm -rf results_$id && mkdir results_$id
    bacalhau job get --output-dir results_$id $id &
done
wait

Display the image

To view the images, we will use glob to return all file paths that match a specific pattern.

import os, glob
import pandas as pd

# Get CSV files list from a folder
path = os.path.join("results_*", "outputs", "*.csv")
csv_files = glob.glob(path)

# Read each CSV file into a list of DataFrames
df_list = (pd.read_csv(file, index_col='block_datetime') for file in csv_files)

# Concatenate all DataFrames
df_unsorted = pd.concat(df_list, ignore_index=False)

# Some files will cross days, so group by day and sum the values
df = df_unsorted.groupby(level=0).sum()

# Plot
df.plot(figsize=(16,9))

That's it! There are several years of Ethereum transaction volume data.

rm -rf results_* output_* outputs results temp # Remove temporary results

Appendix: List Ethereum Data CIDs

The following list is a list of IPFS CID's for the Ethereum data that we used in this tutorial. You can use these CID's to download the rest of the chain if you so desire. The CIDs are ordered by block number and they increase 50,000 blocks at a time. Here's a list of ordered CIDs:

# hashes.txt
bafybeihvtzberlxrsz4lvzrzvpbanujmab3hr5okhxtbgv2zvonqos2l3i
bafybeifb25fgxrzu45lsc47gldttomycqcsao22xa2gtk2ijbsa5muzegq
bafybeig4wwwhs63ly6wbehwd7tydjjtnw425yvi2tlzt3aii3pfcj6hvoq
bafybeievpb5q372q3w5fsezflij3wlpx6thdliz5xowimunoqushn3cwka
bafybeih6te26iwf5kzzby2wqp67m7a5pmwilwzaciii3zipvhy64utikre
bafybeicjd4545xph6rcyoc74wvzxyaz2vftapap64iqsp5ky6nz3f5yndm
bafybeicgo3iofo3sw73wenc3nkdhi263yytjnds5cxjwvypwekbz4sk7ra
bafybeihvep5xsvxm44lngmmeysihsopcuvcr34an4idz45ixl5slsqzy3y
bafybeigmt2zwzrbzwb4q2kt2ihlv34ntjjwujftvabrftyccwzwdypama4
bafybeiciwui7sw3zqkvp4d55p4woq4xgjlstrp3mzxl66ab5ih5vmeozci
bafybeicpmotdsj2ambf666b2jkzp2gvg6tadr6acxqw2tmdlmsruuggbbu
bafybeigefo3esovbveavllgv5wiheu5w6cnfo72jxe6vmfweco5eq5sfty
bafybeigvajsumnfwuv7lp7yhr2sr5vrk3bmmuhhnaz53waa2jqv3kgkvsu
bafybeih2xg2n7ytlunvqxwqlqo5l3daykuykyvhgehoa2arot6dmorstmq
bafybeihnmq2ltuolnlthb757teihwvvw7wophoag2ihnva43afbeqdtgi4
bafybeibb34hzu6z2xgo6nhrplt3xntpnucthqlawe3pmzgxccppbxrpudy
bafybeigny33b4g6gf2hrqzzkfbroprqrimjl5gmb3mnsqu655pbbny6tou
bafybeifgqjvmzbtz427bne7af5tbndmvniabaex77us6l637gqtb2iwlwq
bafybeibryqj62l45pxjhdyvgdc44p3suhvt4xdqc5jpx474gpykxwgnw2e
bafybeidme3fkigdjaifkjfbwn76jk3fcqdogpzebtotce6ygphlujaecla
bafybeig7myc3eg3h2g5mk2co7ybte4qsuremflrjneer6xk3pghjwmcwbi
bafybeic3x2r5rrd3fdpdqeqax4bszcciwepvbpjl7xdv6mkwubyqizw5te
bafybeihxutvxg3bw7fbwohq4gvncrk3hngkisrtkp52cu7qu7tfcuvktnq
bafybeicumr67jkyarg5lspqi2w4zqopvgii5dgdbe5vtbbq53mbyftduxy
bafybeiecn2cdvefvdlczhz6i4afbkabf5pe5yqrcsgdvlw5smme2tw7em4
bafybeiaxh7dhg4krgkil5wqrv5kdsc3oewwy6ym4n3545ipmzqmxaxrqf4
bafybeiclcqfzinrmo3adr4lg7sf255faioxjfsolcdko3i4x7opx7xrqii
bafybeicjmeul7c2dxhmaudawum4ziwfgfkvbgthgtliggfut5tsc77dx7q
bafybeialziupik7csmhfxnhuss5vrw37kmte7rmboqovp4cpq5hj4insda
bafybeid7ecwdrw7pb3fnkokq5adybum6s5ok3yi2lw4m3edjpuy65zm4ji
bafybeibuxwnl5ogs4pwa32xriqhch24zbrw44rp22hrly4t6roh6rz7j4m
bafybeicxvy47jpvv3fi5umjatem5pxabfrbkzxiho7efu6mpidjpatte54
bafybeifynb4mpqrbsjbeqtxpbuf6y4frrtjrc4tm7cnmmui7gbjkckszrq
bafybeidcgnbhguyfaahkoqbyy2z525d3qfzdtbjuk4e75wkdbnkcafvjei
bafybeiefc67s6hpydnsqdgypbunroqwkij5j26sfmc7are7yxvg45uuh7i
bafybeiefwjy3o42ovkssnm7iihbog46k5grk3gobvvkzrqvof7p6xbgowi
bafybeihpydd3ivtza2ql5clatm5fy7ocych7t4czu46sbc6c2ykrbwk5uu
bafybeiet7222lqfmzogur3zlxqavlnd3lt3qryw5yi5rhuiqeqg4w7c3qu
bafybeihwomd4ygoydvj5kh24wfwk5kszmst5vz44zkl6yibjargttv7sly
bafybeidbjt2ckr4oooio3jsfk76r3bsaza5trjvt7u36slhha5ksoc5gv4
bafybeifyjrmopgtfmswq7b4pfscni46doy3g3z6vi5rrgpozc6duebpmuy
bafybeidsrowz46yt62zs64q2mhirlc3rsmctmi3tluorsts53vppdqjj7e
bafybeiggntql57bw24bw6hkp2yqd3qlyp5oxowo6q26wsshxopfdnzsxhq
bafybeidguz36u6wakx4e5ewuhslsfsjmk5eff5q7un2vpkrcu7cg5aaqf4
bafybeiaypwu2b45iunbqnfk2g7bku3nfqveuqp4vlmmwj7o7liyys42uai
bafybeicaahv7xvia7xojgiecljo2ddrvryzh2af7rb3qqbg5a257da5p2y
bafybeibgeiijr74rcliwal3e7tujybigzqr6jmtchqrcjdo75trm2ptb4e
bafybeiba3nrd43ylnedipuq2uoowd4blghpw2z7r4agondfinladcsxlku
bafybeif3semzitjbxg5lzwmnjmlsrvc7y5htekwqtnhmfi4wxywtj5lgoe
bafybeiedmsig5uj7rgarsjans2ad5kcb4w4g5iurbryqn62jy5qap4qq2a
bafybeidyz34bcd3k6nxl7jbjjgceg5eu3szbrbgusnyn7vfl7facpecsce
bafybeigmq5gch72q3qpk4nipssh7g7msk6jpzns2d6xmpusahkt2lu5m4y
bafybeicjzoypdmmdt6k54wzotr5xhpzwbgd3c4oqg6mj4qukgvxvdrvzye
bafybeien55egngdpfvrsxr2jmkewdyha72ju7qaaeiydz2f5rny7drgzta

Support

If you have questions or need support or guidance, please reach out to the Bacalhau team via Slack (#general channel).

Convert CSV To Parquet Or Avro

Introduction

Converting from CSV to parquet or avro reduces the size of the file and allows for faster read and write speeds. With Bacalhau, you can convert your CSV files stored on ipfs or on the web without the need to download files and install dependencies locally.

In this example tutorial we will convert a CSV file from a URL to parquet format and save the converted parquet file to IPFS

Prerequisites

To get started, you need to install the Bacalhau client, see more information here

Running CSV to Avro or Parquet Locally

Downloading the CSV file

Let's download the transactions.csv file:

wget https://cloudflare-ipfs.com/ipfs/QmfKJT13h5k1b23ja3ZCVg5nFL9oKz2bVXc8oXgtwiwhjz/transactions.csv

You can use the CSV files from here

Writing the Script

Write the converter.py Python script, that serves as a CSV converter to Avro or Parquet formats:

# converter.py
import os
import sys
from abc import ABCMeta, abstractmethod

import fastavro
import numpy as np
import pandas as pd
from pyarrow import Table, parquet


class BaseConverter(metaclass=ABCMeta):
    """
    Base class for converters.

    Validate received parameters for future use.
    """
    def __init__(
        self,
        csv_file_path: str,
        target_file_path: str,
    ) -> None:
        self.csv_file_path = csv_file_path
        self.target_file_path = target_file_path

    @property
    def csv_file_path(self):
        return self._csv_file_path

    @csv_file_path.setter
    def csv_file_path(self, path):
        if not os.path.isabs(path):
            path = os.path.join(os.getcwd(), path)
        _, extension = os.path.splitext(path)
        if not os.path.isfile(path) or extension != '.csv':
            raise FileNotFoundError(
                f'No such csv file: {path}'
            )
        self._csv_file_path = path

    @property
    def target_file_path(self):
        return self._target_file_path

    @target_file_path.setter
    def target_file_path(self, path):
        if not os.path.isabs(path):
            path = os.path.join(os.getcwd(), path)
        target_dir = os.path.dirname(path)
        if not os.path.isdir(target_dir):
            raise FileNotFoundError(
                f'No such directory: {target_dir}\n'
                'Choose existing or create directory for result file.'
            )
        if os.path.isfile(path):
            raise FileExistsError(
                f'File {path} has already exists.'
                'Usage of existing file may result in data loss.'
            )
        self._target_file_path = path

    def get_csv_reader(self):
        """Return csv reader which read csv file as a stream"""
        return pd.read_csv(
            self.csv_file_path,
            iterator=True,
            chunksize=100000
        )

    @abstractmethod
    def convert(self):
        """Should be implemented in child class"""
        pass


class ParquetConverter(BaseConverter):
    """
    Convert received csv file to parquet file.

    Take path to csv file and path to result file.
    """
    def convert(self):
        """Read csv file as a stream and write data to parquet file."""
        csv_reader = self.get_csv_reader()
        writer = None
        for chunk in csv_reader:
            if not writer:
                table = Table.from_pandas(chunk)
                writer = parquet.ParquetWriter(
                    self.target_file_path, table.schema
                )
            table = Table.from_pandas(chunk)
            writer.write_table(table)
        writer.close()


class AvroConverter(BaseConverter):
    """
    Convert received csv file to avro file.

    Take path to csv file and path to result file.
    """
    NUMPY_TO_AVRO_TYPES = {
        np.dtype('?'): 'boolean',
        np.dtype('int8'): 'int',
        np.dtype('int16'): 'int',
        np.dtype('int32'): 'int',
        np.dtype('uint8'): 'int',
        np.dtype('uint16'): 'int',
        np.dtype('uint32'): 'int',
        np.dtype('int64'): 'long',
        np.dtype('uint64'): 'long',
        np.dtype('O'): ['null', 'string', 'float'],
        np.dtype('unicode_'): 'string',
        np.dtype('float32'): 'float',
        np.dtype('float64'): 'double',
        np.dtype('datetime64'): {
            'type': 'long',
            'logicalType': 'timestamp-micros'
        },
    }

    def get_avro_schema(self, pandas_df):
        """Generate avro schema."""
        column_dtypes = pandas_df.dtypes
        schema_name = os.path.basename(self.target_file_path)
        schema = {
            'type': 'record',
            'name': schema_name,
            'fields': [
                {
                    'name': name,
                    'type': AvroConverter.NUMPY_TO_AVRO_TYPES[dtype]
                } for (name, dtype) in column_dtypes.items()
            ]
        }
        return fastavro.parse_schema(schema)

    def convert(self):
        """Read csv file as a stream and write data to avro file."""
        csv_reader = self.get_csv_reader()
        schema = None
        with open(self.target_file_path, 'a+b') as f:
            for chunk in csv_reader:
                if not schema:
                    schema = self.get_avro_schema(chunk)
                fastavro.writer(
                    f,
                    schema=schema,
                    records=chunk.to_dict('records')
                )


if __name__ == '__main__':
    converters = {
        'parquet': ParquetConverter,
        'avro': AvroConverter
    }
    csv_file, result_path, result_type = sys.argv[1], sys.argv[2], sys.argv[3]
    if result_type.lower() not in converters:
        raise ValueError(
            'Invalid target type. Avalible types: avro, parquet.'
        )
    converter = converters[result_type.lower()](csv_file, result_path)
    converter.convert()

You can find out more information about converter.py here

Installing Dependencies

pip install fastavro numpy pandas pyarrow

Converting CSV file to Parquet format

python converter.py <path_to_csv> <path_to_result_file> <extension>

In our case:

python3 converter.py transactions.csv transactions.parquet parquet

Viewing the parquet file:

import pandas as pd
pd.read_parquet('transactions.parquet').head()

Containerize Script with Docker

You can skip this section entirely and directly go to Running on Bacalhau

To build your own docker container, create a Dockerfile, which contains instructions to build your image.

FROM python:3.8

RUN apt update && apt install git

RUN git clone https://github.com/bacalhau-project/Sparkov_Data_Generation

WORKDIR /Sparkov_Data_Generation/

RUN pip3 install -r requirements.txt

See more information on how to containerize your script/app here

Build the container

We will run the docker build command to build the container:

docker build -t <hub-user>/<repo-name>:<tag> .

Before running the command replace:

hub-user with your docker hub username. If you don’t have a docker hub account follow these instructions to create docker account, and use the username of the account you created

repo-name with the name of the container, you can name it anything you want

tag this is not required but you can use the latest tag

In our case:

docker build -t jsacex/csv-to-arrow-or-parquet .

Push the container

Next, upload the image to the registry. This can be done by using the Docker hub username, repo name or tag.

docker push <hub-user>/<repo-name>:<tag>

In our case:

docker push jsacex/csv-to-arrow-or-parquet

Running a Bacalhau Job

With the command below, we are mounting the CSV file for transactions from IPFS

export JOB_ID=$(bacalhau docker run \
    -i ipfs://QmTAQMGiSv9xocaB4PUCT5nSBHrf9HZrYj21BAZ5nMTY2W  \
    --wait \
    --id-only \
    --output outputs:\outputs \
    --publisher ipfs \
    jsacex/csv-to-arrow-or-parquet \
    -- python3 src/converter.py ../inputs/transactions.csv  /outputs/transactions.parquet parquet)

Structure of the command

Let's look closely at the command above:

bacalhau docker run: call to Bacalhau
-i ipfs://QmTAQMGiSv9xocaB4PUCT5nSBHrf9HZrYj21BAZ5nMTY2W: CIDs to use on the job. Mounts them at '/inputs' in the execution.
jsacex/csv-to-arrow-or-parque: the name and the tag of the docker image we are using
../inputs/transactions.csv : path to input dataset
/outputs/transactions.parquet parquet: path to the output
python3 src/converter.py: execute the script

When a job is submitted, Bacalhau prints out the related job_id. We store that in an environment variable so that we can reuse it later on.

Declarative job description

The same job can be presented in the declarative format. In this case, the description will look like this:

name: Convert CSV To Parquet Or Avro
type: batch
count: 1
tasks:
  - name: My main task
    Engine:
      type: docker
      params:
        Image: jsacex/csv-to-arrow-or-parquet
        Entrypoint:
          - /bin/bash
        Parameters:
          - -c
          - python3 src/converter.py ../inputs/transactions.csv  ../outputs/transactions.parquet parquet
    Publisher:
      Type: ipfs
    ResultPaths:
      - Name: outputs
        Path: /outputs
    InputSources:
      - Target: "/inputs"
        Source:
          Type: "ipfs"
          Params:
            CID: "QmTAQMGiSv9xocaB4PUCT5nSBHrf9HZrYj21BAZ5nMTY2W"

The job description should be saved in .yaml format, e.g. convertcsv.yaml, and then run with the command:

bacalhau job run convertcsv.yaml

Checking the State of your Jobs

Job status: You can check the status of the job using bacalhau job list.

bacalhau job list --id-filter ${JOB_ID}

When it says Published or Completed, that means the job is done, and we can get the results.

Job information: You can find out more information about your job by using bacalhau job describe.

bacalhau job describe ${JOB_ID}

rm -rf results && mkdir -p results # Temporary directory to store the results
bacalhau job get ${JOB_ID} --output-dir results # Download the results

Viewing your Job Output

To view the file, run the following command:

ls results/outputs

transactions.parquet

Alternatively, you can do this:

import pandas as pd
import os
pd.read_parquet('results/outputs/transactions.parquet')

Support

If you have questions or need support or guidance, please reach out to the Bacalhau team via Slack (#general channel).

Simple Image Processing

Introduction

In this example tutorial, we will show you how to use Bacalhau to process images on a Landsat dataset.

Bacalhau has the unique capability of operating at a massive scale in a distributed environment. This is made possible because data is naturally sharded across the IPFS network amongst many providers. We can take advantage of this to process images in parallel.

Prerequisite

To get started, you need to install the Bacalhau client, see more information here

Running a Bacalhau Job

To submit a workload to Bacalhau, we will use the bacalhau docker run command. This command allows to pass input data volume with a -i ipfs://CID:path argument just like Docker, except the left-hand side of the argument is a content identifier (CID). This results in Bacalhau mounting a data volume inside the container. By default, Bacalhau mounts the input volume at the path /inputs inside the container.

export JOB_ID=$(bacalhau docker run \
    --wait \
    --wait-timeout-secs 100 \
    --id-only \
    -i src=s3://landsat-image-processing/*,dst=/input_images,opt=region=us-east-1 \
    --publisher ipfs \
    --entrypoint mogrify \
    dpokidov/imagemagick:7.1.0-47-ubuntu \
    -- -resize 100x100 -quality 100 -path /outputs '/input_images/*.jpg')

Structure of the command

Let's look closely at the command above:

bacalhau docker run: call to Bacalhau
-i src=s3://landsat-image-processing/*,dst=/input_images,opt=region=us-east-1: Specifies the input data, which is stored in the S3 storage.
--entrypoint mogrify: Overrides the default ENTRYPOINT of the image, indicating that the mogrify utility from the ImageMagick package will be used instead of the default entry.
dpokidov/imagemagick:7.1.0-47-ubuntu: The name and the tag of the docker image we are using
-- -resize 100x100 -quality 100 -path /outputs '/input_images/*.jpg': These arguments are passed to mogrify and specify operations on the images: resizing to 100x100 pixels, setting quality to 100, and saving the results to the /outputs folder.

When a job is submitted, Bacalhau prints out the related job_id. We store that in an environment variable so that we can reuse it later on.

Declarative job description

The same job can be presented in the declarative format. In this case, the description will look like this:

name: Simple Image Processing
type: batch
count: 1
tasks:
  - name: My main task
    Engine:
      type: docker
      params:
        Image: dpokidov/imagemagick:7.1.0-47-ubuntu
        Entrypoint:
          - /bin/bash
        Parameters:
          - -c
          - magick mogrify -resize 100x100 -quality 100 -path /outputs '/input_images/*.jpg'
    Publisher:
      Type: ipfs
    ResultPaths:
      - Name: outputs
        Path: /outputs
    InputSources:
    - Target: "/input_images"
      Source:
        Type: "s3"
        Params:
          Bucket: "landsat-image-processing"
          Key: "*"
          Region: "us-east-1"

The job description should be saved in .yaml format, e.g. image.yaml, and then run with the command:

bacalhau job run image.yaml

Checking the State of your Jobs

Job status: You can check the status of the job using bacalhau job list:

bacalhau job list --id-filter ${JOB_ID}

When it says Completed, that means the job is done, and we can get the results.

Job information: You can find out more information about your job by using bacalhau job describe:

bacalhau job describe ${JOB_ID}

Job download: You can download your job results directly by using bacalhau job get. Alternatively, you can choose to create a directory to store your results. In the command below, we created a directory and downloaded our job output to be stored in that directory.

rm -rf results && mkdir results
bacalhau job get ${JOB_ID} --output-dir results

Display the image

To view the images, open the results/outputs/ folder:

Support

If you have questions or need support or guidance, please reach out to the Bacalhau team via Slack (#general channel).

Oceanography - Data Conversion

Introduction

The Surface Ocean CO₂ Atlas (SOCAT) contains measurements of the fugacity of CO₂ in seawater around the globe. But to calculate how much carbon the ocean is taking up from the atmosphere, these measurements need to be converted to the partial pressure of CO₂. We will convert the units by combining measurements of the surface temperature and fugacity. Python libraries (xarray, pandas, numpy) and the pyseaflux package facilitate this process.

In this example tutorial, our focus will be on running the oceanography dataset with Bacalhau, where we will investigate the data and convert the workload. This will enable the execution on the Bacalhau network, allowing us to leverage its distributed storage and compute resources.

Prerequisites

To get started, you need to install the Bacalhau client, see more information here

Running Locally

Downloading the dataset

For the purposes of this example we will use the SOCATv2022 dataset in the "Gridded" format from the SOCAT website and long-term global sea surface temperature data from NOAA - information about that dataset can be found here.

mkdir -p inputs
curl -L --output ./inputs/SOCATv2022_tracks_gridded_monthly.nc.zip https://www.socat.info/socat_files/v2022/SOCATv2022_tracks_gridded_monthly.nc.zip
curl --output ./inputs/sst.mnmean.nc https://downloads.psl.noaa.gov/Datasets/noaa.oisst.v2/sst.mnmean.nc

Installing dependencies

Next let's write the requirements.txt. This file will also be used by the Dockerfile to install the dependencies.

# requirements.txt
Bottleneck==1.3.5
dask==2022.2.0
fsspec==2022.5.0
netCDF4==1.6.0
numpy==1.21.6
pandas==1.3.5
pip==22.1.2
pyseaflux==2.2.1
scipy==1.7.3
xarray==0.20.2
zarr>=2.0.0

pip install -r requirements.txt > /dev/null

Reading and Viewing Data

import fsspec # for reading remote files
import xarray as xr

# Open the zip archive using fsspec and load the data into xarray.Dataset
with fsspec.open("./inputs/SOCATv2022_tracks_gridded_monthly.nc.zip", compression='zip') as fp:
    ds = xr.open_dataset(fp)

# Display information about the dataset    
ds.info()

time_slice = slice("2010", "2020") # select a decade
res = ds['sst_ave_unwtd'].sel(tmnth=time_slice).mean(dim='tmnth') # compute the mean for this period
res.plot() # plot the result

We can see that the dataset contains latitude-longitude coordinates, the date, and a series of seawater measurements. Below is a plot of the average sea surface temperature (SST) between 2010 and 2020, where data have been collected by buoys and vessels.

Data Conversion

To convert the data from fugacity of CO2 (fCO2) to partial pressure of CO2 (pCO2) we will combine the measurements of the surface temperature and fugacity. The conversion is performed by the pyseaflux package.

Writing the Script

Let's create a new file called main.py and paste the following script in it:

# main.py
import fsspec
import xarray as xr
import pandas as pd
import numpy as np
import pyseaflux


def lon_360_to_180(ds=None, lonVar=None):
    lonVar = "lon" if lonVar is None else lonVar
    return (ds.assign_coords({lonVar: (((ds[lonVar] + 180) % 360) - 180)})
            .sortby(lonVar)
            .astype(dtype='float32', order='C'))


def center_dates(ds):
    # start and end date
    start_date = str(ds.time[0].dt.strftime('%Y-%m').values)
    end_date = str(ds.time[-1].dt.strftime('%Y-%m').values)

    # monthly dates centered on 15th of each month
    dates = pd.date_range(start=f'{start_date}-01T00:00:00.000000000',
                          end=f'{end_date}-01T00:00:00.000000000',
                          freq='MS') + np.timedelta64(14, 'D')

    return ds.assign(time=dates)


def get_and_process_sst(url=None):
    # get noaa sst
    if url is None:
        url = ("/inputs/sst.mnmean.nc")

    with fsspec.open(url) as fp:
        ds = xr.open_dataset(fp)
        ds = lon_360_to_180(ds)
        ds = center_dates(ds)
        return ds


def get_and_process_socat(url=None):
    if url is None:
        url = ("/inputs/SOCATv2022_tracks_gridded_monthly.nc.zip")

    with fsspec.open(url, compression='zip') as fp:
        ds = xr.open_dataset(fp)
        ds = ds.rename({"xlon": "lon", "ylat": "lat", "tmnth": "time"})
        ds = center_dates(ds)
        return ds


def main():
    print("Load SST and SOCAT data")
    ds_sst = get_and_process_sst()
    ds_socat = get_and_process_socat()

    print("Merge datasets together")
    time_slice = slice("1981-12", "2022-05")
    ds_out = xr.merge([ds_sst['sst'].sel(time=time_slice),
                       ds_socat['fco2_ave_unwtd'].sel(time=time_slice)])

    print("Calculate pco2 from fco2")
    ds_out['pco2_ave_unwtd'] = xr.apply_ufunc(
        pyseaflux.fCO2_to_pCO2,
        ds_out['fco2_ave_unwtd'],
        ds_out['sst'])

    print("Add metadata")
    ds_out['pco2_ave_unwtd'].attrs['units'] = 'uatm'
    ds_out['pco2_ave_unwtd'].attrs['notes'] = ("calculated using" +
                                               "NOAA OI SST V2" +
                                               "and pyseaflux package")

    print("Save data")
    ds_out.to_zarr("/processed.zarr")
    import shutil
    shutil.make_archive("/outputs/processed.zarr", 'zip', "/processed.zarr")
    print("Zarr file written to disk, job completed successfully")

if __name__ == "__main__":
    main()

This code loads and processes SST and SOCAT data, combines them, computes pCO2, and saves the results for further use.

Upload the Data to IPFS

The simplest way to upload the data to IPFS is to use a third-party service to "pin" data to the IPFS network, to ensure that the data exists and is available. To do this you need an account with a pinning service like NFT.storage or Pinata. Once registered you can use their UI or API or SDKs to upload files.

This resulted in the IPFS CID of bafybeidunikexxu5qtuwc7eosjpuw6a75lxo7j5ezf3zurv52vbrmqwf6y.

Setting up Docker Container

We will create a Dockerfile and add the desired configuration to the file. These commands specify how the image will be built, and what extra requirements will be included.

FROM python:slim

RUN apt-get update && apt-get -y upgrade \
    && apt-get install -y --no-install-recommends \
    g++ \
    && rm -rf /var/lib/apt/lists/*

WORKDIR /project

COPY ./requirements.txt /project

RUN pip3 install -r requirements.txt

COPY ./main.py /project

CMD ["python","main.py"]

Build the container

We will run docker build command to build the container:

docker build -t <hub-user>/<repo-name>:<tag> .

Before running the command replace:

hub-user with your docker hub username, If you don’t have a docker hub account follow these instructions to create a Docker account, and use the username of the account you created

repo-name with the name of the container, you can name it anything you want

tag this is not required but you can use the latest tag

Push the container

Now you can push this repository to the registry designated by its name or tag.

docker push <hub-user>/<repo-name>:<tag>

For more information about working with custom containers, see the custom containers example.

Running a Bacalhau Job

Now that we have the data in IPFS and the Docker image pushed, next is to run a job using the bacalhau docker run command

export JOB_ID=$(bacalhau docker run \
    --input ipfs://bafybeidunikexxu5qtuwc7eosjpuw6a75lxo7j5ezf3zurv52vbrmqwf6y \
    --memory 10Gb \
    ghcr.io/bacalhau-project/examples/socat:0.0.11 \
    -- python main.py)

Structure of the command

Let's look closely at the command above:

bacalhau docker run: call to Bacalhau
--input ipfs://bafybeidunikexxu5qtuwc7eosjpuw6a75lxo7j5ezf3zurv52vbrmqwf6y: CIDs to use on the job. Mounts them at '/inputs' in the execution.
ghcr.io/bacalhau-project/examples/socat:0.0.11: the name and the tag of the image we are using
python main.py: execute the script

When a job is submitted, Bacalhau prints out the related job_id. We store that in an environment variable so that we can reuse it later on.

Declarative job description

The same job can be presented in the declarative format. In this case, the description will look like this:

name: Oceanography
type: batch
count: 1
tasks:
  - name: My main task
    Engine:
      type: docker
      params:
        Image: ghcr.io/bacalhau-project/examples/socat:0.0.11
        Entrypoint:
          - /bin/bash
        Parameters:
          - -c
          - python main.py
    Publisher:
      Type: ipfs
    ResultPaths:
      - Name: outputs
        Path: /outputs
    InputSources:
      - Target: "/inputs"
        Source:
          Type: "ipfs"
          Params:
            CID: "bafybeidunikexxu5qtuwc7eosjpuw6a75lxo7j5ezf3zurv52vbrmqwf6y"
    Resources:
        Memory: 10gb

The job description should be saved in .yaml format, e.g. oceanyaml, and then run with the command:

bacalhau job run ocean.yaml

Checking the State of your Jobs

Job status: You can check the status of the job using bacalhau job list.

bacalhau job list --id-filter ${JOB_ID}

When it says Published or Completed, that means the job is done, and we can get the results.

Job information: You can find out more information about your job by using bacalhau job describe.

bacalhau job describe ${JOB_ID}

rm -rf results
mkdir -p ./results # Temporary directory to store the results
bacalhau job get ${JOB_ID} --output-dir ./results # Download the results

Viewing your Job Output

To view the file, run the following command:

ls results/outputs

processed.zarr.zip

Support

If you have questions or need support or guidance, please reach out to the Bacalhau team via Slack (#general channel).

Video Processing

Introduction

Many data engineering workloads consist of embarrassingly parallel workloads where you want to run a simple execution on a large number of files. In this example tutorial, we will run a simple video filter on a large number of video files.

Prerequisite

To get started, you need to install the Bacalhau client, see more information

Upload the Data to IPFS

The simplest way to upload the data to IPFS is to use a third-party service to "pin" data to the IPFS network, to ensure that the data exists and is available. To do this you need an account with a pinning service like or . Once registered you can use their UI or API or SDKs to upload files.

This resulted in the IPFS CID of Qmd9CBYpdgCLuCKRtKRRggu24H72ZUrGax5A9EYvrbC72j.

Running a Bacalhau Job

To submit a workload to Bacalhau, we will use the bacalhau docker run command. The command allows one to pass input data volume with a -i ipfs://CID:path argument just like Docker, except the left-hand side of the argument is a . This results in Bacalhau mounting a data volume inside the container. By default, Bacalhau mounts the input volume at the path /inputs inside the container.

export JOB_ID=$(bacalhau docker run \
    --wait \
    --wait-timeout-secs 100 \
    --id-only \
    -i ipfs://Qmd9CBYpdgCLuCKRtKRRggu24H72ZUrGax5A9EYvrbC72j:/inputs \
    linuxserver/ffmpeg \
    -- bash -c 'find /inputs -iname "*.mp4" -printf "%f\n" | xargs -I{} ffmpeg -y -i /inputs/{} -vf "scale=-1:72,setsar=1:1" /outputs/scaled_{}' )

Structure of the command

Let's look closely at the command above:

bacalhau docker run: call to Bacalhau
-i ipfs://Qmd9CBYpdgCLuCKRtKRRggu24H72ZUrGax5A9EYvrbC72j: CIDs to use on the job. Mounts them at '/inputs' in the execution.
linuxserver/ffmpeg: the name of the docker image we are using to resize the videos
-- bash -c 'find /inputs -iname "*.mp4" -printf "%f\n" | xargs -I{} ffmpeg -y -i /inputs/{} -vf "scale=-1:72,setsar=1:1" /outputs/scaled_{}': the command that will be executed inside the container. It uses find to locate all files with the extension ".mp4" within /inputs and then uses ffmpeg to resize each found file to 72 pixels in height, saving the results in the /outputs folder.

When a job is submitted, Bacalhau prints out the related job_id. We store that in an environment variable so that we can reuse it later on.

Declarative job description

name: Video Processing
type: batch
count: 1
tasks:
  - name: My main task
    Engine:
      type: docker
      params:
        Image: linuxserver/ffmpeg
        Entrypoint:
          - /bin/bash
        Parameters:
          - -c
          - find /inputs -iname "*.mp4" -printf "%f\n" | xargs -I{} ffmpeg -y -i /inputs/{} -vf "scale=-1:72,setsar=1:1" /outputs/scaled_{}
    Publisher:
      Type: ipfs
    ResultPaths:
      - Name: outputs
        Path: /outputs
    InputSources:
    - Target: "/inputs"
      Source:
        Type: "s3"
        Params:
          Bucket: "bacalhau-video-processing"
          Key: "*"
          Region: "us-east-1"

The job description should be saved in .yaml format, e.g. video.yaml, and then run with the command:

bacalhau job run video.yaml

Checking the State of your Jobs

Job status: You can check the status of the job using bacalhau job list.

bacalhau job list --id-filter ${JOB_ID} --no-style

When it says Published or Completed, that means the job is done, and we can get the results.

Job information: You can find out more information about your job by using bacalhau job describe.

bacalhau job describe ${JOB_ID}

mkdir -p ./results # Temporary directory to store the results
bacalhau job get ${JOB_ID} --output-dir ./results # Download the results

Viewing your Job Output

To view the results open the results/outputs/ folder.

Support

Bacalhau and BigQuery

Use Bacalhau to process logs remotely across clouds and land the results on BigQuery.

Introduction

This example demonstrates how to build a sophisticated distributed log processing pipeline using , Google , and . You'll learn how to process and analyze logs across distributed nodes, with progressively more advanced techniques for handling, sanitizing, and aggregating log data.

The combination of Bacalhau and BigQuery offers several key advantages:

Process logs directly where they are generated, eliminating the need for centralized collection
Scale processing across multiple nodes in different cloud providers
Leverage BigQuery's powerful analytics capabilities for processed data
Implement privacy-conscious data handling and efficient aggregation strategies

Through this example, you'll evolve from basic log collection to implementing a production-ready system with privacy protection and smart aggregation. Whether you're handling application logs, system metrics, or security events, this pipeline provides a robust foundation for distributed log analytics.

Prerequisites

installed
A Google Cloud Project with BigQuery enabled
Service account credentials with BigQuery access
A running Bacalhau cluster with nodes across different cloud providers
- Follow the
- Ensure nodes are properly configured across your cloud providers (AWS, GCP, Azure). You can see more about setting up your nodes

Your nodes will have to have at least the following configurations installed to use the below jobs unchanged (in addition to anything else you may require).

Compute:
  Enabled: true
  AllowListedLocalPaths:
    - /bacalhau_data:rw
JobAdmissionControl:
  AcceptNetworkedJobs: true

Components

DuckDB Processing: Process and analyze data using DuckDB's SQL capabilities
BigQuery Integration: Store processed results in Google BigQuery for further analysis

Before You Start

Make sure you have configured your config file to have the correct BigQuery project and dataset. You can do this by copying the config.yaml.example file to config.yaml and editing the values to match your BigQuery project and dataset.

# BigQuery Configuration
project:
  id: "bacalhau-and-bigquery"  # Required: Your Google Cloud project ID
  region: "US"           # Optional: Default region for resources
  create_if_missing: true # Whether to create the project if it doesn't exist

credentials:
  path: "credentials.json"  # Path to service account credentials

bigquery:
  dataset_name: "log_analytics"     # Name of the BigQuery dataset
  table_name: "log_results"         # Name of the results table
  location: "US"                    # Dataset location

Ensure your Google Cloud service account has these roles:
- BigQuery Data Editor
- BigQuery Job User
Have your service account key file (JSON format) ready
Configure your BigQuery settings:
- Create a dataset for log analytics
- Note your project ID and dataset name

We have provided some utility scripts to help you set up your BigQuery project and tables. You can run the following commands to set up your project and tables if you haven't already:

Interactive setup to set up your BigQuery project and tables. Will go through and create the necessary bigquery projects.

./utility_scripts/setup.py -i

Creates sample tables in BigQuery for testing.

./utility_scripts/sample_tables.sh

Checks the permissions of the service account specified in log_uploader_credentials.json to ensure it has the necessary permissions to write to BigQuery tables from the Bacalhau nodes.

./utility_scripts/check_permissions.sh

Confirms your BigQuery project and dataset, and creates the tables if they don't exist with the correct schema. This will also zero out the tables if they already exist, so be careful! (Useful for debugging)

./utility_scripts/confirm_tables.sh

Distributes the credentials to /bacalhau_data on all nodes in a Bacalhau network.

./utility_scripts/distribute_credentials.sh

Ensures the service account specified in log_uploader_credentials.json has the necessary permissions to write to BigQuery tables from the Bacalhau nodes.

./utility_scripts/setup_log_uploader.sh

One more thing to set up is the log faker on the nodes. This will generate logs for you to work with. You can run the following command to start the log faker:

bacalhau job run start-logging-container.yaml

Give it a couple of minutes to start up and then you can start processing the logs.

Demo Walkthrough

Let's walk through each stage of the demo, seeing how we can progressively improve our data processing pipeline!

Stage 1: Raw Power - Basic Log Upload 🚀

Let's start by looking at the raw logs.

# Let's look at the raw logs
bacalhau docker run alpine cat /var/log/app/access.log

That will print out the logs to stdout, which we can then read from the job.

After running the job, you will see a job id, something like this:

To get more details about the run, execute:
        bacalhau job describe j-01480df3-476e-4fdd-a297-0fc41cb10710

To get more details about the run executions, execute:
        bacalhau job executions j-01480df3-476e-4fdd-a297-0fc41cb10710

When you run the describe command, you will see the details of the job, including the output of the log information.

Now let's upload the raw logs to BigQuery. This is the simplest approach - just get the data there:

bacalhau job run bigquery_export_job.yaml --template-vars=python_file_b64=$(cat bigquery-exporter/log_process_0.py | base64)

This will upload the python script to all the nodes which, in turn, will upload the raw logs from all nodes to BigQuery. When you check BigQuery, you'll see:

Millions of rows uploaded (depends on how many nodes you have and how long you let it run)
Each log line as raw text
No structure or parsing

To query the logs, you can use the following SQL:

PROJECT_ID=$(yq '.bigquery.project_id' config.yaml)
bq query --use_legacy_sql=false "SELECT * FROM \`$PROJECT_ID.log_analytics.raw_logs\` LIMIT 5"

Stage 2: Adding Structure - Making Sense of Chaos 📊

Now let's do something more advanced, by parsing those logs into structured data before upload:

bacalhau job run bigquery_export_job.yaml --template-vars=python_file_b64=$(cat bigquery-exporter/log_process_1.py | base64)

Your logs are now parsed into fields like:

IP Address
Timestamp
HTTP Method
Endpoint
Status Code
Response Size

To query the logs, you can use the following SQL:

bq query --use_legacy_sql=false "SELECT * FROM \`$PROJECT_ID.log_analytics.log_results\` LIMIT 5"

Stage 3: Privacy First - Responsible Data Handling 🔒

Now let's handle the data responsibly by sanitizing PII (like IP addresses):

bacalhau job run bigquery_export_job.yaml --template-vars=python_file_b64=$(cat bigquery-exporter/log_process_2.py | base64)

This:

Zeros out the last octet of IPv4 addresses
Zeros out the last 64 bits of IPv6 addresses
Maintains data utility while ensuring compliance

Again, to query the logs, you can use the following SQL:

bq query --use_legacy_sql=false "SELECT * FROM \`$PROJECT_ID.log_analytics.log_results\` LIMIT 5"

Notice that the IP addresses are now sanitized.

Stage 4: Smart Aggregation - Efficiency at Scale 📈

Finally, let's be smart about what we upload:

bacalhau job run bigquery_export_job.yaml --template-vars=python_file_b64=$(cat bigquery-exporter/log_process_3.py | base64)

This creates two streams:

Aggregated normal logs:
- Grouped in 5-minute windows
- Counts by status code
- Average response sizes
- Total requests per endpoint
Real-time emergency events:
- Critical errors
- Security alerts
- System failures

To query the logs, you can use the following SQL:

-- See your aggregated logs
SELECT * FROM \`$PROJECT_ID.log_analytics.log_aggregates\`
ORDER BY time_window DESC LIMIT 5;

-- Check emergency events
SELECT * FROM \`$PROJECT_ID.log_analytics.emergency_logs\`
ORDER BY timestamp DESC LIMIT 5;

What's Next?

Now that you've seen the power of distributed processing with Bacalhau:

Try processing your own log files
Experiment with different aggregation windows
Add your own privacy-preserving transformations
Scale to even more nodes!

Remember: The real power comes from processing data where it lives, rather than centralizing everything first. Happy distributed processing! 🚀

Table Schemas

log_results (Main Table):
- project_id: STRING - Project identifier
- region: STRING - Deployment region
- nodeName: STRING - Node name
- timestamp: TIMESTAMP - Event time
- version: STRING - Log version
- message: STRING - Log content
- sync_time: TIMESTAMP - Upload time
- remote_log_id: STRING - Original log ID
- hostname: STRING - Source host
- public_ip: STRING - Sanitized public IP
- private_ip: STRING - Internal IP
- alert_level: STRING - Event severity
- provider: STRING - Cloud provider
log_aggregates (5-minute windows):
- project_id: STRING - Project identifier
- region: STRING - Deployment region
- nodeName: STRING - Node name
- provider: STRING - Cloud provider
- hostname: STRING - Source host
- time_window: TIMESTAMP - Aggregation window
- log_count: INT64 - Events in window
- messages: ARRAY - Event details
emergency_logs (Critical Events):
- project_id: STRING - Project identifier
- region: STRING - Deployment region
- nodeName: STRING - Node name
- provider: STRING - Cloud provider
- hostname: STRING - Source host
- timestamp: TIMESTAMP - Event time
- version: STRING - Log version
- message: STRING - Alert details
- remote_log_id: STRING - Original log ID
- alert_level: STRING - Always "EMERGENCY"
- public_ip: STRING - Sanitized public IP
- private_ip: STRING - Internal IP

Data Ingestion

Copy Data from URL to Public Storage

To upload a file from a URL we will use the bacalhau docker run command.

bacalhau docker run \
    --id-only \
    --wait \
    --input https://raw.githubusercontent.com/filecoin-project/bacalhau/main/README.md \
    ghcr.io/bacalhau-project/examples/upload:v1

The job has been submitted and Bacalhau has printed out the related job id.

Structure of the command

Let's look closely at the command above:

bacalhau docker run: call to bacalhau using docker executor
--input https://raw.githubusercontent.com/filecoin-project/bacalhau/main/README.md: URL path of the input data volumes downloaded from a URL source.
ghcr.io/bacalhau-project/examples/upload:v1: the name and tag of the docker image we are using

The bacalhau docker run command takes advantage of the --input parameter. This will download a file from a public URL and place it in the /inputs directory of the container (by default). Then we will use a helper container to move that data to the /outputs directory.

You can find out more about the which is designed to simplify the data uploading process.

For more details, see the

Checking the State of Your Jobs

Job status: You can check the status of the job using bacalhau job list, processing the json ouput with the jq:

bacalhau job list $JOB_ID --output=json | jq '.[0].Status.JobState.Nodes[] | .Shards."0" | select(.RunOutput)'

When the job status is Published or Completed, that means the job is done, and we can get the results using the job ID.

Job information: You can find out more information about your job by using bacalhau job describe.

bacalhau job describe  $JOB_ID

Job download: You can download your job results directly by using bacalhau job get. Alternatively, you can choose to create a directory to store your results. In the command below, we removed a directory in case it was present before, created it and downloaded our job output to be stored in that directory.

rm -rf results && mkdir ./results
bacalhau job get --output-dir ./results $JOB_ID

Each job result contains an outputs subfolder and exitCode, stderr and stdout files with relevant content. To view the execution logs execute following:

head -n 15 ./results/stdout

And to view the job execution result (README.md file in the example case), which was saved as a job output, execute:

tail ./results/outputs/README.md

To get the output CID from a completed job, run the following command:

bacalhau job list $JOB_ID --output=json | jq -r '.[0].Status.JobState.Nodes[] | .Shards."0".PublishedResults | select(.CID) | .CID'

The job will upload the CID to the public storage via IPFS. We will store the CID in an environment variable so that we can reuse it later on.

Now that we have the CID, we can use it in a new job. This time we will use the --input parameter to tell Bacalhau to use the CID we just uploaded.

In this case, the only goal of our job is just to list the contents of the /inputs directory. You can see that the "input" data is located under /inputs/outputs/README.md.

bacalhau docker run \
    --id-only \
    --wait \
    --input ipfs://$CID \
    ubuntu -- \
    bash -c "set -x; ls -l /inputs; ls -l /inputs/outputs; cat /inputs/outputs/README.md"

The job has been submitted and Bacalhau has printed out the related job id. We store that in an environment variable so that we can reuse it later on.

Pinning Data

How to pin data to public storage

If you have data that you want to make available to your Bacalhau jobs (or other people), you can pin it using a pinning service like Pinata, NFT.Storage, Thirdweb, etc. Pinning services store data on behalf of users. The pinning provider is essentially guaranteeing that your data will be available if someone knows the CID. Most pinning services offer you a free tier, so you can try them out without spending any money.

Basic steps

To use a pinning service, you will almost always need to create an account. After registration, you get an API token, which is necessary to control and access the files. Then you need to upload files - usually services provide a web interface, CLI and code samples for integration into your application. Once you upload the files you will get its CID, which looks like this: QmUyUg8en7G6RVL5uhyoLBxSWFgRMdMraCRWFcDdXKWEL9. Now you can access pinned data from the jobs via this CID.

Data source can be specified via --input flag, see the for more details

Running a Job over S3 data

Here is a quick tutorial on how to copy Data from S3 to a public storage. In this tutorial, we will scrape all the links from a public AWS S3 buckets and then copy the data to IPFS using Bacalhau.

Prerequisite

To get started, you need to install the Bacalhau client, see more information

Running a Bacalhau Job

bacalhau docker run \
    -i "s3://noaa-goes16/ABI-L1b-RadC/2000/001/12/OR_ABI-L1b-RadC-M3C01*:/inputs,opt=region=us-east-1" \
    --id-only \
    --wait \
    alpine \
    -- sh -c "cp -r /inputs/* /outputs/"

Structure of the Command

Let's look closely at the command above:

bacalhau docker run: call to bacalhau
-i "s3://noaa-goes16/ABI-L1b-RadC/2000/001/12/OR_ABI-L1b-RadC-M3C01*:/inputs,opt=region=us-east-1: defines S3 objects as inputs to the job. In this case, it will download all objects that match the prefix ABI-L1b-RadC/2000/001/12/OR_ABI-L1b-RadC-M3C01 from the bucket noaa-goes16 in us-east-1 region, and mount the objects under /inputs path inside the docker job.
-- sh -c "cp -r /inputs/* /outputs/": copies all files under /inputs to /outputs, which is by default the result output directory which all of its content will be published to the specified destination, which is IPFS by default

When a job is submitted, Bacalhau prints out the related job_id. We store that in an environment variable so that we can reuse it later on.

This works either with datasets that are publicly available or with private datasets, provided that the nodes have the necessary credentials to access. See the for more details.

Job status: You can check the status of the job using bacalhau job list.

bacalhau job list --id-filter ${JOB_ID} --wide

When it says Published or Completed, that means the job is done, and we can get the results.

Job information: You can find out more information about your job by using bacalhau job describe.

bacalhau job describe ${JOB_ID}

Job download: You can download your job results directly by using bacalhau job get. Alternatively, you can choose to create a directory to store your results. In the command below, we remove the results directory if it exists, create it again and download our job output to be stored in that directory.

rm -rf results && mkdir -p results # Temporary directory to store the results
bacalhau job get $JOB_ID --output-dir results # Download the results

When the download is completed, the results of the job will be present in the directory. To view them, run the following command:

ls -1 results/outputs

{
  "NextToken": "",
  "Results": [
    {
      "Type": "s3PreSigned",
      "Params": {
        "PreSignedURL": "https://bacalhau-test-datasets.s3.eu-west-1.amazonaws.com/integration-tests-publisher/walid-manual-test-j-46a23fe7-e063-4ba6-8879-aac62af732b0.tar.gz?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAUEMPQ7JFSLGEPHJG%2F20240129%2Feu-west-1%2Fs3%2Faws4_request&X-Amz-Date=20240129T060142Z&X-Amz-Expires=1800&X-Amz-SignedHeaders=host&x-id=GetObject&X-Amz-Signature=cea00578ae3b03a1b52dba2d65a1bab40f1901fb7cd4ee1a0a974dc05b595f2e",
        "SourceSpec": {
          "Bucket": "bacalhau-test-datasets",
          "ChecksumSHA256": "1tlbgo+q0TlQhJi8vkiWnwTwPu1zenfvTO4qW1D5yvI=",
          "Endpoint": "",
          "Filter": "",
          "Key": "integration-tests-publisher/walid-manual-test-j-46a23fe7-e063-4ba6-8879-aac62af732b0.tar.gz",
          "Region": "eu-west-1",
          "VersionID": "oS7n.lY5BYHPMNOfbBS1n5VLl4ppVS4h"
        }
      }
    }
  ]
}

First you need to install jq (if it is not already installed) to process JSON:

sudo apt update
sudo apt install jq

To extract the CIDs from output JSON, execute following:

bacalhau job describe ${JOB_ID} --json \
| jq -r '.State.Executions[].PublishedResults.CID | select (. != null)'

The extracted CID will look like this:

QmYFhG668yJZmtk84SMMdbrz5Uvuh78Q8nLxTgLDWShkhR

You can publish your results to Amazon s3 or other S3-compatible destinations like MinIO, Ceph, or SeaweedFS to conveniently store and share your outputs.

To facilitate publishing results, define publishers and their configurations using the PublisherSpec structure.

For S3-compatible destinations, the configuration is as follows:

type PublisherSpec struct {
    Type   Publisher              `json:"Type,omitempty"`
    Params map[string]interface{} `json:"Params,omitempty"`
}

For Amazon S3, you can specify the PublisherSpec configuration as shown below:

PublisherSpec:
  Type: S3
  Params:
    Bucket: <bucket>              # Specify the bucket where results will be stored
    Key: <object-key>             # Define the object key (supports dynamic naming using placeholders)
    Compress: <true/false>        # Specify whether to publish results as a single gzip file (default: false)
    Endpoint: <optional>          # Optionally specify the S3 endpoint
    Region: <optional>            # Optionally specify the S3 region

Let's explore some examples to illustrate how you can use this:

Publishing results to S3 using default settings

bacalhau docker run -p s3://<bucket>/<object-key> ubuntu ...

Publishing results to S3 with a custom endpoint and region:

bacalhau docker run \
-p s3://<bucket>/<object-key>,opt=endpoint=http://s3.example.com,opt=region=us-east-1 \
ubuntu ...

Publishing results to S3 as a single compressed file

bacalhau docker run -p s3://<bucket>/<object-key>,opt=compress=true ubuntu ...

Utilizing naming placeholders in the object key

bacalhau docker run -p s3://<bucket>/result-{date}-{jobID} ubuntu ...

Tracking content identification and maintaining lineage across different jobs' inputs and outputs can be challenging. To address this, the publisher encodes the SHA-256 checksum of the published results, specifically when publishing a single compressed file.

Here's an example of a sample result:

{
    "NodeID": "QmYJ9QN9Pbi6gBKNrXVk5J36KSDGL5eUT6LMLF5t7zyaA7",
    "Data": {
        "StorageSource": "S3",
        "Name": "s3://<bucket>/run3.tar.gz",
        "S3": {
            "Bucket": "<bucket>",
            "Key": "run3.tar.gz",
            "Checksum": "e0uDqmflfT9b+rMfoCnO5G+cy+8WVTOPUtAqDMnXWbw=",
            "VersionID": "hZoNdqJsZxE_bFm3UGJuJ0RqkITe9dQ1"
        }
    }
}

To enable support for the S3-compatible storage provider, no additional dependencies are required. However, valid AWS credentials are necessary to sign the requests. The storage provider uses the default credentials chain, which checks the following sources for credentials:

Environment variables, such as AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY
Credentials file ~/.aws/credentials
IAM Roles for Amazon EC2 Instances

Model Inference

EasyOCR (Optical Character Recognition) on Bacalhau

Introduction

In this example tutorial, we use Bacalhau and Easy OCR to digitize paper records or for recognizing characters or extract text data from images stored on IPFS, S3 or on the web. is a ready-to-use OCR with 80+ supported languages and all popular writing scripts including Latin, Chinese, Arabic, Devanagari, Cyrillic etc. With easy OCR, you use the pre-trained models or use your own fine-tuned model.

TL;DR

bacalhau docker run \
    -i ipfs://bafybeibvcllzpfviggluobcfassm3vy4x2a4yanfxtmn4ir7olyzfrgq64:/root/.EasyOCR/model/zh_sim_g2.pth  \
    -i https://raw.githubusercontent.com/JaidedAI/EasyOCR/ae773d693c3f355aac2e58f0d8142c600172f016/examples/chinese.jpg \
    --timeout 3600 \
    --wait-timeout-secs 3600 \
    --gpu 1  \
    --memory 10Gb \
    --cpu 3 \
    --id-only \
    --wait \
    jsacex/easyocr \
    --  easyocr -l ch_sim  en -f ./inputs/chinese.jpg --detail=1 --gpu=True

Running Easy OCR Locally

Install the required dependencies

pip install --upgrade easyocr

Load the different example images

npx degit JaidedAI/EasyOCR/examples -f

List all the images. You'll see an output like this:

ls -l

total 3508
-rw-r--r-- 1 root root   59898 Jun 16 22:36 chinese.jpg
-rw-r--r-- 1 root root   97910 Jun 16 22:36 easyocr_framework.jpeg
-rw-r--r-- 1 root root 1740957 Jun 16 22:36 english.png
-rw-r--r-- 1 root root  487995 Jun 16 22:36 example2.png
-rw-r--r-- 1 root root  127454 Jun 16 22:36 example3.png
-rw-r--r-- 1 root root  488641 Jun 16 22:36 example.png
-rw-r--r-- 1 root root  168376 Jun 16 22:36 french.jpg
-rw-r--r-- 1 root root   42159 Jun 16 22:36 japanese.jpg
-rw-r--r-- 1 root root  225531 Jun 16 22:36 korean.png
drwxr-xr-x 1 root root    4096 Jun 15 13:37 sample_data
-rw-r--r-- 1 root root   82229 Jun 16 22:36 thai.jpg
-rw-r--r-- 1 root root   34706 Jun 16 22:36 width_ths.png

Next, we create a reader to do OCR to get coordinates which represent a rectangle containing text and the text itself:

import easyocr
reader = easyocr.Reader(['th','en'])
# Doing OCR. Get bounding boxes.
bounds = reader.readtext('thai.jpg')
bounds

Containerize your Script using Docker

git clone https://github.com/JaidedAI/EasyOCR
cd EasyOCR

Build the Container

The docker build command builds Docker images from a Dockerfile.

docker build -t <hub-user>/<repo-name>:<tag> .

Before running the command replace:

repo-name with the name of the container, you can name it anything you want
tag this is not required but you can use the latest tag

Push the container

Next, upload the image to the registry. This can be done by using the Docker hub username, repo name, or tag.

docker push <hub-user>/<repo-name>:<tag>

Running a Bacalhau Job to Generate Easy OCR output

Prerequisite

Now that we have an image in the docker hub (your own or an example image from the manual), we can use the container for running on Bacalhau.

Structure of the imperative command

Let's look closely at the command below:

export JOB_ID=$( ... ) exports the job ID as environment variable
bacalhau docker run: call to bacalhau
The --gpu 1 flag is set to specify hardware requirements, a GPU is needed to run such a job
The --id-only flag is set to print only job id
-i ipfs://bafybeibvc...... Mounts the model from IPFS
-i https://raw.githubusercontent.com... Mounts the Input Image from a URL
jsacex/easyocr the name and the tag of the docker image we are using
-- easyocr -l ch_sim en -f ./inputs/chinese.jpg --detail=1 --gpu=True execute script with following paramters:
1. -l ch_sim: the name of the model
2. -f ./inputs/chinese.jpg: path to the input Image or directory
3. --detail=1: level of detail
4. --gpu=True: we set this flag to true since we are running inference on a GPU. If you run this on a CPU - set this flag to false

export JOB_ID=$(bacalhau docker run \
    -i ipfs://bafybeibvcllzpfviggluobcfassm3vy4x2a4yanfxtmn4ir7olyzfrgq64:/root/.EasyOCR/model/zh_sim_g2.pth  \
    -i https://raw.githubusercontent.com/JaidedAI/EasyOCR/ae773d693c3f355aac2e58f0d8142c600172f016/examples/chinese.jpg \
    --timeout 3600 \
    --wait-timeout-secs 3600 \
    --gpu 1  \
    --memory 10Gb \
    --cpu 3 \
    --id-only \
    --wait \
    jsacex/easyocr \
    --  easyocr -l ch_sim  en -f ./inputs/chinese.jpg --detail=1 --gpu=True)

When a job is submitted, Bacalhau prints out the related job_id. We store that in an environment variable so that we can reuse it later on.

Declarative job description

name: EasyOCR
type: batch
count: 1
tasks:
  - name: My main task
    Engine:
      type: docker
      params:
        Image: "jsacex/easyocr" 
        Entrypoint:
          - /bin/bash
        Parameters:
          - -c
          - easyocr -l ch_sim  en -f ./inputs/chinese.jpg --detail=1 --gpu=True
    InputSources:
    - Source:
        Type: "urlDownload"
        Params:
          URL: "https://raw.githubusercontent.com/JaidedAI/EasyOCR/ae773d693c3f355aac2e58f0d8142c600172f016/examples/chinese.jpg"
      Target: "/inputs/chinese.jpg"
    - Source:
        Type: "s3"
        Params:
          Bucket: "landsat-image-processing"
          Key: "*"
          Region: "us-east-1"
      Target: "/root/.EasyOCR/model/zh_sim_g2.pth"
    Resources:
      GPU: "1"

The job description should be saved in .yaml format, e.g. easyocr.yaml, and then run with the command:

bacalhau job run easyocr.yaml

Checking the State of your Jobs

Job status

You can check the status of the job using bacalhau job list.

bacalhau job list --id-filter ${JOB_ID}

When it says Completed, that means the job is done, and we can get the results.

Job information

You can find out more information about your job by using bacalhau job describe.

bacalhau job describe ${JOB_ID}

Job download

You can download your job results directly by using bacalhau job get. Alternatively, you can choose to create a directory to store your results. In the command below, we created a directory and downloaded our job output to be stored in that directory.

rm -rf results && mkdir -p results
bacalhau job get $JOB_ID --output-dir results

After the download has finished you should see the following contents in results directory

Viewing your Job Output

Now you can find the file in the results/outputs folder. You can view results by running following commands:

ls results # list the contents of the current directory

cat results/stdout # displays the contents of the current directory

Running Inference on Dolly 2.0 Model with Hugging Face

Introduction

Dolly 2.0, the groundbreaking, open-source, instruction-following Large Language Model (LLM) that has been fine-tuned on a human-generated instruction dataset, licensed for both research and commercial purposes. Developed using the EleutherAI Pythia model family, this 12-billion-parameter language model is built exclusively on a high-quality, human-generated instruction following dataset, contributed by Databricks employees.

Dolly 2.0 package is open source, including the training code, dataset, and model weights, all available for commercial use. This unprecedented move empowers organizations to create, own, and customize robust LLMs capable of engaging in human-like interactions, without the need for API access fees or sharing data with third parties.

Running locally

Prerequisites

A NVIDIA GPU
Python

Installing dependencies

pip -q install git+https://github.com/huggingface/transformers # need to install from github
pip -q --upgrade install accelerate # ensure you are using version higher than 0.12.0

Create an inference.py file with following code:

# content of the inference.py file
import argparse
import torch
from transformers import pipeline

def main(prompt_string, model_version):

    # use dolly-v2-12b if you're using Colab Pro+, using pythia-2.8b for Free Colab
    generate_text = pipeline(model=model_version, 
                            torch_dtype=torch.bfloat16, 
                            trust_remote_code=True,
                            device_map="auto")

    print(generate_text(prompt_string))

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--prompt", type=str, required=True, help="The prompt to be used in the GPT model")
    parser.add_argument("--model_version", type=str, default="./databricks/dolly-v2-12b", help="The model version to be used")
    args = parser.parse_args()
    main(args.prompt, args.model_version)

Building the container (optional)

You may want to create your own container for this kind of task. In that case, use the instructions for creating and publishing your own image in the docker hub. Use huggingface/transformers-pytorch-deepspeed-nightly-gpu as base image, install dependencies listed above and copy the inference.py into it. So your Dockerfile will look like this:

FROM huggingface/transformers-pytorch-deepspeed-nightly-gpu
RUN apt-get update -y
RUN pip -q install git+https://github.com/huggingface/transformers
RUN pip -q install accelerate>=0.12.0 
COPY ./inference.py .

Running Inference on Bacalhau

Prerequisite

To get started, you need to install the Bacalhau client, see more information here

Structure of the command

export JOB_ID=$( ... ): Export results of a command execution as environment variable
bacalhau docker run: Run a job using docker executor.
--gpu 1: Flag to specify the number of GPUs to use for the execution. In this case, 1 GPU will be used.
-w /inputs: Flag to set the working directory inside the container to /inputs.
-i gitlfs://huggingface.co/databricks/dolly-v2-3b.git: Flag to clone the Dolly V2-3B model from Hugging Face's repository using Git LFS. The files will be mounted to /inputs/databricks/dolly-v2-3b.
-i https://gist.githubusercontent.com/js-ts/d35e2caa98b1c9a8f176b0b877e0c892/raw/3f020a6e789ceef0274c28fc522ebf91059a09a9/inference.py: Flag to download the inference.py script from the provided URL. The file will be mounted to /inputs/inference.py.
jsacex/dolly_inference:latest: The name and the tag of the Docker image.
The command to run inference on the model: python inference.py --prompt "Where is Earth located ?" --model_version "./databricks/dolly-v2-3b". It consists of:
1. inference.py: The Python script that runs the inference process using the Dolly V2-3B model.
2. --prompt "Where is Earth located ?": Specifies the text prompt to be used for the inference.
3. --model_version "./databricks/dolly-v2-3b": Specifies the path to the Dolly V2-3B model. In this case, the model files are mounted to /inputs/databricks/dolly-v2-3b.

When a job is submitted, Bacalhau prints out the related job_id. We store that in an environment variable so that we can reuse it later on.

export JOB_ID=$(bacalhau docker run \
    --gpu 1 \
    --id-only \
    -w /inputs \
    -i gitlfs://huggingface.co/databricks/dolly-v2-3b.git \
    -i https://gist.githubusercontent.com/js-ts/d35e2caa98b1c9a8f176b0b877e0c892/raw/3f020a6e789ceef0274c28fc522ebf91059a09a9/inference.py \
    jsacex/dolly_inference:latest \
    -- python inference.py --prompt "Where is Earth located ?" --model_version "./databricks/dolly-v2-3b")

Checking the State of your Jobs

Job status: You can check the status of the job using bacalhau job list:

bacalhau job list --id-filter ${JOB_ID}

When it says Completed, that means the job is done, and we can get the results.

Job information: You can find out more information about your job by using bacalhau job describe:

bacalhau job describe ${JOB_ID}

Job download: You can download your job results directly by using bacalhau job get. Alternatively, you can choose to create a directory to store your results. In the command below, we created a directory and downloaded our job output to be stored in that directory.

rm -rf results && mkdir results
bacalhau job get ${JOB_ID} --output-dir results

Viewing your Job Output

After the download has finished, we can see the results in the results/outputs folder.

Speech Recognition using Whisper

Introduction

Whisper is an automatic speech recognition (ASR) system trained on 680,000 hours of multilingual and multitask supervised data collected from the web. It shows that the use of such a large and diverse dataset leads to improved robustness to accents, background noise, and technical language. Moreover, it enables transcription in multiple languages, as well as translation from those languages into English. Creators are open-sourcing models and inference code to serve as a foundation for building useful applications and for further research on robust speech processing. In this example, we will transcribe an audio clip locally, containerize the script and then run the container on Bacalhau.

The advantage of using Bacalhau over managed Automatic Speech Recognition services is that you can run your own containers which can scale to do batch process petabytes of videos or audio for automatic speech recognition

TL;DR

bacalhau docker run \
    --id-only \
    --gpu 1 \
    --timeout 3600 \
    --wait-timeout-secs 3600 \
    jsacex/whisper \
    -i ipfs://bafybeielf6z4cd2nuey5arckect5bjmelhouvn5rhbjlvpvhp7erkrc4nu \
    -- python openai-whisper.py -p inputs/Apollo_11_moonwalk_montage_720p.mp4 -o outputs

Prerequisite

To get started, you need to install:

Bacalhau client, see more information here.
Whisper.
PyTorch.
Python Pandas.

Running whisper locally

pip install git+https://github.com/openai/whisper.git
pip install torch==1.10.1
pip install --upgrade  pandas
sudo apt update && sudo apt install ffmpeg

Before we create and run the script we need a sample audio file to test the code. For that we download a sample audio clip:

wget https://github.com/js-ts/hello/raw/main/hello.mp3

Create the script

We will create a script that accepts parameters (input file path, output file path, temperature, etc.) and set the default parameters. Also if the input file is in mp4 format, then the script converts it to wav format. The transcript can be saved in various formats. Then the large model is loaded and we pass it the required parameters.

This model is not only limited to English and transcription, it supports many other languages.

Next, let's create an openai-whisper script:

#content of the openai-whisper.py file

import argparse
import os
import sys
import warnings
import whisper
from pathlib import Path
import subprocess
import torch
import shutil
import numpy as np
parser = argparse.ArgumentParser(description="OpenAI Whisper Automatic Speech Recognition")
parser.add_argument("-l",dest="audiolanguage", type=str,help="Language spoken in the audio, use Auto detection to let Whisper detect the language. Select from the following languages['Auto detection', 'Afrikaans', 'Albanian', 'Amharic', 'Arabic', 'Armenian', 'Assamese', 'Azerbaijani', 'Bashkir', 'Basque', 'Belarusian', 'Bengali', 'Bosnian', 'Breton', 'Bulgarian', 'Burmese', 'Castilian', 'Catalan', 'Chinese', 'Croatian', 'Czech', 'Danish', 'Dutch', 'English', 'Estonian', 'Faroese', 'Finnish', 'Flemish', 'French', 'Galician', 'Georgian', 'German', 'Greek', 'Gujarati', 'Haitian', 'Haitian Creole', 'Hausa', 'Hawaiian', 'Hebrew', 'Hindi', 'Hungarian', 'Icelandic', 'Indonesian', 'Italian', 'Japanese', 'Javanese', 'Kannada', 'Kazakh', 'Khmer', 'Korean', 'Lao', 'Latin', 'Latvian', 'Letzeburgesch', 'Lingala', 'Lithuanian', 'Luxembourgish', 'Macedonian', 'Malagasy', 'Malay', 'Malayalam', 'Maltese', 'Maori', 'Marathi', 'Moldavian', 'Moldovan', 'Mongolian', 'Myanmar', 'Nepali', 'Norwegian', 'Nynorsk', 'Occitan', 'Panjabi', 'Pashto', 'Persian', 'Polish', 'Portuguese', 'Punjabi', 'Pushto', 'Romanian', 'Russian', 'Sanskrit', 'Serbian', 'Shona', 'Sindhi', 'Sinhala', 'Sinhalese', 'Slovak', 'Slovenian', 'Somali', 'Spanish', 'Sundanese', 'Swahili', 'Swedish', 'Tagalog', 'Tajik', 'Tamil', 'Tatar', 'Telugu', 'Thai', 'Tibetan', 'Turkish', 'Turkmen', 'Ukrainian', 'Urdu', 'Uzbek', 'Valencian', 'Vietnamese', 'Welsh', 'Yiddish', 'Yoruba'] ",default="English")
parser.add_argument("-p",dest="inputpath", type=str,help="Path of the input file",default="/hello.mp3")
parser.add_argument("-v",dest="typeverbose", type=str,help="Whether to print out the progress and debug messages. ['Live transcription', 'Progress bar', 'None']",default="Live transcription")
parser.add_argument("-g",dest="outputtype", type=str,help="Type of file to generate to record the transcription. ['All', '.txt', '.vtt', '.srt']",default="All")
parser.add_argument("-s",dest="speechtask", type=str,help="Whether to perform X->X speech recognition (`transcribe`) or X->English translation (`translate`). ['transcribe', 'translate']",default="transcribe")
parser.add_argument("-n",dest="numSteps", type=int,help="Number of Steps",default=50)
parser.add_argument("-t",dest="decodingtemperature", type=int,help="Temperature to increase when falling back when the decoding fails to meet either of the thresholds below.",default=0.15 )
parser.add_argument("-b",dest="beamsize", type=int,help="Number of Images",default=5)
parser.add_argument("-o",dest="output", type=str,help="Output Folder where to store the outputs",default="")

args=parser.parse_args()
device = torch.device('cuda:0')
print('Using device:', device, file=sys.stderr)

Model = 'large'
whisper_model =whisper.load_model(Model)
video_path_local = os.getcwd()+args.inputpath
file_name=os.path.basename(video_path_local)
output_file_path=args.output

if os.path.splitext(video_path_local)[1] == ".mp4":
    video_path_local_wav =os.path.splitext(file_name)[0]+".wav"
    result  = subprocess.run(["ffmpeg", "-i", str(video_path_local), "-vn", "-acodec", "pcm_s16le", "-ar", "16000", "-ac", "1", str(video_path_local_wav)])

# add language parameters
# Language spoken in the audio, use Auto detection to let Whisper detect the language.
#  ['Auto detection', 'Afrikaans', 'Albanian', 'Amharic', 'Arabic', 'Armenian', 'Assamese', 'Azerbaijani', 'Bashkir', 'Basque', 'Belarusian', 'Bengali', 'Bosnian', 'Breton', 'Bulgarian', 'Burmese', 'Castilian', 'Catalan', 'Chinese', 'Croatian', 'Czech', 'Danish', 'Dutch', 'English', 'Estonian', 'Faroese', 'Finnish', 'Flemish', 'French', 'Galician', 'Georgian', 'German', 'Greek', 'Gujarati', 'Haitian', 'Haitian Creole', 'Hausa', 'Hawaiian', 'Hebrew', 'Hindi', 'Hungarian', 'Icelandic', 'Indonesian', 'Italian', 'Japanese', 'Javanese', 'Kannada', 'Kazakh', 'Khmer', 'Korean', 'Lao', 'Latin', 'Latvian', 'Letzeburgesch', 'Lingala', 'Lithuanian', 'Luxembourgish', 'Macedonian', 'Malagasy', 'Malay', 'Malayalam', 'Maltese', 'Maori', 'Marathi', 'Moldavian', 'Moldovan', 'Mongolian', 'Myanmar', 'Nepali', 'Norwegian', 'Nynorsk', 'Occitan', 'Panjabi', 'Pashto', 'Persian', 'Polish', 'Portuguese', 'Punjabi', 'Pushto', 'Romanian', 'Russian', 'Sanskrit', 'Serbian', 'Shona', 'Sindhi', 'Sinhala', 'Sinhalese', 'Slovak', 'Slovenian', 'Somali', 'Spanish', 'Sundanese', 'Swahili', 'Swedish', 'Tagalog', 'Tajik', 'Tamil', 'Tatar', 'Telugu', 'Thai', 'Tibetan', 'Turkish', 'Turkmen', 'Ukrainian', 'Urdu', 'Uzbek', 'Valencian', 'Vietnamese', 'Welsh', 'Yiddish', 'Yoruba']
language = args.audiolanguage
# Whether to print out the progress and debug messages.
# ['Live transcription', 'Progress bar', 'None']
verbose = args.typeverbose
#  Type of file to generate to record the transcription.
# ['All', '.txt', '.vtt', '.srt']
output_type = args.outputtype
# Whether to perform X->X speech recognition (`transcribe`) or X->English translation (`translate`).
# ['transcribe', 'translate']
task = args.speechtask
# Temperature to use for sampling.
temperature = args.decodingtemperature
#  Temperature to increase when falling back when the decoding fails to meet either of the thresholds below.
temperature_increment_on_fallback = 0.2
#  Number of candidates when sampling with non-zero temperature.
best_of = 5
#  Number of beams in beam search, only applicable when temperature is zero.
beam_size = args.beamsize
# Optional patience value to use in beam decoding, as in [*Beam Decoding with Controlled Patience*](https://arxiv.org/abs/2204.05424), the default (1.0) is equivalent to conventional beam search.
patience = 1.0
# Optional token length penalty coefficient (alpha) as in [*Google's Neural Machine Translation System*](https://arxiv.org/abs/1609.08144), set to negative value to uses simple length normalization.
length_penalty = -0.05
# Comma-separated list of token ids to suppress during sampling; '-1' will suppress most special characters except common punctuations.
suppress_tokens = "-1"
# Optional text to provide as a prompt for the first window.
initial_prompt = ""
# if True, provide the previous output of the model as a prompt for the next window; disabling may make the text inconsistent across windows, but the model becomes less prone to getting stuck in a failure loop.
condition_on_previous_text = True
#  whether to perform inference in fp16.
fp16 = True
#  If the gzip compression ratio is higher than this value, treat the decoding as failed.
compression_ratio_threshold = 2.4
# If the average log probability is lower than this value, treat the decoding as failed.
logprob_threshold = -1.0
# If the probability of the <|nospeech|> token is higher than this value AND the decoding has failed due to `logprob_threshold`, consider the segment as silence.
no_speech_threshold = 0.6

verbose_lut = {
    'Live transcription': True,
    'Progress bar': False,
    'None': None
}

args = dict(
    language = (None if language == "Auto detection" else language),
    verbose = verbose_lut[verbose],
    task = task,
    temperature = temperature,
    temperature_increment_on_fallback = temperature_increment_on_fallback,
    best_of = best_of,
    beam_size = beam_size,
    patience=patience,
    length_penalty=(length_penalty if length_penalty>=0.0 else None),
    suppress_tokens=suppress_tokens,
    initial_prompt=(None if not initial_prompt else initial_prompt),
    condition_on_previous_text=condition_on_previous_text,
    fp16=fp16,
    compression_ratio_threshold=compression_ratio_threshold,
    logprob_threshold=logprob_threshold,
    no_speech_threshold=no_speech_threshold
)

temperature = args.pop("temperature")
temperature_increment_on_fallback = args.pop("temperature_increment_on_fallback")
if temperature_increment_on_fallback is not None:
    temperature = tuple(np.arange(temperature, 1.0 + 1e-6, temperature_increment_on_fallback))
else:
    temperature = [temperature]

if Model.endswith(".en") and args["language"] not in {"en", "English"}:
    warnings.warn(f"{Model} is an English-only model but receipted '{args['language']}'; using English instead.")
    args["language"] = "en"

video_transcription = whisper.transcribe(
    whisper_model,
    str(video_path_local),
    temperature=temperature,
    **args,
)

# Save output
writing_lut = {
    '.txt': whisper.utils.write_txt,
    '.vtt': whisper.utils.write_vtt,
    '.srt': whisper.utils.write_txt,
}

if output_type == "All":
    for suffix, write_suffix in writing_lut.items():
        transcript_local_path =os.getcwd()+output_file_path+'/'+os.path.splitext(file_name)[0] +suffix
        with open(transcript_local_path, "w", encoding="utf-8") as f:
            write_suffix(video_transcription["segments"], file=f)
        try:
            transcript_drive_path =file_name
        except:
            print(f"**Transcript file created: {transcript_local_path}**")
else:
    transcript_local_path =output_file_path+'/'+os.path.splitext(file_name)[0] +output_type

    with open(transcript_local_path, "w", encoding="utf-8") as f:
        writing_lut[output_type](video_transcription["segments"], file=f)

Let's run the script with the default parameters:

python openai-whisper.py

To view the outputs, execute following:

cat hello.srt

Containerize Script using Docker

To build your own docker container, create a Dockerfile, which contains instructions on how the image will be built, and what extra requirements will be included.

FROM  pytorch/pytorch:1.12.1-cuda11.3-cudnn8-runtime

WORKDIR /

RUN apt-get -y update

RUN apt-get -y install git

RUN python3 -m pip install --upgrade pip

RUN python -m pip install regex tqdm Pillow

RUN pip install git+https://github.com/openai/whisper.git

ADD hello.mp3 hello.mp3

ADD openai-whisper.py openai-whisper.py

RUN python openai-whisper.py

We choose pytorch/pytorch:1.12.1-cuda11.3-cudnn8-runtime as our base image.

And then install all the dependencies, after that we will add the test audio file and our openai-whisper script to the container, we will also run a test command to check whether our script works inside the container and if the container builds successfully

See more information on how to containerize your script/app here

Build the container

We will run docker build command to build the container;

docker build -t <hub-user>/<repo-name>:<tag> .

Before running the command replace:

hub-user with your docker hub username, If you don’t have a docker hub account follow these instructions to create a Docker account, and use the username of the account you created
repo-name with the name of the container, you can name it anything you want
tag this is not required but you can use the latest tag

In our case:

docker build -t jsacex/whisper

Push the container

Next, upload the image to the registry. This can be done by using the Docker hub username, repo name or tag.

docker push <hub-user>/<repo-name>:<tag>

In our case:

docker push jsacex/whisper

Running a Bacalhau Job

We will transcribe the moon landing video, which can be found here: https://www.nasa.gov/multimedia/hd/apollo11_hdpage.html

Since the downloaded video is in mov format we convert the video to mp4 format and then upload it to our public storage in this case IPFS. To do this, a public IPFS network can be used, or you can create your own private IPFS network and use it to pin files.

After the dataset has been uploaded, copy the CID:

bafybeielf6z4cd2nuey5arckect5bjmelhouvn5rhbjlvpvhp7erkrc4nu

Structure of the command

Let's look closely at the command below:

export JOB_ID=$( ... ) exports the job ID as environment variable
bacalhau docker run: call to bacalhau
The-i ipfs://bafybeielf6z4cd2nuey5arckect5bjmelhouvn5r: flag to mount the CID which contains our file to the container at the path /inputs
The --gpu 1 flag is set to specify hardware requirements, a GPU is needed to run such a job
jsacex/whisper: the name and the tag of the docker image we are using
python openai-whisper.py: execute the script with following parameters:
1. -p inputs/Apollo_11_moonwalk_montage_720p.mp4 : the input path of our file
2. -o outputs: the path where to store the outputs

export JOB_ID=$(bacalhau docker run \
    --id-only \
    --gpu 1 \
    --timeout 3600 \
    --wait-timeout-secs 3600 \
    jsacex/whisper \
    -i ipfs://bafybeielf6z4cd2nuey5arckect5bjmelhouvn5rhbjlvpvhp7erkrc4nu \
    -- python openai-whisper.py -p inputs/Apollo_11_moonwalk_montage_720p.mp4 -o outputs

When a job is submitted, Bacalhau prints out the related job_id. We store that in an environment variable so that we can reuse it later on.

Declarative job description

The same job can be presented in the declarative format. In this case, the description will look like this:

name: Speech Recognition using Whisper
type: batch
count: 1
tasks:
  - name: My main task
    Engine:
      type: docker
      params:
        Image: jsacex/whisper:latest
        Entrypoint:
          - /bin/bash
        Parameters:
          - -c   
          - python openai-whisper.py -p inputs/Apollo_11_moonwalk_montage_720p.mp4 -o outputs
    Resources:
      GPU: "1"

Checking the State of your Jobs

Job status

You can check the status of the job using bacalhau job list.

bacalhau job list --id-filter ${JOB_ID}

When it says Completed, that means the job is done, and we can get the results.

Job information

You can find out more information about your job by using bacalhau job describe.

bacalhau job describe ${JOB_ID}

Job download

rm -rf results && mkdir -p results
bacalhau job get $JOB_ID --output-dir results

After the download has finished you should see the following contents in results directory

Viewing your Job Output

Now you can find the file in the results/outputs folder. To view it, run the following command:

cat results/outputs/Apollo_11_moonwalk_montage_720p.vtt

Stable Diffusion on a GPU

This example tutorial demonstrates how to use Stable Diffusion on a GPU and run it on the Bacalhau demo network. Stable Diffusion is a state of the art text-to-image model that generates images from text and was developed as an open-source alternative to DALL·E 2. It is based on a Diffusion Probabilistic Model and uses a Transformer to generate images from text.

TL;DR

bacalhau docker run \
    --id-only \
    --gpu 1 \
    ghcr.io/bacalhau-project/examples/stable-diffusion-gpu:0.0.1 \
    -- python main.py --o ./outputs --p "meme about tensorflow"

Prerequisite

To get started, you need to install the Bacalhau client, see more information here.

Quick Test

Here is an example of an image generated by this model.

bacalhau docker run \
    --gpu 1 \
    ghcr.io/bacalhau-project/examples/stable-diffusion-gpu:0.0.1 \
    -- python main.py --o ./outputs --p "cod swimming through data"

Development

This stable diffusion example is based on the Keras/Tensorflow implementation. You might also be interested in the Pytorch oriented diffusers library.

Installing dependencies

When you run this code for the first time, it will download the pre-trained weights, which may add a short delay.

Based on the requirements here, we will install the following:

pip install git+https://github.com/fchollet/stable-diffusion-tensorflow --upgrade --quiet
pip install tensorflow tensorflow_addons ftfy --upgrade --quiet
pip install tqdm --upgrade
apt install --allow-change-held-packages libcudnn8=8.1.0.77-1+cuda11.2

Testing the Code

We have a sample code from this the Stable Diffusion in TensorFlow/Keras repo which we will use to check if the code is working as expected. Our output for this code will be a DSLR photograph of an astronaut riding a horse.

When you run this code for the first time, it will download the pre-trained weights, which may add a short delay.

from stable_diffusion_tf.stable_diffusion import Text2Image
from PIL import Image

generator = Text2Image( 
    img_height=512,
    img_width=512,
    jit_compile=False,  # You can try True as well (different performance profile)
)
img = generator.generate(
    "DSLR photograph of an astronaut riding a horse",
    num_steps=50,
    unconditional_guidance_scale=7.5,
    temperature=1,
    batch_size=1,
)
pil_img = Image.fromarray(img[0])
display(pil_img)

When running this code, if you check the GPU RAM usage, you'll see that it's sucked up many GBs, and depending on what GPU you're running, it may OOM (Out of memory) if you run this again.

You can try and reduce RAM usage by playing with batch sizes (although it is only set to 1 above!) or more carefully controlling the TensorFlow session.

To clear the GPU memory we will use numba. This won't be required when running in a single-shot manner.

pip install numba --upgrade

# clearing the GPU memory 
from numba import cuda 
device = cuda.get_current_device()
device.reset()

Write the Script

You need a script to execute when we submit jobs. The code below is a slightly modified version of the code we ran above which we got from here, however, this includes more things such as argument parsing argument parsing to be able to customize the generator.

#content of the main.py file

import argparse
from stable_diffusion_tf.stable_diffusion import Text2Image
from PIL import Image
import os
parser = argparse.ArgumentParser(description="Stable Diffusion")
parser.add_argument("--h",dest="height", type=int,help="height of the image",default=512)
parser.add_argument("--w",dest="width", type=int,help="width of the image",default=512)
parser.add_argument("--p",dest="prompt", type=str,help="Description of the image you want to generate",default="cat")
parser.add_argument("--n",dest="numSteps", type=int,help="Number of Steps",default=50)
parser.add_argument("--u",dest="unconditionalGuidanceScale", type=float,help="Number of Steps",default=7.5)
parser.add_argument("--t",dest="temperature", type=int,help="Number of Steps",default=1)
parser.add_argument("--b",dest="batchSize", type=int,help="Number of Images",default=1)
parser.add_argument("--o",dest="output", type=str,help="Output Folder where to store the Image",default="./")

args=parser.parse_args()
height=args.height
width=args.width
prompt=args.prompt
numSteps=args.numSteps
unconditionalGuidanceScale=args.unconditionalGuidanceScale
temperature=args.temperature
batchSize=args.batchSize
output=args.output

generator = Text2Image(
    img_height=height,
    img_width=width,
    jit_compile=False,  # You can try True as well (different performance profile)
)

img = generator.generate(
    prompt,
    num_steps=numSteps,
    unconditional_guidance_scale=unconditionalGuidanceScale,
    temperature=temperature,
    batch_size=batchSize,
)
for i in range(0,batchSize):
  pil_img = Image.fromarray(img[i])
  image = pil_img.save(f"{output}/image{i}.png")

For a full list of arguments that you can pass to the script, see more information here

Run the Script

After writing the code the next step is to run the script.

python3 main.py

As a result, you will get something like this:

The following presents additional parameters you can try:

python main.py --p "cat with three eyes - to set prompt
python main.py --p "cat with three eyes" --n 100 - to set the number of iterations to 100
python stable-diffusion.py --p "cat with three eyes" --b 2 to set batch size to 2 (№ of images to generate)

Containerize Script using Docker

Docker is the easiest way to run TensorFlow on a GPU since the host machine only requires the NVIDIA® driver. To containerize the inference code, we will create a Dockerfile. The Dockerfile is a text document that contains the commands that specify how the image will be built.

FROM tensorflow/tensorflow:2.10.0-gpu

RUN apt-get -y update

RUN apt-get -y install --allow-change-held-packages libcudnn8=8.1.0.77-1+cuda11.2 git

RUN python3 -m pip install --upgrade pip

RUN python -m pip install regex tqdm Pillow tensorflow tensorflow_addons ftfy  --upgrade --quiet

RUN pip install git+https://github.com/fchollet/stable-diffusion-tensorflow --upgrade --quiet

ADD main.py main.py

# Run once so it downloads and caches the pre-trained weights
RUN python main.py --n 1

The Dockerfile leverages the latest official TensorFlow GPU image and then installs other dependencies like git, CUDA packages, and other image-related necessities. See the original repository for the expected requirements.

See more information on how to containerize your script/app here

Build the container

We will run docker build command to build the container;

docker build -t <hub-user>/<repo-name>:<tag> .

Before running the command replace following:

hub-user with your docker hub username, If you don’t have a docker hub account follow these instructions to create a Docker account, and use the username of the account you created
repo-name with the name of the container, you can name it anything you want
tag this is not required but you can use the latest tag

In our case:

docker build -t ghcr.io/bacalhau-project/examples/stable-diffusion-gpu:0.0.1 .

Push the container

Next, upload the image to the registry. This can be done by using the Docker hub username, repo name or tag.

docker push <hub-user>/<repo-name>:<tag>

In our case:

docker push ghcr.io/bacalhau-project/examples/stable-diffusion-gpu:0.0.1 .

Running a Bacalhau Job

Structure of the command

Some of the jobs presented in the Examples section may require more resources than are currently available on the demo network. Consider starting your own network or running less resource-intensive jobs on the demo network

To submit a job run the Bacalhau command with following structure:

export JOB_ID=$( ... ) exports the job ID as environment variable
The --gpu 1 flag is set to specify hardware requirements, a GPU is needed to run such a job
The --id-only flag is set to print only job id
ghcr.io/bacalhau-project/examples/stable-diffusion-gpu:0.0.1: the name and the tag of the docker image we are using
-- python main.py --o ./outputs --p "meme about tensorflow": The command to run inference on the model. It consists of:
1. main.py path to the script
2. --o ./outputs specifies the output directory
3. --p "meme about tensorflow" specifies the prompt

export JOB_ID=$(
    bacalhau docker run \
    --id-only \
    --gpu 1 \
    ghcr.io/bacalhau-project/examples/stable-diffusion-gpu:0.0.1 \
    -- python main.py --o ./outputs --p "meme about tensorflow")

The Bacalhau command passes a prompt to the model and generates an image in the outputs directory. The main difference in the example below compared to all the other examples is the addition of the --gpu X flag, which tells Bacalhau to only schedule the job on nodes that have X GPUs free. You can read more about GPU support in the documentation.

This will take about 5 minutes to complete and is mainly due to the cold-start GPU setup time. This is faster than the CPU version, but you might still want to grab some fruit or plan your lunchtime run.

Furthermore, the container itself is about 10GB, so it might take a while to download on the node if it isn't cached.

When a job is submitted, Bacalhau prints out the related job_id. We store that in an environment variable so that we can reuse it later on.

Checking the State of your Jobs

Job status: You can check the status of the job using bacalhau job list.

bacalhau job list --id-filter ${JOB_ID}

When it says Completed, that means the job is done, and we can get the results.

Job information: You can find out more information about your job by using bacalhau job describe.

bacalhau job describe ${JOB_ID}

Job download: You can download your job results directly by using bacalhau job get. Alternatively, you can choose to create a directory to store your results. In the command below, we created a directory and downloaded our job output to be stored in that directory.

rm -rf results && mkdir -p results
bacalhau job get $JOB_ID --output-dir results

After the download has finished you should see the following contents in results directory

Viewing your Job Output

Now you can find the file in the results/outputsfolder:

Stable Diffusion on a CPU

Introduction

Stable Diffusion is an open-source text-to-image model, which generates images from text. It's a cutting-edge alternative to DALL·E 2 and uses the Diffusion Probabilistic Model for image generation. At the core the model generates graphics from text using a Transformer.

This example demonstrates how to use stable diffusion online on a CPU and run it on the Bacalhau demo network. The first section describes the development of the code and the container. The second section demonstrates how to run the job using Bacalhau.

This model generated the images presented on this page.

TL;DR

bacalhau docker run ghcr.io/bacalhau-project/examples/stable-diffusion-cpu:0.0.1 \
  -- python demo.py \
  --prompt "cod in space" \
  --output ../outputs/cod.png

Development

The original text-to-image stable diffusion model was trained on a fleet of GPU machines, at great cost. To use this trained model for inference, you also need to run it on a GPU.

However, this isn't always desired or possible. One alternative is to use a project called OpenVINO from Intel that allows you to convert and optimize models from a variety of frameworks (and ONNX if your framework isn't directly supported) to run on a supported Intel CPU. This is what we will do in this example.

Heads up! This example takes about 10 minutes to generate an image on an average CPU. Whilst this demonstrates it is possible, it might not be practical.

Prerequisites

In order to run this example you need:

A Debian-flavoured Linux (although you might be able to get it working on the newest machines)
Docker

Converting Stable Diffusion to a CPU Model Using OpenVINO

First we convert the trained stable diffusion models so that they work efficiently on a CPU with OpenVINO. Choose the fine tuned version of Stable Diffusion you want to use. The example is quite complex, so we have created a separate repository to host the code. This is a fork from this Github repository.

In summary, the code downloads a pre-optimized OpenVINO version of the original pre-trained stable diffusion model. This model leverages OpenAI's CLIP transformer and is wrapped inside an OpenVINO runtime, which executes the model.

The core code representing these tasks can be found in the stable_diffusion_engine.py file. This is a mashup that creates a pipeline necessary to tokenize the text and run the stable diffusion model. This boilerplate could be simplified by leveraging the more recent version of the diffusers library. But let's continue.

Install Dependencies

Note that these dependencies are only known to work on Ubuntu-based x64 machines.

sudo apt-get update
sudo apt-get install -y libgl1 libglib2.0-0 git-lfs

Clone the Repository and Dependencies

The following commands clone the example repository, and other required repositories, and install the Python dependencies.

git clone https://github.com/js-ts/stable_diffusion.openvino
cd stable_diffusion.openvino
git lfs install
git clone https://huggingface.co/openai/clip-vit-large-patch14
git clone https://huggingface.co/bes-dev/stable-diffusion-v1-4-openvino
pip3 install -r requirements.txt

Generate an Image

Now that we have all the dependencies installed, we can call the demo.py wrapper, which is a simple CLI, to generate an image from a prompt.

cd stable_diffusion.openvino && \
  python3 demo.py \
  --prompt "hello" \
  --output hello.png

When the generation is complete, you can open the generated hello.png and see something like this:

Lets try another prompt and see what we get:

cd stable_diffusion.openvino && \
  python3 demo.py \
  --prompt "cat driving a car" \
  --output cat.png

Running Stable Diffusion (CPU) on Bacalhau

Now we have a working example, we can convert it into a format that allows us to perform inference in a distributed environment.

First we will create a Dockerfile to containerize the inference code. The Dockerfile can be found in the repository, but is presented here to aid understanding.

FROM python:3.9.9-bullseye

WORKDIR /src

RUN apt-get update && \
    apt-get install -y \
    libgl1 libglib2.0-0 git-lfs

RUN git lfs install

COPY requirements.txt /src/

RUN pip3 install -r requirements.txt

COPY stable_diffusion_engine.py demo.py demo_web.py /src/
COPY data/ /src/data/

RUN git clone https://huggingface.co/openai/clip-vit-large-patch14
RUN git clone https://huggingface.co/bes-dev/stable-diffusion-v1-4-openvino

# download models
RUN python3 demo.py --num-inference-steps 1 --prompt "test" --output /tmp/test.jpg

This container is using the python:3.9.9-bullseye image and the working directory is set. Next, the Dockerfile installs the same dependencies from earlier in this notebook. Then we add our custom code and pull the dependent repositories.

We've already pushed this image to GHCR, but for posterity, you'd use a command like this to update it:

docker buildx build --platform linux/amd64 --push -t ghcr.io/bacalhau-project/examples/stable-diffusion-cpu:0.0.1 .

Prerequisites

To run this example you will need Bacalhau installed and running

Generating an Image Using Stable Diffusion on Bacalhau

Bacalhau is a distributed computing platform that allows you to run jobs on a network of computers. It is designed to be easy to use and to run on a variety of hardware. In this example, we will use it to run the stable diffusion model on a CPU.

To submit a job, you can use the Bacalhau CLI. The following command passes a prompt to the model and generates an image in the outputs directory.

This will take about 10 minutes to complete. Go grab a coffee. Or a beer. Or both. If you want to block and wait for the job to complete, add the --wait flag.

Furthermore, the container itself is about 15GB, so it might take a while to download on the node if it isn't cached.

Structure of the command

export JOB_ID=$( ... ): Export results of a command execution as environment variable
bacalhau docker run: Run a job using docker executor.
--id-only: Flag to print out only the job id
ghcr.io/bacalhau-project/examples/stable-diffusion-cpu:0.0.1: The name and the tag of the Docker image.
The command to run inference on the model: python demo.py --prompt "First Humans On Mars" --output ../outputs/mars.png. It consists of:
1. demo.py: The Python script that runs the inference process.
2. --prompt "First Humans On Mars": Specifies the text prompt to be used for the inference.
3. --output ../outputs/mars.png: Specifies the path to the output image.

When a job is submitted, Bacalhau prints out the related job_id. We store that in an environment variable so that we can reuse it later on.

export JOB_ID=$(bacalhau docker run \  
ghcr.io/bacalhau-project/examples/stable-diffusion-cpu:0.0.1 \
--id-only \
-- python demo.py --prompt "First Humans On Mars" --output ../outputs/mars.png)

Checking the State of your Jobs

Job status: You can check the status of the job using bacalhau job list:

bacalhau job list --id-filter ${JOB_ID}

When it says Completed, that means the job is done, and we can get the results.

Job information: You can find out more information about your job by using bacalhau job describe:

bacalhau job describe ${JOB_ID}

Job download: You can download your job results directly by using bacalhau job get. Alternatively, you can choose to create a directory to store your results. In the command below, we created a directory and downloaded our job output to be stored in that directory.

rm -rf results && mkdir results
bacalhau job get ${JOB_ID} --output-dir results

Viewing your Job Output

After the download has finished we can see the results in the results/outputs folder.

Object Detection with YOLOv5 on Bacalhau

Introduction

The identification and localization of objects in images and videos is a computer vision task called object detection. Several algorithms have emerged in the past few years to tackle the problem. One of the most popular algorithms to date for real-time object detection is YOLO (You Only Look Once), initially proposed by Redmond et al.

Traditionally, models like YOLO required enormous amounts of training data to yield reasonable results. People might not have access to such high-quality labeled data. Thankfully, open-source communities and researchers have made it possible to utilize pre-trained models to perform inference. In other words, you can use models that have already been trained on large datasets to perform object detection on your own data.

Bacalhau is a highly scalable decentralized computing platform and is well suited to running massive object detection jobs. In this example, you can take advantage of the GPUs available on the Bacalhau Network and perform an end-to-end object detection inference, using the YOLOv5 Docker Image developed by Ultralytics.

TL;DR

Load your dataset into S3/IPFS, specify it and pre-trained weights via the --input flags, choose a suitable container, specify the command and path to save the results - done!

Prerequisite

To get started, you need to install the Bacalhau client, see more information here

Test Run with Sample Data

To get started, let's run a test job with a small sample dataset that is included in the YOLOv5 Docker Image. This will give you a chance to familiarise yourself with the process of running a job on Bacalhau.

In addition to the usual Bacalhau flags, you will also see example of using the --gpu 1 flag in order to specify the use of a GPU.

Remember that by default Bacalhau does not provide any network connectivity when running a job. So you need to either provide all assets at job submission time, or use the --network=full or --network=http flags to access the data at task time. See the Internet Access page for more details

The model requires pre-trained weights to run and by default downloads them from within the container. Bacalhau jobs don't have network access so we will pass in the weights at submission time, saving them to /usr/src/app/yolov5s.pt. You may also provide your own weights here.

The container has its own options that we must specify:

--input to select which pre-trained weights you desire with details on the yolov5 release page
--project specifies the output volume that the model will save its results to. Bacalhau defaults to using /outputs as the output directory, so we save it there.

For more container flags refer to the yolov5/detect.py file in the YOLO repository.

One final additional hack that we have to do is move the weights file to a location with the standard name. As of writing this, Bacalhau downloads the file to a UUID-named file, which the model is not expecting. This is because GitHub 302 redirects the request to a random file in its backend.

Structure of the command

export JOB_ID=$( ... ) exports the job ID as environment variable
The --gpu 1 flag is set to specify hardware requirements, a GPU is needed to run such a job
The --timeout flag is set to make sure that if the job is not completed in the specified time, it will be terminated
The --wait flag is set to wait for the job to complete before return
The --wait-timeout-secs flag is set together with --wait to define how long should app wait for the job to complete
The --id-only flag is set to print only job id
The --input flags are used to specify the sources of input data
-- /bin/bash -c 'find /inputs -type f -exec cp {} /outputs/yolov5s.pt \; ; python detect.py --weights /outputs/yolov5s.pt --source $(pwd)/data/images --project /outputs' tells the model where to find input data and where to write output

export JOB_ID=$(bacalhau docker run \
--gpu 1 \
--timeout 3600 \
--wait \
--wait-timeout-secs 3600 \
--id-only \
--input https://github.com/ultralytics/yolov5/releases/download/v6.2/yolov5s.pt \
ultralytics/yolov5:v6.2 \
-- /bin/bash -c 'find /inputs -type f -exec cp {} /outputs/yolov5s.pt \; ; python detect.py --weights /outputs/yolov5s.pt --source $(pwd)/data/images --project /outputs')

This should output a UUID (like 59c59bfb-4ef8-45ac-9f4b-f0e9afd26e70), which will be stored in the environment variable JOB_ID. This is the ID of the job that was created. You can check the status of the job using the commands below.

Declarative job description

The same job can be presented in the declarative format. In this case, the description will look like this:

name: Object Detection with YOLOv5
type: batch
count: 1
tasks:
  - name: My main task
    Engine:
      type: docker
      params:
        Image: ultralytics/yolov5:v6.2
        Entrypoint:
          - /bin/bash
        Parameters:
          - -c
          - "find /inputs -type f -exec cp {} /outputs/yolov5s.pt \\; ; python detect.py --weights /outputs/yolov5s.pt --source $(pwd)/data/images --project /outputs"
    Publisher:
      Type: ipfs
    ResultPaths:
      - Name: outputs
        Path: /outputs
    InputSources:
      - Source:
          Type: "urlDownload"
          Params:
            URL: "https://github.com/ultralytics/yolov5/releases/download/v6.1/yolov5s.pt"
        Target: "/inputs"

Checking the State of your Jobs

Job status: You can check the status of the job using bacalhau job list:

bacalhau job list --id-filter ${JOB_ID}

When it says Completed, that means the job is done, and we can get the results.

Job information: You can find out more information about your job by using bacalhau job describe:

bacalhau job describe ${JOB_ID}

Job download: You can download your job results directly by using bacalhau job get. Alternatively, you can choose to create a directory to store your results. In the command below, we created a directory and downloaded our job output to be stored in that directory.

rm -rf results && mkdir results
bacalhau job get ${JOB_ID} --output-dir results

Viewing Output

After the download has finished we can see the results in the results/outputs/exp folder.

Using Custom Images as an Input

Now let's use some custom images. First, you will need to ingest your images onto IPFS or S3 storage. For more information about how to do that see the data ingestion section.

This example will use the Cyclist Dataset for Object Detection | Kaggle dataset.

We have already uploaded this dataset to the IPFS storage under the CID: bafybeicyuddgg4iliqzkx57twgshjluo2jtmlovovlx5lmgp5uoh3zrvpm. You can browse to this dataset via a HTTP IPFS proxy.

Let's run a the same job again, but this time use the images above.

export JOB_ID=$(bacalhau docker run \
--gpu 1 \
--timeout 3600 \
--wait \
--wait-timeout-secs 3600 \
--id-only \
--input https://github.com/ultralytics/yolov5/releases/download/v6.2/yolov5s.pt \
--input ipfs://bafybeicyuddgg4iliqzkx57twgshjluo2jtmlovovlx5lmgp5uoh3zrvpm:/datasets \
ultralytics/yolov5:v6.2 \
-- /bin/bash -c 'find /inputs -type f -exec cp {} /outputs/yolov5s.pt \; ; python detect.py --weights /outputs/yolov5s.pt --source /datasets --project /outputs')

Just as in the example above, this should output a UUID, which will be stored in the environment variable JOB_ID. You can check the status of the job using the commands below.

To check the state of the job and view job output refer to the instructions above.

Support

If you have questions or need support or guidance, please reach out to the Bacalhau team via Slack (#general channel).

Generate Realistic Images using StyleGAN3 and Bacalhau

Introduction

In this example tutorial, we will show you how to generate realistic images with StyleGAN3 and Bacalhau. StyleGAN is based on Generative Adversarial Networks (GANs), which include a generator and discriminator network that has been trained to differentiate images generated by the generator from real images. However, during the training, the generator tries to fool the discriminator, which results in the generation of realistic-looking images. With StyleGAN3 we can generate realistic-looking images or videos. It can generate not only human faces but also animals, cars, and landscapes.

TL;DR

bacalhau docker run \
    --wait \
    --id-only \
    --gpu 1 \
    --timeout 3600 \
    --wait-timeout-secs 3600 \
    jsacex/stylegan3 \
    -- python gen_images.py --outdir=../outputs --trunc=1 --seeds=2 --network=stylegan3-r-afhqv2-512x512.pkl

Prerequisite

To get started, you need to install the Bacalhau client, see more information here

Running StyleGAN3 locally

To run StyleGAN3 locally, you'll need to clone the repo, install dependencies and download the model weights.

git clone https://github.com/NVlabs/stylegan3
cd stylegan3
conda env create -f environment.yml
conda activate stylegan3
wget https://api.ngc.nvidia.com/v2/models/nvidia/research/stylegan3/versions/1/files/stylegan3-r-afhqv2-512x512.pkl

Now you can generate an image using a pre-trained AFHQv2 model. Here is an example of the image we generated:

Containerize Script with Docker

To build your own docker container, create a Dockerfile, which contains instructions to build your image.

FROM nvcr.io/nvidia/pytorch:21.08-py3

COPY . /scratch

WORKDIR /scratch

ENV HOME /scratch

See more information on how to containerize your script/app here

Build the container

We will run docker build command to build the container:

docker build -t <hub-user>/<repo-name>:<tag> .

Before running the command replace:

hub-user with your docker hub username, If you don’t have a docker hub account follow these instructions to create docker account (https://docs.docker.com/docker-id/), and use the username of the account you created
repo-name with the name of the container, you can name it anything you want
tag this is not required but you can use the latest tag

In our case:

docker build -t jsacex/stylegan3

Push the container

Next, upload the image to the registry. This can be done by using the Docker hub username, repo name or tag.

docker push <hub-user>/<repo-name>:<tag>

In our case:

docker push jsacex/stylegan3

Running a Bacalhau Job

Structure of the command

To submit a job run the Bacalhau command with following structure:

export JOB_ID=$( ... ) exports the job ID as environment variable
bacalhau docker run: call to Bacalhau
The --gpu 1 flag is set to specify hardware requirements, a GPU is needed to run such a job
The --id-only flag is set to print only job id
jsacex/stylegan3: the name and the tag of the docker image we are using
python gen_images.py: execute the script with following parameters:
1. --trunc=1 --seeds=2 --network=stylegan3-r-afhqv2-512x512.pkl: The animation length is either determined based on the --seeds value or explicitly specified using the --num-keyframes option. When num keyframes are specified with --num-keyframes, the output video length will be num_keyframes * w_frames frames.
2. ../outputs: path to the output

export JOB_ID=$(bacalhau docker run \
    --wait \
    --id-only \
    --gpu 1 \
    --timeout 3600 \
    --wait-timeout-secs 3600 \
    jsacex/stylegan3 \
    -- python gen_images.py --outdir=../outputs --trunc=1 --seeds=2 --network=stylegan3-r-afhqv2-512x512.pkl)

Declarative job description

The same job can be presented in the declarative format. In this case, the description will look like this:

name: StyleGAN3
type: batch
count: 1
tasks:
  - name: My main task
    Engine:
      type: docker
      params:
        Image: "jsacex/stylegan3" 
        Parameters:
          - python gen_images.py --outdir=../outputs --trunc=1 --seeds=2 --network=stylegan3-r-afhqv2-512x512.pkl
    Resources:
      GPU: "1"

The job description should be saved in .yaml format, e.g. stylegan3.yaml, and then run with the command:

bacalhau job run stylegan3.yaml

Render a latent vector interpolation video

You can also run variations of this command to generate videos and other things. In the following command below, we will render a latent vector interpolation video. This will render a 4x2 grid of interpolations for seeds 0 through 31.

Structure of the command

Let's look closely at the command below:

export JOB_ID=$( ... ) exports the job ID as environment variable
bacalhau docker run: call to bacalhau
The --gpu 1 flag is set to specify hardware requirements, a GPU is needed to run such a job
The --id-only flag is set to print only job id
jsacex/stylegan3 the name and the tag of the docker image we are using
python gen_images.py: execute the script with following parameters:
1. --trunc=1 --seeds=2 --network=stylegan3-r-afhqv2-512x512.pkl: The animation length is either determined based on the --seeds value or explicitly specified using the --num-keyframes option. When num keyframes is specified with --num-keyframes, the output video length will be num_keyframes * w_frames frames. If --num-keyframes is not specified, the number of seeds given with --seeds must be divisible by grid size W*H (--grid). In this case, the output video length will be # seeds/(w*h)*w_frames frames.
2. ../outputs: path to the output

export JOB_ID=$(bacalhau docker run \
    jsacex/stylegan3 \
    --gpu 1 \
    --timeout 3600 \
    --wait-timeout-secs 3600 \
    -- python gen_video.py --output=../outputs/lerp.mp4 --trunc=1 \
        --seeds=0-31 \
        --grid=4x2 \
        --network=stylegan3-r-afhqv2-512x512.pkl)

When a job is submitted, Bacalhau prints out the related job_id. We store that in an environment variable so that we can reuse it later on.

Checking the State of your Jobs

Job status

You can check the status of the job using bacalhau job list.

bacalhau job list --id-filter ${JOB_ID}

When it says Completed, that means the job is done, and we can get the results.

Job information

You can find out more information about your job by using bacalhau job describe.

bacalhau job describe ${JOB_ID}

Job download

rm -rf results && mkdir -p results
bacalhau job get $JOB_ID --output-dir results

After the download has finished you should see the following contents in results directory

Viewing your Job Output

Now you can find the file in the results/outputs folder.

Support

If you have questions or need support or guidance, please reach out to the Bacalhau team via Slack (#general channel).

Stable Diffusion Checkpoint Inference

Introduction

Stable Diffusion is a state of the art text-to-image model that generates images from text and was developed as an open-source alternative to DALL·E 2. It is based on a Diffusion Probabilistic Model and uses a Transformer to generate images from text.

This example demonstrates how to use stable diffusion using a finetuned model and run inference on it. The first section describes the development of the code and the container - it is optional as users don't need to build their own containers to use their own custom model. The second section demonstrates how to convert your model weights to ckpt. The third section demonstrates how to run the job using Bacalhau.

The following guide is using the fine-tuned model, which was finetuned on Bacalhau. To learn how to finetune your own stable diffusion model refer to this guide.

TL;DR

Convert your existing model weights to the ckpt format and upload to the IPFS storage.
Create a job using bacalhau docker run, relevant docker image, model weights and any prompt.
Download results using bacalhau job get and the job id.

Prerequisite

To get started, you need to install:

Bacalhau client, see more information here
NVIDIA GPU
CUDA drivers
NVIDIA docker

Running Locally

Containerize your Script using Docker

This part of the guide is optional - you can skip it and proceed to the Running a Bacalhau job if you are not going to use your own custom image.

To build your own docker container, create a Dockerfile, which contains instructions to containerize the code for inference.

FROM  pytorch/pytorch:1.13.0-cuda11.6-cudnn8-runtime

WORKDIR /

RUN apt update &&  apt install -y git

RUN git clone https://github.com/runwayml/stable-diffusion.git

WORKDIR /stable-diffusion

RUN conda env create -f environment.yaml

SHELL ["conda", "run", "-n", "ldm", "/bin/bash", "-c"]

RUN pip install opencv-python

RUN apt update

RUN apt-get install ffmpeg libsm6 libxext6 libxrender-dev  -y

This container is using the pytorch/pytorch:1.13.0-cuda11.6-cudnn8-runtime image and the working directory is set. Next the Dockerfile installs required dependencies. Then we add our custom code and pull the dependent repositories.

See more information on how to containerize your script/app here

Build the container

We will run docker build command to build the container.

docker build -t <hub-user>/<repo-name>:<tag> .

Before running the command replace:

hub-user with your docker hub username, If you don’t have a docker hub account follow these instructions to create the Docker account, and use the username of the account you created
repo-name with the name of the container, you can name it anything you want
tag this is not required but you can use the latest tag

So in our case, the command will look like this:

docker build -t jsacex/stable-diffusion-ckpt

Push the container

Next, upload the image to the registry. This can be done by using the Docker hub username, repo name or tag.

docker push <hub-user>/<repo-name>:<tag>

Thus, in this case, the command would look this way:

docker push jsacex/stable-diffusion-ckpt

After the repo image has been pushed to Docker Hub, you can now use the container for running on Bacalhau. But before that you need to check whether your model is a ckpt file or not. If your model is a ckpt file you can skip to the running on Bacalhau, and if not - the next section describes how to convert your model into the ckpt format.

Converting model weights to CKPT

To download the convert script:

wget -q https://github.com/TheLastBen/diffusers/raw/main/scripts/convert_diffusers_to_original_stable_diffusion.py

To convert the model weights into ckpt format, the --half flag cuts the size of the output model from 4GB to 2GB:

python3 convert_diffusers_to_original_stable_diffusion.py \
    --model_path <path-to-the-model-weights>  \
    --checkpoint_path <path-to-save-the-checkpoint>/model.ckpt \
    --half

Running a Bacalhau Job

To do inference on your own checkpoint on Bacalhau you need to first upload it to your public storage, which can be mounted anywhere on your machine. In this case, we will be using NFT.Storage (Recommended Option). To upload your dataset using NFTup drag and drop your directory and it will upload it to IPFS.

After the checkpoint file has been uploaded copy its CID.

Structure of the command

Let's look closely at the command above:

export JOB_ID=$( ... ): Export results of a command execution as environment variable
The --gpu 1 flag is set to specify hardware requirements, a GPU is needed to run such a job
-i ipfs://QmUCJuFZ2v7KvjBGHRP2K1TMPFce3reTkKVGF2BJY5bXdZ:/model.ckpt: Path to mount the checkpoint
-- conda run --no-capture-output -n ldm: since we are using conda we need to specify the name of the environment which we are going to use, in this case it is ldm
scripts/txt2img.py: running the python script
--prompt "a photo of a person drinking coffee": the prompt you need to specify the session name in the prompt.
--plms: the sampler you want to use. In this case we will use the plms sampler
--ckpt ../model.ckpt: here we specify the path to our checkpoint
--n_samples 1: no of samples we want to produce
--skip_grid: skip creating a grid of images
--outdir ../outputs: path to store the outputs
--seed $RANDOM: The output generated on the same prompt will always be the same for different outputs on the same prompt set the seed parameter to random

When a job is submitted, Bacalhau prints out the related job_id. We store that in an environment variable so that we can reuse it later on.

export JOB_ID=$(bacalhau docker run \
--gpu 1 \
--timeout 3600 \
--wait-timeout-secs 3600 \
--wait \
--id-only \
-i ipfs://QmUCJuFZ2v7KvjBGHRP2K1TMPFce3reTkKVGF2BJY5bXdZ:/model.ckpt \
jsacex/stable-diffusion-ckpt \
-- conda run --no-capture-output -n ldm python scripts/txt2img.py --prompt "a photo of a person drinking coffee" --plms --ckpt ../model.ckpt --skip_grid --n_samples 1 --skip_grid --outdir ../outputs)

Checking the State of your Jobs

Job status: You can check the status of the job using bacalhau job list:

bacalhau job list --id-filter ${JOB_ID}

When it says Completed, that means the job is done, and we can get the results.

Job information: You can find out more information about your job by using bacalhau job describe:

bacalhau job describe ${JOB_ID}

Job download: You can download your job results directly by using bacalhau job get. Alternatively, you can choose to create a directory to store your results. In the command below, we created a directory and downloaded our job output to be stored in that directory.

rm -rf results && mkdir results
bacalhau job get ${JOB_ID} --output-dir results

Viewing your Job Output

After the download has finished we can see the results in the results/outputs folder. We received following image for our prompt:

Running Inference on a Model stored on S3

Introduction

In this example, we will demonstrate how to run inference on a model stored on Amazon S3. We will use a PyTorch model trained on the MNIST dataset.

Running Locally

Prerequisites

Consider using the latest versions or use the docker method listed below in the article.

Python
PyTorch

Downloading the Datasets

Use the following commands to download the model and test image:

wget https://sagemaker-sample-files.s3.amazonaws.com/datasets/image/MNIST/model/pytorch-training-2020-11-21-22-02-56-203/model.tar.gz
wget https://raw.githubusercontent.com/js-ts/mnist-test/main/digit.png

Creating the Inference Script

This script is designed to load a pretrained PyTorch model for MNIST digit classification from a tar.gz file, extract it, and use the model to perform inference on a given input image. Ensure you have all required dependencies installed:

pip install Pillow torch torchvision

# content of the inference.py file
import torch
import torchvision.transforms as transforms
from PIL import Image
from torch.autograd import Variable
import argparse
import tarfile

class CustomModel(torch.nn.Module):
    def __init__(self):
        super(CustomModel, self).__init__()
        self.conv1 = torch.nn.Conv2d(1, 10, 5)
        self.conv2 = torch.nn.Conv2d(10, 20, 5)
        self.fc1 = torch.nn.Linear(320, 50)
        self.fc2 = torch.nn.Linear(50, 10)

    def forward(self, x):
        x = torch.relu(self.conv1(x))
        x = torch.max_pool2d(x, 2)
        x = torch.relu(self.conv2(x))
        x = torch.max_pool2d(x, 2)
        x = torch.flatten(x, 1)
        x = torch.relu(self.fc1(x))
        x = self.fc2(x)
        output = torch.log_softmax(x, dim=1)
        return output

def extract_tar_gz(file_path, output_dir):
    with tarfile.open(file_path, 'r:gz') as tar:
        tar.extractall(path=output_dir)

# Parse command-line arguments
parser = argparse.ArgumentParser()
parser.add_argument('--tar_gz_file_path', type=str, required=True, help='Path to the tar.gz file')
parser.add_argument('--output_directory', type=str, required=True, help='Output directory to extract the tar.gz file')
parser.add_argument('--image_path', type=str, required=True, help='Path to the input image file')
args = parser.parse_args()

# Extract the tar.gz file
tar_gz_file_path = args.tar_gz_file_path
output_directory = args.output_directory
extract_tar_gz(tar_gz_file_path, output_directory)

# Load the model
model_path = f"{output_directory}/model.pth"
model = CustomModel()
model.load_state_dict(torch.load(model_path, map_location=torch.device("cpu")))
model.eval()

# Transformations for the MNIST dataset
transform = transforms.Compose([
    transforms.Resize((28, 28)),
    transforms.Grayscale(num_output_channels=1),
    transforms.ToTensor(),
    transforms.Normalize((0.1307,), (0.3081,)),
])

# Function to run inference on an image
def run_inference(image, model):
    image_tensor = transform(image).unsqueeze(0)  # Apply transformations and add batch dimension
    input = Variable(image_tensor)

    # Perform inference
    output = model(input)
    _, predicted = torch.max(output.data, 1)
    return predicted.item()

# Example usage
image_path = args.image_path
image = Image.open(image_path)
predicted_class = run_inference(image, model)
print(f"Predicted class: {predicted_class}")

Running the Inference Script

To use this script, you need to provide the paths to the tar.gz file containing the pre-trained model, the output directory where the model will be extracted, and the input image file for which you want to perform inference. The script will output the predicted digit (class) for the given input image.

python3 inference.py --tar_gz_file_path ./model.tar.gz --output_directory ./model --image_path ./digit.png

Running Inference on Bacalhau

Prerequisite

To get started, you need to install the Bacalhau client, see more information here

Structure of the Command

export JOB_ID=$( ... ): Export results of a command execution as environment variable
-w /inputs Set the current working directory at /inputs in the container
-i src=s3://sagemaker-sample-files/datasets/image/MNIST/model/pytorch-training-2020-11-21-22-02-56-203/model.tar.gz,dst=/model/,opt=region=us-east-1: Mount the s3 bucket at the destination path provided - /model/ and specifying the region where the bucket is located opt=region=us-east-1
-i git://github.com/js-ts/mnist-test.git: Flag to mount the source code repo from GitHub. It would mount the repo at /inputs/js-ts/mnist-test in this case it also contains the test image
pytorch/pytorch: The name of the Docker image
-- python3 /inputs/js-ts/mnist-test/inference.py --tar_gz_file_path /model/model.tar.gz --output_directory /model-pth --image_path /inputs/js-ts/mnist-test/image.png: The command to run inference on the model. It consists of:
1. /model/model.tar.gz is the path to the model file
2. /model-pth is the output directory for the model
3. /inputs/js-ts/mnist-test/image.png is the path to the input image

export JOB_ID=$(bacalhau docker run \
--wait \
--id-only \
--timeout 3600 \
--wait-timeout-secs 3600 \
-w /inputs \
-i src=s3://sagemaker-sample-files/datasets/image/MNIST/model/pytorch-training-2020-11-21-22-02-56-203/model.tar.gz,dst=/model/,opt=region=us-east-1 \
-i git://github.com/js-ts/mnist-test.git \
pytorch/pytorch \
 -- python3 /inputs/js-ts/mnist-test/inference.py --tar_gz_file_path /model/model.tar.gz --output_directory /model-pth --image_path /inputs/js-ts/mnist-test/image.png)

When the job is submitted Bacalhau prints out the related job id. We store that in an environment variable JOB_ID so that we can reuse it later on.

Viewing the Output

Use the bacalhau job logs command to view the job output, since the script prints the result of execution to the stdout:

bacalhau job logs ${JOB_ID}

Predicted class: 0

You can also use bacalhau job get to download job results:

bacalhau job get ${JOB_ID}

Support

If you have questions or need support or guidance, please reach out to the Bacalhau team via Slack (#general channel).

Model Training

Training Pytorch Model with Bacalhau

Introduction

In this example tutorial, we will show you how to train a PyTorch RNN MNIST neural network model with Bacalhau. PyTorch is a framework developed by Facebook AI Research for deep learning, featuring both beginner-friendly debugging tools and a high level of customization for advanced users, with researchers and practitioners using it across companies like Facebook and Tesla. Applications include computer vision, natural language processing, cryptography, and more.

TL;DR

bacalhau docker run \
    --gpu 1 \
    --timeout 3600 \
    --wait-timeout-secs 3600 \
    --wait \
    --id-only \
    pytorch/pytorch \
    -w /outputs \
    -i ipfs://QmdeQjz1HQQdT9wT2NHX86Le9X6X6ySGxp8dfRUKPtgziw:/data \
    -i https://raw.githubusercontent.com/pytorch/examples/main/mnist_rnn/main.py \
-- python ../inputs/main.py --save-model

Prerequisite

To get started, you need to install the Bacalhau client, see more information

Training the Model Locally

git clone https://github.com/pytorch/examples

Install the following:

pip install --upgrade torch torchvision

Next, we run the command below to begin the training of the mnist_rnn model. We added the --save-model flag to save the model

python ./examples/mnist_rnn/main.py --save-model

Next, the downloaded MNIST dataset is saved in the data folder.

Uploading Dataset to IPFS

Running a Bacalhau Job

After the repo image has been pushed to Docker Hub, we can now use the container for running on Bacalhau. To submit a job, run the following Bacalhau command:

export JOB_ID=$(bacalhau docker run \
    --gpu 1 \
    --timeout 3600 \
    --wait-timeout-secs 3600 \
    --wait \
    --id-only \
    pytorch/pytorch \
    -w /outputs \
    -i ipfs://QmdeQjz1HQQdT9wT2NHX86Le9X6X6ySGxp8dfRUKPtgziw:/data \
    -i https://raw.githubusercontent.com/pytorch/examples/main/mnist_rnn/main.py \
-- python ../inputs/main.py --save-model)

Structure of the command

export JOB_ID=$( ... ) exports the job ID as environment variable
bacalhau docker run: call to bacalhau
The --gpu 1 flag is set to specify hardware requirements, a GPU is needed to run such a job
pytorch/pytorch: Using the official pytorch Docker image
The -i ipfs://QmdeQjz1HQQd.....: flag is used to mount the uploaded dataset
-w /outputs: Our working directory is /outputs. This is the folder where we will save the model as it will automatically get uploaded to IPFS as outputs
python ../inputs/main.py --save-model: URL script gets mounted to the /inputs folder in the container

When a job is submitted, Bacalhau prints out the related job_id. We store that in an environment variable so that we can reuse it later on.

Declarative job description

name: Stable Diffusion Dreambooth Finetuning
type: batch
count: 1
tasks:
  - name: My main task
    Engine:
      type: docker
      params:
        Image: "pytorch/pytorch" 
        Entrypoint:
          - /bin/bash
        Parameters:
          - -c
          - python ../inputs/main.py --save-model
    InputSources:
      - Source:
          Type: "ipfs"
          Params:
            CID: "QmdeQjz1HQQdT9wT2NHX86Le9X6X6ySGxp8dfRUKPtgziw"
        Target: /data
      - Source:
          Type: urlDownload
          Params:
            URL: https://raw.githubusercontent.com/pytorch/examples/main/mnist_rnn/main.py
        Target: /inputs  
    Resources:
      GPU: "1"

The job description should be saved in .yaml format, e.g. torch.yaml, and then run with the command:

bacalhau job run torch.yaml

Checking the State of your Jobs

Job status

You can check the status of the job using bacalhau job list.

bacalhau job list --id-filter ${JOB_ID}

When it says Completed, that means the job is done, and we can get the results.

Job information

You can find out more information about your job by using bacalhau job describe.

bacalhau job describe ${JOB_ID}

Job download

rm -rf results && mkdir -p results
bacalhau job get $JOB_ID --output-dir results

After the download has finished you should see the following contents in results directory

Viewing your Job Output

Now you can find results in the results/outputs folder. To view them, run the following command:

ls results/ # list the contents of the current directory

cat results/stdout # displays the contents of the file given to it as a parameter.

ls results/outputs/ # list the successfully trained model

Training Tensorflow Model

Introduction

is an open-source machine learning software library, TensorFlow is used to train neural networks. Expressed in the form of stateful dataflow graphs, each node in the graph represents the operations performed by neural networks on multi-dimensional arrays. These multi-dimensional arrays are commonly known as “tensors”, hence the name TensorFlow. In this example, we will be training a MNIST model.

Training TensorFlow models Locally

This section is from

TensorFlow 2 quickstart for beginners

This short introduction uses to:

Load a prebuilt dataset.
Build a neural network machine learning model that classifies images.
Train this neural network.
Evaluate the accuracy of the model.

Set up TensorFlow

Import TensorFlow into your program to check whether it is installed

import tensorflow as tf
import os
print("TensorFlow version:", tf.__version__)

mkdir /inputs
wget https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz -O /inputs/mnist.npz

mnist = tf.keras.datasets.mnist

CWD = '' if os.getcwd() == '/' else os.getcwd()
(x_train, y_train), (x_test, y_test) = mnist.load_data('/inputs/mnist.npz')
x_train, x_test = x_train / 255.0, x_test / 255.0

Build a machine-learning model

Build a tf.keras.Sequential model by stacking layers.

model = tf.keras.models.Sequential([
  tf.keras.layers.Flatten(input_shape=(28, 28)),
  tf.keras.layers.Dense(128, activation='relu'),
  tf.keras.layers.Dropout(0.2),
  tf.keras.layers.Dense(10)
])

predictions = model(x_train[:1]).numpy()
predictions

The tf.nn.softmax function converts these logits to probabilities for each class:

tf.nn.softmax(predictions).numpy()

Note: It is possible to bake the tf.nn.softmax function into the activation function for the last layer of the network. While this can make the model output more directly interpretable, this approach is discouraged as it's impossible to provide an exact and numerically stable loss calculation for all models when using a softmax output.

Define a loss function for training using losses.SparseCategoricalCrossentropy, which takes a vector of logits and a True index and returns a scalar loss for each example.

loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)

This loss is equal to the negative log probability of the true class: The loss is zero if the model is sure of the correct class.

This untrained model gives probabilities close to random (1/10 for each class), so the initial loss should be close to -tf.math.log(1/10) ~= 2.3.

loss_fn(y_train[:1], predictions).numpy()

model.compile(optimizer='adam',
              loss=loss_fn,
              metrics=['accuracy'])

Train and evaluate your model

Use the Model.fit method to adjust your model parameters and minimize the loss:

model.fit(x_train, y_train, epochs=5)

model.evaluate(x_test,  y_test, verbose=2)

If you want your model to return a probability, you can wrap the trained model, and attach the softmax to it:

probability_model = tf.keras.Sequential([
  model,
  tf.keras.layers.Softmax()
])

probability_model(x_test[:5])

mkdir /outputs

The following method can be used to save the model as a checkpoint

model.save_weights('/outputs/checkpoints/my_checkpoint')

ls /outputs/

Running on Bacalhau

The dataset and the script are mounted to the TensorFlow container using an URL, we then run the script inside the container

Declarative job description

name: Training ML model using tensorflow
type: batch
count: 1
tasks:
  - name: My main task
    Engine:
      type: docker
      params:
        WorkingDirectory: "/inputs"
        Image: "tensorflow/tensorflow" 
        Entrypoint:
          - /bin/bash
        Parameters:
          - -c
          - python train.py
    InputSources:
      - Source:
          Type: urlDownload
          Params:
            URL: https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz
        Target: /inputs
      - Source:
          Type: urlDownload
          Params:
            URL: https://gist.githubusercontent.com/js-ts/e7d32c7d19ffde7811c683d4fcb1a219/raw/ff44ac5b157d231f464f4d43ce0e05bccb4c1d7b/train.py
        Target: /inputs
    Resources:
      GPU: "1"

The job description should be saved in .yaml format, e.g. tensorflow.yaml, and then run with the command:

bacalhau job run tensorflow.yaml

Checking the State of your Jobs

Job status

You can check the status of the job using bacalhau job list.

bacalhau job list --id-filter ${JOB_ID}

When it says Completed, that means the job is done, and we can get the results.

Job information

You can find out more information about your job by using bacalhau job describe.

bacalhau job describe ${JOB_ID}

Job download

rm -rf results && mkdir -p results
bacalhau job get $JOB_ID --output-dir results

After the download has finished you should see the following contents in results directory

Viewing your Job Output

Now you can find the file in the results/outputs folder. To view it, run the following command:

cat results/outputs/

Support

Stable Diffusion Dreambooth (Finetuning)

Introduction

Stable diffusion has revolutionalized text2image models by producing high quality images based on a prompt. Dreambooth is a approach for personalization of text-to-image diffusion models. With images as input subject, we can fine-tune a pretrained text-to-image model

Although the used to finetune the pre-trained model since both the Imagen model and Dreambooth code are closed source, several opensource projects have emerged using stable diffusion.

Dreambooth makes stable-diffusion even more powered with the ability to generate realistic looking pictures of humans, animals or any other object by just training them on 20-30 images.

In this example tutorial, we will be fine-tuning a pretrained stable diffusion using images of a human and generating images of him drinking coffee.

TL;DR

The following command generates the following:

Subject: SBF
Prompt: a photo of SBF without hair

bacalhau docker run \
 --gpu 1 \
 --timeout 3600 \
 --wait-timeout-secs 3600 \
  -i ipfs://QmRKnvqvpFzLjEoeeNNGHtc7H8fCn9TvNWHFnbBHkK8Mhy  \
  jsacex/dreambooth:full \
  -- bash finetune.sh /inputs /outputs "a photo of sbf man" "a photo of man" 3000 "/man" "/model"

bacalhau docker run \
 --gpu 1 \
  -i ipfs://QmUEJPr5pfV6tRzWQuNSSb3wdcN6tRQS5tdk3dYSCJ55Xs:/SBF.ckpt \
   jsacex/stable-diffusion-ckpt \
   -- conda run --no-capture-output -n ldm python scripts/txt2img.py --prompt "a photo of sbf without hair" --plms --ckpt ../SBF.ckpt --skip_grid --n_samples 1 --skip_grid --outdir ../outputs

Output:

Building this container requires you to have a supported GPU which needs to have 16gb+ of memory, since it can be resource intensive.

We will create a Dockerfile and add the desired configuration to the file. Following commands specify how the image will be built, and what extra requirements will be included:

FROM pytorch/pytorch:1.12.1-cuda11.3-cudnn8-devel

WORKDIR /

# Install requirements
# RUN git clone https://github.com/TheLastBen/diffusers

RUN apt update && apt install wget git unzip -y

RUN wget -q https://gist.githubusercontent.com/js-ts/28684a7e6217214ec944a9224584e9af/raw/d7492bc8f36700b75d51e3346259d1a466b99a40/train_dreambooth.py

RUN wget -q https://github.com/TheLastBen/diffusers/raw/main/scripts/convert_diffusers_to_original_stable_diffusion.py

# RUN cp /content/convert_diffusers_to_original_stable_diffusion.py /content/diffusers/scripts/convert_diffusers_to_original_stable_diffusion.py

RUN pip install -qq git+https://github.com/TheLastBen/diffusers

RUN pip install -q accelerate==0.12.0 transformers ftfy bitsandbytes gradio natsort

# Install xformers

RUN pip install -q https://github.com/metrolobo/xformers_wheels/releases/download/1d31a3ac_various_6/xformers-0.0.14.dev0-cp37-cp37m-linux_x86_64.whl

RUN wget 'https://github.com/TheLastBen/fast-stable-diffusion/raw/main/Dreambooth/Regularization/Women' -O woman.zip

RUN wget 'https://github.com/TheLastBen/fast-stable-diffusion/raw/main/Dreambooth/Regularization/Men' -O man.zip

RUN wget 'https://github.com/TheLastBen/fast-stable-diffusion/raw/main/Dreambooth/Regularization/Mix' -O mix.zip

RUN unzip -j woman.zip -d woman

RUN unzip -j man.zip -d man

RUN unzip -j mix.zip -d mix

This container is using the pytorch/pytorch:1.12.1-cuda11.3-cudnn8-devel image and the working directory is set. Next, we add our custom code and pull the dependent repositories.

# finetune.sh
python clear_mem.py

accelerate launch train_dreambooth.py \
  --image_captions_filename \
  --train_text_encoder \
  --save_n_steps=$(expr $5 / 6) \
  --stop_text_encoder_training=$(expr $5 + 100) \
  --class_data_dir="$6" \
  --pretrained_model_name_or_path=${7:=/model} \
  --tokenizer_name=${7:=/model}/tokenizer/ \
  --instance_data_dir="$1" \
  --output_dir="$2" \
  --instance_prompt="$3" \
  --class_prompt="$4" \
  --seed=96576 \
  --resolution=512 \
  --mixed_precision="fp16" \
  --train_batch_size=1 \
  --gradient_accumulation_steps=1 \
  --use_8bit_adam \
  --learning_rate=2e-6 \
  --lr_scheduler="polynomial" \
  --center_crop \
  --lr_warmup_steps=0 \
  --max_train_steps=$5

echo  Convert weights to ckpt
python convert_diffusers_to_original_stable_diffusion.py --model_path $2  --checkpoint_path $2/model.ckpt --half
echo model saved at $2/model.ckpt

The shell script is there to make things much simpler since the command to train the model needs many parameters to pass and later convert the model weights to the checkpoint, you can edit this script and add in your own parameters

To download the models and run a test job in the Docker file, copy the following:

FROM pytorch/pytorch:1.12.1-cuda11.3-cudnn8-devel
WORKDIR /
# Install requirements
# RUN git clone https://github.com/TheLastBen/diffusers
RUN apt update && apt install wget git unzip -y
RUN wget -q https://gist.githubusercontent.com/js-ts/28684a7e6217214ec944a9224584e9af/raw/d7492bc8f36700b75d51e3346259d1a466b99a40/train_dreambooth.py
RUN wget -q https://github.com/TheLastBen/diffusers/raw/main/scripts/convert_diffusers_to_original_stable_diffusion.py
# RUN cp /content/convert_diffusers_to_original_stable_diffusion.py /content/diffusers/scripts/convert_diffusers_to_original_stable_diffusion.py
RUN pip install -qq git+https://github.com/TheLastBen/diffusers
RUN pip install -q accelerate==0.12.0 transformers ftfy bitsandbytes gradio natsort
# Install xformers
RUN pip install -q https://github.com/metrolobo/xformers_wheels/releases/download/1d31a3ac_various_6/xformers-0.0.14.dev0-cp37-cp37m-linux_x86_64.whl
# You need to accept the model license before downloading or using the Stable Diffusion weights. Please, visit the [model card](https://huggingface.co/runwayml/stable-diffusion-v1-5), read the license and tick the checkbox if you agree. You have to be a registered user in 🤗 Hugging Face Hub, and you'll also need to use an access token for the code to work.
# https://huggingface.co/settings/tokens
RUN mkdir -p ~/.huggingface
RUN echo -n "<your-hugging-face-token>" > ~/.huggingface/token
# copy the test dataset from a local file
# COPY jfk /jfk

# Download and extract the test dataset
RUN wget https://github.com/js-ts/test-images/raw/main/jfk.zip
RUN unzip -j jfk.zip -d jfk
RUN mkdir model
RUN wget 'https://github.com/TheLastBen/fast-stable-diffusion/raw/main/Dreambooth/Regularization/Women' -O woman.zip
RUN wget 'https://github.com/TheLastBen/fast-stable-diffusion/raw/main/Dreambooth/Regularization/Men' -O man.zip
RUN wget 'https://github.com/TheLastBen/fast-stable-diffusion/raw/main/Dreambooth/Regularization/Mix' -O mix.zip
RUN unzip -j woman.zip -d woman
RUN unzip -j man.zip -d man
RUN unzip -j mix.zip -d mix

RUN  accelerate launch train_dreambooth.py \
  --image_captions_filename \
  --train_text_encoder \
  --save_starting_step=5\
  --stop_text_encoder_training=31 \
  --class_data_dir=/man \
  --save_n_steps=5 \
  --pretrained_model_name_or_path="CompVis/stable-diffusion-v1-4" \
  --instance_data_dir="/jfk" \
  --output_dir="/model" \
  --instance_prompt="a photo of jfk man" \
  --class_prompt="a photo of man" \
  --instance_prompt="" \
  --seed=96576 \
  --resolution=512 \
  --mixed_precision="fp16" \
  --train_batch_size=1 \
  --gradient_accumulation_steps=1 \
  --use_8bit_adam \
  --learning_rate=2e-6 \
  --lr_scheduler="polynomial" \
  --center_crop \
  --lr_warmup_steps=0 \
  --max_train_steps=30

COPY finetune.sh /finetune.sh

RUN wget -q https://gist.githubusercontent.com/js-ts/624fecc3fff807d4948688cb28993a94/raw/fd69ac084debe26a815485c1f363b8a45566f1ba/clear_mem.py
# Removing your token
RUN rm -rf  ~/.huggingface/token

Then execute finetune.sh with following commands:

# finetune.sh
python clear_mem.py

accelerate launch train_dreambooth.py \
    --image_captions_filename \
   --train_text_encoder \
    --save_n_steps=$(expr $5 / 6) \
    --stop_text_encoder_training=$(expr $5 + 100) \
       --class_data_dir="$6" \
  --pretrained_model_name_or_path=${7:=/model} \
--tokenizer_name=${7:=/model}/tokenizer/ \
    --instance_data_dir="$1" \
    --output_dir="$2" \
    --instance_prompt="$3" \
   --class_prompt="$4" \
    --seed=96576 \
    --resolution=512 \
    --mixed_precision="fp16" \
    --train_batch_size=1 \
    --gradient_accumulation_steps=1 \
    --use_8bit_adam \
    --learning_rate=2e-6 \
    --lr_scheduler="polynomial" \
    --center_crop \
    --lr_warmup_steps=0 \
    --max_train_steps=$5

echo  Convert weights to ckpt
python convert_diffusers_to_original_stable_diffusion.py --model_path $2  --checkpoint_path $2/model.ckpt --half
echo model saved at $2/model.ckpt

We will run docker build command to build the container:

docker build -t <hub-user>/<repo-name>:<tag> .

Before running the command replace:

repo-name with the name of the container, you can name it anything you want.
tag this is not required but you can use the latest tag

Now you can push this repository to the registry designated by its name or tag.

docker push <hub-user>/<repo-name>:<tag>

The optimal dataset size is between 20-30 images. You can choose the images of the subject in different positions, full body images, half body, pictures of the face etc.

Only the subject should appear in the image so you can crop the image to just fit the subject. Make sure that the images are 512x512 size and are named in the following pattern:

Subject Name.jpg, Subject Name (2).jpg ... Subject Name (n).jpg

After the Subject dataset is created we upload it to IPFS.

To upload your dataset using NFTup just drag and drop your directory it will upload it to IPFS:

After the checkpoint file has been uploaded, copy its CID which will look like this:

bafybeidqbuphwkqwgrobv2vakwsh3l6b4q2mx7xspgh4l7lhulhc3dfa7a

Since there are a lot of combinations that you can try, processing of finetuned model can take almost 1hr+ to complete. Here are a few approaches that you can try based on your requirements:

bacalhau docker run: call to bacalhau
The --gpu 1 flag is set to specify hardware requirements, a GPU is needed to run such a job
-i ipfs://bafybeidqbuphwkqwgrobv2vakwsh3l6b4q2mx7xspgh4l7lhulhc3dfa7a Mounts the data from IPFS via its CID
jsacex/dreambooth:latest Name and tag of the docker image we are using
-- bash finetune.sh /inputs /outputs "a photo of David Aronchick man" "a photo of man" 3000 "/man" execute script with following paramters:
1. /inputs Path to the subject Images
2. /outputs Path to save the generated outputs
3. "a photo of < name of the subject > < class >" -> "a photo of David Aronchick man" Subject name along with class
4. "a photo of < class >" -> "a photo of man" Name of the class

bacalhau docker run \
  --gpu 1 \
  --timeout 3600 \
  --wait-timeout-secs 3600 \
  --timeout 3600 \
  --wait-timeout-secs 3600 \
  -i <CID-OF-THE-SUBJECT> \
  jsacex/dreambooth:full \
  -- bash finetune.sh /inputs /outputs "a photo of <name-of-the-subject> man" "a photo of man" 3000 "/man" "/model"

The number of iterations is 3000. This number should be no of subject images x 100. So if there are 30 images, it would be 3000. It takes around 32 minutes on a v100 for 3000 iterations, but you can increase/decrease the number based on your requirements.

Here is our command with our parameters replaced:

bacalhau docker run \
  --gpu 1 \
  --timeout 3600 \
  --wait-timeout-secs 3600 \
  --timeout 3600 \
  --wait-timeout-secs 3600 \
  -i ipfs://bafybeidqbuphwkqwgrobv2vakwsh3l6b4q2mx7xspgh4l7lhulhc3dfa7a \
  --wait \
  --id-only \
  jsacex/dreambooth:full \
  -- bash finetune.sh /inputs /outputs "a photo of David Aronchick man" "a photo of man" 3000 "/man" "/model"

If your subject fits the above class, but has a different name you just need to replace the input CID and the subject name.

Use the /woman class images

bacalhau docker run \
  --gpu 1 \
  --timeout 3600 \
  --wait-timeout-secs 3600 \
  -i <CID-OF-THE-SUBJECT> \
  jsacex/dreambooth:full \
  -- bash finetune.sh /inputs /outputs "a photo of <name-of-the-subject> woman" "a photo of woman" 3000 "/woman"  "/model"

Here you can provide your own regularization images or use the mix class.

Use the /mix class images if the class of the subject is mix

bacalhau docker run \
  --gpu 1 \
  --timeout 3600 \
  --wait-timeout-secs 3600 \
  -i <CID-OF-THE-SUBJECT> \
  jsacex/dreambooth:full \
  -- bash finetune.sh /inputs /outputs "a photo of <name-of-the-subject> mix" "a photo of mix" 3000 "/mix"  "/model"

You can upload the model to IPFS and then create a gist, mount the model and script to the lightweight container

bacalhau docker run \
  --gpu 1 \
  --timeout 3600 \
  --wait-timeout-secs 3600 \
  -i ipfs://bafybeidqbuphwkqwgrobv2vakwsh3l6b4q2mx7xspgh4l7lhulhc3dfa7a:/aronchick \
  -i ipfs://<CID-OF-THE-MODEL>:/model 
  -i https://gist.githubusercontent.com/js-ts/54b270a36aa3c35fdc270640680b3bd4/raw/7d8e8fa47bc3811ef63772f7fc7f4360aa9d51a8/finetune.sh
  --wait \
  --id-only \
  jsacex/dreambooth:lite \
  -- bash /inputs/finetune.sh /aronchick /outputs "a photo of aronchick man" "a photo of man" 3000 "/man" "/model"

When a job is submitted, Bacalhau prints out the related job_id. Use the export JOB_ID=$(bacalhau docker run ...) wrapper to store that in an environment variable so that we can reuse it later on.

name: Stable Diffusion Dreambooth Finetuning
type: batch
count: 1
tasks:
  - name: My main task
    Engine:
      type: docker
      params:
        Image: "jsacex/dreambooth:full" 
        Parameters:
          - bash finetune.sh /inputs /outputs "a photo of aronchick man" "a photo of man" 3000 "/man" "/model"
    InputSources:
      - Target: "/inputs/data"
        Source:
          Type: "ipfs"
          Params:
            CID: "QmRKnvqvpFzLjEoeeNNGHtc7H8fCn9TvNWHFnbBHkK8Mhy"
    Resources:
      GPU: "1"

You can check the status of the job using bacalhau job list.

bacalhau job list --id-filter ${JOB_ID}

When it says Completed, that means the job is done, and we can get the results.

You can find out more information about your job by using bacalhau job describe.

bacalhau job describe ${JOB_ID}

rm -rf results && mkdir -p results
bacalhau job get $JOB_ID --output-dir results

After the download has finished you should see the following contents in results directory

Now you can find the file in the results/outputs folder. You can view results by running following commands:

ls results # list the contents of the current directory

In the next steps, we will be doing inference on the finetuned model

Bacalhau currently doesn't support mounting subpaths of the CID, so instead of just mounting the model.ckpt file we need to mount the whole output CID which is 6.4GB, which might result in errors like FAILED TO COPY /inputs. So you have to manually copy the CID of the model.ckpt which is of 2GB.

To get the CID of the model.ckpt file go to https://gateway.ipfs.io/ipfs/< YOUR-OUTPUT-CID >/outputs/. For example:

https://gateway.ipfs.io/ipfs/QmcmD7M7pYLP8QgwjqpbP4dojRLiLuEBdhevuCD9kFmbdV/outputs/

ipfs://QmdpsqZn9BZx9XxzCsyPcJyS7yfYacmQXZxHzcuYwzmtGg/outputs

Or you can use the IPFS CLI:

ipfs ls QmdpsqZn9BZx9XxzCsyPcJyS7yfYacmQXZxHzcuYwzmtGg/outputs

Copy the link of model.ckpt highlighted in the box:

https://gateway.ipfs.io/ipfs/QmdpsqZn9BZx9XxzCsyPcJyS7yfYacmQXZxHzcuYwzmtGg?filename=model.ckpt

Then extract the CID portion of the link and copy it.

To run a Bacalhau Job on the fine-tuned model, we will use the bacalhau docker run command.

export JOB_ID=$(bacalhau docker run \
  --gpu 1 \
  --timeout 3600 \
  --wait-timeout-secs 3600 \
  --wait \
  --id-only \
  -i ipfs://QmdpsqZn9BZx9XxzCsyPcJyS7yfYacmQXZxHzcuYwzmtGg \
  jsacex/stable-diffusion-ckpt \
  -- conda run --no-capture-output -n ldm python scripts/txt2img.py --prompt "a photo of aronchick drinking coffee" --plms --ckpt ../inputs/model.ckpt --skip_grid --n_samples 1 --skip_grid --outdir ../outputs)

If you are facing difficulties using the above method you can mount the whole output CID

export JOB_ID=$(bacalhau docker run \
  --gpu 1 \
  --timeout 3600 \
  --wait-timeout-secs 3600 \
  --wait \
  --id-only \
  -i ipfs://QmcmD7M7pYLP8QgwjqpbP4dojRLiLuEBdhevuCD9kFmbdV \
  jsacex/stable-diffusion-ckpt \
  -- conda run --no-capture-output -n ldm python scripts/txt2img.py --prompt "a photo of aronchick drinking coffee" --plms --ckpt ../inputs/outputs/model.ckpt --skip_grid --n_samples 1 --skip_grid --outdir ../outputs)

When a job is sumbitted, Bacalhau prints out the related job_id. We store that in an environment variable so that we can reuse it later on.

We got an image like this as a result:

Molecular Dynamics

Running BIDS Apps on Bacalhau

Introduction

In this example tutorial, we will look at how to run BIDS App on Bacalhau. BIDS (Brain Imaging Data Structure) is an emerging standard for organizing and describing neuroimaging datasets. is a container image capturing a neuroimaging pipeline that takes a BIDS formatted dataset as input. Each BIDS App has the same core set of command line arguments, making them easy to run and integrate into automated platforms. BIDS Apps are constructed in a way that does not depend on any software outside of the image other than the container engine.

Prerequisite

To get started, you need to install the Bacalhau client, see more information

Downloading datasets

For this tutorial, download file ds005.tar from this Bids dataset and untar it in a directory:

mkdir data
tar -xf ds005.tar -C data

Let's take a look at the structure of the data directory:

data
└── ds005
    ├── CHANGES
    ├── dataset_description.json
    ├── participants.tsv
    ├── README
    ├── sub-01
    │   ├── anat
    │   │   ├── sub-01_inplaneT2.nii.gz
    │   │   └── sub-01_T1w.nii.gz
    │   └── func
    │       ├── sub-01_task-mixedgamblestask_run-01_bold.nii.gz
    │       ├── sub-01_task-mixedgamblestask_run-01_events.tsv
    │       ├── sub-01_task-mixedgamblestask_run-02_bold.nii.gz
    │       ├── sub-01_task-mixedgamblestask_run-02_events.tsv
    │       ├── sub-01_task-mixedgamblestask_run-03_bold.nii.gz
    │       └── sub-01_task-mixedgamblestask_run-03_events.tsv
    ├── sub-02
    │   ├── anat
    │   │   ├── sub-02_inplaneT2.nii.gz
    │   │   └── sub-02_T1w.nii.gz
    ...

Uploading the datasets to IPFS

When you pin your data, you'll get a CID which is in a format like this QmaNyzSpJCt1gMCQLd3QugihY6HzdYmA8QMEa45LDBbVPz. Copy the CID as it will be used to access your data

Running a Bacalhau Job

export JOB_ID=$(bacalhau docker run \
    --id-only \
    --wait \
    --timeout 3600 \
    --wait-timeout-secs 3600 \
    -i ipfs://QmaNyzSpJCt1gMCQLd3QugihY6HzdYmA8QMEa45LDBbVPz:/data \
    nipreps/mriqc:latest \
    -- mriqc ../data/ds005 ../outputs participant --participant_label 01 02 03)

Structure of the command

Let's look closely at the command above:

bacalhau docker run: call to bacalhau
-i ipfs://QmaNyzSpJCt1gMCQLd3QugihY6HzdYmA8QMEa45LDBbVPz:/data: mount the CID of the dataset that is uploaded to IPFS and mount it to a folder called data on the container
nipreps/mriqc:latest: the name and the tag of the docker image we are using
../data/ds005: path to input dataset
../outputs: path to the output
participant --participant_label 01 02 03: run the mriqc on subjects with participant labels 01, 02, and 03

When a job is submitted, Bacalhau prints out the related job_id. We store that in an environment variable so that we can reuse it later on.

Declarative job description

Copy

name: Running BIDS
type: batch
count: 1
tasks:
  - name: My main task
    Engine:
      type: docker
      params:
        Image: nipreps/mriqc:latest
        Entrypoint:
          - /bin/bash
        Parameters:
          - -c
          - mriqc ../data/ds005 ../outputs participant --participant_label 01 02 03
    Publisher:
      Type: ipfs
    ResultPaths:
      - Name: outputs
        Path: /outputs
    InputSources:
      - Target: "/data"
        Source:
          Type: "ipfs"
          Params:
            CID: "QmaNyzSpJCt1gMCQLd3QugihY6HzdYmA8QMEa45LDBbVPz"

The job description should be saved in .yaml format, e.g. bids.yaml, and then run with the command:

bacalhau job run bids.yaml

Checking the State of your Jobs

Job status: You can check the status of the job using bacalhau job list.

bacalhau job list --id-filter ${JOB_ID} --wide

When it says Published or Completed, that means the job is done, and we can get the results.

Job information: You can find out more information about your job by using bacalhau job describe.

bacalhau job describe ${JOB_ID}

rm -rf results && mkdir -p results
bacalhau job get $JOB_ID --output-dir results

Viewing your Job Output

To view the file, run the following command:

ls results/ # list the contents of the current directory 
cat results/stdout # displays the contents of the current directory

Support

Coresets On Bacalhau

Introduction

is a data subsetting method. Since the uncompressed datasets can get very large when compressed, it becomes much harder to train them as training time increases with the dataset size. To reduce training time and cut costs, we employ the coreset method; the coreset method can also be applied to other datasets. In this case, we use the coreset method which can lead to a fast speed in solving the k-means problem among the big data with high accuracy in the meantime.

We construct a small coreset for arbitrary shapes of numerical data with a decent time cost. The implementation was mainly based on the coreset construction algorithm that was proposed by Braverman et al. (SODA 2021).

For a deeper understanding of the core concepts, it's recommended to explore: 1. 2.

In this tutorial example, we will run compressed dataset with Bacalhau

Prerequisite

To get started, you need to install the Bacalhau client, see more information

Running Locally

Clone the repo which contains the code

git clone https://github.com/js-ts/Coreset

Downloading the dataset

To download the dataset you should open Street Map, which is a public repository that aims to generate and distribute accessible geographic data for the whole world. Basically, it supplies detailed position information, including the longitude and latitude of the places around the world.

The dataset is a osm.pbf (compressed format for .osm file), the file can be downloaded from

wget https://download.geofabrik.de/europe/monaco-latest.osm.pbf

Installing Dependencies

The following command is installing Linux dependencies:

sudo apt-get -y update \
sudo apt-get -y install osmium-tool \
sudo apt-get -y install libpq-dev gdal-bin libgdal-dev libxml2-dev libxslt-dev

Ensure that the requirements.txt file contains the following dependencies:

# requirements.txt
certifi==2020.12.5
chardet==4.0.0
cycler==0.10.0
idna==2.10
joblib
kiwisolver==1.3.1
lxml==4.6.2
matplotlib==3.3.3
numpy==1.19.4
overpy==0.4
pandas==1.1.4
Pillow==8.0.1
pyparsing==2.4.7
python-dateutil==2.8.1
pytz==2020.4
requests==2.25.1
scikit-learn
scipy
six==1.15.0
threadpoolctl
tqdm==4.56.0
urllib3==1.26.2
geopandas

The following command is installing Python dependencies:

pip3 install -r Coreset/requirements.txt

Running the Script

To run coreset locally, you need to convert from compressed pbf format to geojson format:

osmium export monaco-latest.osm.pbf -o monaco-latest.geojson

The following command is to run the Python script to generate the coreset:

python Coreset/python/coreset.py -f monaco-latest.geojson

Containerize Script using Docker

To build your own docker container, create a Dockerfile, which contains instructions on how the image will be built, and what extra requirements will be included.

FROM python:3.8

RUN apt-get -y update && apt-get -y install osmium-tool && apt update && apt-get -y install libpq-dev gdal-bin libgdal-dev libxml2-dev libxslt-dev

ADD Coreset Coreset

ADD monaco-latest.geojson .

RUN cd Coreset && pip3 install -r requirements.txt

We will use the python:3.8 image, we run the same commands for installing dependencies that we used locally.

Build the container

We will run docker build command to build the container:

docker build -t <hub-user>/<repo-name>:<tag> .

Before running the command replace:

repo-name with the name of the container, you can name it anything you want

tag this is not required but you can use the latest tag

In our case:

docker build -t jsace/coreset

Push the container

Next, upload the image to the registry. This can be done by using the Docker hub username, repo name or tag.

docker push <hub-user>/<repo-name>:<tag>

In our case:

docker push jsace/coreset

Running a Bacalhau Job

bacalhau docker run \
    --input https://github.com/js-ts/Coreset/blob/master/monaco-latest.geojson \
    jsace/coreset \
    -- /bin/bash -c 'python Coreset/python/coreset.py -f monaco-latest.geojson -o outputs'

Structure of the command

Let's look closely at the command above:

bacalhau docker run: call to bacalhau
--input https://github.com/js-ts/Coreset/blob/master/monaco-latest.geojson: mount the monaco-latest.geojson file inside the container so it can be used by the script
jsace/coreset: the name of the docker image we are using
python Coreset/python/coreset.py -f monaco-latest.geojson -o outputs: the script initializes cluster centers, creates a coreset using these centers, and saves the results to the specified folder.

Additional parameters:

-k: amount of initialized centers (default=5)

-n: size of coreset (default=50)

-o: the output folder

When a job is submitted, Bacalhau prints out the related job_id. We store that in an environment variable so that we can reuse it later on.

Declarative job description

name: Coresets On Bacalhau
type: batch
count: 1
tasks:
  - name: My main task
    Engine:
      type: docker
      params:
        Image: "jsace/coreset" 
        Entrypoint:
          - /bin/bash
        Parameters:
          - -c
          - "osmium export input/liechtenstein-latest.osm.pbf -o /liechtenstein-latest.geojson;python Coreset/python/coreset.py -f /liechtenstein-latest.geojson -o /outputs"
    Publisher:
      Type: ipfs
    ResultPaths:
      - Name: outputs
        Path: /outputs      
    InputSources:
      - Source:
          Type: "s3"
          Params:
            Bucket: "coreset"
            Key: "*"
            Region: "us-east-1"
        Target: "/input"

The job description should be saved in .yaml format, e.g. coreset.yaml, and then run with the command:

bacalhau job run coreset.yaml

Checking the State of your Jobs

Job status: You can check the status of the job using bacalhau job list.

bacalhau job list --id-filter ${JOB_ID} --wide

When it says Published or Completed, that means the job is done, and we can get the results.

Job information: You can find out more information about your job by using bacalhau job describe.

bacalhau job describe ${JOB_ID}

rm -rf results && mkdir -p results
bacalhau job get $JOB_ID --output-dir results

Viewing your Job Output

To view the file, run the following command:

ls results/outputs

centers.csv                       coreset-weights-monaco-latest.csv
coreset-values-monaco-latest.csv  ids.csv

To view the output as a CSV file, run:

cat results/outputs/centers.csv | head -n 10

lat,long
7.423843975787508,43.730621154072196
7.4252607,43.7399135
7.411026970571964,43.72937671121925
7.459404485446199,43.62065587026715
7.429551373022234,43.74042043301333

cat results/outputs/coreset-values-monaco-latest.csv | head -n 10

7.418849799999999384e+00,4.372759140000000144e+01
7.416779063194204547e+00,4.373053835217195484e+01
7.422073648233502574e+00,4.374059957604499971e+01
7.434173206469590234e+00,4.374591689556921636e+01
7.417540100000000081e+00,4.372501400000000160e+01
7.427359010538406636e+00,4.374324133692341121e+01
7.427839200000001085e+00,4.374025220000000758e+01
7.418834173612560257e+00,4.372760402368248833e+01
7.416381731248183229e+00,4.373708812663696932e+01
7.412050699999999992e+00,4.372842109999999849e+01

cat results/outputs/coreset-weights-monaco-latest.csv | head -n 10

7.704359156916230233e+01
2.090893934427382987e+02
1.560611140982714744e+02
2.516557569411126281e+02
7.714605094768158722e+01
2.640808776415075840e+02
2.326085291610944523e+02
7.704841021255269595e+01
2.089705263763523249e+02
1.728105655128551632e+02

Support

Genomics Data Generation

Introduction

Kipoi (pronounce: kípi; from the Greek κήποι: gardens) is an API and a repository of ready-to-use trained models for genomics. It currently contains 2201 different models, covering canonical predictive tasks in transcriptional and post-transcriptional gene regulation. Kipoi's API is implemented as a , and it is also accessible from the command line.

In this tutorial example, we will run a genomics model on Bacalhau.

Prerequisite

To get started, you need to install the Bacalhau client, see more information

Running Locally

To run locally you need to install kipoi-veff2. You can find out the information about installing and usage

In our case this will be the following command:

kipoi_veff2_predict ./examples/input/test.vcf ./examples/input/test.fa ./output.tsv -m "DeepSEA/predict" -s "diff" -s "logit"

Containerize Script using Docker

To run Genomics on Bacalhau we need to set up a Docker container. To do this, you'll need to create a Dockerfile and add your desired configuration. The Dockerfile is a text document that contains the commands that specify how the image will be built.

FROM kipoi/kipoi-veff2:py37

RUN kipoi_veff2_predict ./examples/input/test.vcf ./examples/input/test.fa ./output.tsv -m "DeepSEA/predict" -s "diff" -s "logit"

We will use the kipoi/kipoi-veff2:py37 image and perform variant-centered effect prediction using the kipoi_veff2_predict tool.

Build the container

The docker build command builds Docker images from a Dockerfile.

docker build -t <hub-user>/<repo-name>:<tag> .

Before running the command replace:

repo-name with the name of the container, you can name it anything you want

tag this is not required but you can use the latest tag

In our case:

docker build -t jsacex/kipoi-veff2:py37 .

Push the container

Next, upload the image to the registry. This can be done by using the Docker hub username, repo name or tag.

docker push <hub-user>/<repo-name>:<tag>

Running a Bacalhau job

After the repo image has been pushed to Docker Hub, we can now use the container for running on Bacalhau. To submit a job for generating genomics data, run the following Bacalhau command:

export JOB_ID=$(bacalhau docker run \
    --id-only \
    --memory 20Gb \
    --wait \
    --timeout 3600 \
    --wait-timeout-secs 3600 \
    --publisher ipfs \
    --network full \
    jsacex/kipoi-veff2:py37 \
    -- kipoi_veff2_predict ./examples/input/test.vcf ./examples/input/test.fa ../outputs/output.tsv -m "DeepSEA/predict" -s "diff" -s "logit")

Structure of the command

Let's look closely at the command above:

bacalhau docker run: call to Bacalhau
jsacex/kipoi-veff2:py37: the name of the image we are using
kipoi_veff2_predict ./examples/input/test.vcf ./examples/input/test.fa ../outputs/output.tsv -m "DeepSEA/predict" -s "diff" -s "logit": the command that will be executed inside the container. It performs variant-centered effect prediction using the kipoi_veff2_predict tool
./examples/input/test.vcf: the path to a Variant Call Format (VCF) file containing information about genetic variants
./examples/input/test.fa: the path to a FASTA file containing DNA sequences. FASTA files contain nucleotide sequences used for variant effect prediction
../outputs/output.tsv: the path to the output file where the prediction results will be stored. The output file format is Tab-Separated Values (TSV), and it will contain information about the predicted variant effects
-m "DeepSEA/predict": specifies the model to be used for prediction
-s "diff" -s "logit": indicates using two scoring functions for comparing prediction results. In this case, the "diff" and "logit" scoring functions are used. These scoring functions can be employed to analyze differences between predictions for the reference and alternative alleles.

When a job is submitted, Bacalhau prints out the related job_id. We store that in an environment variable so that we can reuse it later on.

Declarative job description

name: Genomics
type: batch
count: 1
tasks:
  - name: My main task
    Engine:
      type: docker
      params:
        Image: jsacex/kipoi-veff2:py37
        Entrypoint:
          - /bin/bash
        Parameters:
          - -c
          - kipoi_veff2_predict ./examples/input/test.vcf ./examples/input/test.fa ../outputs/output.tsv -m "DeepSEA/predict" -s "diff" -s "logit"
    Publisher:
      Type: ipfs
    Network:
      Type: full
    ResultPaths:
      - Name: outputs
        Path: /outputs
    Resources:
      Memory: 20gb

The job description should be saved in .yaml format, e.g. genomics.yaml, and then run with the command:

bacalhau job run genomics.yaml

Checking the State of your Jobs

Job status: You can check the status of the job using bacalhau job list.

bacalhau job list --id-filter ${JOB_ID} --wide

When it says Published or Completed, that means the job is done, and we can get the results.

Job information: You can find out more information about your job by using bacalhau job describe.

bacalhau job describe ${JOB_ID}

rm -rf results && mkdir -p results
bacalhau job get $JOB_ID --output-dir results

Viewing your Job Output

To view the file, run the following command:

cat results/outputs/output.tsv | head -n 10

Use Cases

Examples

Table of Contents for Bacalhau Examples

Data Engineering

Using Bacalhau with DuckDB

Introduction

Prerequisites

Running the Query

Structure of the command

Running with a YAML file

More complex queries

Need Help?

Ethereum Blockchain Analysis with Ethereum-ETL and Bacalhau

Introduction

Prerequisite​

Analysing Ethereum Data Locally​

Analysing Ethereum Data With Bacalhau​

Running a Bacalhau Job​

Declarative job description​

Checking the State of your Jobs​

Viewing your Job Output​

Display the image​

Massive Scale Ethereum Analysis​

Display the image​

Appendix: List Ethereum Data CIDs​

Support​

Convert CSV To Parquet Or Avro

Introduction​

Prerequisites​

Running CSV to Avro or Parquet Locally​​

Downloading the CSV file​

Writing the Script​

Installing Dependencies​

Converting CSV file to Parquet format​

Viewing the parquet file:​

Containerize Script with Docker​

Build the container​

Push the container​

Running a Bacalhau Job​

Structure of the command​

Declarative job description​

Checking the State of your Jobs​

Viewing your Job Output​

Support​

Simple Image Processing

Introduction

Prerequisite​

Running a Bacalhau Job​

Structure of the command​

Declarative job description​

Checking the State of your Jobs​

Display the image​

Support​

Oceanography - Data Conversion

Introduction

Prerequisites​

Running Locally​

Downloading the dataset​

Installing dependencies​

Reading and Viewing Data​

Data Conversion​

Writing the Script​

Upload the Data to IPFS​

Setting up Docker Container​

Build the container​

Push the container​

Running a Bacalhau Job​

Structure of the command​

Declarative job description​

Checking the State of your Jobs​

Viewing your Job Output​

Support​

Video Processing

Introduction

Prerequisite​

Upload the Data to IPFS​

Running a Bacalhau Job​

Structure of the command​

Declarative job description​

Checking the State of your Jobs​

Prerequisite

Analysing Ethereum Data Locally

Analysing Ethereum Data With Bacalhau

Running a Bacalhau Job

Declarative job description

Checking the State of your Jobs

Viewing your Job Output

Display the image

Massive Scale Ethereum Analysis

Display the image

Appendix: List Ethereum Data CIDs

Support

Introduction

Prerequisites

Running CSV to Avro or Parquet Locally

Downloading the CSV file

Writing the Script

Installing Dependencies

Converting CSV file to Parquet format

Viewing the parquet file:

Containerize Script with Docker

Build the container

Push the container

Running a Bacalhau Job

Structure of the command

Declarative job description

Checking the State of your Jobs

Viewing your Job Output

Support

Prerequisite

Running a Bacalhau Job

Structure of the command

Declarative job description

Checking the State of your Jobs

Display the image

Support

Prerequisites

Running Locally

Downloading the dataset

Installing dependencies

Reading and Viewing Data

Data Conversion

Writing the Script

Upload the Data to IPFS

Setting up Docker Container

Build the container

Push the container

Running a Bacalhau Job

Structure of the command

Declarative job description

Checking the State of your Jobs

Viewing your Job Output

Support

Prerequisite

Upload the Data to IPFS

Running a Bacalhau Job

Structure of the command

Declarative job description

Checking the State of your Jobs

Viewing your Job Output

Support

Introduction

TL;DR

Running Easy OCR Locally

Containerize your Script using Docker

Build the Container

Push the container

Running a Bacalhau Job to Generate Easy OCR output

Prerequisite

Structure of the imperative command

Declarative job description

Checking the State of your Jobs

Job status

Job information

Job download

Viewing your Job Output

Introduction

Running locally

Prerequisites

Installing dependencies