1 of 4

Model Training

Training Pytorch Model with Bacalhau

Introduction

In this example tutorial, we will show you how to train a PyTorch RNN MNIST neural network model with Bacalhau. PyTorch is a framework developed by Facebook AI Research for deep learning, featuring both beginner-friendly debugging tools and a high level of customization for advanced users, with researchers and practitioners using it across companies like Facebook and Tesla. Applications include computer vision, natural language processing, cryptography, and more.

TL;DR

bacalhau docker run \
    --gpu 1 \
    --timeout 3600 \
    --wait-timeout-secs 3600 \
    --wait \
    --id-only \
    pytorch/pytorch \
    -w /outputs \
    -i ipfs://QmdeQjz1HQQdT9wT2NHX86Le9X6X6ySGxp8dfRUKPtgziw:/data \
    -i https://raw.githubusercontent.com/pytorch/examples/main/mnist_rnn/main.py \
-- python ../inputs/main.py --save-model

Prerequisite

To get started, you need to install the Bacalhau client, see more information here

Training the Model Locally

To train our model locally, we will start by cloning the Pytorch examples repo:

git clone https://github.com/pytorch/examples

Install the following:

pip install --upgrade torch torchvision

Next, we run the command below to begin the training of the mnist_rnn model. We added the --save-model flag to save the model

python ./examples/mnist_rnn/main.py --save-model

Next, the downloaded MNIST dataset is saved in the data folder.

Uploading Dataset to IPFS

Now that we have downloaded our dataset, the next step is to upload it to IPFS. The simplest way to upload the data to IPFS is to use a third-party service to "pin" data to the IPFS network, to ensure that the data exists and is available. To do this you need an account with a pinning service like Pinata or NFT.Storage. Once registered you can use their UI or API or SDKs to upload files.

Once you have uploaded your data, you'll be finished copying the CID. Here is the dataset we have uploaded.

Running a Bacalhau Job

After the repo image has been pushed to Docker Hub, we can now use the container for running on Bacalhau. To submit a job, run the following Bacalhau command:

export JOB_ID=$(bacalhau docker run \
    --gpu 1 \
    --timeout 3600 \
    --wait-timeout-secs 3600 \
    --wait \
    --id-only \
    pytorch/pytorch \
    -w /outputs \
    -i ipfs://QmdeQjz1HQQdT9wT2NHX86Le9X6X6ySGxp8dfRUKPtgziw:/data \
    -i https://raw.githubusercontent.com/pytorch/examples/main/mnist_rnn/main.py \
-- python ../inputs/main.py --save-model)

Structure of the command

export JOB_ID=$( ... ) exports the job ID as environment variable
bacalhau docker run: call to bacalhau
The --gpu 1 flag is set to specify hardware requirements, a GPU is needed to run such a job
pytorch/pytorch: Using the official pytorch Docker image
The -i ipfs://QmdeQjz1HQQd.....: flag is used to mount the uploaded dataset
The -i https://raw.githubusercontent.com/py..........: flag is used to mount our training script. We will use the URL to this Pytorch example
-w /outputs: Our working directory is /outputs. This is the folder where we will save the model as it will automatically get uploaded to IPFS as outputs
python ../inputs/main.py --save-model: URL script gets mounted to the /inputs folder in the container

When a job is submitted, Bacalhau prints out the related job_id. We store that in an environment variable so that we can reuse it later on.

Declarative job description

The same job can be presented in the declarative format. In this case, the description will look like this:

name: Stable Diffusion Dreambooth Finetuning
type: batch
count: 1
tasks:
  - name: My main task
    Engine:
      type: docker
      params:
        Image: "pytorch/pytorch" 
        Entrypoint:
          - /bin/bash
        Parameters:
          - -c
          - python ../inputs/main.py --save-model
    InputSources:
      - Source:
          Type: "ipfs"
          Params:
            CID: "QmdeQjz1HQQdT9wT2NHX86Le9X6X6ySGxp8dfRUKPtgziw"
        Target: /data
      - Source:
          Type: urlDownload
          Params:
            URL: https://raw.githubusercontent.com/pytorch/examples/main/mnist_rnn/main.py
        Target: /inputs  
    Resources:
      GPU: "1"

The job description should be saved in .yaml format, e.g. torch.yaml, and then run with the command:

bacalhau job run torch.yaml

Checking the State of your Jobs

Job status

You can check the status of the job using bacalhau job list.

bacalhau job list --id-filter ${JOB_ID}

When it says Completed, that means the job is done, and we can get the results.

Job information

You can find out more information about your job by using bacalhau job describe.

bacalhau job describe ${JOB_ID}

Job download

You can download your job results directly by using bacalhau job get. Alternatively, you can choose to create a directory to store your results. In the command below, we created a directory and downloaded our job output to be stored in that directory.

rm -rf results && mkdir -p results
bacalhau job get $JOB_ID --output-dir results

After the download has finished you should see the following contents in results directory

Viewing your Job Output

Now you can find results in the results/outputs folder. To view them, run the following command:

ls results/ # list the contents of the current directory

cat results/stdout # displays the contents of the file given to it as a parameter.

ls results/outputs/ # list the successfully trained model

Training Tensorflow Model

Introduction

is an open-source machine learning software library, TensorFlow is used to train neural networks. Expressed in the form of stateful dataflow graphs, each node in the graph represents the operations performed by neural networks on multi-dimensional arrays. These multi-dimensional arrays are commonly known as “tensors”, hence the name TensorFlow. In this example, we will be training a MNIST model.

Training TensorFlow models Locally

This section is from

TensorFlow 2 quickstart for beginners

This short introduction uses to:

Load a prebuilt dataset.
Build a neural network machine learning model that classifies images.
Train this neural network.
Evaluate the accuracy of the model.

Set up TensorFlow

Import TensorFlow into your program to check whether it is installed

Build a machine-learning model

Build a tf.keras.Sequential model by stacking layers.

The tf.nn.softmax function converts these logits to probabilities for each class:

Note: It is possible to bake the tf.nn.softmax function into the activation function for the last layer of the network. While this can make the model output more directly interpretable, this approach is discouraged as it's impossible to provide an exact and numerically stable loss calculation for all models when using a softmax output.

Define a loss function for training using losses.SparseCategoricalCrossentropy, which takes a vector of logits and a True index and returns a scalar loss for each example.

This loss is equal to the negative log probability of the true class: The loss is zero if the model is sure of the correct class.

This untrained model gives probabilities close to random (1/10 for each class), so the initial loss should be close to -tf.math.log(1/10) ~= 2.3.

Train and evaluate your model

Use the Model.fit method to adjust your model parameters and minimize the loss:

If you want your model to return a probability, you can wrap the trained model, and attach the softmax to it:

The following method can be used to save the model as a checkpoint

Running on Bacalhau

The dataset and the script are mounted to the TensorFlow container using an URL, we then run the script inside the container

Declarative job description

The job description should be saved in .yaml format, e.g. tensorflow.yaml, and then run with the command:

Checking the State of your Jobs

Job status

You can check the status of the job using bacalhau job list.

When it says Completed, that means the job is done, and we can get the results.

Job information

You can find out more information about your job by using bacalhau job describe.

Job download

After the download has finished you should see the following contents in results directory

Viewing your Job Output

Now you can find the file in the results/outputs folder. To view it, run the following command:

Support

Stable Diffusion Dreambooth (Finetuning)

Introduction

Stable diffusion has revolutionalized text2image models by producing high quality images based on a prompt. Dreambooth is a approach for personalization of text-to-image diffusion models. With images as input subject, we can fine-tune a pretrained text-to-image model

Although the used to finetune the pre-trained model since both the Imagen model and Dreambooth code are closed source, several opensource projects have emerged using stable diffusion.

Dreambooth makes stable-diffusion even more powered with the ability to generate realistic looking pictures of humans, animals or any other object by just training them on 20-30 images.

In this example tutorial, we will be fine-tuning a pretrained stable diffusion using images of a human and generating images of him drinking coffee.

TL;DR

The following command generates the following:

Subject: SBF
Prompt: a photo of SBF without hair

Output:

Building this container requires you to have a supported GPU which needs to have 16gb+ of memory, since it can be resource intensive.

We will create a Dockerfile and add the desired configuration to the file. Following commands specify how the image will be built, and what extra requirements will be included:

This container is using the pytorch/pytorch:1.12.1-cuda11.3-cudnn8-devel image and the working directory is set. Next, we add our custom code and pull the dependent repositories.

The shell script is there to make things much simpler since the command to train the model needs many parameters to pass and later convert the model weights to the checkpoint, you can edit this script and add in your own parameters

To download the models and run a test job in the Docker file, copy the following:

Then execute finetune.sh with following commands:

We will run docker build command to build the container:

Before running the command replace:

repo-name with the name of the container, you can name it anything you want.
tag this is not required but you can use the latest tag

Now you can push this repository to the registry designated by its name or tag.

The optimal dataset size is between 20-30 images. You can choose the images of the subject in different positions, full body images, half body, pictures of the face etc.

Only the subject should appear in the image so you can crop the image to just fit the subject. Make sure that the images are 512x512 size and are named in the following pattern:

After the Subject dataset is created we upload it to IPFS.

To upload your dataset using NFTup just drag and drop your directory it will upload it to IPFS:

After the checkpoint file has been uploaded, copy its CID which will look like this:

Since there are a lot of combinations that you can try, processing of finetuned model can take almost 1hr+ to complete. Here are a few approaches that you can try based on your requirements:

bacalhau docker run: call to bacalhau
The --gpu 1 flag is set to specify hardware requirements, a GPU is needed to run such a job
-i ipfs://bafybeidqbuphwkqwgrobv2vakwsh3l6b4q2mx7xspgh4l7lhulhc3dfa7a Mounts the data from IPFS via its CID
jsacex/dreambooth:latest Name and tag of the docker image we are using
-- bash finetune.sh /inputs /outputs "a photo of David Aronchick man" "a photo of man" 3000 "/man" execute script with following paramters:
1. /inputs Path to the subject Images
2. /outputs Path to save the generated outputs
3. "a photo of < name of the subject > < class >" -> "a photo of David Aronchick man" Subject name along with class
4. "a photo of < class >" -> "a photo of man" Name of the class

The number of iterations is 3000. This number should be no of subject images x 100. So if there are 30 images, it would be 3000. It takes around 32 minutes on a v100 for 3000 iterations, but you can increase/decrease the number based on your requirements.

Here is our command with our parameters replaced:

If your subject fits the above class, but has a different name you just need to replace the input CID and the subject name.

Use the /woman class images

Here you can provide your own regularization images or use the mix class.

Use the /mix class images if the class of the subject is mix

You can upload the model to IPFS and then create a gist, mount the model and script to the lightweight container

When a job is submitted, Bacalhau prints out the related job_id. Use the export JOB_ID=$(bacalhau docker run ...) wrapper to store that in an environment variable so that we can reuse it later on.

You can check the status of the job using bacalhau job list.

When it says Completed, that means the job is done, and we can get the results.

You can find out more information about your job by using bacalhau job describe.

After the download has finished you should see the following contents in results directory

Now you can find the file in the results/outputs folder. You can view results by running following commands:

In the next steps, we will be doing inference on the finetuned model

Bacalhau currently doesn't support mounting subpaths of the CID, so instead of just mounting the model.ckpt file we need to mount the whole output CID which is 6.4GB, which might result in errors like FAILED TO COPY /inputs. So you have to manually copy the CID of the model.ckpt which is of 2GB.

To get the CID of the model.ckpt file go to https://gateway.ipfs.io/ipfs/< YOUR-OUTPUT-CID >/outputs/. For example:

Or you can use the IPFS CLI:

Copy the link of model.ckpt highlighted in the box:

Then extract the CID portion of the link and copy it.

To run a Bacalhau Job on the fine-tuned model, we will use the bacalhau docker run command.

If you are facing difficulties using the above method you can mount the whole output CID

When a job is sumbitted, Bacalhau prints out the related job_id. We store that in an environment variable so that we can reuse it later on.

We got an image like this as a result:

Training Pytorch Model with Bacalhau

Introduction

TL;DR

bacalhau docker run \
    --gpu 1 \
    --timeout 3600 \
    --wait-timeout-secs 3600 \
    --wait \
    --id-only \
    pytorch/pytorch \
    -w /outputs \
    -i ipfs://QmdeQjz1HQQdT9wT2NHX86Le9X6X6ySGxp8dfRUKPtgziw:/data \
    -i https://raw.githubusercontent.com/pytorch/examples/main/mnist_rnn/main.py \
-- python ../inputs/main.py --save-model

Prerequisite

To get started, you need to install the Bacalhau client, see more information here

Training the Model Locally

To train our model locally, we will start by cloning the Pytorch examples repo:

git clone https://github.com/pytorch/examples

Install the following:

pip install --upgrade torch torchvision

Next, we run the command below to begin the training of the mnist_rnn model. We added the --save-model flag to save the model

python ./examples/mnist_rnn/main.py --save-model

Next, the downloaded MNIST dataset is saved in the data folder.

Uploading Dataset to IPFS

Once you have uploaded your data, you'll be finished copying the CID. Here is the dataset we have uploaded.

Running a Bacalhau Job

After the repo image has been pushed to Docker Hub, we can now use the container for running on Bacalhau. To submit a job, run the following Bacalhau command:

export JOB_ID=$(bacalhau docker run \
    --gpu 1 \
    --timeout 3600 \
    --wait-timeout-secs 3600 \
    --wait \
    --id-only \
    pytorch/pytorch \
    -w /outputs \
    -i ipfs://QmdeQjz1HQQdT9wT2NHX86Le9X6X6ySGxp8dfRUKPtgziw:/data \
    -i https://raw.githubusercontent.com/pytorch/examples/main/mnist_rnn/main.py \
-- python ../inputs/main.py --save-model)

Structure of the command

export JOB_ID=$( ... ) exports the job ID as environment variable
bacalhau docker run: call to bacalhau
The --gpu 1 flag is set to specify hardware requirements, a GPU is needed to run such a job
pytorch/pytorch: Using the official pytorch Docker image
The -i ipfs://QmdeQjz1HQQd.....: flag is used to mount the uploaded dataset
The -i https://raw.githubusercontent.com/py..........: flag is used to mount our training script. We will use the URL to this Pytorch example
-w /outputs: Our working directory is /outputs. This is the folder where we will save the model as it will automatically get uploaded to IPFS as outputs
python ../inputs/main.py --save-model: URL script gets mounted to the /inputs folder in the container

When a job is submitted, Bacalhau prints out the related job_id. We store that in an environment variable so that we can reuse it later on.

Declarative job description

The same job can be presented in the declarative format. In this case, the description will look like this:

name: Stable Diffusion Dreambooth Finetuning
type: batch
count: 1
tasks:
  - name: My main task
    Engine:
      type: docker
      params:
        Image: "pytorch/pytorch" 
        Entrypoint:
          - /bin/bash
        Parameters:
          - -c
          - python ../inputs/main.py --save-model
    InputSources:
      - Source:
          Type: "ipfs"
          Params:
            CID: "QmdeQjz1HQQdT9wT2NHX86Le9X6X6ySGxp8dfRUKPtgziw"
        Target: /data
      - Source:
          Type: urlDownload
          Params:
            URL: https://raw.githubusercontent.com/pytorch/examples/main/mnist_rnn/main.py
        Target: /inputs  
    Resources:
      GPU: "1"

The job description should be saved in .yaml format, e.g. torch.yaml, and then run with the command:

bacalhau job run torch.yaml

Checking the State of your Jobs

Job status

You can check the status of the job using bacalhau job list.

bacalhau job list --id-filter ${JOB_ID}

When it says Completed, that means the job is done, and we can get the results.

Job information

You can find out more information about your job by using bacalhau job describe.

bacalhau job describe ${JOB_ID}

Job download

rm -rf results && mkdir -p results
bacalhau job get $JOB_ID --output-dir results

After the download has finished you should see the following contents in results directory

Viewing your Job Output

Now you can find results in the results/outputs folder. To view them, run the following command:

ls results/ # list the contents of the current directory

cat results/stdout # displays the contents of the file given to it as a parameter.

ls results/outputs/ # list the successfully trained model

Stable Diffusion Dreambooth (Finetuning)

Introduction

Although the used to finetune the pre-trained model since both the Imagen model and Dreambooth code are closed source, several opensource projects have emerged using stable diffusion.

Dreambooth makes stable-diffusion even more powered with the ability to generate realistic looking pictures of humans, animals or any other object by just training them on 20-30 images.

In this example tutorial, we will be fine-tuning a pretrained stable diffusion using images of a human and generating images of him drinking coffee.

TL;DR

The following command generates the following:

Subject: SBF
Prompt: a photo of SBF without hair

Inference

bacalhau docker run \
 --gpu 1 \
  -i ipfs://QmUEJPr5pfV6tRzWQuNSSb3wdcN6tRQS5tdk3dYSCJ55Xs:/SBF.ckpt \
   jsacex/stable-diffusion-ckpt \
   -- conda run --no-capture-output -n ldm python scripts/txt2img.py --prompt "a photo of sbf without hair" --plms --ckpt ../SBF.ckpt --skip_grid --n_samples 1 --skip_grid --outdir ../outputs

Output:

Prerequisites

To get started, you need to install the Bacalhau client, see more information

Setting up Docker Container

You can skip this section entirely and directly go to

Building this container requires you to have a supported GPU which needs to have 16gb+ of memory, since it can be resource intensive.

We will create a Dockerfile and add the desired configuration to the file. Following commands specify how the image will be built, and what extra requirements will be included:

FROM pytorch/pytorch:1.12.1-cuda11.3-cudnn8-devel

WORKDIR /

# Install requirements
# RUN git clone https://github.com/TheLastBen/diffusers

RUN apt update && apt install wget git unzip -y

RUN wget -q https://gist.githubusercontent.com/js-ts/28684a7e6217214ec944a9224584e9af/raw/d7492bc8f36700b75d51e3346259d1a466b99a40/train_dreambooth.py

RUN wget -q https://github.com/TheLastBen/diffusers/raw/main/scripts/convert_diffusers_to_original_stable_diffusion.py

# RUN cp /content/convert_diffusers_to_original_stable_diffusion.py /content/diffusers/scripts/convert_diffusers_to_original_stable_diffusion.py

RUN pip install -qq git+https://github.com/TheLastBen/diffusers

RUN pip install -q accelerate==0.12.0 transformers ftfy bitsandbytes gradio natsort

# Install xformers

RUN pip install -q https://github.com/metrolobo/xformers_wheels/releases/download/1d31a3ac_various_6/xformers-0.0.14.dev0-cp37-cp37m-linux_x86_64.whl

RUN wget 'https://github.com/TheLastBen/fast-stable-diffusion/raw/main/Dreambooth/Regularization/Women' -O woman.zip

RUN wget 'https://github.com/TheLastBen/fast-stable-diffusion/raw/main/Dreambooth/Regularization/Men' -O man.zip

RUN wget 'https://github.com/TheLastBen/fast-stable-diffusion/raw/main/Dreambooth/Regularization/Mix' -O mix.zip

RUN unzip -j woman.zip -d woman

RUN unzip -j man.zip -d man

RUN unzip -j mix.zip -d mix

This container is using the pytorch/pytorch:1.12.1-cuda11.3-cudnn8-devel image and the working directory is set. Next, we add our custom code and pull the dependent repositories.

# finetune.sh
python clear_mem.py

accelerate launch train_dreambooth.py \
  --image_captions_filename \
  --train_text_encoder \
  --save_n_steps=$(expr $5 / 6) \
  --stop_text_encoder_training=$(expr $5 + 100) \
  --class_data_dir="$6" \
  --pretrained_model_name_or_path=${7:=/model} \
  --tokenizer_name=${7:=/model}/tokenizer/ \
  --instance_data_dir="$1" \
  --output_dir="$2" \
  --instance_prompt="$3" \
  --class_prompt="$4" \
  --seed=96576 \
  --resolution=512 \
  --mixed_precision="fp16" \
  --train_batch_size=1 \
  --gradient_accumulation_steps=1 \
  --use_8bit_adam \
  --learning_rate=2e-6 \
  --lr_scheduler="polynomial" \
  --center_crop \
  --lr_warmup_steps=0 \
  --max_train_steps=$5

echo  Convert weights to ckpt
python convert_diffusers_to_original_stable_diffusion.py --model_path $2  --checkpoint_path $2/model.ckpt --half
echo model saved at $2/model.ckpt

Downloading the models

To download the models and run a test job in the Docker file, copy the following:

FROM pytorch/pytorch:1.12.1-cuda11.3-cudnn8-devel
WORKDIR /
# Install requirements
# RUN git clone https://github.com/TheLastBen/diffusers
RUN apt update && apt install wget git unzip -y
RUN wget -q https://gist.githubusercontent.com/js-ts/28684a7e6217214ec944a9224584e9af/raw/d7492bc8f36700b75d51e3346259d1a466b99a40/train_dreambooth.py
RUN wget -q https://github.com/TheLastBen/diffusers/raw/main/scripts/convert_diffusers_to_original_stable_diffusion.py
# RUN cp /content/convert_diffusers_to_original_stable_diffusion.py /content/diffusers/scripts/convert_diffusers_to_original_stable_diffusion.py
RUN pip install -qq git+https://github.com/TheLastBen/diffusers
RUN pip install -q accelerate==0.12.0 transformers ftfy bitsandbytes gradio natsort
# Install xformers
RUN pip install -q https://github.com/metrolobo/xformers_wheels/releases/download/1d31a3ac_various_6/xformers-0.0.14.dev0-cp37-cp37m-linux_x86_64.whl
# You need to accept the model license before downloading or using the Stable Diffusion weights. Please, visit the [model card](https://huggingface.co/runwayml/stable-diffusion-v1-5), read the license and tick the checkbox if you agree. You have to be a registered user in 🤗 Hugging Face Hub, and you'll also need to use an access token for the code to work.
# https://huggingface.co/settings/tokens
RUN mkdir -p ~/.huggingface
RUN echo -n "<your-hugging-face-token>" > ~/.huggingface/token
# copy the test dataset from a local file
# COPY jfk /jfk

# Download and extract the test dataset
RUN wget https://github.com/js-ts/test-images/raw/main/jfk.zip
RUN unzip -j jfk.zip -d jfk
RUN mkdir model
RUN wget 'https://github.com/TheLastBen/fast-stable-diffusion/raw/main/Dreambooth/Regularization/Women' -O woman.zip
RUN wget 'https://github.com/TheLastBen/fast-stable-diffusion/raw/main/Dreambooth/Regularization/Men' -O man.zip
RUN wget 'https://github.com/TheLastBen/fast-stable-diffusion/raw/main/Dreambooth/Regularization/Mix' -O mix.zip
RUN unzip -j woman.zip -d woman
RUN unzip -j man.zip -d man
RUN unzip -j mix.zip -d mix

RUN  accelerate launch train_dreambooth.py \
  --image_captions_filename \
  --train_text_encoder \
  --save_starting_step=5\
  --stop_text_encoder_training=31 \
  --class_data_dir=/man \
  --save_n_steps=5 \
  --pretrained_model_name_or_path="CompVis/stable-diffusion-v1-4" \
  --instance_data_dir="/jfk" \
  --output_dir="/model" \
  --instance_prompt="a photo of jfk man" \
  --class_prompt="a photo of man" \
  --instance_prompt="" \
  --seed=96576 \
  --resolution=512 \
  --mixed_precision="fp16" \
  --train_batch_size=1 \
  --gradient_accumulation_steps=1 \
  --use_8bit_adam \
  --learning_rate=2e-6 \
  --lr_scheduler="polynomial" \
  --center_crop \
  --lr_warmup_steps=0 \
  --max_train_steps=30

COPY finetune.sh /finetune.sh

RUN wget -q https://gist.githubusercontent.com/js-ts/624fecc3fff807d4948688cb28993a94/raw/fd69ac084debe26a815485c1f363b8a45566f1ba/clear_mem.py
# Removing your token
RUN rm -rf  ~/.huggingface/token

Then execute finetune.sh with following commands:

# finetune.sh
python clear_mem.py

accelerate launch train_dreambooth.py \
    --image_captions_filename \
   --train_text_encoder \
    --save_n_steps=$(expr $5 / 6) \
    --stop_text_encoder_training=$(expr $5 + 100) \
       --class_data_dir="$6" \
  --pretrained_model_name_or_path=${7:=/model} \
--tokenizer_name=${7:=/model}/tokenizer/ \
    --instance_data_dir="$1" \
    --output_dir="$2" \
    --instance_prompt="$3" \
   --class_prompt="$4" \
    --seed=96576 \
    --resolution=512 \
    --mixed_precision="fp16" \
    --train_batch_size=1 \
    --gradient_accumulation_steps=1 \
    --use_8bit_adam \
    --learning_rate=2e-6 \
    --lr_scheduler="polynomial" \
    --center_crop \
    --lr_warmup_steps=0 \
    --max_train_steps=$5

echo  Convert weights to ckpt
python convert_diffusers_to_original_stable_diffusion.py --model_path $2  --checkpoint_path $2/model.ckpt --half
echo model saved at $2/model.ckpt

Build the Docker container

We will run docker build command to build the container:

docker build -t <hub-user>/<repo-name>:<tag> .

Before running the command replace:

hub-user with your docker hub username, If you don’t have a docker hub account follow to create a Docker account, and use the username of the account you create.
repo-name with the name of the container, you can name it anything you want.
tag this is not required but you can use the latest tag

Now you can push this repository to the registry designated by its name or tag.

docker push <hub-user>/<repo-name>:<tag>

Create the Subject Dataset

The optimal dataset size is between 20-30 images. You can choose the images of the subject in different positions, full body images, half body, pictures of the face etc.

Only the subject should appear in the image so you can crop the image to just fit the subject. Make sure that the images are 512x512 size and are named in the following pattern:

Subject Name.jpg, Subject Name (2).jpg ... Subject Name (n).jpg

You can view the for reference.

After the Subject dataset is created we upload it to IPFS.

Uploading the Subject Images to IPFS

In this case, we will be using (Recommended Option) to upload files and directories with .

To upload your dataset using NFTup just drag and drop your directory it will upload it to IPFS:

After the checkpoint file has been uploaded, copy its CID which will look like this:

bafybeidqbuphwkqwgrobv2vakwsh3l6b4q2mx7xspgh4l7lhulhc3dfa7a

Approaches to run a Bacalhau Job on a Finetuned Model

Since there are a lot of combinations that you can try, processing of finetuned model can take almost 1hr+ to complete. Here are a few approaches that you can try based on your requirements:

Case 1: If the subject is of class male

Structure of the command

bacalhau docker run: call to bacalhau
The --gpu 1 flag is set to specify hardware requirements, a GPU is needed to run such a job
-i ipfs://bafybeidqbuphwkqwgrobv2vakwsh3l6b4q2mx7xspgh4l7lhulhc3dfa7a Mounts the data from IPFS via its CID
jsacex/dreambooth:latest Name and tag of the docker image we are using
-- bash finetune.sh /inputs /outputs "a photo of David Aronchick man" "a photo of man" 3000 "/man" execute script with following paramters:
1. /inputs Path to the subject Images
2. /outputs Path to save the generated outputs
3. "a photo of < name of the subject > < class >" -> "a photo of David Aronchick man" Subject name along with class
4. "a photo of < class >" -> "a photo of man" Name of the class

bacalhau docker run \
  --gpu 1 \
  --timeout 3600 \
  --wait-timeout-secs 3600 \
  --timeout 3600 \
  --wait-timeout-secs 3600 \
  -i <CID-OF-THE-SUBJECT> \
  jsacex/dreambooth:full \
  -- bash finetune.sh /inputs /outputs "a photo of <name-of-the-subject> man" "a photo of man" 3000 "/man" "/model"

Here is our command with our parameters replaced:

bacalhau docker run \
  --gpu 1 \
  --timeout 3600 \
  --wait-timeout-secs 3600 \
  --timeout 3600 \
  --wait-timeout-secs 3600 \
  -i ipfs://bafybeidqbuphwkqwgrobv2vakwsh3l6b4q2mx7xspgh4l7lhulhc3dfa7a \
  --wait \
  --id-only \
  jsacex/dreambooth:full \
  -- bash finetune.sh /inputs /outputs "a photo of David Aronchick man" "a photo of man" 3000 "/man" "/model"

If your subject fits the above class, but has a different name you just need to replace the input CID and the subject name.

Case 2 : If the subject is of class female

Use the /woman class images

bacalhau docker run \
  --gpu 1 \
  --timeout 3600 \
  --wait-timeout-secs 3600 \
  -i <CID-OF-THE-SUBJECT> \
  jsacex/dreambooth:full \
  -- bash finetune.sh /inputs /outputs "a photo of <name-of-the-subject> woman" "a photo of woman" 3000 "/woman"  "/model"

Case 3: If the subject is of class mix

Here you can provide your own regularization images or use the mix class.

Use the /mix class images if the class of the subject is mix

bacalhau docker run \
  --gpu 1 \
  --timeout 3600 \
  --wait-timeout-secs 3600 \
  -i <CID-OF-THE-SUBJECT> \
  jsacex/dreambooth:full \
  -- bash finetune.sh /inputs /outputs "a photo of <name-of-the-subject> mix" "a photo of mix" 3000 "/mix"  "/model"

Case 4: If you want a different tokenizer, model, and a different shell script with custom parameters

You can upload the model to IPFS and then create a gist, mount the model and script to the lightweight container

bacalhau docker run \
  --gpu 1 \
  --timeout 3600 \
  --wait-timeout-secs 3600 \
  -i ipfs://bafybeidqbuphwkqwgrobv2vakwsh3l6b4q2mx7xspgh4l7lhulhc3dfa7a:/aronchick \
  -i ipfs://<CID-OF-THE-MODEL>:/model 
  -i https://gist.githubusercontent.com/js-ts/54b270a36aa3c35fdc270640680b3bd4/raw/7d8e8fa47bc3811ef63772f7fc7f4360aa9d51a8/finetune.sh
  --wait \
  --id-only \
  jsacex/dreambooth:lite \
  -- bash /inputs/finetune.sh /aronchick /outputs "a photo of aronchick man" "a photo of man" 3000 "/man" "/model"

Declarative job description

The same job can be presented in the format. In this case, the description will look like this. Change the command in the Parameters section and CID to suit your goals.

name: Stable Diffusion Dreambooth Finetuning
type: batch
count: 1
tasks:
  - name: My main task
    Engine:
      type: docker
      params:
        Image: "jsacex/dreambooth:full" 
        Parameters:
          - bash finetune.sh /inputs /outputs "a photo of aronchick man" "a photo of man" 3000 "/man" "/model"
    InputSources:
      - Target: "/inputs/data"
        Source:
          Type: "ipfs"
          Params:
            CID: "QmRKnvqvpFzLjEoeeNNGHtc7H8fCn9TvNWHFnbBHkK8Mhy"
    Resources:
      GPU: "1"

Checking the State of your Jobs

Job status

You can check the status of the job using bacalhau job list.

bacalhau job list --id-filter ${JOB_ID}

When it says Completed, that means the job is done, and we can get the results.

Job information

You can find out more information about your job by using bacalhau job describe.

bacalhau job describe ${JOB_ID}

Job download

rm -rf results && mkdir -p results
bacalhau job get $JOB_ID --output-dir results

After the download has finished you should see the following contents in results directory

Viewing your Job Output

Now you can find the file in the results/outputs folder. You can view results by running following commands:

ls results # list the contents of the current directory

In the next steps, we will be doing inference on the finetuned model

Inference on the Fine-Tuned Model

Refer to our for more details of how to build a SD inference container

To get the CID of the model.ckpt file go to https://gateway.ipfs.io/ipfs/< YOUR-OUTPUT-CID >/outputs/. For example:

https://gateway.ipfs.io/ipfs/QmcmD7M7pYLP8QgwjqpbP4dojRLiLuEBdhevuCD9kFmbdV/outputs/

If you use the browser, you can use following:

ipfs://QmdpsqZn9BZx9XxzCsyPcJyS7yfYacmQXZxHzcuYwzmtGg/outputs

Or you can use the IPFS CLI:

ipfs ls QmdpsqZn9BZx9XxzCsyPcJyS7yfYacmQXZxHzcuYwzmtGg/outputs

Copy the link of model.ckpt highlighted in the box:

https://gateway.ipfs.io/ipfs/QmdpsqZn9BZx9XxzCsyPcJyS7yfYacmQXZxHzcuYwzmtGg?filename=model.ckpt

Then extract the CID portion of the link and copy it.

Run the Bacalhau Job on the Fine-Tuned Model

To run a Bacalhau Job on the fine-tuned model, we will use the bacalhau docker run command.

export JOB_ID=$(bacalhau docker run \
  --gpu 1 \
  --timeout 3600 \
  --wait-timeout-secs 3600 \
  --wait \
  --id-only \
  -i ipfs://QmdpsqZn9BZx9XxzCsyPcJyS7yfYacmQXZxHzcuYwzmtGg \
  jsacex/stable-diffusion-ckpt \
  -- conda run --no-capture-output -n ldm python scripts/txt2img.py --prompt "a photo of aronchick drinking coffee" --plms --ckpt ../inputs/model.ckpt --skip_grid --n_samples 1 --skip_grid --outdir ../outputs)

If you are facing difficulties using the above method you can mount the whole output CID

export JOB_ID=$(bacalhau docker run \
  --gpu 1 \
  --timeout 3600 \
  --wait-timeout-secs 3600 \
  --wait \
  --id-only \
  -i ipfs://QmcmD7M7pYLP8QgwjqpbP4dojRLiLuEBdhevuCD9kFmbdV \
  jsacex/stable-diffusion-ckpt \
  -- conda run --no-capture-output -n ldm python scripts/txt2img.py --prompt "a photo of aronchick drinking coffee" --plms --ckpt ../inputs/outputs/model.ckpt --skip_grid --n_samples 1 --skip_grid --outdir ../outputs)

When a job is sumbitted, Bacalhau prints out the related job_id. We store that in an environment variable so that we can reuse it later on.

To check the status of your job and download results refer back to the .

We got an image like this as a result:

Training Tensorflow Model

Introduction

Training TensorFlow models Locally

This section is from

TensorFlow 2 quickstart for beginners

This short introduction uses to:

Load a prebuilt dataset.
Build a neural network machine learning model that classifies images.
Train this neural network.
Evaluate the accuracy of the model.

Set up TensorFlow

Import TensorFlow into your program to check whether it is installed

mkdir /inputs
wget https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz -O /inputs/mnist.npz

mnist = tf.keras.datasets.mnist

CWD = '' if os.getcwd() == '/' else os.getcwd()
(x_train, y_train), (x_test, y_test) = mnist.load_data('/inputs/mnist.npz')
x_train, x_test = x_train / 255.0, x_test / 255.0

Build a machine-learning model

Build a tf.keras.Sequential model by stacking layers.

model = tf.keras.models.Sequential([
  tf.keras.layers.Flatten(input_shape=(28, 28)),
  tf.keras.layers.Dense(128, activation='relu'),
  tf.keras.layers.Dropout(0.2),
  tf.keras.layers.Dense(10)
])

For each example, the model returns a vector of or scores, one for each class.

predictions = model(x_train[:1]).numpy()
predictions

The tf.nn.softmax function converts these logits to probabilities for each class:

tf.nn.softmax(predictions).numpy()

Define a loss function for training using losses.SparseCategoricalCrossentropy, which takes a vector of logits and a True index and returns a scalar loss for each example.

loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)

This loss is equal to the negative log probability of the true class: The loss is zero if the model is sure of the correct class.

This untrained model gives probabilities close to random (1/10 for each class), so the initial loss should be close to -tf.math.log(1/10) ~= 2.3.

loss_fn(y_train[:1], predictions).numpy()

Before you start training, configure and compile the model using Keras Model.compile. Set the class to adam, set the loss to the loss_fn function you defined earlier, and specify a metric to be evaluated for the model by setting the metrics parameter to accuracy.

model.compile(optimizer='adam',
              loss=loss_fn,
              metrics=['accuracy'])

Train and evaluate your model

Use the Model.fit method to adjust your model parameters and minimize the loss:

model.fit(x_train, y_train, epochs=5)

The Model.evaluate method checks the models performance, usually on a "" or "".

model.evaluate(x_test,  y_test, verbose=2)

The image classifier is now trained to ~98% accuracy on this dataset. To learn more, read the .

If you want your model to return a probability, you can wrap the trained model, and attach the softmax to it:

probability_model = tf.keras.Sequential([
  model,
  tf.keras.layers.Softmax()
])

probability_model(x_test[:5])

mkdir /outputs

The following method can be used to save the model as a checkpoint

model.save_weights('/outputs/checkpoints/my_checkpoint')

ls /outputs/

Running on Bacalhau

The dataset and the script are mounted to the TensorFlow container using an URL, we then run the script inside the container

Declarative job description

The same job can be presented in the format. In this case, the description will look like this:

name: Training ML model using tensorflow
type: batch
count: 1
tasks:
  - name: My main task
    Engine:
      type: docker
      params:
        WorkingDirectory: "/inputs"
        Image: "tensorflow/tensorflow" 
        Entrypoint:
          - /bin/bash
        Parameters:
          - -c
          - python train.py
    InputSources:
      - Source:
          Type: urlDownload
          Params:
            URL: https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz
        Target: /inputs
      - Source:
          Type: urlDownload
          Params:
            URL: https://gist.githubusercontent.com/js-ts/e7d32c7d19ffde7811c683d4fcb1a219/raw/ff44ac5b157d231f464f4d43ce0e05bccb4c1d7b/train.py
        Target: /inputs
    Resources:
      GPU: "1"

The job description should be saved in .yaml format, e.g. tensorflow.yaml, and then run with the command:

bacalhau job run tensorflow.yaml

Checking the State of your Jobs

Job status

You can check the status of the job using bacalhau job list.

bacalhau job list --id-filter ${JOB_ID}

When it says Completed, that means the job is done, and we can get the results.

Job information

You can find out more information about your job by using bacalhau job describe.

bacalhau job describe ${JOB_ID}

Job download

rm -rf results && mkdir -p results
bacalhau job get $JOB_ID --output-dir results

After the download has finished you should see the following contents in results directory

Viewing your Job Output

Now you can find the file in the results/outputs folder. To view it, run the following command:

cat results/outputs/

Support

If you have questions or need support or guidance, please reach out to the (#general channel).

Model Training

Training Pytorch Model with Bacalhau

Introduction

TL;DR​

Prerequisite​

Training the Model Locally​

Uploading Dataset to IPFS​

Running a Bacalhau Job​

Structure of the command​

Declarative job description​

Checking the State of your Jobs​

Job status​

Job information​

Job download​

Viewing your Job Output​

Training Tensorflow Model

Introduction

Training TensorFlow models Locally​

TensorFlow 2 quickstart for beginners​

Set up TensorFlow​

Build a machine-learning model​

Train and evaluate your model​

Running on Bacalhau​

Declarative job description​

Checking the State of your Jobs​

Job status​

Job information​

Job download​

Viewing your Job Output​

Support​

Stable Diffusion Dreambooth (Finetuning)

Introduction

TL;DR

Model Training

Training Pytorch Model with Bacalhau

Introduction

TL;DR​

Prerequisite​

Training the Model Locally​

Uploading Dataset to IPFS​

Running a Bacalhau Job​

Structure of the command​

Declarative job description​

Checking the State of your Jobs​

Job status​

Job information​

Job download​

Viewing your Job Output​

Stable Diffusion Dreambooth (Finetuning)

Introduction

TL;DR

Inference

Prerequisites

Setting up Docker Container

Downloading the models

Build the Docker container

Create the Subject Dataset

Uploading the Subject Images to IPFS

Approaches to run a Bacalhau Job on a Finetuned Model

Case 1: If the subject is of class male

Structure of the command

Case 2 : If the subject is of class female

Case 3: If the subject is of class mix

Case 4: If you want a different tokenizer, model, and a different shell script with custom parameters

Declarative job description

Checking the State of your Jobs

Job status

Job information

Job download

Viewing your Job Output

Inference on the Fine-Tuned Model

Run the Bacalhau Job on the Fine-Tuned Model

Training Tensorflow Model

Introduction

Training TensorFlow models Locally​

TensorFlow 2 quickstart for beginners​

Set up TensorFlow​

Build a machine-learning model​

Train and evaluate your model​

Running on Bacalhau​

TL;DR

Prerequisite

Training the Model Locally

Uploading Dataset to IPFS

Running a Bacalhau Job

Structure of the command

Declarative job description

Checking the State of your Jobs

Job status

Job information

Job download

Viewing your Job Output

Training TensorFlow models Locally

TensorFlow 2 quickstart for beginners

Set up TensorFlow

Build a machine-learning model

Train and evaluate your model

Running on Bacalhau

Declarative job description

Checking the State of your Jobs

Job status

Job information

Job download

Viewing your Job Output

Support

TL;DR

Prerequisite

Training the Model Locally

Uploading Dataset to IPFS

Running a Bacalhau Job

Structure of the command

Declarative job description

Checking the State of your Jobs

Job status

Job information

Job download

Viewing your Job Output

Training TensorFlow models Locally

TensorFlow 2 quickstart for beginners

Set up TensorFlow

Build a machine-learning model

Train and evaluate your model

Running on Bacalhau

Declarative job description

Checking the State of your Jobs

Job status

Job information

Job download

Viewing your Job Output

Support