In this example tutorial, we will show you how to train a Pytorch RNN MNIST neural network model with Bacalhau. PyTorch is a framework developed by Facebook AI Research for deep learning, featuring both beginner-friendly debugging tools and a high level of customization for advanced users, with researchers and practitioners using it across companies like Facebook and Tesla. Applications include computer vision, natural language processing, cryptography, and more.
Running any type of Pytorch model with Bacalhau
To get started, you need to install the Bacalhau client, see more information here
To train our model locally, we will start by cloning the Pytorch examples repo
Install the following
Next, we run the command below to begin the training of the mnist_rnn model. We added the --save-model
flag to save the model
Next, the downloaded MNIST dataset is saved in the data
folder.
Now that we have downloaded our dataset, the next step is to upload it to IPFS. The simplest way to upload the data to IPFS is to use a third-party service to "pin" data to the IPFS network, to ensure that the data exists and is available. To do this you need an account with a pinning service like web3.storage or Pinata or NFT.Storage. Once registered you can use their UI or API or SDKs to upload files.
Once you have uploaded your data, you'll be finished copying the CID. Here is the dataset we have uploaded https://gateway.pinata.cloud/ipfs/QmdeQjz1HQQdT9wT2NHX86Le9X6X6ySGxp8dfRUKPtgziw/?filename=data
After the repo image has been pushed to Docker Hub, we can now use the container for running on Bacalhau. To submit a job, run the following Bacalhau command:
bacalhau docker run
: call to bacalhau
--gpu 1
: Request 1 GPU to train the model
pytorch/pytorch
: Using the official pytorch Docker image
-i ipfs://QmdeQjz1HQQd.....
: Mounting the uploaded dataset to the path
-i https://raw.githubusercontent.com/py..........
: Mounting our training script we will use the URL to this Pytorch example
-w /outputs:
Our working directory is /outputs. This is the folder where we will save the model as it will automatically get uploaded to IPFS as outputs
python ../inputs/main.py --save-model
: URL script gets mounted to the /inputs folder in the container.
When a job is submitted, Bacalhau prints out the related job_id
. We store that in an environment variable so that we can reuse it later on.
Job status: You can check the status of the job using bacalhau list
.
When it says Completed
, that means the job is done, and we can get the results.
Job information: You can find out more information about your job by using bacalhau describe
.
Job download: You can download your job results directly by using bacalhau get
. Alternatively, you can choose to create a directory to store your results. In the command below, we created a directory and downloaded our job output to be stored in that directory.
After the download has finished you should see the following contents in the results directory
To view the file, run the following command:
Stable diffusion has revolutionalized text2image models by producing high quality images based on a prompt. Dreambooth is a approach for personalization of text-to-image diffusion models. With images as input subject, we can fine-tune a pretrained text-to-image model
Although the used to finetune the pre-trained model since both the Imagen model and Dreambooth code are closed source, several opensource projects have emerged using stable diffusion.
Dreambooth makes stable-diffusion even more powered with the ability to generate realistic looking pictures of humans, animals or any other object by just training them on 20-30 images.
In this example tutorial, we will be fine-tuning a pretrained stable diffusion using images of a human and generating images of him drinking coffee.
The following command generates the following:
Subject: SBF
Prompt: a photo of SBF without hair
Output:
:::info You can skip this section entirely and directly go to running a job on Bacalhau :::
Building this container requires you to have a supported GPU which needs to have 16gb+ of memory, since it can be resource intensive.
We will create a Dockerfile
and add the desired configuration to the file. These commands specify how the image will be built, and what extra requirements will be included.
This container is using the pytorch/pytorch:1.12.1-cuda11.3-cudnn8-devel
image and the working directory is set. Next, we add our custom code and pull the dependent repositories.
The shell script is there to make things much simpler since the command to train the model needs many parameters to pass and later convert the model weights to the checkpoint, you can edit this script and add in your own parameters
To download the models and run a test job in the Docker file, copy the following:
finetune.sh
We will run docker build
command to build the container;
Before running the command replace;
repo-name with the name of the container, you can name it anything you want
tag this is not required but you can use the latest tag
Now you can push this repository to the registry designated by its name or tag.
The optimal dataset size is between 20-30 images. You can choose the images of the subject in different positions, full body images, half body, pictures of the face etc.
Only the subject should appear in the image so you can crop the image to just fit the subject. Make sure that the images are 512x512 size and are named in the following pattern since the subject name is David Aronchick we name the images in the following pattern
After the Subject dataset is created we upload it to IPFS
To upload your dataset using NFTup just drag and drop your directory it will upload it to IPFS
After the checkpoint file has been uploaded copy its CID bafybeidqbuphwkqwgrobv2vakwsh3l6b4q2mx7xspgh4l7lhulhc3dfa7a
Since there are a lot of combinations that you can try, processing of finetuned model can take almost 1hr+ to complete. Here are a few approaches that you can try based on your requirements
Structure of the command
No of GPUs --gpu 1
CID of the Subject Images -i ipfs://bafybeidqbuphwkqwgrobv2vakwsh3l6b4q2mx7xspgh4l7lhulhc3dfa7a
Name of our Image jsacex/dreambooth:latest
-- bash finetune.sh /inputs /outputs "a photo of aronchick man" "a photo of man" 3000 "/man"
Path to the subject Images /inputs
Path to save the generated outputs /outputs
Subject name along with class "a photo of < name of the subject > < class >" -> "a photo of aronchick man"
Name of the class "a photo of < class >" -> "a photo of man"
The number of iterations is 3000. This number should be no of subject images x 100. So if there are 30 images, it would be 3000. It takes around 32Mins on a v100 for 3000 iterations, but you can increase/decrease the number based on your requirements
Path to our class Images /man
Here is our command with our parameters replaced
If your subject fits the above class but has a different name you just need to replace the input CID and the subject name which in this case is SBF
Use the /woman
class images
provide your own regularization images or use the mix class
Use the /mix
class images if the class of the subject is mix
You can upload the model to IPFS and then create a gist and mount the model and script to the lightweight container
When a job is submitted, Bacalhau prints out the related job_id
. We store that in an environment variable so that we can reuse it later on.
Job status: You can check the status of the job using bacalhau list
.
When it says Published
or Completed
, that means the job is done, and we can get the results.
Job information: You can find out more information about your job by using bacalhau describe
.
In the next steps, we will be doing inference on the finetuned model
:::info
Refer https://docs.bacalhau.org/examples/model-inference/Stable-Diffusion-CKPT-Inference on details of how to build a SD inference container :::
Bacalhau currently doesn't support mounting subpaths of the CID, so instead of just mounting the model.ckpt file we need to mount the whole output CID which is 6.4GB, which might result in errors like FAILED TO COPY /inputs. So you have to manually copy the CID of the model.ckpt which is of 2GB
To get the CID of the model.ckpt file go to https://gateway.ipfs.io/ipfs/< YOUR-OUTPUT-CID >/outputs/
https://gateway.ipfs.io/ipfs/QmcmD7M7pYLP8QgwjqpbP4dojRLiLuEBdhevuCD9kFmbdV/outputs/
If you use the Brave browser
Using IPFS CLI
Copy the link of model.ckpt highlighted in the box https://gateway.ipfs.io/ipfs/QmdpsqZn9BZx9XxzCsyPcJyS7yfYacmQXZxHzcuYwzmtGg?filename=model.ckpt
Extract the CID portion of the link and copy it
To run a Bacalhau Job on the fine-tuned model, we will use the bacalhau docker run
command.
If you are facing difficulties using the above method you can mount the whole output CID
When a job is sumbitted, Bacalhau prints out the related job_id
. We store that in an environment variable so that we can reuse it later on.
Job status: You can check the status of the job using bacalhau list
.
When it says Published
or Completed
, that means the job is done, and we can get the results.
Job information: You can find out more information about your job by using bacalhau describe
.
Job download: You can download your job results directly by using bacalhau get
. Alternatively, you can choose to create a directory to store your results. In the command below, we created a directory and downloaded our job output to be stored in that directory.
After the download has finished you should see the following contents in results directory
To view the file, run the following command:
%%bash ls results/outputs
To get started, you need to install the Bacalhau client, see more information
hub-user with your docker hub username, If you don’t have a docker hub account , and use the username of the account you created
You can view the for reference
In this case, we will be using (Recommended Option) to upload files and directories with
TensorFlow is an open-source machine learning software library, TensorFlow is used to train neural networks. Expressed in the form of stateful dataflow graphs, each node in the graph represents the operations performed by neural networks on multi-dimensional arrays. These multi-dimensional arrays are commonly known as “tensors,” hence the name TensorFlow. In this example, we will be training a MNIST model.
Running any type of Tensorflow model with Bacalhau
This section is from TensorFlow 2 quickstart for beginners
This short introduction uses Keras to:
Load a prebuilt dataset.
Build a neural network machine learning model that classifies images.
Train this neural network.
Evaluate the accuracy of the model.
Import TensorFlow into your program to check whether it is installed
Build a tf.keras.Sequential
model by stacking layers.
For each example, the model returns a vector of logits or log-odds scores, one for each class.
The tf.nn.softmax
function converts these logits to probabilities for each class:
Note: It is possible to bake the tf.nn.softmax
function into the activation function for the last layer of the network. While this can make the model output more directly interpretable, this approach is discouraged as it's impossible to provide an exact and numerically stable loss calculation for all models when using a softmax output.
Define a loss function for training using losses.SparseCategoricalCrossentropy
, which takes a vector of logits and a True
index and returns a scalar loss for each example.
This loss is equal to the negative log probability of the true class: The loss is zero if the model is sure of the correct class.
This untrained model gives probabilities close to random (1/10 for each class), so the initial loss should be close to -tf.math.log(1/10) ~= 2.3
.
Before you start training, configure and compile the model using Keras Model.compile
. Set the optimizer
class to adam
, set the loss
to the loss_fn
function you defined earlier, and specify a metric to be evaluated for the model by setting the metrics
parameter to accuracy
.
Use the Model.fit
method to adjust your model parameters and minimize the loss:
The Model.evaluate
method checks the models performance, usually on a "Validation-set" or "Test-set".
The image classifier is now trained to ~98% accuracy on this dataset. To learn more, read the TensorFlow tutorials.
If you want your model to return a probability, you can wrap the trained model, and attach the softmax to it:
the following method can be used to save the model as a checkpoint
You can use a tool like nbconvert
to convert your Python notebook into a script.
After that, you can create a gist of the training script at gist.github.com copy the raw link of the gist
Testing whether the script works
The dataset and the script are mounted to the TensorFlow container using an URL we then run the script inside the container
Structure of the command:
-i https://gist.githubusercontent.com/js-ts/e7d32c7d19ffde7811c683d4fcb1a219/raw/ff44ac5b157d231f464f4d43ce0e05bccb4c1d7b/train.py
: mount the training script
-i https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz
: mount the dataset
tensorflow/tensorflow
: specify the Docker image
python train.py
: execute the script
By default whatever URL you mount using the -i flag gets mounted at the path /inputs so we choose that as our input directory -w /inputs
Where it says Completed
, that means the job is done, and we can get the results.
To find out more information about your job, run the following command: