Loading...
This directory contains examples relating to data engineering workloads. The goal is to provide a range of examples that show you how to work with Bacalhau in a variety of use cases.
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Below you will find a table of contents for a wide array of examples on how to use Bacalhau.
The Surface Ocean CO₂ Atlas (SOCAT) contains measurements of the fugacity of CO₂ in seawater around the globe. But to calculate how much carbon the ocean is taking up from the atmosphere, these measurements need to be converted to the partial pressure of CO₂. We will convert the units by combining measurements of the surface temperature and fugacity. Python libraries (xarray, pandas, numpy) and the pyseaflux package facilitate this process.
In this example tutorial, our focus will be on running the oceanography dataset with Bacalhau, where we will investigate the data and convert the workload. This will enable the execution on the Bacalhau network, allowing us to leverage its distributed storage and compute resources.
To get started, you need to install the Bacalhau client, see more information here
For the purposes of this example we will use the SOCATv2022 dataset in the "Gridded" format from the SOCAT website and long-term global sea surface temperature data from NOAA - information about that dataset can be found here.
Next let's write the requirements.txt
. This file will also be used by the Dockerfile to install the dependencies.
We can see that the dataset contains latitude-longitude coordinates, the date, and a series of seawater measurements. Below is a plot of the average sea surface temperature (SST) between 2010 and 2020, where data have been collected by buoys and vessels.
To convert the data from fugacity of CO2 (fCO2) to partial pressure of CO2 (pCO2) we will combine the measurements of the surface temperature and fugacity. The conversion is performed by the pyseaflux package.
Let's create a new file called main.py
and paste the following script in it:
This code loads and processes SST and SOCAT data, combines them, computes pCO2, and saves the results for further use.
The simplest way to upload the data to IPFS is to use a third-party service to "pin" data to the IPFS network, to ensure that the data exists and is available. To do this you need an account with a pinning service like NFT.storage or Pinata. Once registered you can use their UI or API or SDKs to upload files.
This resulted in the IPFS CID of bafybeidunikexxu5qtuwc7eosjpuw6a75lxo7j5ezf3zurv52vbrmqwf6y
.
We will create a Dockerfile
and add the desired configuration to the file. These commands specify how the image will be built, and what extra requirements will be included.
We will run docker build
command to build the container:
Before running the command replace:
hub-user
with your docker hub username, If you don’t have a docker hub account follow these instructions to create a Docker account, and use the username of the account you created
repo-name
with the name of the container, you can name it anything you want
tag
this is not required but you can use the latest tag
Now you can push this repository to the registry designated by its name or tag.
For more information about working with custom containers, see the custom containers example.
Now that we have the data in IPFS and the Docker image pushed, next is to run a job using the bacalhau docker run
command
Let's look closely at the command above:
bacalhau docker run
: call to Bacalhau
--input ipfs://bafybeidunikexxu5qtuwc7eosjpuw6a75lxo7j5ezf3zurv52vbrmqwf6y
: CIDs to use on the job. Mounts them at '/inputs' in the execution.
ghcr.io/bacalhau-project/examples/socat:0.0.11
: the name and the tag of the image we are using
python main.py
: execute the script
When a job is submitted, Bacalhau prints out the related job_id
. We store that in an environment variable so that we can reuse it later on.
The same job can be presented in the declarative format. In this case, the description will look like this:
The job description should be saved in .yaml
format, e.g. oceanyaml
, and then run with the command:
Job status: You can check the status of the job using bacalhau job list
.
When it says Published
or Completed
, that means the job is done, and we can get the results.
Job information: You can find out more information about your job by using bacalhau job describe
.
Job download: You can download your job results directly by using bacalhau job get
. Alternatively, you can choose to create a directory to store your results. In the command below, we created a directory (results
) and downloaded our job output to be stored in that directory.
To view the file, run the following command:
If you have questions or need support or guidance, please reach out to the Bacalhau team via Slack (#general channel).
In this example tutorial, we will show you how to use Bacalhau to process images on a Landsat dataset.
Bacalhau has the unique capability of operating at a massive scale in a distributed environment. This is made possible because data is naturally sharded across the IPFS network amongst many providers. We can take advantage of this to process images in parallel.
To get started, you need to install the Bacalhau client, see more information here
To submit a workload to Bacalhau, we will use the bacalhau docker run
command. This command allows to pass input data volume with a -i ipfs://CID:path
argument just like Docker, except the left-hand side of the argument is a content identifier (CID). This results in Bacalhau mounting a data volume inside the container. By default, Bacalhau mounts the input volume at the path /inputs
inside the container.
Bacalhau also mounts a data volume to store output data. The bacalhau docker run
command creates an output data volume mounted at /outputs
. This is a convenient location to store the results of your job.
Let's look closely at the command above:
bacalhau docker run
: call to Bacalhau
-i src=s3://landsat-image-processing/*,dst=/input_images,opt=region=us-east-1
: Specifies the input data, which is stored in the S3 storage.
--entrypoint mogrify
: Overrides the default ENTRYPOINT of the image, indicating that the mogrify utility from the ImageMagick package will be used instead of the default entry.
dpokidov/imagemagick:7.1.0-47-ubuntu
: The name and the tag of the docker image we are using
-- -resize 100x100 -quality 100 -path /outputs '/input_images/*.jpg'
: These arguments are passed to mogrify and specify operations on the images: resizing to 100x100 pixels, setting quality to 100, and saving the results to the /outputs
folder.
When a job is submitted, Bacalhau prints out the related job_id
. We store that in an environment variable so that we can reuse it later on.
The same job can be presented in the declarative format. In this case, the description will look like this:
The job description should be saved in .yaml
format, e.g. image.yaml
, and then run with the command:
Job status: You can check the status of the job using bacalhau job list
:
When it says Completed
, that means the job is done, and we can get the results.
Job information: You can find out more information about your job by using bacalhau job describe
:
Job download: You can download your job results directly by using bacalhau job get
. Alternatively, you can choose to create a directory to store your results. In the command below, we created a directory and downloaded our job output to be stored in that directory.
To view the images, open the results/outputs/
folder:
If you have questions or need support or guidance, please reach out to the Bacalhau team via Slack (#general channel).
Converting from CSV to parquet or avro reduces the size of the file and allows for faster read and write speeds. With Bacalhau, you can convert your CSV files stored on ipfs or on the web without the need to download files and install dependencies locally.
In this example tutorial we will convert a CSV file from a URL to parquet format and save the converted parquet file to IPFS
To get started, you need to install the Bacalhau client, see more information here
Let's download the transactions.csv
file:
You can use the CSV files from here
Write the converter.py
Python script, that serves as a CSV converter to Avro or Parquet formats:
You can find out more information about converter.py
here
In our case:
You can skip this section entirely and directly go to Running on Bacalhau
To build your own docker container, create a Dockerfile
, which contains instructions to build your image.
See more information on how to containerize your script/app here
We will run the docker build
command to build the container:
Before running the command replace:
hub-user
with your docker hub username. If you don’t have a docker hub account follow these instructions to create docker account, and use the username of the account you created
repo-name
with the name of the container, you can name it anything you want
tag
this is not required but you can use the latest tag
In our case:
Next, upload the image to the registry. This can be done by using the Docker hub username, repo name or tag.
In our case:
With the command below, we are mounting the CSV file for transactions from IPFS
Let's look closely at the command above:
bacalhau docker run
: call to Bacalhau
-i ipfs://QmTAQMGiSv9xocaB4PUCT5nSBHrf9HZrYj21BAZ5nMTY2W
: CIDs to use on the job. Mounts them at '/inputs' in the execution.
jsacex/csv-to-arrow-or-parque
: the name and the tag of the docker image we are using
../inputs/transactions.csv
: path to input dataset
/outputs/transactions.parquet parquet
: path to the output
python3 src/converter.py
: execute the script
When a job is submitted, Bacalhau prints out the related job_id
. We store that in an environment variable so that we can reuse it later on.
The same job can be presented in the declarative format. In this case, the description will look like this:
The job description should be saved in .yaml
format, e.g. convertcsv.yaml
, and then run with the command:
Job status: You can check the status of the job using bacalhau job list
.
When it says Published
or Completed
, that means the job is done, and we can get the results.
Job information: You can find out more information about your job by using bacalhau job describe
.
Job download: You can download your job results directly by using bacalhau job get
. Alternatively, you can choose to create a directory to store your results. In the command below, we created a directory (results
) and downloaded our job output to be stored in that directory.
To view the file, run the following command:
Alternatively, you can do this:
If you have questions or need support or guidance, please reach out to the Bacalhau team via Slack (#general channel).
Mature blockchains are difficult to analyze because of their size. Ethereum-ETL is a tool that makes it easy to extract information from an Ethereum node, but it's not easy to get working in a batch manner. It takes approximately 1 week for an Ethereum node to download the entire chain (even more in my experience) and importing and exporting data from the Ethereum node is slow.
For this example, we ran an Ethereum node for a week and allowed it to synchronize. We then ran ethereum-etl to extract the information and pinned it on Filecoin. This means that we can both now access the data without having to run another Ethereum node.
But there's still a lot of data and these types of analyses typically need repeating or refining. So it makes absolute sense to use a decentralized network like Bacalhau to process the data in a scalable way.
In this tutorial example, we will run Ethereum-ETL tool on Bacalhau to extract data from an Ethereum node.
To get started, you need to install the Bacalhau client, see more information here
First let's download one of the IPFS files and inspect it locally:
You can see the full list of IPFS CIDs in the appendix at the bottom of the page.
If you don't already have the Pandas library, let's install it:
The following code inspects the daily trading volume of Ethereum for a single chunk (100,000 blocks) of data.
This is all good, but we can do better. We can use the Bacalhau client to download the data from IPFS and then run the analysis on the data in the cloud. This means that we can analyze the entire Ethereum blockchain without having to download it locally.
To run jobs on the Bacalhau network you need to package your code. In this example, I will package the code as a Docker image.
But before we do that, we need to develop the code that will perform the analysis. The code below is a simple script to parse the incoming data and produce a CSV file with the daily trading volume of Ethereum.
Next, let's make sure the file works as expected:
And finally, package the code inside a Docker image to make the process reproducible. Here I'm passing the Bacalhau default /inputs
and /outputs
directories. The /inputs
directory is where the data will be read from and the /outputs
directory is where the results will be saved to.
We've already pushed the container, but for posterity, the following command pushes this container to GHCR.
To run our analysis on the Ethereum blockchain, we will use the bacalhau docker run
command.
The job has been submitted and Bacalhau has printed out the related job id. We store that in an environment variable so that we can reuse it later on.
The bacalhau docker run
command allows passing input data volume with --input
or -i ipfs://CID:path
argument just like Docker, except the left-hand side of the argument is a content identifier (CID). This results in Bacalhau mounting a data volume inside the container. By default, Bacalhau mounts the input volume at the path /inputs
inside the container.
Bacalhau also mounts a data volume to store output data. The bacalhau docker run
command creates an output data volume mounted at /outputs
. This is a convenient location to store the results of your job.
The same job can be presented in the declarative format. In this case, the description will look like this:
The job description should be saved in .yaml
format, e.g. blockchain.yaml
, and then run with the command:
Copy
Job status: You can check the status of the job using bacalhau job list
.
When it says Published
or Completed
, that means the job is done, and we can get the results.
Job information: You can find out more information about your job by using bacalhau job describe
.
Job download: You can download your job results directly by using bacalhau job get
. Alternatively, you can choose to create a directory to store your results. In the command below, we created a directory (results
) and downloaded our job output to be stored in that directory.
To view the file, run the following command:
To view the images, we will use glob to return all file paths that match a specific pattern.
Ok, so that works. Let's scale this up! We can run the same analysis on the entire Ethereum blockchain (up to the point where I have uploaded the Ethereum data). To do this, we need to run the analysis on each of the chunks of data that we have stored on IPFS. We can do this by running the same job on each of the chunks.
See the appendix for the hashes.txt
file.
Now take a look at the job id's. You can use these to check the status of the jobs and download the results:
You might want to double-check that the jobs ran ok by doing a bacalhau job list
.
Wait until all of these jobs have been completed. And then download all the results and merge them into a single directory. This might take a while, so this is a good time to treat yourself to a nice Dark Mild. There's also been some issues in the past communicating with IPFS, so if you get an error, try again.
To view the images, we will use glob to return all file paths that match a specific pattern.
That's it! There are several years of Ethereum transaction volume data.
The following list is a list of IPFS CID's for the Ethereum data that we used in this tutorial. You can use these CID's to download the rest of the chain if you so desire. The CIDs are ordered by block number and they increase 50,000 blocks at a time. Here's a list of ordered CIDs:
If you have questions or need support or guidance, please reach out to the Bacalhau team via Slack (#general channel).
DuckDB is a relational table-oriented database management system and supports SQL queries for producing analytical results. It also comes with various features that are useful for data analytics.
DuckDB is suited for the following use cases:
Processing and storing tabular datasets, e.g. from CSV or Parquet files
Interactive data analysis, e.g. Joining & aggregate multiple large tables
Concurrent large changes, to multiple large tables, e.g. appending rows, adding/removing/updating columns
Large result set transfer to client
In this example tutorial, we will show how to use DuckDB with Bacalhau. The advantage of using DuckDB with Bacalhau is that you don’t need to install, there is no need to download the datasets since the datasets are are already available on the server you are looking to run the query on.
To get started, you need to install the Bacalhau client, see more information here
We will be using Bacalhau setup with the standard setting up Bacalhau network.
We will also need to have a server with some data to run the query on. In this example, we will use a server with the Yellow Taxi Trips dataset.
If you do not already have this data on your server, you can download it using the scripts in the prep_data
directory. The command to download the data is ./prep_data/run_download_jobs.sh
- and you must have the /bacalhau_data
directory on your server.
To submit a job, run the following Bacalhau command:
This is a simple query that will return a single row with a single column - but the query will be executed in DuckDB, on a remote server.
Let's look closely at the command above:
bacalhau docker run
: call to bacalhau
-e QUERY="select 1"
: the query to execute
docker.io/bacalhauproject/duckdb:latest
: the name and the tag of the docker image we are using
When a job is submitted, Bacalhau runs the query in DuckDB, and returns the results to the client.
After we run it, when we describe
the job, we can see the following in standard output:
What if you didn't want to run everything on the command line? You can use a YAML file to define the job. In simple_query.sql
, we have a simple query that will return the number of rows in the dataset.
To run this query, we can use the following YAML file:
Though this looks like a lot of code, it is actually quite simple. The Tasks
section defines the task to run, and the InputSources
section defines the input dataset. The Publisher
section defines where the results will be published, and the Resources
section defines the resources required for the job.
All the work is done in the environment variables, which are passed to the Docker image, and handed to DuckDB to execute the query.
To run this query, we can use the following command:
This breaks down into the following steps:
bacalhau job run
: call to bacalhau
duckdb_query_job.yaml
: the YAML file we are using
--template-vars="filename=/bacalhau_data/yellow_tripdata_2020-02.parquet"
: the file to read
--template-vars="QUERY=$(cat simple_query.sql)"
: the query to execute
When we run this, we get back the following simple output:
Let's say we want to run a more complex query. In window_query_simple.sql
, we have a query that will return the average number of rides per 5 minute interval.
When we run this, we get back the following output:
The sql file needs to be run in a single line, otherwise the line breaks will cause some issues with the templating. We're working on improving this!
With this structure, you can now run virtually any query you want on remote servers, without ever having to download the data. Welcome to compute over data by Bacalhau!
If you get stuck or have questions:
Check out the official Bacalhau Documentation
Open an issue in our GitHub repository
Join our Slack
Many data engineering workloads consist of embarrassingly parallel workloads where you want to run a simple execution on a large number of files. In this example tutorial, we will run a simple video filter on a large number of video files.
To get started, you need to install the Bacalhau client, see more information
The simplest way to upload the data to IPFS is to use a third-party service to "pin" data to the IPFS network, to ensure that the data exists and is available. To do this you need an account with a pinning service like or . Once registered you can use their UI or API or SDKs to upload files.
This resulted in the IPFS CID of Qmd9CBYpdgCLuCKRtKRRggu24H72ZUrGax5A9EYvrbC72j
.
To submit a workload to Bacalhau, we will use the bacalhau docker run
command. The command allows one to pass input data volume with a -i ipfs://CID:path
argument just like Docker, except the left-hand side of the argument is a . This results in Bacalhau mounting a data volume inside the container. By default, Bacalhau mounts the input volume at the path /inputs
inside the container.
Let's look closely at the command above:
bacalhau docker run
: call to Bacalhau
-i ipfs://Qmd9CBYpdgCLuCKRtKRRggu24H72ZUrGax5A9EYvrbC72j
: CIDs to use on the job. Mounts them at '/inputs' in the execution.
linuxserver/ffmpeg
: the name of the docker image we are using to resize the videos
-- bash -c 'find /inputs -iname "*.mp4" -printf "%f\n" | xargs -I{} ffmpeg -y -i /inputs/{} -vf "scale=-1:72,setsar=1:1" /outputs/scaled_{}'
: the command that will be executed inside the container. It uses find
to locate all files with the extension ".mp4" within /inputs
and then uses ffmpeg
to resize each found file to 72 pixels in height, saving the results in the /outputs
folder.
When a job is submitted, Bacalhau prints out the related job_id
. We store that in an environment variable so that we can reuse it later on.
The job description should be saved in .yaml
format, e.g. video.yaml
, and then run with the command:
Job status: You can check the status of the job using bacalhau job list
.
When it says Published
or Completed
, that means the job is done, and we can get the results.
Job information: You can find out more information about your job by using bacalhau job describe
.
Job download: You can download your job results directly by using bacalhau job get
. Alternatively, you can choose to create a directory to store your results. In the command below, we created a directory (results
) and downloaded our job output to be stored in that directory.
To view the results open the results/outputs/
folder.
To upload a file from a URL we will use the bacalhau docker run
command.
The job has been submitted and Bacalhau has printed out the related job id.
Let's look closely at the command above:
bacalhau docker run
: call to bacalhau using docker executor
--input https://raw.githubusercontent.com/filecoin-project/bacalhau/main/README.md
: URL path of the input data volumes downloaded from a URL source.
ghcr.io/bacalhau-project/examples/upload:v1
: the name and tag of the docker image we are using
The bacalhau docker run
command takes advantage of the --input
parameter. This will download a file from a public URL and place it in the /inputs
directory of the container (by default). Then we will use a helper container to move that data to the /outputs directory.
You can find out more about the which is designed to simplify the data uploading process.
For more details, see the
Job status: You can check the status of the job using bacalhau job list
, processing the json ouput with the jq
:
When the job status is Published
or Completed
, that means the job is done, and we can get the results using the job ID.
Job information: You can find out more information about your job by using bacalhau job describe
.
Job download: You can download your job results directly by using bacalhau job get
. Alternatively, you can choose to create a directory to store your results. In the command below, we removed a directory in case it was present before, created it and downloaded our job output to be stored in that directory.
Each job result contains an outputs
subfolder and exitCode
, stderr
and stdout
files with relevant content. To view the execution logs execute following:
And to view the job execution result (README.md
file in the example case), which was saved as a job output, execute:
To get the output CID from a completed job, run the following command:
The job will upload the CID to the public storage via IPFS. We will store the CID in an environment variable so that we can reuse it later on.
Now that we have the CID, we can use it in a new job. This time we will use the --input
parameter to tell Bacalhau to use the CID we just uploaded.
In this case, the only goal of our job is just to list the contents of the /inputs
directory. You can see that the "input" data is located under /inputs/outputs/README.md
.
The job has been submitted and Bacalhau has printed out the related job id. We store that in an environment variable so that we can reuse it later on.
Use Bacalhau to process logs remotely across clouds and land the results on BigQuery.
This example demonstrates how to build a sophisticated distributed log processing pipeline using , Google , and . You'll learn how to process and analyze logs across distributed nodes, with progressively more advanced techniques for handling, sanitizing, and aggregating log data.
The combination of Bacalhau and BigQuery offers several key advantages:
Process logs directly where they are generated, eliminating the need for centralized collection
Scale processing across multiple nodes in different cloud providers
Leverage BigQuery's powerful analytics capabilities for processed data
Implement privacy-conscious data handling and efficient aggregation strategies
Through this example, you'll evolve from basic log collection to implementing a production-ready system with privacy protection and smart aggregation. Whether you're handling application logs, system metrics, or security events, this pipeline provides a robust foundation for distributed log analytics.
installed
A Google Cloud Project with BigQuery enabled
Service account credentials with BigQuery access
A running Bacalhau cluster with nodes across different cloud providers
Follow the
Ensure nodes are properly configured across your cloud providers (AWS, GCP, Azure). You can see more about setting up your nodes
Your nodes will have to have at least the following configurations installed to use the below jobs unchanged (in addition to anything else you may require).
DuckDB Processing: Process and analyze data using DuckDB's SQL capabilities
BigQuery Integration: Store processed results in Google BigQuery for further analysis
Make sure you have configured your config file to have the correct BigQuery project and dataset. You can do this by copying the config.yaml.example file to config.yaml and editing the values to match your BigQuery project and dataset.
Ensure your Google Cloud service account has these roles:
BigQuery Data Editor
BigQuery Job User
Have your service account key file (JSON format) ready
Configure your BigQuery settings:
Create a dataset for log analytics
Note your project ID and dataset name
We have provided some utility scripts to help you set up your BigQuery project and tables. You can run the following commands to set up your project and tables if you haven't already:
Interactive setup to set up your BigQuery project and tables. Will go through and create the necessary bigquery projects.
Creates sample tables in BigQuery for testing.
Checks the permissions of the service account specified in log_uploader_credentials.json to ensure it has the necessary permissions to write to BigQuery tables from the Bacalhau nodes.
Confirms your BigQuery project and dataset, and creates the tables if they don't exist with the correct schema. This will also zero out the tables if they already exist, so be careful! (Useful for debugging)
Distributes the credentials to /bacalhau_data on all nodes in a Bacalhau network.
Ensures the service account specified in log_uploader_credentials.json has the necessary permissions to write to BigQuery tables from the Bacalhau nodes.
One more thing to set up is the log faker on the nodes. This will generate logs for you to work with. You can run the following command to start the log faker:
Give it a couple of minutes to start up and then you can start processing the logs.
Let's walk through each stage of the demo, seeing how we can progressively improve our data processing pipeline!
Let's start by looking at the raw logs.
That will print out the logs to stdout, which we can then read from the job.
After running the job, you will see a job id, something like this:
When you run the describe
command, you will see the details of the job, including the output of the log information.
Now let's upload the raw logs to BigQuery. This is the simplest approach - just get the data there:
This will upload the python script to all the nodes which, in turn, will upload the raw logs from all nodes to BigQuery. When you check BigQuery, you'll see:
Millions of rows uploaded (depends on how many nodes you have and how long you let it run)
Each log line as raw text
No structure or parsing
To query the logs, you can use the following SQL:
Now let's do something more advanced, by parsing those logs into structured data before upload:
Your logs are now parsed into fields like:
IP Address
Timestamp
HTTP Method
Endpoint
Status Code
Response Size
To query the logs, you can use the following SQL:
Now let's handle the data responsibly by sanitizing PII (like IP addresses):
This:
Zeros out the last octet of IPv4 addresses
Zeros out the last 64 bits of IPv6 addresses
Maintains data utility while ensuring compliance
Again, to query the logs, you can use the following SQL:
Notice that the IP addresses are now sanitized.
Finally, let's be smart about what we upload:
This creates two streams:
Aggregated normal logs:
Grouped in 5-minute windows
Counts by status code
Average response sizes
Total requests per endpoint
Real-time emergency events:
Critical errors
Security alerts
System failures
To query the logs, you can use the following SQL:
Now that you've seen the power of distributed processing with Bacalhau:
Try processing your own log files
Experiment with different aggregation windows
Add your own privacy-preserving transformations
Scale to even more nodes!
Remember: The real power comes from processing data where it lives, rather than centralizing everything first. Happy distributed processing! 🚀
log_results (Main Table):
project_id
: STRING - Project identifier
region
: STRING - Deployment region
nodeName
: STRING - Node name
timestamp
: TIMESTAMP - Event time
version
: STRING - Log version
message
: STRING - Log content
sync_time
: TIMESTAMP - Upload time
remote_log_id
: STRING - Original log ID
hostname
: STRING - Source host
public_ip
: STRING - Sanitized public IP
private_ip
: STRING - Internal IP
alert_level
: STRING - Event severity
provider
: STRING - Cloud provider
log_aggregates (5-minute windows):
project_id
: STRING - Project identifier
region
: STRING - Deployment region
nodeName
: STRING - Node name
provider
: STRING - Cloud provider
hostname
: STRING - Source host
time_window
: TIMESTAMP - Aggregation window
log_count
: INT64 - Events in window
messages
: ARRAY - Event details
emergency_logs (Critical Events):
project_id
: STRING - Project identifier
region
: STRING - Deployment region
nodeName
: STRING - Node name
provider
: STRING - Cloud provider
hostname
: STRING - Source host
timestamp
: TIMESTAMP - Event time
version
: STRING - Log version
message
: STRING - Alert details
remote_log_id
: STRING - Original log ID
alert_level
: STRING - Always "EMERGENCY"
public_ip
: STRING - Sanitized public IP
private_ip
: STRING - Internal IP
How to pin data to public storage
If you have data that you want to make available to your Bacalhau jobs (or other people), you can pin it using a pinning service like Pinata, NFT.Storage, Thirdweb, etc. Pinning services store data on behalf of users. The pinning provider is essentially guaranteeing that your data will be available if someone knows the CID. Most pinning services offer you a free tier, so you can try them out without spending any money.
To use a pinning service, you will almost always need to create an account. After registration, you get an API token, which is necessary to control and access the files. Then you need to upload files - usually services provide a web interface, CLI and code samples for integration into your application. Once you upload the files you will get its CID, which looks like this: QmUyUg8en7G6RVL5uhyoLBxSWFgRMdMraCRWFcDdXKWEL9
. Now you can access pinned data from the jobs via this CID.
Data source can be specified via --input
flag, see the for more details
Here is a quick tutorial on how to copy Data from S3 to a public storage. In this tutorial, we will scrape all the links from a public AWS S3 buckets and then copy the data to IPFS using Bacalhau.
To get started, you need to install the Bacalhau client, see more information
Let's look closely at the command above:
bacalhau docker run
: call to bacalhau
-i "s3://noaa-goes16/ABI-L1b-RadC/2000/001/12/OR_ABI-L1b-RadC-M3C01*:/inputs,opt=region=us-east-1
: defines S3 objects as inputs to the job. In this case, it will download all objects that match the prefix ABI-L1b-RadC/2000/001/12/OR_ABI-L1b-RadC-M3C01
from the bucket noaa-goes16
in us-east-1
region, and mount the objects under /inputs
path inside the docker job.
-- sh -c "cp -r /inputs/* /outputs/"
: copies all files under /inputs
to /outputs
, which is by default the result output directory which all of its content will be published to the specified destination, which is IPFS by default
When a job is submitted, Bacalhau prints out the related job_id
. We store that in an environment variable so that we can reuse it later on.
This works either with datasets that are publicly available or with private datasets, provided that the nodes have the necessary credentials to access. See the for more details.
Job status: You can check the status of the job using bacalhau job list
.
When it says Published
or Completed
, that means the job is done, and we can get the results.
Job information: You can find out more information about your job by using bacalhau job describe
.
Job download: You can download your job results directly by using bacalhau job get
. Alternatively, you can choose to create a directory to store your results. In the command below, we remove the results directory if it exists, create it again and download our job output to be stored in that directory.
When the download is completed, the results of the job will be present in the directory. To view them, run the following command:
First you need to install jq
(if it is not already installed) to process JSON:
To extract the CIDs from output JSON, execute following:
The extracted CID will look like this:
You can publish your results to Amazon s3 or other S3-compatible destinations like MinIO, Ceph, or SeaweedFS to conveniently store and share your outputs.
To facilitate publishing results, define publishers and their configurations using the PublisherSpec structure.
For S3-compatible destinations, the configuration is as follows:
For Amazon S3, you can specify the PublisherSpec
configuration as shown below:
Let's explore some examples to illustrate how you can use this:
Publishing results to S3 using default settings
Publishing results to S3 with a custom endpoint and region:
Publishing results to S3 as a single compressed file
Utilizing naming placeholders in the object key
Tracking content identification and maintaining lineage across different jobs' inputs and outputs can be challenging. To address this, the publisher encodes the SHA-256 checksum of the published results, specifically when publishing a single compressed file.
Here's an example of a sample result:
To enable support for the S3-compatible storage provider, no additional dependencies are required. However, valid AWS credentials are necessary to sign the requests. The storage provider uses the default credentials chain, which checks the following sources for credentials:
Environment variables, such as AWS_ACCESS_KEY_ID
and AWS_SECRET_ACCESS_KEY
Credentials file ~/.aws/credentials
IAM Roles for Amazon EC2 Instances
In this example tutorial, we use Bacalhau and Easy OCR to digitize paper records or for recognizing characters or extract text data from images stored on IPFS, S3 or on the web. is a ready-to-use OCR with 80+ supported languages and all popular writing scripts including Latin, Chinese, Arabic, Devanagari, Cyrillic etc. With easy OCR, you use the pre-trained models or use your own fine-tuned model.
Install the required dependencies
Load the different example images
List all the images. You'll see an output like this:
Next, we create a reader to do OCR to get coordinates which represent a rectangle containing text and the text itself:
The docker build
command builds Docker images from a Dockerfile.
Before running the command replace:
repo-name with the name of the container, you can name it anything you want
tag this is not required but you can use the latest tag
Next, upload the image to the registry. This can be done by using the Docker hub username, repo name, or tag.
Now that we have an image in the docker hub (your own or an example image from the manual), we can use the container for running on Bacalhau.
Let's look closely at the command below:
export JOB_ID=$( ... )
exports the job ID as environment variable
bacalhau docker run
: call to bacalhau
The --gpu 1
flag is set to specify hardware requirements, a GPU is needed to run such a job
The --id-only
flag is set to print only job id
-i ipfs://bafybeibvc......
Mounts the model from IPFS
-i https://raw.githubusercontent.com...
Mounts the Input Image from a URL
jsacex/easyocr
the name and the tag of the docker image we are using
-- easyocr -l ch_sim en -f ./inputs/chinese.jpg --detail=1 --gpu=True
execute script with following paramters:
-l ch_sim
: the name of the model
-f ./inputs/chinese.jpg
: path to the input Image or directory
--detail=1
: level of detail
--gpu=True
: we set this flag to true since we are running inference on a GPU. If you run this on a CPU - set this flag to false
When a job is submitted, Bacalhau prints out the related job_id
. We store that in an environment variable so that we can reuse it later on.
The job description should be saved in .yaml
format, e.g. easyocr.yaml
, and then run with the command:
You can check the status of the job using bacalhau job list
.
When it says Completed
, that means the job is done, and we can get the results.
You can find out more information about your job by using bacalhau job describe
.
You can download your job results directly by using bacalhau job get
. Alternatively, you can choose to create a directory to store your results. In the command below, we created a directory and downloaded our job output to be stored in that directory.
After the download has finished you should see the following contents in results directory
Now you can find the file in the results/outputs
folder. You can view results by running following commands:
so we must run the full command after the --
argument. In this line you will list all of the mp4 files in the /inputs
directory and execute ffmpeg
against each instance.
The same job can be presented in the format. In this case, the description will look like this:
If you have questions or need support or guidance, please reach out to the (#general channel).
For questions and feedback, please reach out in our
For questions, feedback, please reach out in our
You can skip this step and go straight to running a
We will use the Dockerfile
that is already created in the . Use the command below to clone the repo
hub-user with your docker hub username, If you don’t have a docker hub account follow to create docker account, and use the username of the account you created
To get started, you need to install the Bacalhau client, see more information .
Since the model and the image aren't present in the container we will mount the image from an URL and the model from IPFS. You can find models to download from . You can choose the model you want to use in this case we will be using the zh_sim_g2
model
The same job can be presented in the format. In this case, the description will look like this:
The identification and localization of objects in images and videos is a computer vision task called object detection. Several algorithms have emerged in the past few years to tackle the problem. One of the most popular algorithms to date for real-time object detection is YOLO (You Only Look Once), initially proposed by Redmond et al.
Traditionally, models like YOLO required enormous amounts of training data to yield reasonable results. People might not have access to such high-quality labeled data. Thankfully, open-source communities and researchers have made it possible to utilize pre-trained models to perform inference. In other words, you can use models that have already been trained on large datasets to perform object detection on your own data.
Bacalhau is a highly scalable decentralized computing platform and is well suited to running massive object detection jobs. In this example, you can take advantage of the GPUs available on the Bacalhau Network and perform an end-to-end object detection inference, using the YOLOv5 Docker Image developed by Ultralytics.
Load your dataset into S3/IPFS, specify it and pre-trained weights via the --input
flags, choose a suitable container, specify the command and path to save the results - done!
To get started, you need to install the Bacalhau client, see more information here
To get started, let's run a test job with a small sample dataset that is included in the YOLOv5 Docker Image. This will give you a chance to familiarise yourself with the process of running a job on Bacalhau.
In addition to the usual Bacalhau flags, you will also see example of using the --gpu 1
flag in order to specify the use of a GPU.
Remember that by default Bacalhau does not provide any network connectivity when running a job. So you need to either provide all assets at job submission time, or use the --network=full
or --network=http
flags to access the data at task time. See the Internet Access page for more details
The model requires pre-trained weights to run and by default downloads them from within the container. Bacalhau jobs don't have network access so we will pass in the weights at submission time, saving them to /usr/src/app/yolov5s.pt
. You may also provide your own weights here.
The container has its own options that we must specify:
--input
to select which pre-trained weights you desire with details on the yolov5 release page
--project
specifies the output volume that the model will save its results to. Bacalhau defaults to using /outputs
as the output directory, so we save it there.
For more container flags refer to the yolov5/detect.py
file in the YOLO repository.
One final additional hack that we have to do is move the weights file to a location with the standard name. As of writing this, Bacalhau downloads the file to a UUID-named file, which the model is not expecting. This is because GitHub 302 redirects the request to a random file in its backend.
Some of the jobs presented in the Examples section may require more resources than are currently available on the demo network. Consider starting your own network or running less resource-intensive jobs on the demo network
export JOB_ID=$( ... )
exports the job ID as environment variable
The --gpu 1
flag is set to specify hardware requirements, a GPU is needed to run such a job
The --timeout
flag is set to make sure that if the job is not completed in the specified time, it will be terminated
The --wait
flag is set to wait for the job to complete before return
The --wait-timeout-secs
flag is set together with --wait
to define how long should app wait for the job to complete
The --id-only
flag is set to print only job id
The --input
flags are used to specify the sources of input data
-- /bin/bash -c 'find /inputs -type f -exec cp {} /outputs/yolov5s.pt \; ; python detect.py --weights /outputs/yolov5s.pt --source $(pwd)/data/images --project /outputs'
tells the model where to find input data and where to write output
This should output a UUID (like 59c59bfb-4ef8-45ac-9f4b-f0e9afd26e70
), which will be stored in the environment variable JOB_ID
. This is the ID of the job that was created. You can check the status of the job using the commands below.
The same job can be presented in the declarative format. In this case, the description will look like this:
Job status: You can check the status of the job using bacalhau job list
:
When it says Completed
, that means the job is done, and we can get the results.
Job information: You can find out more information about your job by using bacalhau job describe
:
Job download: You can download your job results directly by using bacalhau job get
. Alternatively, you can choose to create a directory to store your results. In the command below, we created a directory and downloaded our job output to be stored in that directory.
After the download has finished we can see the results in the results/outputs/exp
folder.
Now let's use some custom images. First, you will need to ingest your images onto IPFS or S3 storage. For more information about how to do that see the data ingestion section.
This example will use the Cyclist Dataset for Object Detection | Kaggle dataset.
We have already uploaded this dataset to the IPFS storage under the CID: bafybeicyuddgg4iliqzkx57twgshjluo2jtmlovovlx5lmgp5uoh3zrvpm
. You can browse to this dataset via a HTTP IPFS proxy.
Let's run a the same job again, but this time use the images above.
Just as in the example above, this should output a UUID, which will be stored in the environment variable JOB_ID
. You can check the status of the job using the commands below.
To check the state of the job and view job output refer to the instructions above.
If you have questions or need support or guidance, please reach out to the Bacalhau team via Slack (#general channel).
In this example, we will demonstrate how to run inference on a model stored on Amazon S3. We will use a PyTorch model trained on the MNIST dataset.
Consider using the latest versions or use the docker method listed below in the article.
Python
PyTorch
Use the following commands to download the model and test image:
This script is designed to load a pretrained PyTorch model for MNIST digit classification from a tar.gz
file, extract it, and use the model to perform inference on a given input image. Ensure you have all required dependencies installed:
To use this script, you need to provide the paths to the tar.gz
file containing the pre-trained model, the output directory where the model will be extracted, and the input image file for which you want to perform inference. The script will output the predicted digit (class) for the given input image.
To get started, you need to install the Bacalhau client, see more information here
export JOB_ID=$( ... )
: Export results of a command execution as environment variable
-w /inputs
Set the current working directory at /inputs
in the container
-i src=s3://sagemaker-sample-files/datasets/image/MNIST/model/pytorch-training-2020-11-21-22-02-56-203/model.tar.gz,dst=/model/,opt=region=us-east-1
: Mount the s3 bucket at the destination path provided - /model/
and specifying the region where the bucket is located opt=region=us-east-1
-i git://github.com/js-ts/mnist-test.git
: Flag to mount the source code repo from GitHub. It would mount the repo at /inputs/js-ts/mnist-test
in this case it also contains the test image
pytorch/pytorch
: The name of the Docker image
-- python3 /inputs/js-ts/mnist-test/inference.py --tar_gz_file_path /model/model.tar.gz --output_directory /model-pth --image_path /inputs/js-ts/mnist-test/image.png
: The command to run inference on the model. It consists of:
/model/model.tar.gz
is the path to the model file
/model-pth
is the output directory for the model
/inputs/js-ts/mnist-test/image.png
is the path to the input image
When the job is submitted Bacalhau prints out the related job id. We store that in an environment variable JOB_ID
so that we can reuse it later on.
Use the bacalhau job logs
command to view the job output, since the script prints the result of execution to the stdout:
You can also use bacalhau job get
to download job results:
If you have questions or need support or guidance, please reach out to the Bacalhau team via Slack (#general channel).
Stable Diffusion is an open-source text-to-image model, which generates images from text. It's a cutting-edge alternative to DALL·E 2 and uses the Diffusion Probabilistic Model for image generation. At the core the model generates graphics from text using a Transformer.
This example demonstrates how to use stable diffusion online on a CPU and run it on the Bacalhau demo network. The first section describes the development of the code and the container. The second section demonstrates how to run the job using Bacalhau.
This model generated the images presented on this page.
The original text-to-image stable diffusion model was trained on a fleet of GPU machines, at great cost. To use this trained model for inference, you also need to run it on a GPU.
However, this isn't always desired or possible. One alternative is to use a project called OpenVINO from Intel that allows you to convert and optimize models from a variety of frameworks (and ONNX if your framework isn't directly supported) to run on a supported Intel CPU. This is what we will do in this example.
Heads up! This example takes about 10 minutes to generate an image on an average CPU. Whilst this demonstrates it is possible, it might not be practical.
In order to run this example you need:
A Debian-flavoured Linux (although you might be able to get it working on the newest machines)
First we convert the trained stable diffusion models so that they work efficiently on a CPU with OpenVINO. Choose the fine tuned version of Stable Diffusion you want to use. The example is quite complex, so we have created a separate repository to host the code. This is a fork from this Github repository.
In summary, the code downloads a pre-optimized OpenVINO version of the original pre-trained stable diffusion model. This model leverages OpenAI's CLIP transformer and is wrapped inside an OpenVINO runtime, which executes the model.
The core code representing these tasks can be found in the stable_diffusion_engine.py file. This is a mashup that creates a pipeline necessary to tokenize the text and run the stable diffusion model. This boilerplate could be simplified by leveraging the more recent version of the diffusers library. But let's continue.
Note that these dependencies are only known to work on Ubuntu-based x64 machines.
The following commands clone the example repository, and other required repositories, and install the Python dependencies.
Now that we have all the dependencies installed, we can call the demo.py
wrapper, which is a simple CLI, to generate an image from a prompt.
When the generation is complete, you can open the generated hello.png
and see something like this:
Lets try another prompt and see what we get:
Now we have a working example, we can convert it into a format that allows us to perform inference in a distributed environment.
First we will create a Dockerfile
to containerize the inference code. The Dockerfile can be found in the repository, but is presented here to aid understanding.
This container is using the python:3.9.9-bullseye
image and the working directory is set. Next, the Dockerfile installs the same dependencies from earlier in this notebook. Then we add our custom code and pull the dependent repositories.
We've already pushed this image to GHCR, but for posterity, you'd use a command like this to update it:
To run this example you will need Bacalhau installed and running
Bacalhau is a distributed computing platform that allows you to run jobs on a network of computers. It is designed to be easy to use and to run on a variety of hardware. In this example, we will use it to run the stable diffusion model on a CPU.
To submit a job, you can use the Bacalhau CLI. The following command passes a prompt to the model and generates an image in the outputs directory.
This will take about 10 minutes to complete. Go grab a coffee. Or a beer. Or both. If you want to block and wait for the job to complete, add the --wait
flag.
Furthermore, the container itself is about 15GB, so it might take a while to download on the node if it isn't cached.
Some of the jobs presented in the Examples section may require more resources than are currently available on the demo network. Consider starting your own network or running less resource-intensive jobs on the demo network
export JOB_ID=$( ... )
: Export results of a command execution as environment variable
bacalhau docker run
: Run a job using docker executor.
--id-only
: Flag to print out only the job id
ghcr.io/bacalhau-project/examples/stable-diffusion-cpu:0.0.1
: The name and the tag of the Docker image.
The command to run inference on the model: python demo.py --prompt "First Humans On Mars" --output ../outputs/mars.png
. It consists of:
demo.py
: The Python script that runs the inference process.
--prompt "First Humans On Mars"
: Specifies the text prompt to be used for the inference.
--output ../outputs/mars.png
: Specifies the path to the output image.
When a job is submitted, Bacalhau prints out the related job_id
. We store that in an environment variable so that we can reuse it later on.
Job status: You can check the status of the job using bacalhau job list
:
When it says Completed
, that means the job is done, and we can get the results.
Job information: You can find out more information about your job by using bacalhau job describe
:
Job download: You can download your job results directly by using bacalhau job get
. Alternatively, you can choose to create a directory to store your results. In the command below, we created a directory and downloaded our job output to be stored in that directory.
After the download has finished we can see the results in the results/outputs
folder.
In this example tutorial, we will show you how to generate realistic images with StyleGAN3 and Bacalhau. StyleGAN is based on Generative Adversarial Networks (GANs), which include a generator and discriminator network that has been trained to differentiate images generated by the generator from real images. However, during the training, the generator tries to fool the discriminator, which results in the generation of realistic-looking images. With StyleGAN3 we can generate realistic-looking images or videos. It can generate not only human faces but also animals, cars, and landscapes.
To get started, you need to install the Bacalhau client, see more information here
To run StyleGAN3 locally, you'll need to clone the repo, install dependencies and download the model weights.
Now you can generate an image using a pre-trained AFHQv2
model. Here is an example of the image we generated:
To build your own docker container, create a Dockerfile
, which contains instructions to build your image.
See more information on how to containerize your script/app here
We will run docker build
command to build the container:
Before running the command replace:
hub-user with your docker hub username, If you don’t have a docker hub account follow these instructions to create docker account (https://docs.docker.com/docker-id/), and use the username of the account you created
repo-name with the name of the container, you can name it anything you want
tag this is not required but you can use the latest tag
In our case:
Next, upload the image to the registry. This can be done by using the Docker hub username, repo name or tag.
In our case:
Some of the jobs presented in the Examples section may require more resources than are currently available on the demo network. Consider starting your own network or running less resource-intensive jobs on the demo network
To submit a job run the Bacalhau command with following structure:
export JOB_ID=$( ... )
exports the job ID as environment variable
bacalhau docker run
: call to Bacalhau
The --gpu 1
flag is set to specify hardware requirements, a GPU is needed to run such a job
The --id-only
flag is set to print only job id
jsacex/stylegan3
: the name and the tag of the docker image we are using
python gen_images.py
: execute the script with following parameters:
--trunc=1 --seeds=2 --network=stylegan3-r-afhqv2-512x512.pkl
: The animation length is either determined based on the --seeds
value or explicitly specified using the --num-keyframes
option. When num keyframes are specified with --num-keyframes
, the output video length will be num_keyframes * w_frames
frames.
../outputs
: path to the output
The same job can be presented in the declarative format. In this case, the description will look like this:
The job description should be saved in .yaml
format, e.g. stylegan3.yaml
, and then run with the command:
You can also run variations of this command to generate videos and other things. In the following command below, we will render a latent vector interpolation video. This will render a 4x2 grid of interpolations for seeds 0 through 31.
Let's look closely at the command below:
export JOB_ID=$( ... )
exports the job ID as environment variable
bacalhau docker run
: call to bacalhau
The --gpu 1
flag is set to specify hardware requirements, a GPU is needed to run such a job
The --id-only
flag is set to print only job id
jsacex/stylegan3
the name and the tag of the docker image we are using
python gen_images.py
: execute the script with following parameters:
--trunc=1 --seeds=2 --network=stylegan3-r-afhqv2-512x512.pkl
: The animation length is either determined based on the --seeds
value or explicitly specified using the --num-keyframes
option. When num keyframes is specified with --num-keyframes
, the output video length will be num_keyframes * w_frames frames
. If --num-keyframes
is not specified, the number of seeds given with --seeds
must be divisible by grid size W*H (--grid
). In this case, the output video length will be # seeds/(w*h)*w_frames
frames.
../outputs
: path to the output
When a job is submitted, Bacalhau prints out the related job_id
. We store that in an environment variable so that we can reuse it later on.
You can check the status of the job using bacalhau job list
.
When it says Completed
, that means the job is done, and we can get the results.
You can find out more information about your job by using bacalhau job describe
.
You can download your job results directly by using bacalhau job get
. Alternatively, you can choose to create a directory to store your results. In the command below, we created a directory and downloaded our job output to be stored in that directory.
After the download has finished you should see the following contents in results directory
Now you can find the file in the results/outputs
folder.
If you have questions or need support or guidance, please reach out to the Bacalhau team via Slack (#general channel).
Stable Diffusion is a state of the art text-to-image model that generates images from text and was developed as an open-source alternative to DALL·E 2. It is based on a Diffusion Probabilistic Model and uses a Transformer to generate images from text.
This example demonstrates how to use stable diffusion using a finetuned model and run inference on it. The first section describes the development of the code and the container - it is optional as users don't need to build their own containers to use their own custom model. The second section demonstrates how to convert your model weights to ckpt. The third section demonstrates how to run the job using Bacalhau.
The following guide is using the fine-tuned model, which was finetuned on Bacalhau. To learn how to finetune your own stable diffusion model refer to this guide.
Convert your existing model weights to the ckpt
format and upload to the IPFS storage.
Create a job using bacalhau docker run
, relevant docker image, model weights and any prompt.
Download results using bacalhau job get
and the job id.
To get started, you need to install:
Bacalhau client, see more information here
NVIDIA GPU
CUDA drivers
NVIDIA docker
This part of the guide is optional - you can skip it and proceed to the Running a Bacalhau job if you are not going to use your own custom image.
To build your own docker container, create a Dockerfile
, which contains instructions to containerize the code for inference.
This container is using the pytorch/pytorch:1.13.0-cuda11.6-cudnn8-runtime
image and the working directory is set. Next the Dockerfile installs required dependencies. Then we add our custom code and pull the dependent repositories.
See more information on how to containerize your script/app here
We will run docker build
command to build the container.
Before running the command replace:
hub-user with your docker hub username, If you don’t have a docker hub account follow these instructions to create the Docker account, and use the username of the account you created
repo-name with the name of the container, you can name it anything you want
tag this is not required but you can use the latest
tag
So in our case, the command will look like this:
Next, upload the image to the registry. This can be done by using the Docker hub username, repo name or tag.
Thus, in this case, the command would look this way:
After the repo image has been pushed to Docker Hub, you can now use the container for running on Bacalhau. But before that you need to check whether your model is a ckpt
file or not. If your model is a ckpt
file you can skip to the running on Bacalhau, and if not - the next section describes how to convert your model into the ckpt
format.
To download the convert script:
To convert the model weights into ckpt
format, the --half
flag cuts the size of the output model from 4GB to 2GB:
To do inference on your own checkpoint on Bacalhau you need to first upload it to your public storage, which can be mounted anywhere on your machine. In this case, we will be using NFT.Storage (Recommended Option). To upload your dataset using NFTup drag and drop your directory and it will upload it to IPFS.
After the checkpoint file has been uploaded copy its CID.
Some of the jobs presented in the Examples section may require more resources than are currently available on the demo network. Consider starting your own network or running less resource-intensive jobs on the demo network
Let's look closely at the command above:
export JOB_ID=$( ... )
: Export results of a command execution as environment variable
The --gpu 1
flag is set to specify hardware requirements, a GPU is needed to run such a job
-i ipfs://QmUCJuFZ2v7KvjBGHRP2K1TMPFce3reTkKVGF2BJY5bXdZ:/model.ckpt
: Path to mount the checkpoint
-- conda run --no-capture-output -n ldm
: since we are using conda we need to specify the name of the environment which we are going to use, in this case it is ldm
scripts/txt2img.py
: running the python script
--prompt "a photo of a person drinking coffee"
: the prompt you need to specify the session name in the prompt.
--plms
: the sampler you want to use. In this case we will use the plms
sampler
--ckpt ../model.ckpt
: here we specify the path to our checkpoint
--n_samples 1
: no of samples we want to produce
--skip_grid
: skip creating a grid of images
--outdir ../outputs
: path to store the outputs
--seed $RANDOM
: The output generated on the same prompt will always be the same for different outputs on the same prompt set the seed parameter to random
When a job is submitted, Bacalhau prints out the related job_id
. We store that in an environment variable so that we can reuse it later on.
Job status: You can check the status of the job using bacalhau job list
:
When it says Completed
, that means the job is done, and we can get the results.
Job information: You can find out more information about your job by using bacalhau job describe
:
Job download: You can download your job results directly by using bacalhau job get
. Alternatively, you can choose to create a directory to store your results. In the command below, we created a directory and downloaded our job output to be stored in that directory.
After the download has finished we can see the results in the results/outputs
folder. We received following image for our prompt:
Dolly 2.0, the groundbreaking, open-source, instruction-following Large Language Model (LLM) that has been fine-tuned on a human-generated instruction dataset, licensed for both research and commercial purposes. Developed using the EleutherAI Pythia model family, this 12-billion-parameter language model is built exclusively on a high-quality, human-generated instruction following dataset, contributed by Databricks employees.
Dolly 2.0 package is open source, including the training code, dataset, and model weights, all available for commercial use. This unprecedented move empowers organizations to create, own, and customize robust LLMs capable of engaging in human-like interactions, without the need for API access fees or sharing data with third parties.
A NVIDIA GPU
Python
Create an inference.py
file with following code:
You may want to create your own container for this kind of task. In that case, use the instructions for creating and publishing your own image in the docker hub. Use huggingface/transformers-pytorch-deepspeed-nightly-gpu
as base image, install dependencies listed above and copy the inference.py
into it. So your Dockerfile will look like this:
To get started, you need to install the Bacalhau client, see more information here
export JOB_ID=$( ... )
: Export results of a command execution as environment variable
bacalhau docker run
: Run a job using docker executor.
--gpu 1
: Flag to specify the number of GPUs to use for the execution. In this case, 1 GPU will be used.
-w /inputs
: Flag to set the working directory inside the container to /inputs
.
-i gitlfs://huggingface.co/databricks/dolly-v2-3b.git
: Flag to clone the Dolly V2-3B model from Hugging Face's repository using Git LFS. The files will be mounted to /inputs/databricks/dolly-v2-3b
.
-i https://gist.githubusercontent.com/js-ts/d35e2caa98b1c9a8f176b0b877e0c892/raw/3f020a6e789ceef0274c28fc522ebf91059a09a9/inference.py
: Flag to download the inference.py
script from the provided URL. The file will be mounted to /inputs/inference.py
.
jsacex/dolly_inference:latest
: The name and the tag of the Docker image.
The command to run inference on the model: python inference.py --prompt "Where is Earth located ?" --model_version "./databricks/dolly-v2-3b"
. It consists of:
inference.py
: The Python script that runs the inference process using the Dolly V2-3B model.
--prompt "Where is Earth located ?"
: Specifies the text prompt to be used for the inference.
--model_version "./databricks/dolly-v2-3b"
: Specifies the path to the Dolly V2-3B model. In this case, the model files are mounted to /inputs/databricks/dolly-v2-3b
.
When a job is submitted, Bacalhau prints out the related job_id
. We store that in an environment variable so that we can reuse it later on.
Job status: You can check the status of the job using bacalhau job list
:
When it says Completed
, that means the job is done, and we can get the results.
Job information: You can find out more information about your job by using bacalhau job describe
:
Job download: You can download your job results directly by using bacalhau job get
. Alternatively, you can choose to create a directory to store your results. In the command below, we created a directory and downloaded our job output to be stored in that directory.
After the download has finished, we can see the results in the results/outputs
folder.
This example tutorial demonstrates how to use Stable Diffusion on a GPU and run it on the Bacalhau demo network. Stable Diffusion is a state of the art text-to-image model that generates images from text and was developed as an open-source alternative to DALL·E 2. It is based on a Diffusion Probabilistic Model and uses a Transformer to generate images from text.
To get started, you need to install the Bacalhau client, see more information here.
Here is an example of an image generated by this model.
This stable diffusion example is based on the Keras/Tensorflow implementation. You might also be interested in the Pytorch oriented diffusers library.
When you run this code for the first time, it will download the pre-trained weights, which may add a short delay.
Based on the requirements here, we will install the following:
We have a sample code from this the Stable Diffusion in TensorFlow/Keras repo which we will use to check if the code is working as expected. Our output for this code will be a DSLR photograph of an astronaut riding a horse.
When you run this code for the first time, it will download the pre-trained weights, which may add a short delay.
When running this code, if you check the GPU RAM usage, you'll see that it's sucked up many GBs, and depending on what GPU you're running, it may OOM (Out of memory) if you run this again.
You can try and reduce RAM usage by playing with batch sizes (although it is only set to 1 above!) or more carefully controlling the TensorFlow session.
To clear the GPU memory we will use numba. This won't be required when running in a single-shot manner.
You need a script to execute when we submit jobs. The code below is a slightly modified version of the code we ran above which we got from here, however, this includes more things such as argument parsing argument parsing to be able to customize the generator.
For a full list of arguments that you can pass to the script, see more information here
After writing the code the next step is to run the script.
As a result, you will get something like this:
The following presents additional parameters you can try:
python main.py --p "cat with three eyes
- to set prompt
python main.py --p "cat with three eyes" --n 100
- to set the number of iterations to 100
python stable-diffusion.py --p "cat with three eyes" --b 2
to set batch size to 2 (№ of images to generate)
Docker is the easiest way to run TensorFlow on a GPU since the host machine only requires the NVIDIA® driver. To containerize the inference code, we will create a Dockerfile
. The Dockerfile is a text document that contains the commands that specify how the image will be built.
The Dockerfile leverages the latest official TensorFlow GPU image and then installs other dependencies like git
, CUDA
packages, and other image-related necessities. See the original repository for the expected requirements.
See more information on how to containerize your script/app here
We will run docker build
command to build the container;
Before running the command replace following:
hub-user with your docker hub username, If you don’t have a docker hub account follow these instructions to create a Docker account, and use the username of the account you created
repo-name with the name of the container, you can name it anything you want
tag this is not required but you can use the latest tag
In our case:
Next, upload the image to the registry. This can be done by using the Docker hub username, repo name or tag.
In our case:
Some of the jobs presented in the Examples section may require more resources than are currently available on the demo network. Consider starting your own network or running less resource-intensive jobs on the demo network
To submit a job run the Bacalhau command with following structure:
export JOB_ID=$( ... )
exports the job ID as environment variable
The --gpu 1
flag is set to specify hardware requirements, a GPU is needed to run such a job
The --id-only
flag is set to print only job id
ghcr.io/bacalhau-project/examples/stable-diffusion-gpu:0.0.1
: the name and the tag of the docker image we are using
-- python main.py --o ./outputs --p "meme about tensorflow"
: The command to run inference on the model. It consists of:
main.py
path to the script
--o ./outputs
specifies the output directory
--p "meme about tensorflow"
specifies the prompt
The Bacalhau command passes a prompt to the model and generates an image in the outputs directory. The main difference in the example below compared to all the other examples is the addition of the --gpu X
flag, which tells Bacalhau to only schedule the job on nodes that have X
GPUs free. You can read more about GPU support in the documentation.
This will take about 5 minutes to complete and is mainly due to the cold-start GPU setup time. This is faster than the CPU version, but you might still want to grab some fruit or plan your lunchtime run.
Furthermore, the container itself is about 10GB, so it might take a while to download on the node if it isn't cached.
When a job is submitted, Bacalhau prints out the related job_id
. We store that in an environment variable so that we can reuse it later on.
Job status: You can check the status of the job using bacalhau job list
.
When it says Completed
, that means the job is done, and we can get the results.
Job information: You can find out more information about your job by using bacalhau job describe
.
Job download: You can download your job results directly by using bacalhau job get
. Alternatively, you can choose to create a directory to store your results. In the command below, we created a directory and downloaded our job output to be stored in that directory.
After the download has finished you should see the following contents in results directory
Now you can find the file in the results/outputs
folder:
Whisper is an automatic speech recognition (ASR) system trained on 680,000 hours of multilingual and multitask supervised data collected from the web. It shows that the use of such a large and diverse dataset leads to improved robustness to accents, background noise, and technical language. Moreover, it enables transcription in multiple languages, as well as translation from those languages into English. Creators are open-sourcing models and inference code to serve as a foundation for building useful applications and for further research on robust speech processing. In this example, we will transcribe an audio clip locally, containerize the script and then run the container on Bacalhau.
The advantage of using Bacalhau over managed Automatic Speech Recognition services is that you can run your own containers which can scale to do batch process petabytes of videos or audio for automatic speech recognition
To get started, you need to install:
Bacalhau client, see more information here.
Whisper.
PyTorch.
Python Pandas.
Before we create and run the script we need a sample audio file to test the code. For that we download a sample audio clip:
We will create a script that accepts parameters (input file path, output file path, temperature, etc.) and set the default parameters. Also if the input file is in mp4
format, then the script converts it to wav
format. The transcript can be saved in various formats. Then the large model is loaded and we pass it the required parameters.
This model is not only limited to English and transcription, it supports many other languages.
Next, let's create an openai-whisper script:
Let's run the script with the default parameters:
To view the outputs, execute following:
To build your own docker container, create a Dockerfile
, which contains instructions on how the image will be built, and what extra requirements will be included.
We choose pytorch/pytorch:1.12.1-cuda11.3-cudnn8-runtime
as our base image.
And then install all the dependencies, after that we will add the test audio file and our openai-whisper script to the container, we will also run a test command to check whether our script works inside the container and if the container builds successfully
See more information on how to containerize your script/app here
We will run docker build
command to build the container;
Before running the command replace:
hub-user with your docker hub username, If you don’t have a docker hub account follow these instructions to create a Docker account, and use the username of the account you created
repo-name with the name of the container, you can name it anything you want
tag this is not required but you can use the latest tag
In our case:
Next, upload the image to the registry. This can be done by using the Docker hub username, repo name or tag.
In our case:
We will transcribe the moon landing video, which can be found here: https://www.nasa.gov/multimedia/hd/apollo11_hdpage.html
Since the downloaded video is in mov format we convert the video to mp4 format and then upload it to our public storage in this case IPFS. To do this, a public IPFS network can be used, or you can create your own private IPFS network and use it to pin files.
After the dataset has been uploaded, copy the CID:
Let's look closely at the command below:
export JOB_ID=$( ... )
exports the job ID as environment variable
bacalhau docker run
: call to bacalhau
The-i ipfs://bafybeielf6z4cd2nuey5arckect5bjmelhouvn5r
: flag to mount the CID which contains our file to the container at the path /inputs
The --gpu 1
flag is set to specify hardware requirements, a GPU is needed to run such a job
jsacex/whisper
: the name and the tag of the docker image we are using
python openai-whisper.py
: execute the script with following parameters:
-p inputs/Apollo_11_moonwalk_montage_720p.mp4
: the input path of our file
-o outputs
: the path where to store the outputs
When a job is submitted, Bacalhau prints out the related job_id
. We store that in an environment variable so that we can reuse it later on.
The same job can be presented in the declarative format. In this case, the description will look like this:
You can check the status of the job using bacalhau job list
.
When it says Completed
, that means the job is done, and we can get the results.
You can find out more information about your job by using bacalhau job describe
.
You can download your job results directly by using bacalhau job get
. Alternatively, you can choose to create a directory to store your results. In the command below, we created a directory and downloaded our job output to be stored in that directory.
After the download has finished you should see the following contents in results directory
Now you can find the file in the results/outputs
folder. To view it, run the following command:
In this example tutorial, we will show you how to train a PyTorch RNN MNIST neural network model with Bacalhau. PyTorch is a framework developed by Facebook AI Research for deep learning, featuring both beginner-friendly debugging tools and a high level of customization for advanced users, with researchers and practitioners using it across companies like Facebook and Tesla. Applications include computer vision, natural language processing, cryptography, and more.
To get started, you need to install the Bacalhau client, see more information
Install the following:
Next, we run the command below to begin the training of the mnist_rnn
model. We added the --save-model
flag to save the model
Next, the downloaded MNIST dataset is saved in the data
folder.
After the repo image has been pushed to Docker Hub, we can now use the container for running on Bacalhau. To submit a job, run the following Bacalhau command:
export JOB_ID=$( ... )
exports the job ID as environment variable
bacalhau docker run
: call to bacalhau
The --gpu 1
flag is set to specify hardware requirements, a GPU is needed to run such a job
pytorch/pytorch
: Using the official pytorch Docker image
The -i ipfs://QmdeQjz1HQQd.....
: flag is used to mount the uploaded dataset
-w /outputs:
Our working directory is /outputs. This is the folder where we will save the model as it will automatically get uploaded to IPFS as outputs
python ../inputs/main.py --save-model
: URL script gets mounted to the /inputs
folder in the container
When a job is submitted, Bacalhau prints out the related job_id
. We store that in an environment variable so that we can reuse it later on.
The job description should be saved in .yaml
format, e.g. torch.yaml
, and then run with the command:
You can check the status of the job using bacalhau job list
.
When it says Completed
, that means the job is done, and we can get the results.
You can find out more information about your job by using bacalhau job describe
.
You can download your job results directly by using bacalhau job get
. Alternatively, you can choose to create a directory to store your results. In the command below, we created a directory and downloaded our job output to be stored in that directory.
After the download has finished you should see the following contents in results directory
Now you can find results in the results/outputs
folder. To view them, run the following command:
In this example tutorial, we will look at how to run BIDS App on Bacalhau. BIDS (Brain Imaging Data Structure) is an emerging standard for organizing and describing neuroimaging datasets. is a container image capturing a neuroimaging pipeline that takes a BIDS formatted dataset as input. Each BIDS App has the same core set of command line arguments, making them easy to run and integrate into automated platforms. BIDS Apps are constructed in a way that does not depend on any software outside of the image other than the container engine.
To get started, you need to install the Bacalhau client, see more information
For this tutorial, download file ds005.tar
from this Bids dataset and untar it in a directory:
Let's take a look at the structure of the data
directory:
When you pin your data, you'll get a CID which is in a format like this QmaNyzSpJCt1gMCQLd3QugihY6HzdYmA8QMEa45LDBbVPz
. Copy the CID as it will be used to access your data
Let's look closely at the command above:
bacalhau docker run
: call to bacalhau
-i ipfs://QmaNyzSpJCt1gMCQLd3QugihY6HzdYmA8QMEa45LDBbVPz:/data
: mount the CID of the dataset that is uploaded to IPFS and mount it to a folder called data on the container
nipreps/mriqc:latest
: the name and the tag of the docker image we are using
../data/ds005
: path to input dataset
../outputs
: path to the output
participant --participant_label 01 02 03
: run the mriqc on subjects with participant labels 01, 02, and 03
When a job is submitted, Bacalhau prints out the related job_id. We store that in an environment variable so that we can reuse it later on.
Copy
The job description should be saved in .yaml
format, e.g. bids.yaml
, and then run with the command:
Job status: You can check the status of the job using bacalhau job list
.
When it says Published
or Completed
, that means the job is done, and we can get the results.
Job information: You can find out more information about your job by using bacalhau job describe
.
Job download: You can download your job results directly by using bacalhau job get
. Alternatively, you can choose to create a directory to store your results. In the command below, we created a directory (results
) and downloaded our job output to be stored in that directory.
To view the file, run the following command:
is an open-source machine learning software library, TensorFlow is used to train neural networks. Expressed in the form of stateful dataflow graphs, each node in the graph represents the operations performed by neural networks on multi-dimensional arrays. These multi-dimensional arrays are commonly known as “tensors”, hence the name TensorFlow. In this example, we will be training a MNIST model.
This section is from
This short introduction uses to:
Load a prebuilt dataset.
Build a neural network machine learning model that classifies images.
Train this neural network.
Evaluate the accuracy of the model.
Import TensorFlow into your program to check whether it is installed
Build a tf.keras.Sequential
model by stacking layers.
The tf.nn.softmax
function converts these logits to probabilities for each class:
Note: It is possible to bake the tf.nn.softmax
function into the activation function for the last layer of the network. While this can make the model output more directly interpretable, this approach is discouraged as it's impossible to provide an exact and numerically stable loss calculation for all models when using a softmax output.
Define a loss function for training using losses.SparseCategoricalCrossentropy
, which takes a vector of logits and a True
index and returns a scalar loss for each example.
This loss is equal to the negative log probability of the true class: The loss is zero if the model is sure of the correct class.
This untrained model gives probabilities close to random (1/10 for each class), so the initial loss should be close to -tf.math.log(1/10) ~= 2.3
.
Use the Model.fit
method to adjust your model parameters and minimize the loss:
If you want your model to return a probability, you can wrap the trained model, and attach the softmax to it:
The following method can be used to save the model as a checkpoint
The dataset and the script are mounted to the TensorFlow container using an URL, we then run the script inside the container
The job description should be saved in .yaml
format, e.g. tensorflow.yaml
, and then run with the command:
You can check the status of the job using bacalhau job list
.
When it says Completed
, that means the job is done, and we can get the results.
You can find out more information about your job by using bacalhau job describe
.
You can download your job results directly by using bacalhau job get
. Alternatively, you can choose to create a directory to store your results. In the command below, we created a directory and downloaded our job output to be stored in that directory.
After the download has finished you should see the following contents in results directory
Now you can find the file in the results/outputs
folder. To view it, run the following command:
Stable diffusion has revolutionalized text2image models by producing high quality images based on a prompt. Dreambooth is a approach for personalization of text-to-image diffusion models. With images as input subject, we can fine-tune a pretrained text-to-image model
Although the used to finetune the pre-trained model since both the Imagen model and Dreambooth code are closed source, several opensource projects have emerged using stable diffusion.
Dreambooth makes stable-diffusion even more powered with the ability to generate realistic looking pictures of humans, animals or any other object by just training them on 20-30 images.
In this example tutorial, we will be fine-tuning a pretrained stable diffusion using images of a human and generating images of him drinking coffee.
The following command generates the following:
Subject: SBF
Prompt: a photo of SBF without hair
Output:
Building this container requires you to have a supported GPU which needs to have 16gb+ of memory, since it can be resource intensive.
We will create a Dockerfile
and add the desired configuration to the file. Following commands specify how the image will be built, and what extra requirements will be included:
This container is using the pytorch/pytorch:1.12.1-cuda11.3-cudnn8-devel
image and the working directory is set. Next, we add our custom code and pull the dependent repositories.
The shell script is there to make things much simpler since the command to train the model needs many parameters to pass and later convert the model weights to the checkpoint, you can edit this script and add in your own parameters
To download the models and run a test job in the Docker file, copy the following:
Then execute finetune.sh
with following commands:
We will run docker build
command to build the container:
Before running the command replace:
repo-name with the name of the container, you can name it anything you want.
tag this is not required but you can use the latest tag
Now you can push this repository to the registry designated by its name or tag.
The optimal dataset size is between 20-30 images. You can choose the images of the subject in different positions, full body images, half body, pictures of the face etc.
Only the subject should appear in the image so you can crop the image to just fit the subject. Make sure that the images are 512x512 size and are named in the following pattern:
After the Subject dataset is created we upload it to IPFS.
To upload your dataset using NFTup just drag and drop your directory it will upload it to IPFS:
After the checkpoint file has been uploaded, copy its CID which will look like this:
Since there are a lot of combinations that you can try, processing of finetuned model can take almost 1hr+ to complete. Here are a few approaches that you can try based on your requirements:
bacalhau docker run
: call to bacalhau
The --gpu 1
flag is set to specify hardware requirements, a GPU is needed to run such a job
-i ipfs://bafybeidqbuphwkqwgrobv2vakwsh3l6b4q2mx7xspgh4l7lhulhc3dfa7a
Mounts the data from IPFS via its CID
jsacex/dreambooth:latest
Name and tag of the docker image we are using
-- bash finetune.sh /inputs /outputs "a photo of David Aronchick man" "a photo of man" 3000 "/man"
execute script with following paramters:
/inputs
Path to the subject Images
/outputs
Path to save the generated outputs
"a photo of < name of the subject > < class >"
-> "a photo of David Aronchick man"
Subject name along with class
"a photo of < class >
" -> "a photo of man"
Name of the class
The number of iterations is 3000. This number should be no of subject images x 100. So if there are 30 images, it would be 3000. It takes around 32 minutes on a v100
for 3000 iterations, but you can increase/decrease the number based on your requirements.
Here is our command with our parameters replaced:
If your subject fits the above class, but has a different name you just need to replace the input CID and the subject name.
Use the /woman
class images
Here you can provide your own regularization images or use the mix class.
Use the /mix
class images if the class of the subject is mix
You can upload the model to IPFS and then create a gist, mount the model and script to the lightweight container
When a job is submitted, Bacalhau prints out the related job_id
. Use the export JOB_ID=$(bacalhau docker run ...)
wrapper to store that in an environment variable so that we can reuse it later on.
You can check the status of the job using bacalhau job list
.
When it says Completed
, that means the job is done, and we can get the results.
You can find out more information about your job by using bacalhau job describe
.
You can download your job results directly by using bacalhau job get
. Alternatively, you can choose to create a directory to store your results. In the command below, we created a directory and downloaded our job output to be stored in that directory.
After the download has finished you should see the following contents in results directory
Now you can find the file in the results/outputs
folder. You can view results by running following commands:
In the next steps, we will be doing inference on the finetuned model
Bacalhau currently doesn't support mounting subpaths of the CID, so instead of just mounting the model.ckpt
file we need to mount the whole output CID which is 6.4GB, which might result in errors like FAILED TO COPY /inputs
. So you have to manually copy the CID of the model.ckpt
which is of 2GB.
To get the CID of the model.ckpt
file go to https://gateway.ipfs.io/ipfs/< YOUR-OUTPUT-CID >/outputs/
. For example:
Or you can use the IPFS CLI:
Copy the link of model.ckpt
highlighted in the box:
Then extract the CID portion of the link and copy it.
To run a Bacalhau Job on the fine-tuned model, we will use the bacalhau docker run
command.
If you are facing difficulties using the above method you can mount the whole output CID
When a job is sumbitted, Bacalhau prints out the related job_id
. We store that in an environment variable so that we can reuse it later on.
We got an image like this as a result:
Kipoi (pronounce: kípi; from the Greek κήποι: gardens) is an API and a repository of ready-to-use trained models for genomics. It currently contains 2201 different models, covering canonical predictive tasks in transcriptional and post-transcriptional gene regulation. Kipoi's API is implemented as a , and it is also accessible from the command line.
In this tutorial example, we will run a genomics model on Bacalhau.
To get started, you need to install the Bacalhau client, see more information
To run locally you need to install kipoi-veff2. You can find out the information about installing and usage
In our case this will be the following command:
To run Genomics on Bacalhau we need to set up a Docker container. To do this, you'll need to create a Dockerfile
and add your desired configuration. The Dockerfile is a text document that contains the commands that specify how the image will be built.
We will use the kipoi/kipoi-veff2:py37
image and perform variant-centered effect prediction using the kipoi_veff2_predict
tool.
The docker build
command builds Docker images from a Dockerfile.
Before running the command replace:
repo-name
with the name of the container, you can name it anything you want
tag
this is not required but you can use the latest tag
In our case:
Next, upload the image to the registry. This can be done by using the Docker hub username, repo name or tag.
After the repo image has been pushed to Docker Hub, we can now use the container for running on Bacalhau. To submit a job for generating genomics data, run the following Bacalhau command:
Let's look closely at the command above:
bacalhau docker run
: call to Bacalhau
jsacex/kipoi-veff2:py37
: the name of the image we are using
kipoi_veff2_predict ./examples/input/test.vcf ./examples/input/test.fa ../outputs/output.tsv -m "DeepSEA/predict" -s "diff" -s "logit"
: the command that will be executed inside the container. It performs variant-centered effect prediction using the kipoi_veff2_predict tool
./examples/input/test.vcf
: the path to a Variant Call Format (VCF) file containing information about genetic variants
./examples/input/test.fa
: the path to a FASTA file containing DNA sequences. FASTA files contain nucleotide sequences used for variant effect prediction
../outputs/output.tsv
: the path to the output file where the prediction results will be stored. The output file format is Tab-Separated Values (TSV), and it will contain information about the predicted variant effects
-m "DeepSEA/predict"
: specifies the model to be used for prediction
-s "diff" -s "logit"
: indicates using two scoring functions for comparing prediction results. In this case, the "diff" and "logit" scoring functions are used. These scoring functions can be employed to analyze differences between predictions for the reference and alternative alleles.
When a job is submitted, Bacalhau prints out the related job_id
. We store that in an environment variable so that we can reuse it later on.
The job description should be saved in .yaml
format, e.g. genomics.yaml
, and then run with the command:
Job status: You can check the status of the job using bacalhau job list
.
When it says Published
or Completed
, that means the job is done, and we can get the results.
Job information: You can find out more information about your job by using bacalhau job describe
.
Job download: You can download your job results directly by using bacalhau job get
. Alternatively, you can choose to create a directory to store your results. In the command below, we created a directory (results
) and downloaded our job output to be stored in that directory.
To view the file, run the following command:
is a data subsetting method. Since the uncompressed datasets can get very large when compressed, it becomes much harder to train them as training time increases with the dataset size. To reduce training time and cut costs, we employ the coreset method; the coreset method can also be applied to other datasets. In this case, we use the coreset method which can lead to a fast speed in solving the k-means problem among the big data with high accuracy in the meantime.
We construct a small coreset for arbitrary shapes of numerical data with a decent time cost. The implementation was mainly based on the coreset construction algorithm that was proposed by Braverman et al. (SODA 2021).
For a deeper understanding of the core concepts, it's recommended to explore: 1. 2.
In this tutorial example, we will run compressed dataset with Bacalhau
To get started, you need to install the Bacalhau client, see more information
Clone the repo which contains the code
To download the dataset you should open Street Map, which is a public repository that aims to generate and distribute accessible geographic data for the whole world. Basically, it supplies detailed position information, including the longitude and latitude of the places around the world.
The dataset is a osm.pbf
(compressed format for .osm
file), the file can be downloaded from
The following command is installing Linux dependencies:
Ensure that the requirements.txt
file contains the following dependencies:
The following command is installing Python dependencies:
To run coreset locally, you need to convert from compressed pbf
format to geojson
format:
The following command is to run the Python script to generate the coreset:
To build your own docker container, create a Dockerfile
, which contains instructions on how the image will be built, and what extra requirements will be included.
We will use the python:3.8
image, we run the same commands for installing dependencies that we used locally.
We will run docker build
command to build the container:
Before running the command replace:
repo-name
with the name of the container, you can name it anything you want
tag
this is not required but you can use the latest tag
In our case:
Next, upload the image to the registry. This can be done by using the Docker hub username, repo name or tag.
In our case:
Let's look closely at the command above:
bacalhau docker run
: call to bacalhau
--input https://github.com/js-ts/Coreset/blob/master/monaco-latest.geojson
: mount the monaco-latest.geojson
file inside the container so it can be used by the script
jsace/coreset
: the name of the docker image we are using
python Coreset/python/coreset.py -f monaco-latest.geojson -o outputs
: the script initializes cluster centers, creates a coreset using these centers, and saves the results to the specified folder.
-k
: amount of initialized centers (default=5)
-n
: size of coreset (default=50)
-o
: the output folder
When a job is submitted, Bacalhau prints out the related job_id
. We store that in an environment variable so that we can reuse it later on.
The job description should be saved in .yaml
format, e.g. coreset.yaml
, and then run with the command:
Job status: You can check the status of the job using bacalhau job list
.
When it says Published
or Completed
, that means the job is done, and we can get the results.
Job information: You can find out more information about your job by using bacalhau job describe
.
Job download: You can download your job results directly by using bacalhau job get
. Alternatively, you can choose to create a directory to store your results. In the command below, we created a directory (results
) and downloaded our job output to be stored in that directory.
To view the file, run the following command:
To view the output as a CSV file, run:
To train our model locally, we will start by cloning the Pytorch examples :
Now that we have downloaded our dataset, the next step is to upload it to IPFS. The simplest way to upload the data to IPFS is to use a third-party service to "pin" data to the IPFS network, to ensure that the data exists and is available. To do this you need an account with a pinning service like or . Once registered you can use their UI or API or SDKs to upload files.
Once you have uploaded your data, you'll be finished copying the CID. is the dataset we have uploaded.
The -i https://raw.githubusercontent.com/py..........
: flag is used to mount our training script. We will use the URL to this
The same job can be presented in the format. In this case, the description will look like this:
The simplest way to upload the data to IPFS is to use a third-party service to "pin" data to the IPFS network, to ensure that the data exists and is available. To do this, you need an account with a pinning service like or . Once registered, you can use their UI or API or SDKs to upload files.
Alternatively, you can upload your dataset to IPFS using , but the recommended approach is to use a pinning service as we have mentioned above.
The same job can be presented in the format. In this case, the description will look like this:
If you have questions or need support or guidance, please reach out to the (#general channel).
For each example, the model returns a vector of or scores, one for each class.
Before you start training, configure and compile the model using Keras Model.compile
. Set the class to adam
, set the loss
to the loss_fn
function you defined earlier, and specify a metric to be evaluated for the model by setting the metrics
parameter to accuracy
.
The Model.evaluate
method checks the models performance, usually on a "" or "".
The image classifier is now trained to ~98% accuracy on this dataset. To learn more, read the .
The same job can be presented in the format. In this case, the description will look like this:
If you have questions or need support or guidance, please reach out to the (#general channel).
To get started, you need to install the Bacalhau client, see more information
You can skip this section entirely and directly go to
hub-user with your docker hub username, If you don’t have a docker hub account follow to create a Docker account, and use the username of the account you create.
You can view the for reference.
In this case, we will be using (Recommended Option) to upload files and directories with .
The same job can be presented in the format. In this case, the description will look like this. Change the command in the Parameters section and CID to suit your goals.
Refer to our for more details of how to build a SD inference container
If you use the browser, you can use following:
To check the status of your job and download results refer back to the .
See more information on how to containerize your script/app
hub-user
with your docker hub username. If you don’t have a docker hub account , and use the username of the account you created
In this example, a model from github.com
is downloaded during the job execution. In order to do this, use the flag when describing the job, and --job-selection-accept-networked
when starting the compute node on which the job will be executed.
Note, that in the demo network, nodes do not accept jobs that require full
network access. Consider creating your own .
The same job can be presented in the format. In this case, the description will look like this:
If you have questions or need support or guidance, please reach out to the (#general channel).
coreset.py
contains the following script
See more information on how to containerize your script/app
hub-user
with your docker hub username, If you don’t have a docker hub account , and use the username of the account you created
After the repo image has been pushed to Docker Hub, we can now use the container for running on Bacalhau. We've already converted the monaco-latest.osm.pbf
file from compressed pbf
format to geojson
format . To submit a job, run the following Bacalhau command:
The same job can be presented in the format. In this case, the description will look like this:
If you have questions or need support or guidance, please reach out to the (#general channel).
--network full
Log Management
Execute an ad-hoc log query using DuckDB
Process logs on hundreds of servers using DuckDB
Perform remote user analysis using log windowing
Upload logs from servers to MotherDuck
Upload logs from servers to Snowflake
Upload logs from servers to MongoDB
Upload logs from servers to Redshift
GROMACS is a package for high-performance molecular dynamics and output analysis. Molecular dynamics is a computer simulation method for analyzing the physical movements of atoms and molecules
In this example, we will make use of gmx pdb2gmx program to add hydrogens to the molecules and generates coordinates in Gromacs (Gromos) format and topology in Gromacs format.
In this example tutorial, our focus will be on running Gromacs package with Bacalhau
To get started, you need to install the Bacalhau client, see more information here
Datasets can be found here https://www.rcsb.org, In this example we use RCSB PDB - 1AKI dataset. After downloading, place it in a folder called “input”
The simplest way to upload the data to IPFS is to use a third-party service to "pin" data to the IPFS network, to ensure that the data exists and is available. To do this you need an account with a pinning service like NFT.storage or Pinata. Once registered you can use their UI or API or SDKs to upload files.
Alternatively, you can upload your dataset to IPFS using IPFS CLI:
Copy the CID in the end which is QmeeEB1YMrG6K8z43VdsdoYmQV46gAPQCHotZs9pwusCm9
Let's run a Bacalhau job that converts coordinate files to topology and FF-compliant coordinate files:
Lets look closely at the command above:
bacalhau docker run
: call to Bacalhau
-i ipfs://QmeeEB1YMrG6K8z43VdsdoYmQV46gAPQCHotZs9pwusCm9:/input
: here we mount the CID of the dataset we uploaded to IPFS to use on the job
gromacs/gromacs
: we use the official gromacs - Docker Image
gmx pdb2gmx
: command in GROMACS that performs the conversion of molecular structural data from the Protein Data Bank (PDB) format to the GROMACS format, which is used for conducting Molecular Dynamics (MD) simulations and analyzing the results. Additional parameters could be found here gmx pdb2gmx — GROMACS 2022.2 documentation
-f input/1AKI.pdb
: input file
-o outputs/1AKI_processed.gro
: output file
-water
Water model to use. In this case we use spc
For a similar tutorial that you can try yourself, check out KALP-15 in DPPC - GROMACS Tutorial
When a job is submitted, Bacalhau prints out the related job_id
. We store that in an environment variable so that we can reuse it later on.
The same job can be presented in the declarative format. In this case, the description will look like this:
The job description should be saved in .yaml
format, e.g. gromacs.yaml
, and then run with the command:
Job status: You can check the status of the job using bacalhau job list
.
When it says Published
or Completed
, that means the job is done, and we can get the results.
Job information: You can find out more information about your job by using bacalhau job describe
.
Job download: You can download your job results directly by using bacalhau job get
. Alternatively, you can choose to create a directory to store your results. In the command below, we created a directory (results
) and downloaded our job output to be stored in that directory.
To view the file, run the following command:
If you have questions or need support or guidance, please reach out to the Bacalhau team via Slack (#general channel).
This guide provides an overview of using DuckDB with Bacalhau for remote log analysis. By leveraging these tools, you can perform detailed analyses without the need to download datasets locally.
DuckDB is a powerful in-memory SQL database management system ideal for data analytics. Bacalhau facilitates decentralized job execution, meaning you can run jobs remotely without having to log in or build complicated services. Together, they make a powerful tool for remote, ad-hoc server interaction.
You will need a Bacalhau cluster with the following configuration:
To run a DuckDB job on Bacalhau, all you need to do is use the DuckDB container. To submit a job, run the following command:
bacalhau docker run: command to run a Docker container on Bacalhau
davidgasquez/datadex:v0.2.0: name and tag of the Docker image
duckdb -s "select 1": the DuckDB CLI command to execute
When a job is submitted, Bacalhau prints out the related job_id
. We store that ID in an environment variable (JOB_ID
) so that we can reuse it later on.
The same job can be submitted in a declarative format. Create a YAML file (e.g., duckdb1.yaml
) with the following content:
Then run the command:
Job Status Check the status of the job:
When it says Published
or Completed
, the job is done, and we can fetch the results.
Job Information Get more details about your job:
Job Download Download your job results:
Each job creates 3 subfolders in your results directory:
combined_results
per_shard
raw
To view the output file:
Expected output:
Below is an example command to run arbitrary SQL queries against the NYC Yellow Taxi Trips dataset. This dataset is hosted on IPFS for demonstration purposes.
bacalhau docker run: command to run a Docker container on Bacalhau
-i ipfs://...: specifying IPFS CIDs so Bacalhau can mount the data at /inputs
--workdir /inputs: sets the working directory inside the container to /inputs
davidgasquez/duckdb:latest: Docker image with DuckDB installed
duckdb -s: the DuckDB CLI command to execute
You can also present the above job in YAML format. For example, duckdb2.yaml
:
Then run:
Job Status
Job Information
Job Download
Sample output might look like this:
If you have questions or need support or guidance, please reach out to the Bacalhau team via Slack (look for the #general channel).
This is bacalhau-airflow
, a Python package that integrates Bacalhau with Apache Airflow. The benefit is twofold. First, thanks to this package you can now write complex pipelines for Bacalhau. For instance, jobs can communicate their output's CIDs to downstream jobs, that can use those as inputs. Second, Apache Airflow provides a solid solution to reliably orchestrate your DAGs.
You may try this out using a local devstack until https://github.com/bacalhau-project/bacalhau/issues/2038 has been fixed. Please set the following environment variables AIRFLOW_VAR_BACALHAU_API_HOST
, AIRFLOW_VAR_BACALHAU_API_PORT
.
Create Airflow tasks that run on Bacalhau (via custom operator!)
Support for sharded jobs: output shards can be passed downstream (via XComs)
Coming soon...
Lineage (see OpenLineage proof-of-concept integration here)
Various working code examples
Hosting instructions
Python 3.8+
apache-airflow
2.3+
The integration automatically registers itself for Airflow 2.3+ if it's installed on the Airflow worker's Python.
Clone the public repository:
Once you have a copy of the source, you can install it with:
In a production environment you may want to follow the official Airflow's instructions or pick one of the suggested hosted solutions.
If you're just curious and want to give it a try on your local machine, please follow the steps below.
First, install and initialize Airflow:
Then, we need to point Airflow to the absolute path of the folder where your pipelines live. To do that we edit the dags_folder
field in ${AIRFLOW_HOME}/airflow.cfg
file. In this example I'm going to use the hello_world.py
DAG shipped with this repository; for the sake of completeness, the next section will walk you through the actual code.
My config file looks like what follows:
Optionally, to reduce clutter in the Airflow UI, you could disable the loading of the default example DAGs by setting load_examples
to False
.
Finally, we can launch Airflow locally:
While Airflow's pinwheel is warming up in the background, let's take a look at the hello_world.py
breakdown below.
In brief, the first task of this DAG prints out "Hello World" to stdout, then automatically pipe its output into the subsequent task as an input file. The second task will simply print out the content of its input file.
All you need to import from this package is the BacalhauSubmitJobOperator
. It allows you to submit a job spec comprised of the usual fields such as engine, image, etc.
This operator supports chaining multiple jobs without the need to manually pass any CID along, in this regards a special note goes to the input_volumes
parameter (see task_2
below). Every time the operator runs a task, it stores a comma-separated string with the output shard-CIDs in an internal key-value store under the cids
key. Thus, downstream tasks can read in those CIDs via the input_volumes
parameter.
All you need to do is (1) use the XComs syntax (in curly brackets) to specify the "sender" task ids and the cids
key (e.g. {{ task_instance.xcom_pull(task_ids='task_1', key='cids') }}
), (2) define a target mount point separated by a colon (e.g. :/task_1_output
).
Lastly, we define task dependencies simply with task_1 >> task_2
. To learn more about Airflow's DAG syntax please check out this page.
Now that we understand what the example DAG is supposed to do, let's just run it! Head over to http://0.0.0.0:8080, where Airflow UI is being served. The screenshot below shows our hello world has been loaded correctly.
When you inspect a DAG, Airflow will render a graph depicting a color-coded topology (see image below). For active (i.e. running) pipelines, this will be useful to oversee the status of each task.
To trigger a DAG please enable the toggle shown below.
When all tasks have been completed, we want to fetch the output of our pipeline. To do so we need to retrieve the job-id of the last task. Click on a green box in the task_2
line and then open the XCom tab.
Here we find the bacalhau_job_id
. Select that value and copy it into your clipboard.
Lastly, we can use the bacalhau cli get
command to fetch the output data as follows:
You can also skip using tox
and run pytest
on your own dev environment.
In this tutorial example, we will showcase how to containerize an OpenMM workload so that it can be executed on the Bacalhau network and take advantage of the distributed storage & compute resources. OpenMM is a toolkit for molecular simulation. It is a physic-based library that is useful for refining the structure and exploring functional interactions with other molecules. It provides a combination of extreme flexibility (through custom forces and integrators), openness, and high performance (especially on recent GPUs) that make it truly unique among simulation codes.
In this example tutorial, our focus will be on running OpenMM molecular simulation with Bacalhau.
To get started, you need to install the Bacalhau client, see more information here
We use a processed 2DRI dataset that represents the ribose binding protein in bacterial transport and chemotaxis. The source organism is the Escherichia coli bacteria.
Protein data can be stored in a .pdb
file, this is a human-readable format. It provides for the description and annotation of protein and nucleic acid structures including atomic coordinates, secondary structure assignments, as well as atomic connectivity. See more information about PDB format here. For the original, unprocessed 2DRI dataset, you can download it from the RCSB Protein Data Bank here.
The relevant code of the processed 2DRI dataset can be found here. Let's print the first 10 lines of the 2dri-processed.pdb
file. The output contains a number of ATOM records. These describe the coordinates of the atoms that are part of the protein.
To run the script above all we need is a Python environment with the OpenMM library installed. This script makes sure that there are no empty cells and to filter out potential error sources from the file.
This is only done to check whether your Python script is running. If there are no errors occurring, proceed further.
The simplest way to upload the data to IPFS is to use a third-party service to "pin" data to the IPFS network, to ensure that the data exists and is available. To do this, you need an account with a pinning service like Pinata or nft.storage. Once registered, you can use their UI or API or SDKs to upload files.
When you pin your data, you'll get a CID. Copy the CID as it will be used to access your data
To build your own docker container, create a Dockerfile
, which contains instructions to build your image.
See more information on how to containerize your script/app here
We will run docker build
command to build the container:
Before running the command, replace:
hub-user
with your docker hub username, If you don’t have a docker hub account follow these instructions to create docker account, and use the username of the account you created
repo-name
with the name of the container, you can name it anything you want
tag
this is not required but you can use the latest tag
In our case, this will be:
Next, upload the image to the registry. This can be done by using the Docker hub username, repo name, or tag.
Now that we have the data in IPFS and the docker image pushed, we can run a job on the Bacalhau network.
Lets look closely at the command above:
bacalhau docker run
: call to Bacalhau
bafybeig63whfqyuvwqqrp5456fl4anceju24ttyycexef3k5eurg5uvrq4
: here we mount the CID of the dataset we uploaded to IPFS to use on the job
ghcr.io/bacalhau-project/examples/openmm:0.3
: the name and the tag of the image we are using
python run_openmm_simulation.py
: the script that will be executed inside the container
When a job is submitted, Bacalhau prints out the related job_id
. We store that in an environment variable so that we can reuse it later on.
Job status: You can check the status of the job using bacalhau job list
.
When it says Published
or Completed
, that means the job is done, and we can get the results.
Job information: You can find out more information about your job by using bacalhau job describe
.
Job download: You can download your job results directly by using bacalhau job get
. Alternatively, you can choose to create a directory to store your results. In the command below, we created a directory (results
) and downloaded our job output to be stored in that directory.
To view the file, run the following command:
If you have questions or need support or guidance, please reach out to the Bacalhau team via Slack (#general channel).
Lilypad is a distributed compute network for web3 based on Bacalhau. It currently enables the running of Bacalhau jobs from smart contracts.
Lilypad is aiming to build an internet-scale trustless distributed compute network for web3. Creating the infrastructure for use cases like AI inference, ML training, DeSci and more.
Lilypad serves as a verifiable, trustless, and decentralized computational network engineered to facilitate mainstream adoption of web3 applications. By extending unrestricted, global access to computational power, Lilypad strategically collaborates with decentralized infrastructure networks, such as Filecoin, to formulate a transparent, efficient, and accessible computational ecosystem. While Lilypad does not specifically resolve issues related to the accessibility of AI models, it significantly alleviates challenges associated with procuring high-performance AI hardware. In this context, Lilypad provides decentralized AI computational services. The network recently unveiled its second version, dubbed Lilypad V2 (Aurora), and is actively laying groundwork for multi-chain integration and the deployment of an incentivized test net.
Some of the key features of Lilypad include:
Verifiable Trustless Decentralized Compute Network: Lilypad is a decentralized compute network that aims to provide global, permissionless access to compute power. It leverages decentralized physical infrastructure networks like Filecoin to ensure trustlessness and verifiability.
Mainstream Web3 Application Support: Lilypad is designed to enable mainstream web3 applications to use its compute network. It aims to make decentralized AI more accessible, efficient, and transparent for developers and users.
Open Compute Network: Lilypad creates an open compute network that allows users to access and run AI models and jobs. It separates module creators from users and curates a set of deterministic modules for users to run, ensuring determinism in verification systems.
Multichain Support: Lilypad plans to go multichain, which means it will support multiple blockchain networks. This will increase the scalability and interoperability of the network, allowing users to choose the blockchain that best suits their needs.
Incentivized Test Net: Lilypad has plans to launch an incentivized test net, which will provide users with incentives to participate in testing and improving the network. This will help identify and address any issues or challenges before the mainnet launch.
Decentralization of Mediators: The team also aims to decentralize the mediators in the network. This means that the decision-making process and governance of the network will be distributed among multiple participants, ensuring a more decentralized and resilient system.
v0: September 2022 - Lilypad Bridge POC for triggering and returning Bacalhau compute jobs from a smart contract
v1: July 2023 - A modicum-based minimal testnet (EVM-based). See github
v2: September 2023 - A more robust trustless distributed testnet
v3 November 2023 - Lilypad AI Studio, Calibration Net for compute providers
v4: Q1 2024 - Lilypad Incentivized Testnet deployment
v5: tbd - Lilypad Mainnet
The architecture of Lilypad is inspired by the research paper titled "Mechanisms for Outsourcing Computation via a Decentralized Market." The paper introduces MODiCuM, a decentralized system that allows for computational outsourcing in an open market. Just like MODiCuM, Lilypad aims to create an open market of computational resources by introducing various decentralized services like solver, resource provider, job creator, mediator, and directory services. MODiCuM's unique approach to deterring misbehavior in a decentralized environment through dedicated mediators and enforceable fines has influenced Lilypad's own design, particularly in the areas of dispute resolution and system integrity.
Lilypad is a proof of concept bridge which runs off the demo (free to use) Bacalhau compute network. As such, the reliability of jobs on this network are not guaranteed.
If you have a need for reliable compute based on this infrastructure - get in touch with us.
Components of the Lilypad Ecosystem - From SaaS to Smart Contracts:
Main architecture of the Lilypad Ecosystem:
Services in the Lilypad Ecosystem:
See more about how Bacalhau & Lilypad are related below:
FVM Hackerbase Video
Note: Since this video was released some changes have been made to the underlying code, but the process and general architecture remains the same.
Install metamask Extension
Add the Lilypad Testnet chain to metamask:
To do this, open metamask then click on the Network
button > Add Network
> Add a network manually
See the Lilypad documentation for more details
To obtain funds, connect to the lilypad v2 Aurora testnet network on your wallet and head to the faucet at http://faucet.lilypad.tech:8080
to get ETH and LP. Copy your metamask wallet address into the bar and click request.
Only x86_64 Linux
platform is supported
There are two ways to install CLI:
With GO toolchain
You may then need to set:
Via officially released binaries
You may then need to set:
Verifying if the installation is successful: Execute lilypad
on your terminal and it should produce the following response.
Run the command below
Wait for the compute to take place and for the results to be published
View your results by executing
Check out the latest version of the official Lilypad documentation
Gain deeper insights into the WebAssembly (Wasm) jobs running on Bacalhau compute nodes using the , an open-source library that unlocks modern observability for WebAssembly. This feature supplements the host-level observability data with additional traces extracted from within the Wasm modules running on compute nodes in a Bacalhau network.
Extract telemetry data from Wasm workloads. Currently supports tracing, with logs and metrics coming soon.
Data can be sent to the same viewing destinations (i.e. sinks) that are supported for the host-level data.
Utilizes the same Trace ID as the host-level data, allowing for seamless visibility into the end-to-end execution of the job.
The Observe SDK is integrated with the default WebAssembly provided by Bacalhau, so node operators are not required to integrate the SDK itself as long as a custom / pluggable Executor is not being used.
(Optional) For node operators using a custom Executer see for instructions on how to integrate the Observe SDK.
The SDK uses the same environment variables noted here to send data to a viewing destination.
If you are running a Wasm-based workload on the Bacalhau network, your module must be instrumented to send its telemetry data to the provided by the Observe SDK. Modules can be instrumented automatically or manually.
Wasm modules can be automatically instrumented using an provided by . This method removes the need for manual instrumentation (see below) but does not preclude it (i.e., both approaches can be used together without issue)
Note: Currently, only the span_enter
, span_exit
, and span_tag
interfaces are implemented, with support for logging and metrics available in the future.
This is the official Python SDK for Bacalhau, named bacalhau-sdk.
The Bacalhau SDK changed with Bacalhau v.1.4.0 and has added/changed functionality!
It is a high-level SDK that ships the client-side logic (e.g. signing requests) needed to query the endpoints. Please take a look at for snippets to create, list and inspect jobs. Under the hood, this SDK uses bacalhau-apiclient
(autogenerated via /OpenAPI) to interact with the API.
Please make sure to use this SDK library in your Python projects, instead of the lower level bacalhau-apiclient
. The latter is listed as a dependency of this SDK and will be installed automatically when you follow the installation instructions below.
List, create and inspect Bacalhau jobs using Python objects
Use the production network, or set the following environment variables to target any Bacalhau network out there:
BACALHAU_API_HOST
BACALHAU_API_PORT
Generate a key pair used to sign requests stored in the path specified by the BACALHAU_DIR
env var (default: ~/.bacalhau
)
Clone the public repository:
Once you have a copy of the source, you can install it with:
Likewise the Bacalhau CLI, this SDK uses a key pair to be stored in BACALHAU_DIR
used for signing requests. If a key pair is not found there, it will create one for you.
Let's submit a Hello World job and then fetch its output data's CID. We start by importing this sdk, namely bacalhau_sdk
, used to create and submit a job create request. Then we import bacalhau_apiclient
(installed automatically with this sdk), it provides various object models that compose a job create request. These are used to populate a simple python dictionary that will be passed over to the submit
util method.
You have to set your API keys for the requestor node in the Environment variables first! These are stored as
The script above prints the following object, the job.metadata.id
value is our newly created job id!
We can then use the results
method to fetch, among other fields, the output data's CID. Please extract your own job_id
from the above output and hand it over to the results
function.
The line above prints the following dictionary:
When there wasn't some config specs specified, you may get messages about the config debugger working on them. This can look as the following:
You can set the environment variables BACALHAU_API_HOST
and BACALHAU_API_PORT
to point this SDK to your Bacalhau API local devstack.
To develop this SDK locally, create a dedicated poetry virtual environment and install the root package (i.e. bacalhau_sdk
) and its dependencies:
This outputs the following:
Note the line above installs the root package (i.e. bacalhau_sdk
) in editable mode, that is, any change to its source code is reflected immediately without the need for re-packaging and re-installing it. Easy-peasy!
Then install the pre-commit hooks and test it:
That's all folks .
To manually instrument a Wasm module, calls are made to the function interfaces through language bindings specific to the source language used to create the Wasm module. Examples of these bindings are available for and . Additional language bindings are planned but you can also build your own if desired. If going the former route, Dylibso can .
Congrats, that was a good start! Please find more code snippets in .
We use Poetry to manage this package, take a look at to install it. Note, all targets in the Makefile use poetry as well!
put
A request to put a job to bacalhau network. It encapsulates the job model. Once the job is successful put on bacalhau network, this returns the job details.
PutJobRequest
stop
Stops a certain job and takes optionally a reason why it was stopped.
job_id (str), reason (str=None)
executions
Gets Job Executions with the given parameters. Note that only job_id is required.
job_id (str), namespace (str=""), next_token (str =""), limit (int = 5), reverse (bool = False), order_by (str = "")
results
Get the results of the specified job_id.
job_id (str)
get
Gets Details/Specs of a Job by job_id and returns the job details.
job_id (str), include (str = ""), limit (int = 10)
history
Get History of a Job by job_id and return it.
job_id (str), event_type (str = "execution"), node_id (str = ""), execution_id (str = "")
list
Fetches and returns a list of all the Jobs, which abide the constraints given.
limit (int = 5), next_token (str = ""), order_by (str="created_at"), reverse (bool = False)