1 of 4

Data Ingestion

Copy Data from URL to Public Storage

To upload a file from a URL we will use the bacalhau docker run command.

The job has been submitted and Bacalhau has printed out the related job id.

Structure of the command

Let's look closely at the command above:

bacalhau docker run: call to bacalhau using docker executor
--input https://raw.githubusercontent.com/filecoin-project/bacalhau/main/README.md: URL path of the input data volumes downloaded from a URL source.
ghcr.io/bacalhau-project/examples/upload:v1: the name and tag of the docker image we are using

The bacalhau docker run command takes advantage of the --input parameter. This will download a file from a public URL and place it in the /inputs directory of the container (by default). Then we will use a helper container to move that data to the /outputs directory.

You can find out more about the which is designed to simplify the data uploading process.

For more details, see the

Checking the State of Your Jobs

Job status: You can check the status of the job using bacalhau job list, processing the json ouput with the jq:

When the job status is Published or Completed, that means the job is done, and we can get the results using the job ID.

Job information: You can find out more information about your job by using bacalhau job describe.

Job download: You can download your job results directly by using bacalhau job get. Alternatively, you can choose to create a directory to store your results. In the command below, we removed a directory in case it was present before, created it and downloaded our job output to be stored in that directory.

Each job result contains an outputs subfolder and exitCode, stderr and stdout files with relevant content. To view the execution logs execute following:

And to view the job execution result (README.md file in the example case), which was saved as a job output, execute:

To get the output CID from a completed job, run the following command:

The job will upload the CID to the public storage via IPFS. We will store the CID in an environment variable so that we can reuse it later on.

Now that we have the CID, we can use it in a new job. This time we will use the --input parameter to tell Bacalhau to use the CID we just uploaded.

In this case, the only goal of our job is just to list the contents of the /inputs directory. You can see that the "input" data is located under /inputs/outputs/README.md.

The job has been submitted and Bacalhau has printed out the related job id. We store that in an environment variable so that we can reuse it later on.

Pinning Data

How to pin data to public storage

If you have data that you want to make available to your Bacalhau jobs (or other people), you can pin it using a pinning service like Pinata, NFT.Storage, Thirdweb, etc. Pinning services store data on behalf of users. The pinning provider is essentially guaranteeing that your data will be available if someone knows the CID. Most pinning services offer you a free tier, so you can try them out without spending any money.

Basic steps

To use a pinning service, you will almost always need to create an account. After registration, you get an API token, which is necessary to control and access the files. Then you need to upload files - usually services provide a web interface, CLI and code samples for integration into your application. Once you upload the files you will get its CID, which looks like this: QmUyUg8en7G6RVL5uhyoLBxSWFgRMdMraCRWFcDdXKWEL9. Now you can access pinned data from the jobs via this CID.

Data source can be specified via --input flag, see the CLI Guide for more details

Running a Job over S3 data

Here is a quick tutorial on how to copy Data from S3 to a public storage. In this tutorial, we will scrape all the links from a public AWS S3 buckets and then copy the data to IPFS using Bacalhau.

Prerequisite

To get started, you need to install the Bacalhau client, see more information here

Running a Bacalhau Job

bacalhau docker run \
    -i "s3://noaa-goes16/ABI-L1b-RadC/2000/001/12/OR_ABI-L1b-RadC-M3C01*:/inputs,opt=region=us-east-1" \
    --id-only \
    --wait \
    alpine \
    -- sh -c "cp -r /inputs/* /outputs/"

Structure of the Command

Let's look closely at the command above:

bacalhau docker run: call to bacalhau
-i "s3://noaa-goes16/ABI-L1b-RadC/2000/001/12/OR_ABI-L1b-RadC-M3C01*:/inputs,opt=region=us-east-1: defines S3 objects as inputs to the job. In this case, it will download all objects that match the prefix ABI-L1b-RadC/2000/001/12/OR_ABI-L1b-RadC-M3C01 from the bucket noaa-goes16 in us-east-1 region, and mount the objects under /inputs path inside the docker job.
-- sh -c "cp -r /inputs/* /outputs/": copies all files under /inputs to /outputs, which is by default the result output directory which all of its content will be published to the specified destination, which is IPFS by default

When a job is submitted, Bacalhau prints out the related job_id. We store that in an environment variable so that we can reuse it later on.

This works either with datasets that are publicly available or with private datasets, provided that the nodes have the necessary credentials to access. See the S3 Source Specification for more details.

Checking the State of your Jobs

Job status: You can check the status of the job using bacalhau job list.

bacalhau job list --id-filter ${JOB_ID} --wide

When it says Published or Completed, that means the job is done, and we can get the results.

Job information: You can find out more information about your job by using bacalhau job describe.

bacalhau job describe ${JOB_ID}

Job download: You can download your job results directly by using bacalhau job get. Alternatively, you can choose to create a directory to store your results. In the command below, we remove the results directory if it exists, create it again and download our job output to be stored in that directory.

rm -rf results && mkdir -p results # Temporary directory to store the results
bacalhau job get $JOB_ID --output-dir results # Download the results

Viewing your Job Output

When the download is completed, the results of the job will be present in the directory. To view them, run the following command:

ls -1 results/outputs

{
  "NextToken": "",
  "Results": [
    {
      "Type": "s3PreSigned",
      "Params": {
        "PreSignedURL": "https://bacalhau-test-datasets.s3.eu-west-1.amazonaws.com/integration-tests-publisher/walid-manual-test-j-46a23fe7-e063-4ba6-8879-aac62af732b0.tar.gz?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAUEMPQ7JFSLGEPHJG%2F20240129%2Feu-west-1%2Fs3%2Faws4_request&X-Amz-Date=20240129T060142Z&X-Amz-Expires=1800&X-Amz-SignedHeaders=host&x-id=GetObject&X-Amz-Signature=cea00578ae3b03a1b52dba2d65a1bab40f1901fb7cd4ee1a0a974dc05b595f2e",
        "SourceSpec": {
          "Bucket": "bacalhau-test-datasets",
          "ChecksumSHA256": "1tlbgo+q0TlQhJi8vkiWnwTwPu1zenfvTO4qW1D5yvI=",
          "Endpoint": "",
          "Filter": "",
          "Key": "integration-tests-publisher/walid-manual-test-j-46a23fe7-e063-4ba6-8879-aac62af732b0.tar.gz",
          "Region": "eu-west-1",
          "VersionID": "oS7n.lY5BYHPMNOfbBS1n5VLl4ppVS4h"
        }
      }
    }
  ]
}

Extract Result CID

First you need to install jq (if it is not already installed) to process JSON:

sudo apt update
sudo apt install jq

To extract the CIDs from output JSON, execute following:

bacalhau job describe ${JOB_ID} --json \
| jq -r '.State.Executions[].PublishedResults.CID | select (. != null)'

The extracted CID will look like this:

QmYFhG668yJZmtk84SMMdbrz5Uvuh78Q8nLxTgLDWShkhR

Publishing Results to S3-Compatible Destinations

You can publish your results to Amazon s3 or other S3-compatible destinations like MinIO, Ceph, or SeaweedFS to conveniently store and share your outputs.

Publisher Spec

To facilitate publishing results, define publishers and their configurations using the PublisherSpec structure.

For S3-compatible destinations, the configuration is as follows:

type PublisherSpec struct {
    Type   Publisher              `json:"Type,omitempty"`
    Params map[string]interface{} `json:"Params,omitempty"`
}

For Amazon S3, you can specify the PublisherSpec configuration as shown below:

PublisherSpec:
  Type: S3
  Params:
    Bucket: <bucket>              # Specify the bucket where results will be stored
    Key: <object-key>             # Define the object key (supports dynamic naming using placeholders)
    Compress: <true/false>        # Specify whether to publish results as a single gzip file (default: false)
    Endpoint: <optional>          # Optionally specify the S3 endpoint
    Region: <optional>            # Optionally specify the S3 region

Example Usage

Let's explore some examples to illustrate how you can use this:

Publishing results to S3 using default settings

bacalhau docker run -p s3://<bucket>/<object-key> ubuntu ...

Publishing results to S3 with a custom endpoint and region:

bacalhau docker run \
-p s3://<bucket>/<object-key>,opt=endpoint=http://s3.example.com,opt=region=us-east-1 \
ubuntu ...

Publishing results to S3 as a single compressed file

bacalhau docker run -p s3://<bucket>/<object-key>,opt=compress=true ubuntu ...

Utilizing naming placeholders in the object key

bacalhau docker run -p s3://<bucket>/result-{date}-{jobID} ubuntu ...

Content Identification

Tracking content identification and maintaining lineage across different jobs' inputs and outputs can be challenging. To address this, the publisher encodes the SHA-256 checksum of the published results, specifically when publishing a single compressed file.

Here's an example of a sample result:

{
    "NodeID": "QmYJ9QN9Pbi6gBKNrXVk5J36KSDGL5eUT6LMLF5t7zyaA7",
    "Data": {
        "StorageSource": "S3",
        "Name": "s3://<bucket>/run3.tar.gz",
        "S3": {
            "Bucket": "<bucket>",
            "Key": "run3.tar.gz",
            "Checksum": "e0uDqmflfT9b+rMfoCnO5G+cy+8WVTOPUtAqDMnXWbw=",
            "VersionID": "hZoNdqJsZxE_bFm3UGJuJ0RqkITe9dQ1"
        }
    }
}

Support for the S3-compatible storage provider

To enable support for the S3-compatible storage provider, no additional dependencies are required. However, valid AWS credentials are necessary to sign the requests. The storage provider uses the default credentials chain, which checks the following sources for credentials:

Environment variables, such as AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY
Credentials file ~/.aws/credentials
IAM Roles for Amazon EC2 Instances

Need Support?

For questions, feedback, please reach out in our Slack

Running a Job over S3 data

Here is a quick tutorial on how to copy Data from S3 to a public storage. In this tutorial, we will scrape all the links from a public AWS S3 buckets and then copy the data to IPFS using Bacalhau.

Prerequisite

To get started, you need to install the Bacalhau client, see more information here

Running a Bacalhau Job

bacalhau docker run \
    -i "s3://noaa-goes16/ABI-L1b-RadC/2000/001/12/OR_ABI-L1b-RadC-M3C01*:/inputs,opt=region=us-east-1" \
    --id-only \
    --wait \
    alpine \
    -- sh -c "cp -r /inputs/* /outputs/"

Structure of the Command

Let's look closely at the command above:

bacalhau docker run: call to bacalhau
-i "s3://noaa-goes16/ABI-L1b-RadC/2000/001/12/OR_ABI-L1b-RadC-M3C01*:/inputs,opt=region=us-east-1: defines S3 objects as inputs to the job. In this case, it will download all objects that match the prefix ABI-L1b-RadC/2000/001/12/OR_ABI-L1b-RadC-M3C01 from the bucket noaa-goes16 in us-east-1 region, and mount the objects under /inputs path inside the docker job.
-- sh -c "cp -r /inputs/* /outputs/": copies all files under /inputs to /outputs, which is by default the result output directory which all of its content will be published to the specified destination, which is IPFS by default

When a job is submitted, Bacalhau prints out the related job_id. We store that in an environment variable so that we can reuse it later on.

Checking the State of your Jobs

Job status: You can check the status of the job using bacalhau job list.

bacalhau job list --id-filter ${JOB_ID} --wide

When it says Published or Completed, that means the job is done, and we can get the results.

Job information: You can find out more information about your job by using bacalhau job describe.

bacalhau job describe ${JOB_ID}

Job download: You can download your job results directly by using bacalhau job get. Alternatively, you can choose to create a directory to store your results. In the command below, we remove the results directory if it exists, create it again and download our job output to be stored in that directory.

rm -rf results && mkdir -p results # Temporary directory to store the results
bacalhau job get $JOB_ID --output-dir results # Download the results

Viewing your Job Output

When the download is completed, the results of the job will be present in the directory. To view them, run the following command:

ls -1 results/outputs

{
  "NextToken": "",
  "Results": [
    {
      "Type": "s3PreSigned",
      "Params": {
        "PreSignedURL": "https://bacalhau-test-datasets.s3.eu-west-1.amazonaws.com/integration-tests-publisher/walid-manual-test-j-46a23fe7-e063-4ba6-8879-aac62af732b0.tar.gz?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAUEMPQ7JFSLGEPHJG%2F20240129%2Feu-west-1%2Fs3%2Faws4_request&X-Amz-Date=20240129T060142Z&X-Amz-Expires=1800&X-Amz-SignedHeaders=host&x-id=GetObject&X-Amz-Signature=cea00578ae3b03a1b52dba2d65a1bab40f1901fb7cd4ee1a0a974dc05b595f2e",
        "SourceSpec": {
          "Bucket": "bacalhau-test-datasets",
          "ChecksumSHA256": "1tlbgo+q0TlQhJi8vkiWnwTwPu1zenfvTO4qW1D5yvI=",
          "Endpoint": "",
          "Filter": "",
          "Key": "integration-tests-publisher/walid-manual-test-j-46a23fe7-e063-4ba6-8879-aac62af732b0.tar.gz",
          "Region": "eu-west-1",
          "VersionID": "oS7n.lY5BYHPMNOfbBS1n5VLl4ppVS4h"
        }
      }
    }
  ]
}

Extract Result CID

First you need to install jq (if it is not already installed) to process JSON:

sudo apt update
sudo apt install jq

To extract the CIDs from output JSON, execute following:

bacalhau job describe ${JOB_ID} --json \
| jq -r '.State.Executions[].PublishedResults.CID | select (. != null)'

The extracted CID will look like this:

QmYFhG668yJZmtk84SMMdbrz5Uvuh78Q8nLxTgLDWShkhR

Publishing Results to S3-Compatible Destinations

You can publish your results to Amazon s3 or other S3-compatible destinations like MinIO, Ceph, or SeaweedFS to conveniently store and share your outputs.

Publisher Spec

To facilitate publishing results, define publishers and their configurations using the PublisherSpec structure.

For S3-compatible destinations, the configuration is as follows:

type PublisherSpec struct {
    Type   Publisher              `json:"Type,omitempty"`
    Params map[string]interface{} `json:"Params,omitempty"`
}

For Amazon S3, you can specify the PublisherSpec configuration as shown below:

PublisherSpec:
  Type: S3
  Params:
    Bucket: <bucket>              # Specify the bucket where results will be stored
    Key: <object-key>             # Define the object key (supports dynamic naming using placeholders)
    Compress: <true/false>        # Specify whether to publish results as a single gzip file (default: false)
    Endpoint: <optional>          # Optionally specify the S3 endpoint
    Region: <optional>            # Optionally specify the S3 region

Example Usage

Let's explore some examples to illustrate how you can use this:

Publishing results to S3 using default settings

bacalhau docker run -p s3://<bucket>/<object-key> ubuntu ...

Publishing results to S3 with a custom endpoint and region:

bacalhau docker run \
-p s3://<bucket>/<object-key>,opt=endpoint=http://s3.example.com,opt=region=us-east-1 \
ubuntu ...

Publishing results to S3 as a single compressed file

bacalhau docker run -p s3://<bucket>/<object-key>,opt=compress=true ubuntu ...

Utilizing naming placeholders in the object key

bacalhau docker run -p s3://<bucket>/result-{date}-{jobID} ubuntu ...

Content Identification

Here's an example of a sample result:

{
    "NodeID": "QmYJ9QN9Pbi6gBKNrXVk5J36KSDGL5eUT6LMLF5t7zyaA7",
    "Data": {
        "StorageSource": "S3",
        "Name": "s3://<bucket>/run3.tar.gz",
        "S3": {
            "Bucket": "<bucket>",
            "Key": "run3.tar.gz",
            "Checksum": "e0uDqmflfT9b+rMfoCnO5G+cy+8WVTOPUtAqDMnXWbw=",
            "VersionID": "hZoNdqJsZxE_bFm3UGJuJ0RqkITe9dQ1"
        }
    }
}

Support for the S3-compatible storage provider

Environment variables, such as AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY
Credentials file ~/.aws/credentials
IAM Roles for Amazon EC2 Instances

Need Support?

For questions, feedback, please reach out in our Slack

Copy Data from URL to Public Storage

To upload a file from a URL we will use the bacalhau docker run command.

The job has been submitted and Bacalhau has printed out the related job id.

Structure of the command

Let's look closely at the command above:

bacalhau docker run: call to bacalhau using docker executor
--input https://raw.githubusercontent.com/filecoin-project/bacalhau/main/README.md: URL path of the input data volumes downloaded from a URL source.
ghcr.io/bacalhau-project/examples/upload:v1: the name and tag of the docker image we are using

You can find out more about the which is designed to simplify the data uploading process.

For more details, see the

Checking the State of Your Jobs

Job status: You can check the status of the job using bacalhau job list, processing the json ouput with the jq:

bacalhau job list $JOB_ID --output=json | jq '.[0].Status.JobState.Nodes[] | .Shards."0" | select(.RunOutput)'

When the job status is Published or Completed, that means the job is done, and we can get the results using the job ID.

Job information: You can find out more information about your job by using bacalhau job describe.

bacalhau job describe  $JOB_ID

rm -rf results && mkdir ./results
bacalhau job get --output-dir ./results $JOB_ID

Viewing your Job Output

Each job result contains an outputs subfolder and exitCode, stderr and stdout files with relevant content. To view the execution logs execute following:

head -n 15 ./results/stdout

And to view the job execution result (README.md file in the example case), which was saved as a job output, execute:

tail ./results/outputs/README.md

Get the CID From the Completed Job

To get the output CID from a completed job, run the following command:

bacalhau job list $JOB_ID --output=json | jq -r '.[0].Status.JobState.Nodes[] | .Shards."0".PublishedResults | select(.CID) | .CID'

The job will upload the CID to the public storage via IPFS. We will store the CID in an environment variable so that we can reuse it later on.

Use the CID in a New Bacalhau Job

Now that we have the CID, we can use it in a new job. This time we will use the --input parameter to tell Bacalhau to use the CID we just uploaded.

In this case, the only goal of our job is just to list the contents of the /inputs directory. You can see that the "input" data is located under /inputs/outputs/README.md.

bacalhau docker run \
    --id-only \
    --wait \
    --input ipfs://$CID \
    ubuntu -- \
    bash -c "set -x; ls -l /inputs; ls -l /inputs/outputs; cat /inputs/outputs/README.md"

The job has been submitted and Bacalhau has printed out the related job id. We store that in an environment variable so that we can reuse it later on.

Need Support?

For questions and feedback, please reach out in our

Data Ingestion

Copy Data from URL to Public Storage

Structure of the command

Checking the State of Your Jobs

Pinning Data

Basic steps

Running a Job over S3 data

Prerequisite​

Running a Bacalhau Job​

Structure of the Command​

Checking the State of your Jobs​

Viewing your Job Output​

Extract Result CID​

Publishing Results to S3-Compatible Destinations​

Publisher Spec​

Example Usage​

Content Identification​

Support for the S3-compatible storage provider​

Need Support?​

Running a Job over S3 data

Prerequisite​

Running a Bacalhau Job​

Structure of the Command​

Checking the State of your Jobs​

Viewing your Job Output​

Extract Result CID​

Publishing Results to S3-Compatible Destinations​

Publisher Spec​

Example Usage​

Content Identification​

Support for the S3-compatible storage provider​

Need Support?​

Copy Data from URL to Public Storage

Structure of the command

Checking the State of Your Jobs

Viewing your Job Output

Get the CID From the Completed Job

Use the CID in a New Bacalhau Job

Need Support?

Viewing your Job Output

Get the CID From the Completed Job

Use the CID in a New Bacalhau Job

Need Support?

Pinning Data

Basic steps

Prerequisite

Running a Bacalhau Job

Structure of the Command

Checking the State of your Jobs

Viewing your Job Output

Extract Result CID

Publishing Results to S3-Compatible Destinations

Publisher Spec

Example Usage

Content Identification

Support for the S3-compatible storage provider

Need Support?

Prerequisite

Running a Bacalhau Job

Structure of the Command

Checking the State of your Jobs

Viewing your Job Output

Extract Result CID

Publishing Results to S3-Compatible Destinations

Publisher Spec

Example Usage

Content Identification

Support for the S3-compatible storage provider

Need Support?