How to pin data to public storage
If you have data that you want to make available to your Bacalhau jobs (or other people), you can pin it using a pinning service like Pinata, NFT.Storage, Thirdweb, etc. Pinning services store data on behalf of users. The pinning provider is essentially guaranteeing that your data will be available if someone knows the CID. Most pinning services offer you a free tier, so you can try them out without spending any money.
To use a pinning service, you will almost always need to create an account. After registration, you get an API token, which is necessary to control and access the files. Then you need to upload files - usually services provide a web interface, CLI and code samples for integration into your application. Once you upload the files you will get its CID, which looks like this: QmUyUg8en7G6RVL5uhyoLBxSWFgRMdMraCRWFcDdXKWEL9
. Now you can access pinned data from the jobs via this CID.
:::info Data source can be specified via --input
flag, see the CLI Guide for more details :::
Here is a quick tutorial on how to copy Data from S3 to a public storage. In this tutorial, we will scrape all the links from a public AWS S3 buckets and then copy the data to IPFS using Bacalhau.
To get started, you need to install the Bacalhau client, see more information here
Let's look closely at the command above:
bacalhau docker run
: call to bacalhau
-i "s3://noaa-goes16/ABI-L1b-RadC/2000/001/12/OR_ABI-L1b-RadC-M3C01*:/inputs,opt=region=us-east-1
: defines S3 objects as inputs to the job. In this case, it will download all objects that match the prefix ABI-L1b-RadC/2000/001/12/OR_ABI-L1b-RadC-M3C01
from the bucket noaa-goes16
in us-east-1
region, and mount the objects under /inputs
path inside the docker job.
-- sh -c "cp -r /inputs/* /outputs/"
: copies all files under /inputs
to /outputs
, which is by default the result output directory which all of its content will be published to the specified destination, which is IPFS by default
When a job is submitted, Bacalhau prints out the related job_id
. We store that in an environment variable so that we can reuse it later on.
:::tip This only works with datasets that are publicly accessible and don't require an AWS account or pay to use buckets. :::
Job status: You can check the status of the job using bacalhau list
.
When it says Published
or Completed
, that means the job is done, and we can get the results.
Job information: You can find out more information about your job by using bacalhau describe
.
Job download: You can download your job results directly by using bacalhau get
. Alternatively, you can choose to create a directory to store your results. In the command below, we created a directory and downloaded our job output to be stored in that directory.
After the download has finished you should see the following contents in results directory.
To view your file, run the following command:
Installing jq to extract CID from the result description
You can publish your results to Amazon s3 or other S3-compatible destinations like MinIO, Ceph, or SeaweedFS to conveniently store and share your outputs.
To facilitate publishing results, define publishers and their configurations using the PublisherSpec structure.
For S3-compatible destinations, the configuration is as follows:
For Amazon S3, you can specify the PublisherSpec configuration as shown below:
Let's explore some examples to illustrate how you can use this:
Publishing results to S3 using default settings
Publishing results to S3 with a custom endpoint and region:
Publishing results to S3 as a single compressed file
Utilizing naming placeholders in the object key
Tracking content identification and maintaining lineage across different jobs' inputs and outputs can be challenging. To address this, the publisher encodes the SHA-256 checksum of the published results, specifically when publishing a single compressed file.
Here's an example of a sample result:
To enable support for the S3-compatible storage provider, no additional dependencies are required. However, valid AWS credentials are necessary to sign the requests. The storage provider uses the default credentials chain, which checks the following sources for credentials:
Environment variables, such as AWS_ACCESS_KEY_ID
and AWS_SECRET_ACCESS_KEY
Credentials file ~/.aws/credentials
IAM Roles for Amazon EC2 Instances
For questions, feedback, please reach out in our forum
To upload a file from a URL we will use the bacalhau docker run
command.
The job has been submitted and Bacalhau has printed out the related job id. We store that in an environment variable so that we can reuse it later on.
Let's look closely at the command above:
bacalhau docker run
: call to bacalhau
ghcr.io/bacalhau-project/examples/upload:v1
: the name and the tag of the docker image we are using
--input=https://raw.githubusercontent.com/filecoin-project/bacalhau/main/README.md \
: URL path of the input data volumes downloaded from a URL source.
The bacalhau docker run
command takes advantage of the --input
parameter. This will download a file from a public URL and place it in the /inputs
directory of the container (by default). Then we will use a helper container to move that data to the /outputs
directory so that it is published to your public storage via IPFS. In our case we are using Filecoin as our public storage.
You can find out more about the helper container in the examples repository.
Job status: You can check the status of the job using bacalhau list
.
When it says Published
or Completed
, that means the job is done, and we can get the results.
Job information: You can find out more information about your job by using bacalhau describe
.
Job download: You can download your job results directly by using bacalhau get
. Alternatively, you can choose to create a directory to store your results. In the command below, we created a directory and downloaded our job output to be stored in that directory.
Each job creates 3 subfolders: the combined_results, per_shard files, and the raw directory. To view the file, run the following command:
To get the output CID from a completed job, run the following command:
The job will upload the CID to your public storage via IPFS. We will store the cid that in an environment variable so that we can reuse it later on.
Now that we have the CID, we can use it in a new job. This time we will use the --input
parameter to tell Bacalhau to use the CID we just uploaded.
In this case, our "job" is just to list the contents of the /inputs
directory. You can see that the "input" data is located under /inputs/outputs/README.md
.
The job has been submitted and Bacalhau has printed out the related job id. We store that in an environment variable so that we can reuse it later on.
For questions and feedback, please reach out in our forum