Here is a quick tutorial on how to copy Data from S3 to a public storage. In this tutorial, we will scrape all the links from a public AWS S3 buckets and then copy the data to IPFS using Bacalhau.
To get started, you need to install the Bacalhau client, see more information here
Let's look closely at the command above:
bacalhau docker run
: call to bacalhau
-i "s3://noaa-goes16/ABI-L1b-RadC/2000/001/12/OR_ABI-L1b-RadC-M3C01*:/inputs,opt=region=us-east-1
: defines S3 objects as inputs to the job. In this case, it will download all objects that match the prefix ABI-L1b-RadC/2000/001/12/OR_ABI-L1b-RadC-M3C01
from the bucket noaa-goes16
in us-east-1
region, and mount the objects under /inputs
path inside the docker job.
-- sh -c "cp -r /inputs/* /outputs/"
: copies all files under /inputs
to /outputs
, which is by default the result output directory which all of its content will be published to the specified destination, which is IPFS by default
When a job is submitted, Bacalhau prints out the related job_id
. We store that in an environment variable so that we can reuse it later on.
This works either with datasets that are publicly available or with private datasets, provided that the nodes have the necessary credentials to access. See the S3 Source Specification for more details.
Job status: You can check the status of the job using bacalhau list
.
When it says Published
or Completed
, that means the job is done, and we can get the results.
Job information: You can find out more information about your job by using bacalhau describe
.
Job download: You can download your job results directly by using bacalhau get
. Alternatively, you can choose to create a directory to store your results. In the command below, we remove the results directory if it exists, create it again and download our job output to be stored in that directory.
When the download is completed, the results of the job will be present in the directory. To view them, run the following command:
First you need to install jq
(if it is not already installed) to process JSON:
To extract the CIDs from output JSON, execute following:
The extracted CID will look like this:
You can publish your results to Amazon s3 or other S3-compatible destinations like MinIO, Ceph, or SeaweedFS to conveniently store and share your outputs.
To facilitate publishing results, define publishers and their configurations using the PublisherSpec structure.
For S3-compatible destinations, the configuration is as follows:
For Amazon S3, you can specify the PublisherSpec
configuration as shown below:
Let's explore some examples to illustrate how you can use this:
Publishing results to S3 using default settings
Publishing results to S3 with a custom endpoint and region:
Publishing results to S3 as a single compressed file
Utilizing naming placeholders in the object key
Tracking content identification and maintaining lineage across different jobs' inputs and outputs can be challenging. To address this, the publisher encodes the SHA-256 checksum of the published results, specifically when publishing a single compressed file.
Here's an example of a sample result:
To enable support for the S3-compatible storage provider, no additional dependencies are required. However, valid AWS credentials are necessary to sign the requests. The storage provider uses the default credentials chain, which checks the following sources for credentials:
Environment variables, such as AWS_ACCESS_KEY_ID
and AWS_SECRET_ACCESS_KEY
Credentials file ~/.aws/credentials
IAM Roles for Amazon EC2 Instances
For questions, feedback, please reach out in our Slack
How to pin data to public storage
If you have data that you want to make available to your Bacalhau jobs (or other people), you can pin it using a pinning service like Pinata, NFT.Storage, Thirdweb, etc. Pinning services store data on behalf of users. The pinning provider is essentially guaranteeing that your data will be available if someone knows the CID. Most pinning services offer you a free tier, so you can try them out without spending any money.
To use a pinning service, you will almost always need to create an account. After registration, you get an API token, which is necessary to control and access the files. Then you need to upload files - usually services provide a web interface, CLI and code samples for integration into your application. Once you upload the files you will get its CID, which looks like this: QmUyUg8en7G6RVL5uhyoLBxSWFgRMdMraCRWFcDdXKWEL9
. Now you can access pinned data from the jobs via this CID.
Data source can be specified via --input
flag, see the CLI Guide for more details
To upload a file from a URL we will use the bacalhau docker run
command.
The job has been submitted and Bacalhau has printed out the related job id.
Let's look closely at the command above:
bacalhau docker run
: call to bacalhau using docker executor
--input https://raw.githubusercontent.com/filecoin-project/bacalhau/main/README.md
: URL path of the input data volumes downloaded from a URL source.
ghcr.io/bacalhau-project/examples/upload:v1
: the name and tag of the docker image we are using
The bacalhau docker run
command takes advantage of the --input
parameter. This will download a file from a public URL and place it in the /inputs
directory of the container (by default). Then we will use a helper container to move that data to the /outputs directory.
You can find out more about the helper container in the examples repository which is designed to simplify the data uploading process.
For more details, see the CLI commands guide
Job status: You can check the status of the job using bacalhau list
, processing the json ouput with the jq
:
When the job status is Published
or Completed
, that means the job is done, and we can get the results using the job ID.
Job information: You can find out more information about your job by using bacalhau describe
.
Job download: You can download your job results directly by using bacalhau get
. Alternatively, you can choose to create a directory to store your results. In the command below, we removed a directory in case it was present before, created it and downloaded our job output to be stored in that directory.
Each job result contains an outputs
subfolder and exitCode
, stderr
and stdout
files with relevant content. To view the execution logs execute following:
And to view the job execution result (README.md
file in the example case), which was saved as a job output, execute:
To get the output CID from a completed job, run the following command:
The job will upload the CID to the public storage via IPFS. We will store the CID in an environment variable so that we can reuse it later on.
Now that we have the CID, we can use it in a new job. This time we will use the --input
parameter to tell Bacalhau to use the CID we just uploaded.
In this case, the only goal of our job is just to list the contents of the /inputs
directory. You can see that the "input" data is located under /inputs/outputs/README.md
.
The job has been submitted and Bacalhau has printed out the related job id. We store that in an environment variable so that we can reuse it later on.
For questions and feedback, please reach out in our Slack