Running a Job over S3 data
Here is a quick tutorial on how to copy Data from S3 to a public storage. In this tutorial, we will scrape all the links from a public AWS S3 buckets and then copy the data to IPFS using Bacalhau.
Prerequisite
To get started, you need to install the Bacalhau client, see more information here
Running a Bacalhau Job
Structure of the Command
Let's look closely at the command above:
bacalhau docker run
: call to bacalhau-i "s3://noaa-goes16/ABI-L1b-RadC/2000/001/12/OR_ABI-L1b-RadC-M3C01*:/inputs,opt=region=us-east-1
: defines S3 objects as inputs to the job. In this case, it will download all objects that match the prefixABI-L1b-RadC/2000/001/12/OR_ABI-L1b-RadC-M3C01
from the bucketnoaa-goes16
inus-east-1
region, and mount the objects under/inputs
path inside the docker job.-- sh -c "cp -r /inputs/* /outputs/"
: copies all files under/inputs
to/outputs
, which is by default the result output directory which all of its content will be published to the specified destination, which is IPFS by default
When a job is submitted, Bacalhau prints out the related job_id
. We store that in an environment variable so that we can reuse it later on.
This works either with datasets that are publicly available or with private datasets, provided that the nodes have the necessary credentials to access. See the S3 Source Specification for more details.
Checking the State of your Jobs
Job status: You can check the status of the job using bacalhau list
.
When it says Published
or Completed
, that means the job is done, and we can get the results.
Job information: You can find out more information about your job by using bacalhau describe
.
Job download: You can download your job results directly by using bacalhau get
. Alternatively, you can choose to create a directory to store your results. In the command below, we remove the results directory if it exists, create it again and download our job output to be stored in that directory.
Viewing your Job Output
When the download is completed, the results of the job will be present in the directory. To view them, run the following command:
Extract Result CID
First you need to install jq
(if it is not already installed) to process JSON:
To extract the CIDs from output JSON, execute following:
The extracted CID will look like this:
Publishing Results to S3-Compatible Destinations
You can publish your results to Amazon s3 or other S3-compatible destinations like MinIO, Ceph, or SeaweedFS to conveniently store and share your outputs.
Publisher Spec
To facilitate publishing results, define publishers and their configurations using the PublisherSpec structure.
For S3-compatible destinations, the configuration is as follows:
For Amazon S3, you can specify the PublisherSpec
configuration as shown below:
Example Usage
Let's explore some examples to illustrate how you can use this:
Publishing results to S3 using default settings
Publishing results to S3 with a custom endpoint and region:
Publishing results to S3 as a single compressed file
Utilizing naming placeholders in the object key
Content Identification
Tracking content identification and maintaining lineage across different jobs' inputs and outputs can be challenging. To address this, the publisher encodes the SHA-256 checksum of the published results, specifically when publishing a single compressed file.
Here's an example of a sample result:
Support for the S3-compatible storage provider
To enable support for the S3-compatible storage provider, no additional dependencies are required. However, valid AWS credentials are necessary to sign the requests. The storage provider uses the default credentials chain, which checks the following sources for credentials:
Environment variables, such as
AWS_ACCESS_KEY_ID
andAWS_SECRET_ACCESS_KEY
Credentials file
~/.aws/credentials
IAM Roles for Amazon EC2 Instances
Need Support?
For questions, feedback, please reach out in our Slack