Bacalhau's S3 Publisher provides users with a secure and efficient method to publish task results to any S3-compatible storage service. This publisher supports not just AWS S3, but other S3-compatible services offered by cloud providers like Google Cloud Storage and Azure Blob Storage, as well as open-source options like MinIO. The integration is designed to be highly flexible, ensuring users can choose the storage option that aligns with their needs, privacy preferences, and operational requirements.
Bucket (string: <required>)
: The name of the S3 bucket where the task results will be stored.
Key (string: <required>)
: The object key within the specified bucket where the task results will be stored.
Endpoint (string: <optional>)
: The endpoint URL of the S3 service (useful for S3-compatible services).
Region (string: <optional>)
: The region where the S3 bucket is located.
Results published to S3 are stored as objects that can also be used as inputs to other Bacalhau jobs by using S3 Input Source. The published result specification includes the following parameters:
Bucket: Confirms the name of the bucket containing the stored results.
Key: Identifies the unique object key within the specified bucket.
Region: Notes the AWS region of the bucket.
Endpoint: Records the endpoint URL for S3-compatible storage services.
VersionID: The version ID of the stored object, enabling versioning support for retrieving specific versions of stored data.
ChecksumSHA256: The SHA-256 checksum of the stored object, providing a method to verify data integrity.
With the S3 Publisher in Bacalhau, you have the flexibility to use dynamic naming for the objects you publish to S3. This allows you to incorporate specific job and execution details into the object key, making it easier to trace, manage, and organize your published artifacts.
Bacalhau supports the following dynamic placeholders that will be replaced with their actual values during the publishing process:
{executionID}
: Replaced with the specific execution ID.
{jobID}
: Replaced with the ID of the job.
{nodeID}
: Replaced with the ID of the node where the execution took place
{date}
: Replaced with the current date in the format YYYYMMDD
.
{time}
: Replaced with the current time in the format HHMMSS
.
Additionally, if you are publishing an archive and the object key does not end with .tar.gz
, it will be automatically appended. Conversely, if you're not archiving and the key doesn't end with a /
, a trailing slash will be added.
Example
Imagine you've specified the following object key pattern for publishing:
Given a job with ID abc123
, executed on 2023-09-26
at 14:05:30
, the published object key would be:
This dynamic naming feature offers a powerful way to create organized, intuitive naming conventions for your Bacalhau published objects in S3.
Here’s an example YAML configuration that outlines the process of using the S3 Publisher with Bacalhau:
In this configuration, task results will be published to the specified S3 bucket and object key. If you’re using an S3-compatible service, simply update the Endpoint
parameter with the appropriate URL.
The results will be compressed into a single object, and the published result specification will look like:
The Bacalhau command-line interface (CLI) provides an imperative approach to specify the S3 Publisher. Below are a few examples showcasing how to define an S3 publisher using CLI commands:
Basic Docker job writing to S3 with default configurations:
This command writes to the S3 bucket using default endpoint and region settings.
Docker job writing to S3 with a specific endpoint and region:
This command specifies a unique endpoint and region for the S3 bucket.
Using naming placeholders:
Dynamic naming placeholders like {date}
and {jobID}
allow for organized naming structures, automatically replacing these placeholders with appropriate values upon execution.
Remember to replace the placeholders like bucket
, key
, and other parameters with your specific values. These CLI commands offer a quick and customizable way to submit jobs and specify how the results should be published to S3.
To support this storage provider, no extra dependencies are necessary. However, valid AWS credentials are essential to sign the requests. The storage provider employs the default credentials chain to retrieve credentials, primarily sourcing them from:
Environment variables: AWS credentials can be specified using AWS_ACCESS_KEY_ID
and AWS_SECRET_ACCESS_KEY
environment variables.
Credentials file: The credentials file typically located at ~/.aws/credentials
can also be used to fetch the necessary AWS credentials.
IAM Roles for Amazon EC2 Instances: If you're running your tasks within an Amazon EC2 instance, IAM roles can be utilized to provide the necessary permissions and credentials.
For a more detailed overview on AWS credential management and other ways to provide these credentials, please refer to the AWS official documentation on standardized credentials.
Compute nodes must run with the following policies to publish to S3:
PutObject Permissions: The s3:PutObject
permission is necessary to publish objects to the specified S3 bucket.
Resource: The Resource
field in the policy specifies the Amazon Resource Name (ARN) of the S3 bucket. The /*
suffix is necessary to allow publishing with any prefix within the bucket or can be replaced with a prefix to limit the scope of the policy. You can also specify multiple resources in the policy to allow publishing to multiple buckets, or *
to allow publishing to all buckets in the account.
To enable downloading published results using bacalhau job get <job_id>
command, the requester node must run with the following policies:
GetObject Permissions: The s3:GetObject
permission is necessary for the requester node to provide a pre-signed URL to download the published results by the client.
For more information on IAM policies specific to Amazon S3 buckets and users, please refer to the AWS documentation on Using IAM Policies with Amazon S3.
The IPFS Publisher in Bacalhau amplifies the versatility of task result storage by integrating with the InterPlanetary File System (IPFS). IPFS is a protocol and network designed to create a peer-to-peer method of storing and sharing hypermedia in a distributed file system. Bacalhau's seamless integration with IPFS ensures that users have a decentralized option for publishing their task results, enhancing accessibility and resilience while reducing dependence on a single point of failure.
IPFS
Publisher ParametersFor the IPFS publisher, no specific parameters need to be defined in the publisher specification. The user only needs to indicate the publisher type as IPFS, and Bacalhau handles the rest. Here is an example of how to set up an IPFS Publisher in a job specification.
Once the job is executed, the results are published to IPFS, and a unique CID (Content Identifier) is generated for each file or piece of data. This CID acts as an address to the file in the IPFS network and can be used to access the file globally.
CID (string)
: This is the unique content identifier generated by IPFS, which can be used to access the published content from anywhere in the world. Every data piece stored on IPFS has its unique CID. Here's a sample of how the published result might appear:
In this example, the task results will be stored in IPFS, and can be referenced and retrieved using the specified CID. This is indicative of Bacalhau's commitment to offering flexible, reliable, and decentralized options for result storage, catering to a diverse set of user needs and preferences.
Bacalhau's Local Publisher provides a useful option for storing task results on the compute node, allowing for ease of access and retrieval for testing or trying our Bacalhau.
The Local Publisher should not be used for Production use as it is not a reliable storage option. For production use, we recommend using a more reliable option such as an S3-compatible storage service.
The local publisher requires no specific parameters to be defined in the publisher specification. The user only needs to indicate the publisher type as "local", and Bacalhau handles the rest. Here is an example of how to set up a Local Publisher in a job specification.
Once the job is executed, the results are published to the local compute node, and stored as compressed tar file, which can be accessed and retrieved over HTTP from the command line using the get
command. TAhis will download and extract the contents for the user from the remove compute node.
URL (string)
: This is the HTTP URL to the results of the computation, which is hosted on the compute node where it ran. Here's a sample of how the published result might appear:
In this example, the task results will be stored on the compute node, and can be referenced and retrieved using the specified URL.
By default the compute node will attempt to use a public address for the HTTP server delivering task output, but there is no guarantee that the compute node is accessible on that address. If the compute node is behind a NAT or firewall, the user may need to manually specify the address to use for the HTTP server in the config.yaml
file.
There is no lifecycle management for the content stored on the compute node. The user is responsible for managing the content and ensuring that it is removed when no longer needed before the compute node runs out of disk space.
If the address/port of the compute node changes, then previously stored content will no longer be accessible. The user will need to manually update the address in the config.yaml
file and re-publish the content to make it accessible again.