S3 Publisher Specification
Bacalhau's S3 Publisher provides users with a secure and efficient method to publish task results to any S3-compatible storage service. This publisher supports not just AWS S3, but other S3-compatible services offered by cloud providers like Google Cloud Storage and Azure Blob Storage, as well as open-source options like MinIO. The integration is designed to be highly flexible, ensuring users can choose the storage option that aligns with their needs, privacy preferences, and operational requirements.
S3 Publisher
Parameters
- Bucket
(string: <required>)
: The name of the S3 bucket where the task results will be stored. - Key
(string: <required>)
: The object key within the specified bucket where the task results will be stored. - Endpoint
(string: <optional>)
: The endpoint URL of the S3 service (useful for S3-compatible services). - Region
(string: <optional>)
: The region where the S3 bucket is located. - Compress
(bool: false)
: Indicates whether the task results should be compressed before storage.
Publishing Flexibility
The S3 Publisher is adept at handling both individual files and full directories. Each file within a directory is uploaded as a separate S3 object. If the Compress
option is enabled, the entire directory is compressed into a single object, enhancing efficiency and reducing storage requirements.
Published Result Spec
Results published to S3 are stored as objects that can also be used as inputs to other Bacalhau jobs by using S3 Input Source. The published result specification includes the following parameters:
- Bucket: Confirms the name of the bucket containing the stored results.
- Key: Identifies the unique object key within the specified bucket.
- Region: Notes the AWS region of the bucket.
- Endpoint: Records the endpoint URL for S3-compatible storage services.
- VersionID: The version ID of the stored object, enabling versioning support for retrieving specific versions of stored data.
- ChecksumSHA256: The SHA-256 checksum of the stored object, providing a method to verify data integrity.
Note: ChecksumSHA256
and VersionID
are only returned when the Compress
option is enabled, offering users a method to verify the integrity of the compressed data and to track different versions of the stored objects.
Dynamic Naming for Published S3 Objects
With the S3 Publisher in Bacalhau, you have the flexibility to use dynamic naming for the objects you publish to S3. This allows you to incorporate specific job and execution details into the object key, making it easier to trace, manage, and organize your published artifacts.
Bacalhau supports the following dynamic placeholders that will be replaced with their actual values during the publishing process:
{executionID}
: Replaced with the specific execution ID.{jobID}
: Replaced with the ID of the job.{date}
: Replaced with the current date in the formatYYYYMMDD
.{time}
: Replaced with the current time in the formatHHMMSS
.
Additionally, if you are publishing an archive and the object key does not end with .tar.gz
, it will be automatically appended. Conversely, if you're not archiving and the key doesn't end with a /
, a trailing slash will be added.
Example
Imagine you've specified the following object key pattern for publishing:
results/{jobID}/{date}/{time}/
Given a job with ID abc123
, executed on 2023-09-26
at 14:05:30
, the published object key would be:
results/abc123/20230926/140530/
This dynamic naming feature offers a powerful way to create organized, intuitive naming conventions for your Bacalhau published objects in S3.
Example
Here’s an example YAML configuration that outlines the process of using the S3 Publisher with Bacalhau:
Publisher:
Type: "s3"
Params:
Bucket: "my-task-results"
Key: "task123/result.tar.gz"
Endpoint: "https://s3.us-west-2.amazonaws.com"
Compress: true
In this configuration, task results will be published to the specified S3 bucket and object key. If you’re using an S3-compatible service, simply update the Endpoint
parameter with the appropriate URL.
The results will be compressed into a single object, and the published result specification will look like:
PublishedResult:
Type: "s3"
Params:
Bucket: "my-task-results"
Key: "task123/result.tar.gz"
Endpoint: "https://s3.us-west-2.amazonaws.com"
Region: "us-west-2"
ChecksumSHA256: "0x9a3a..."
VersionID: "3/L4kqtJlcpXroDTDmJ+rmDbwQaHWyOb..."
Example (Imperative/CLI)
The Bacalhau command-line interface (CLI) provides an imperative approach to specify the S3 Publisher. Below are a few examples showcasing how to define an S3 publisher using CLI commands:
-
Basic Docker job writing to S3 with default configurations:
bacalhau docker run -p s3://bucket/key ubuntu ...
This command writes to the S3 bucket using default endpoint and region settings without compressing the result.
-
Docker job writing to S3 with a specific endpoint and region:
bacalhau docker run -p s3://bucket/key,opt=endpoint=http://s3.example.com,opt=region=us-east-1 ubuntu ...
This command specifies a unique endpoint and region for the S3 bucket.
-
Docker job writing a single archived file to S3:
bacalhau docker run -p s3://bucket/key,opt=compress=true ubuntu ...
This command compresses the result into a single archived file before writing it to the S3 bucket.
-
Using naming placeholders:
bacalhau docker run -p s3://bucket/result-{date}-{jobID} ubuntu ...
Dynamic naming placeholders like
{date}
and{jobID}
allow for organized naming structures, automatically replacing these placeholders with appropriate values upon execution.
Remember to replace the placeholders like bucket
, key
, and other parameters with your specific values. These CLI commands offer a quick and customizable way to submit jobs and specify how the results should be published to S3.
Credential Requirements
To support this storage provider, no extra dependencies are necessary. However, valid AWS credentials are essential to sign the requests. The storage provider employs the default credentials chain to retrieve credentials, primarily sourcing them from:
-
Environment variables: AWS credentials can be specified using
AWS_ACCESS_KEY_ID
andAWS_SECRET_ACCESS_KEY
environment variables. -
Credentials file: The credentials file typically located at
~/.aws/credentials
can also be used to fetch the necessary AWS credentials. -
IAM Roles for Amazon EC2 Instances: If you're running your tasks within an Amazon EC2 instance, IAM roles can be utilized to provide the necessary permissions and credentials.
For a more detailed overview on AWS credential management and other ways to provide these credentials, please refer to the AWS official documentation on standardized credentials.