S3 Source Specification

The S3 Input Source provides a seamless way to utilize data stored in S3 or any S3-compatible storage service as input for Bacalhau jobs. Users can specify files or entire prefixes stored in S3 buckets to be fetched and mounted directly into the task's execution environment. This capability ensures that your tasks have immediate access to the necessary data.

Source Specification Parameters

Here are the parameters that you can define for an S3 input source:

  • Bucket (string: <required>): The name of the S3 bucket where the data is stored.

  • Key(string: <optional>): The object key or prefix within the bucket. Supports trailing wildcard for fetching multiple objects with matching prefixes.

  • Filter(string: <optional>): A regex pattern to filter the objects to be fetched. If a Key is also provided as a prefix, the filter pattern will be applied to object keys after the prefix.

  • Region(string: <optional>): The AWS region where the S3 bucket is hosted.

  • Endpoint(string: <optional>): The endpoint URL of the S3 or S3-compatible service.

  • VersionID(string: <optional>): The specific version of the object if versioning is enabled on the bucket. Only applicable when fetching a single object, and not a prefix or a pattern of objects.

  • ChecksumSHA256(string: <optional>): The SHA-256 checksum of the object to ensure data integrity. Only applicable when fetching a single object, and not a prefix or a pattern of objects.

Fetching Mechanism

  • Single Object: If the key points to a single object, that object is fetched and made available to the task. e.g. s3://myBucket/dir/file-001.txt

  • Prefix Matching: If the key ends with a slash (/), it's interpreted as a prefix, and all objects with keys that start with that prefix are fetched, mimicking the behavior of fetching all objects in a "directory". e.g. s3://myBucket/dir/

  • Wildcard: Supports a trailing wildcard (*). All objects with keys matching the prefix are fetched, facilitating batch processing or analysis of multiple files. e.g. s3://myBucket/dir/log-2023-09-*

Examples

Declarative Examples

When using the Bacalhau YAML configuration to define the S3 input source, you can employ the following declarative approach.

Below is an example of how to define an S3 input source in YAML format.

InputSources:
  - Source:
      Type: "s3"
      Params:
        Bucket: "my-bucket"
        Key: "data/"
        Endpoint: "https://s3.us-west-2.amazonaws.com"
        ChecksumSHA256: "e3b0c44b542b..."
  - Target: "/data"

Imperative Examples

When using the Bacalhau CLI to define the S3 input source, you can employ the following imperative approach. Below are example commands demonstrating how to define the S3 input source with various configurations:

  1. Mount an S3 object to a specific path:

    bacalhau docker run -i src=s3://bucket/key,dst=/my/input/path ubuntu ...
  2. Mount an S3 object with a specific endpoint and region:

    bacalhau docker run -i src=s3://bucket/key,dst=/my/input/path,opt=endpoint=http://s3.example.com,opt=region=us-east-1 ubuntu ...
  3. Mount an S3 object using long flag names:

    bacalhau docker run --input source=s3://bucket/key,destination=/my/input/path ubuntu ...

With these commands, you can seamlessly fetch and mount data from S3 into your task's execution environment directly through the CLI.

Credential Requirements

To support this storage provider, no extra dependencies are necessary. However, valid AWS credentials are essential to sign the requests. The storage provider employs the default credentials chain to retrieve credentials, primarily sourcing them from:

  1. Environment variables: AWS credentials can be specified using AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables.

  2. Credentials file: The credentials file typically located at ~/.aws/credentials can also be used to fetch the necessary AWS credentials.

  3. IAM Roles for Amazon EC2 Instances: If you're running your tasks within an Amazon EC2 instance, IAM roles can be utilized to provide the necessary permissions and credentials.

For a more detailed overview on AWS credential management and other ways to provide these credentials, please refer to the AWS official documentation on standardized credentials.

Required IAM Policies

Compute nodes must run with the following policies to support S3 input source:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": "s3:ListBucket",
            "Resource": "arn:aws:s3:::BUCKET_NAME"
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:GetObjectVersion"
            ],
            "Resource": "arn:aws:s3:::BUCKET_NAME/*"
        }
    ]
}
  • ListBucket Permission: The s3:ListBucket permission is necessary to list the objects within the specified S3 bucket, allowing prefixes and wildcard expressions as the S3 Key for fetching.

  • GetObject and GetObjectVersion Permissions: The s3:GetObject and s3:GetObjectVersion permissions enable the fetching of object data and its versions, respectively.

  • Resource: The Resource field in the policy specifies the Amazon Resource Name (ARN) of the S3 bucket. The /* suffix is necessary to allow fetching of all objects within the bucket or can be replaced with a prefix to limit the scope of the policy. You can also specify multiple resources in the policy to allow fetching from multiple buckets, or * to allow fetching from all buckets in the account.

For more information on IAM policies specific to Amazon S3 buckets and users, please refer to the AWS documentation on Using IAM Policies with Amazon S3.

S3-Compatible Services

This feature isn't limited to AWS S3 - it supports all S3-compatible storage services. It means you can pull data from the likes of Google Cloud Storage and open-source solutions like MinIO, giving you the flexibility to utilize a diverse range of data sources.

Using Google Cloud Storage

To seamlessly integrate Google Cloud Storage with Bacalhau, follow these steps:

  1. Obtain HMAC Keys: To access Google Cloud Storage, you'll need HMAC (Hash-based Message Authentication Code) keys. Refer to the Google Cloud documentation for detailed instructions on creating a service account and generating HMAC keys.

  2. Provide HMAC Keys to Bacalhau: You can provide the HMAC keys to Bacalhau using the same options as AWS credentials, as documented in the Credential Requirements section.

  3. Configure the S3 Input Source: In your S3 input source configuration, set the endpoint for Google Cloud Storage to https://storage.googleapis.com, as shown in the example below:

InputSources:
  - Source:
      Type: "s3"
      Params:
        Bucket: "my-bucket"
        Key: "data/"
        Endpoint: "https://storage.googleapis.com"
  - Target: "/data"