1 of 13

Other Specifications

engines

Docker Engine Specification

Docker Engine is one of the execution engines supported in Bacalhau. It allows users to run tasks inside Docker containers, offering an isolated and consistent environment for execution. Below are the parameters to configure the Docker Engine.

`Docker` Engine Parameters

Image (string: <required>): Specifies the Docker image to use for task execution. It should be an image that can be pulled by Docker.
Entrypoint (string[]: <optional>): Allows overriding the default entrypoint set in the Docker image. Each string in the array represents a segment of the entrypoint command.
Parameters (string[]: <optional>): Additional command-line arguments to be included in the container’s startup command, appended after the entrypoint.
EnvironmentVariables (string[]: <optional>): Sets environment variables within the Docker container during task execution. Each string should be formatted as KEY=value.
WorkingDirectory (string: <optional>): Sets the path inside the container where the task executes. If not specified, it defaults to the working directory defined in the Docker image.

Example

Here’s an example of configuring the Docker Engine within a job or task using YAML:

Engine:
  Type: "Docker"
  Params:
    Image: "ubuntu:20.04"
    Entrypoint:
      - "/bin/bash"
      - "-c"
    Parameters:
      - "echo Hello, World!"
    EnvironmentVariables:
      - "MY_ENV_VAR=myvalue"
    WorkingDirectory: "/app"

In this example, the task will be executed inside an Ubuntu 20.04 Docker container. The entrypoint is overridden to execute a bash shell that runs an echo command. An environment variable MY_ENV_VAR is set with the value myvalue, and the working directory inside the container is set to /app.

WebAssembly (WASM) Engine Specification

The WASM Engine in Bacalhau allows tasks to be executed in a WebAssembly environment, offering compatibility and speed. This engine supports WASM and WASI (WebAssembly System Interface) jobs, making it highly adaptable for various use cases. Below are the parameters for configuring the WASM Engine.

`WASM` Engine Parameters

EntryModule (InputSource : required): Specifies the WASM module that contains the start function or the main execution code of the task. The InputSource should point to the location of the WASM binary.
Entrypoint (string: <optional>): The name of the function within the EntryModule to execute. For WASI jobs, this should typically be _start. The entrypoint function should have zero parameters and zero results.
Parameters (string[]: <optional>): An array of strings containing arguments that will be supplied to the program as ARGV. This allows parameterized execution of the WASM task.
EnvironmentVariables (map[string]string: <optional>): A mapping of environment variable keys to their values, made available within the executing WASM environment.
ImportModules (InputSource[] : optional): An array of InputSources pointing to additional WASM modules. The exports from these modules will be available as imports to the EntryModule, enabling modular and reusable WASM code.

Example

Here’s a sample configuration of the WASM Engine within a task, expressed in YAML:

Engine:
Type: "WASM"
Params:
  EntryModule:
    Source:
      Type: "s3"
      Params:
        Bucket: "my-bucket"
        Key: "entry.wasm"
  Entrypoint: "_start"
  Parameters:
    - "--option"
    - "value"
  EnvironmentVariables:
    VAR1: "value1"
    VAR2: "value2"
  ImportModules:
    - Source:
        Type: "localDirectory"
        Params:
          Path: "/local/path/to/module.wasm"

In this example, the task is configured to run in a WASM environment. The EntryModule is fetched from an S3 bucket, the entrypoint is _start, and parameters and environment variables are passed into the WASM environment. Additionally, an ImportModule is loaded from a local directory, making its exports available to the EntryModule.

publishers

IPFS Publisher Specification

The IPFS Publisher in Bacalhau amplifies the versatility of task result storage by integrating with the InterPlanetary File System (IPFS). IPFS is a protocol and network designed to create a peer-to-peer method of storing and sharing hypermedia in a distributed file system. Bacalhau's seamless integration with IPFS ensures that users have a decentralized option for publishing their task results, enhancing accessibility and resilience while reducing dependence on a single point of failure.

`IPFS` Publisher Parameters

For the IPFS publisher, no specific parameters need to be defined in the publisher specification. The user only needs to indicate the publisher type as IPFS, and Bacalhau handles the rest. Here is an example of how to set up an IPFS Publisher in a job specification.

Publisher:
  Type: ipfs

Published Result Specification

Once the job is executed, the results are published to IPFS, and a unique CID (Content Identifier) is generated for each file or piece of data. This CID acts as an address to the file in the IPFS network and can be used to access the file globally.

Result Parameters

CID (string): This is the unique content identifier generated by IPFS, which can be used to access the published content from anywhere in the world. Every data piece stored on IPFS has its unique CID. Here's a sample of how the published result might appear:

PublishedResult:
  Type: ipfs
  Params:
    CID: "QmXoypizjW3WknFiJnKLwHCnL72vedxjQkDDP1mXWo6uco"

In this example, the task results will be stored in IPFS, and can be referenced and retrieved using the specified CID. This is indicative of Bacalhau's commitment to offering flexible, reliable, and decentralized options for result storage, catering to a diverse set of user needs and preferences.

Local Publisher Specification

Bacalhau's Local Publisher provides a useful option for storing task results on the compute node, allowing for ease of access and retrieval for testing or trying our Bacalhau.

:::danger

The Local Publisher should not be used for Production use as it is not a reliable storage option. For production use, we recommend using a more reliable option such as an S3-compatible storage service. :::

Local Publisher Parameters

The local publisher requires no specific parameters to be defined in the publisher specification. The user only needs to indicate the publisher type as "local", and Bacalhau handles the rest. Here is an example of how to set up a Local Publisher in a job specification.

Published Result Specification

Once the job is executed, the results are published to the local compute node, and stored as compressed tar file, which can be accessed and retrieved over HTTP from the command line using the get command. TAhis will download and extract the contents for the user from the remove compute node.

Result Parameters

URL (string): This is the HTTP URL to the results of the computation, which is hosted on the compute node where it ran. Here's a sample of how the published result might appear:

In this example, the task results will be stored on the compute node, and can be referenced and retrieved using the specified URL.

Caveats

By default the compute node will attempt to use a public address for the HTTP server delivering task output, but there is no guarantee that the compute node is accessible on that address. If the compute node is behind a NAT or firewall, the user may need to manually specify the address to use for the HTTP server in the config.yaml file.
There is no lifecycle management for the content stored on the compute node. The user is responsible for managing the content and ensuring that it is removed when no longer needed before the compute node runs out of disk space.
If the address/port of the compute node changes, then previously stored content will no longer be accessible. The user will need to manually update the address in the config.yaml file and re-publish the content to make it accessible again.

S3 Publisher Specification

Bacalhau's S3 Publisher provides users with a secure and efficient method to publish task results to any S3-compatible storage service. This publisher supports not just AWS S3, but other S3-compatible services offered by cloud providers like Google Cloud Storage and Azure Blob Storage, as well as open-source options like MinIO. The integration is designed to be highly flexible, ensuring users can choose the storage option that aligns with their needs, privacy preferences, and operational requirements.

Publisher Parameters

Bucket (string: <required>): The name of the S3 bucket where the task results will be stored.
Key (string: <required>): The object key within the specified bucket where the task results will be stored.
Endpoint (string: <optional>): The endpoint URL of the S3 service (useful for S3-compatible services).
Region (string: <optional>): The region where the S3 bucket is located.

Published Result Spec

Results published to S3 are stored as objects that can also be used as inputs to other Bacalhau jobs by using S3 Input Source. The published result specification includes the following parameters:

Bucket: Confirms the name of the bucket containing the stored results.
Key: Identifies the unique object key within the specified bucket.
Region: Notes the AWS region of the bucket.
Endpoint: Records the endpoint URL for S3-compatible storage services.
VersionID: The version ID of the stored object, enabling versioning support for retrieving specific versions of stored data.
ChecksumSHA256: The SHA-256 checksum of the stored object, providing a method to verify data integrity.

Dynamic Naming

With the S3 Publisher in Bacalhau, you have the flexibility to use dynamic naming for the objects you publish to S3. This allows you to incorporate specific job and execution details into the object key, making it easier to trace, manage, and organize your published artifacts.

Bacalhau supports the following dynamic placeholders that will be replaced with their actual values during the publishing process:

{executionID}: Replaced with the specific execution ID.
{jobID}: Replaced with the ID of the job.
{nodeID}: Replaced with the ID of the node where the execution took place
{date}: Replaced with the current date in the format YYYYMMDD.
{time}: Replaced with the current time in the format HHMMSS.

Additionally, if you are publishing an archive and the object key does not end with .tar.gz, it will be automatically appended. Conversely, if you're not archiving and the key doesn't end with a /, a trailing slash will be added.

Example

Imagine you've specified the following object key pattern for publishing:

results/{jobID}/{date}/{time}/

Given a job with ID abc123, executed on 2023-09-26 at 14:05:30, the published object key would be:

results/abc123/20230926/140530/

This dynamic naming feature offers a powerful way to create organized, intuitive naming conventions for your Bacalhau published objects in S3.

Examples

Declarative Examples

Here’s an example YAML configuration that outlines the process of using the S3 Publisher with Bacalhau:

Publisher:
  Type: "s3"
  Params:
    Bucket: "my-task-results"
    Key: "task123/result.tar.gz"
    Endpoint: "https://s3.us-west-2.amazonaws.com"

In this configuration, task results will be published to the specified S3 bucket and object key. If you’re using an S3-compatible service, simply update the Endpoint parameter with the appropriate URL.

The results will be compressed into a single object, and the published result specification will look like:

PublishedResult:
  Type: "s3"
  Params:
    Bucket: "my-task-results"
    Key: "task123/result.tar.gz"
    Endpoint: "https://s3.us-west-2.amazonaws.com"
    Region: "us-west-2"
    ChecksumSHA256: "0x9a3a..."
    VersionID: "3/L4kqtJlcpXroDTDmJ+rmDbwQaHWyOb..."

Imperative Examples

The Bacalhau command-line interface (CLI) provides an imperative approach to specify the S3 Publisher. Below are a few examples showcasing how to define an S3 publisher using CLI commands:

Basic Docker job writing to S3 with default configurations:
```
bacalhau docker run -p s3://bucket/key ubuntu ...
```
This command writes to the S3 bucket using default endpoint and region settings.
Docker job writing to S3 with a specific endpoint and region:
```
bacalhau docker run -p s3://bucket/key,opt=endpoint=http://s3.example.com,opt=region=us-east-1 ubuntu ...
```
This command specifies a unique endpoint and region for the S3 bucket.
Using naming placeholders:
```
bacalhau docker run -p s3://bucket/result-{date}-{jobID} ubuntu ...
```
Dynamic naming placeholders like {date} and {jobID} allow for organized naming structures, automatically replacing these placeholders with appropriate values upon execution.

Remember to replace the placeholders like bucket, key, and other parameters with your specific values. These CLI commands offer a quick and customizable way to submit jobs and specify how the results should be published to S3.

Credential Requirements

To support this storage provider, no extra dependencies are necessary. However, valid AWS credentials are essential to sign the requests. The storage provider employs the default credentials chain to retrieve credentials, primarily sourcing them from:

Environment variables: AWS credentials can be specified using AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables.
Credentials file: The credentials file typically located at ~/.aws/credentials can also be used to fetch the necessary AWS credentials.
IAM Roles for Amazon EC2 Instances: If you're running your tasks within an Amazon EC2 instance, IAM roles can be utilized to provide the necessary permissions and credentials.

For a more detailed overview on AWS credential management and other ways to provide these credentials, please refer to the AWS official documentation on standardized credentials.

Required IAM Policies

Compute Nodes

Compute nodes must run with the following policies to publish to S3:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:PutObject"
            ],
            "Resource": "arn:aws:s3:::BUCKET_NAME/*"
        }
    ]
}

PutObject Permissions: The s3:PutObject permission is necessary to publish objects to the specified S3 bucket.
Resource: The Resource field in the policy specifies the Amazon Resource Name (ARN) of the S3 bucket. The /* suffix is necessary to allow publishing with any prefix within the bucket or can be replaced with a prefix to limit the scope of the policy. You can also specify multiple resources in the policy to allow publishing to multiple buckets, or * to allow publishing to all buckets in the account.

Requester Node

To enable downloading published results using bacalhau get <job_id> command, the requester node must run with the following policies:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetObject"
            ],
            "Resource": "arn:aws:s3:::BUCKET_NAME/*"
        }
    ]
}

GetObject Permissions: The s3:GetObject permission is necessary for the requester node to provide a pre-signed URL to download the published results by the client.

For more information on IAM policies specific to Amazon S3 buckets and users, please refer to the AWS documentation on Using IAM Policies with Amazon S3.

sources

IPFS Source Specification

The IPFS Input Source enables users to easily integrate data hosted on the InterPlanetary File System (IPFS) into Bacalhau jobs. By specifying the Content Identifier (CID) of the desired IPFS file or directory, users can have the content fetched and made available in the task's execution environment, ensuring efficient and decentralized data access.

Source Specification Parameters

Here are the parameters that you can define for an IPFS input source:

CID (string: <required>): The Content Identifier that uniquely pinpoints the file or directory on the IPFS network. Bacalhau retrieves the content associated with this CID for use in the task.

Example

Below is an example of how to define an IPFS input source in YAML format.

InputSources:
  - Source:
      Type: "ipfs"
      Params:
        CID: "QmY7Yh4UquoXHLPFo2XbhXkhBvFoPwmQUSa92pxnxjY3fZ"
  - Target: "/data"

In this configuration, the data associated with the specified CID is fetched from the IPFS network and made available in the task's environment at the "/data" path.

Example (Imperative/CLI)

Utilizing IPFS as an input source in Bacalhau via the CLI is straightforward. Below are example commands that demonstrate how to define the IPFS input source:

Mount an IPFS CID to the default /inputs directory:

bacalhau docker run -i ipfs://QmeZRGhe4PmjctYVSVHuEiA9oSXnqmYa4kQubSHgWbjv72 ubuntu ...

Mount an IPFS CID to a custom /data directory:

bacalhau docker run -i ipfs://QmeZRGhe4PmjctYVSVHuEiA9oSXnqmYa4kQubSHgWbjv72:/data ubuntu ...

These commands provide a seamless mechanism to fetch and mount data from IPFS directly into your task's execution environment using the Bacalhau CLI.

Local Source Specification

The Local input source allows Bacalhau jobs to access files and directories that are already present on the compute node. This is especially useful for utilizing locally stored datasets, configuration files, logs, or other necessary resources without the need to fetch them from a remote source, ensuring faster job initialization and execution.

Source Specification Parameters

Here are the parameters that you can define for a Local input source:

SourcePath (string: <required>): The absolute path on the compute node where the Local or file is located. Bacalhau will access this path to read data, and if permitted, write data as well.
ReadWrite (bool: false): A boolean flag that, when set to true, gives Bacalhau both read and write access to the specified Local or file. If set to false, Bacalhau will have read-only access.

Allow-listing Local Paths

For security reasons, direct access to local paths must be explicitly allowed when running the Bacalhau compute node. This is achieved using the --allow-listed-local-paths flag followed by a comma-separated list of the paths, or path patterns, that should be accessible. Each path can be suffixed with permissions as well:

:rw - Read-Write access.
:ro - Read-Only access (default if no suffix is provided).

For instance:

bacalhau serve --allow-listed-local-paths "/etc/config:rw,/etc/*.conf:ro"

Example

Below is an example of how to define a Local input source in YAML format.

InputSources:
  - Source:
      Type: "localDirectory"
      Params:
        SourcePath: "/etc/config"
        ReadWrite: true
    Target: "/config"

In this example, Bacalhau is configured to access the Local "/etc/config" on the compute node. The contents of this directory are made available at the "/config" path within the task's environment, with read and write access. Adjusting the ReadWrite flag to false would enable read-only access, preventing modifications to the local data from within the Bacalhau task.

Example (Imperative/CLI)

When using the Bacalhau CLI to define the local input source, you can employ the following imperative approach. Below are example commands demonstrating how to define the local input source with various configurations:

Mount readonly file to /config:

bacalhau docker run -i file:///etc/config:/config ubuntu ...

Mount writable file to default /input:

bacalhau docker run -i file:///var/checkpoints:/myCheckpoints,opt=rw=true ubuntu ...

S3 Source Specification

The S3 Input Source provides a seamless way to utilize data stored in S3 or any S3-compatible storage service as input for Bacalhau jobs. Users can specify files or entire prefixes stored in S3 buckets to be fetched and mounted directly into the task's execution environment. This capability ensures that your tasks have immediate access to the necessary data.

Source Specification Parameters

Here are the parameters that you can define for an S3 input source:

Bucket (string: <required>): The name of the S3 bucket where the data is stored.
Key(string: <optional>): The object key or prefix within the bucket. Supports trailing wildcard for fetching multiple objects with matching prefixes.
Filter(string: <optional>): A regex pattern to filter the objects to be fetched. If a Key is also provided as a prefix, the filter pattern will be applied to object keys after the prefix.
Region(string: <optional>): The AWS region where the S3 bucket is hosted.
Endpoint(string: <optional>): The endpoint URL of the S3 or S3-compatible service.
VersionID(string: <optional>): The specific version of the object if versioning is enabled on the bucket. Only applicable when fetching a single object, and not a prefix or a pattern of objects.
ChecksumSHA256(string: <optional>): The SHA-256 checksum of the object to ensure data integrity. Only applicable when fetching a single object, and not a prefix or a pattern of objects.

Fetching Mechanism

Single Object: If the key points to a single object, that object is fetched and made available to the task. e.g. s3://myBucket/dir/file-001.txt
Prefix Matching: If the key ends with a slash (/), it's interpreted as a prefix, and all objects with keys that start with that prefix are fetched, mimicking the behavior of fetching all objects in a "directory". e.g. s3://myBucket/dir/
Wildcard: Supports a trailing wildcard (*). All objects with keys matching the prefix are fetched, facilitating batch processing or analysis of multiple files. e.g. s3://myBucket/dir/log-2023-09-*

Examples

Declarative Examples

When using the Bacalhau YAML configuration to define the S3 input source, you can employ the following declarative approach.

Below is an example of how to define an S3 input source in YAML format.

InputSources:
  - Source:
      Type: "s3"
      Params:
        Bucket: "my-bucket"
        Key: "data/"
        Endpoint: "https://s3.us-west-2.amazonaws.com"
        ChecksumSHA256: "e3b0c44b542b..."
  - Target: "/data"

Imperative Examples

When using the Bacalhau CLI to define the S3 input source, you can employ the following imperative approach. Below are example commands demonstrating how to define the S3 input source with various configurations:

Mount an S3 object to a specific path:

bacalhau docker run -i src=s3://bucket/key,dst=/my/input/path ubuntu ...

Mount an S3 object with a specific endpoint and region:

bacalhau docker run -i src=s3://bucket/key,dst=/my/input/path,opt=endpoint=http://s3.example.com,opt=region=us-east-1 ubuntu ...

Mount an S3 object using long flag names:

bacalhau docker run --input source=s3://bucket/key,destination=/my/input/path ubuntu ...

With these commands, you can seamlessly fetch and mount data from S3 into your task's execution environment directly through the CLI.

Credential Requirements

Environment variables: AWS credentials can be specified using AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables.
Credentials file: The credentials file typically located at ~/.aws/credentials can also be used to fetch the necessary AWS credentials.
IAM Roles for Amazon EC2 Instances: If you're running your tasks within an Amazon EC2 instance, IAM roles can be utilized to provide the necessary permissions and credentials.

For a more detailed overview on AWS credential management and other ways to provide these credentials, please refer to the AWS official documentation on standardized credentials.

Required IAM Policies

Compute nodes must run with the following policies to support S3 input source:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": "s3:ListBucket",
            "Resource": "arn:aws:s3:::BUCKET_NAME"
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:GetObjectVersion"
            ],
            "Resource": "arn:aws:s3:::BUCKET_NAME/*"
        }
    ]
}

ListBucket Permission: The s3:ListBucket permission is necessary to list the objects within the specified S3 bucket, allowing prefixes and wildcard expressions as the S3 Key for fetching.
GetObject and GetObjectVersion Permissions: The s3:GetObject and s3:GetObjectVersion permissions enable the fetching of object data and its versions, respectively.
Resource: The Resource field in the policy specifies the Amazon Resource Name (ARN) of the S3 bucket. The /* suffix is necessary to allow fetching of all objects within the bucket or can be replaced with a prefix to limit the scope of the policy. You can also specify multiple resources in the policy to allow fetching from multiple buckets, or * to allow fetching from all buckets in the account.

For more information on IAM policies specific to Amazon S3 buckets and users, please refer to the AWS documentation on Using IAM Policies with Amazon S3.

S3-Compatible Services

This feature isn't limited to AWS S3 - it supports all S3-compatible storage services. It means you can pull data from the likes of Google Cloud Storage and open-source solutions like MinIO, giving you the flexibility to utilize a diverse range of data sources.

Using Google Cloud Storage

To seamlessly integrate Google Cloud Storage with Bacalhau, follow these steps:

Obtain HMAC Keys: To access Google Cloud Storage, you'll need HMAC (Hash-based Message Authentication Code) keys. Refer to the Google Cloud documentation for detailed instructions on creating a service account and generating HMAC keys.
Provide HMAC Keys to Bacalhau: You can provide the HMAC keys to Bacalhau using the same options as AWS credentials, as documented in the Credential Requirements section.
Configure the S3 Input Source: In your S3 input source configuration, set the endpoint for Google Cloud Storage to https://storage.googleapis.com, as shown in the example below:

InputSources:
  - Source:
      Type: "s3"
      Params:
        Bucket: "my-bucket"
        Key: "data/"
        Endpoint: "https://storage.googleapis.com"
  - Target: "/data"

URL Source Specification

The URL Input Source provides a straightforward method for Bacalhau jobs to access and incorporate data available over HTTP/HTTPS. By specifying a URL, users can ensure the required data, whether a single file or a web page content, is retrieved and prepared in the task's execution environment, enabling direct and efficient data utilization.

Source Specification Parameters

Here are the parameters that you can define for a URL input source:

URL (string: <required>): The HTTP/HTTPS URL pointing directly to the file or web content you want to retrieve. The content accessible at this URL will be fetched and made available in the task’s environment.

Example

Below is an example of how to define a URL input source in YAML format.

InputSources:
  - Source:
      Type: "urlDownload"
      Params:
        URL: "https://example.com/data/file.txt"
    Target: "/data"

In this setup, the content available at the specified URL is downloaded and stored at the "/data" path within the task's environment. This mechanism ensures that tasks can directly access a broad range of web-based resources, augmenting the adaptability and utility of Bacalhau jobs.

Example (Imperative/CLI)

When using the Bacalhau CLI to define the URL input source, you can employ the following imperative approach. Below are example commands demonstrating how to define the URL input source with various configurations:

Fetch data from an HTTP endpoint and mount it: This command demonstrates fetching data from a specific HTTP URL and mounting it to a designated path within the task's environment.
```
bacalhau docker run -i http://example.com/data.txt ubuntu -- cat /input
```
Fetch data from an HTTPS endpoint and mount it: Similarly, you can fetch data from secure HTTPS URLs. This example fetches a file from a secure URL and mounts it.
```
bacalhau docker run -i https://secure.example.com/data.txt:/data ubuntu -- cat /data
```

S3 Source Specification

Source Specification Parameters

Here are the parameters that you can define for an S3 input source:

Bucket (string: <required>): The name of the S3 bucket where the data is stored.
Key(string: <optional>): The object key or prefix within the bucket. Supports trailing wildcard for fetching multiple objects with matching prefixes.
Filter(string: <optional>): A regex pattern to filter the objects to be fetched. If a Key is also provided as a prefix, the filter pattern will be applied to object keys after the prefix.
Region(string: <optional>): The AWS region where the S3 bucket is hosted.
Endpoint(string: <optional>): The endpoint URL of the S3 or S3-compatible service.
VersionID(string: <optional>): The specific version of the object if versioning is enabled on the bucket. Only applicable when fetching a single object, and not a prefix or a pattern of objects.
ChecksumSHA256(string: <optional>): The SHA-256 checksum of the object to ensure data integrity. Only applicable when fetching a single object, and not a prefix or a pattern of objects.

Fetching Mechanism

Single Object: If the key points to a single object, that object is fetched and made available to the task. e.g. s3://myBucket/dir/file-001.txt
Prefix Matching: If the key ends with a slash (/), it's interpreted as a prefix, and all objects with keys that start with that prefix are fetched, mimicking the behavior of fetching all objects in a "directory". e.g. s3://myBucket/dir/
Wildcard: Supports a trailing wildcard (*). All objects with keys matching the prefix are fetched, facilitating batch processing or analysis of multiple files. e.g. s3://myBucket/dir/log-2023-09-*

Examples

Declarative Examples

When using the Bacalhau YAML configuration to define the S3 input source, you can employ the following declarative approach.

Below is an example of how to define an S3 input source in YAML format.

InputSources:
  - Source:
      Type: "s3"
      Params:
        Bucket: "my-bucket"
        Key: "data/"
        Endpoint: "https://s3.us-west-2.amazonaws.com"
        ChecksumSHA256: "e3b0c44b542b..."
  - Target: "/data"

Imperative Examples

Mount an S3 object to a specific path:

bacalhau docker run -i src=s3://bucket/key,dst=/my/input/path ubuntu ...

Mount an S3 object with a specific endpoint and region:

bacalhau docker run -i src=s3://bucket/key,dst=/my/input/path,opt=endpoint=http://s3.example.com,opt=region=us-east-1 ubuntu ...

Mount an S3 object using long flag names:

bacalhau docker run --input source=s3://bucket/key,destination=/my/input/path ubuntu ...

With these commands, you can seamlessly fetch and mount data from S3 into your task's execution environment directly through the CLI.

Credential Requirements

Environment variables: AWS credentials can be specified using AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables.
Credentials file: The credentials file typically located at ~/.aws/credentials can also be used to fetch the necessary AWS credentials.
IAM Roles for Amazon EC2 Instances: If you're running your tasks within an Amazon EC2 instance, IAM roles can be utilized to provide the necessary permissions and credentials.

For a more detailed overview on AWS credential management and other ways to provide these credentials, please refer to the AWS official documentation on standardized credentials.

Required IAM Policies

Compute nodes must run with the following policies to support S3 input source:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": "s3:ListBucket",
            "Resource": "arn:aws:s3:::BUCKET_NAME"
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:GetObjectVersion"
            ],
            "Resource": "arn:aws:s3:::BUCKET_NAME/*"
        }
    ]
}

ListBucket Permission: The s3:ListBucket permission is necessary to list the objects within the specified S3 bucket, allowing prefixes and wildcard expressions as the S3 Key for fetching.
GetObject and GetObjectVersion Permissions: The s3:GetObject and s3:GetObjectVersion permissions enable the fetching of object data and its versions, respectively.
Resource: The Resource field in the policy specifies the Amazon Resource Name (ARN) of the S3 bucket. The /* suffix is necessary to allow fetching of all objects within the bucket or can be replaced with a prefix to limit the scope of the policy. You can also specify multiple resources in the policy to allow fetching from multiple buckets, or * to allow fetching from all buckets in the account.

For more information on IAM policies specific to Amazon S3 buckets and users, please refer to the AWS documentation on Using IAM Policies with Amazon S3.

S3-Compatible Services

Using Google Cloud Storage

To seamlessly integrate Google Cloud Storage with Bacalhau, follow these steps:

Obtain HMAC Keys: To access Google Cloud Storage, you'll need HMAC (Hash-based Message Authentication Code) keys. Refer to the Google Cloud documentation for detailed instructions on creating a service account and generating HMAC keys.
Provide HMAC Keys to Bacalhau: You can provide the HMAC keys to Bacalhau using the same options as AWS credentials, as documented in the Credential Requirements section.
Configure the S3 Input Source: In your S3 input source configuration, set the endpoint for Google Cloud Storage to https://storage.googleapis.com, as shown in the example below:

InputSources:
  - Source:
      Type: "s3"
      Params:
        Bucket: "my-bucket"
        Key: "data/"
        Endpoint: "https://storage.googleapis.com"
  - Target: "/data"

S3 Publisher Specification

Publisher Parameters

Bucket (string: <required>): The name of the S3 bucket where the task results will be stored.
Key (string: <required>): The object key within the specified bucket where the task results will be stored.
Endpoint (string: <optional>): The endpoint URL of the S3 service (useful for S3-compatible services).
Region (string: <optional>): The region where the S3 bucket is located.

Published Result Spec

Results published to S3 are stored as objects that can also be used as inputs to other Bacalhau jobs by using S3 Input Source. The published result specification includes the following parameters:

Bucket: Confirms the name of the bucket containing the stored results.
Key: Identifies the unique object key within the specified bucket.
Region: Notes the AWS region of the bucket.
Endpoint: Records the endpoint URL for S3-compatible storage services.
VersionID: The version ID of the stored object, enabling versioning support for retrieving specific versions of stored data.
ChecksumSHA256: The SHA-256 checksum of the stored object, providing a method to verify data integrity.

Dynamic Naming

Bacalhau supports the following dynamic placeholders that will be replaced with their actual values during the publishing process:

{executionID}: Replaced with the specific execution ID.
{jobID}: Replaced with the ID of the job.
{nodeID}: Replaced with the ID of the node where the execution took place
{date}: Replaced with the current date in the format YYYYMMDD.
{time}: Replaced with the current time in the format HHMMSS.

Example

Imagine you've specified the following object key pattern for publishing:

results/{jobID}/{date}/{time}/

Given a job with ID abc123, executed on 2023-09-26 at 14:05:30, the published object key would be:

results/abc123/20230926/140530/

This dynamic naming feature offers a powerful way to create organized, intuitive naming conventions for your Bacalhau published objects in S3.

Examples

Declarative Examples

Here’s an example YAML configuration that outlines the process of using the S3 Publisher with Bacalhau:

Publisher:
  Type: "s3"
  Params:
    Bucket: "my-task-results"
    Key: "task123/result.tar.gz"
    Endpoint: "https://s3.us-west-2.amazonaws.com"

The results will be compressed into a single object, and the published result specification will look like:

PublishedResult:
  Type: "s3"
  Params:
    Bucket: "my-task-results"
    Key: "task123/result.tar.gz"
    Endpoint: "https://s3.us-west-2.amazonaws.com"
    Region: "us-west-2"
    ChecksumSHA256: "0x9a3a..."
    VersionID: "3/L4kqtJlcpXroDTDmJ+rmDbwQaHWyOb..."

Imperative Examples

The Bacalhau command-line interface (CLI) provides an imperative approach to specify the S3 Publisher. Below are a few examples showcasing how to define an S3 publisher using CLI commands:

Basic Docker job writing to S3 with default configurations:
```
bacalhau docker run -p s3://bucket/key ubuntu ...
```
This command writes to the S3 bucket using default endpoint and region settings.
Docker job writing to S3 with a specific endpoint and region:
```
bacalhau docker run -p s3://bucket/key,opt=endpoint=http://s3.example.com,opt=region=us-east-1 ubuntu ...
```
This command specifies a unique endpoint and region for the S3 bucket.
Using naming placeholders:
```
bacalhau docker run -p s3://bucket/result-{date}-{jobID} ubuntu ...
```
Dynamic naming placeholders like {date} and {jobID} allow for organized naming structures, automatically replacing these placeholders with appropriate values upon execution.

Credential Requirements

Environment variables: AWS credentials can be specified using AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables.
Credentials file: The credentials file typically located at ~/.aws/credentials can also be used to fetch the necessary AWS credentials.
IAM Roles for Amazon EC2 Instances: If you're running your tasks within an Amazon EC2 instance, IAM roles can be utilized to provide the necessary permissions and credentials.

For a more detailed overview on AWS credential management and other ways to provide these credentials, please refer to the AWS official documentation on standardized credentials.

Required IAM Policies

Compute Nodes

Compute nodes must run with the following policies to publish to S3:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:PutObject"
            ],
            "Resource": "arn:aws:s3:::BUCKET_NAME/*"
        }
    ]
}

PutObject Permissions: The s3:PutObject permission is necessary to publish objects to the specified S3 bucket.
Resource: The Resource field in the policy specifies the Amazon Resource Name (ARN) of the S3 bucket. The /* suffix is necessary to allow publishing with any prefix within the bucket or can be replaced with a prefix to limit the scope of the policy. You can also specify multiple resources in the policy to allow publishing to multiple buckets, or * to allow publishing to all buckets in the account.

Requester Node

To enable downloading published results using bacalhau get <job_id> command, the requester node must run with the following policies:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetObject"
            ],
            "Resource": "arn:aws:s3:::BUCKET_NAME/*"
        }
    ]
}

GetObject Permissions: The s3:GetObject permission is necessary for the requester node to provide a pre-signed URL to download the published results by the client.

For more information on IAM policies specific to Amazon S3 buckets and users, please refer to the AWS documentation on Using IAM Policies with Amazon S3.

Other Specifications

engines

Docker Engine Specification

Docker Engine Parameters

Example

WebAssembly (WASM) Engine Specification

WASM Engine Parameters

Example

publishers

IPFS Publisher Specification

IPFS Publisher Parameters

Published Result Specification

Result Parameters

Local Publisher Specification

Local Publisher Parameters

Published Result Specification

Result Parameters

Caveats

S3 Publisher Specification

Publisher Parameters

Published Result Spec

Dynamic Naming

Examples

Declarative Examples

Imperative Examples

Credential Requirements

Required IAM Policies

Compute Nodes

Requester Node

sources

IPFS Source Specification

Source Specification Parameters

Example

Example (Imperative/CLI)

Local Source Specification

Source Specification Parameters

Allow-listing Local Paths

Example

Example (Imperative/CLI)

S3 Source Specification

Source Specification Parameters

Fetching Mechanism

Examples

Declarative Examples

Imperative Examples

Credential Requirements

Required IAM Policies

S3-Compatible Services

Using Google Cloud Storage

URL Source Specification

Source Specification Parameters

Example

Example (Imperative/CLI)

IPFS Publisher Specification

IPFS Publisher Parameters

Published Result Specification

Result Parameters

Docker Engine Specification

Docker Engine Parameters

Example

Local Publisher Specification

Local Publisher Parameters

Published Result Specification

Result Parameters

Caveats

Local Source Specification

Source Specification Parameters

Allow-listing Local Paths

Example

Example (Imperative/CLI)

URL Source Specification

Source Specification Parameters

Example

Example (Imperative/CLI)

S3 Source Specification

Source Specification Parameters

Fetching Mechanism

Examples

Declarative Examples

Imperative Examples

`Docker` Engine Parameters

`WASM` Engine Parameters

`IPFS` Publisher Parameters

`IPFS` Publisher Parameters

`Docker` Engine Parameters

`WASM` Engine Parameters