1 of 5

sources

IPFS Source Specification

The IPFS Input Source enables users to easily integrate data hosted on the InterPlanetary File System (IPFS) into Bacalhau jobs. By specifying the Content Identifier (CID) of the desired IPFS file or directory, users can have the content fetched and made available in the task's execution environment, ensuring efficient and decentralized data access.

Source Specification Parameters

Here are the parameters that you can define for an IPFS input source:

CID (string: <required>): The Content Identifier that uniquely pinpoints the file or directory on the IPFS network. Bacalhau retrieves the content associated with this CID for use in the task.

Example

Below is an example of how to define an IPFS input source in YAML format.

InputSources:
  - Source:
      Type: "ipfs"
      Params:
        CID: "QmY7Yh4UquoXHLPFo2XbhXkhBvFoPwmQUSa92pxnxjY3fZ"
  - Target: "/data"

In this configuration, the data associated with the specified CID is fetched from the IPFS network and made available in the task's environment at the "/data" path.

Example (Imperative/CLI)

Utilizing IPFS as an input source in Bacalhau via the CLI is straightforward. Below are example commands that demonstrate how to define the IPFS input source:

Mount an IPFS CID to the default /inputs directory:

bacalhau docker run -i ipfs://QmeZRGhe4PmjctYVSVHuEiA9oSXnqmYa4kQubSHgWbjv72 ubuntu ...

Mount an IPFS CID to a custom /data directory:

bacalhau docker run -i ipfs://QmeZRGhe4PmjctYVSVHuEiA9oSXnqmYa4kQubSHgWbjv72:/data ubuntu ...

These commands provide a seamless mechanism to fetch and mount data from IPFS directly into your task's execution environment using the Bacalhau CLI.

Local Source Specification

The Local input source allows Bacalhau jobs to access files and directories that are already present on the compute node. This is especially useful for utilizing locally stored datasets, configuration files, logs, or other necessary resources without the need to fetch them from a remote source, ensuring faster job initialization and execution.

Source Specification Parameters

Here are the parameters that you can define for a Local input source:

SourcePath (string: <required>): The absolute path on the compute node where the Local or file is located. Bacalhau will access this path to read data, and if permitted, write data as well.
ReadWrite (bool: false): A boolean flag that, when set to true, gives Bacalhau both read and write access to the specified Local or file. If set to false, Bacalhau will have read-only access.

Allow-listing Local Paths

For security reasons, direct access to local paths must be explicitly allowed when running the Bacalhau compute node. This is achieved using the --allow-listed-local-paths flag followed by a comma-separated list of the paths, or path patterns, that should be accessible. Each path can be suffixed with permissions as well:

:rw - Read-Write access.
:ro - Read-Only access (default if no suffix is provided).

For instance:

Example

Below is an example of how to define a Local input source in YAML format.

In this example, Bacalhau is configured to access the Local "/etc/config" on the compute node. The contents of this directory are made available at the "/config" path within the task's environment, with read and write access. Adjusting the ReadWrite flag to false would enable read-only access, preventing modifications to the local data from within the Bacalhau task.

Example (Imperative/CLI)

When using the Bacalhau CLI to define the local input source, you can employ the following imperative approach. Below are example commands demonstrating how to define the local input source with various configurations:

Mount readonly file to /config:
Mount writable file to default /input:

S3 Source Specification

The S3 Input Source provides a seamless way to utilize data stored in S3 or any S3-compatible storage service as input for Bacalhau jobs. Users can specify files or entire prefixes stored in S3 buckets to be fetched and mounted directly into the task's execution environment. This capability ensures that your tasks have immediate access to the necessary data.

Source Specification Parameters

Here are the parameters that you can define for an S3 input source:

Bucket (string: <required>): The name of the S3 bucket where the data is stored.
Key(string: <optional>): The object key or prefix within the bucket. Supports trailing wildcard for fetching multiple objects with matching prefixes.
Filter(string: <optional>): A regex pattern to filter the objects to be fetched. If a Key is also provided as a prefix, the filter pattern will be applied to object keys after the prefix.
Region(string: <optional>): The AWS region where the S3 bucket is hosted.
Endpoint(string: <optional>): The endpoint URL of the S3 or S3-compatible service.
VersionID(string: <optional>): The specific version of the object if versioning is enabled on the bucket. Only applicable when fetching a single object, and not a prefix or a pattern of objects.
ChecksumSHA256(string: <optional>): The SHA-256 checksum of the object to ensure data integrity. Only applicable when fetching a single object, and not a prefix or a pattern of objects.

Fetching Mechanism

Single Object: If the key points to a single object, that object is fetched and made available to the task. e.g. s3://myBucket/dir/file-001.txt
Prefix Matching: If the key ends with a slash (/), it's interpreted as a prefix, and all objects with keys that start with that prefix are fetched, mimicking the behavior of fetching all objects in a "directory". e.g. s3://myBucket/dir/
Wildcard: Supports a trailing wildcard (*). All objects with keys matching the prefix are fetched, facilitating batch processing or analysis of multiple files. e.g. s3://myBucket/dir/log-2023-09-*

Examples

Declarative Examples

When using the Bacalhau YAML configuration to define the S3 input source, you can employ the following declarative approach.

Below is an example of how to define an S3 input source in YAML format.

InputSources:
  - Source:
      Type: "s3"
      Params:
        Bucket: "my-bucket"
        Key: "data/"
        Endpoint: "https://s3.us-west-2.amazonaws.com"
        ChecksumSHA256: "e3b0c44b542b..."
  - Target: "/data"

Imperative Examples

When using the Bacalhau CLI to define the S3 input source, you can employ the following imperative approach. Below are example commands demonstrating how to define the S3 input source with various configurations:

Mount an S3 object to a specific path:

bacalhau docker run -i src=s3://bucket/key,dst=/my/input/path ubuntu ...

Mount an S3 object with a specific endpoint and region:

bacalhau docker run -i src=s3://bucket/key,dst=/my/input/path,opt=endpoint=http://s3.example.com,opt=region=us-east-1 ubuntu ...

Mount an S3 object using long flag names:

bacalhau docker run --input source=s3://bucket/key,destination=/my/input/path ubuntu ...

With these commands, you can seamlessly fetch and mount data from S3 into your task's execution environment directly through the CLI.

Credential Requirements

To support this storage provider, no extra dependencies are necessary. However, valid AWS credentials are essential to sign the requests. The storage provider employs the default credentials chain to retrieve credentials, primarily sourcing them from:

Environment variables: AWS credentials can be specified using AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables.
Credentials file: The credentials file typically located at ~/.aws/credentials can also be used to fetch the necessary AWS credentials.
IAM Roles for Amazon EC2 Instances: If you're running your tasks within an Amazon EC2 instance, IAM roles can be utilized to provide the necessary permissions and credentials.

For a more detailed overview on AWS credential management and other ways to provide these credentials, please refer to the AWS official documentation on standardized credentials.

Required IAM Policies

Compute nodes must run with the following policies to support S3 input source:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": "s3:ListBucket",
            "Resource": "arn:aws:s3:::BUCKET_NAME"
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:GetObjectVersion"
            ],
            "Resource": "arn:aws:s3:::BUCKET_NAME/*"
        }
    ]
}

ListBucket Permission: The s3:ListBucket permission is necessary to list the objects within the specified S3 bucket, allowing prefixes and wildcard expressions as the S3 Key for fetching.
GetObject and GetObjectVersion Permissions: The s3:GetObject and s3:GetObjectVersion permissions enable the fetching of object data and its versions, respectively.
Resource: The Resource field in the policy specifies the Amazon Resource Name (ARN) of the S3 bucket. The /* suffix is necessary to allow fetching of all objects within the bucket or can be replaced with a prefix to limit the scope of the policy. You can also specify multiple resources in the policy to allow fetching from multiple buckets, or * to allow fetching from all buckets in the account.

For more information on IAM policies specific to Amazon S3 buckets and users, please refer to the AWS documentation on Using IAM Policies with Amazon S3.

S3-Compatible Services

This feature isn't limited to AWS S3 - it supports all S3-compatible storage services. It means you can pull data from the likes of Google Cloud Storage and open-source solutions like MinIO, giving you the flexibility to utilize a diverse range of data sources.

Using Google Cloud Storage

To seamlessly integrate Google Cloud Storage with Bacalhau, follow these steps:

Obtain HMAC Keys: To access Google Cloud Storage, you'll need HMAC (Hash-based Message Authentication Code) keys. Refer to the Google Cloud documentation for detailed instructions on creating a service account and generating HMAC keys.
Provide HMAC Keys to Bacalhau: You can provide the HMAC keys to Bacalhau using the same options as AWS credentials, as documented in the Credential Requirements section.
Configure the S3 Input Source: In your S3 input source configuration, set the endpoint for Google Cloud Storage to https://storage.googleapis.com, as shown in the example below:

InputSources:
  - Source:
      Type: "s3"
      Params:
        Bucket: "my-bucket"
        Key: "data/"
        Endpoint: "https://storage.googleapis.com"
  - Target: "/data"

URL Source Specification

The URL Input Source provides a straightforward method for Bacalhau jobs to access and incorporate data available over HTTP/HTTPS. By specifying a URL, users can ensure the required data, whether a single file or a web page content, is retrieved and prepared in the task's execution environment, enabling direct and efficient data utilization.

Source Specification Parameters

Here are the parameters that you can define for a URL input source:

URL (string: <required>): The HTTP/HTTPS URL pointing directly to the file or web content you want to retrieve. The content accessible at this URL will be fetched and made available in the task’s environment.

Example

Below is an example of how to define a URL input source in YAML format.

InputSources:
  - Source:
      Type: "urlDownload"
      Params:
        URL: "https://example.com/data/file.txt"
    Target: "/data"

In this setup, the content available at the specified URL is downloaded and stored at the "/data" path within the task's environment. This mechanism ensures that tasks can directly access a broad range of web-based resources, augmenting the adaptability and utility of Bacalhau jobs.

Example (Imperative/CLI)

When using the Bacalhau CLI to define the URL input source, you can employ the following imperative approach. Below are example commands demonstrating how to define the URL input source with various configurations:

Fetch data from an HTTP endpoint and mount it: This command demonstrates fetching data from a specific HTTP URL and mounting it to a designated path within the task's environment.
```
bacalhau docker run -i http://example.com/data.txt ubuntu -- cat /input
```
Fetch data from an HTTPS endpoint and mount it: Similarly, you can fetch data from secure HTTPS URLs. This example fetches a file from a secure URL and mounts it.
```
bacalhau docker run -i https://secure.example.com/data.txt:/data ubuntu -- cat /data
```

S3 Source Specification

Source Specification Parameters

Here are the parameters that you can define for an S3 input source:

Bucket (string: <required>): The name of the S3 bucket where the data is stored.
Key(string: <optional>): The object key or prefix within the bucket. Supports trailing wildcard for fetching multiple objects with matching prefixes.
Filter(string: <optional>): A regex pattern to filter the objects to be fetched. If a Key is also provided as a prefix, the filter pattern will be applied to object keys after the prefix.
Region(string: <optional>): The AWS region where the S3 bucket is hosted.
Endpoint(string: <optional>): The endpoint URL of the S3 or S3-compatible service.
VersionID(string: <optional>): The specific version of the object if versioning is enabled on the bucket. Only applicable when fetching a single object, and not a prefix or a pattern of objects.
ChecksumSHA256(string: <optional>): The SHA-256 checksum of the object to ensure data integrity. Only applicable when fetching a single object, and not a prefix or a pattern of objects.

Fetching Mechanism

Single Object: If the key points to a single object, that object is fetched and made available to the task. e.g. s3://myBucket/dir/file-001.txt
Prefix Matching: If the key ends with a slash (/), it's interpreted as a prefix, and all objects with keys that start with that prefix are fetched, mimicking the behavior of fetching all objects in a "directory". e.g. s3://myBucket/dir/
Wildcard: Supports a trailing wildcard (*). All objects with keys matching the prefix are fetched, facilitating batch processing or analysis of multiple files. e.g. s3://myBucket/dir/log-2023-09-*

Examples

Declarative Examples

When using the Bacalhau YAML configuration to define the S3 input source, you can employ the following declarative approach.

Below is an example of how to define an S3 input source in YAML format.

InputSources:
  - Source:
      Type: "s3"
      Params:
        Bucket: "my-bucket"
        Key: "data/"
        Endpoint: "https://s3.us-west-2.amazonaws.com"
        ChecksumSHA256: "e3b0c44b542b..."
  - Target: "/data"

Imperative Examples

Mount an S3 object to a specific path:

bacalhau docker run -i src=s3://bucket/key,dst=/my/input/path ubuntu ...

Mount an S3 object with a specific endpoint and region:

bacalhau docker run -i src=s3://bucket/key,dst=/my/input/path,opt=endpoint=http://s3.example.com,opt=region=us-east-1 ubuntu ...

Mount an S3 object using long flag names:

bacalhau docker run --input source=s3://bucket/key,destination=/my/input/path ubuntu ...

With these commands, you can seamlessly fetch and mount data from S3 into your task's execution environment directly through the CLI.

Credential Requirements

Environment variables: AWS credentials can be specified using AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables.
Credentials file: The credentials file typically located at ~/.aws/credentials can also be used to fetch the necessary AWS credentials.
IAM Roles for Amazon EC2 Instances: If you're running your tasks within an Amazon EC2 instance, IAM roles can be utilized to provide the necessary permissions and credentials.

For a more detailed overview on AWS credential management and other ways to provide these credentials, please refer to the AWS official documentation on standardized credentials.

Required IAM Policies

Compute nodes must run with the following policies to support S3 input source:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": "s3:ListBucket",
            "Resource": "arn:aws:s3:::BUCKET_NAME"
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:GetObjectVersion"
            ],
            "Resource": "arn:aws:s3:::BUCKET_NAME/*"
        }
    ]
}

ListBucket Permission: The s3:ListBucket permission is necessary to list the objects within the specified S3 bucket, allowing prefixes and wildcard expressions as the S3 Key for fetching.
GetObject and GetObjectVersion Permissions: The s3:GetObject and s3:GetObjectVersion permissions enable the fetching of object data and its versions, respectively.
Resource: The Resource field in the policy specifies the Amazon Resource Name (ARN) of the S3 bucket. The /* suffix is necessary to allow fetching of all objects within the bucket or can be replaced with a prefix to limit the scope of the policy. You can also specify multiple resources in the policy to allow fetching from multiple buckets, or * to allow fetching from all buckets in the account.

For more information on IAM policies specific to Amazon S3 buckets and users, please refer to the AWS documentation on Using IAM Policies with Amazon S3.

S3-Compatible Services

Using Google Cloud Storage

To seamlessly integrate Google Cloud Storage with Bacalhau, follow these steps:

Obtain HMAC Keys: To access Google Cloud Storage, you'll need HMAC (Hash-based Message Authentication Code) keys. Refer to the Google Cloud documentation for detailed instructions on creating a service account and generating HMAC keys.
Provide HMAC Keys to Bacalhau: You can provide the HMAC keys to Bacalhau using the same options as AWS credentials, as documented in the Credential Requirements section.
Configure the S3 Input Source: In your S3 input source configuration, set the endpoint for Google Cloud Storage to https://storage.googleapis.com, as shown in the example below:

InputSources:
  - Source:
      Type: "s3"
      Params:
        Bucket: "my-bucket"
        Key: "data/"
        Endpoint: "https://storage.googleapis.com"
  - Target: "/data"