The S3 Input Source provides a seamless way to utilize data stored in S3 or any S3-compatible storage service as input for Bacalhau jobs. Users can specify files or entire prefixes stored in S3 buckets to be fetched and mounted directly into the task's execution environment. This capability ensures that your tasks have immediate access to the necessary data.
Here are the parameters that you can define for an S3 input source:
Bucket (string: <required>)
: The name of the S3 bucket where the data is stored.
Key(string: <optional>)
: The object key or prefix within the bucket. Supports trailing wildcard for fetching multiple objects with matching prefixes.
Filter(string: <optional>)
: A regex pattern to filter the objects to be fetched. If a Key is also provided as a prefix, the filter pattern will be applied to object keys after the prefix.
Region(string: <optional>)
: The AWS region where the S3 bucket is hosted.
Endpoint(string: <optional>)
: The endpoint URL of the S3 or S3-compatible service.
VersionID(string: <optional>)
: The specific version of the object if versioning is enabled on the bucket. Only applicable when fetching a single object, and not a prefix or a pattern of objects.
ChecksumSHA256(string: <optional>)
: The SHA-256 checksum of the object to ensure data integrity. Only applicable when fetching a single object, and not a prefix or a pattern of objects.
Single Object: If the key points to a single object, that object is fetched and made available to the task. e.g. s3://myBucket/dir/file-001.txt
Prefix Matching: If the key ends with a slash (/), it's interpreted as a prefix, and all objects with keys that start with that prefix are fetched, mimicking the behavior of fetching all objects in a "directory". e.g. s3://myBucket/dir/
Wildcard: Supports a trailing wildcard (*
). All objects with keys matching the prefix are fetched, facilitating batch processing or analysis of multiple files. e.g. s3://myBucket/dir/log-2023-09-*
When using the Bacalhau YAML configuration to define the S3 input source, you can employ the following declarative approach.
Below is an example of how to define an S3 input source in YAML format.
When using the Bacalhau CLI to define the S3 input source, you can employ the following imperative approach. Below are example commands demonstrating how to define the S3 input source with various configurations:
Mount an S3 object to a specific path:
Mount an S3 object with a specific endpoint and region:
Mount an S3 object using long flag names:
With these commands, you can seamlessly fetch and mount data from S3 into your task's execution environment directly through the CLI.
To support this storage provider, no extra dependencies are necessary. However, valid AWS credentials are essential to sign the requests. The storage provider employs the default credentials chain to retrieve credentials, primarily sourcing them from:
Environment variables: AWS credentials can be specified using AWS_ACCESS_KEY_ID
and AWS_SECRET_ACCESS_KEY
environment variables.
Credentials file: The credentials file typically located at ~/.aws/credentials
can also be used to fetch the necessary AWS credentials.
IAM Roles for Amazon EC2 Instances: If you're running your tasks within an Amazon EC2 instance, IAM roles can be utilized to provide the necessary permissions and credentials.
For a more detailed overview on AWS credential management and other ways to provide these credentials, please refer to the AWS official documentation on standardized credentials.
Compute nodes must run with the following policies to support S3 input source:
ListBucket Permission: The s3:ListBucket
permission is necessary to list the objects within the specified S3 bucket, allowing prefixes and wildcard expressions as the S3 Key for fetching.
GetObject and GetObjectVersion Permissions: The s3:GetObject
and s3:GetObjectVersion
permissions enable the fetching of object data and its versions, respectively.
Resource: The Resource
field in the policy specifies the Amazon Resource Name (ARN) of the S3 bucket. The /*
suffix is necessary to allow fetching of all objects within the bucket or can be replaced with a prefix to limit the scope of the policy. You can also specify multiple resources in the policy to allow fetching from multiple buckets, or *
to allow fetching from all buckets in the account.
For more information on IAM policies specific to Amazon S3 buckets and users, please refer to the AWS documentation on Using IAM Policies with Amazon S3.
This feature isn't limited to AWS S3 - it supports all S3-compatible storage services. It means you can pull data from the likes of Google Cloud Storage and open-source solutions like MinIO, giving you the flexibility to utilize a diverse range of data sources.
To seamlessly integrate Google Cloud Storage with Bacalhau, follow these steps:
Obtain HMAC Keys: To access Google Cloud Storage, you'll need HMAC (Hash-based Message Authentication Code) keys. Refer to the Google Cloud documentation for detailed instructions on creating a service account and generating HMAC keys.
Provide HMAC Keys to Bacalhau: You can provide the HMAC keys to Bacalhau using the same options as AWS credentials, as documented in the Credential Requirements section.
Configure the S3 Input Source: In your S3 input source configuration, set the endpoint for Google Cloud Storage to https://storage.googleapis.com
, as shown in the example below:
The URL Input Source provides a straightforward method for Bacalhau jobs to access and incorporate data available over HTTP/HTTPS. By specifying a URL, users can ensure the required data, whether a single file or a web page content, is retrieved and prepared in the task's execution environment, enabling direct and efficient data utilization.
Here are the parameters that you can define for a URL input source:
URL (string: <required>)
: The HTTP/HTTPS URL pointing directly to the file or web content you want to retrieve. The content accessible at this URL will be fetched and made available in the task’s environment.
Below is an example of how to define a URL input source in YAML format.
In this setup, the content available at the specified URL is downloaded and stored at the "/data" path within the task's environment. This mechanism ensures that tasks can directly access a broad range of web-based resources, augmenting the adaptability and utility of Bacalhau jobs.
When using the Bacalhau CLI to define the URL input source, you can employ the following imperative approach. Below are example commands demonstrating how to define the URL input source with various configurations:
Fetch data from an HTTP endpoint and mount it: This command demonstrates fetching data from a specific HTTP URL and mounting it to a designated path within the task's environment.
Fetch data from an HTTPS endpoint and mount it: Similarly, you can fetch data from secure HTTPS URLs. This example fetches a file from a secure URL and mounts it.
The Local input source allows Bacalhau jobs to access files and directories that are already present on the compute node. This is especially useful for utilizing locally stored datasets, configuration files, logs, or other necessary resources without the need to fetch them from a remote source, ensuring faster job initialization and execution.
Here are the parameters that you can define for a Local input source:
SourcePath (string: <required>)
: The absolute path on the compute node where the Local or file is located. Bacalhau will access this path to read data, and if permitted, write data as well.
ReadWrite (bool: false)
: A boolean flag that, when set to true, gives Bacalhau both read and write access to the specified Local or file. If set to false, Bacalhau will have read-only access.
For security reasons, direct access to local paths must be explicitly allowed when running the Bacalhau compute node. This is achieved using the --allow-listed-local-paths
flag followed by a comma-separated list of the paths, or path patterns, that should be accessible. Each path can be suffixed with permissions as well:
:rw
- Read-Write access.
:ro
- Read-Only access (default if no suffix is provided).
For instance:
Below is an example of how to define a Local input source in YAML format.
In this example, Bacalhau is configured to access the Local "/etc/config" on the compute node. The contents of this directory are made available at the "/config" path within the task's environment, with read and write access. Adjusting the ReadWrite
flag to false would enable read-only access, preventing modifications to the local data from within the Bacalhau task.
When using the Bacalhau CLI to define the local input source, you can employ the following imperative approach. Below are example commands demonstrating how to define the local input source with various configurations:
Mount readonly file to /config
:
Mount writable file to default /input
:
The IPFS Input Source enables users to easily integrate data hosted on the into Bacalhau jobs. By specifying the Content Identifier (CID) of the desired IPFS file or directory, users can have the content fetched and made available in the task's execution environment, ensuring efficient and decentralized data access.
Here are the parameters that you can define for an IPFS input source:
CID (string: <required>)
: The Content Identifier that uniquely pinpoints the file or directory on the IPFS network. Bacalhau retrieves the content associated with this CID for use in the task.
Below is an example of how to define an IPFS input source in YAML format.
In this configuration, the data associated with the specified CID is fetched from the IPFS network and made available in the task's environment at the "/data" path.
Utilizing IPFS as an input source in Bacalhau via the CLI is straightforward. Below are example commands that demonstrate how to define the IPFS input source:
Mount an IPFS CID to the default /inputs
directory:
Mount an IPFS CID to a custom /data
directory:
These commands provide a seamless mechanism to fetch and mount data from IPFS directly into your task's execution environment using the Bacalhau CLI.