Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
This section covers the queuing and timeouts for jobs in Bacalhau.
Loading...
Loading...
Loading...
A Job
represents a discrete unit of work that can be scheduled and executed. It carries all the necessary information to define the nature of the work, how it should be executed, and the resources it requires.
job
ParametersName (string : <optional>)
: A logical name to refer to the job. Defaults to job ID.
Namespace (string: "default")
: The namespace in which the job is running. ClientID
is used as a namespace in the public demo network.
Type (string: <required>)
: The type of the job, such as batch
, ops
, daemon
or service
. You can learn more about the supported jobs types in the Job Types guide.
Priority (int: 0
): Determines the scheduling priority.
Count (int: <required)
: Number of replicas to be scheduled. This is only applicable for jobs of type batch
and service
.
Meta (
Meta
: nil)
: Arbitrary metadata associated with the job.
Labels (
Label
[] : nil)
: Arbitrary labels associated with the job for filtering purposes.
Constraints (
Constraint
[] : nil)
: These are selectors which must be true for a compute node to run this job.
Tasks (
Task
[] : <required>)
:: Task associated with the job, which defines a unit of work within the job. Today we are only supporting single task per job, but with future plans to extend this.
The following parameters are generated by the server and should not be set directly.
ID (string)
: A unique identifier assigned to this job. It's auto-generated by the server and should not be set directly. Used for distinguishing between jobs with similar names.
State (
State
)
: Represents the current state of the job.
Version (int)
: A monotonically increasing version number incremented on job specification update.
Revision (int)
: A monotonically increasing revision number incremented on each update to the job's state or specification.
CreateTime (int)
: Timestamp of job creation.
ModifyTime (int)
: Timestamp of last job modification.
The IPFS Publisher in Bacalhau amplifies the versatility of task result storage by integrating with the InterPlanetary File System (IPFS). IPFS is a protocol and network designed to create a peer-to-peer method of storing and sharing hypermedia in a distributed file system. Bacalhau's seamless integration with IPFS ensures that users have a decentralized option for publishing their task results, enhancing accessibility and resilience while reducing dependence on a single point of failure.
IPFS
Publisher ParametersFor the IPFS publisher, no specific parameters need to be defined in the publisher specification. The user only needs to indicate the publisher type as IPFS, and Bacalhau handles the rest. Here is an example of how to set up an IPFS Publisher in a job specification.
Once the job is executed, the results are published to IPFS, and a unique CID (Content Identifier) is generated for each file or piece of data. This CID acts as an address to the file in the IPFS network and can be used to access the file globally.
CID (string)
: This is the unique content identifier generated by IPFS, which can be used to access the published content from anywhere in the world. Every data piece stored on IPFS has its unique CID. Here's a sample of how the published result might appear:
In this example, the task results will be stored in IPFS, and can be referenced and retrieved using the specified CID. This is indicative of Bacalhau's commitment to offering flexible, reliable, and decentralized options for result storage, catering to a diverse set of user needs and preferences.
Docker Engine is one of the execution engines supported in Bacalhau. It allows users to run tasks inside Docker containers, offering an isolated and consistent environment for execution. Below are the parameters to configure the Docker Engine.
Docker
Engine ParametersImage (string: <required>)
: Specifies the Docker image to use for task execution. It should be an image that can be pulled by Docker.
Entrypoint (string[]: <optional>)
: Allows overriding the default entrypoint set in the Docker image. Each string in the array represents a segment of the entrypoint command.
Parameters (string[]: <optional>)
: Additional command-line arguments to be included in the container’s startup command, appended after the entrypoint.
EnvironmentVariables (string[]: <optional>)
: Sets environment variables within the Docker container during task execution. Each string should be formatted as KEY=value
.
WorkingDirectory (string: <optional>)
: Sets the path inside the container where the task executes. If not specified, it defaults to the working directory defined in the Docker image.
Here’s an example of configuring the Docker Engine within a job or task using YAML:
In this example, the task will be executed inside an Ubuntu 20.04 Docker container. The entrypoint is overridden to execute a bash shell that runs an echo command. An environment variable MY_ENV_VAR is set with the value myvalue, and the working directory inside the container is set to /app.
The WASM Engine in Bacalhau allows tasks to be executed in a WebAssembly environment, offering compatibility and speed. This engine supports WASM and WASI (WebAssembly System Interface) jobs, making it highly adaptable for various use cases. Below are the parameters for configuring the WASM Engine.
WASM
Engine ParametersEntryModule (
InputSource
: required)
: Specifies the WASM module that contains the start function or the main execution code of the task. The InputSource should point to the location of the WASM binary.
Entrypoint (string: <optional>)
: The name of the function within the EntryModule to execute. For WASI jobs, this should typically be _start
. The entrypoint function should have zero parameters and zero results.
Parameters (string[]: <optional>)
: An array of strings containing arguments that will be supplied to the program as ARGV. This allows parameterized execution of the WASM task.
EnvironmentVariables (map[string]string: <optional>)
: A mapping of environment variable keys to their values, made available within the executing WASM environment.
ImportModules (
InputSource
[] : optional)
: An array of InputSources pointing to additional WASM modules. The exports from these modules will be available as imports to the EntryModule, enabling modular and reusable WASM code.
Here’s a sample configuration of the WASM Engine within a task, expressed in YAML:
In this example, the task is configured to run in a WASM environment. The EntryModule is fetched from an S3 bucket, the entrypoint is _start
, and parameters and environment variables are passed into the WASM environment. Additionally, an ImportModule is loaded from a local directory, making its exports available to the EntryModule.
A Task
signifies a distinct unit of work within the broader context of a Job
. It defines the specifics of how the task should be executed, where the results should be published, what environment variables are needed, among other configurations
Task
ParametersName (string : <required>)
: A unique identifier representing the name of the task.
Engine (
SpecConfig
: required)
: Configures the execution engine for the task, such as Docker or WebAssembly.
Publisher (
SpecConfig
: optional)
: Specifies where the results of the task should be published, such as S3 and IPFS publishers. Only applicable for tasks of type batch
and ops
.
Env (map[string]string : optional)
: A set of environment variables for the driver.
Meta (
Meta
: optional)
: Allows association of arbitrary metadata with this task.
InputSources (
InputSource
[] : optional)
: Lists remote artifacts that should be downloaded before task execution and mounted within the task, such as from S3 or HTTP/HTTPs.
ResultPaths (
ResultPath
[] : optional)
: Indicates volumes within the task that should be included in the published result. Only applicable for tasks of type batch
and ops
.
Resources (
Resources
: optional)
: Details the resources that this task requires.
Network (
Network
: optional)
: Configurations related to the networking aspects of the task.
Timeouts (
Timeouts
: optional)
: Configurations concerning any timeouts associated with the task.
Bacalhau's Local Publisher provides a useful option for storing task results on the compute node, allowing for ease of access and retrieval for testing or trying our Bacalhau.
The Local Publisher should not be used for Production use as it is not a reliable storage option. For production use, we recommend using a more reliable option such as an S3-compatible storage service.
The local publisher requires no specific parameters to be defined in the publisher specification. The user only needs to indicate the publisher type as "local", and Bacalhau handles the rest. Here is an example of how to set up a Local Publisher in a job specification.
Once the job is executed, the results are published to the local compute node, and stored as compressed tar file, which can be accessed and retrieved over HTTP from the command line using the get
command. TAhis will download and extract the contents for the user from the remove compute node.
URL (string)
: This is the HTTP URL to the results of the computation, which is hosted on the compute node where it ran. Here's a sample of how the published result might appear:
In this example, the task results will be stored on the compute node, and can be referenced and retrieved using the specified URL.
By default the compute node will attempt to use a public address for the HTTP server delivering task output, but there is no guarantee that the compute node is accessible on that address. If the compute node is behind a NAT or firewall, the user may need to manually specify the address to use for the HTTP server in the config.yaml
file.
There is no lifecycle management for the content stored on the compute node. The user is responsible for managing the content and ensuring that it is removed when no longer needed before the compute node runs out of disk space.
If the address/port of the compute node changes, then previously stored content will no longer be accessible. The user will need to manually update the address in the config.yaml
file and re-publish the content to make it accessible again.
The different job types available in Bacalhau
Bacalhau has recently introduced different job types in v1.1, providing more control and flexibility over the orchestration and scheduling of those jobs - depending on their type.
Despite the differences in job types, all jobs benefit from core functionalities provided by Bacalhau, including:
Node selection - the appropriate nodes are selected based on several criteria, including resource availability, priority and feedback from the nodes.
Job monitoring - jobs are monitored to ensure they complete, and that they stay in a healthy state.
Retries - within limits, Bacalhau will retry certain jobs a set number of times should it fail to complete successfully when requested.
Batch jobs are executed on demand, running on a specified number of Bacalhau nodes. These jobs either run until completion or until they reach a timeout. They are designed to carry out a single, discrete task before finishing. This is the only job type.
Ideal for intermittent yet intensive data dives, for instance performing computation over large datasets before publishing the response. This approach eliminates the continuous processing overhead, focusing on specific, in-depth investigations and computation.
Similar to batch jobs, ops jobs have a broader reach. They are executed on all nodes that align with the job specification, but otherwise behave like batch jobs.
Ops jobs are perfect for urgent investigations, granting direct access to logs on host machines, where previously you may have had to wait for the logs to arrive at a central location before being able to query them. They can also be used for delivering configuration files for other systems should you wish to deploy an update to many machines at once.
Daemon jobs run continuously on all nodes that meet the criteria given in the job specification. Should any new compute nodes join the cluster after the job was started, and should they meet the criteria, the job will be scheduled to run on that node too.
A good application of daemon jobs is to handle continuously generated data on every compute node. This might be from edge devices like sensors, or cameras, or from logs where they are generated. The data can then be aggregated and compressed them before sending it onwards. For logs, the aggregated data can be relayed at regular intervals to platforms like Kafka or Kinesis, or directly to other logging services with edge devices potentially delivering results via MQTT.
Service jobs run continuously on a specified number of nodes that meet the criteria given in the job specification. Bacalhau's orchestrator selects the optimal nodes to run the job, and continuously monitors its health, performance. If required, it will reschedule on other nodes.
This job type is good for long-running consumers such as streaming or queuing services, or real-time event listeners.
This example shows a sample Daemon job description with all available parameters.
This example shows a sample Service job description with all available parameters.
The S3 Input Source provides a seamless way to utilize data stored in S3 or any S3-compatible storage service as input for Bacalhau jobs. Users can specify files or entire prefixes stored in S3 buckets to be fetched and mounted directly into the task's execution environment. This capability ensures that your tasks have immediate access to the necessary data.
Here are the parameters that you can define for an S3 input source:
Bucket (string: <required>)
: The name of the S3 bucket where the data is stored.
Key(string: <optional>)
: The object key or prefix within the bucket. Supports trailing wildcard for fetching multiple objects with matching prefixes.
Filter(string: <optional>)
: A regex pattern to filter the objects to be fetched. If a Key is also provided as a prefix, the filter pattern will be applied to object keys after the prefix.
Region(string: <optional>)
: The AWS region where the S3 bucket is hosted.
Endpoint(string: <optional>)
: The endpoint URL of the S3 or S3-compatible service.
VersionID(string: <optional>)
: The specific version of the object if versioning is enabled on the bucket. Only applicable when fetching a single object, and not a prefix or a pattern of objects.
ChecksumSHA256(string: <optional>)
: The SHA-256 checksum of the object to ensure data integrity. Only applicable when fetching a single object, and not a prefix or a pattern of objects.
Single Object: If the key points to a single object, that object is fetched and made available to the task. e.g. s3://myBucket/dir/file-001.txt
Prefix Matching: If the key ends with a slash (/), it's interpreted as a prefix, and all objects with keys that start with that prefix are fetched, mimicking the behavior of fetching all objects in a "directory". e.g. s3://myBucket/dir/
Wildcard: Supports a trailing wildcard (*
). All objects with keys matching the prefix are fetched, facilitating batch processing or analysis of multiple files. e.g. s3://myBucket/dir/log-2023-09-*
When using the Bacalhau YAML configuration to define the S3 input source, you can employ the following declarative approach.
Below is an example of how to define an S3 input source in YAML format.
When using the Bacalhau CLI to define the S3 input source, you can employ the following imperative approach. Below are example commands demonstrating how to define the S3 input source with various configurations:
Mount an S3 object to a specific path:
Mount an S3 object with a specific endpoint and region:
Mount an S3 object using long flag names:
With these commands, you can seamlessly fetch and mount data from S3 into your task's execution environment directly through the CLI.
To support this storage provider, no extra dependencies are necessary. However, valid AWS credentials are essential to sign the requests. The storage provider employs the default credentials chain to retrieve credentials, primarily sourcing them from:
Environment variables: AWS credentials can be specified using AWS_ACCESS_KEY_ID
and AWS_SECRET_ACCESS_KEY
environment variables.
Credentials file: The credentials file typically located at ~/.aws/credentials
can also be used to fetch the necessary AWS credentials.
IAM Roles for Amazon EC2 Instances: If you're running your tasks within an Amazon EC2 instance, IAM roles can be utilized to provide the necessary permissions and credentials.
For a more detailed overview on AWS credential management and other ways to provide these credentials, please refer to the AWS official documentation on standardized credentials.
Compute nodes must run with the following policies to support S3 input source:
ListBucket Permission: The s3:ListBucket
permission is necessary to list the objects within the specified S3 bucket, allowing prefixes and wildcard expressions as the S3 Key for fetching.
GetObject and GetObjectVersion Permissions: The s3:GetObject
and s3:GetObjectVersion
permissions enable the fetching of object data and its versions, respectively.
Resource: The Resource
field in the policy specifies the Amazon Resource Name (ARN) of the S3 bucket. The /*
suffix is necessary to allow fetching of all objects within the bucket or can be replaced with a prefix to limit the scope of the policy. You can also specify multiple resources in the policy to allow fetching from multiple buckets, or *
to allow fetching from all buckets in the account.
For more information on IAM policies specific to Amazon S3 buckets and users, please refer to the AWS documentation on Using IAM Policies with Amazon S3.
This feature isn't limited to AWS S3 - it supports all S3-compatible storage services. It means you can pull data from the likes of Google Cloud Storage and open-source solutions like MinIO, giving you the flexibility to utilize a diverse range of data sources.
To seamlessly integrate Google Cloud Storage with Bacalhau, follow these steps:
Obtain HMAC Keys: To access Google Cloud Storage, you'll need HMAC (Hash-based Message Authentication Code) keys. Refer to the Google Cloud documentation for detailed instructions on creating a service account and generating HMAC keys.
Provide HMAC Keys to Bacalhau: You can provide the HMAC keys to Bacalhau using the same options as AWS credentials, as documented in the Credential Requirements section.
Configure the S3 Input Source: In your S3 input source configuration, set the endpoint for Google Cloud Storage to https://storage.googleapis.com
, as shown in the example below:
The Local input source allows Bacalhau jobs to access files and directories that are already present on the compute node. This is especially useful for utilizing locally stored datasets, configuration files, logs, or other necessary resources without the need to fetch them from a remote source, ensuring faster job initialization and execution.
Here are the parameters that you can define for a Local input source:
SourcePath (string: <required>)
: The absolute path on the compute node where the Local or file is located. Bacalhau will access this path to read data, and if permitted, write data as well.
ReadWrite (bool: false)
: A boolean flag that, when set to true, gives Bacalhau both read and write access to the specified Local or file. If set to false, Bacalhau will have read-only access.
For security reasons, direct access to local paths must be explicitly allowed when running the Bacalhau compute node. This is achieved using the --allow-listed-local-paths
flag followed by a comma-separated list of the paths, or path patterns, that should be accessible. Each path can be suffixed with permissions as well:
:rw
- Read-Write access.
:ro
- Read-Only access (default if no suffix is provided).
Check out the default settings on your server, as this may be set to :ro
and may lead to an error, when a different access is required.
For instance:
Below is an example of how to define a Local input source in YAML format.
In this example, Bacalhau is configured to access the Local "/etc/config" on the compute node. The contents of this directory are made available at the "/config" path within the task's environment, with read and write access. Adjusting the ReadWrite
flag to false would enable read-only access, preventing modifications to the local data from within the Bacalhau task.
When using the Bacalhau CLI to define the local input source, you can employ the following imperative approach. Below are example commands demonstrating how to define the local input source with various configurations:
Mount readonly file to /config
:
Mount writable file to default /input
:
The URL Input Source provides a straightforward method for Bacalhau jobs to access and incorporate data available over HTTP/HTTPS. By specifying a URL, users can ensure the required data, whether a single file or a web page content, is retrieved and prepared in the task's execution environment, enabling direct and efficient data utilization.
Here are the parameters that you can define for a URL input source:
URL (string: <required>)
: The HTTP/HTTPS URL pointing directly to the file or web content you want to retrieve. The content accessible at this URL will be fetched and made available in the task’s environment.
Below is an example of how to define a URL input source in YAML format.
In this setup, the content available at the specified URL is downloaded and stored at the "/data" path within the task's environment. This mechanism ensures that tasks can directly access a broad range of web-based resources, augmenting the adaptability and utility of Bacalhau jobs.
When using the Bacalhau CLI to define the URL input source, you can employ the following imperative approach. Below are example commands demonstrating how to define the URL input source with various configurations:
Fetch data from an HTTP endpoint and mount it: This command demonstrates fetching data from a specific HTTP URL and mounting it to a designated path within the task's environment.
Fetch data from an HTTPS endpoint and mount it: Similarly, you can fetch data from secure HTTPS URLs. This example fetches a file from a secure URL and mounts it.
The IPFS Input Source enables users to easily integrate data hosted on the InterPlanetary File System (IPFS) into Bacalhau jobs. By specifying the Content Identifier (CID) of the desired IPFS file or directory, users can have the content fetched and made available in the task's execution environment, ensuring efficient and decentralized data access.
Here are the parameters that you can define for an IPFS input source:
CID (string: <required>)
: The Content Identifier that uniquely pinpoints the file or directory on the IPFS network. Bacalhau retrieves the content associated with this CID for use in the task.
Below is an example of how to define an IPFS input source in YAML format.
In this configuration, the data associated with the specified CID is fetched from the IPFS network and made available in the task's environment at the "/data" path.
Utilizing IPFS as an input source in Bacalhau via the CLI is straightforward. Below are example commands that demonstrate how to define the IPFS input source:
Mount an IPFS CID to the default /inputs
directory:
Mount an IPFS CID to a custom /data
directory:
These commands provide a seamless mechanism to fetch and mount data from IPFS directly into your task's execution environment using the Bacalhau CLI.
The Network
object offers a method to specify the networking requirements of a Task
. It defines the scope and constraints of the network connectivity based on the demands of the task.
Network
Parameters:Type (string: "None")
: Indicates the network configuration's nature. There are several network modes available:
None
: This mode implies that the task does not necessitate any networking capabilities.
Full
: Specifies that the task mandates unrestricted, raw IP networking without any imposed filters.
HTTP
: This mode constrains the task to only require HTTP networking with specific domains. In this model:
The job specifier puts forward a job, stipulating the domain(s) it intends to communicate with.
The compute provider assesses the inherent risk of the job based on these domains and bids accordingly.
At runtime, the network traffic remains strictly confined to the designated domain(s).
A typical command for this might resemble: bacalhau docker run —network=http —domain=crates.io —domain=github.com -i ipfs://Qmy1234myd4t4,dst=/code rust/compile
The primary risks for the compute provider center around possible violations of its terms, its hosting provider's terms, or even prevailing laws in its jurisdiction. This encompasses issues such as unauthorized access or distribution of illicit content and potential cyber-attacks.
Conversely, the job specifier's primary risk involves operating in a paid environment. External entities might seek to exploit this environment, for instance, through a compromised package download that initiates a crypto mining operation, depleting the allocated, prepaid job time. By limiting traffic strictly to the pre-specified domains, the potential for such cyber threats diminishes considerably.
While a compute provider might impose its limits through other means, having domains declared upfront allows it to selectively bid on jobs that it can execute without issues, improving the user experience for job specifiers.
Domains (string[]: <optional>)
: A list of domain strings, relevant primarily when the Type
is set to HTTP. It dictates the specific domains the task can communicate with over HTTP.
Understanding and utilizing these configurations aptly can ensure that tasks are executed in an environment that aligns with their networking requirements, bolstering efficiency and security.
Bacalhau's S3 Publisher provides users with a secure and efficient method to publish task results to any S3-compatible storage service. This publisher supports not just AWS S3, but other S3-compatible services offered by cloud providers like Google Cloud Storage and Azure Blob Storage, as well as open-source options like MinIO. The integration is designed to be highly flexible, ensuring users can choose the storage option that aligns with their needs, privacy preferences, and operational requirements.
Bucket (string: <required>)
: The name of the S3 bucket where the task results will be stored.
Key (string: <required>)
: The object key within the specified bucket where the task results will be stored.
Endpoint (string: <optional>)
: The endpoint URL of the S3 service (useful for S3-compatible services).
Region (string: <optional>)
: The region where the S3 bucket is located.
Results published to S3 are stored as objects that can also be used as inputs to other Bacalhau jobs by using S3 Input Source. The published result specification includes the following parameters:
Bucket: Confirms the name of the bucket containing the stored results.
Key: Identifies the unique object key within the specified bucket.
Region: Notes the AWS region of the bucket.
Endpoint: Records the endpoint URL for S3-compatible storage services.
VersionID: The version ID of the stored object, enabling versioning support for retrieving specific versions of stored data.
ChecksumSHA256: The SHA-256 checksum of the stored object, providing a method to verify data integrity.
With the S3 Publisher in Bacalhau, you have the flexibility to use dynamic naming for the objects you publish to S3. This allows you to incorporate specific job and execution details into the object key, making it easier to trace, manage, and organize your published artifacts.
Bacalhau supports the following dynamic placeholders that will be replaced with their actual values during the publishing process:
{executionID}
: Replaced with the specific execution ID.
{jobID}
: Replaced with the ID of the job.
{nodeID}
: Replaced with the ID of the node where the execution took place
{date}
: Replaced with the current date in the format YYYYMMDD
.
{time}
: Replaced with the current time in the format HHMMSS
.
Additionally, if you are publishing an archive and the object key does not end with .tar.gz
, it will be automatically appended. Conversely, if you're not archiving and the key doesn't end with a /
, a trailing slash will be added.
Example
Imagine you've specified the following object key pattern for publishing:
Given a job with ID abc123
, executed on 2023-09-26
at 14:05:30
, the published object key would be:
This dynamic naming feature offers a powerful way to create organized, intuitive naming conventions for your Bacalhau published objects in S3.
Here’s an example YAML configuration that outlines the process of using the S3 Publisher with Bacalhau:
In this configuration, task results will be published to the specified S3 bucket and object key. If you’re using an S3-compatible service, simply update the Endpoint
parameter with the appropriate URL.
The results will be compressed into a single object, and the published result specification will look like:
The Bacalhau command-line interface (CLI) provides an imperative approach to specify the S3 Publisher. Below are a few examples showcasing how to define an S3 publisher using CLI commands:
Basic Docker job writing to S3 with default configurations:
This command writes to the S3 bucket using default endpoint and region settings.
Docker job writing to S3 with a specific endpoint and region:
This command specifies a unique endpoint and region for the S3 bucket.
Using naming placeholders:
Dynamic naming placeholders like {date}
and {jobID}
allow for organized naming structures, automatically replacing these placeholders with appropriate values upon execution.
Remember to replace the placeholders like bucket
, key
, and other parameters with your specific values. These CLI commands offer a quick and customizable way to submit jobs and specify how the results should be published to S3.
To support this storage provider, no extra dependencies are necessary. However, valid AWS credentials are essential to sign the requests. The storage provider employs the default credentials chain to retrieve credentials, primarily sourcing them from:
Environment variables: AWS credentials can be specified using AWS_ACCESS_KEY_ID
and AWS_SECRET_ACCESS_KEY
environment variables.
Credentials file: The credentials file typically located at ~/.aws/credentials
can also be used to fetch the necessary AWS credentials.
IAM Roles for Amazon EC2 Instances: If you're running your tasks within an Amazon EC2 instance, IAM roles can be utilized to provide the necessary permissions and credentials.
For a more detailed overview on AWS credential management and other ways to provide these credentials, please refer to the AWS official documentation on standardized credentials.
Compute nodes must run with the following policies to publish to S3:
PutObject Permissions: The s3:PutObject
permission is necessary to publish objects to the specified S3 bucket.
Resource: The Resource
field in the policy specifies the Amazon Resource Name (ARN) of the S3 bucket. The /*
suffix is necessary to allow publishing with any prefix within the bucket or can be replaced with a prefix to limit the scope of the policy. You can also specify multiple resources in the policy to allow publishing to multiple buckets, or *
to allow publishing to all buckets in the account.
To enable downloading published results using bacalhau job get <job_id>
command, the requester node must run with the following policies:
GetObject Permissions: The s3:GetObject
permission is necessary for the requester node to provide a pre-signed URL to download the published results by the client.
For more information on IAM policies specific to Amazon S3 buckets and users, please refer to the AWS documentation on Using IAM Policies with Amazon S3.
The Resources
provides a structured way to detail the computational resources a Task
requires. By specifying these requirements, you ensure that the task is scheduled on a node with adequate resources, optimizing performance and avoiding potential issues linked to resource constraints.
Resources
Parameters:CPU (string: <optional>)
: Defines the CPU resources required for the task. Units can be specified in cores (e.g., 2
for 2 CPU cores) or in milliCPU units (e.g., 250m
or 0.25
for 250 milliCPU units). For instance, if you have half a CPU core, you can represent it as 500m
or 0.5
.
Memory (string: <optional>)
: Highlights the amount of RAM needed for the task. You can specify the memory in various units such as:
Kb
for Kilobytes
Mb
for Megabytes
Gb
for Gigabytes
Tb
for Terabytes
Disk (string: <optional>)
: States the disk storage space needed for the task. Similarly, the disk space can be expressed in units like Gb
for Gigabytes, Mb
for Megabytes, and so on. As an example, 10Gb
indicates 10 Gigabytes of storage space.
GPU (string: <optional>)
: Denotes the number of GPU units required. For example, 2
signifies the requirement of 2 GPU units. This is crucial for tasks involving heavy computational processes, machine learning models, or tasks that leverage GPU acceleration.
A ResultPath
denotes a specific location within a Task
that contains meaningful output or results. By specifying a ResultPath
, you can pinpoint which files or directories are essential and should be retained or published after the task's execution.
ResultPath
Parameters:Name: A descriptive label or identifier for the result, allowing for easier referencing and understanding of the output's nature or significance.
Path: Specifies the exact location, either a file or a directory, within the task's environment where the result or output is stored. This ensures that after the task completes, the critical data at this path can be accessed, retained, or published as necessary.
A Constraint
represents a condition that must be met for a compute node to be eligible to run a given job. Operators have the flexibility to manually define node labels when initiating a node using the bacalhau serve command. Additionally, Bacalhau boasts features like automatic resource detection and dynamic labeling, further enhancing its capability.
By defining constraints, you can ensure that jobs are scheduled on nodes that have the necessary requirements or conditions.
Constraint
Parameters:Key: The name of the attribute or property to check on the compute node. This could be anything from a specific hardware feature, operating system version, or any other node property.
Operator: Determines the kind of comparison to be made against the Key
's value, which can be:
in
: Checks if the Key's value exists within the provided list of values.
notin
: Ensures the Key's value doesn't match any in the provided list of values.
exists
: Verifies that a value for the specified Key is present, regardless of its actual value.
!
: Confirms the absence of the specified Key. i.e DoesNotExist
gt
: Assesses if the Key's value is greater than the provided value.
lt
: Assesses if the Key's value is less than the provided value.
=
& ==
: Both are used to compare the Key's value for an exact match with the provided value.
!=
: Ensures the Key's value is not the same as the provided value.
Values (optional): A list of values that the node attribute, specified by the Key
, is compared against using the Operator
. This is not needed for operators like exists
or !
.
Consider a scenario where a job should only run on nodes with a GPU and an operating system version greater than 2.0
. The constraints for such a requirement might look like:
In this example, the first constraint checks if the node has a GPU, the second constraint ensures the OS is linux, and deployed in eu-west-1 or eu-west-2`.
Constraints are evaluated as a logical AND, meaning all constraints must be satisfied for a node to be eligible.
Using too many specific constraints can lead to a job not being scheduled if no nodes satisfy all the conditions.
It's essential to balance the specificity of constraints with the broader needs and resources available in the cluster.
The Labels
block within a Job
specification plays a crucial role in Bacalhau, serving as a mechanism for filtering jobs. By attaching specific labels to jobs, users can quickly and effectively filter and manage jobs via both the Command Line Interface (CLI) and Application Programming Interface (API) based on various criteria.
Labels
ParametersLabels are essentially key-value pairs attached to jobs, allowing for detailed categorizations and filtrations. Each label consists of a Key
and a Value
. These labels can be filtered using operators to pinpoint specific jobs fitting certain criteria.
Jobs can be filtered using the following operators:
in
: Checks if the key's value matches any within a specified list of values.
notin
: Validates that the key's value isn’t within a provided list of values.
exists
: Checks for the presence of a specified key, regardless of its value.
!
: Validates the absence of a specified key. (i.e., DoesNotExist)
gt
: Checks if the key's value is greater than a specified value.
lt
: Checks if the key's value is less than a specified value.
= & ==
: Used for exact match comparisons between the key’s value and a specified value.
!=
: Validates that the key’s value doesn't match a specified value.
Filter jobs with a label whose key is "environment" and value is "development":
Filter jobs with a label whose key is "version" and value is greater than "2.0":
Filter jobs with a label "project" existing:
Filter jobs without a "project" label:
Job Management: Enables efficient management of jobs by categorizing them based on distinct attributes or criteria.
Automation: Facilitates the automation of job deployment and management processes by allowing scripts and tools to target specific categories of jobs.
Monitoring & Analytics: Enhances monitoring and analytics by grouping jobs into meaningful categories, allowing for detailed insights and analysis.
The Labels
block is instrumental in the enhanced management, filtering, and operation of jobs within Bacalhau. By understanding and utilizing the available operators and label parameters effectively, users can optimize their workflow, automate processes, and achieve detailed insights into their jobs.
In both the Job
and Task
specifications within Bacalhau, the Meta
block is a versatile element used to attach arbitrary metadata. This metadata isn't utilized for filtering or categorizing jobs; there's a separate block specifically designated for that purpose. Instead, the Meta
block is instrumental for embedding additional information for operators or external systems, enhancing clarity and context.
Meta
Parameters in Job and Task SpecsThe Meta
block is comprised of key-value pairs, with both keys and values being strings. These pairs aren't constrained by a predefined structure, offering flexibility for users to annotate jobs and tasks with diverse metadata.
Users can incorporate any arbitrary key-value pairs to convey descriptive information or context about the job or task.
project: Identifies the associated project.
version: Specifies the version of the application or service.
owner: Names the responsible team or individual.
environment: Indicates the stage in the development lifecycle.
Beyond user-defined metadata, Bacalhau automatically injects specific metadata keys for identification and security purposes.
bacalhau.org/requester.id: A unique identifier for the orchestrator that handled the job.
bacalhau.org/requester.publicKey: The public key of the requester, aiding in security and validation.
bacalhau.org/client.id: The ID for the client submitting the job, enhancing traceability.
Identification: The metadata aids in uniquely identifying jobs and tasks, connecting them to their originators and executors.
Context Enhancement: Metadata can supplement jobs and tasks with additional data, offering insights and context that aren't captured by standard parameters.
Security Enhancement: Auto-generated keys like the requester's public key contribute to the secure handling and execution of jobs and tasks.
An InputSource
defines where and how to retrieve specific artifacts needed for a , such as files or data, and where to mount them within the task's context. This ensures the necessary data is present before the task's execution begins.
Bacalhau's InputSource
natively supports fetching data from remote sources like S3 and IPFS and can also mount local directories. It is intended to be flexible for future expansion.
InputSource
Parameters:Source (
: <required>)
: Specifies the origin of the artifact, which could be a URL, an S3 bucket, or other locations.
Alias (string: <optional>)
: An optional identifier for this input source. It's particularly useful for dynamic operations within a task, such as dynamically importing data in WebAssembly using an alias.
Target (string: <required>)
: Defines the path inside the task's environment where the retrieved artifact should be mounted or stored. This ensures that the task can access the data during its execution.
In this example, the first input source fetches data from an S3 bucket and mounts it at /my_s3_data
within the task. The second input source mounts a local directory at /my_local_data
and allows the task to read and write data to it.
While the Meta
block is distinct from the block used for filtering, its contribution to providing context, security, and traceability is integral in managing and understanding the diverse jobs and tasks within the Bacalhau ecosystem effectively.
The Timeouts
object provides a mechanism to impose timing constraints on specific task operations, particularly execution. By setting these timeouts, users can ensure tasks don't run indefinitely and align them with intended durations.
Timeouts
Parameters:ExecutionTimeout (int: <optional>)
: Defines the maximum duration (in seconds) that a task is permitted to run. A value of zero indicates that there's no set timeout. This could be particularly useful for tasks that function as daemons and are designed to run indefinitely.
Utilizing the Timeouts
judiciously helps in managing resource utilization and ensures tasks adhere to expected timelines, thereby enhancing the efficiency and predictability of job executions.
State
Structure SpecificationWithin Bacalhau, the State
structure is designed to represent the status or state of an object (like a Job
), coupled with a human-readable message for added context. Below is a breakdown of the structure:
State
ParametersStateType (T : <required>)
: Represents the current state of the object. This is a generic parameter that will take on a specific value from a set of defined state types for the object in question. For jobs, this will be one of the JobStateType
values.
Message (string : <optional>)
: A human-readable message giving more context about the current state. Particularly useful for states like Failed
to provide insight into the nature of any error.
When State
is used for a job, the StateType
can be one of the following:
Pending
: This indicates that the job is submitted but is not yet scheduled for execution.
Running
: The job is scheduled and is currently undergoing execution.
Completed
: This state signifies that a job has successfully executed its task. Only applicable for batch jobs.
Failed
: A state indicating that the job encountered errors and couldn't successfully complete.
JobStateTypeStopped
: The job has been intentionally halted by the user before its natural completion.
The inclusion of the Message
field can offer detailed insights, especially in states like Failed
, aiding in error comprehension and debugging.
Efficient job management and resource optimization are significant considerations. In our continued effort to support scalable distributed computing and data processing, we are excited to introduce job queuing in Bacalhau v1.4.0
.
The Job Queuing feature was only added to the Bacalhau in version 1.4 and is not supported in previous versions. Consider upgrading to the latest version to optimize resource usage with Job Queuing.
Job Queuing allows to deal with the situation when there are no suitable nodes available on the network to execute a job. In this case, a user-defined period of time can be configured for the job, during which the job will wait for suitable nodes to become available or free in the network. This feature enables better flexibility and reliability in managing your distributed workloads.
The job queuing feature is not automatically enabled, and it needs to be explicitly set in your Job specification or requester node using the QueueTimeout
parameter. This parameter activates the queuing feature and defines the amount of time your job should wait for available nodes in the network.
Node availability in your network is determined by capacity as well as job constraints such as label selectors, engines or publishers. For example, jobs will be queued if all nodes are currently busy, as well as if idle nodes do not match parameters in your job specification.
Bacalhau compute nodes regularly update their node, resource and health information every 30 seconds to the requester nodes in the network. During this update period, multiple jobs may be allocated to a node, oversubscribing and potentially exceeding its immediate available capacity. A local job queue is created at the compute node, efficiently handling the high demand as resources become available over time.
At the requester node level, you can set default queuing behavior for all jobs by defining the QueueTimeout
parameter in the node's configuration file. Alternatively, within the job specification, you can include the QueueTimeout
parameter directly in the configuration YAML. This flexibility allows you to tailor the queuing behavior to meet the specific needs of your distributed computing environment, ensuring that jobs are efficiently managed and resources are optimally utilized.
Here’s an example requester node configuration that sets the default job queuing time for an hour
The QueueBackoff
parameter determines the interval between retry attempts by the requester node to assign queued jobs.
Here’s a sample job specification setting the QueueTimeout
for this specific job, overwriting any node defaults.
You can also define timeouts for your jobs directly through the CLI using the --queue-timeout
flag. This method provides a convenient way to specify queuing behavior on a per-job basis, allowing you to manage job execution dynamically without modifying configuration files.
For example, here is how you can submit a job with a specified queue timeout using the CLI:
Timeouts in Bacalhau are generally governed by the TotalTimeout
value for your yaml specifications and the --timeout
flag for your CLI commands. The default total timeout value is 30 minutes. Declaring any queue timeout that is larger than that without changing the total timeout value will result in a validation error.
Jobs will be queued when all available nodes are busy and when there is no node that matches your job specifications. Let’s take a look at how queuing will be executed within your network.
Queued Jobs will initially display the Queued
status. Using the bacalhau job describe
command will showcase both the state of the job and the reason behind queuing.
For busy nodes:
For no matching nodes in the network:
Once appropriate node resources become available, these jobs will transition to either a Running
or Completed
status, allowing more jobs to be assigned to matching nodes.
As Bacalhau continues to evolve, our commitment to making distributed computing and data processing more accessible and efficient remains strong. We want to hear what you think about this feature so that we can make Bacalhau better and meet all the diverse needs and requirements of you, our users.
For questions, feedback, please reach out in our Slack.
Templating Support in Bacalhau Job Run
This documentation introduces templating support for bacalhau job run
, providing users with the ability to dynamically inject variables into their job specifications. This feature is particularly useful when running multiple jobs with varying parameters such as DuckDB query, S3 buckets, prefixes, and time ranges without the need to edit each job specification file manually.
The motivation behind this feature arises from the need to streamline the process of preparing and running multiple jobs with different configurations. Rather than manually editing job specs for each run, users can leverage placeholders and pass actual values at runtime.
The templating functionality in Bacalhau is built upon the Go text/template
package. This powerful library offers a wide range of features for manipulating and formatting text based on template definitions and input variables.
For more detailed information about the Go text/template
library and its syntax, please refer to the official documentation: Go text/template
Package.
You can also use environment variables for templating:
To preview the final templated job spec without actually submitting the job, you can use the --dry-run
flag:
This will output the processed job specification, showing you how the placeholders have been replaced with the provided values.
This is an ops
job that runs on all nodes that match the job selection criteria. It accepts duckdb query
variable, and two optional start-time
and end-time
variables to define the time range for the query.
To run this job, you can use the following command:
This is abatch
job that runs on a single node. It accepts the duckdb query
variable, and four other variables to define the S3 bucket, prefix, and pattern for the logs and the AWS region.
To run this job, you can use the following command: