Mounting Input Data
Mounting Input Data
This page explains how to feed external data into Bacalhau jobs from various sources. Bacalhau's modular architecture enables flexible data mounting from multiple storage providers, with S3-compatible storage, local directories, IPFS, and HTTP/HTTPS URLs supported out of the box.
What You'll Learn
How to mount data from different sources to your Bacalhau jobs
The syntax and options for each data source type
Best practices for efficient data handling
Input Mounting Basics
Bacalhau jobs often need access to input data. The general syntax for mounting input data is:
Where:
URI
is the protocol identifier (file://, s3://, ipfs://, http://, https://)SOURCE
specifies the path to the dataTARGET
is the path where the data will be mounted in the container
This pattern is consistent across all input types, making it easy to understand and use regardless of the data source.
Local Directories
This mounts the directory /path/to/local/data
from the host machine to /data
inside the container.
S3-Compatible Storage
S3 integration connects to storage solutions compatible with the S3 API, such as AWS S3, Google Cloud Storage, and locally deployed MinIO
This downloads and mounts the S3 object to the specified path in the container.
HTTP/HTTPS URLs
URL-based inputs provide access to web-hosted resources.
IPFS (InterPlanetary File System)
IPFS provides content-addressable, peer-to-peer storage for decentralized data sharing.
The IPFS CID (Content Identifier) points to the specific content you want to mount.
Multiple Inputs
You can combine multiple inputs from different sources in a single job:
Working with Large Datasets
For very large datasets, consider these optimization strategies:
Best practices:
Increase resource allocations as needed
Use data locality to minimize transfer costs
Process data in chunks when possible
Choose efficient data formats (Parquet, Arrow, etc.)
Tips & Caveats
Credentials: Some mount sources (S3) require proper credentials or connectivity
Data Locality: Use Bacalhau label selectors to run jobs on nodes that have or close to the data
IPFS Network: Compute nodes must be connected to an IPFS daemon to support this storage type
Size Limits: Very large inputs may require increased disk allocations using
--disk
Next Steps
Learn how to retrieve and publish outputs from jobs
See a complete example workflow that includes input data
Explore resource constraints for jobs with large data processing needs
Last updated
Was this helpful?