Mounting Input Data

PreviousCommon Workflows NextPublishing & Retrieving Results

Last updated 1 month ago

Was this helpful?

Mounting Input Data

This page explains how to feed external data into Bacalhau jobs from various sources. Bacalhau's modular architecture enables flexible data mounting from multiple storage providers, with S3-compatible storage, local directories, IPFS, and HTTP/HTTPS URLs supported out of the box.

What You'll Learn

How to mount data from different sources to your Bacalhau jobs
The syntax and options for each data source type
Best practices for efficient data handling

Input Mounting Basics

Bacalhau jobs often need access to input data. The general syntax for mounting input data is:

bacalhau docker run \
  --input <URI><SOURCE>:<TARGET> \
  IMAGE -- COMMAND

Where:

URI is the protocol identifier (file://, s3://, ipfs://, http://, https://)
SOURCE specifies the path to the data
TARGET is the path where the data will be mounted in the container

This pattern is consistent across all input types, making it easy to understand and use regardless of the data source.

...
InputSources:
  - Alias: input
    Target: <TARGET>
    Source:
      Type: <URI>
      Params:
        key: value

Where:

URI is the protocol identifier (file://, s3://, ipfs://, http://, https://)
TARGET is the path where the data will be mounted in the container
PARAMS are key value configuration

Local Directories

bacalhau docker run \
  --input file:///path/to/local/data:/data \
  ubuntu:latest -- cat /data/input.txt

Type: batch
Count: 1
Tasks:
  - Name: "task1"
    Engine:
      Type: "docker"
      Params:
        Image: "ubuntu:latest"
        Parameters:
          - "cat"
          - "/data/input.txt"
    InputSources:
      - Alias: input_data
        Target: /data
        Source:
          Type: local
          Params:
            Path: /path/to/local/data

This mounts the directory /path/to/local/data from the host machine to /data inside the container.

S3-Compatible Storage

S3 integration connects to storage solutions compatible with the S3 API, such as AWS S3, Google Cloud Storage, and locally deployed MinIO

bacalhau docker run \
  --input s3://my-bucket/datasets/sample.csv:/data/sample.csv \
  ubuntu:latest -- cat /data/sample.csv

Type: batch
Count: 1
Tasks:
  - Name: "task1"
    Engine:
      Type: "docker"
      Params:
        Image: "ubuntu:latest"
        Parameters:
          - "cat"
          - "/data/sample.csv"
    InputSources:
      - Alias: input_data
        Target: /data/sample.csv
        Source:
          Type: s3
          Params:
            Bucket: my-bucket
            Key: datasets/sample.csv

This downloads and mounts the S3 object to the specified path in the container.

HTTP/HTTPS URLs

URL-based inputs provide access to web-hosted resources.

bacalhau docker run \
  --input https://example.com/data.csv:/data/data.csv \
  ubuntu:latest -- head -n 10 /data/data.csv

Type: batch
Count: 1
Tasks:
  - Name: "task1"
    Engine:
      Type: "docker"
      Params:
        Image: "ubuntu:latest"
        Parameters:
          - head
          - -n
          - "10"
          - /data/data.csv
    InputSources:
      - Alias: input_data
        Target: /data/data.csv
        Source:
          Type: urlDownload
          Params:
            URL: https://example.com/data.csv

IPFS (InterPlanetary File System)

IPFS provides content-addressable, peer-to-peer storage for decentralized data sharing.

bacalhau docker run \
  --input ipfs://QmZ4tDuvesekSs4qM5ZBKpXiZGun7S2CYtEZRB3DYXkjGx:/data \
  ubuntu:latest -- cat /data/file.txt

Type: batch
Count: 1
Tasks:
  - Name: "task1"
    Engine:
      Type: "docker"
      Params:
        Image: "ubuntu:latest"
        Parameters:
          - cat
          - /data/file.txt
    InputSources:
      - Alias: input_data
        Target: /data
        Source:
          Type: ipfs
          Params:
            CID: QmZ4tDuvesekSs4qM5ZBKpXiZGun7S2CYtEZRB3DYXkjGx

The IPFS CID (Content Identifier) points to the specific content you want to mount.

Multiple Inputs

You can combine multiple inputs from different sources in a single job:

bacalhau docker run \
  --input file:///path/to/config:/config \
  --input s3://my-bucket/datasets/data.csv:/data/data.csv \
  --input https://example.com/reference.json:/data/reference.json \
  python:3.9 -- python /config/process.py

Working with Large Datasets

For very large datasets, consider these optimization strategies:

bacalhau docker run \
  --cpu 4 \
  --memory 8GB \
  --disk 100GB \
  --input s3://big-data-bucket/huge-dataset/:/data/ \
  python:3.9 -- python process_big_data.py

Best practices:

Increase resource allocations as needed
Use data locality to minimize transfer costs
Process data in chunks when possible
Choose efficient data formats (Parquet, Arrow, etc.)

Tips & Caveats

Credentials: Some mount sources (S3) require proper credentials or connectivity
Data Locality: Use Bacalhau label selectors to run jobs on nodes that have or close to the data
IPFS Network: Compute nodes must be connected to an IPFS daemon to support this storage type
Size Limits: Very large inputs may require increased disk allocations using --disk

Next Steps

Learn how to retrieve and publish outputs from jobs
See a complete example workflow that includes input data
Explore resource constraints for jobs with large data processing needs

PreviousCommon Workflows NextPublishing & Retrieving Results

Last updated 1 month ago

Was this helpful?