Bacalhau Docs
GithubSlackBlogEnterprise
v1.5.x
  • Documentation
  • Use Cases
  • CLI & API
  • References
  • Community
v1.5.x
  • Welcome
  • Getting Started
    • How Bacalhau Works
    • Installation
    • Create Network
    • Hardware Setup
    • Container Onboarding
      • Docker Workloads
      • WebAssembly (Wasm) Workloads
  • Setting Up
    • Running Nodes
      • Node Onboarding
      • GPU Installation
      • Job selection policy
      • Access Management
      • Node persistence
      • Connect Storage
      • Configuring Transport Level Security
      • Limits and Timeouts
      • Test Network Locally
      • Bacalhau WebUI
      • Private IPFS Network Setup
    • Workload Onboarding
      • Container
        • Docker Workload Onboarding
        • WebAssembly (Wasm) Workloads
        • Bacalhau Docker Image
        • How To Work With Custom Containers in Bacalhau
      • Python
        • Building and Running Custom Python Container
        • Running Pandas on Bacalhau
        • Running a Python Script
        • Running Jupyter Notebooks on Bacalhau
        • Scripting Bacalhau with Python
      • R (language)
        • Building and Running your Custom R Containers on Bacalhau
        • Running a Simple R Script on Bacalhau
      • Run CUDA programs on Bacalhau
      • Running a Prolog Script
      • Reading Data from Multiple S3 Buckets using Bacalhau
      • Running Rust programs as WebAssembly (WASM)
      • Generate Synthetic Data using Sparkov Data Generation technique
    • Data Ingestion
      • Copy Data from URL to Public Storage
      • Pinning Data
      • Running a Job over S3 data
    • Networking Instructions
      • Accessing the Internet from Jobs
      • Utilizing NATS.io within Bacalhau
    • GPU Workloads Setup
    • Automatic Update Checking
    • Marketplace Deployments
      • Google Cloud Marketplace
  • Guides
    • (Updated) Configuration Management
    • Write a config.yaml
    • Write a SpecConfig
  • Examples
    • Data Engineering
      • Using Bacalhau with DuckDB
      • Ethereum Blockchain Analysis with Ethereum-ETL and Bacalhau
      • Convert CSV To Parquet Or Avro
      • Simple Image Processing
      • Oceanography - Data Conversion
      • Video Processing
    • Model Inference
      • EasyOCR (Optical Character Recognition) on Bacalhau
      • Running Inference on Dolly 2.0 Model with Hugging Face
      • Speech Recognition using Whisper
      • Stable Diffusion on a GPU
      • Stable Diffusion on a CPU
      • Object Detection with YOLOv5 on Bacalhau
      • Generate Realistic Images using StyleGAN3 and Bacalhau
      • Stable Diffusion Checkpoint Inference
      • Running Inference on a Model stored on S3
    • Model Training
      • Training Pytorch Model with Bacalhau
      • Training Tensorflow Model
      • Stable Diffusion Dreambooth (Finetuning)
    • Molecular Dynamics
      • Running BIDS Apps on Bacalhau
      • Coresets On Bacalhau
      • Genomics Data Generation
      • Gromacs for Analysis
      • Molecular Simulation with OpenMM and Bacalhau
  • References
    • Jobs Guide
      • Job Specification
        • Job Types
        • Task Specification
          • Engines
            • Docker Engine Specification
            • WebAssembly (WASM) Engine Specification
          • Publishers
            • IPFS Publisher Specification
            • Local Publisher Specification
            • S3 Publisher Specification
          • Sources
            • IPFS Source Specification
            • Local Source Specification
            • S3 Source Specification
            • URL Source Specification
          • Network Specification
          • Input Source Specification
          • Resources Specification
          • ResultPath Specification
        • Constraint Specification
        • Labels Specification
        • Meta Specification
      • Job Templates
      • Queuing & Timeouts
        • Job Queuing
        • Timeouts Specification
      • Job Results
        • State
    • CLI Guide
      • Single CLI commands
        • Agent
          • Agent Overview
          • Agent Alive
          • Agent Node
          • Agent Version
        • Config
          • Config Overview
          • Config Auto-Resources
          • Config Default
          • Config List
          • Config Set
        • Job
          • Job Overview
          • Job Describe
          • Job Exec
          • Job Executions
          • Job History
          • Job List
          • Job Logs
          • Job Run
          • Job Stop
        • Node
          • Node Overview
          • Node Approve
          • Node Delete
          • Node List
          • Node Describe
          • Node Reject
      • Command Migration
    • API Guide
      • Bacalhau API overview
      • Best Practices
      • Agent Endpoint
      • Orchestrator Endpoint
      • Migration API
    • Node Management
    • Authentication & Authorization
    • Database Integration
    • Debugging
      • Debugging Failed Jobs
      • Debugging Locally
    • Running Locally In Devstack
    • Setting up Dev Environment
  • Help & FAQ
    • Bacalhau FAQs
    • Glossary
    • Release Notes
      • v1.5.0 Release Notes
      • v1.4.0 Release Notes
  • Integrations
    • Apache Airflow Provider for Bacalhau
    • Lilypad
    • Bacalhau Python SDK
    • Observability for WebAssembly Workloads
  • Community
    • Social Media
    • Style Guide
    • Ways to Contribute
Powered by GitBook
LogoLogo

Use Cases

  • Distributed ETL
  • Edge ML
  • Distributed Data Warehousing
  • Fleet Management

About Us

  • Who we are
  • What we value

News & Blog

  • Blog

Get Support

  • Request Enterprise Solutions

Expanso (2025). All Rights Reserved.

On this page
  • 1. What Does a Job Failure Look Like?
  • 2. Inspecting the Status of the Job
  • 3. Common Error 1 - Executable file not found
  • 4. Common Error 2 - exit code was not zero: 1
  • 5. Debugging Via Sanity Checks
  • 6. Debugging Via Logging
  • 8. Debugging by Running Locally
  • 7. Debugging Via Simple Jobs
  • Support
  • Contributing

Was this helpful?

Export as PDF
  1. References
  2. Debugging

Debugging Failed Jobs

How to troubleshoot and debug failed Bacalhau jobs

"An expert is a person who has made all the mistakes that can be made in a very narrow field." ― Niels Bohr

Bacalhau is a decentralized compute network that anyone can join. The network comprises of a smorgasbord of hardware provided by a hodgepodge of providers. In addition, its users are diverse and their jobs are unique. The permutations involved mean that there's a pretty good chance that something will go wrong at some point.

Being decentralized also means that you can't follow standard debugging practices such as SSH'ing into a node or spinning up a REPL environment. This page describes a few hints and tips that we've found useful when debugging failed jobs.

1. What Does a Job Failure Look Like?

A failing job could be described as anything that didn't meet your expectations. But clearly much of that is outside of the scope of the network.

When it comes to Bacalhau, a failing job is one that has failed to complete successfully. If you run a job in the foreground you might see a message saying:

Error while executing the job.

Or when you list the jobs you might see a state of ERROR, like:

CREATED   ID        JOB                      STATE      VERIFIED  PUBLISHED
11:05:47  bab5f64c  Docker ubuntu echo "...  Error

2. Inspecting the Status of the Job

When you first suspect that your job has failed, the first thing you should do is inspect the status. The bacalhau job describe $JOB_ID command presents everything that is known about a job from the perspective of the network.

Look through the Shards of the job and see if any of them have a State of Error. The RunOutput field provides the juicy details of what went wrong.

3. Common Error 1 - Executable file not found

One of the most common reasons for failure is that the entrypoint for a job doesn't exist. The stderr or runnerError will look something like:

JobState:
  Nodes:
    QmXMzb3GQRMyUyVvUB53nfkZ1sURTVxuR8BPowey7a3WKk:
      Shards:
        "0":
          NodeId: QmXMzb3GQRMyUyVvUB53nfkZ1sURTVxuR8BPowey7a3WKk
          PublishedResults: {}
          RunOutput:
            exitCode: 0
            runnerError: 'Executable file not found: Error response from daemon: failed
              to create shim task: OCI runtime create failed: runc create failed:
              unable to start container process: exec: "echo \"Something spooky\"
              &>2 && exit 1": executable file not found in $PATH: unknown: Executable
              file not found: Error response from daemon: failed to create shim task:
              OCI runtime create failed: runc create failed: unable to start container
              process: exec: "echo \"Something spooky\" &>2 && exit 1": executable
              file not found in $PATH: unknown'

This is usually caused by a mistake in the path to the executable or quotes. To fix this, you'll need to edit the command and make sure it's a valid command.

Try enclosing your command in a bash -c '...' (or equivalent shell) to make sure that your command is parsed by the process in the container, not your shell.

4. Common Error 2 - exit code was not zero: 1

If your program exits with a non-zero exit code, the job will report a failure. The exitCode field will present the code. Inspect the stderr or stdout to see what went wrong. For example:

JobState:
  Nodes:
    QmVAb7r2pKWCuyLpYWoZr9syhhFnTWeFaByHdb8PkkhLQG:
      Shards:
        "0":
          NodeId: QmVAb7r2pKWCuyLpYWoZr9syhhFnTWeFaByHdb8PkkhLQG
          PublishedResults: {}
          RunOutput:
            exitCode: 1
            runnerError: 'exit code was not zero: 1'
            stderr: |
              Something spooky happened and I got scared

Typically this is caused by a user error in the code. But you can sometimes see it due to a hardware (e.g. an out of memory error) or Docker error (e.g. wrong container architecture).

5. Debugging Via Sanity Checks

If you're not sure what went wrong, you can try adding some sanity checks to your command or code. Here is a list of common command line commands that we use to make sure everything is in its right place:

  1. ls -lah /inputs > /outputs/ls.txt - list the contents of a directory and write to the outputs (or stdout) to double check that files/binaries really exist

  2. md5sum /inputs/data.tar.gz > /outputs/checksum.txt - calculate the checksum of a file and write to the outputs (or stdout) to double check that the file is what you expect

Inside your code:

  • Use your language's assert functionality to check that the inputs are what you expect

Seriously, we've seen all sorts of wonderful things go wrong. Like CIDs presenting a corrupted file. It's worth checking everything!

More tips:

6. Debugging Via Logging

"The most effective debugging tool is still careful thought, coupled with judiciously placed print statements." — Brian Kernighan, "Unix for Beginners" (1979)

Since Bacalhau jobs have no external access, you can't rely on remote metric solutions or writing checkpoints to disk. Instead, liberally apply print statements like you're decorating a 1970's Christmas tree.

At the command line:

  1. cp /inputs/data.tar.gz /outputs/data.tar.gz - copy a file to the outputs so you can download and inspect it later

  2. Add echo or cat commands to list out pertinent information

Inside your code:

  1. Use a logging framework if you have one - structure the output to make it more searchable

  2. Add print-like debugging statements to trace the path of execution within your code. When you think you've added enough, add more.

  3. print out the hardware resources observed by your code, to ensure that hardware is visible and behaving as expected (e.g. GPU information)

  4. For longer-running or hardware intensive jobs, print status updates and current consumption metrics to ensure that the job is progressing as expected

More tips:

8. Debugging by Running Locally

It might sound obvious but run a test job locally first. You'll often have much better visibility into what's going on, and you can use your local tools to debug.

7. Debugging Via Simple Jobs

Before running a Bacalhau job for real, it's worth taking the time to slowly baby-step your way to the final command. This is especially true if you're new to Bacalhau or if you're not sure what the inputs will look like.

  1. Your first job should be a simple ubuntu based ls command to make sure that the inputs are where you expect them to be

  2. Your second job should be a similarly simple ls-like job, but using your code/container

  3. Your third job should use your code, but run some kind of "inspect" or "list" or "sanity check" like job to double check that your job has everything it needs to do before it starts. A "hello world" if you will.

  4. Finally, try and run the actual job.

  5. If the job fails, try to tailor a job that tests the specific issue you're facing.

In essence, you should try and minimize failure of the job by intentionally testing all the normal things that can go wrong, like data not being in the right place or in the wrong format.

Support

Contributing

PreviousDebuggingNextDebugging Locally

Was this helpful?

If you're still having trouble, please reach out to the

If you have any hints or tips to add, then please submit a PR to .

Wikipedia
Tips for debugging with print()
Debugging: print statements and logging
Flame war: The unreasonable effectiveness of print debugging
Bacalhau team via Slack (#bacalhau channel)
the Bacalhau Documentation repository