How to troubleshoot and debug failed Bacalhau jobs
"An expert is a person who has made all the mistakes that can be made in a very narrow field." ― Niels Bohr
Bacalhau is a decentralized compute network that anyone can join. The network comprises of a smorgasbord of hardware provided by a hodgepodge of providers. In addition, its users are diverse and their jobs are unique. The permutations involved mean that there's a pretty good chance that something will go wrong at some point.
Being decentralized also means that you can't follow standard debugging practices such as SSH'ing into a node or spinning up a REPL environment. This page describes a few hints and tips that we've found useful when debugging failed jobs.
A failing job could be described as anything that didn't meet your expectations. But clearly much of that is outside of the scope of the network.
When it comes to Bacalhau, a failing job is one that has failed to complete successfully. If you run a job in the foreground you might see a message saying:
Or when you list the jobs you might see a state of ERROR
, like:
When you first suspect that your job has failed, the first thing you should do is inspect the status. The bacalhau job describe $JOB_ID
command presents everything that is known about a job from the perspective of the network.
Look through the Shards
of the job and see if any of them have a State
of Error
. The RunOutput
field provides the juicy details of what went wrong.
Executable file not found
One of the most common reasons for failure is that the entrypoint for a job doesn't exist. The stderr
or runnerError
will look something like:
This is usually caused by a mistake in the path to the executable or quotes. To fix this, you'll need to edit the command and make sure it's a valid command.
Try enclosing your command in a bash -c '...' (or equivalent shell) to make sure that your command is parsed by the process in the container, not your shell.
exit code was not zero: 1
If your program exits with a non-zero exit code, the job will report a failure. The exitCode
field will present the code. Inspect the stderr
or stdout
to see what went wrong. For example:
Typically this is caused by a user error in the code. But you can sometimes see it due to a hardware (e.g. an out of memory error) or Docker error (e.g. wrong container architecture).
If you're not sure what went wrong, you can try adding some sanity checks to your command or code. Here is a list of common command line commands that we use to make sure everything is in its right place:
ls -lah /inputs > /outputs/ls.txt
- list the contents of a directory and write to the outputs (or stdout) to double check that files/binaries really exist
md5sum /inputs/data.tar.gz > /outputs/checksum.txt
- calculate the checksum of a file and write to the outputs (or stdout) to double check that the file is what you expect
Inside your code:
Use your language's assert functionality to check that the inputs are what you expect
Seriously, we've seen all sorts of wonderful things go wrong. Like CIDs presenting a corrupted file. It's worth checking everything!
More tips:
"The most effective debugging tool is still careful thought, coupled with judiciously placed print statements." — Brian Kernighan, "Unix for Beginners" (1979)
Since Bacalhau jobs have no external access, you can't rely on remote metric solutions or writing checkpoints to disk. Instead, liberally apply print statements like you're decorating a 1970's Christmas tree.
At the command line:
cp /inputs/data.tar.gz /outputs/data.tar.gz
- copy a file to the outputs so you can download and inspect it later
Add echo
or cat
commands to list out pertinent information
Inside your code:
Use a logging framework if you have one - structure the output to make it more searchable
Add print
-like debugging statements to trace the path of execution within your code. When you think you've added enough, add more.
print
out the hardware resources observed by your code, to ensure that hardware is visible and behaving as expected (e.g. GPU information)
For longer-running or hardware intensive jobs, print
status updates and current consumption metrics to ensure that the job is progressing as expected
More tips:
It might sound obvious but run a test job locally first. You'll often have much better visibility into what's going on, and you can use your local tools to debug.
Before running a Bacalhau job for real, it's worth taking the time to slowly baby-step your way to the final command. This is especially true if you're new to Bacalhau or if you're not sure what the inputs will look like.
Your first job should be a simple ubuntu
based ls
command to make sure that the inputs are where you expect them to be
Your second job should be a similarly simple ls
-like job, but using your code/container
Your third job should use your code, but run some kind of "inspect" or "list" or "sanity check" like job to double check that your job has everything it needs to do before it starts. A "hello world" if you will.
Finally, try and run the actual job.
If the job fails, try to tailor a job that tests the specific issue you're facing.
In essence, you should try and minimize failure of the job by intentionally testing all the normal things that can go wrong, like data not being in the right place or in the wrong format.
If you're still having trouble, please reach out to the Bacalhau team via Slack (#bacalhau channel)
If you have any hints or tips to add, then please submit a PR to the Bacalhau Documentation repository.
Useful tricks for debugging bacalhau when developing it locally.
We use the library for logging and so the usual log levels are supported (debug
, info
, warn
, error
, fatal
).
The log level is controlled by the LOG_LEVEL
variable:
An example of printing a log at a certain level (this is literally just using the zero log library):
We also have the LOG_TYPE
variable which controls what format the log messages are printed in:
text
(default): Prints the log message using zerolog.ConsoleWriter
so you see text output.
json
: Prints line delimited JSON logs
event
: Prints only the event logs
combined
: Prints text, json and event logs
Event logs are useful when you need to understand the flow of events through the system.
They are much less noisy and are only called from the requester node and compute node when a job is transitioned to a new state.
To print only event logs - you use the LOG_TYPE
variable:
It's sometimes useful to see the text output on stdout but also write just the event log to a file - for this the LOG_EVENT_FILE
variable can be used:
An example of calling the event log library: