Simple Image Processing
Introduction
In this example tutorial, we will show you how to use Bacalhau to process images on a Landsat dataset.
Bacalhau has the unique capability of operating at a massive scale in a distributed environment. This is made possible because data is naturally sharded across the IPFS network amongst many providers. We can take advantage of this to process images in parallel.
Prerequisite​
To get started, you need to install the Bacalhau client, see more information here
Running a Bacalhau Job​
To submit a workload to Bacalhau, we will use the bacalhau docker run
command. This command allows to pass input data volume with a -i ipfs://CID:path
argument just like Docker, except the left-hand side of the argument is a content identifier (CID). This results in Bacalhau mounting a data volume inside the container. By default, Bacalhau mounts the input volume at the path /inputs
inside the container.
Bacalhau also mounts a data volume to store output data. The bacalhau docker run
command creates an output data volume mounted at /outputs
. This is a convenient location to store the results of your job.
Structure of the command​
Let's look closely at the command above:
bacalhau docker run
: call to Bacalhau-i src=s3://landsat-image-processing/*,dst=/input_images,opt=region=us-east-1
: Specifies the input data, which is stored in the S3 storage.--entrypoint mogrify
: Overrides the default ENTRYPOINT of the image, indicating that the mogrify utility from the ImageMagick package will be used instead of the default entry.dpokidov/imagemagick:7.1.0-47-ubuntu
: The name and the tag of the docker image we are using-- -resize 100x100 -quality 100 -path /outputs '/input_images/*.jpg'
: These arguments are passed to mogrify and specify operations on the images: resizing to 100x100 pixels, setting quality to 100, and saving the results to the/outputs
folder.
When a job is submitted, Bacalhau prints out the related job_id
. We store that in an environment variable so that we can reuse it later on.
Declarative job description​
The same job can be presented in the declarative format. In this case, the description will look like this:
The job description should be saved in .yaml
format, e.g. image.yaml
, and then run with the command:
Checking the State of your Jobs​
Job status: You can check the status of the job using bacalhau job list
:
When it says Completed
, that means the job is done, and we can get the results.
Job information: You can find out more information about your job by using bacalhau job describe
:
Job download: You can download your job results directly by using bacalhau job get
. Alternatively, you can choose to create a directory to store your results. In the command below, we created a directory and downloaded our job output to be stored in that directory.
Display the image​
To view the images, open the results/outputs/
folder:
Support​
If you have questions or need support or guidance, please reach out to the Bacalhau team via Slack (#general channel).
Was this helpful?