Video Processing
Introduction
Many data engineering workloads consist of embarrassingly parallel workloads where you want to run a simple execution on a large number of files. In this example tutorial, we will run a simple video filter on a large number of video files.
Prerequisite​
To get started, you need to install the Bacalhau client, see more information here
Upload the Data to IPFS​
The simplest way to upload the data to IPFS is to use a third-party service to "pin" data to the IPFS network, to ensure that the data exists and is available. To do this you need an account with a pinning service like NFT.storage or Pinata. Once registered you can use their UI or API or SDKs to upload files.
This resulted in the IPFS CID of Qmd9CBYpdgCLuCKRtKRRggu24H72ZUrGax5A9EYvrbC72j
.
Running a Bacalhau Job​
To submit a workload to Bacalhau, we will use the bacalhau docker run
command. The command allows one to pass input data volume with a -i ipfs://CID:path
argument just like Docker, except the left-hand side of the argument is a content identifier (CID). This results in Bacalhau mounting a data volume inside the container. By default, Bacalhau mounts the input volume at the path /inputs
inside the container.
Structure of the command​
Let's look closely at the command above:
bacalhau docker run
: call to Bacalhau-i ipfs://Qmd9CBYpdgCLuCKRtKRRggu24H72ZUrGax5A9EYvrbC72j
: CIDs to use on the job. Mounts them at '/inputs' in the execution.linuxserver/ffmpeg
: the name of the docker image we are using to resize the videos-- bash -c 'find /inputs -iname "*.mp4" -printf "%f\n" | xargs -I{} ffmpeg -y -i /inputs/{} -vf "scale=-1:72,setsar=1:1" /outputs/scaled_{}'
: the command that will be executed inside the container. It usesfind
to locate all files with the extension ".mp4" within/inputs
and then usesffmpeg
to resize each found file to 72 pixels in height, saving the results in the/outputs
folder.
When a job is submitted, Bacalhau prints out the related job_id
. We store that in an environment variable so that we can reuse it later on.
Bacalhau overwrites the default entrypoint so we must run the full command after the --
argument. In this line you will list all of the mp4 files in the /inputs
directory and execute ffmpeg
against each instance.
Declarative job description​
The same job can be presented in the declarative format. In this case, the description will look like this:
The job description should be saved in .yaml
format, e.g. video.yaml
, and then run with the command:
Checking the State of your Jobs​
Job status: You can check the status of the job using bacalhau job list
.
When it says Published
or Completed
, that means the job is done, and we can get the results.
Job information: You can find out more information about your job by using bacalhau job describe
.
Job download: You can download your job results directly by using bacalhau job get
. Alternatively, you can choose to create a directory to store your results. In the command below, we created a directory (results
) and downloaded our job output to be stored in that directory.
Viewing your Job Output​
To view the results open the results/outputs/
folder.
Support​
If you have questions or need support or guidance, please reach out to the Bacalhau team via Slack (#general channel).
Was this helpful?