Skip to main content

Parallel Video Resizing via File Sharding

Open In Colab Open In Binder

Many data engineering workloads consist of embarrassingly parallel workloads where you want to run a simple execution on a large number of files. In this notebook, we will use the Sharding functionality in Bacalhau to run a simple video filter on a large number of video files.

Although you would normally you would use your own container and script to make your workloads reproducible, in this example we will use a pre-built container and CLI arguments to allow you to make changes. You can find the container on docker hub.

Prerequistes

Make sure you have the latest bacalhau client installed by following the getting started instructions or using the installation command below (which installs Bacalhau local to the notebook).

!command -v bacalhau >/dev/null 2>&1 || (export BACALHAU_INSTALL_DIR=.; curl -sL https://get.bacalhau.org/install.sh | bash)
path=!echo $PATH
%env PATH=./:{path[0]}
Your system is linux_amd64
No BACALHAU detected. Installing fresh BACALHAU CLI...
Getting the latest BACALHAU CLI...
Installing v0.3.15 BACALHAU CLI...
Downloading https://github.com/filecoin-project/bacalhau/releases/download/v0.3.15/bacalhau_v0.3.15_linux_amd64.tar.gz ...
Downloading sig file https://github.com/filecoin-project/bacalhau/releases/download/v0.3.15/bacalhau_v0.3.15_linux_amd64.tar.gz.signature.sha256 ...
Verified OK
Extracting tarball ...
NOT verifying Bin
bacalhau installed into . successfully.
Client Version: v0.3.15
Server Version: v0.3.15
env: PATH=./:/home/gitpod/.pyenv/versions/3.8.13/bin:/home/gitpod/.pyenv/libexec:/home/gitpod/.pyenv/plugins/python-build/bin:/home/gitpod/.pyenv/plugins/pyenv-virtualenv/bin:/home/gitpod/.pyenv/plugins/pyenv-update/bin:/home/gitpod/.pyenv/plugins/pyenv-installer/bin:/home/gitpod/.pyenv/plugins/pyenv-doctor/bin:/home/gitpod/.pyenv/shims:/ide/bin/remote-cli:/home/gitpod/.nix-profile/bin:/home/gitpod/.local/bin:/home/gitpod/.sdkman/candidates/maven/current/bin:/home/gitpod/.sdkman/candidates/java/current/bin:/home/gitpod/.sdkman/candidates/gradle/current/bin:/workspace/.cargo/bin:/home/gitpod/.rvm/gems/ruby-3.1.2/bin:/home/gitpod/.rvm/gems/ruby-3.1.2@global/bin:/home/gitpod/.rvm/rubies/ruby-3.1.2/bin:/home/gitpod/.pyenv/plugins/pyenv-virtualenv/shims:/home/gitpod/.pyenv/shims:/workspace/go/bin:/home/gitpod/.nix-profile/bin:/ide/bin/remote-cli:/home/gitpod/go/bin:/home/gitpod/go-packages/bin:/home/gitpod/.nvm/versions/node/v16.19.0/bin:/home/gitpod/.yarn/bin:/home/gitpod/.pnpm:/home/gitpod/.pyenv/bin:/workspace/.rvm/bin:/home/gitpod/.cargo/bin:/home/linuxbrew/.linuxbrew/bin:/home/linuxbrew/.linuxbrew/sbin/:/home/gitpod/.local/bin:/usr/games:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/home/gitpod/.nvm/versions/node/v16.19.0/bin:/home/gitpod/.rvm/bin
bacalhau version
Client Version: v0.3.15
Server Version: v0.3.15

Submit the workload

To submit a workload to Bacalhau you can use the bacalhau docker run command. This allows you to pass input data volume with a -v CID:path argument just like Docker, except the left-hand side of the argument is a content identifier (CID). This results in Bacalhau mounting a data volume inside the container. By default, Bacalhau mounts the input volume at the path /inputs inside the container.

Bacalhau also mounts a data volume to store output data. By default bacalhau docker run creates an output data volume mounted at /outputs. This is a convenient location to store the results of your job. See below for an example.

And to shard across files in the input directory, we need to pass three (optional) arguments to the command:

  • sharding-base-path - the path to the directory you want to shard over
  • sharding-glob-pattern - the pattern to match files in the directory
  • sharding-batch-size - the number of files to pass into each job

A Simple Video Resize Example

In this example, you will create 72px wide video thumbnails for all the videos in the inputs directory. The outputs directory will contain the thumbnails for each video. We will shard by 1 video per job, and use the linuxserver/ffmpeg container to resize the videos.

Note that Bacalhau overwrites the default entrypoint so we must run the full command after the -- argument. In this line you will list all of the mp4 files in the /inputs directory and execute ffmpeg against each instance.

bacalhau docker run \
--wait \
--wait-timeout-secs 100 \
--id-only \
--sharding-base-path "/inputs" \
--sharding-glob-pattern "*.mp4" \
--sharding-batch-size 1 \
-v Qmd9CBYpdgCLuCKRtKRRggu24H72ZUrGax5A9EYvrbC72j:/inputs \
linuxserver/ffmpeg -- \
bash -c 'find /inputs -iname "*.mp4" -printf "%f\n" | xargs -I{} ffmpeg -y -i /inputs/{} -vf "scale=-1:72,setsar=1:1" /outputs/scaled_{}'

%env JOB_ID={job_id}
env: JOB_ID=c1ebae42-32b7-4d33-9885-704c7e6253b5

Get Results

Now let's download and display the result from the results directory. We can use the bacalhau results command to download the results from the output data volume. The --output-dir argument specifies the directory to download the results to.

mkdir -p ./results # Temporary directory to store the results
bacalhau get --output-dir ./results ${JOB_ID} # Download the results
Fetching results of job 'c1ebae42-32b7-4d33-9885-704c7e6253b5'...
Results for job 'c1ebae42-32b7-4d33-9885-704c7e6253b5' have been written to...
./results
# Copy the files to the local directory, to allow the documentation scripts to copy them to the right place
cp results/combined_results/outputs/* ./ && rm -rf results/combined_results/outputs/*
# Remove any spaces from the filenames
for f in *\ *; do mv "$f" "${f// /_}"; done
import glob
from IPython.display import Video, display
for file in glob.glob('*.mp4'):
display(Video(filename=file))