1 of 8

Data Engineering

This directory contains examples relating to data engineering workloads. The goal is to provide a range of examples that show you how to work with Bacalhau in a variety of use cases.

Using Bacalhau with DuckDB

Introduction

is a relational table-oriented database management system and supports SQL queries for producing analytical results. It also comes with various features that are useful for data analytics.

DuckDB is suited for the following use cases:

Processing and storing tabular datasets, e.g. from CSV or Parquet files
Interactive data analysis, e.g. Joining & aggregate multiple large tables
Concurrent large changes, to multiple large tables, e.g. appending rows, adding/removing/updating columns
Large result set transfer to client

In this example tutorial, we will show how to use DuckDB with Bacalhau. The advantage of using DuckDB with Bacalhau is that you don’t need to install, there is no need to download the datasets since the datasets are are already available on the server you are looking to run the query on.

Prerequisites

To get started, you need to install the Bacalhau client, see more information

We will be using Bacalhau setup with the standard .

We will also need to have a server with some data to run the query on. In this example, we will use a server with the Yellow Taxi Trips dataset.

If you do not already have this data on your server, you can download it using the scripts in the prep_data directory. The command to download the data is ./prep_data/run_download_jobs.sh - and you must have the /bacalhau_data directory on your server.

Running the Query

To submit a job, run the following Bacalhau command:

bacalhau docker run -e QUERY="select 1" docker.io/bacalhauproject/duckdb:latest

This is a simple query that will return a single row with a single column - but the query will be executed in DuckDB, on a remote server.

Structure of the command

Let's look closely at the command above:

bacalhau docker run: call to bacalhau
-e QUERY="select 1": the query to execute
docker.io/bacalhauproject/duckdb:latest: the name and the tag of the docker image we are using

When a job is submitted, Bacalhau runs the query in DuckDB, and returns the results to the client.

After we run it, when we describe the job, we can see the following in standard output:

Standard Output
┌───────┐
│   1   │
│ int32 │
├───────┤
│     1 │
└───────┘

Running with a YAML file

What if you didn't want to run everything on the command line? You can use a YAML file to define the job. In simple_query.sql, we have a simple query that will return the number of rows in the dataset.

-- simple_query.sql
SELECT COUNT(*) AS row_count FROM yellow_taxi_trips;

To run this query, we can use the following YAML file:

# duckdb_query_job.yaml
Tasks:
  - Engine:
      Params:
        Image: docker.io/bacalhauproject/duckdb:latest
        WorkingDirectory: ""
        EnvironmentVariables:
          - QUERY=WITH yellow_taxi_trips AS (SELECT * FROM read_parquet('{{ .filename }}')) {{ .query }}
      Type: docker
    Name: duckdb-query-job
    InputSources:
      - Source:
          Type: "localDirectory"
          Params:
            SourcePath: "/bacalhau_data"
            ReadWrite: true
        Target: "/bacalhau_data"
    Publisher:
      Type: "local"
      Params:
        TargetPath: "/bacalhau_data"
    Network:
      Type: Full
    Resources:
      CPU: 2000m
      Memory: 2048Mi
    Timeouts: {}
Type: batch
Count: 1

Though this looks like a lot of code, it is actually quite simple. The Tasks section defines the task to run, and the InputSources section defines the input dataset. The Publisher section defines where the results will be published, and the Resources section defines the resources required for the job.

All the work is done in the environment variables, which are passed to the Docker image, and handed to DuckDB to execute the query.

To run this query, we can use the following command:

bacalhau job run duckdb_query_job.yaml --template-vars="filename=/bacalhau_data/yellow_tripdata_2020-02.parquet" --template-vars="QUERY=$(cat simple_query.sql)"

This breaks down into the following steps:

bacalhau job run: call to bacalhau
duckdb_query_job.yaml: the YAML file we are using
--template-vars="filename=/bacalhau_data/yellow_tripdata_2020-02.parquet": the file to read
--template-vars="QUERY=$(cat simple_query.sql)": the query to execute

When we run this, we get back the following simple output:

Standard Output
┌───────────┐
│ row_count │
│   int64   │
├───────────┤
│   6299367 │
└───────────┘

More complex queries

Let's say we want to run a more complex query. In window_query_simple.sql, we have a query that will return the average number of rides per 5 minute interval.

-- window_query_simple.sql
SELECT
    DATE_TRUNC('hour', tpep_pickup_datetime) + INTERVAL (FLOOR(EXTRACT(MINUTE FROM tpep_pickup_datetime) / 5) * 5) MINUTE AS interval_start,
    COUNT(*) AS ride_count
FROM
    yellow_taxi_trips
GROUP BY
    interval_start
ORDER BY
    interval_start;

When we run this, we get back the following output:

┌─────────────────────┬────────────┐
│   interval_start    │ ride_count │
│      timestamp      │   int64    │
├─────────────────────┼────────────┤
│ 2008-12-31 22:20:00 │          1 │
│ 2008-12-31 23:00:00 │          1 │
│ 2008-12-31 23:05:00 │          1 │
│ 2008-12-31 23:10:00 │          1 │
│ 2008-12-31 23:15:00 │          1 │
│ 2008-12-31 23:30:00 │          1 │
│ 2009-01-01 00:00:00 │          3 │
│ 2009-01-01 00:05:00 │          3 │
│ 2009-01-01 00:15:00 │          1 │
│ 2009-01-01 00:40:00 │          1 │
│ 2009-01-01 01:15:00 │          1 │
│ 2009-01-01 01:20:00 │          1 │
│ 2009-01-01 01:35:00 │          1 │
│ 2009-01-01 01:40:00 │          1 │
│ 2009-01-01 02:00:00 │          2 │
│ 2009-01-01 02:15:00 │          1 │
│ 2009-01-01 04:05:00 │          1 │
│ 2009-01-01 04:15:00 │          2 │
│ 2009-01-01 04:45:00 │          1 │
│ 2009-01-01 06:30:00 │          1 │
│          ·          │          · │
│          ·          │          · │
│          ·          │          · │
│ 2020-03-05 12:15:00 │          1 │

The sql file needs to be run in a single line, otherwise the line breaks will cause some issues with the templating. We're working on improving this!

With this structure, you can now run virtually any query you want on remote servers, without ever having to download the data. Welcome to compute over data by Bacalhau!

Need Help?

If you get stuck or have questions:

Ethereum Blockchain Analysis with Ethereum-ETL and Bacalhau

Introduction

Mature blockchains are difficult to analyze because of their size. Ethereum-ETL is a tool that makes it easy to extract information from an Ethereum node, but it's not easy to get working in a batch manner. It takes approximately 1 week for an Ethereum node to download the entire chain (even more in my experience) and importing and exporting data from the Ethereum node is slow.

For this example, we ran an Ethereum node for a week and allowed it to synchronize. We then ran ethereum-etl to extract the information and pinned it on Filecoin. This means that we can both now access the data without having to run another Ethereum node.

But there's still a lot of data and these types of analyses typically need repeating or refining. So it makes absolute sense to use a decentralized network like Bacalhau to process the data in a scalable way.

In this tutorial example, we will run Ethereum-ETL tool on Bacalhau to extract data from an Ethereum node.

Prerequisite

To get started, you need to install the Bacalhau client, see more information

Analysing Ethereum Data Locally

First let's download one of the IPFS files and inspect it locally:

wget -q -O file.tar.gz https://w3s.link/ipfs/bafybeifgqjvmzbtz427bne7af5tbndmvniabaex77us6l637gqtb2iwlwq
tar -xvf file.tar.gz

You can see the full list of IPFS CIDs in the appendix at the bottom of the page.

If you don't already have the Pandas library, let's install it:

pip install pandas

# Use pandas to read in transaction data and clean up the columns
import pandas as pd
import glob

file = glob.glob('output_*/transactions/start_block=*/end_block=*/transactions*.csv')[0]
print("Loading file %s" % file)
df = pd.read_csv(file)
df['value'] = df['value'].astype('float')
df['from_address'] = df['from_address'].astype('string')
df['to_address'] = df['to_address'].astype('string')
df['hash'] = df['hash'].astype('string')
df['block_hash'] = df['block_hash'].astype('string')
df['block_datetime'] = pd.to_datetime(df['block_timestamp'], unit='s')
df.info()

# Total volume per day
df[['block_datetime', 'value']].groupby(pd.Grouper(key='block_datetime', freq='1D')).sum().plot()

The following code inspects the daily trading volume of Ethereum for a single chunk (100,000 blocks) of data.

This is all good, but we can do better. We can use the Bacalhau client to download the data from IPFS and then run the analysis on the data in the cloud. This means that we can analyze the entire Ethereum blockchain without having to download it locally.

Analysing Ethereum Data With Bacalhau

To run jobs on the Bacalhau network you need to package your code. In this example, I will package the code as a Docker image.

But before we do that, we need to develop the code that will perform the analysis. The code below is a simple script to parse the incoming data and produce a CSV file with the daily trading volume of Ethereum.

# main.py
import glob, os, sys, shutil, tempfile
import pandas as pd

def main(input_dir, output_dir):
    search_path = os.path.join(input_dir, "output*", "transactions", "start_block*", "end_block*", "transactions_*.csv")
    csv_files = glob.glob(search_path)
    if len(csv_files) == 0:
        print("No CSV files found in %s" % search_path)
        sys.exit(1)
    for transactions_file in csv_files:
        print("Loading %s" % transactions_file)
        df = pd.read_csv(transactions_file)
        df['value'] = df['value'].astype('float')
        df['block_datetime'] = pd.to_datetime(df['block_timestamp'], unit='s')
        
        print("Processing %d blocks" % (df.shape[0]))
        results = df[['block_datetime', 'value']].groupby(pd.Grouper(key='block_datetime', freq='1D')).sum()
        print("Finished processing %d days worth of records" % (results.shape[0]))

        save_path = os.path.join(output_dir, os.path.basename(transactions_file))
        os.makedirs(os.path.dirname(save_path), exist_ok=True)
        print("Saving to %s" % (save_path))
        results.to_csv(save_path)

def extractData(input_dir, output_dir):
    search_path = os.path.join(input_dir, "*.tar.gz")
    gz_files = glob.glob(search_path)
    if len(gz_files) == 0:
        print("No tar.gz files found in %s" % search_path)
        sys.exit(1)
    for f in gz_files:
        shutil.unpack_archive(filename=f, extract_dir=output_dir)

if __name__ == "__main__":
    if len(sys.argv) != 3:
        print('Must pass arguments. Format: [command] input_dir output_dir')
        sys.exit()
    with tempfile.TemporaryDirectory() as tmp_dir:
        extractData(sys.argv[1], tmp_dir)
        main(tmp_dir, sys.argv[2])

Next, let's make sure the file works as expected:

python main.py . outputs/

And finally, package the code inside a Docker image to make the process reproducible. Here I'm passing the Bacalhau default /inputs and /outputs directories. The /inputs directory is where the data will be read from and the /outputs directory is where the results will be saved to.

FROM python:3.11-slim-bullseye
WORKDIR /src
RUN pip install pandas==1.5.1
ADD main.py .
CMD ["python", "main.py", "/inputs", "/outputs"]

We've already pushed the container, but for posterity, the following command pushes this container to GHCR.

docker buildx build --platform linux/amd64 --push -t ghcr.io/bacalhau-project/examples/blockchain-etl:0.0.1 .

Running a Bacalhau Job

To run our analysis on the Ethereum blockchain, we will use the bacalhau docker run command.

export JOB_ID=$(bacalhau docker run \
    --id-only \
    --input ipfs://bafybeifgqjvmzbtz427bne7af5tbndmvniabaex77us6l637gqtb2iwlwq:/inputs/data.tar.gz \
    ghcr.io/bacalhau-project/examples/blockchain-etl:0.0.6)

The job has been submitted and Bacalhau has printed out the related job id. We store that in an environment variable so that we can reuse it later on.

Bacalhau also mounts a data volume to store output data. The bacalhau docker run command creates an output data volume mounted at /outputs. This is a convenient location to store the results of your job.

Declarative job description

name: Ethereum Blockchain Analysis with Ethereum-ETL
type: batch
count: 1
tasks:
  - name: My main task
    Engine:
      type: docker
      params:
        Image: ghcr.io/bacalhau-project/examples/blockchain-etl:0.0.6
    Publisher:
      Type: ipfs
    ResultPaths:
      - Name: outputs
        Path: /outputs
    InputSources:
      - Target: "/inputs/data.tar.gz"
        Source:
          Type: "ipfs"
          Params:
            CID: "bafybeifgqjvmzbtz427bne7af5tbndmvniabaex77us6l637gqtb2iwlwq"

The job description should be saved in .yaml format, e.g. blockchain.yaml, and then run with the command:

Copy

bacalhau job run blockchain.yaml

Checking the State of your Jobs

Job status: You can check the status of the job using bacalhau job list.

bacalhau job list --id-filter ${JOB_ID}

When it says Published or Completed, that means the job is done, and we can get the results.

Job information: You can find out more information about your job by using bacalhau job describe.

bacalhau job describe ${JOB_ID}

Job download: You can download your job results directly by using bacalhau job get. Alternatively, you can choose to create a directory to store your results. In the command below, we created a directory (results) and downloaded our job output to be stored in that directory.

rm -rf results && mkdir -p results # Temporary directory to store the results
bacalhau job get ${JOB_ID} --output-dir results # Download the results

Viewing your Job Output

To view the file, run the following command:

ls -lah results/outputs

Display the image

To view the images, we will use glob to return all file paths that match a specific pattern.

import glob
import pandas as pd

# Get CSV files list from a folder
csv_files = glob.glob("results/outputs/*.csv")
df = pd.read_csv(csv_files[0], index_col='block_datetime')
df.plot()

Massive Scale Ethereum Analysis

Ok, so that works. Let's scale this up! We can run the same analysis on the entire Ethereum blockchain (up to the point where I have uploaded the Ethereum data). To do this, we need to run the analysis on each of the chunks of data that we have stored on IPFS. We can do this by running the same job on each of the chunks.

See the appendix for the hashes.txt file.

printf "" > job_ids.txt
for h in $(cat hashes.txt); do \
    bacalhau docker run \
    --id-only \
    --wait=false \
    --input=ipfs://$h:/inputs/data.tar.gz \
    ghcr.io/bacalhau-project/examples/blockchain-etl:0.0.6 >> job_ids.txt 
done

Now take a look at the job id's. You can use these to check the status of the jobs and download the results:

cat job_ids.txt

d840df7b-9318-4e5b-ab06-adb72dd95394
09d01f9c-9409-42b9-829d-92e22fcdd062
0072758f-3575-44d7-b193-da4a22f6bc86
2043dee4-fc82-4768-92cb-4d23dd2514b1
36ef8e9e-9eae-4218-81e6-15883d0a5b8d
932aa406-cd29-4933-b09f-c8cea4d77164
1f3e5273-bdd4-4ef0-b7ed-b83591fab64e
8bfabe96-54e3-4fee-b344-a0517c683268
1cd588a1-5c76-4f91-ba90-af7931bca596
b9c29531-e1b4-4520-b03d-7406a22bbdb3
8665b8be-24a9-4c78-9913-803d3e3c9a65
06115147-bc83-49e8-bb71-7b447c8ad1bc
84afed3e-831c-462b-a3e3-9a23bc7d6fb8
ed6e55e6-98d3-4bde-8ece-1f05838d489e
...

You might want to double-check that the jobs ran ok by doing a bacalhau job list.

bacalhau job list -n 50

Wait until all of these jobs have been completed. And then download all the results and merge them into a single directory. This might take a while, so this is a good time to treat yourself to a nice Dark Mild. There's also been some issues in the past communicating with IPFS, so if you get an error, try again.

for id in $(cat job_ids.txt); do \
    rm -rf results_$id && mkdir results_$id
    bacalhau job get --output-dir results_$id $id &
done
wait

Display the image

To view the images, we will use glob to return all file paths that match a specific pattern.

import os, glob
import pandas as pd

# Get CSV files list from a folder
path = os.path.join("results_*", "outputs", "*.csv")
csv_files = glob.glob(path)

# Read each CSV file into a list of DataFrames
df_list = (pd.read_csv(file, index_col='block_datetime') for file in csv_files)

# Concatenate all DataFrames
df_unsorted = pd.concat(df_list, ignore_index=False)

# Some files will cross days, so group by day and sum the values
df = df_unsorted.groupby(level=0).sum()

# Plot
df.plot(figsize=(16,9))

That's it! There are several years of Ethereum transaction volume data.

rm -rf results_* output_* outputs results temp # Remove temporary results

Appendix: List Ethereum Data CIDs

The following list is a list of IPFS CID's for the Ethereum data that we used in this tutorial. You can use these CID's to download the rest of the chain if you so desire. The CIDs are ordered by block number and they increase 50,000 blocks at a time. Here's a list of ordered CIDs:

# hashes.txt
bafybeihvtzberlxrsz4lvzrzvpbanujmab3hr5okhxtbgv2zvonqos2l3i
bafybeifb25fgxrzu45lsc47gldttomycqcsao22xa2gtk2ijbsa5muzegq
bafybeig4wwwhs63ly6wbehwd7tydjjtnw425yvi2tlzt3aii3pfcj6hvoq
bafybeievpb5q372q3w5fsezflij3wlpx6thdliz5xowimunoqushn3cwka
bafybeih6te26iwf5kzzby2wqp67m7a5pmwilwzaciii3zipvhy64utikre
bafybeicjd4545xph6rcyoc74wvzxyaz2vftapap64iqsp5ky6nz3f5yndm
bafybeicgo3iofo3sw73wenc3nkdhi263yytjnds5cxjwvypwekbz4sk7ra
bafybeihvep5xsvxm44lngmmeysihsopcuvcr34an4idz45ixl5slsqzy3y
bafybeigmt2zwzrbzwb4q2kt2ihlv34ntjjwujftvabrftyccwzwdypama4
bafybeiciwui7sw3zqkvp4d55p4woq4xgjlstrp3mzxl66ab5ih5vmeozci
bafybeicpmotdsj2ambf666b2jkzp2gvg6tadr6acxqw2tmdlmsruuggbbu
bafybeigefo3esovbveavllgv5wiheu5w6cnfo72jxe6vmfweco5eq5sfty
bafybeigvajsumnfwuv7lp7yhr2sr5vrk3bmmuhhnaz53waa2jqv3kgkvsu
bafybeih2xg2n7ytlunvqxwqlqo5l3daykuykyvhgehoa2arot6dmorstmq
bafybeihnmq2ltuolnlthb757teihwvvw7wophoag2ihnva43afbeqdtgi4
bafybeibb34hzu6z2xgo6nhrplt3xntpnucthqlawe3pmzgxccppbxrpudy
bafybeigny33b4g6gf2hrqzzkfbroprqrimjl5gmb3mnsqu655pbbny6tou
bafybeifgqjvmzbtz427bne7af5tbndmvniabaex77us6l637gqtb2iwlwq
bafybeibryqj62l45pxjhdyvgdc44p3suhvt4xdqc5jpx474gpykxwgnw2e
bafybeidme3fkigdjaifkjfbwn76jk3fcqdogpzebtotce6ygphlujaecla
bafybeig7myc3eg3h2g5mk2co7ybte4qsuremflrjneer6xk3pghjwmcwbi
bafybeic3x2r5rrd3fdpdqeqax4bszcciwepvbpjl7xdv6mkwubyqizw5te
bafybeihxutvxg3bw7fbwohq4gvncrk3hngkisrtkp52cu7qu7tfcuvktnq
bafybeicumr67jkyarg5lspqi2w4zqopvgii5dgdbe5vtbbq53mbyftduxy
bafybeiecn2cdvefvdlczhz6i4afbkabf5pe5yqrcsgdvlw5smme2tw7em4
bafybeiaxh7dhg4krgkil5wqrv5kdsc3oewwy6ym4n3545ipmzqmxaxrqf4
bafybeiclcqfzinrmo3adr4lg7sf255faioxjfsolcdko3i4x7opx7xrqii
bafybeicjmeul7c2dxhmaudawum4ziwfgfkvbgthgtliggfut5tsc77dx7q
bafybeialziupik7csmhfxnhuss5vrw37kmte7rmboqovp4cpq5hj4insda
bafybeid7ecwdrw7pb3fnkokq5adybum6s5ok3yi2lw4m3edjpuy65zm4ji
bafybeibuxwnl5ogs4pwa32xriqhch24zbrw44rp22hrly4t6roh6rz7j4m
bafybeicxvy47jpvv3fi5umjatem5pxabfrbkzxiho7efu6mpidjpatte54
bafybeifynb4mpqrbsjbeqtxpbuf6y4frrtjrc4tm7cnmmui7gbjkckszrq
bafybeidcgnbhguyfaahkoqbyy2z525d3qfzdtbjuk4e75wkdbnkcafvjei
bafybeiefc67s6hpydnsqdgypbunroqwkij5j26sfmc7are7yxvg45uuh7i
bafybeiefwjy3o42ovkssnm7iihbog46k5grk3gobvvkzrqvof7p6xbgowi
bafybeihpydd3ivtza2ql5clatm5fy7ocych7t4czu46sbc6c2ykrbwk5uu
bafybeiet7222lqfmzogur3zlxqavlnd3lt3qryw5yi5rhuiqeqg4w7c3qu
bafybeihwomd4ygoydvj5kh24wfwk5kszmst5vz44zkl6yibjargttv7sly
bafybeidbjt2ckr4oooio3jsfk76r3bsaza5trjvt7u36slhha5ksoc5gv4
bafybeifyjrmopgtfmswq7b4pfscni46doy3g3z6vi5rrgpozc6duebpmuy
bafybeidsrowz46yt62zs64q2mhirlc3rsmctmi3tluorsts53vppdqjj7e
bafybeiggntql57bw24bw6hkp2yqd3qlyp5oxowo6q26wsshxopfdnzsxhq
bafybeidguz36u6wakx4e5ewuhslsfsjmk5eff5q7un2vpkrcu7cg5aaqf4
bafybeiaypwu2b45iunbqnfk2g7bku3nfqveuqp4vlmmwj7o7liyys42uai
bafybeicaahv7xvia7xojgiecljo2ddrvryzh2af7rb3qqbg5a257da5p2y
bafybeibgeiijr74rcliwal3e7tujybigzqr6jmtchqrcjdo75trm2ptb4e
bafybeiba3nrd43ylnedipuq2uoowd4blghpw2z7r4agondfinladcsxlku
bafybeif3semzitjbxg5lzwmnjmlsrvc7y5htekwqtnhmfi4wxywtj5lgoe
bafybeiedmsig5uj7rgarsjans2ad5kcb4w4g5iurbryqn62jy5qap4qq2a
bafybeidyz34bcd3k6nxl7jbjjgceg5eu3szbrbgusnyn7vfl7facpecsce
bafybeigmq5gch72q3qpk4nipssh7g7msk6jpzns2d6xmpusahkt2lu5m4y
bafybeicjzoypdmmdt6k54wzotr5xhpzwbgd3c4oqg6mj4qukgvxvdrvzye
bafybeien55egngdpfvrsxr2jmkewdyha72ju7qaaeiydz2f5rny7drgzta

Support

Convert CSV To Parquet Or Avro

Introduction

Converting from CSV to parquet or avro reduces the size of the file and allows for faster read and write speeds. With Bacalhau, you can convert your CSV files stored on ipfs or on the web without the need to download files and install dependencies locally.

In this example tutorial we will convert a CSV file from a URL to parquet format and save the converted parquet file to IPFS

Prerequisites

To get started, you need to install the Bacalhau client, see more information here

Running CSV to Avro or Parquet Locally

Downloading the CSV file

Let's download the transactions.csv file:

wget https://cloudflare-ipfs.com/ipfs/QmfKJT13h5k1b23ja3ZCVg5nFL9oKz2bVXc8oXgtwiwhjz/transactions.csv

You can use the CSV files from here

Writing the Script

Write the converter.py Python script, that serves as a CSV converter to Avro or Parquet formats:

# converter.py
import os
import sys
from abc import ABCMeta, abstractmethod

import fastavro
import numpy as np
import pandas as pd
from pyarrow import Table, parquet


class BaseConverter(metaclass=ABCMeta):
    """
    Base class for converters.

    Validate received parameters for future use.
    """
    def __init__(
        self,
        csv_file_path: str,
        target_file_path: str,
    ) -> None:
        self.csv_file_path = csv_file_path
        self.target_file_path = target_file_path

    @property
    def csv_file_path(self):
        return self._csv_file_path

    @csv_file_path.setter
    def csv_file_path(self, path):
        if not os.path.isabs(path):
            path = os.path.join(os.getcwd(), path)
        _, extension = os.path.splitext(path)
        if not os.path.isfile(path) or extension != '.csv':
            raise FileNotFoundError(
                f'No such csv file: {path}'
            )
        self._csv_file_path = path

    @property
    def target_file_path(self):
        return self._target_file_path

    @target_file_path.setter
    def target_file_path(self, path):
        if not os.path.isabs(path):
            path = os.path.join(os.getcwd(), path)
        target_dir = os.path.dirname(path)
        if not os.path.isdir(target_dir):
            raise FileNotFoundError(
                f'No such directory: {target_dir}\n'
                'Choose existing or create directory for result file.'
            )
        if os.path.isfile(path):
            raise FileExistsError(
                f'File {path} has already exists.'
                'Usage of existing file may result in data loss.'
            )
        self._target_file_path = path

    def get_csv_reader(self):
        """Return csv reader which read csv file as a stream"""
        return pd.read_csv(
            self.csv_file_path,
            iterator=True,
            chunksize=100000
        )

    @abstractmethod
    def convert(self):
        """Should be implemented in child class"""
        pass


class ParquetConverter(BaseConverter):
    """
    Convert received csv file to parquet file.

    Take path to csv file and path to result file.
    """
    def convert(self):
        """Read csv file as a stream and write data to parquet file."""
        csv_reader = self.get_csv_reader()
        writer = None
        for chunk in csv_reader:
            if not writer:
                table = Table.from_pandas(chunk)
                writer = parquet.ParquetWriter(
                    self.target_file_path, table.schema
                )
            table = Table.from_pandas(chunk)
            writer.write_table(table)
        writer.close()


class AvroConverter(BaseConverter):
    """
    Convert received csv file to avro file.

    Take path to csv file and path to result file.
    """
    NUMPY_TO_AVRO_TYPES = {
        np.dtype('?'): 'boolean',
        np.dtype('int8'): 'int',
        np.dtype('int16'): 'int',
        np.dtype('int32'): 'int',
        np.dtype('uint8'): 'int',
        np.dtype('uint16'): 'int',
        np.dtype('uint32'): 'int',
        np.dtype('int64'): 'long',
        np.dtype('uint64'): 'long',
        np.dtype('O'): ['null', 'string', 'float'],
        np.dtype('unicode_'): 'string',
        np.dtype('float32'): 'float',
        np.dtype('float64'): 'double',
        np.dtype('datetime64'): {
            'type': 'long',
            'logicalType': 'timestamp-micros'
        },
    }

    def get_avro_schema(self, pandas_df):
        """Generate avro schema."""
        column_dtypes = pandas_df.dtypes
        schema_name = os.path.basename(self.target_file_path)
        schema = {
            'type': 'record',
            'name': schema_name,
            'fields': [
                {
                    'name': name,
                    'type': AvroConverter.NUMPY_TO_AVRO_TYPES[dtype]
                } for (name, dtype) in column_dtypes.items()
            ]
        }
        return fastavro.parse_schema(schema)

    def convert(self):
        """Read csv file as a stream and write data to avro file."""
        csv_reader = self.get_csv_reader()
        schema = None
        with open(self.target_file_path, 'a+b') as f:
            for chunk in csv_reader:
                if not schema:
                    schema = self.get_avro_schema(chunk)
                fastavro.writer(
                    f,
                    schema=schema,
                    records=chunk.to_dict('records')
                )


if __name__ == '__main__':
    converters = {
        'parquet': ParquetConverter,
        'avro': AvroConverter
    }
    csv_file, result_path, result_type = sys.argv[1], sys.argv[2], sys.argv[3]
    if result_type.lower() not in converters:
        raise ValueError(
            'Invalid target type. Avalible types: avro, parquet.'
        )
    converter = converters[result_type.lower()](csv_file, result_path)
    converter.convert()

You can find out more information about converter.py here

Installing Dependencies

pip install fastavro numpy pandas pyarrow

Converting CSV file to Parquet format

python converter.py <path_to_csv> <path_to_result_file> <extension>

In our case:

python3 converter.py transactions.csv transactions.parquet parquet

Viewing the parquet file:

import pandas as pd
pd.read_parquet('transactions.parquet').head()

Containerize Script with Docker

You can skip this section entirely and directly go to Running on Bacalhau

To build your own docker container, create a Dockerfile, which contains instructions to build your image.

FROM python:3.8

RUN apt update && apt install git

RUN git clone https://github.com/bacalhau-project/Sparkov_Data_Generation

WORKDIR /Sparkov_Data_Generation/

RUN pip3 install -r requirements.txt

See more information on how to containerize your script/app here

Build the container

We will run the docker build command to build the container:

docker build -t <hub-user>/<repo-name>:<tag> .

Before running the command replace:

hub-user with your docker hub username. If you don’t have a docker hub account follow these instructions to create docker account, and use the username of the account you created

repo-name with the name of the container, you can name it anything you want

tag this is not required but you can use the latest tag

In our case:

docker build -t jsacex/csv-to-arrow-or-parquet .

Push the container

Next, upload the image to the registry. This can be done by using the Docker hub username, repo name or tag.

docker push <hub-user>/<repo-name>:<tag>

In our case:

docker push jsacex/csv-to-arrow-or-parquet

Running a Bacalhau Job

With the command below, we are mounting the CSV file for transactions from IPFS

export JOB_ID=$(bacalhau docker run \
    -i ipfs://QmTAQMGiSv9xocaB4PUCT5nSBHrf9HZrYj21BAZ5nMTY2W  \
    --wait \
    --id-only \
    --output outputs:\outputs \
    --publisher ipfs \
    jsacex/csv-to-arrow-or-parquet \
    -- python3 src/converter.py ../inputs/transactions.csv  /outputs/transactions.parquet parquet)

Structure of the command

Let's look closely at the command above:

bacalhau docker run: call to Bacalhau
-i ipfs://QmTAQMGiSv9xocaB4PUCT5nSBHrf9HZrYj21BAZ5nMTY2W: CIDs to use on the job. Mounts them at '/inputs' in the execution.
jsacex/csv-to-arrow-or-parque: the name and the tag of the docker image we are using
../inputs/transactions.csv : path to input dataset
/outputs/transactions.parquet parquet: path to the output
python3 src/converter.py: execute the script

When a job is submitted, Bacalhau prints out the related job_id. We store that in an environment variable so that we can reuse it later on.

Declarative job description

The same job can be presented in the declarative format. In this case, the description will look like this:

name: Convert CSV To Parquet Or Avro
type: batch
count: 1
tasks:
  - name: My main task
    Engine:
      type: docker
      params:
        Image: jsacex/csv-to-arrow-or-parquet
        Entrypoint:
          - /bin/bash
        Parameters:
          - -c
          - python3 src/converter.py ../inputs/transactions.csv  ../outputs/transactions.parquet parquet
    Publisher:
      Type: ipfs
    ResultPaths:
      - Name: outputs
        Path: /outputs
    InputSources:
      - Target: "/inputs"
        Source:
          Type: "ipfs"
          Params:
            CID: "QmTAQMGiSv9xocaB4PUCT5nSBHrf9HZrYj21BAZ5nMTY2W"

The job description should be saved in .yaml format, e.g. convertcsv.yaml, and then run with the command:

bacalhau job run convertcsv.yaml

Checking the State of your Jobs

Job status: You can check the status of the job using bacalhau job list.

bacalhau job list --id-filter ${JOB_ID}

When it says Published or Completed, that means the job is done, and we can get the results.

Job information: You can find out more information about your job by using bacalhau job describe.

bacalhau job describe ${JOB_ID}

rm -rf results && mkdir -p results # Temporary directory to store the results
bacalhau job get ${JOB_ID} --output-dir results # Download the results

Viewing your Job Output

To view the file, run the following command:

ls results/outputs

transactions.parquet

Alternatively, you can do this:

import pandas as pd
import os
pd.read_parquet('results/outputs/transactions.parquet')

Support

If you have questions or need support or guidance, please reach out to the Bacalhau team via Slack (#general channel).

Simple Image Processing

Introduction

In this example tutorial, we will show you how to use Bacalhau to process images on a Landsat dataset.

Bacalhau has the unique capability of operating at a massive scale in a distributed environment. This is made possible because data is naturally sharded across the IPFS network amongst many providers. We can take advantage of this to process images in parallel.

Prerequisite

To get started, you need to install the Bacalhau client, see more information here

Running a Bacalhau Job

To submit a workload to Bacalhau, we will use the bacalhau docker run command. This command allows to pass input data volume with a -i ipfs://CID:path argument just like Docker, except the left-hand side of the argument is a content identifier (CID). This results in Bacalhau mounting a data volume inside the container. By default, Bacalhau mounts the input volume at the path /inputs inside the container.

export JOB_ID=$(bacalhau docker run \
    --wait \
    --wait-timeout-secs 100 \
    --id-only \
    -i src=s3://landsat-image-processing/*,dst=/input_images,opt=region=us-east-1 \
    --publisher ipfs \
    --entrypoint mogrify \
    dpokidov/imagemagick:7.1.0-47-ubuntu \
    -- -resize 100x100 -quality 100 -path /outputs '/input_images/*.jpg')

Structure of the command

Let's look closely at the command above:

bacalhau docker run: call to Bacalhau
-i src=s3://landsat-image-processing/*,dst=/input_images,opt=region=us-east-1: Specifies the input data, which is stored in the S3 storage.
--entrypoint mogrify: Overrides the default ENTRYPOINT of the image, indicating that the mogrify utility from the ImageMagick package will be used instead of the default entry.
dpokidov/imagemagick:7.1.0-47-ubuntu: The name and the tag of the docker image we are using
-- -resize 100x100 -quality 100 -path /outputs '/input_images/*.jpg': These arguments are passed to mogrify and specify operations on the images: resizing to 100x100 pixels, setting quality to 100, and saving the results to the /outputs folder.

When a job is submitted, Bacalhau prints out the related job_id. We store that in an environment variable so that we can reuse it later on.

Declarative job description

The same job can be presented in the declarative format. In this case, the description will look like this:

name: Simple Image Processing
type: batch
count: 1
tasks:
  - name: My main task
    Engine:
      type: docker
      params:
        Image: dpokidov/imagemagick:7.1.0-47-ubuntu
        Entrypoint:
          - /bin/bash
        Parameters:
          - -c
          - magick mogrify -resize 100x100 -quality 100 -path /outputs '/input_images/*.jpg'
    Publisher:
      Type: ipfs
    ResultPaths:
      - Name: outputs
        Path: /outputs
    InputSources:
    - Target: "/input_images"
      Source:
        Type: "s3"
        Params:
          Bucket: "landsat-image-processing"
          Key: "*"
          Region: "us-east-1"

The job description should be saved in .yaml format, e.g. image.yaml, and then run with the command:

bacalhau job run image.yaml

Checking the State of your Jobs

Job status: You can check the status of the job using bacalhau job list:

bacalhau job list --id-filter ${JOB_ID}

When it says Completed, that means the job is done, and we can get the results.

Job information: You can find out more information about your job by using bacalhau job describe:

bacalhau job describe ${JOB_ID}

Job download: You can download your job results directly by using bacalhau job get. Alternatively, you can choose to create a directory to store your results. In the command below, we created a directory and downloaded our job output to be stored in that directory.

rm -rf results && mkdir results
bacalhau job get ${JOB_ID} --output-dir results

Display the image

To view the images, open the results/outputs/ folder:

Support

If you have questions or need support or guidance, please reach out to the Bacalhau team via Slack (#general channel).

Oceanography - Data Conversion

Introduction

The Surface Ocean CO₂ Atlas (SOCAT) contains measurements of the fugacity of CO₂ in seawater around the globe. But to calculate how much carbon the ocean is taking up from the atmosphere, these measurements need to be converted to the partial pressure of CO₂. We will convert the units by combining measurements of the surface temperature and fugacity. Python libraries (xarray, pandas, numpy) and the pyseaflux package facilitate this process.

In this example tutorial, our focus will be on running the oceanography dataset with Bacalhau, where we will investigate the data and convert the workload. This will enable the execution on the Bacalhau network, allowing us to leverage its distributed storage and compute resources.

Prerequisites

To get started, you need to install the Bacalhau client, see more information here

Running Locally

Downloading the dataset

For the purposes of this example we will use the SOCATv2022 dataset in the "Gridded" format from the SOCAT website and long-term global sea surface temperature data from NOAA - information about that dataset can be found here.

mkdir -p inputs
curl -L --output ./inputs/SOCATv2022_tracks_gridded_monthly.nc.zip https://www.socat.info/socat_files/v2022/SOCATv2022_tracks_gridded_monthly.nc.zip
curl --output ./inputs/sst.mnmean.nc https://downloads.psl.noaa.gov/Datasets/noaa.oisst.v2/sst.mnmean.nc

Installing dependencies

Next let's write the requirements.txt. This file will also be used by the Dockerfile to install the dependencies.

# requirements.txt
Bottleneck==1.3.5
dask==2022.2.0
fsspec==2022.5.0
netCDF4==1.6.0
numpy==1.21.6
pandas==1.3.5
pip==22.1.2
pyseaflux==2.2.1
scipy==1.7.3
xarray==0.20.2
zarr>=2.0.0

pip install -r requirements.txt > /dev/null

Reading and Viewing Data

import fsspec # for reading remote files
import xarray as xr

# Open the zip archive using fsspec and load the data into xarray.Dataset
with fsspec.open("./inputs/SOCATv2022_tracks_gridded_monthly.nc.zip", compression='zip') as fp:
    ds = xr.open_dataset(fp)

# Display information about the dataset    
ds.info()

time_slice = slice("2010", "2020") # select a decade
res = ds['sst_ave_unwtd'].sel(tmnth=time_slice).mean(dim='tmnth') # compute the mean for this period
res.plot() # plot the result

We can see that the dataset contains latitude-longitude coordinates, the date, and a series of seawater measurements. Below is a plot of the average sea surface temperature (SST) between 2010 and 2020, where data have been collected by buoys and vessels.

Data Conversion

To convert the data from fugacity of CO2 (fCO2) to partial pressure of CO2 (pCO2) we will combine the measurements of the surface temperature and fugacity. The conversion is performed by the pyseaflux package.

Writing the Script

Let's create a new file called main.py and paste the following script in it:

# main.py
import fsspec
import xarray as xr
import pandas as pd
import numpy as np
import pyseaflux


def lon_360_to_180(ds=None, lonVar=None):
    lonVar = "lon" if lonVar is None else lonVar
    return (ds.assign_coords({lonVar: (((ds[lonVar] + 180) % 360) - 180)})
            .sortby(lonVar)
            .astype(dtype='float32', order='C'))


def center_dates(ds):
    # start and end date
    start_date = str(ds.time[0].dt.strftime('%Y-%m').values)
    end_date = str(ds.time[-1].dt.strftime('%Y-%m').values)

    # monthly dates centered on 15th of each month
    dates = pd.date_range(start=f'{start_date}-01T00:00:00.000000000',
                          end=f'{end_date}-01T00:00:00.000000000',
                          freq='MS') + np.timedelta64(14, 'D')

    return ds.assign(time=dates)


def get_and_process_sst(url=None):
    # get noaa sst
    if url is None:
        url = ("/inputs/sst.mnmean.nc")

    with fsspec.open(url) as fp:
        ds = xr.open_dataset(fp)
        ds = lon_360_to_180(ds)
        ds = center_dates(ds)
        return ds


def get_and_process_socat(url=None):
    if url is None:
        url = ("/inputs/SOCATv2022_tracks_gridded_monthly.nc.zip")

    with fsspec.open(url, compression='zip') as fp:
        ds = xr.open_dataset(fp)
        ds = ds.rename({"xlon": "lon", "ylat": "lat", "tmnth": "time"})
        ds = center_dates(ds)
        return ds


def main():
    print("Load SST and SOCAT data")
    ds_sst = get_and_process_sst()
    ds_socat = get_and_process_socat()

    print("Merge datasets together")
    time_slice = slice("1981-12", "2022-05")
    ds_out = xr.merge([ds_sst['sst'].sel(time=time_slice),
                       ds_socat['fco2_ave_unwtd'].sel(time=time_slice)])

    print("Calculate pco2 from fco2")
    ds_out['pco2_ave_unwtd'] = xr.apply_ufunc(
        pyseaflux.fCO2_to_pCO2,
        ds_out['fco2_ave_unwtd'],
        ds_out['sst'])

    print("Add metadata")
    ds_out['pco2_ave_unwtd'].attrs['units'] = 'uatm'
    ds_out['pco2_ave_unwtd'].attrs['notes'] = ("calculated using" +
                                               "NOAA OI SST V2" +
                                               "and pyseaflux package")

    print("Save data")
    ds_out.to_zarr("/processed.zarr")
    import shutil
    shutil.make_archive("/outputs/processed.zarr", 'zip', "/processed.zarr")
    print("Zarr file written to disk, job completed successfully")

if __name__ == "__main__":
    main()

This code loads and processes SST and SOCAT data, combines them, computes pCO2, and saves the results for further use.

Upload the Data to IPFS

The simplest way to upload the data to IPFS is to use a third-party service to "pin" data to the IPFS network, to ensure that the data exists and is available. To do this you need an account with a pinning service like NFT.storage or Pinata. Once registered you can use their UI or API or SDKs to upload files.

This resulted in the IPFS CID of bafybeidunikexxu5qtuwc7eosjpuw6a75lxo7j5ezf3zurv52vbrmqwf6y.

Setting up Docker Container

We will create a Dockerfile and add the desired configuration to the file. These commands specify how the image will be built, and what extra requirements will be included.

FROM python:slim

RUN apt-get update && apt-get -y upgrade \
    && apt-get install -y --no-install-recommends \
    g++ \
    && rm -rf /var/lib/apt/lists/*

WORKDIR /project

COPY ./requirements.txt /project

RUN pip3 install -r requirements.txt

COPY ./main.py /project

CMD ["python","main.py"]

Build the container

We will run docker build command to build the container:

docker build -t <hub-user>/<repo-name>:<tag> .

Before running the command replace:

hub-user with your docker hub username, If you don’t have a docker hub account follow these instructions to create a Docker account, and use the username of the account you created

repo-name with the name of the container, you can name it anything you want

tag this is not required but you can use the latest tag

Push the container

Now you can push this repository to the registry designated by its name or tag.

docker push <hub-user>/<repo-name>:<tag>

For more information about working with custom containers, see the custom containers example.

Running a Bacalhau Job

Now that we have the data in IPFS and the Docker image pushed, next is to run a job using the bacalhau docker run command

export JOB_ID=$(bacalhau docker run \
    --input ipfs://bafybeidunikexxu5qtuwc7eosjpuw6a75lxo7j5ezf3zurv52vbrmqwf6y \
    --memory 10Gb \
    ghcr.io/bacalhau-project/examples/socat:0.0.11 \
    -- python main.py)

Structure of the command

Let's look closely at the command above:

bacalhau docker run: call to Bacalhau
--input ipfs://bafybeidunikexxu5qtuwc7eosjpuw6a75lxo7j5ezf3zurv52vbrmqwf6y: CIDs to use on the job. Mounts them at '/inputs' in the execution.
ghcr.io/bacalhau-project/examples/socat:0.0.11: the name and the tag of the image we are using
python main.py: execute the script

When a job is submitted, Bacalhau prints out the related job_id. We store that in an environment variable so that we can reuse it later on.

Declarative job description

The same job can be presented in the declarative format. In this case, the description will look like this:

name: Oceanography
type: batch
count: 1
tasks:
  - name: My main task
    Engine:
      type: docker
      params:
        Image: ghcr.io/bacalhau-project/examples/socat:0.0.11
        Entrypoint:
          - /bin/bash
        Parameters:
          - -c
          - python main.py
    Publisher:
      Type: ipfs
    ResultPaths:
      - Name: outputs
        Path: /outputs
    InputSources:
      - Target: "/inputs"
        Source:
          Type: "ipfs"
          Params:
            CID: "bafybeidunikexxu5qtuwc7eosjpuw6a75lxo7j5ezf3zurv52vbrmqwf6y"
    Resources:
        Memory: 10gb

The job description should be saved in .yaml format, e.g. oceanyaml, and then run with the command:

bacalhau job run ocean.yaml

Checking the State of your Jobs

Job status: You can check the status of the job using bacalhau job list.

bacalhau job list --id-filter ${JOB_ID}

When it says Published or Completed, that means the job is done, and we can get the results.

Job information: You can find out more information about your job by using bacalhau job describe.

bacalhau job describe ${JOB_ID}

rm -rf results
mkdir -p ./results # Temporary directory to store the results
bacalhau job get ${JOB_ID} --output-dir ./results # Download the results

Viewing your Job Output

To view the file, run the following command:

ls results/outputs

processed.zarr.zip

Support

If you have questions or need support or guidance, please reach out to the Bacalhau team via Slack (#general channel).

Video Processing

Introduction

Many data engineering workloads consist of embarrassingly parallel workloads where you want to run a simple execution on a large number of files. In this example tutorial, we will run a simple video filter on a large number of video files.

Prerequisite

To get started, you need to install the Bacalhau client, see more information here

Upload the Data to IPFS

This resulted in the IPFS CID of Qmd9CBYpdgCLuCKRtKRRggu24H72ZUrGax5A9EYvrbC72j.

Running a Bacalhau Job

To submit a workload to Bacalhau, we will use the bacalhau docker run command. The command allows one to pass input data volume with a -i ipfs://CID:path argument just like Docker, except the left-hand side of the argument is a content identifier (CID). This results in Bacalhau mounting a data volume inside the container. By default, Bacalhau mounts the input volume at the path /inputs inside the container.

export JOB_ID=$(bacalhau docker run \
    --wait \
    --wait-timeout-secs 100 \
    --id-only \
    -i ipfs://Qmd9CBYpdgCLuCKRtKRRggu24H72ZUrGax5A9EYvrbC72j:/inputs \
    linuxserver/ffmpeg \
    -- bash -c 'find /inputs -iname "*.mp4" -printf "%f\n" | xargs -I{} ffmpeg -y -i /inputs/{} -vf "scale=-1:72,setsar=1:1" /outputs/scaled_{}' )

Structure of the command

Let's look closely at the command above:

bacalhau docker run: call to Bacalhau
-i ipfs://Qmd9CBYpdgCLuCKRtKRRggu24H72ZUrGax5A9EYvrbC72j: CIDs to use on the job. Mounts them at '/inputs' in the execution.
linuxserver/ffmpeg: the name of the docker image we are using to resize the videos
-- bash -c 'find /inputs -iname "*.mp4" -printf "%f\n" | xargs -I{} ffmpeg -y -i /inputs/{} -vf "scale=-1:72,setsar=1:1" /outputs/scaled_{}': the command that will be executed inside the container. It uses find to locate all files with the extension ".mp4" within /inputs and then uses ffmpeg to resize each found file to 72 pixels in height, saving the results in the /outputs folder.

When a job is submitted, Bacalhau prints out the related job_id. We store that in an environment variable so that we can reuse it later on.

Bacalhau overwrites the default entrypoint so we must run the full command after the -- argument. In this line you will list all of the mp4 files in the /inputs directory and execute ffmpeg against each instance.

Declarative job description

The same job can be presented in the declarative format. In this case, the description will look like this:

name: Video Processing
type: batch
count: 1
tasks:
  - name: My main task
    Engine:
      type: docker
      params:
        Image: linuxserver/ffmpeg
        Entrypoint:
          - /bin/bash
        Parameters:
          - -c
          - find /inputs -iname "*.mp4" -printf "%f\n" | xargs -I{} ffmpeg -y -i /inputs/{} -vf "scale=-1:72,setsar=1:1" /outputs/scaled_{}
    Publisher:
      Type: ipfs
    ResultPaths:
      - Name: outputs
        Path: /outputs
    InputSources:
    - Target: "/inputs"
      Source:
        Type: "s3"
        Params:
          Bucket: "bacalhau-video-processing"
          Key: "*"
          Region: "us-east-1"

The job description should be saved in .yaml format, e.g. video.yaml, and then run with the command:

bacalhau job run video.yaml

Checking the State of your Jobs

Job status: You can check the status of the job using bacalhau job list.

bacalhau job list --id-filter ${JOB_ID} --no-style

When it says Published or Completed, that means the job is done, and we can get the results.

Job information: You can find out more information about your job by using bacalhau job describe.

bacalhau job describe ${JOB_ID}

mkdir -p ./results # Temporary directory to store the results
bacalhau job get ${JOB_ID} --output-dir ./results # Download the results

Viewing your Job Output

To view the results open the results/outputs/ folder.

Support

If you have questions or need support or guidance, please reach out to the Bacalhau team via Slack (#general channel).

Bacalhau and BigQuery

Use Bacalhau to process logs remotely across clouds and land the results on BigQuery.

Introduction

This example demonstrates how to build a sophisticated distributed log processing pipeline using , Google , and . You'll learn how to process and analyze logs across distributed nodes, with progressively more advanced techniques for handling, sanitizing, and aggregating log data.

The combination of Bacalhau and BigQuery offers several key advantages:

Process logs directly where they are generated, eliminating the need for centralized collection
Scale processing across multiple nodes in different cloud providers
Leverage BigQuery's powerful analytics capabilities for processed data
Implement privacy-conscious data handling and efficient aggregation strategies

Through this example, you'll evolve from basic log collection to implementing a production-ready system with privacy protection and smart aggregation. Whether you're handling application logs, system metrics, or security events, this pipeline provides a robust foundation for distributed log analytics.

Prerequisites

installed
A Google Cloud Project with BigQuery enabled
Service account credentials with BigQuery access
A running Bacalhau cluster with nodes across different cloud providers
- Follow the
- Ensure nodes are properly configured across your cloud providers (AWS, GCP, Azure). You can see more about setting up your nodes

Your nodes will have to have at least the following configurations installed to use the below jobs unchanged (in addition to anything else you may require).

Compute:
  Enabled: true
  AllowListedLocalPaths:
    - /bacalhau_data:rw
JobAdmissionControl:
  AcceptNetworkedJobs: true

Components

DuckDB Processing: Process and analyze data using DuckDB's SQL capabilities
BigQuery Integration: Store processed results in Google BigQuery for further analysis

Before You Start

Make sure you have configured your config file to have the correct BigQuery project and dataset. You can do this by copying the config.yaml.example file to config.yaml and editing the values to match your BigQuery project and dataset.

# BigQuery Configuration
project:
  id: "bacalhau-and-bigquery"  # Required: Your Google Cloud project ID
  region: "US"           # Optional: Default region for resources
  create_if_missing: true # Whether to create the project if it doesn't exist

credentials:
  path: "credentials.json"  # Path to service account credentials

bigquery:
  dataset_name: "log_analytics"     # Name of the BigQuery dataset
  table_name: "log_results"         # Name of the results table
  location: "US"                    # Dataset location

Ensure your Google Cloud service account has these roles:
- BigQuery Data Editor
- BigQuery Job User
Have your service account key file (JSON format) ready
Configure your BigQuery settings:
- Create a dataset for log analytics
- Note your project ID and dataset name

We have provided some utility scripts to help you set up your BigQuery project and tables. You can run the following commands to set up your project and tables if you haven't already:

Interactive setup to set up your BigQuery project and tables. Will go through and create the necessary bigquery projects.

./utility_scripts/setup.py -i

Creates sample tables in BigQuery for testing.

./utility_scripts/sample_tables.sh

Checks the permissions of the service account specified in log_uploader_credentials.json to ensure it has the necessary permissions to write to BigQuery tables from the Bacalhau nodes.

./utility_scripts/check_permissions.sh

Confirms your BigQuery project and dataset, and creates the tables if they don't exist with the correct schema. This will also zero out the tables if they already exist, so be careful! (Useful for debugging)

./utility_scripts/confirm_tables.sh

Distributes the credentials to /bacalhau_data on all nodes in a Bacalhau network.

./utility_scripts/distribute_credentials.sh

Ensures the service account specified in log_uploader_credentials.json has the necessary permissions to write to BigQuery tables from the Bacalhau nodes.

./utility_scripts/setup_log_uploader.sh

One more thing to set up is the log faker on the nodes. This will generate logs for you to work with. You can run the following command to start the log faker:

bacalhau job run start-logging-container.yaml

Give it a couple of minutes to start up and then you can start processing the logs.

Demo Walkthrough

Let's walk through each stage of the demo, seeing how we can progressively improve our data processing pipeline!

Stage 1: Raw Power - Basic Log Upload 🚀

Let's start by looking at the raw logs.

# Let's look at the raw logs
bacalhau docker run alpine cat /var/log/app/access.log

That will print out the logs to stdout, which we can then read from the job.

After running the job, you will see a job id, something like this:

To get more details about the run, execute:
        bacalhau job describe j-01480df3-476e-4fdd-a297-0fc41cb10710

To get more details about the run executions, execute:
        bacalhau job executions j-01480df3-476e-4fdd-a297-0fc41cb10710

When you run the describe command, you will see the details of the job, including the output of the log information.

Now let's upload the raw logs to BigQuery. This is the simplest approach - just get the data there:

bacalhau job run bigquery_export_job.yaml --template-vars=python_file_b64=$(cat bigquery-exporter/log_process_0.py | base64)

This will upload the python script to all the nodes which, in turn, will upload the raw logs from all nodes to BigQuery. When you check BigQuery, you'll see:

Millions of rows uploaded (depends on how many nodes you have and how long you let it run)
Each log line as raw text
No structure or parsing

To query the logs, you can use the following SQL:

PROJECT_ID=$(yq '.bigquery.project_id' config.yaml)
bq query --use_legacy_sql=false "SELECT * FROM \`$PROJECT_ID.log_analytics.raw_logs\` LIMIT 5"

Stage 2: Adding Structure - Making Sense of Chaos 📊

Now let's do something more advanced, by parsing those logs into structured data before upload:

bacalhau job run bigquery_export_job.yaml --template-vars=python_file_b64=$(cat bigquery-exporter/log_process_1.py | base64)

Your logs are now parsed into fields like:

IP Address
Timestamp
HTTP Method
Endpoint
Status Code
Response Size

To query the logs, you can use the following SQL:

bq query --use_legacy_sql=false "SELECT * FROM \`$PROJECT_ID.log_analytics.log_results\` LIMIT 5"

Stage 3: Privacy First - Responsible Data Handling 🔒

Now let's handle the data responsibly by sanitizing PII (like IP addresses):

bacalhau job run bigquery_export_job.yaml --template-vars=python_file_b64=$(cat bigquery-exporter/log_process_2.py | base64)

This:

Zeros out the last octet of IPv4 addresses
Zeros out the last 64 bits of IPv6 addresses
Maintains data utility while ensuring compliance

Again, to query the logs, you can use the following SQL:

bq query --use_legacy_sql=false "SELECT * FROM \`$PROJECT_ID.log_analytics.log_results\` LIMIT 5"

Notice that the IP addresses are now sanitized.

Stage 4: Smart Aggregation - Efficiency at Scale 📈

Finally, let's be smart about what we upload:

bacalhau job run bigquery_export_job.yaml --template-vars=python_file_b64=$(cat bigquery-exporter/log_process_3.py | base64)

This creates two streams:

Aggregated normal logs:
- Grouped in 5-minute windows
- Counts by status code
- Average response sizes
- Total requests per endpoint
Real-time emergency events:
- Critical errors
- Security alerts
- System failures

To query the logs, you can use the following SQL:

-- See your aggregated logs
SELECT * FROM \`$PROJECT_ID.log_analytics.log_aggregates\`
ORDER BY time_window DESC LIMIT 5;

-- Check emergency events
SELECT * FROM \`$PROJECT_ID.log_analytics.emergency_logs\`
ORDER BY timestamp DESC LIMIT 5;

What's Next?

Now that you've seen the power of distributed processing with Bacalhau:

Try processing your own log files
Experiment with different aggregation windows
Add your own privacy-preserving transformations
Scale to even more nodes!

Remember: The real power comes from processing data where it lives, rather than centralizing everything first. Happy distributed processing! 🚀

Table Schemas

log_results (Main Table):
- project_id: STRING - Project identifier
- region: STRING - Deployment region
- nodeName: STRING - Node name
- timestamp: TIMESTAMP - Event time
- version: STRING - Log version
- message: STRING - Log content
- sync_time: TIMESTAMP - Upload time
- remote_log_id: STRING - Original log ID
- hostname: STRING - Source host
- public_ip: STRING - Sanitized public IP
- private_ip: STRING - Internal IP
- alert_level: STRING - Event severity
- provider: STRING - Cloud provider
log_aggregates (5-minute windows):
- project_id: STRING - Project identifier
- region: STRING - Deployment region
- nodeName: STRING - Node name
- provider: STRING - Cloud provider
- hostname: STRING - Source host
- time_window: TIMESTAMP - Aggregation window
- log_count: INT64 - Events in window
- messages: ARRAY - Event details
emergency_logs (Critical Events):
- project_id: STRING - Project identifier
- region: STRING - Deployment region
- nodeName: STRING - Node name
- provider: STRING - Cloud provider
- hostname: STRING - Source host
- timestamp: TIMESTAMP - Event time
- version: STRING - Log version
- message: STRING - Alert details
- remote_log_id: STRING - Original log ID
- alert_level: STRING - Always "EMERGENCY"
- public_ip: STRING - Sanitized public IP
- private_ip: STRING - Internal IP

Bacalhau and BigQuery

Use Bacalhau to process logs remotely across clouds and land the results on BigQuery.

Introduction

The combination of Bacalhau and BigQuery offers several key advantages:

Process logs directly where they are generated, eliminating the need for centralized collection
Scale processing across multiple nodes in different cloud providers
Leverage BigQuery's powerful analytics capabilities for processed data
Implement privacy-conscious data handling and efficient aggregation strategies

Prerequisites

installed
A Google Cloud Project with BigQuery enabled
Service account credentials with BigQuery access
A running Bacalhau cluster with nodes across different cloud providers
- Follow the
- Ensure nodes are properly configured across your cloud providers (AWS, GCP, Azure). You can see more about setting up your nodes

Your nodes will have to have at least the following configurations installed to use the below jobs unchanged (in addition to anything else you may require).

Compute:
  Enabled: true
  AllowListedLocalPaths:
    - /bacalhau_data:rw
JobAdmissionControl:
  AcceptNetworkedJobs: true

Components

DuckDB Processing: Process and analyze data using DuckDB's SQL capabilities
BigQuery Integration: Store processed results in Google BigQuery for further analysis

Before You Start

Make sure you have configured your config file to have the correct BigQuery project and dataset. You can do this by copying the config.yaml.example file to config.yaml and editing the values to match your BigQuery project and dataset.

# BigQuery Configuration
project:
  id: "bacalhau-and-bigquery"  # Required: Your Google Cloud project ID
  region: "US"           # Optional: Default region for resources
  create_if_missing: true # Whether to create the project if it doesn't exist

credentials:
  path: "credentials.json"  # Path to service account credentials

bigquery:
  dataset_name: "log_analytics"     # Name of the BigQuery dataset
  table_name: "log_results"         # Name of the results table
  location: "US"                    # Dataset location

Ensure your Google Cloud service account has these roles:
- BigQuery Data Editor
- BigQuery Job User
Have your service account key file (JSON format) ready
Configure your BigQuery settings:
- Create a dataset for log analytics
- Note your project ID and dataset name

We have provided some utility scripts to help you set up your BigQuery project and tables. You can run the following commands to set up your project and tables if you haven't already:

Interactive setup to set up your BigQuery project and tables. Will go through and create the necessary bigquery projects.

./utility_scripts/setup.py -i

Creates sample tables in BigQuery for testing.

./utility_scripts/sample_tables.sh

Checks the permissions of the service account specified in log_uploader_credentials.json to ensure it has the necessary permissions to write to BigQuery tables from the Bacalhau nodes.

./utility_scripts/check_permissions.sh

Confirms your BigQuery project and dataset, and creates the tables if they don't exist with the correct schema. This will also zero out the tables if they already exist, so be careful! (Useful for debugging)

./utility_scripts/confirm_tables.sh

Distributes the credentials to /bacalhau_data on all nodes in a Bacalhau network.

./utility_scripts/distribute_credentials.sh

Ensures the service account specified in log_uploader_credentials.json has the necessary permissions to write to BigQuery tables from the Bacalhau nodes.

./utility_scripts/setup_log_uploader.sh

One more thing to set up is the log faker on the nodes. This will generate logs for you to work with. You can run the following command to start the log faker:

bacalhau job run start-logging-container.yaml

Give it a couple of minutes to start up and then you can start processing the logs.

Demo Walkthrough

Let's walk through each stage of the demo, seeing how we can progressively improve our data processing pipeline!

Stage 1: Raw Power - Basic Log Upload 🚀

Let's start by looking at the raw logs.

# Let's look at the raw logs
bacalhau docker run alpine cat /var/log/app/access.log

That will print out the logs to stdout, which we can then read from the job.

After running the job, you will see a job id, something like this:

To get more details about the run, execute:
        bacalhau job describe j-01480df3-476e-4fdd-a297-0fc41cb10710

To get more details about the run executions, execute:
        bacalhau job executions j-01480df3-476e-4fdd-a297-0fc41cb10710

When you run the describe command, you will see the details of the job, including the output of the log information.

Now let's upload the raw logs to BigQuery. This is the simplest approach - just get the data there:

bacalhau job run bigquery_export_job.yaml --template-vars=python_file_b64=$(cat bigquery-exporter/log_process_0.py | base64)

This will upload the python script to all the nodes which, in turn, will upload the raw logs from all nodes to BigQuery. When you check BigQuery, you'll see:

Millions of rows uploaded (depends on how many nodes you have and how long you let it run)
Each log line as raw text
No structure or parsing

To query the logs, you can use the following SQL:

PROJECT_ID=$(yq '.bigquery.project_id' config.yaml)
bq query --use_legacy_sql=false "SELECT * FROM \`$PROJECT_ID.log_analytics.raw_logs\` LIMIT 5"

Stage 2: Adding Structure - Making Sense of Chaos 📊

Now let's do something more advanced, by parsing those logs into structured data before upload:

bacalhau job run bigquery_export_job.yaml --template-vars=python_file_b64=$(cat bigquery-exporter/log_process_1.py | base64)

Your logs are now parsed into fields like:

IP Address
Timestamp
HTTP Method
Endpoint
Status Code
Response Size

To query the logs, you can use the following SQL:

bq query --use_legacy_sql=false "SELECT * FROM \`$PROJECT_ID.log_analytics.log_results\` LIMIT 5"

Stage 3: Privacy First - Responsible Data Handling 🔒

Now let's handle the data responsibly by sanitizing PII (like IP addresses):

bacalhau job run bigquery_export_job.yaml --template-vars=python_file_b64=$(cat bigquery-exporter/log_process_2.py | base64)

This:

Zeros out the last octet of IPv4 addresses
Zeros out the last 64 bits of IPv6 addresses
Maintains data utility while ensuring compliance

Again, to query the logs, you can use the following SQL:

bq query --use_legacy_sql=false "SELECT * FROM \`$PROJECT_ID.log_analytics.log_results\` LIMIT 5"

Notice that the IP addresses are now sanitized.

Stage 4: Smart Aggregation - Efficiency at Scale 📈

Finally, let's be smart about what we upload:

bacalhau job run bigquery_export_job.yaml --template-vars=python_file_b64=$(cat bigquery-exporter/log_process_3.py | base64)

This creates two streams:

Aggregated normal logs:
- Grouped in 5-minute windows
- Counts by status code
- Average response sizes
- Total requests per endpoint
Real-time emergency events:
- Critical errors
- Security alerts
- System failures

To query the logs, you can use the following SQL:

-- See your aggregated logs
SELECT * FROM \`$PROJECT_ID.log_analytics.log_aggregates\`
ORDER BY time_window DESC LIMIT 5;

-- Check emergency events
SELECT * FROM \`$PROJECT_ID.log_analytics.emergency_logs\`
ORDER BY timestamp DESC LIMIT 5;

What's Next?

Now that you've seen the power of distributed processing with Bacalhau:

Try processing your own log files
Experiment with different aggregation windows
Add your own privacy-preserving transformations
Scale to even more nodes!

Remember: The real power comes from processing data where it lives, rather than centralizing everything first. Happy distributed processing! 🚀

Table Schemas

log_results (Main Table):
- project_id: STRING - Project identifier
- region: STRING - Deployment region
- nodeName: STRING - Node name
- timestamp: TIMESTAMP - Event time
- version: STRING - Log version
- message: STRING - Log content
- sync_time: TIMESTAMP - Upload time
- remote_log_id: STRING - Original log ID
- hostname: STRING - Source host
- public_ip: STRING - Sanitized public IP
- private_ip: STRING - Internal IP
- alert_level: STRING - Event severity
- provider: STRING - Cloud provider
log_aggregates (5-minute windows):
- project_id: STRING - Project identifier
- region: STRING - Deployment region
- nodeName: STRING - Node name
- provider: STRING - Cloud provider
- hostname: STRING - Source host
- time_window: TIMESTAMP - Aggregation window
- log_count: INT64 - Events in window
- messages: ARRAY - Event details
emergency_logs (Critical Events):
- project_id: STRING - Project identifier
- region: STRING - Deployment region
- nodeName: STRING - Node name
- provider: STRING - Cloud provider
- hostname: STRING - Source host
- timestamp: TIMESTAMP - Event time
- version: STRING - Log version
- message: STRING - Alert details
- remote_log_id: STRING - Original log ID
- alert_level: STRING - Always "EMERGENCY"
- public_ip: STRING - Sanitized public IP
- private_ip: STRING - Internal IP

Ethereum Blockchain Analysis with Ethereum-ETL and Bacalhau

Introduction

In this tutorial example, we will run Ethereum-ETL tool on Bacalhau to extract data from an Ethereum node.

Prerequisite

To get started, you need to install the Bacalhau client, see more information

Analysing Ethereum Data Locally

First let's download one of the IPFS files and inspect it locally:

wget -q -O file.tar.gz https://w3s.link/ipfs/bafybeifgqjvmzbtz427bne7af5tbndmvniabaex77us6l637gqtb2iwlwq
tar -xvf file.tar.gz

You can see the full list of IPFS CIDs in the appendix at the bottom of the page.

If you don't already have the Pandas library, let's install it:

pip install pandas

# Use pandas to read in transaction data and clean up the columns
import pandas as pd
import glob

file = glob.glob('output_*/transactions/start_block=*/end_block=*/transactions*.csv')[0]
print("Loading file %s" % file)
df = pd.read_csv(file)
df['value'] = df['value'].astype('float')
df['from_address'] = df['from_address'].astype('string')
df['to_address'] = df['to_address'].astype('string')
df['hash'] = df['hash'].astype('string')
df['block_hash'] = df['block_hash'].astype('string')
df['block_datetime'] = pd.to_datetime(df['block_timestamp'], unit='s')
df.info()

# Total volume per day
df[['block_datetime', 'value']].groupby(pd.Grouper(key='block_datetime', freq='1D')).sum().plot()

The following code inspects the daily trading volume of Ethereum for a single chunk (100,000 blocks) of data.

Analysing Ethereum Data With Bacalhau

To run jobs on the Bacalhau network you need to package your code. In this example, I will package the code as a Docker image.

# main.py
import glob, os, sys, shutil, tempfile
import pandas as pd

def main(input_dir, output_dir):
    search_path = os.path.join(input_dir, "output*", "transactions", "start_block*", "end_block*", "transactions_*.csv")
    csv_files = glob.glob(search_path)
    if len(csv_files) == 0:
        print("No CSV files found in %s" % search_path)
        sys.exit(1)
    for transactions_file in csv_files:
        print("Loading %s" % transactions_file)
        df = pd.read_csv(transactions_file)
        df['value'] = df['value'].astype('float')
        df['block_datetime'] = pd.to_datetime(df['block_timestamp'], unit='s')
        
        print("Processing %d blocks" % (df.shape[0]))
        results = df[['block_datetime', 'value']].groupby(pd.Grouper(key='block_datetime', freq='1D')).sum()
        print("Finished processing %d days worth of records" % (results.shape[0]))

        save_path = os.path.join(output_dir, os.path.basename(transactions_file))
        os.makedirs(os.path.dirname(save_path), exist_ok=True)
        print("Saving to %s" % (save_path))
        results.to_csv(save_path)

def extractData(input_dir, output_dir):
    search_path = os.path.join(input_dir, "*.tar.gz")
    gz_files = glob.glob(search_path)
    if len(gz_files) == 0:
        print("No tar.gz files found in %s" % search_path)
        sys.exit(1)
    for f in gz_files:
        shutil.unpack_archive(filename=f, extract_dir=output_dir)

if __name__ == "__main__":
    if len(sys.argv) != 3:
        print('Must pass arguments. Format: [command] input_dir output_dir')
        sys.exit()
    with tempfile.TemporaryDirectory() as tmp_dir:
        extractData(sys.argv[1], tmp_dir)
        main(tmp_dir, sys.argv[2])

Next, let's make sure the file works as expected:

python main.py . outputs/

FROM python:3.11-slim-bullseye
WORKDIR /src
RUN pip install pandas==1.5.1
ADD main.py .
CMD ["python", "main.py", "/inputs", "/outputs"]

We've already pushed the container, but for posterity, the following command pushes this container to GHCR.

docker buildx build --platform linux/amd64 --push -t ghcr.io/bacalhau-project/examples/blockchain-etl:0.0.1 .

Running a Bacalhau Job

To run our analysis on the Ethereum blockchain, we will use the bacalhau docker run command.

export JOB_ID=$(bacalhau docker run \
    --id-only \
    --input ipfs://bafybeifgqjvmzbtz427bne7af5tbndmvniabaex77us6l637gqtb2iwlwq:/inputs/data.tar.gz \
    ghcr.io/bacalhau-project/examples/blockchain-etl:0.0.6)

The job has been submitted and Bacalhau has printed out the related job id. We store that in an environment variable so that we can reuse it later on.

The bacalhau docker run command allows passing input data volume with --input or -i ipfs://CID:path argument just like Docker, except the left-hand side of the argument is a . This results in Bacalhau mounting a data volume inside the container. By default, Bacalhau mounts the input volume at the path /inputs inside the container.

Declarative job description

The same job can be presented in the format. In this case, the description will look like this:

name: Ethereum Blockchain Analysis with Ethereum-ETL
type: batch
count: 1
tasks:
  - name: My main task
    Engine:
      type: docker
      params:
        Image: ghcr.io/bacalhau-project/examples/blockchain-etl:0.0.6
    Publisher:
      Type: ipfs
    ResultPaths:
      - Name: outputs
        Path: /outputs
    InputSources:
      - Target: "/inputs/data.tar.gz"
        Source:
          Type: "ipfs"
          Params:
            CID: "bafybeifgqjvmzbtz427bne7af5tbndmvniabaex77us6l637gqtb2iwlwq"

The job description should be saved in .yaml format, e.g. blockchain.yaml, and then run with the command:

Copy

bacalhau job run blockchain.yaml

Checking the State of your Jobs

Job status: You can check the status of the job using bacalhau job list.

bacalhau job list --id-filter ${JOB_ID}

When it says Published or Completed, that means the job is done, and we can get the results.

Job information: You can find out more information about your job by using bacalhau job describe.

bacalhau job describe ${JOB_ID}

rm -rf results && mkdir -p results # Temporary directory to store the results
bacalhau job get ${JOB_ID} --output-dir results # Download the results

Viewing your Job Output

To view the file, run the following command:

ls -lah results/outputs

Display the image

To view the images, we will use glob to return all file paths that match a specific pattern.

import glob
import pandas as pd

# Get CSV files list from a folder
csv_files = glob.glob("results/outputs/*.csv")
df = pd.read_csv(csv_files[0], index_col='block_datetime')
df.plot()

Massive Scale Ethereum Analysis

See the appendix for the hashes.txt file.

printf "" > job_ids.txt
for h in $(cat hashes.txt); do \
    bacalhau docker run \
    --id-only \
    --wait=false \
    --input=ipfs://$h:/inputs/data.tar.gz \
    ghcr.io/bacalhau-project/examples/blockchain-etl:0.0.6 >> job_ids.txt 
done

Now take a look at the job id's. You can use these to check the status of the jobs and download the results:

cat job_ids.txt

d840df7b-9318-4e5b-ab06-adb72dd95394
09d01f9c-9409-42b9-829d-92e22fcdd062
0072758f-3575-44d7-b193-da4a22f6bc86
2043dee4-fc82-4768-92cb-4d23dd2514b1
36ef8e9e-9eae-4218-81e6-15883d0a5b8d
932aa406-cd29-4933-b09f-c8cea4d77164
1f3e5273-bdd4-4ef0-b7ed-b83591fab64e
8bfabe96-54e3-4fee-b344-a0517c683268
1cd588a1-5c76-4f91-ba90-af7931bca596
b9c29531-e1b4-4520-b03d-7406a22bbdb3
8665b8be-24a9-4c78-9913-803d3e3c9a65
06115147-bc83-49e8-bb71-7b447c8ad1bc
84afed3e-831c-462b-a3e3-9a23bc7d6fb8
ed6e55e6-98d3-4bde-8ece-1f05838d489e
...

You might want to double-check that the jobs ran ok by doing a bacalhau job list.

bacalhau job list -n 50

for id in $(cat job_ids.txt); do \
    rm -rf results_$id && mkdir results_$id
    bacalhau job get --output-dir results_$id $id &
done
wait

Display the image

To view the images, we will use glob to return all file paths that match a specific pattern.

import os, glob
import pandas as pd

# Get CSV files list from a folder
path = os.path.join("results_*", "outputs", "*.csv")
csv_files = glob.glob(path)

# Read each CSV file into a list of DataFrames
df_list = (pd.read_csv(file, index_col='block_datetime') for file in csv_files)

# Concatenate all DataFrames
df_unsorted = pd.concat(df_list, ignore_index=False)

# Some files will cross days, so group by day and sum the values
df = df_unsorted.groupby(level=0).sum()

# Plot
df.plot(figsize=(16,9))

That's it! There are several years of Ethereum transaction volume data.

rm -rf results_* output_* outputs results temp # Remove temporary results

Appendix: List Ethereum Data CIDs

# hashes.txt
bafybeihvtzberlxrsz4lvzrzvpbanujmab3hr5okhxtbgv2zvonqos2l3i
bafybeifb25fgxrzu45lsc47gldttomycqcsao22xa2gtk2ijbsa5muzegq
bafybeig4wwwhs63ly6wbehwd7tydjjtnw425yvi2tlzt3aii3pfcj6hvoq
bafybeievpb5q372q3w5fsezflij3wlpx6thdliz5xowimunoqushn3cwka
bafybeih6te26iwf5kzzby2wqp67m7a5pmwilwzaciii3zipvhy64utikre
bafybeicjd4545xph6rcyoc74wvzxyaz2vftapap64iqsp5ky6nz3f5yndm
bafybeicgo3iofo3sw73wenc3nkdhi263yytjnds5cxjwvypwekbz4sk7ra
bafybeihvep5xsvxm44lngmmeysihsopcuvcr34an4idz45ixl5slsqzy3y
bafybeigmt2zwzrbzwb4q2kt2ihlv34ntjjwujftvabrftyccwzwdypama4
bafybeiciwui7sw3zqkvp4d55p4woq4xgjlstrp3mzxl66ab5ih5vmeozci
bafybeicpmotdsj2ambf666b2jkzp2gvg6tadr6acxqw2tmdlmsruuggbbu
bafybeigefo3esovbveavllgv5wiheu5w6cnfo72jxe6vmfweco5eq5sfty
bafybeigvajsumnfwuv7lp7yhr2sr5vrk3bmmuhhnaz53waa2jqv3kgkvsu
bafybeih2xg2n7ytlunvqxwqlqo5l3daykuykyvhgehoa2arot6dmorstmq
bafybeihnmq2ltuolnlthb757teihwvvw7wophoag2ihnva43afbeqdtgi4
bafybeibb34hzu6z2xgo6nhrplt3xntpnucthqlawe3pmzgxccppbxrpudy
bafybeigny33b4g6gf2hrqzzkfbroprqrimjl5gmb3mnsqu655pbbny6tou
bafybeifgqjvmzbtz427bne7af5tbndmvniabaex77us6l637gqtb2iwlwq
bafybeibryqj62l45pxjhdyvgdc44p3suhvt4xdqc5jpx474gpykxwgnw2e
bafybeidme3fkigdjaifkjfbwn76jk3fcqdogpzebtotce6ygphlujaecla
bafybeig7myc3eg3h2g5mk2co7ybte4qsuremflrjneer6xk3pghjwmcwbi
bafybeic3x2r5rrd3fdpdqeqax4bszcciwepvbpjl7xdv6mkwubyqizw5te
bafybeihxutvxg3bw7fbwohq4gvncrk3hngkisrtkp52cu7qu7tfcuvktnq
bafybeicumr67jkyarg5lspqi2w4zqopvgii5dgdbe5vtbbq53mbyftduxy
bafybeiecn2cdvefvdlczhz6i4afbkabf5pe5yqrcsgdvlw5smme2tw7em4
bafybeiaxh7dhg4krgkil5wqrv5kdsc3oewwy6ym4n3545ipmzqmxaxrqf4
bafybeiclcqfzinrmo3adr4lg7sf255faioxjfsolcdko3i4x7opx7xrqii
bafybeicjmeul7c2dxhmaudawum4ziwfgfkvbgthgtliggfut5tsc77dx7q
bafybeialziupik7csmhfxnhuss5vrw37kmte7rmboqovp4cpq5hj4insda
bafybeid7ecwdrw7pb3fnkokq5adybum6s5ok3yi2lw4m3edjpuy65zm4ji
bafybeibuxwnl5ogs4pwa32xriqhch24zbrw44rp22hrly4t6roh6rz7j4m
bafybeicxvy47jpvv3fi5umjatem5pxabfrbkzxiho7efu6mpidjpatte54
bafybeifynb4mpqrbsjbeqtxpbuf6y4frrtjrc4tm7cnmmui7gbjkckszrq
bafybeidcgnbhguyfaahkoqbyy2z525d3qfzdtbjuk4e75wkdbnkcafvjei
bafybeiefc67s6hpydnsqdgypbunroqwkij5j26sfmc7are7yxvg45uuh7i
bafybeiefwjy3o42ovkssnm7iihbog46k5grk3gobvvkzrqvof7p6xbgowi
bafybeihpydd3ivtza2ql5clatm5fy7ocych7t4czu46sbc6c2ykrbwk5uu
bafybeiet7222lqfmzogur3zlxqavlnd3lt3qryw5yi5rhuiqeqg4w7c3qu
bafybeihwomd4ygoydvj5kh24wfwk5kszmst5vz44zkl6yibjargttv7sly
bafybeidbjt2ckr4oooio3jsfk76r3bsaza5trjvt7u36slhha5ksoc5gv4
bafybeifyjrmopgtfmswq7b4pfscni46doy3g3z6vi5rrgpozc6duebpmuy
bafybeidsrowz46yt62zs64q2mhirlc3rsmctmi3tluorsts53vppdqjj7e
bafybeiggntql57bw24bw6hkp2yqd3qlyp5oxowo6q26wsshxopfdnzsxhq
bafybeidguz36u6wakx4e5ewuhslsfsjmk5eff5q7un2vpkrcu7cg5aaqf4
bafybeiaypwu2b45iunbqnfk2g7bku3nfqveuqp4vlmmwj7o7liyys42uai
bafybeicaahv7xvia7xojgiecljo2ddrvryzh2af7rb3qqbg5a257da5p2y
bafybeibgeiijr74rcliwal3e7tujybigzqr6jmtchqrcjdo75trm2ptb4e
bafybeiba3nrd43ylnedipuq2uoowd4blghpw2z7r4agondfinladcsxlku
bafybeif3semzitjbxg5lzwmnjmlsrvc7y5htekwqtnhmfi4wxywtj5lgoe
bafybeiedmsig5uj7rgarsjans2ad5kcb4w4g5iurbryqn62jy5qap4qq2a
bafybeidyz34bcd3k6nxl7jbjjgceg5eu3szbrbgusnyn7vfl7facpecsce
bafybeigmq5gch72q3qpk4nipssh7g7msk6jpzns2d6xmpusahkt2lu5m4y
bafybeicjzoypdmmdt6k54wzotr5xhpzwbgd3c4oqg6mj4qukgvxvdrvzye
bafybeien55egngdpfvrsxr2jmkewdyha72ju7qaaeiydz2f5rny7drgzta

Support

If you have questions or need support or guidance, please reach out to the (#general channel).

Data Engineering

Using Bacalhau with DuckDB

Introduction

Prerequisites

Running the Query

Structure of the command

Running with a YAML file

More complex queries

Need Help?

Ethereum Blockchain Analysis with Ethereum-ETL and Bacalhau

Introduction

Prerequisite​

Analysing Ethereum Data Locally​

Analysing Ethereum Data With Bacalhau​

Running a Bacalhau Job​

Declarative job description​

Checking the State of your Jobs​

Viewing your Job Output​

Display the image​

Massive Scale Ethereum Analysis​

Display the image​

Appendix: List Ethereum Data CIDs​

Support​

Convert CSV To Parquet Or Avro

Introduction​

Prerequisites​

Running CSV to Avro or Parquet Locally​​

Downloading the CSV file​

Writing the Script​

Installing Dependencies​

Converting CSV file to Parquet format​

Viewing the parquet file:​

Containerize Script with Docker​

Build the container​

Push the container​

Running a Bacalhau Job​

Structure of the command​

Declarative job description​

Checking the State of your Jobs​

Viewing your Job Output​

Support​

Simple Image Processing

Introduction

Prerequisite​

Running a Bacalhau Job​

Structure of the command​

Declarative job description​

Checking the State of your Jobs​

Display the image​

Support​

Oceanography - Data Conversion

Introduction

Prerequisites​

Running Locally​

Downloading the dataset​

Installing dependencies​

Reading and Viewing Data​

Data Conversion​

Writing the Script​

Upload the Data to IPFS​

Setting up Docker Container​

Build the container​

Push the container​

Running a Bacalhau Job​

Structure of the command​

Declarative job description​

Checking the State of your Jobs​

Viewing your Job Output​

Support​

Video Processing

Introduction

Prerequisite​

Upload the Data to IPFS​

Running a Bacalhau Job​

Structure of the command​

Declarative job description​

Checking the State of your Jobs​

Viewing your Job Output​

Support​

Bacalhau and BigQuery

Prerequisite

Analysing Ethereum Data Locally

Analysing Ethereum Data With Bacalhau

Running a Bacalhau Job

Declarative job description

Checking the State of your Jobs

Viewing your Job Output

Display the image

Massive Scale Ethereum Analysis

Display the image

Appendix: List Ethereum Data CIDs

Support

Introduction

Prerequisites

Running CSV to Avro or Parquet Locally

Downloading the CSV file

Writing the Script

Installing Dependencies

Converting CSV file to Parquet format

Viewing the parquet file:

Containerize Script with Docker

Build the container

Push the container

Running a Bacalhau Job

Structure of the command

Declarative job description

Checking the State of your Jobs

Viewing your Job Output

Support

Prerequisite

Running a Bacalhau Job

Structure of the command

Declarative job description

Checking the State of your Jobs

Display the image

Support

Prerequisites

Running Locally

Downloading the dataset

Installing dependencies

Reading and Viewing Data

Data Conversion

Writing the Script

Upload the Data to IPFS

Setting up Docker Container

Build the container

Push the container

Running a Bacalhau Job

Structure of the command

Declarative job description

Checking the State of your Jobs

Viewing your Job Output

Support

Prerequisite

Upload the Data to IPFS

Running a Bacalhau Job

Structure of the command

Declarative job description

Checking the State of your Jobs

Viewing your Job Output

Support

Introduction

Prerequisites

Running CSV to Avro or Parquet Locally

Downloading the CSV file

Writing the Script

Installing Dependencies

Converting CSV file to Parquet format

Viewing the parquet file:

Containerize Script with Docker

Build the container

Push the container

Running a Bacalhau Job

Structure of the command

Declarative job description

Checking the State of your Jobs

Viewing your Job Output

Support

Prerequisite

Running a Bacalhau Job