It is possible to run Bacalhau completely disconnected from the main Bacalhau network so that you can run private workloads without risking running on public nodes or inadvertently sharing your data outside of your organization. The isolated network will not connect to the public Bacalhau network nor connect to a public network. To do this, we will run our network in-process rather than externally.
:::info A private network and storage is easier to set up, but a separate public server is better for production. The private network and storage will use a temporary directory for its repository and so the contents will be lost on shutdown. :::
The first step is to start up the initial node, which we will use as the requester node
. This node will connect to nothing but will listen for connections.
This will produce output similar to this:
To connect another node to this private one, run the following command in your shell:
To connect another node to this private one, run the following command in your shell:
:::tip The exact command will be different on each computer and is outputted by the bacalhau serve --node-type requester ...
command :::
The command bacalhau serve --private-internal-ipfs --peer ...
starts up a compute node and adds it to the cluster.
To use this cluster from the client, run the following commands in your shell:
:::tip The exact command will be different on each computer and is outputted by the bacalhau serve --node-type requester ...
command :::
The command export BACALHAU_IPFS_SWARM_ADDRESSES=...
sends jobs into the cluster from the command line client.
Instructions for connecting to the public IPFS network via the private Bacalhau cluster:
On all nodes, start ipfs:
Then run the following command in your shell:
On the first node execute the following:
Monitor the output log for: 11:16:03.827 | DBG pkg/transport/bprotocol/compute_handler.go:39 > ComputeHandler started on host QmWXAaSHbbP7mU4GrqDhkgUkX9EscfAHPMCHbrBSUi4A35
On all other nodes execute the following:
Replace the values in the command above with your own value
Here is our example:
Then from any client set the following before invoking your Bacalhau job:
A private cluster is a network of Bacalhau nodes completely isolated from any public node. That means you can safely process private jobs and data on your cloud or on-premise hosts!
Install Bacalhau curl -sL https://get.bacalhau.org/install.sh | bash
on every host
Run bacalhau serve
only on one host, this will be our "bootstrap" machine
Copy and paste the command it outputs under the "To connect another node to this private one, run the following command in your shell..." line to the other hosts
Copy and paste the env vars it outputs under the "To use this requester node from the client, run the following commands in your shell..." line to a client machine
Run bacalhau docker run ubuntu echo hello
on the client machine
Optionally, set up systemd units make Bacalhau daemons permanent, here's an example systemd service file.
Please contact us on Slack #bacalhau
channel for questions and feedback!
Good news. Spinning up a private cluster is really a piece of cake :
That's all folks!
Bacalhau uses libp2p under the hood to communicate with other nodes on the network.
Because bacalhau is built using libp2p, the concept of peer identity is used to identify nodes on the network.
When you start a bacalhau node using bacalhau serve
, it will look for an RSA private key in the ~/.bacalhau
directory. If it doesn't find one, it will generate a new one and save it there.
You can override the directory where the private key is stored using the BACALHAU_PATH
environment variable.
Private keys are named after the port used for the libp2p connection which defaults to 1235
. By default when first starting a node, the private key will be stored in ~/.bacalhau/private_key.1235
.
The peer identity is derived from the private key and is used to identify the node on the network. You can get the peer identity of a node by running bacalhau id
:
By default, running bacalhau serve
will connect to the following nodes (which are the default bootstrap nodes run by Protocol labs):
Bacalhau uses libp2p multiaddresses to identify nodes on the network.
If you want to connect to other nodes, and you know their Peer IDs you can use the --peer
flag to specify additional peers to connect to (comma-separated list).
If you want to connect to a requester node, and you know it's IP but not it's Peer ID, you can use the following which will contact the requester API directly and ask for the current Peer ID instead.
The default port the libp2p swarm listens on is 1235.
You can configure the swarm port using the --port
flag:
To ensure that the node can communicate with other nodes on the network, make sure the swarm port is open and accessible by other nodes.
The Bacalhau node exposes a REST API that can be used to query the node for information.
The default port the REST API listens on is 1234.
The default network interface the REST API listens on is 0.0.0.0.
You can configure the REST API port using the --api-port
flag:
You can also configure which network interface the REST API will bind to using the --host
flag:
:::tip
You can use the --host
flag to restrict network access to the REST API.
:::
You can call http://dashboard.bacalhau.org:1000/api/v1/run with the POST body as a JSON serialized spec
Once you run the command above, you'll get a CID output:
By default, Bacalhau jobs do not have any access to the internet. This is to keep both compute providers and users safe from malicious activities.
However, by using data volumes you can read and access your data from within jobs and write back results.
When you submit a Bacalhau job, you'll need to specify the internet locations to download data from and write results to. Both Docker and WebAssembly jobs support these features.
When submitting a Bacalhau job, you can specify the CID (Content IDentifier) or HTTP(S) URL to download data from. The data will be retrieved before the job starts and made available to the job as a directory on the filesystem. When running Bacalhau jobs, you can specify as many CIDs or URLs as needed using --input
which is accepted by both bacalhau docker run
and bacalhau wasm run
. See command line flags for more information.
You can write back results from your Bacalhau jobs to your public storage location. By default, jobs will write results to the storage provider using the --publisher
command line flag. See command line flags on how to configure this.
To use these features, the data to be downloaded has to be known before the job starts. For some workloads, the required data is computed as part of the job if the purpose of the job is to process web results. In these cases, networking may be possible during job execution.
To run Docker jobs on Bacalhau to access the internet, you'll need to specify one of the following:
full: unfiltered networking for any protocol --network=full
http: HTTP(S)-only networking to a specified list of domains --network=http
none: no networking at all, the default --network=none
:::tip Specifying none
will still allow Bacalhau to download and upload data before and after the job. :::
Jobs using http
must specify the domains they want to access when the job is submitted. When the job runs, only HTTP requests to those domains will be possible and data transfer will be rate limited to 10Mbit/sec in either direction to prevent ddos.
Jobs will be provided with http_proxy
and https_proxy
environment variables which contain a TCP address of an HTTP proxy to connect through. Most tools and libraries will use these environment variables by default. If not, they must be used by user code to configure HTTP proxy usage.
The required networking can be specified using the --network
flag. For http
networking, the required domains can be specified using the --domain
flag, multiple times for as many domains as required. Specifying a domain starting with a .
means that all sub-domains will be included. For example, specifying .example.com
will cover some.thing.example.com
as well as example.com
.
:::caution Bacalhau jobs are explicitly prevented from starting other Bacalhau jobs, even if a Bacalhau requester node is specified on the HTTP allowlist. :::
Bacalhau has support for describing jobs that can access the internet during job execution. The ability for compute nodes to run jobs that require internet access depends on what compute nodes are currently part of the network.
Compute nodes that join the Bacalhau network do not accept networked jobs by default (i.e. they only accept jobs that specify --network=none
, which is also the default).
The public compute nodes provided by the Bacalhau network will accept jobs that require HTTP networking as long as the domains are from this allowlist.
If you need to access a domain that isn't on the allowlist, you can make a request to the Bacalhau Project team to include your required domains. You can also set up your own compute node that implements the allowlist you need.
This directory contains instructions on how to setup the networking in Bacalhau.