Copy Data from S3 to Public Storage
Here is a quick tutorial on how to copy Data from S3 to a public storage. In this tutorial, we will scrape all the links from a public AWS S3 buckets and then copy the data to IPFS using Bacalhau.
Prerequisite
To get started, you need to install the Bacalhau client, see more information here
Running a Bacalhau Job
%%bash --out job_id
bacalhau docker run \
-i "s3://noaa-goes16/ABI-L1b-RadC/2000/001/12/OR_ABI-L1b-RadC-M3C01*:/inputs,opt=region=us-east-1" \
--id-only \
--wait \
alpine \
-- sh -c "cp -r /inputs/* /outputs/"
Structure of the Command
Let's look closely at the command above:
bacalhau docker run
: call to bacalhau-i "s3://noaa-goes16/ABI-L1b-RadC/2000/001/12/OR_ABI-L1b-RadC-M3C01*:/inputs,opt=region=us-east-1
: defines S3 objects as inputs to the job. In this case, it will download all objects that match the prefixABI-L1b-RadC/2000/001/12/OR_ABI-L1b-RadC-M3C01
from the bucketnoaa-goes16
inus-east-1
region, and mount the objects under/inputs
path inside the docker job.-- sh -c "cp -r /inputs/* /outputs/"
: copies all files under/inputs
to/outputs
, which is by default the result output directory which all of its content will be published to the specified destination, which is IPFS by default
When a job is submitted, Bacalhau prints out the related job_id
. We store that in an environment variable so that we can reuse it later on.
This only works with datasets that are publicly accessible and don't require an AWS account or pay to use buckets.
Checking the State of your Jobs
- Job status: You can check the status of the job using
bacalhau list
.
%%bash
bacalhau list --id-filter ${JOB_ID} --wide
When it says Published
or Completed
, that means the job is done, and we can get the results.
- Job information: You can find out more information about your job by using
bacalhau describe
.
%%bash
bacalhau describe ${JOB_ID}
- Job download: You can download your job results directly by using
bacalhau get
. Alternatively, you can choose to create a directory to store your results. In the command below, we created a directory and downloaded our job output to be stored in that directory.
%%bash
rm -rf results && mkdir -p results # Temporary directory to store the results
bacalhau get $JOB_ID --output-dir results # Download the results
After the download has finished you should see the following contents in results directory.
Viewing your Job Output
To view your file, run the following command:
%%bash
ls -1 results/outputs
OR_ABI-L1b-RadC-M3C01_G16_s20000011200000_e20000011200000_c20170671748180.nc
OR_ABI-L1b-RadC-M3C01_G16_s20000011200000_e20000011200000_c20170691603180.nc
OR_ABI-L1b-RadC-M3C01_G16_s20000011200000_e20000011200000_c20170751219598.nc
OR_ABI-L1b-RadC-M3C01_G16_s20000011200000_e20000011200000_c20170752149454.nc
OR_ABI-L1b-RadC-M3C01_G16_s20000011200000_e20000011200000_c20170752204183.nc
OR_ABI-L1b-RadC-M3C01_G16_s20000011200000_e20000011200000_c20170752234173.nc
OR_ABI-L1b-RadC-M3C01_G16_s20000011200000_e20000011200000_c20170901216521.nc
OR_ABI-L1b-RadC-M3C01_G16_s20000011200000_e20000011200000_c20170951807462.nc
OR_ABI-L1b-RadC-M3C01_G16_s20000011200000_e20000011200000_c20171000619157.nc
OR_ABI-L1b-RadC-M3C01_G16_s20000011200000_e20000011200000_c20171061215161.nc
OR_ABI-L1b-RadC-M3C01_G16_s20000011200000_e20000011200000_c20171071918365.nc
OR_ABI-L1b-RadC-M3C01_G16_s20000011200000_e20000011200000_c20171091517487.nc
OR_ABI-L1b-RadC-M3C01_G16_s20000011200000_e20000011200000_c20171152112459.nc
OR_ABI-L1b-RadC-M3C01_G16_s20000011200000_e20000011200000_c20171221432456.nc
OR_ABI-L1b-RadC-M3C01_G16_s20000011200000_e20000011200000_c20171232313205.nc
OR_ABI-L1b-RadC-M3C01_G16_s20000011200000_e20000011200000_c20171301618116.nc
OR_ABI-L1b-RadC-M3C01_G16_s20000011200000_e20000011200000_c20171572234151.nc
OR_ABI-L1b-RadC-M3C01_G16_s20000011200000_e20000011200000_c20171592127442.nc
OR_ABI-L1b-RadC-M3C01_G16_s20000011200000_e20000011200000_c20171801512461.nc
OR_ABI-L1b-RadC-M3C01_G16_s20000011200000_e20000011200000_c20171941452463.nc
Extract Result CID
Installing jq to extract CID from the result description
%%bash
sudo apt update
sudo apt install jq
Extracting the CIDs from output json
%%bash
bacalhau describe ${JOB_ID} --json \
| jq -r '.State.Executions[].PublishedResults.CID | select (. != null)'
QmYFhG668yJZmtk84SMMdbrz5Uvuh78Q8nLxTgLDWShkhR
Need Support?
For questions, feedback, please reach out in our forum