Exercise 3 step 3: A Software Package for Calvalus
As you have noticed during the last step, the segmentation script runs for a little while. This is fine if you want to create a single segmented image but quickly becomes cumbersome if you are intending to create hundreds or thousands of segmented images! For large-scale tasks, it is better to run the same program many times on different inputs in parallel. With the Calvalus processing system on EstHub you need to setup your software only once to execute it on many computers in parallel. Calvalus takes care of distributing software and data to all needed computers and collecting the results then they are ready.
In order to deploy our Python script on Calvalus, we need to put our software on the cluster. "Our software" includes:
- The Python code for segmentation, including all files ending with
.py
- The Python packages that the script depends on as a packed Conda environment
- The pretrained machine learning models (
.pth
files) - The configuration file
config.yaml
- A small wrapper script that sets paths before the main script is called (we still need to create this)
With these parts, we can create a software package that we deploy on Calvalus. Once that is done, we can create a request to ask Calvalus to run this software package on any number of inputs.
It may help to recall that we distinguish a few artefacts during this exercise:
- a conda environment
conda-segmentation
- the segmentation software
sander-script
provided by Sander Tars - a processor that we call
segmentation
- a processor package we call
sander-segmentation-1.0
Before we create the processor software package we modify the Python code for segmentation such that input and output are passed as parameters instead of updating a configuration file for each processor call. This simplifies later execution.
Copying the software directory
Unless done so in the previous step, copy the directory tree of sander-script to your directory to have a copy that you can modify and pack.
cp -r /home/martin/training/orthophotos/sander-script .
ls -l sander-script
cat sander-script/config.yaml
New parameter for the input
Because we want to run the segmentation script on many inputs on the cluster, it is not very convenient to
create a new configuration file (config.yaml
) for each input. Instead, we add another parameter to the
main.py
script to specify the input. This way, we only minimally change the original software and keep
the configuration file to hold the other configuration values that do not change depending on the input.
We will also remove the output file name from the configuration, because this name depends on the input.
Instead, it should be generated in the script. To make things as simple as possible, we also change the
output directory to the current working directory .
.
Example:
If we have an input 588541.tif
the output should be called 588541_segments.tif
.
Because we want to relate the name of the output to the name of the input, it is easiest to determine the
output name in the script itself.
Third, we change the working directory where the algorithm is called to one directory above sander-script. The idea is that we can run the segmentation algorithm with an example input in the current working directory as in:
python sander-script/main.py --conf sander-script/config.yaml --input 588541.tif
where 588541.tif
is the example input data.
First, edit sander-script/config.yaml
:
- Set
output_path
to.
(the working directory on the compute node) - Remove the lines specifying the input and output file names in
config.yaml
, i.e. those withoutput_name
,input_img_path
- Add
sander-script/
in front of the path where the script looks for our model weights inconfig.yaml
Your configuration file should start like this:
output_path: "."
channels: [1, 2, 3]
img_pixels_detection: 512
margin: 128
output_type: "argmax"
n_classes: 19
model_weights: "sander-script/models/RGB_resnet34_unet_weights_finetuned.pth"
...
Second, edit sander-script/main.py
:
- Add a new parameter
--input
that can be used to specify an input file - Update the configuration dictionary in the
setup()
function inmain.py
so that this dictionary contains the input file name specified through the new parameter - Update the configuration dictionary in the
setup()
function inmain.py
so that this dictionary contains a suitable output file name.
Your main.py should look like this:
...
#### CONF FILE
argParser = argparse.ArgumentParser()
argParser.add_argument("--conf", help="Path to the .yaml config file")
argParser.add_argument("--input", help="Input file")
...
def setup(args):
config = read_config(args.conf)
config["input_img_path"] = args.input
config["output_name"] = Path(args.input).stem + "_segments.tif"
...
Packing the software
We already created an archive of our conda
environment. Now we create another archive with
the segmentation software and auxiliary files.
As a reminder, the components of the software are:
- The Python script
main.py
and all files ending with.py
that are imported by it - The pre-trained machine learning models (
.pth
files) - The configuration file
config.yaml
We neither include the wrapper script segmentation.py
(see below, this shall live in the
working directory) nor the conda environment (it already has an archive and can be used separately).
cd sander-script
tar cvzf ../sander-script.tar.gz *.py config.yaml models
cd ..
Note that sander-script.tar.gz
is packed inside the sander-script/
folder. It shall not contain the prefix
path sander-script/
inside the .tar.gz
. At runtime on a compute node the content of this package will be
made available in a directory named like the stem of the .tar.gz file, i.e. sander-script/ below the working
directory.
Adding a wrapper script
Purpose of the wrapper script is to translate between the calling convention of Calvalus and the call of the specific software. In our case, we could nearly save that piece and call the main function instead:
python sander-script/main.py --conf sander-script/config.yaml --input *.tif
There are two pieces missing with this solution:
- The PYTHONPATH needs to be extended to contain
sander-script
in order to find modules imported by main.py - PyTorch needs to be configured with the directory where to find the basic trained model.
We could solve that by modifying main.py
. But instead, we use a more generic approach with a wrapper
script. This wrapper script shall have the name of our processor, segmentation.py
. segmentation.py
shall make
sure Python finds our Python files by modifying the sys.path
variable (This is how we access the PYTHONPATH from within Python),
tell torch
where to find our pretrained model weights and call the main()
function of main.py
.
Create a Python script segmentation.py
. The script should conform to the following requirements:
- has a shebang line that specifies the Python interpreter as the first line
- sets the directory in which PyTorch searches for pretrained models to
sander-script/models
(viatorch.hub.set_dir(...)
) - adds the directory containing our software to the
PYTHONPATH
(in Python:sys.path
) - calls the
main()
method from themain.py
script
Your new file shall be called segmentation.py
, be stored in your working directory, and look like the following:
#! /usr/bin/env python
# set the model search path
import torch
torch.hub.set_dir("sander-script/models")
# Change the PYTHONPATH to include sander-script
import sys
sys.path.append("sander-script")
# run the main method
import main
main.main()
Deployment
Now that we have packaged our software, we need to deploy it on the cluster.
For deployment, we create a suitable folder on the /calvalus
file system from ehproduction02
.
Remember from the previous exercises, that the convention for this directory is
/calvalus/home/<user>/software/<package-name>-<version>
.
Create the folder, for your user:
# on ehproduction02
mkdir -p /calvalus/home/<user>/software/sander-segmentation-1.0
# example
# mkdir -p /calvalus/home/martin/software/sander-segmentation-1.0
After the folder has been created, we can copy our software to this directory.
Remember that copying large files to the distributed file system works faster if you use the hdfs dfs -put
command instead of cp
.
The components of our software that we need to copy are:
- The conda environment
conda-segmentation.tar.gz
- The Python script
sander-script.tar.gz
- The wrapper script
segmentation.py
If you created these files on your local machine, you need to move them to ehproduction02
first if you want to use the hdfs dfs -put
command.
If you worked on ehproduction02
for the previous steps, you can skip this step:
First, create a working directory on ehproduction02
if you haven't already:
# on ehproduction02
mkdir ~/segmentation-wd
Now, back on your own machine, copy the files to ehproduction02
.
As an alternative to these commands you can use a tool like filezilla to
copy files to ehproduction02
.
# in your local working directory on your local computer
# copy the files that should be deployed to ehproduction02
scp conda-segmentation.tar.gz ehproduction02:~/segmentation-wd
scp sander-script.tar.gz ehproduction02:~/segmentation-wd
scp segmentation.py ehproduction02:~/segmentation-wd
At this point all the necessary files are on ehproduction02
and you can continue with deployment to the distributed file system,
no matter where you created the files originally.
hdfs dfs -put -f conda-segmentation.tar.gz /calvalus/home/<user>/software/sander-segmentation-1.0/
hdfs dfs -put -f sander-script.tar.gz /calvalus/home/<user>/software/sander-segmentation-1.0/
hdfs dfs -put -f segmentation.py /calvalus/home/<user>/software/sander-segmentation-1.0/
ls -l /calvalus/home/<user>/software/sander-segmentation-1.0
Creating a request
In our case, the directory /calvalus/home/martin/orthophotos2
contains a few orthophotos that we can
run our software on. In principle, nothing stops you from uploading your own input data to some directory
below /calvalus/home/<user>/
in the same way, that you uploaded the software.
ls /calvalus/home/martin/orthophotos2
Do not replace martin
with your user name in this particular path!
Here, we only use a few inputs to illustrate how Calvalus works. The same procedure can be used, to run software on a very large number of inputs. The only thing that needs to change is the input directory, where you would need to find or create one with all the desired input data in it.
We use instances on the production machines to collect files related
to a particular project on the cluster. Every instance also contains a
script to set environment variables to specify the Calvalus version
among other things.
You already created an instance, training3-inst
in the previous exercises.
In order to submit a request, you need to source this script first,
before running the cht
tool which submits the request.
The names of the instance scripts start with my
by convention,
our script is called mytraining3
.
Also by convention, our hand-written requests are collected in a
subdirectory of the instance directory named special-requests
.
Now that the software is deployed, we want to execute it on several inputs concurrently.
Executing software is done via requests.
A request is just a .json
file instructing Calvalus to run our software.
For our script, we need to set a number of parameters. Some of the parameters in the template are already set, as they are not very specific to our case and would be the same for any other request for a similar project.
Some parameters are specified by Patterns. Patterns allow us to specify a group of files with locations that share something in common (a pattern) instead of a single file. We can make use of patterns to identify all the input files that we want to use without listing them all individually.
Because Calvalus forwards the parameters to various tools, the syntax to specify patterns can be different
for different parameters, if the tools use a different pattern syntax. In our case, be aware that the
inputPath
parameter uses the pattern .*.tif
to identify all .tif
files in a directory (. denotes any
character, and * repeats it any number of times), whereas the outputPattern
and processorCall
parameters
would use the pattern *.tif
(the so-called glob pattern where * denotes any number of any character except
/
), without the leading .
!
The outputPattern
parameter depends on how you specified the name of
the outputs in the exercise above. The example for this parameter will work if all output file names end
in _segments.tif
.
Parameter | Example | Meaning |
---|---|---|
inputPath |
/calvalus/home/martin/orthophotos2/.*.tif |
Location of the inputs on the /calvalus file system. Supports patterns to specify multiple inputs. E.g. /calvalus/<...>/mydir/.*.tif to specify all .tif files in the directory mydir |
processorName |
segmentation |
Prefix of the processor call Python script, makes sure processor is recognised as Python processor |
processorCall |
segmentation.py --conf sander-script/config.yaml --input *.tif |
Command line how Calvalus will execute the software. Calvalus will add the input after this command line on each execution |
condaEnv |
conda-segmentation |
Name of the conda environment to be activated by Calvalus before the processor is called |
outputDir |
/calvalus/home/martin/orthophoto-segmentation |
Where to store the outputs |
outputPattern |
*_segments.tif |
Pattern to determine the output file. This should match how the output names are constructed in our modified main.py |
processorBundles |
/calvalus/home/martin/software/sander-segmentation-1.0 |
Path to the sander-segmentation-1.0 package containing your software |
{
"productionType" : "processing",
"productionName" : "",
"inputPath" : "",
"checkIntersection" : "false",
"processorName" : "segmentation",
"processorCall" : "",
"condaEnv" : "",
"outputDir" : "",
"outputPattern" : "",
"queue" : "training30",
"attempts" : "1",
"failurePercent" : "0",
"timeout" : "3600",
"executableMemory" : "6144",
"mapreduce.map.cpu.vcores" : "4",
"processorBundles" : "",
"calvalus" : "calvalus-2.26",
"snap" : "snap-9.3cv"
}
- Create a file
orthophoto-segmentation.json
file in thespecial-requests/
subdirectory of the training instancetraining3-inst
and copy-paste the text above. - Fill in the missing fields in the template request
Remember that you can use the inputs in the user martin
s directory /calvalus/home/martin/orthphotos2/
,
but that you should use your own bundle and output directory.
If you compare your request with the examples in the table you will need
to replace martin
with your username except in the input path, which you can use unmodified.
Your request except for the values of productionName, outputDir, queue, and processorBundles values finally may look like ours
{
"productionType" : "processing",
"productionName" : "Orthophoto segmentation 1.2 Hannes",
"inputPath" : "/calvalus/home/martin/orthophotos2/.*.tif",
"checkIntersection" : "false",
"processorName" : "segmentation",
"processorCall" : "segmentation.py --conf sander-script/config.yaml --input *.tif",
"condaEnv" : "conda-segmentation",
"outputDir" : "/calvalus/home/martin/orthophoto-segmentation",
"outputPattern" : "*_segments.tif",
"queue" : "training30",
"attempts" : "1",
"failurePercent" : "0",
"timeout" : "3600",
"executableMemory" : "6144",
"mapreduce.map.cpu.vcores" : "4",
"processorBundles" : "/calvalus/home/martin/software/sander-segmentation-1.0",
"calvalus" : "calvalus-2.26",
"snap" : "snap-9.3cv"
}
Submitting the request
When the request is written, we can submit it with the Calvalus-Hadoop-Tool cht
.
Make sure that you "sourced" the mytraining3
script first, otherwise we cannot find the cht
program.
# on ehproduction02 in the training3-inst instance directory
source mytraining3 # if you didn't yet
cht special-requests/orthophoto-segmentation.json
Alternatively, if you don't want to stay logged in to wait for the request to complete you can
add a -a
option to the cht command
to run it in the background.
cht -a special-requests/orthophoto-segmentation.json
Next steps
You just submitted a request to run your own software package on Calvalus. Now, we need to wait until the production is finished to look at our results.
In the meantime, we can monitor the production while it is running, to verify if everything is working well.