Exercise 3 step 3: A Software Package for Calvalus

As you have noticed during the last step, the segmentation script runs for a little while. This is fine if you want to create a single segmented image but quickly becomes cumbersome if you are intending to create hundreds or thousands of segmented images! For large-scale tasks, it is better to run the same program many times on different inputs in parallel. With the Calvalus processing system on EstHub you need to setup your software only once to execute it on many computers in parallel. Calvalus takes care of distributing software and data to all needed computers and collecting the results then they are ready.

In order to deploy our Python script on Calvalus, we need to put our software on the cluster. "Our software" includes:

  • The Python code for segmentation, including all files ending with .py
  • The Python packages that the script depends on as a packed Conda environment
  • The pretrained machine learning models (.pth files)
  • The configuration file config.yaml
  • A small wrapper script that sets paths before the main script is called (we still need to create this)

With these parts, we can create a software package that we deploy on Calvalus. Once that is done, we can create a request to ask Calvalus to run this software package on any number of inputs.

It may help to recall that we distinguish a few artefacts during this exercise:

  • a conda environment conda-segmentation
  • the segmentation software sander-script provided by Sander Tars
  • a processor that we call segmentation
  • a processor package we call sander-segmentation-1.0

Before we create the processor software package we modify the Python code for segmentation such that input and output are passed as parameters instead of updating a configuration file for each processor call. This simplifies later execution.

Copying the software directory

Unless done so in the previous step, copy the directory tree of sander-script to your directory to have a copy that you can modify and pack.

cp -r /home/martin/training/orthophotos/sander-script .
ls -l sander-script
cat sander-script/config.yaml

New parameter for the input

Because we want to run the segmentation script on many inputs on the cluster, it is not very convenient to create a new configuration file (config.yaml) for each input. Instead, we add another parameter to the main.py script to specify the input. This way, we only minimally change the original software and keep the configuration file to hold the other configuration values that do not change depending on the input.

We will also remove the output file name from the configuration, because this name depends on the input. Instead, it should be generated in the script. To make things as simple as possible, we also change the output directory to the current working directory ..

Example: If we have an input 588541.tif the output should be called 588541_segments.tif. Because we want to relate the name of the output to the name of the input, it is easiest to determine the output name in the script itself.

Third, we change the working directory where the algorithm is called to one directory above sander-script. The idea is that we can run the segmentation algorithm with an example input in the current working directory as in:

python sander-script/main.py --conf sander-script/config.yaml --input 588541.tif

where 588541.tif is the example input data.

First, edit sander-script/config.yaml:

  • Set output_path to . (the working directory on the compute node)
  • Remove the lines specifying the input and output file names in config.yaml, i.e. those with output_name, input_img_path
  • Add sander-script/ in front of the path where the script looks for our model weights in config.yaml

Your configuration file should start like this:

output_path: "."
channels: [1, 2, 3]

img_pixels_detection: 512
margin: 128
output_type: "argmax"
n_classes: 19

model_weights: "sander-script/models/RGB_resnet34_unet_weights_finetuned.pth"

...

Second, edit sander-script/main.py:

  • Add a new parameter --input that can be used to specify an input file
  • Update the configuration dictionary in the setup() function in main.py so that this dictionary contains the input file name specified through the new parameter
  • Update the configuration dictionary in the setup() function in main.py so that this dictionary contains a suitable output file name.

Your main.py should look like this:

...

#### CONF FILE                                                                                                                               
argParser = argparse.ArgumentParser()
argParser.add_argument("--conf", help="Path to the .yaml config file")
argParser.add_argument("--input", help="Input file")

...

def setup(args):
    config = read_config(args.conf)
    config["input_img_path"] = args.input
    config["output_name"] = Path(args.input).stem + "_segments.tif"
    ...

Packing the software

We already created an archive of our conda environment. Now we create another archive with the segmentation software and auxiliary files.

As a reminder, the components of the software are:

  • The Python script main.py and all files ending with .py that are imported by it
  • The pre-trained machine learning models (.pth files)
  • The configuration file config.yaml

We neither include the wrapper script segmentation.py (see below, this shall live in the working directory) nor the conda environment (it already has an archive and can be used separately).

cd sander-script
tar cvzf ../sander-script.tar.gz *.py config.yaml models
cd ..

Note that sander-script.tar.gz is packed inside the sander-script/ folder. It shall not contain the prefix path sander-script/ inside the .tar.gz. At runtime on a compute node the content of this package will be made available in a directory named like the stem of the .tar.gz file, i.e. sander-script/ below the working directory.

Adding a wrapper script

Purpose of the wrapper script is to translate between the calling convention of Calvalus and the call of the specific software. In our case, we could nearly save that piece and call the main function instead:

python sander-script/main.py --conf sander-script/config.yaml --input *.tif

There are two pieces missing with this solution:

  • The PYTHONPATH needs to be extended to contain sander-script in order to find modules imported by main.py
  • PyTorch needs to be configured with the directory where to find the basic trained model.

We could solve that by modifying main.py . But instead, we use a more generic approach with a wrapper script. This wrapper script shall have the name of our processor, segmentation.py. segmentation.py shall make sure Python finds our Python files by modifying the sys.path variable (This is how we access the PYTHONPATH from within Python), tell torch where to find our pretrained model weights and call the main() function of main.py.

Create a Python script segmentation.py. The script should conform to the following requirements:

  • has a shebang line that specifies the Python interpreter as the first line
  • sets the directory in which PyTorch searches for pretrained models to sander-script/models (via torch.hub.set_dir(...))
  • adds the directory containing our software to the PYTHONPATH (in Python: sys.path)
  • calls the main() method from the main.py script

Your new file shall be called segmentation.py, be stored in your working directory, and look like the following:

#! /usr/bin/env python

# set the model search path
import torch
torch.hub.set_dir("sander-script/models")

# Change the PYTHONPATH to include sander-script
import sys
sys.path.append("sander-script")

# run the main method
import main
main.main()

Deployment

Now that we have packaged our software, we need to deploy it on the cluster. For deployment, we create a suitable folder on the /calvalus file system from ehproduction02. Remember from the previous exercises, that the convention for this directory is /calvalus/home/<user>/software/<package-name>-<version>. Create the folder, for your user:

# on ehproduction02
mkdir -p /calvalus/home/<user>/software/sander-segmentation-1.0
# example
# mkdir -p /calvalus/home/martin/software/sander-segmentation-1.0

After the folder has been created, we can copy our software to this directory. Remember that copying large files to the distributed file system works faster if you use the hdfs dfs -put command instead of cp.

The components of our software that we need to copy are:

  • The conda environment conda-segmentation.tar.gz
  • The Python script sander-script.tar.gz
  • The wrapper script segmentation.py

If you created these files on your local machine, you need to move them to ehproduction02 first if you want to use the hdfs dfs -put command. If you worked on ehproduction02 for the previous steps, you can skip this step:

First, create a working directory on ehproduction02 if you haven't already:

# on ehproduction02
mkdir ~/segmentation-wd

Now, back on your own machine, copy the files to ehproduction02. As an alternative to these commands you can use a tool like filezilla to copy files to ehproduction02.

# in your local working directory on your local computer

# copy the files that should be deployed to ehproduction02
scp conda-segmentation.tar.gz ehproduction02:~/segmentation-wd
scp sander-script.tar.gz ehproduction02:~/segmentation-wd
scp segmentation.py ehproduction02:~/segmentation-wd

At this point all the necessary files are on ehproduction02 and you can continue with deployment to the distributed file system, no matter where you created the files originally.

hdfs dfs -put -f conda-segmentation.tar.gz /calvalus/home/<user>/software/sander-segmentation-1.0/
hdfs dfs -put -f sander-script.tar.gz /calvalus/home/<user>/software/sander-segmentation-1.0/
hdfs dfs -put -f segmentation.py /calvalus/home/<user>/software/sander-segmentation-1.0/
ls -l /calvalus/home/<user>/software/sander-segmentation-1.0

Creating a request

In our case, the directory /calvalus/home/martin/orthophotos2 contains a few orthophotos that we can run our software on. In principle, nothing stops you from uploading your own input data to some directory below /calvalus/home/<user>/ in the same way, that you uploaded the software.

ls /calvalus/home/martin/orthophotos2

Do not replace martin with your user name in this particular path!

Here, we only use a few inputs to illustrate how Calvalus works. The same procedure can be used, to run software on a very large number of inputs. The only thing that needs to change is the input directory, where you would need to find or create one with all the desired input data in it.

We use instances on the production machines to collect files related to a particular project on the cluster. Every instance also contains a script to set environment variables to specify the Calvalus version among other things. You already created an instance, training3-inst in the previous exercises. In order to submit a request, you need to source this script first, before running the cht tool which submits the request.

The names of the instance scripts start with my by convention, our script is called mytraining3.

Also by convention, our hand-written requests are collected in a subdirectory of the instance directory named special-requests.

Now that the software is deployed, we want to execute it on several inputs concurrently. Executing software is done via requests. A request is just a .json file instructing Calvalus to run our software.

For our script, we need to set a number of parameters. Some of the parameters in the template are already set, as they are not very specific to our case and would be the same for any other request for a similar project.

Some parameters are specified by Patterns. Patterns allow us to specify a group of files with locations that share something in common (a pattern) instead of a single file. We can make use of patterns to identify all the input files that we want to use without listing them all individually.

Because Calvalus forwards the parameters to various tools, the syntax to specify patterns can be different for different parameters, if the tools use a different pattern syntax. In our case, be aware that the inputPath parameter uses the pattern .*.tif to identify all .tif files in a directory (. denotes any character, and * repeats it any number of times), whereas the outputPattern and processorCall parameters would use the pattern *.tif (the so-called glob pattern where * denotes any number of any character except /), without the leading .!

The outputPattern parameter depends on how you specified the name of the outputs in the exercise above. The example for this parameter will work if all output file names end in _segments.tif.

Parameter Example Meaning
inputPath /calvalus/home/martin/orthophotos2/.*.tif Location of the inputs on the /calvalus file system. Supports patterns to specify multiple inputs. E.g. /calvalus/<...>/mydir/.*.tif to specify all .tif files in the directory mydir
processorName segmentation Prefix of the processor call Python script, makes sure processor is recognised as Python processor
processorCall segmentation.py --conf sander-script/config.yaml --input *.tif Command line how Calvalus will execute the software. Calvalus will add the input after this command line on each execution
condaEnv conda-segmentation Name of the conda environment to be activated by Calvalus before the processor is called
outputDir /calvalus/home/martin/orthophoto-segmentation Where to store the outputs
outputPattern *_segments.tif Pattern to determine the output file. This should match how the output names are constructed in our modified main.py
processorBundles /calvalus/home/martin/software/sander-segmentation-1.0 Path to the sander-segmentation-1.0 package containing your software
{
    "productionType"    : "processing",
    "productionName"    : "",

    "inputPath"         : "",
    "checkIntersection" : "false",

    "processorName"     : "segmentation",
    "processorCall"     : "",
    "condaEnv"          : "",

    "outputDir"         : "",
    "outputPattern"     : "",

    "queue"             : "training30",
    "attempts"          : "1",
    "failurePercent"    : "0",
    "timeout"           : "3600",
    "executableMemory"  : "6144",
    "mapreduce.map.cpu.vcores" : "4",

    "processorBundles"  : "",
    "calvalus"          : "calvalus-2.26",
    "snap"              : "snap-9.3cv"
}
  • Create a file orthophoto-segmentation.json file in the special-requests/ subdirectory of the training instance training3-inst and copy-paste the text above.
  • Fill in the missing fields in the template request

Remember that you can use the inputs in the user martins directory /calvalus/home/martin/orthphotos2/, but that you should use your own bundle and output directory. If you compare your request with the examples in the table you will need to replace martin with your username except in the input path, which you can use unmodified.

Your request except for the values of productionName, outputDir, queue, and processorBundles values finally may look like ours

{
    "productionType"    : "processing",
    "productionName"    : "Orthophoto segmentation 1.2 Hannes",

    "inputPath"         : "/calvalus/home/martin/orthophotos2/.*.tif",
    "checkIntersection" : "false",

    "processorName"     : "segmentation",
    "processorCall"     : "segmentation.py --conf sander-script/config.yaml --input *.tif",
    "condaEnv"          : "conda-segmentation",

    "outputDir"         : "/calvalus/home/martin/orthophoto-segmentation",
    "outputPattern"     : "*_segments.tif",

    "queue"             : "training30",
    "attempts"          : "1",
    "failurePercent"    : "0",
    "timeout"           : "3600",
    "executableMemory"  : "6144",
    "mapreduce.map.cpu.vcores" : "4",

    "processorBundles"  : "/calvalus/home/martin/software/sander-segmentation-1.0",
    "calvalus"          : "calvalus-2.26",
    "snap"              : "snap-9.3cv"
}

Submitting the request

When the request is written, we can submit it with the Calvalus-Hadoop-Tool cht. Make sure that you "sourced" the mytraining3 script first, otherwise we cannot find the cht program.

# on ehproduction02 in the training3-inst instance directory 
source mytraining3  # if you didn't yet
cht special-requests/orthophoto-segmentation.json

Alternatively, if you don't want to stay logged in to wait for the request to complete you can add a -a option to the cht command to run it in the background.

cht -a special-requests/orthophoto-segmentation.json

Next steps

You just submitted a request to run your own software package on Calvalus. Now, we need to wait until the production is finished to look at our results.

In the meantime, we can monitor the production while it is running, to verify if everything is working well.