Excercise 3 step 4: Monitoring production

When the production is running, you can monitor its status to see how far the production has progressed and if there are any errors.

When you submit the request, you should see a response line that starts with a time stamp and then

INFO com.bc.calvalus: Production successfully ordered with ID <job-id>

where <job-id> is a string starting with job_ followed by two numbers seperated by a _. In the following, we need the application ID which is almost the same, but with job_ replaced by application_. The application ID in the following commands is therefore application_ followed by the same two numbers with a single underscore in between, just like the job ID. Fill in your application ID in the commands below where application-id is written.

In the following there are a number of commands listed that can be used to monitor different aspects of our currently running production, with a short explanation of their use.

Listing all running applications

With this command, you can see all currently running applications. If everything went well, you will see your own application listed in the table.

It can be useful to reduce the font size in your terminal when running this command, as the returned table is quite wide and will look strange if each table row is broken up into multiple lines of text.

The output will contain the Application ID. If you forgot the application ID, this is your way to recover it. You can identify your production from your username and name of the production.


yarn application -list

By default, this command only lists applications that are either in either of the states SUBMITTED, ACCEPTED or RUNNING. This will include your production if everything went well. However, if there has been an error or if the production already finished successfully, it will not show up.

You can filter the output to include applications in a particular state by adding the -appStates flag.

Possible states include the three states included by default (see above), FINISHED (if everything went well), FAILED (if it didn't) and KILLED (if the application was terminated externally).

If your application encountered an error, it should be visible if you run

yarn application -list -appStates FAILED

If you don't know the status of the application, you can use the yarn application command explained in the next section.

Determining the status of an application

This command lists the status of a particular application. In order to run it, you will need to know your application ID (see above) and fill it into the example instead of <application-id>.

yarn application -status <application-id>

This returns a status report on your application, including its name (Application-Name), User and State.

Accessing logs

If your application is running successfully, you might want to access its log files to determine how far the application has progressed. Conversely, if the application failed, the logs might give you a hint to diagnose the problem.

The yarn logs command returns all logs for a specific application.

yarn logs -applicationId <application-id> | less

Because this command potentially retuns a large amount of information, it is useful to look at specific log files instead.

You can see the types of log files by running the command with the -show_container_log_info flag.

yarn logs -applicationId <application-id> -show_container_log_info

If you want to monitor progress or if the application encountered an error, but managed to start correctly, stderr is often the most useful log file. It contains the log lines created by the application itself. You will recognize these lines from your local execution of the Python script when they were printed in your terminal.

yarn logs -applicationId <application-id> -log_files <log-file-type> | less

log-file-type could be stderr or any of the LogFile types listed by the previous command.

For more information on the yarn logs command, Cloudera has a tutorial that goes into greater depth.

Monitor your application

  • Determine your application ID
  • Verify that your application is running correctly and ask for help if it isn't.
  • Access the stderr log file of your application to see how far the application has progressed

Accessing results

In the request, we have specified an outputDir, where our results will be collected.

Once the production has successfully completed, we can look at the files in the directory and see if everything looks good. Because all the results from different executions of the script are collected in the same directory, it was important to generate output files with a name that depends on the input, so that different outputs never have the same name and overwrite each other

To see which files are in the output directory:

# on ehproduction02

ls <output-dir>

where <output-dir> is the directory you specified in the request under outputDir.

For example:

ls /calvalus/home/martin/segmentation2

You should see a list of output files.

Now you can copy these files to your local computer for inspection, again with scp:

# on the training machine

mkdir outputs
cd outputs
scp martin@ehproduction02:/calvalus/home/martin/segmentation2/* .

If you are working on ehproduction02 directly you may want to copy the files to your local machine in order to use a tool like QGIS for inspection of the results.

Inspect your results

  • Download the outputs of your application
  • Inspect the results with QGIS or SNAP. Do they look the way you expect?

Celebrate!

We are done. In this tutorial, you learned how to take a Python Machine Learning script from local, manual execution on a single image, to large scale, massively parallel execution on an arbitrary number of inputs on a cluster, using Calvalus.

You learned to create a software package for Calvalus:

  • Create a conda virtual environment from a list of packages
  • Make a conda environment relocatable, using conda-pack
  • Parameterize the python script to take an input file as a parameter
  • Create a wrapper script that calls the main software

Then, you learned to deploy the software package, consisting of

  • the software archive
  • the packaged conda environment
  • the wrapper script.
  • (optional) auxiliary data

In our case, we added the auxiliary data (the pre-trained models) to the software package.

Finally, you learned to run a software on Calvalus:

  • Write a .json request
  • Submit a request by sourcing an instance script (mytraining3) and using cht
  • Monitor production and access logs
  • Retrieve outputs