Excercise 3 step 4: Monitoring production
When the production is running, you can monitor its status to see how far the production has progressed and if there are any errors.
When you submit the request, you should see a response line that starts with a time stamp and then
INFO com.bc.calvalus: Production successfully ordered with ID <job-id>
where <job-id>
is a string starting with job_
followed by two numbers seperated by a _
.
In the following, we need the application ID which is almost the same, but with job_
replaced by application_
.
The application ID in the following commands is therefore application_
followed by the same two numbers with a single underscore in between, just like the job ID.
Fill in your application ID in the commands below where application-id
is written.
In the following there are a number of commands listed that can be used to monitor different aspects of our currently running production, with a short explanation of their use.
Listing all running applications
With this command, you can see all currently running applications. If everything went well, you will see your own application listed in the table.
It can be useful to reduce the font size in your terminal when running this command, as the returned table is quite wide and will look strange if each table row is broken up into multiple lines of text.
The output will contain the Application ID. If you forgot the application ID, this is your way to recover it. You can identify your production from your username and name of the production.
yarn application -list
By default, this command only lists applications that are either in either of the states SUBMITTED
, ACCEPTED
or RUNNING
.
This will include your production if everything went well.
However, if there has been an error or if the production already finished successfully, it will not show up.
You can filter the output to include applications in a particular state
by adding the -appStates
flag.
Possible states include the three states included by default (see above), FINISHED
(if everything went well), FAILED
(if it didn't) and KILLED
(if the application was terminated externally).
If your application encountered an error, it should be visible if you run
yarn application -list -appStates FAILED
If you don't know the status of the application, you can use the
yarn application
command explained in the next section.
Determining the status of an application
This command lists the status of a particular application.
In order to run it, you will need to know your application ID
(see above) and fill it into the example instead of <application-id>
.
yarn application -status <application-id>
This returns a status report on your application, including its name (Application-Name
), User
and State
.
Accessing logs
If your application is running successfully, you might want to access its log files to determine how far the application has progressed. Conversely, if the application failed, the logs might give you a hint to diagnose the problem.
The yarn logs
command returns all logs for a specific application.
yarn logs -applicationId <application-id> | less
Because this command potentially retuns a large amount of information, it is useful to look at specific log files instead.
You can see the types of log files by running the command with the -show_container_log_info
flag.
yarn logs -applicationId <application-id> -show_container_log_info
If you want to monitor progress or if the application encountered
an error, but managed to start correctly,
stderr
is often the most useful log file.
It contains the log lines created by the application itself.
You will recognize these lines from your local execution of the
Python script when they were printed in your terminal.
yarn logs -applicationId <application-id> -log_files <log-file-type> | less
log-file-type
could be stderr
or any of the LogFile types listed by the previous command.
For more information on the yarn logs
command, Cloudera has a tutorial that goes into greater depth.
Monitor your application
- Determine your application ID
- Verify that your application is running correctly and ask for help if it isn't.
- Access the
stderr
log file of your application to see how far the application has progressed
Accessing results
In the request, we have specified an outputDir
, where our results will
be collected.
Once the production has successfully completed, we can look at the files in the directory and see if everything looks good. Because all the results from different executions of the script are collected in the same directory, it was important to generate output files with a name that depends on the input, so that different outputs never have the same name and overwrite each other
To see which files are in the output directory:
# on ehproduction02
ls <output-dir>
where <output-dir>
is the directory you specified in the request under outputDir
.
For example:
ls /calvalus/home/martin/segmentation2
You should see a list of output files.
Now you can copy these files to your local computer for inspection, again with scp
:
# on the training machine
mkdir outputs
cd outputs
scp martin@ehproduction02:/calvalus/home/martin/segmentation2/* .
If you are working on ehproduction02
directly you may want to copy the files to your local machine in order to use a tool like QGIS for inspection of the results.
Inspect your results
- Download the outputs of your application
- Inspect the results with QGIS or SNAP. Do they look the way you expect?
Celebrate!
We are done. In this tutorial, you learned how to take a Python Machine Learning script from local, manual execution on a single image, to large scale, massively parallel execution on an arbitrary number of inputs on a cluster, using Calvalus.
You learned to create a software package for Calvalus:
- Create a
conda
virtual environment from a list of packages - Make a conda environment relocatable, using
conda-pack
- Parameterize the python script to take an input file as a parameter
- Create a wrapper script that calls the main software
Then, you learned to deploy the software package, consisting of
- the software archive
- the packaged
conda
environment - the wrapper script.
- (optional) auxiliary data
In our case, we added the auxiliary data (the pre-trained models) to the software package.
Finally, you learned to run a software on Calvalus:
- Write a
.json
request - Submit a request by sourcing an instance script (
mytraining3
) and usingcht
- Monitor production and access logs
- Retrieve outputs