Watchdog
Although Orion Context Broker is highly stable, it may fail (see the section on diagnosis procedures for more information about detecting problems with the broker). Thus, it is recommendable to use a watchdog process to detect if the broker process has stopped running, so it can be re-started automatically and/or you get a notification of the problem.
You can write the watchdog program yourself (e.g. a script invoked by cron in a regularly basic that checks /etc/init.d/contextBroker status and starts it again if is not working and/or send you a notification email) or use existing tools. This section includes a procedure using Monit.
First of all, install the RPMs available at http://rpmfind.net/linux/rpm2html/search.php?query=monit. The following procedure has been prepared considering monit-5.1.1-4.el6.x86_64.rpm, although it also should work with other versions of the RPM.
sudo rpm -i monit-5.2.5-1.el5.rf.x86_64.rpm
Create a directory for monit stuff, eg:
/home/orion/monit_CB
Create monitBROKER.conf file in that directory. In this example, we configure monit to restart contextBroker if CPU load is greater than 80% for two cycles or if allocated memory is greater than 200MB for five cycles (that would be a symptom of memory leaking). In addition to resource checking, monit will restart the process if it is down. The duration of a cycle is defined using a monit command line parameter (described below).
###############################################################################
## Monit control file
###############################################################################
##
## Comments begin with a '#' and extend through the end of the line. Keywords
## are case insensitive. All path's MUST BE FULLY QUALIFIED, starting with '/'.
##
##
###############################################################################
## Global section
###############################################################################
##
set logfile /var/log/contextBroker/monitBROKER.log
set statefile /var/log/contextBroker/monit.state
###############################################################################
## Services
###############################################################################
##
check host localhost with address localhost
if failed (url http://localhost:1026/version and content == '<version>') for 3 cycles then
exec "/etc/init.d/contextBroker stop"
check file monitBROKER.log with path /var/log/contextBroker/monitBROKER.log
if size > 50 MB then
exec "/bin/bash -c '/bin/rm /var/log/contextBroker/monitBROKER.log; monit -c /home/localadmin/monit_CB/monitBROKER.conf -p /var/log/contextBroker/monit.pid reload'"
check process contextBroker with pidfile /var/log/contextBroker/contextBroker.pid start program = "/etc/init.d/contextBroker start" stop program = "/etc/init.d/contextBroker stop"
if cpu > 60% for 2 cycles then alert
if cpu > 80% for 5 cycles then restart
if totalmem > 200.0 MB for 5 cycles then restart
Make root the owner of that file and set permissions only for owner:
sudo chown root:root monitBROKER.conf
sudo chmod 0700 monitBROKER.conf
Create monit start script start_monit_BROKER.sh. The "-d" command line parameter is used to specify the checking cycle duration (in the example we are setting 10 seconds).
monit -v -c /home/orion/monit_CB/monitBROKER.conf -d 10 -p /var/log/contextBroker/monit.pid
Make root the owner of that file and set execution permissions:
sudo chown root:root start_monit_BROKER.sh
sudo chmod a+x start_monit_BROKER.sh
To run monit do:
cd /home/orion/monit_CB
sudo ./start_monit_BROKER.sh
To check that monit is working properly, check that the process exist, e.g.:
# ps -ef | grep contextBroker
500 27175 1 0 21:06 ? 00:00:00 monit -v -c /home/localadmin/monit_CB/monitBROKER.conf -d 10 -p /var/log/contextBroker/monit.pid
500 27205 1 0 21:06 ? 00:00:00 /usr/bin/contextBroker -port 1026 -logDir /var/log/contextBroker -pidpath /var/log/contextBroker/contextBroker.pid -dbhost localhost -db orion;
Then, kill contextBroker, e.g.:
#kill 27205
and check with ps that after a while (less than 30 seconds) contextBroker is up again.