My IBM MQ Cluster Ping Nagios script

Monitoring cluster health can be quite tricky. I have created a small Nagios script to help in the task. It uses the sample programs amqsput and amqsget, delivered with the IBM MQ installation

Preparations:
In this example I’m only going to use two queue managers QMBASE (full repository) and QMEXTERNAL (partial repository). Commands below are for runmqsc
QMEXTERNAL

DEFINE QA(QA.CLUSTER.PING) TARGET(TOP.CLUSTER.PING) TARGTYPE(TOPIC)
DEFINE QL(QL.CLUSTER.PING) CLUSTER(EXTERNAL)

QMBASE

DEFINE TOP(TOP.CLUSTER.PING) TOPICSTR('ping') CLUSTER(EXTERNAL) 
DEFINE SUB(SUB.CLUSTER.PING) TOPICOBJ(TOP.CLUSTER.PING) TOPICSTR('#') DEST(QL.CLUSTER.PING)

How it works:
It puts a message on a queue alias, using amqsput, that is connected to a cluster topic on the base queue manager
A subscription to the topic picks up the message and put it on the cluster queue on the external queue manager
The message is then picked up by and checked. I also check the length of the message to check for more than one message

And here is the script:

#!/bin/bash

if [ $# -lt 1 ]; then
  echo "********************"
  echo "Cluster Ping"
  echo "********************"
  echo "NOTE!"
  echo "This script needs that all the MQ objects are in place (see below)"
  echo ""
  echo "QMEXTENAL                        QMBASE (REPOSITORY)"
  echo "QA.CLUSTER.PING            ->    TOP.CLUSTER.PING (CLUSTER)"
  echo "QL.CLUSTER.PING (CLUSTER)  <-    SUB.CLUSTER.PING(TOP.CLUSTER.PING)"
  echo ""
  echo "Usage: $0 <external queue manager>"
  echo ""
  echo "External Queue Manager: ex. QMEXTERNAL"
  echo ""
  echo "Ex. $0 QMEXTERNAL"
  exit 1
fi

# Define variables
qmanager=$1
inqueue=QA.CLUSTER.PING
outqueue=QL.CLUSTER.PING
timestamp=$(date +%s)
match=false
normal=true

#Send message to QMBASE
printf "%s\n\n" ${timestamp} | amqsput ${inqueue} ${qmanager} > /dev/null

msg=$(amqsget ${outqueue} ${qmanager})

if [[ ${msg} == *"${timestamp}"* ]]; then
  match=true
fi

if [ ${#msg} -gt 80 ]; then
  normal=false
fi

if [[ ${match} == "true" && ${normal} == "true" ]]; then
  echo "OK - Message received from cluster"
  exit 0
fi

if [[ ${match} == "true" && ${normal} == "false" ]]; then
  echo "WARNING - More then one message was received"
  exit 1
fi

if [[ ${match} == "false" ]]; then
  echo "ERROR - No message received!"
  exit 2
fi

echo "UNKNOWN - Script has not run correctly"
exit 3

Now, this will tell us that the cluster can transport messages to and from the base queue manager, but we need more to be able to feel safe (at least I do) so here are a few other things to look at:
* Keep track of the transmission queues so that they process messages. I have, so far, only worked with small clusters so monitoring queue depth of the SYSTEM.CLUSTER.TRANSMIT.QUEUE has been enough, but if you have more transmission queues you need to monitor them all.
* Keep track of the command queue SYSTEM.CLUSTER.COMMAND.QUEUE. This queue should also process messages and not only grow
* Look for error messages in the queue manager error log (AMQERR01.LOG). Here I look for the following codes (short description in parentheses):

- AMQ9465 (Failed republishing of cluster information)
- AMQ5534E (User ID 'anyuser' authentication failed)
- AMQ5542I (The failed authentication check was caused by the queue manager)
- AMQ9202E (Remote host 'anyhost (anyip) (anyport)' not available, retry)
- AMQ9259E (Connection timed out from host 'anyhost')
- AMQ9469W: (Update not received for CLUSRCVR channel any channel)
- AMQ9492E (The TCP/IP responder program encountered an error)
- AMQ9558E (The remote channel 'any channel' encountered a problem)
- AMQ9633E (Bad SSL certificate for channel 'any channel')
- AMQ9637E (Channel is lacking a certificate)
- AMQ9716E (Remote SSL certificate revocation status check failed for channel)

After all of the tests above I usually feel pretty confident that we have a working environment.

Tested on IBM MQ 9 and Red Hat 7

Comments are closed.