Had the strangest problem with nagios today. I noticed that I was not recieving email notifications when services went down. Nagios would log that it saw the problem and update the webpage correctly but when it came to sending an email notification I got nothing. It logged that the emails went out in it's log but when watching for it on the nagios machine you saw nothing. The log looked like this:
Oct 27 16:02:02 nagios nagios: SERVICE ALERT: mail;SSH;CRITICAL;HARD;5;CRITICAL - Socket timeout after 10 seconds Oct 27 16:02:02 nagios nagios: SERVICE NOTIFICATION: tech1;mail;SSH;CRITICAL;notify-by-email;CRITICAL - Socket timeout after 10 seconds
Postfix had not logged an email going out. Tcpdump showed no emails going out when it was supposedly sent the email. I was confused to say the least.
Nagios uses regular unix programs (printf and mail) to send it's email. I tried using the line nagios uses to send mail on the machine and it went out fine. I finally broke down and compiled nagios with ultra (all) debug turned on. The webpage will not work with debug turned on but the notifications and checks will. When it came time to send the mail this is what I saw:
/tmp/RshkRO1F: Permission denied
Permissions on /tmp ??? WTF? Sure enough /tmp's permissions were screwed up. Showing:
drwxr-xr-x 5 100 users 4096 Oct 27 15:58 tmp
So setting them back to the correct perms (below) fixed the problem right up. Mail could not create a temp file to send out its email. Nagios does not seem to check if the mail went out correctly so you end up with nothing being logged anywhere.
chown root:root /tmp
chmod 1777 /tmp