pantz.org banner
SaltStack Minion communication and missing returns
Posted on 05-30-2016 04:42:57 UTC | Updated on 08-02-2016 00:27:12 UTC
Section: /software/saltstack/ | Permanent Link

Setting up SaltStack is a fairly easy task. There is plenty of documentation here. This is not an install tutorial, this is an explanation and trouble shooting of what is going on with SaltStack Master and Minion communication. Mostly when using the CLI to send commands from the Master to the Minions.

Basic Check List

After you have installed your Salt Master and your Salt Minions software the first thing to do after starting your Master is open your Minion's config file in /etc/salt/minion and fill out the line "master: " to tell the Minion where his Master is. Then start/restart your Salt Minion. Do this for all your Minions.

Go back to the Master and accept all of of the Minions keys. See here on how to do this. If you don't see a certain Minions key here are some things you should check.

  1. Is your Minion and Master running the same software version? The Master can usually work at a higher version. Try to keep them the same if possible.
  2. Is your salt-Minion service running? Make sure it is set to run on start as well.
  3. Has the Minions key been accepted by the Master? If you don't even see a key request from the Minion then the Minion is not even talking to the Master .
  4. Does the Minion have an unobstructed network path back to TCP port 4505 on the Master? The Minions initialize a TCP connection back to the Master so they don't need any ports open. Watch out for those Firewalls.
  5. Check your Minions log file in /var/log/salt/minion for key issues or any other issues.

Basic Communication

Now lets say you have all of basic network and and key issues worked out and would like to send some jobs to your Minions. You can do this via the Salt CLI. Something like salt \* cmd.run 'echo HI'. This is considered a job by Salt. The Minions get this request and run the command and return the job information to the Master. The CLI talks to the Master who is listening for the return messages as they are coming in on the ZMQ bus. The CLI then reports back that status and output of the job.

That is a basic view of this process. But, sometimes Minions don't return job information. Then you ask yourself what the heck happened. You know the Minion is running fine. Eventually you find out you don't really understand Minion Master job communication at all.

Detailed Breakdown of Master Minion CLI Communication

By default when the job information gets returned to the Master and is stored on disk in the job cache. We will assume this is the case below.

The Salt CLI is just an small bit of code that interfaces with the API SaltStack has written that allows anyone to send commands to the Minions programmatically. The CLI is not connected directly to the Minions when the job request is made. When the CLI makes a job request, is handed to the Master to fulfill.

There are 2 key timeout periods you need be aware of before we go into a explanation of how a job request is handled. They are "timeout" and "gather_job_timeout".

When the CLI command is issued, the Master gathers a list of Minions with valid keys so it knows which Minions are on the system. It validates and filters the targeting information from the given target list and sets that as its list (targets) of Minions for the job. Now the Master has a list of who should return information when queried. The Master takes the requested command, target list, job id, and a few pieces of info, and broadcasts a message on the ZeroMQ bus to all of Minions. When all Minions get the message, they look at the target list and decide if they should execute the job or not. If the Minion sees he is in the target list he executes the job. If a Minion sees he is not part of the target list, he just ignores the message. The Minion that decided to run the command creates an local job id for the job and then performs the work.

While the Minions are working their jobs the CLI is waiting for the first initial timeout period (-t or timeout:) to start. When that hits, the CLI sends sends the first "find_job" query. This kicks off the gather_job_timeout timer. The Minions receive the the first find_job request with the original job_id. If they are still running the job, the Minion responds to "find job" request with a status of "still working" or "Job Finished". If a Minion does not respond to the request within the gather_job_timeout time period (10 secs), the CLI marks the Minion as "non responsive" for the polling interval. All Minions will keep being queried on the gather_job_timeout interval. If the Minions do not reply within this timeout, or all report that they are no longer running the job in question, the CLI command will return. If one of more minions replies that they are still running the job, the initial timeout is triggered again and the cycle repeats.

The CLI will show the output from the Minions as they finish their jobs. For the Minions that did not respond, but are connected to the Master, you will see the message "Minion did not return". If a Minion does not even look like it has a TCP connection with the Master, you will see "Minion did not return. [Not connected]".

By this time the Master should have marked the job as finished. The jobs info should now be available in the job cache. The above explanation is a high level explanation of how Master and Minions communicate. There are more details to this process than the above info, but this should give you a basic idea of how it works.

Takeaways From This Info

  1. There is no defined period on how long a job will take. The job will finish when the last responsive Minion has said it is done.
  2. If a Minion is not up or connected when a job request it sent out, then the Minion just misses that job. It is _not_ queued by the Master, and sent at a later time.
  3. Currently there is no hard timeout to force the Master to stop listening after a certain amount of time.
  4. If you set your timeout (-t) to be something silly like 3600, then if even one Minion is not responding the CLI will wait the full 3600 seconds to return. Beware!

Missing Returns

Sometimes you know there are Minions up and working, but you get "Minion did not return" or you did not see any info from the Minion at all before the CLI timed out. It is frustrating, as you can send the same Minion that just failed a job and it finishes it with no problem. There can be many reasons for this. Try/check the following things.

Reddit!

Related stories


RSS Feed RSS feed logo

About


3com

3ware

alsa

alsactl

alsamixer

amd

android

apache

areca

arm

ati

auditd

awk

badblocks

bash

bind

bios

bonnie

cable

carp

cat5

cdrom

cellphone

centos

chart

chrome

chromebook

cifs

cisco

cloudera

comcast

commands

comodo

compiz-fusion

corsair

cpufreq

cpufrequtils

cpuspeed

cron

crontab

crossover

cu

cups

cvs

database

dbus

dd

dd_rescue

ddclient

debian

decimal

dhclient

dhcp

diagnostic

diskexplorer

disks

dkim

dns

dos

dovecot

drac

dsniff

dvdauthor

e-mail

echo

editor

emerald

encryption

ethernet

expect

ext3

ext4

fat32

fedora

fetchmail

fiber

filesystems

firefox

firewall

flac

flexlm

floppy

flowtools

fonts

format

freebsd

ftp

gdm

gmail

gnome

google

gpg

greasemonkey

greylisting

growisofs

grub

hacking

hadoop

harddrive

hba

hex

hfsc

html

html5

http

https

hulu

idl

ie

ilo

intel

ios

iperf

ipmi

iptables

ipv6

irix

javascript

kde

kernel

kickstart

kmail

kprinter

krecord

kubuntu

kvm

lame

ldap

linux

logfile

lp

lpq

lpr

maradns

matlab

memory

mencoder

mhdd

mkinitrd

mkisofs

moinmoin

motherboard

mouse

movemail

mplayer

multitail

mutt

myodbc

mysql

mythtv

nagios

nameserver

netflix

netflow

nginx

nic

ntfs

ntp

nvidia

odbc

openbsd

openntpd

openoffice

openssh

openssl

openvpn

opteron

parted

partimage

patch

perl

pf

pfflowd

pfsync

photorec

php

pop3

pop3s

ports

postfix

power

procmail

proftpd

proxy

pulseaudio

putty

pxe

python

qemu

r-studio

raid

recovery

redhat

router

rpc

rsync

ruby

saltstack

samba

schedule

screen

scsi

seagate

seatools

sed

sendmail

sgi

shell

siw

smtp

snort

solaris

soundcard

sox

spam

spamd

spf

spotify

sql

sqlite

squid

srs

ssh

ssh.com

ssl

su

subnet

subversion

sudo

sun

supermicro

switches

symbols

syslinux

syslog

systemd

systemrescuecd

t1

tcpip

tcpwrappers

telnet

terminal

testdisk

tftp

thttpd

thunderbird

timezone

ting

tls

tools

tr

trac

tuning

tunnel

ubuntu

unbound

vi

vpn

wget

wiki

windows

windowsxp

wireless

wpa_supplicant

x

xauth

xfree86

xfs

xinearama

xmms

youtube

zdump

zeromq

zic

zlib