SaltStack Minion communication and missing returns

Setting up SaltStack is a fairly easy task. There is plenty of documentation here. This is not an install tutorial, this is an explanation and trouble shooting of what is going on with SaltStack Master and Minion communication. Mostly when using the CLI to send commands from the Master to the Minions.

Basic Check List

After you have installed your Salt Master and your Salt Minions software the first thing to do after starting your Master is open your Minion's config file in /etc/salt/minion and fill out the line "master: " to tell the Minion where his Master is. Then start/restart your Salt Minion. Do this for all your Minions.

Go back to the Master and accept all of of the Minions keys. See here on how to do this. If you don't see a certain Minions key here are some things you should check.

Is your Minion and Master running the same software version? The Master can usually work at a higher version. Try to keep them the same if possible.
Is your salt-Minion service running? Make sure it is set to run on start as well.
Has the Minions key been accepted by the Master? If you don't even see a key request from the Minion then the Minion is not even talking to the Master .
Does the Minion have an unobstructed network path back to TCP port 4505 on the Master? The Minions initialize a TCP connection back to the Master so they don't need any ports open. Watch out for those Firewalls.
Check your Minions log file in /var/log/salt/minion for key issues or any other issues.

Basic Communication

Now lets say you have all of basic network and and key issues worked out and would like to send some jobs to your Minions. You can do this via the Salt CLI. Something like salt \* cmd.run 'echo HI'. This is considered a job by Salt. The Minions get this request and run the command and return the job information to the Master. The CLI talks to the Master who is listening for the return messages as they are coming in on the ZMQ bus. The CLI then reports back that status and output of the job.

That is a basic view of this process. But, sometimes Minions don't return job information. Then you ask yourself what the heck happened. You know the Minion is running fine. Eventually you find out you don't really understand Minion Master job communication at all.

Detailed Breakdown of Master Minion CLI Communication

By default when the job information gets returned to the Master and is stored on disk in the job cache. We will assume this is the case below.

The Salt CLI is just an small bit of code that interfaces with the API SaltStack has written that allows anyone to send commands to the Minions programmatically. The CLI is not connected directly to the Minions when the job request is made. When the CLI makes a job request, is handed to the Master to fulfill.

There are 2 key timeout periods you need be aware of before we go into a explanation of how a job request is handled. They are "timeout" and "gather_job_timeout".

In the Master config the initial timeout "timeout:" setting tells the Master how often to poll the Minions about their job status. After the initial timeout period expires, the Master fires off a "find_job" query. If you did this on the command line it would look like salt \* saltutil.find_job <jid> to all the Minions that were targeted. This asks all of the Minions if they are still running the job they were assigned. This setting can also be set from the CLI with the "-t" flag. Do not make the mistake that this timeout is how long the CLI command will wait until it times out and kills itself. The default value for "timeout" is 5 seconds.
In the Master config the "gather_job_timeout" setting defines how long the Master waits for a Minion to respond to the "find_job" query issued by the initial timeout setting mentioned above. If a Minion does not respond in the defined gather_job_timeout time then it is marked by the Master as "non responsive" for that polling period. This setting can not be set from the CLI. It can only be set in the Master config file. The default is 10 seconds.

When the CLI command is issued, the Master gathers a list of Minions with valid keys so it knows which Minions are on the system. It validates and filters the targeting information from the given target list and sets that as its list (targets) of Minions for the job. Now the Master has a list of who should return information when queried. The Master takes the requested command, target list, job id, and a few pieces of info, and broadcasts a message on the ZeroMQ bus to all of Minions. When all Minions get the message, they look at the target list and decide if they should execute the job or not. If the Minion sees he is in the target list he executes the job. If a Minion sees he is not part of the target list, he just ignores the message. The Minion that decided to run the command creates an local job id for the job and then performs the work.

While the Minions are working their jobs the CLI is waiting for the first initial timeout period (-t or timeout:) to start. When that hits, the CLI sends sends the first "find_job" query. This kicks off the gather_job_timeout timer. The Minions receive the the first find_job request with the original job_id. If they are still running the job, the Minion responds to "find job" request with a status of "still working" or "Job Finished". If a Minion does not respond to the request within the gather_job_timeout time period (10 secs), the CLI marks the Minion as "non responsive" for the polling interval. All Minions will keep being queried on the gather_job_timeout interval. If the Minions do not reply within this timeout, or all report that they are no longer running the job in question, the CLI command will return. If one of more minions replies that they are still running the job, the initial timeout is triggered again and the cycle repeats.

The CLI will show the output from the Minions as they finish their jobs. For the Minions that did not respond, but are connected to the Master, you will see the message "Minion did not return". If a Minion does not even look like it has a TCP connection with the Master, you will see "Minion did not return. [Not connected]".

By this time the Master should have marked the job as finished. The jobs info should now be available in the job cache. The above explanation is a high level explanation of how Master and Minions communicate. There are more details to this process than the above info, but this should give you a basic idea of how it works.

Takeaways From This Info

There is no defined period on how long a job will take. The job will finish when the last responsive Minion has said it is done.
If a Minion is not up or connected when a job request it sent out, then the Minion just misses that job. It is _not_ queued by the Master, and sent at a later time.
Currently there is no hard timeout to force the Master to stop listening after a certain amount of time.
If you set your timeout (-t) to be something silly like 3600, then if even one Minion is not responding the CLI will wait the full 3600 seconds to return. Beware!

Missing Returns

Sometimes you know there are Minions up and working, but you get "Minion did not return" or you did not see any info from the Minion at all before the CLI timed out. It is frustrating, as you can send the same Minion that just failed a job and it finishes it with no problem. There can be many reasons for this. Try/check the following things.

Did the Minion actually get the job request message? Start the Minion with log level "info" or higher. Then check the Minion log /var/log/salt/minion for job acceptance and completion.
After the CLI has exited, check job log. Use the jobs.list_jobs runner to find the job id, then list the output of the job with the jobs.lookup_jid runner. The CLI can miss returns, especially when the Master is overloaded. See the next bit about a overloaded Master.
The Master server getting overloaded is often the answer to missed returns. The Master is bound mostly by CPU and disk IO. Make sure these are not being starved. Bump up the threads in the Master config with the setting "worker_threads". Don't bump the threads past the amount of available CPU cores.
Bump your timeout (-t) values higher to give the Minions a longer start up period.
Bump your gather_job_timeout value in the Master to give the Minions more time in between find_job polling periods. This value becomes more important when your amount of Minions gets very high, as your polling period might be to short and your CLI/Master can't process all of the returns during the polling period. Then you end up overlapping polling periods. If your CLI or Master can't process the messages coming in fast enough you will start getting missing returns.

Reddit!

Basic Check List

Basic Communication

Detailed Breakdown of Master Minion CLI Communication

Takeaways From This Info

Missing Returns

Related stories