I wanted to share with you a problem I’ve been having for the past few weeks with a deployment via Capistrano.
We have a few web servers and a few resque servers which we deploy our code to. Web servers have app and web roles while resque servers have app and resque roles.
When we ran
cap production deploy we were getting a
Net::SSH::Disconnect: connection closed by remote host in the
create_symlink deploy task. We had no error while running
cap ROLES=resque production deploy or
cap ROLES=web production deploy
Somehow deploying to both roles together made Capistrano lose its SSH connection to some of the servers.
Since we wanted to verify this is indeed the problem we added the web role to our resque servers, which made
cap production deploy pass without any errors. Since resque servers had a web role they now redundantly ran Nginx and Unicorn on their machines. This caused some other problems like too many connections to our database (Unicorn holds one per process) and insufficient memory on those machines.
When I had enough time to handle this I knew the problem was somehow connected to SSH timeout but couldn’t point the finger at exactly what. I noticed then, that the SSH error happened somewhere after the
assets:precompile. What made me more suspicious of that is the fact the this task is carried for web servers only – meaning our resque servers did not run this task. After a few minutes I already got a full profile of the problem:
When deploying to both resque and web roles, the SSH connection is idle on the resque servers when running the
assets:precompile on the web servers. This makes the resque servers close the SSH connection and fail the next task which involves them:
create_symlink. This was not happening when deploying only to resque servers since the assets pre-compilation task was not carried at all so the connection would not time out.
When you search the internet for “SSH timeout” you usually get two ways to handle the problem:
1. The first one involves setting the
ServerAliveInterval 60 variable in your ssh client settings file. This one configures your ssh client to send a keep alive packet to the server every 60 seconds.
2. The second involves setting two variables on your server at /etc/ssh/sshd_config:
* ClientAliveInterval 60 – This one sends a keep alive packet from the server to the client to keep connection every 60 seconds.
* ClientAliveCountMax 50- This setting determines how many of those client keep alive packets are allowed before terminating the connection
Note: You will have to restart your ssh server via
sudo service ssh restart for these settings to take effect.
As an example: If I have the settings of
Then a SSH connection to that server would be kept alive for 5*60 seconds, which adds to 5 minutes. Take into account that the ClientAliveCountMax default value is only 3, so most of the times setting only ClientAliveInterval is pointless.
Out of those two solutions the second wins without a fight for a couple of reasons:
1. I don’t know if Capistrano intentionally ignores the ServerAliveInterval setting on the client machine but this one does not work. You will still get your SSH connection terminated
2. Using the client configuration option forces you to configure each client which deploys to have this settings. Server configuration is much more maintainable since once you create an AMI out of the machine you will never have to touch this setting again.
This also solves the problem of Capistrano deploy hung at a certain task – most of them happen on assets precompile which takes a lot of time and causes SSH connection termination when the task is running.
I wish you all happy and easy deployments!