Thursday, January 17, 2019

Troubleshooting AWS Instance Profile, Role, and SSM Agent

During some AWS troubleshooting session, I happened to notice that there is a possibility of stale role attached to EC2. 

My scenario is as follow: We launched a new EC2 and by default no role attached to it. Then we programmatically create a role, let's call it EC2Role which we then associate it with a new Instance Profile. We then attach the Instance Role to our new EC2. In our case, the EC2Role allows SSM Agent to have permission to run commands.

For somewhat reason, we decided to delete the EC2Role and again programmatically recreate a new role with the same name and associate it with a new Instance Profile. We noticed that when we don't detach the old role (which has the same name with the new one) from the EC2, the old role will still be attached to the EC2 although the old role itself has been deleted. Hence, we were confused on why the EC2 which has the right role attached to it will not run command sent by SSM. The SSM Agent log keeps saying the token is invalid.

I did couple of troubleshooting steps. First, I restarted the SSM Agent and that didn't solve the problem. Second, in a different instance with the same problem, I detached and reattached the role and that didn't work either. Third, I combined previous two steps on a single instance, so I detached and reattached the role then restarted the agent and it works afterwards.




AWS Systems Manager (SSM) Run Command Troubleshooting

I have been working with AWS SSM for couple of months, but I found the troubleshooting document on their website lacks straightforward answers. So I provide the problems that I encountered and the solution based on my experience.

Problem #1: The instance is not visible in AWS Systems Manager Console although documentation says the agent has been installed by default.

Problem #2: The instance is visible, but "Run Command" took too long and even timed out.

Solution:
  1. First thing I would check is whether the instance has a role attached to it.
  2. If so, make sure the role has AmazonEC2RoleforSSM policy attached to it since permission is required for the agent to do health check.
  3. If after all the above has been confirmed, check if the latest SSM agent has been installed and running.
  4. If SSM agent is at the latest and running, check if it is hibernating. The hibernate logic has exponential backoff, so it might not respond for a long time.
  5. If it is hibernating, we can simply restart the agent.
    • On Windows, we can run Restart-Service AmazonSSMAgent PowerShell command.
    • On Linux, we can run sudo restart amazon-ssm-agent shell command.
  6. If all the above fails, it is time to get into the log files.
    • On Windows:
      • %PROGRAMDATA%\Amazon\SSM\Logs\amazon-ssm-agent.log
      • %PROGRAMDATA%\Amazon\SSM\Logs\errors.log
    • On Linux:
      • /var/log/amazon/ssm/amazon-ssm-agent.log
      • /var/log/amazon/ssm/errors.log
  7. If log files doesn't give enough information, we can enable debug logging which will give more information. This requires quite a number of steps, so refer to the reference link below.
Reference