Debugging slow ssh connections

A while ago I was asked to look into an issue in which ansible runs were taking a very long time on a few hosts. It was quickly obvious that the issue was with ssh and nothing specific to ansible. The initial ssh connection to a subset of hosts took 2+ minutes, but subsequent logins were immediate (which is why the original person looking into this thought the issue was ansible, not ssh).

I checked the normal things that can cause slow ssh logins, specifically DNS and Kerberos. All hosts had valid A and PTR records and GSSAPI was functioning as expected. I ended up building ssh from source, enabling debugging symbols, and stepping through the connection with gdb. It turns out that if there is an ssh configuration file, such as ~/.ssh/authorized_keys, that is group writable there is a Debian patch](https://sources.debian.net/patches/openssh/1:7.5p1-5/user-group-modes.patch/) that will check to see if the group assigned to the file only has a single user. If it does, then the authentication process can proceed as normal. All of the Ubuntu-based hosts in this environment were running into this issue whereas all of the CentOS machines behaved perfectly normal.

This is the specific function that was causing issues:

secure_permissions(struct stat *st, uid_t uid)
{
	if (!platform_sys_dir_uid(st->st_uid) && st->st_uid != uid)
	return 0;
	if ((st->st_mode & 002) != 0)
	return 0;
	if ((st->st_mode & 020) != 0) {
	/* If the file is group-writable, the group in question
	 * must have exactly one member, namely the file's owner.
	 * (Zero-member groups are typically used by setgid
	 * binaries, and are unlikely to be suitable.)
	 */
	struct passwd *pw;
	struct group *gr;
	int members = 0;

	gr = getgrgid(st->st_gid);
	if (!gr)
		return 0;

	/* Check primary group memberships. */
	while ((pw = getpwent()) != NULL) {
		if (pw->pw_gid == gr->gr_gid) {
		++members;
		if (pw->pw_uid != uid)
			return 0;
		}
	}
	endpwent();

	pw = getpwuid(st->st_uid);
	if (!pw)
		return 0;

	/* Check supplementary group memberships. */
	if (gr->gr_mem[0]) {
		++members;
		if (strcmp(pw->pw_name, gr->gr_mem[0]) ||
		gr->gr_mem[1])
		return 0;
	}

	if (!members)
		return 0;
	}
	return 1;
}

These systems were in an environment in which group lookups queried active directory, and this active directory environment had hundreds of thousands of groups spread across multiple domain controllers in different locations. The system took nearly 2 minutes to find the proper group in active directory and determine whether or not the group only had a single user. Once the lookup succeeded once, the group was cached via nscd which is why subsequent logins behaved normally.

The quick fix was to simply remove the group write permissions from everything under ~/.ssh/ for the ansible user. The longer-term fix was to re-architect the authentication system such that these systems weren’t searching the entire AD tree for groups.