Kidsnet has not been immune to our recent connection woes (see previous posts). Our magilla.mnkids.net is still hosted in NCIS’s datacenter, and as such was cut off from the Internet on Friday along with our other equipment.
As you may know, Magilla is the Web, mail, primary DNS, and LDAP server for Kidsnet. We have been in the process of moving the LDAP service to Scooby (a Family Pathways-owned machine fed by Sherbtel lines), but the outage cut our efforts short.
When the outage began, our secondary nameserver (puck.nether.net) began taking on the network’s DNS load. But with the LDAP database still on the now-disconnected Magilla, there was no chance for users to carry on as normal (unless they already had a session open – and once they logged out, they were out for good).
We ended up doing two things to try and resolve the situation: moving the LDAP database to Scooby by physically going to Magilla’s console and copying it to disk, and giving Magilla an address on the temporary connection (see previous posts). But it was all in vain.
Turns out that our terminal servers and secondary nameserver are configured in such as way as to create gridlock in this situation.
Whenever a user tries to log in, the LDAP client attempts to connect to “ldap.mnkids.net” (an alias for Magilla) and authenticate. So whenever a user attempted to log in, a DNS query went out – which would be answered by our secondary nameserver, and would return the Onvoy-routed IP address of Magilla.
“No big deal”, you say, “just change the A-record on your secondary nameserver and you’re good to go.” If only it were that easy!
You see, our secondary nameservice is not provided by a machine we control. It’s a free service. And it only accepts updates from… wait for it… the primary nameserver. Which is Magilla. Which is disconnected.
“Okay, then,” you ask, “why not just SSH into the terminal servers and make an entry in /etc/hosts for ldap.mnkids.net, or put Magilla’s new IP into resolv.conf, or ldap.conf?” Because the terminal servers don’t allow root to log in via SSH (security, of course!), and there are no other users in /etc/passwd that can log in at all.
So, we’ve got ourselves a pickle. There were only two solutions to the problem:
1) Visit each site, log in as root, change the settings. For free. While dozens of other (paying) customers are having issues. Not gonna happen.
2) Change the A-records on Magilla, call up the registrar of mnkids.net and ask them to change the IP they have listed for Magilla, wait 24 hours for the changes to propagate, then wait another 12 or so hours for our secondary nameserver to notice and refresh (we can’t force it into a zone transfer – the admin doesn’t seem to allow it).
Needless to say, we had to take Choice #2. Nobody here’s happy about it, and I’m sure the children aren’t exactly thrilled either, but it’s all we can reasonably do right now – especially considering that Kidsnet is unofficially Not My Problem as of 12/22/2009 (it’s only official once Magilla is retired… and I was soooo close!).
The registrar was contacted this morning. Now we wait.
UPDATE 1: 2/3/10 9:00a – the changes have propagated to most of the Internet’s nameservers. About the only one that doesn’t seem to have noticed is Sherbtel’s (208.38.65.35). Unfortunately, most Kidnset equipment is configured to use Magilla (at its old IP) and Sherbtel for their nameservers… so things won’t be back to normal until the changes are reflected there. Needless to say, we won’t be using Sherbtel’s nameserver for lookups anymore after this – its shortcomings have been perhaps the biggest hurdle in this entire situation (even bigger than getting the temporary connection for Magilla!). Google’s 8.8.8.8 will likely be substituted – once I can log in to the many hosts involved, that is. Kidtime is 6 hours away… and the clock is ticking.
UPDATE 2: 2/3/10 6:00p – our new DNS settings have (partially) propagated to Sherbtel’s nameserver… several records are still cached incorrectly, but fortunately ldap.mnkids.net is not among them. Logged into each and every Kidsnet host and changed the primary nameserver to 8.8.8.8. Things now appear to be working correctly. The other Kidsnet-related domains (warehouse214.org, stacyteencenter.com, etc) are not yet working, but that’ll be a project for tomorrow.