I had a client experience an issue this weekend due to a lack of execution by another one of their service providers. And it just so happens that lack of execution occurred overnight and on the weekend. This provider was hosting and managing my client’s DNS records, that is it. Something that is certainly mission critical to an online presence and one would assume that when changes are made to it that if something goes wrong someone can be reached for support.
Well this weekend we were migrating our client’s business critical application servers from in house into our cloud. A migration process that went very smooth. Backup processes had been replicating into our cloud for the past week or so. So all we really had to do was disconnect existing servers, perform a final backup to our cloud, change DNS entries and spin up the new cloud servers. I never would have guessed that the DNS task would have been the one that derailed us, but it did. A backup process might fail sometimes and you might have to work out plan B; sometimes you might have some tweaks with the new virtual server in the cloud. But changing an IP address in DNS sure seemed like a safe bet.
My client’s provider sent us a notice close to midnight the night of the migration that the entries had been updated and we should be good to go. We brought the servers up in our environment and began some external testing. It took a few minutes for the DNS changes to take effect, but after a handful of minutes our sites were back up and running. The only problem was that we were testing with a very small subset of locally based users because it was overnight. As the early hours of Sunday morning turned into Sunday afternoon, we began to get support calls that the customer facing order entry portal was inaccessible to many public users. They also have a fleet of roughly 600 mobile devices that communicate back through web services on the web server that wasn’t working for about 80% of the fleet.
We spent many hours looking into the applications themselves and the web server setup in the new cloud environment to see if it was something with the migration. Then we came back to DNS. Certainly it had fully replicated by now, almost 24 hours later. But sure enough their primary DNS server had the new IP address but their secondary still had the old address. And the intermittent connection issues were due to that secondary DNS server.
So what do I do? I call their support phone number at about 1 AM. No answer. I leave a message. No problem, most overnight support services have some sort of paging system to alert their resources of a request. And the SLA is usually responding within the hour for emergencies. So I wait. Then I submit an online ticket, just to push it a bit. Then I call back. Again. And Again. In all, about 25 times times between 1 AM and 6:30 AM. All while our client’s customers and employees continue to have issues gaining access to their system. I finally receive a call at about 6:50 AM. “Sorry about the issue and we have someone looking into it,” the technician said to me. Ten minutes later the DNS was updated and everyone started to be back in business again.
I questioned the process of the 24/7 support. The response I got? We don’t offer a 24/7 support line. Our techs have cell phones that you could call in an emergency. WHAT?!? Wow, how in the world would I know their tech’s cell number and where do I get off thinking someone that controls any portion of someone’s IT infrastructure wouldn’t support it 24/7? Well, needless to say less than 8 hours later they no longer control my client’s DNS. We do! Oh, and we have 24/7/365 support.
Visit our contact page and get in touch if you are looking for a cloud provider who has your back.