Am I the only one who can't think of a time DNS has caused a production outage on a platform I worked on?
Lots of other problems over the years, but never DNS.
I have a coworker who always forgets TTL is a thing, and never plans ahead. On multiple occasions they've moved a database, updated DNS to reflect the change, and are confused why everything is broken for 10-20 minutes.
I really wish the first time they learned, but every once and a while they come to me to troubleshoot the same issue.
How would you prevent that?
While planning your change (or project requiring such change), check the relevant(* see edit) DNS TTL. Figure out the point in the future you want to do the actual change (time T), and set the TTL to 60 seconds at T-(TTL*2) or earlier. Then when it comes to the point where you need to make your DNS change, the TTL is reasonable and you can verify your change in some small amounts of minutes instead of wondering for hours.
Edit: literally check all host names involved. They are all suspect
This. For example, if you have a DNS entry for your DB and the TTL is set to 1 hour, an hour before you intend to make the changes, just lower the TTL of the record to a minute. This allows all clients to be told to only cache for a minute and to do lookups every minute. Then after an hour, make the necessary changes to the record. Within a minute of the changes, the clients should all be using the new record. Once you've confirmed that everything is good, you can then raise TTL to 1 hour again.
This approach does require some more planning and two or three updates to DNS, but minimizes downtime. The reason you may need to keep TTL high is if you have thousands of clients and you know the DNS won't be updated often. Since most providers charge per thousand or million lookups, that adds up quickly when you have thousands of clients who would be doing unnecessary lookups often. Also a larger TTL would minimize the impact of a loss of DNS servers.
Set it to 5 seconds
???
Profit
??? Is when the underwear gnomes send you a massive bill because you're paying per 1k lookups. They profit, you don't
"yes boss we need another 20 dns servers"
"idk why dns traffic is so heavy these days"
Am I the only one who can't think of a time DNS has caused a production outage on a platform I worked on?
Lots of other problems over the years, but never DNS.
I have a coworker who always forgets TTL is a thing, and never plans ahead. On multiple occasions they've moved a database, updated DNS to reflect the change, and are confused why everything is broken for 10-20 minutes.
I really wish the first time they learned, but every once and a while they come to me to troubleshoot the same issue.
How would you prevent that?
While planning your change (or project requiring such change), check the relevant(* see edit) DNS TTL. Figure out the point in the future you want to do the actual change (time T), and set the TTL to 60 seconds at T-(TTL*2) or earlier. Then when it comes to the point where you need to make your DNS change, the TTL is reasonable and you can verify your change in some small amounts of minutes instead of wondering for hours.
Edit: literally check all host names involved. They are all suspect
This. For example, if you have a DNS entry for your DB and the TTL is set to 1 hour, an hour before you intend to make the changes, just lower the TTL of the record to a minute. This allows all clients to be told to only cache for a minute and to do lookups every minute. Then after an hour, make the necessary changes to the record. Within a minute of the changes, the clients should all be using the new record. Once you've confirmed that everything is good, you can then raise TTL to 1 hour again.
This approach does require some more planning and two or three updates to DNS, but minimizes downtime. The reason you may need to keep TTL high is if you have thousands of clients and you know the DNS won't be updated often. Since most providers charge per thousand or million lookups, that adds up quickly when you have thousands of clients who would be doing unnecessary lookups often. Also a larger TTL would minimize the impact of a loss of DNS servers.
Set it to 5 seconds ??? Profit
??? Is when the underwear gnomes send you a massive bill because you're paying per 1k lookups. They profit, you don't
"yes boss we need another 20 dns servers" "idk why dns traffic is so heavy these days"