When I switched over mail infrastructure earlier this week, I might have been under a few mistaken impressions — in hindsight, one might qualify at least some of them as embarrassing.
Firstly, MX records never work the way you suspect — in part because you expect them to work as specified. Forget that.
In a hosted environment, you serve the receiving side of email traffic for many more domains than you own. As a result, you only control a very limited number of the MX records, and an equally limited number of TXT v=spf1 records involved with other people’s traffic.
Moreover, though, you’ve instructed these customers to define their SPF in some way. In the past.
This might have included something smart, but strict, that allows you to migrate infrastructure so long as you pay sufficient attention — that’s where things go wrong. In operations, you don’t matter to the business. The business will just happily go sideways and run off in a completely different direction.
We had not aforehand documented the reasons we did this particular thing the one way, albeit it had been thoughtful, so that we would be able to follow up on it in this other (specific) particular way. That’s where, later on in life, resources will show to have been lacking, and an inability to follow through on the originally outlined process will show.
Nor had we, by the way, sorted the procedural maintenance of DNS TLSA resource records in the case of expanding the MX resource records. This can be considered entirely my mistake, and that’s fair enough.
So, what happened?
I wanted the Internet-facing, inbound mail exchangers (ext-mx-in) to receive most if not all of the traffic, without violating commonly applied SPF restrictions on existing mail in the deferred queue on the old outbound mail exchangers — this takes 5 days, so give it 6.
You would substitute all existing MX records value’s IN A RRs for the new IP addresses, and chose one of them to hold the old IP addresses. Right? Nope.
This is where things, in part, went awry. See, our technical debt includes a former incomplete transition — a rebrand from MyKolab (mykolab.com) to Kolab Now (kolabnow.com). Thousands of customer domains still point to mykolab.com. Another group of thousands refer to kolabnow.com.
The nature of MX records is that you’re not allowed to CNAME them, so that clients can “chase” the CNAMEs — it’s a direct IN A and otherwise, well, basically the recipient address is rewritten. ’nuff about that.
Furthermore, when you are not allowed to “chase” addresses, a number of peripheral changes will also need to be made — in the case at hand, it included the TLSA records for the MX RRs involved for the legacy domain.