paint-brush
Gradually Shift Traffic With AWS Route 53 Weighted Routing Policyby@devgrowth
1,670 reads
1,670 reads

Gradually Shift Traffic With AWS Route 53 Weighted Routing Policy

by devgrowth.techMarch 10th, 2023
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

It is common in software to migrate an existing service to a new infrastructure such as moving to cloud. With critical services, it is still a safer option to gradually shift production traffic to the new service stack. More often if there is no plan to maintain the existing service stack, "redirect" the traffic using DNS resolution is common, and this is what this article will focus.
featured image - Gradually Shift Traffic With AWS Route 53 Weighted Routing Policy
devgrowth.tech HackerNoon profile picture

🤔 Why Do We Need to Shift Traffic Gradually?

It is common in software to migrate an existing service to a new infrastructure such as moving to the cloud. The business logic remains the same, but it is still a drastic change since the new service stack will be on new infrastructure.


We can use extensive integration tests, load tests, etc., to ensure the new service is working as expected, however, with critical services, it is still a safer option to gradually shift production traffic to the new service stack, so that the service owner can verify the new service is also robust and scalable in a safe and incremental way, as well as giving clients time to adjust if necessary.


The high-level idea of traffic shifting


You may think there are already tools for this "gradual shifting" scenario; for example, there are many feature flags tools, either built in-house or from third-party vendors, that probably have functionalities like this, and it could be the right solution for some cases, especially for experimenting a new feature.


But for other scenarios, you have to consider if it's the right solution; will it increase latencies that could break your service's SLA? Or if the tools can handle the level of traffic for all traffic that goes to a service? Or any cleanup work you need to do with the feature flag after all traffic is migrated.


More often, if there is no plan to maintain the existing service stack, "redirecting" the traffic using DNS resolution is common, and this is what this article will focus on (I've also seen people do a one-time flip using DNS, but this is usually not recommended considering the risk).


💡How Do We Use DNS to Solve This Problem?

Before we get into the details of using AWS Route53, we need to understand a bit more about DNS and AWS Route 53:


Amazon Route 53 is a highly available and scalable cloud domain name system(DNS) service. Enables to customize DNS routing policies to reduce latency


So, how does DNS(domain name system) work?

AWS - What is DNS?


In a nutshell, DNS is the phone book to translate a human friendly domain name such as example.com to machine readable IP address such as 192.0.2.244.


When a request is initiated, DNS lookup happens in a hierarchy name resolution architecture that resolves the DNS name with different name servers.


For example, in the above diagram, the domain name, www.example.com, is answered first by the DNS root name server; then the name server for .com TLD, when it reaches the Route53 name server, which has the record for www.example.com.


Then it will return the machine-readable IP address for the client to make a request to the host, and the resolution result is heavily cached along this path.


Now that we know how DNS work, let's see how to implement it with AWS Route53:

Scenario A: Ask Clients to Use a New API Domain/URL

If it's easy to ask clients to use a new API URL, then it is relatively straightforward to add weighted routing policy records. You will delegate a new domain from either company's internal infrastructure or third-party service provider; then create a new (public) hosted zone in Route53:


Add a new hosted zone in Route 53


Click on the hosted zone line; you should be able to see an NS record(name server) and an SOA record(start of authority) created automatically.


For this scenario, you don't need to make changes to these records; just keep them as it is and know that they are the administration type of information for DNS resolution. We will talk more about it in another scenario.


The next step is to create records with a weighted routing policy:


create a weighted record to shift traffic


After creating a record that points to the new service (with 155 as the weight), we can create another record that points to the existing service with a weight of 100; then we should see these two records in the hosted zone as below:


weighted records in the hosted zone


Once this is set up, you can gradually change the weight config so that eventually, the new service stack can get all the traffic.

Scenario B: No Change Required From Clients to Use the Weighted Record

In reality, production services often serve a wide range of clients, and it could be challenging to ask every client to use the new API domain/URL. Luckily, we can still control which endpoints clients are using behind the scenes.


To understand how this approach works, we need to understand what is the role of the name server:


An NS record (or nameserver record) is a DNS record that contains the name of the authoritative name server within a domain or DNS zone. When a client queries for an IP address, it can find the IP address of their intended destination from an NS record via a DNS lookup.


In other words, a name server or DNS server contains all of the DNS zone files and records for a domain. As we mentioned in scenario A, when you create a hosted zone in Route53, by default, an NS record and SOA record will be created. Basically, these name servers know all the records you create in this hosted zone.


How do we make the clients using the weighted records we create in route53 without any changes? For example, the clients are using an endpoint, medium.com.


This domain is managed by some internal infra or a third-party tool, and you can get the corresponding name server record with this command:


dig medium.com +noall +answer NS

; <<>> DiG 9.10.6 <<>> medium.com +noall +answer NS
;; global options: +cmd
medium.com.  86400 IN NS alina.ns.cloudflare.com.
medium.com.  86400 IN NS kip.ns.cloudflare.com.


In this case, we want to use the name server in Route53 instead of the Cloudflare name server so that the DNS resolution will flow through Route53 instead of Cloudflare; then the weighted records we set up in Route53 hosted zone will be effective.


📝 Lessons Learned

  • DNS is cached heavily; changes are not instant, and traffic is going to linger for a while after making changes


  • Shorter TTL for records is helpful for easier rollback. Once migration is completed, and there is no need for rollback, you can change the TTL back to the default value or another value you see fit.


  • During the testing phase, if you have a load test set up, remember that a host will resolve to one of the record destinations. Use that for the duration of TTL, so it's better to have the load tests running from several hosts and perhaps longer than the DNS resolution TTL. You may see the traffic following the DNS resolution weights.


  • If the domains in the DNS record are not publicly accessible, you cannot use the commonly used Route53 health check since it requires the domain to be accessible publicly. There are other metrics such as DNS query metrics, but if the domains can only be reached internally, it's better to find other ways to measure the health of the request.


  • To make your application more resilient, Route53 Application Recovery Controller(ARC) can help, and AWS recently released a feature called zonal shift that you can use to mitigate grey failures.


  • One thing I noticed is that if you have a weight setup as 100% traffic goes to A, 0% traffic goes to B, theoretically, you would not see traffic goes to B. However, if A cannot be resolved, Route53 will fall back to other records such as B.


  • When you replace the name server record as described in scenario B, remember to replace the name server record in one single operation(append and delete); otherwise, there will be a period of time when the domain cannot be resolved.

If you are interested in learning more about DNS or networking in general, this Coursera course from Google and this course from the University of Colorado give you an overview of computer networking. If you prefer books, there is a classic computer networking book Computer Networking: A Top-down Approach. Hope this helps!



I've compiled and curated a list of job-hunting resources for software developers; it covers resources for writing resumes, applying and managing job applications, efficient ways to prepare for coding interviews, and help to learn system design. You can download the free PDF here.