Six design principles to help mitigate latency

Author: Rajesh Vargheese

The cloud has enabled global access, mass scale, simplified management, and cost reduction. While these are significant advantages, one of the side effects in workloads moving to the cloud is the increase in latency introduced by the expanded distances between cloud-based applications and endpoint devices.

Today, enterprise cloud architects are challenged to address the latency issues that arise from the distributed, global nature of cloud applications.

Previously we considered the five C’s of latency (connection, closeness, capacity, contention, consistency). In this article we will review six design principles that can help mitigate latency challenges.

Minimize the hops and serve the request at the closest possible point

With the wide adoption of cloud-based applications, the number of hops has significantly increased, compared to on-premise applications, due to the addition of the wide area network between the requestor and the service processing the request. Every hop can be a potential contributor to latency. When the number of hops increases, the latency is also likely to increase; moreover, it becomes a challenge to maintain the consistency of latency due to the variability in the performance characteristics of each hop.

Today, cloud architects have focused their efforts on optimizing the number of hops and processing at the closest point possible once a request reaches the cloud provider’s network. Approaches such as caching, direct routes, and geo-based routing are commonly used. These approaches help improve backend processing latency, but the path from the end user to the cloud consists of many more hops through the carrier’s networks and the internet.

To further decrease the overall latency, 5G and edge computing have emerged as a potential game changer. It not only reduces the number of hops, but can influence the entire journey of the request through the last mile, the transport network, and the back-end cloud services.

Mitigate contention by designing a horizontally scalable elastic infrastructure

Customer demands and traffic changes over time. How does this impact the latency? If the design allows resources to be scaled dynamically to handle the increase or decrease in traffic, the application is likely to provide consistent latency.

Cloud architects commonly use constructs such as auto-scaling groups, load balancers, read replicas, API, and server caching to mitigate latency. Auto-scaling groups allow additional compute instances to be launched as traffic increases or as policy definitions are reached, thereby creating additional capacity to handle the requests. Load balancers distribute the load among the instances and reduce contention at the individual servers. API caching and resource caching allows commonly invoked requests to be handled at the first possible point in the transaction flow without creating additional traffic to the backend servers and databases. This reduces contention at the downstream resources as traffic grows.

To ensure low latency, the scalable resources should not be just limited to compute servers.  The design must take into account each of the network resources. 5G with its ultra-wide data throughput capabilities reduces the contention right at the last mile. Processing requests at the edge further reduces traffic to the backend.

Identify bottlenecks and delegate requests to location and application-aware optimized resources

Bottlenecks limit the ability to reach and consume available resources and impacts latency. However, if the load is delegated at potential bottleneck points to resources that are optimal for handling the request, the processing performance can be improved.

Commonly used capabilities include geo-based routing and request type-specific instances. In the case of applications that have more read requests than write requests, read requests can be delegated to a pool of read replicas, thereby eliminating the bottleneck on the master server for write operations. Certain traffic is best suited to be processed at the edge, while others are best suited to be served in the cloud. A well-architected design will use the ideal location for the workloads.

In addition to scalability, application-aware infrastructure design is important. For example, having Graphical Processing Unit (GPU) compute resources for video analytics/machine learning workloads can speed up processing instead of general-purpose compute and thereby limit the latency in processing.

Cache whenever possible, closer to the customer and at multiple levels

One of the key factors in avoiding contention (and minimize latency) at a processing server is to limit the requests the server needs to process. For commonly used requests, the backend services can cache the output of the requests instead of recreating the output for each additional request. Caching is the process of storing data in a location that is different from the original server and closer to the end user. 

Caching can occur at different levels and for different assets. Architects have used Content Delivery Networks (CDN), which allow caching at the edge, closer to the customers, thereby reducing the latency for data asset requests. API caching allows the results of commonly invoked requests to be cached without requiring the backend servers to process the requests.

The caches can be replicated across multiple regions to provide faster response times for users in specific regions.

Ensure closeness of dependent infrastructure through location and connectivity

In a distributed architecture, processing an application request can require multiple dependent services (for example a database application talking to a web server). If these services are in close proximity to each other, the contribution to increased latency can be contained. If the data must travel a great distance through many routers, or if there is traffic congestion on the network, latency is likely to increase.

Having a scalable enterprise backbone is important to keep traffic routed through the preferred path instead of being routed off-network and through networks where latency characteristics are difficult to control.

Architects have used network routing concepts to keep traffic contained in a specific network and have used placement and other strategies to keep interdependent resources closer to each other.

Understand segmentation of traffic and define policies to prioritize and process

Not all traffic is the same; they differ in their type and importance. A well-architected design must account for the type of traffic, its priorities and establish policies to help reduce latency. For example, control plane traffic which determines which path to use versus data plane traffic, which forwards packets and frames; read request versus write requests; initial request for an asset versus repeat request for the same asset.  

Policies can be of different types. Routing requests to backend resources based on geolocation-based routing and latency based routing policies  can provide lower latency as traffic routing requests can be processed by resources closer to the user.

Network slicing ensures that the capacity is allocated to the prioritized traffic and quality of service can be maintained even in the last mile transport. The ability to prioritize traffic is very critical in maintaining latency especially when there is contention.

In summary, delivering application performance requires careful design planning to mitigate latency. Application Performance also depends on a strategic network plan that considers the complete end to end data journey to deliver the edge computing experience businesses demand.

Learn more about 5G Edge and the design options that can help jumpstart your journey in delivering low-latency applications.

Rajesh Vargheese is a Technology Strategist & Distinguished Architect for Verizon's 5G/MEC Professional Services organization. Rajesh brings 20+ years of expertise in technology strategy, engineering, product management, and consulting to help customers innovate and drive business outcomes.