Scaling Real-Time Messaging for Live Chat Experiences: Challenges and Best Practices
Live chat is the most common type of realtime Web experience. Embedded in our everyday lives in the form of messaging platforms (e.g., WhatsApp and Slack) and chat experiences across e-commerce, live streaming, and e-learning experiences, end users have come to expect (near) instant message receipt and delivery. Meeting these expectations requires a robust realtime messaging system that delivers at any scale. Here, I’ll outline the challenges involved in delivering this — and ways to overcome them if you decide to build.
Ensuring Message Delivery Across Disconnections
All messaging systems will experience client disconnections. What’s important is ensuring that data integrity is preserved (no message is lost, delivered multiple times, or out of order) — particularly as your system scales and the volume of disconnects grows. Here are some best practices for preserving data integrity:
- Ensure disconnected clients can reconnect automatically, without any user action. The best way to do this is to exponentially increase the delay after each reconnection attempt, increasing the wait time between retries to a maximum backoff time. This gives time to add capacity to the system so it can deal with the reconnection attempts that might happen simultaneously. When deciding how to handle reconnections, you should also consider the impact that frequent reconnect attempts have on the battery of user devices.
- Ensure data integrity by persisting messages somewhere, so they can be re-sent if needed. This means deciding where to store messages and how long to store them.
- Keep track of the last message received on the client side. To achieve this, you can add sequencing information to each message (e.g., a serial number to specify position in an ordered sequence of messages). This enables the backlog of undelivered messages to resume where it left off when the client reconnects.
Achieving Consistently Low Latencies
Low-latency data delivery is the cornerstone of any realtime messaging system. Most people perceive a response time of 100ms as instantaneous. This means that messages delivered 100ms or less will be received in realtime from a user perspective. However, delivering low latency at scale is no easy feat since it is impacted by a range of factors, notably:
- Network congestion.
- Processing power.
- The physical distance between the server and client.
To achieve low latency, you need the ability to dynamically increase the capacity of your server layer and reassign load. This means there’s enough processing power, and your servers won’t slowed down — or overrun.
You should also consider using an event-driven protocol optimized for low latency (e.g., WebSocket) and aim to counteract the effect of latency variation by deploying your realtime messaging system in different regions and routing traffic to the region that provides the lowest latency.
While WebSocket is a better choice than HTTP for low-latency communication, WebSocket connections are harder to scale than HTTP because they persist for long periods of time. This is particularly tricky to handle if you scale horizontally. You need a way for existing servers to shed WebSocket connections onto any servers you might spin up (in contrast, with HTTP, you can simply route each incoming request to new resources). This is already difficult when your servers are in a single data center (region), let alone when you’re building a globally distributed, multi-region WebSocket-based messaging system.
Dealing with Volatile Demand
Any system that’s accessible over the public internet should expect to deal with an unknown (but potentially high) and quickly changing number of users. For example, if you offer a commercial chat solution in specific geographies, you want to avoid being overprovisioned globally by scaling only when you would expect to see high traffic in specific geographies (during working hours) and down during other times. But you still need to be able to account for unexpected out-of-hours activity.
Therefore, to operate your messaging service cost-effectively, you need to scale up and down dynamically, depending on load, and avoid being overprovisioned at all times. Ensuring your realtime messaging system can handle this involves two key things, including scaling the server layer and architecting your system for scale.
Scaling the Server Layer
At first glance, vertical scaling seems attractive. It’s easier to implement and maintain than horizontal scaling — especially if you’re using a stateful protocol like WebSocket. However, with vertical scaling, there’s a single point of failure, a technical ceiling to scale set by your cloud host or hardware supplier and a higher risk of congestion. Plus, it requires up-front planning to avoid the end-user impact of adding capacity.
Horizontal scaling is a more dependable model since you are able to protect your system’s availability using other nodes in the network if a server crashes or needs to be upgraded. The downside is the complexity that comes with having an entire server farm to manage and optimize, plus a load-balancing layer. You’ll have to decide on things like:
- The best load-balancing algorithm for your use case (e.g., round-robin, least-connected, hashing).
- How to redistribute load evenly across your server farm — including shedding and reassigning existing load during a scaling event.
- How to handle disconnections and reconnections.
If you need to support a fallback transport, it adds to the complexity of horizontal scaling. For example, if you use WebSocket as your main transport, then you need to consider if users will connect from environments where they might not be available (e.g., restrictive corporate networks and certain browsers). If they will, then fallback support (e.g., for HTTP long polling) will be required. When handling fundamentally different protocols, your scaling parameters change since you need a strategy to scale both. You might even need to have separate server farms to handle WebSockets vs. HTTP traffic.
Architecting Your System for Scale
Given the unpredictability of user volumes, you should architect your realtime messaging system using a pattern designed for scale. A popular and dependable choice is the publish/subscribe (pub/sub) pattern, which provides a framework for exchanging messages between any number of publishers and subscribers. Both publishers and subscribers are unaware of each other. They’re decoupled by a message broker that groups messages into channels (or topics) — publishers send messages to channels, while subscribers receive messages by subscribing to them.
As long as the message broker can scale predictably, you shouldn’t have to make other changes to deal with unpredictable user volumes.
That being said, pub/sub comes with its complexities. For any publisher, there could be one, many, or no subscriber connections listening for messages on the same channel. If you’re using WebSockets and you’ve spread all connections across multiple frontend servers as part of your horizontal scaling strategy, you now need a way to route messages between your own servers, such that they’re delivered to the corresponding frontends holding the WebSocket connections to the relevant subscribers.
Making Your System Fault-Tolerant
To deliver live chat experiences at scale, you need to think about the fault tolerance of the underlying realtime messaging system.
Fault-tolerant systems assume that component failures will occur — and ensure that the system has enough redundancy to continue operating. The larger the system, the more likely failures are — and the more important fault-tolerance becomes.
To make your system fault-tolerant, you must ensure it’s redundant against any kind of failure (software, hardware, network, or otherwise). This could mean things like:
- Having the ability to elastically scale your server layer;
- Operating with extra capacity on standby;
- Distributing your infrastructure across multiple regions (sometimes entire regions do fail, so to provide high availability and superior uptime guarantees, you shouldn’t rely on any single region).
Note that implementing fault-tolerant mechanisms creates complexity around preserving data integrity (guaranteed message ordering and delivery). Making sure that operations fail over across regions or availability zones automatically when there’s an outage is very tricky. Ensuring this happens without the user being sent the same message twice, dropping a message, or delivering things out of order is particularly difficult.
Six Best Practices for Scaling Real-Time Messaging
Given the challenges associated with scaling realtime messaging, it’s critical to make the right choices up front to ensure your chat system is dependable at scale.
Some best practices to remember are:
- Preserve data integrity with mechanisms that allow you to enforce message ordering and delivery at all times.
- Use a protocol with a low overhead like WebSocket that’s designed and optimized for low-latency communication.
- Choose horizontal over vertical scaling. Although more complex, horizontal scaling is a more available model in the long run.
- Use an architecture pattern designed for scale like the pub/sub pattern, which provides a framework for exchanging messages between any number of publishers and subscribers.
- Ensure your system is dynamically elastic. The ability to automatically add more capacity to your realtime messaging infrastructure to deal with spikes is key to handling the ebb and flow of traffic.
- Use a multi-region setup. A globally distributed, multi-region setup puts you in a better position to ensure consistently low latencies and avoid single points of failure.
Ultimately, prepare for things to go wrong. Whenever you engineer a large-scale realtime messaging system, something will fail eventually. Plan for the inevitable by building redundancy into every layer of your realtime infrastructure.
About the author: Matthew O’Riordan is CEO and co-founder of Ably, a realtime experience infrastructure provider. He has been a software engineer for over 20 years, many of those as a CTO. He first started working on commercial internet projects in the mid-1990s, when Internet Explorer 3 and Netscape were still battling it out. While he enjoys coding, the challenges he faces as an entrepreneur starting and scaling businesses are what drive him. Matthew has previously started and successfully exited from two tech businesses.
Related Items:
In Search of Hyper-Personalized Customer Experiences
The Impact of Data Regulations on Contact Centers
Leveraging AI to Deliver a Personalized Experience in the New Normal