We recently onboarded a new customer into our Managed Database Service for SQL Server and after rolling out monitoring across their estate we picked up several alerts. This is not unusual when onboarding new environments as there is normally a period during which we consolidate and remediate any number of issues – small and large.
This customer has a large estate with multiple transactional replication publications and subscriptions and in amongst the alerts was an interesting one that would have been otherwise difficult to spot without our monitoring platform.
The alerts indicated that 3 publications (across 2 different publishers) were not synchronizing. The first step was to log onto the publishers and check out SQL Server Replication Monitor where we were presented with this view:
All looks good. You can see that the subscription performance is “Excellent”, latency is minimal, and the Last Synchronization was very recent (relative to the Last refresh time of the replication monitor).
Running the following query in the distribution database revealed no recent errors so everything seemed OK from that perspective:
select top 10 * from MSrepl_errors order by time desc;
However, the following query revealed some interesting output:
select top 10 * from MSdistribution_history order by time desc;
You can see that two of the distribution agents are running fine and seemingly delivering commands successfully. However, two publications are reporting that the Initial snapshot for the publication is not yet available. We can also see that the runstatus of the distribution agents that are not synchronizing are set to 4, which indicates that the agents are idle.
Visit this link for more details on SQL Server’s MSdistribution_history table.
The total_delivered_commands column in MSdistribution_history also tells us that zero commands have been delivered since the agent was started on the respective dates, meaning that replication has not been delivering transactions to the subscribers for over a year in both publications.
Returning to Replication Monitor, if we double-click on the subscription shown above, we see more detail relating to the distribution agent:
Here, we can see the same message reported in MSdistribution_history. Admittedly, when I first glanced at replication monitoring, because everything was green and seemingly fine, I didn’t see any reason to double-click into the subscription.
I suspect, that from SQL Server’s perspective, nothing is broken. Replication has been set up correctly and the log reader agent is doing its thing, as is the distribution agent. It just happens to be that no one got around to running the snapshot agent for that initial snapshot to be delivered to the subscribers.
Even so, it’s a fair assumption that if you went to the trouble of setting up replication that you are expecting commands to be replicated.
In this case, that assumption was correct, and our customer was thankful for us having spotted the problem.
Why is SQL Server Replication still used today?
We often get asked whether replication still has a place in SQL Server today. Some often call out AlwaysOn as offering similar functionality while being easier to manage. Here are some of the reasons why we believe replication still has a place in modern SQL Server architectures:
- You can replicate a subset of the database (e.g., just one table) with replication, whereas with AlwaysOn it’s the whole database or nothing.
- You can have an inordinately larger number of subscribers (secondaries) with replication.
- The subscription databases can be read-write instead of just read-only.
- You can implement a different indexing strategy on subscribers to suit the workload of that subscriber.
- You can replicate to a different version of SQL Server. Or even to a non-SQL Server destination (this is being deprecated in the latest versions of SQL Server).
- With replication, you can configure security on your subscriber(s) independently from your publisher.
- Maintenance tasks (e.g., index rebuilds) do not get replicated to your subscribers.
Replication is not a HA/DR tool in the same way that AlwaysOn is and whilst it can pass for such a tool, it isn’t intended to provide such functionality.
It is important to evaluate your requirements and choose the right tool that fits the job. In fact, there’s no reason why you cannot leverage AlwaysOn and replication at the same time.
Monitoring SQL Server Replication
Replication, like many other tools in SQL Server offers wonderful functionality. It does, however, have more moving parts than many others which means there’s more that can go wrong.
If you’re not constantly monitoring your SQL Server environments it’s very easy to miss those things that aren’t working as expected, even if you regularly use the built-in monitoring tools provided by SQL Server. In some cases, the impact of missing something can have a dramatic impact on the overall availability of your production databases. For example, if the log reader agent had been failing instead of the distribution agent, this could result in the transaction log filling up on your publication database.
Our Managed Database Service monitors the health of all replication agents as well as the health of distributors, publishers, publications, subscribers & subscriptions and we take the time to define what “healthy” looks like for each customer. If you have 2000 pending commands waiting to be delivered by the distribution agent, that might be normal for one publication and excessive for another. Or it might be that you only care about your subscription latency.
Our customers rely on us to proactively identify when things are not quite right so that we can fix things before they become serious incidents. We invest a lot of time in ensuring that every tiny aspect and angle of SQL Server is covered so that they have the confidence to focus on their own customers. Read more about our partnership stories.