RemotePresenceView and Connectivity Failures

Recently I've been doing some investigating on how presence subscriptions from UCMA applications are affected by losses of network connectivity. In some applications, having up to date presence information is critical to the proper functioning of the application, and it is important to be sure that the application can discover and react to interruptions to its notifications of presence updates. Various kinds of network connectivity disruptions can interfere with the delivery of presence notifications:

  • The application server itself could lose network connectivity
  • A Front End Server could lose network connectivity or go down
  • Connectivity between pools could be interrupted or another pool could go down
  • An Edge Server could go down, or lose network connectivity, or federation could fail with a partner

So, what effect will network connectivity problems such as these have on presence subscriptions from your UCMA 3.0 or UCMA 4.0 application? How will the application find out about them? And how should it handle recovering from network connectivity problems?

No Immediate Sign of Problems

The first important point to be aware of is that, in most circumstances, your application won't have any immediate way of knowing that it has stopped receiving presence notifications because of a network interruption. To understand why, we need to look at how Lync handles presence subscriptions.

When an instance of RemotePresenceView in your UCMA application starts subscribing to a user's presence, it sends a SIP SUBSCRIBE request to the SIP URI of the user whose presence it is subscribing to. An XML document in the body of the message identifies the specific information it wants to be notified about. It then gets back a 200 OK response acknowledging the subscription, with an XML body containing the user's current presence information. From that point forward, when something relevant changes in the user's presence, a SIP NOTIFY message gets sent to the UCMA application from the SIP URI of the user. This NOTIFY message has the same call ID (in the Call-ID header) as the SUBSCRIBE message, tying it back to the presence subscription. It contains XML with the updated presence information in its body.

The subscription target (the user whose presence we've subscribed to) doesn't send any kind of "heartbeat" messages back to the subscriber. This means that if a network link goes down, and those NOTIFY messages aren't getting through, your UCMA application has no way of knowing immediately what has happened. There is no way for it to distinguish between not receiving presence notifications because there haven't been any updates and not receiving notifications because there's been a network failure.

Now, of course there are some obvious cases where the UCMA application will experience other failures immediately because of network connectivity going down. If the application server where the UCMA app is located loses all network connectivity, existing calls will drop, new outbound calls will fail, newly created presence subscriptions won't work, and so forth. But more subtle network failures can go unnoticed for a while, impacting presence notifications the whole time. Some examples: loss of connectivity between Lync pools, or between federated partners.

Presence Subscription Refreshes

Eventually, after several hours, the application will try to refresh the presence subscription. This is a built-in behaviour -- the Lync client does it as well -- and the way it works is that it sends a new SUBSCRIBE request to the server where the notifications have been coming from, with the same call ID as before, effectively saying "Please keep this subscription active." It gets back another 200 OK with the user's current presence, and continues receiving notifications -- assuming everything is working normally, of course.

If the application tries to refresh the subscription while the subscription target is unreachable because of network issues, the refresh will fail. This is the first indication to the application that there is a problem.

The failed refresh isn't surfaced immediately by UCMA, though. First, it will try completely re-establishing the presence subscription from scratch, using an entirely new SUBSCRIBE message to the user's SIP URI with a new call ID. It will retry this a few times, with short delays in between. Finally, it will stop trying, and the state of the presence subscription will go from Subscribed to Terminating and then Terminated.

You can be notified of these presence subscription state changes by subscribing to the RemotePresenceView.SubscriptionStateChanged event.

Responding to Network Failures

There are two takeaways from these points about the RemotePresenceView that you can actually use in designing your application if presence notifications are important.

First of all, make sure to monitor state changes in the presence subscriptions using the RemotePresenceView.SubscriptionStateChanged event. If you notice a subscription terminating unexpectedly (when you did not intentionally terminate it), you may need to re-establish it, either right away or after some time has passed and the connectivity problem has resolved itself. If you do not monitor this event, your application may lose presence subscriptions during long connectivity interruptions, and these presence subscriptions will not automatically recover.

Secondly, if real-time presence is a critical component of your application, it may be problematic that there will be no sign of missed notifications in the event of a network failure. You can get around this to some extend by forcing more frequent "refreshes" of the presence subscriptions. To do this, you can call the LocalEndpoint.PresenceServices.BeginRefreshRemotePresenceViews method. This will force all RemotePresenceView instances associated with your UserEndpoint or ApplicationEndpoint to send a new SUBSCRIBE request for each subscription target and get current presence information. If there has been a loss of connectivity, this will allow your application to discover it immediately rather than waiting for an automatic refresh, possibly several hours later.

Taking these two steps will make the presence-dependent parts of your UCMA application handle network failures much more gracefully.