Magic Multiplexing

Inside Roam’s Patent-Pending Selective Forward Unit (SFU)
When we started Roam, I figured video conferencing was a utility you could simply turn on from a service provider like Twilio or AWS. This turned out to be a majorly faulty assumption. Yes, you can easily get low grade stuff. But it’s a much lower standard than people have become accustomed to from the industry leader in video conferencing, Zoom, which has the best technology. And, it’s particularly worse as meetings get larger - after a few dozen people in a meeting things deteriorate quickly. Scaling from 50 to 100s is an order of magnitude more complicated. The reason is because it’s inefficient to send, say 200 video and audio stream pairs to a single computer and have the computer handle them all at the same time. Even more complex - a screen normally can only fit 9, or maybe 12 videos. How do you decide which one to show? And, what if someone not on the main screen starts talking? How do you catch that fast enough to make sure their audio isn’t clipped?
No open source or commercially available software or service is even close to good enough to do this at internet scale for a product as heavily used as Roam. The Roam !nvention team rose to the challenge and set out to solve these complex technical problems from scratch. What follows is the result of 3 years of hardcore R&D effort.

Background

First some background on our system and some constraints which our !nventors insisted we could not compromise:
Roam must work seamlessly in a web browser. While the majority of our use is from members who download our desktop and mobile apps, guests and external users use browsers. Non-negotiable for us.
We use WebRTC for media. This gets us great browser support, but makes track negotiation complex at scale. Some negotiation parts increase linearly with meeting size, others increase at square.
Single Track AV pair sync must always be respected. This is a subtle but critical architectural decision, but it essentially means we will always pair the same person’s audio and video in a pair of tracks. When you swap around a person’s video and audio tracks, desyncing becomes a problem. It’s a subtle thing, but noticeable to users. This decision makes things much more complicated. But it makes the end audio and video synchronization for our members much smoother, especially when internet conditions are changing.
Name Tags must always stay in sync. It is never acceptable, even for a second, for a name tag to be incorrect.

Magic Multiplexing

At the foundation of our proprietary Selective Forwarding Unit (SFU) powered by our patent-pending Magic Multiplexing. The SFU is a central server that receives AV media from all clients. The magic is in deciding which tracks to send to which clients. There is major complexity in making sure the right tracks are seamlessly delivered with minimal packet loss. Clients come in a number of shapes and sizes. Some clients are mobile phones, some are computers with large screens, others are laptops, each has different CPU and GPU capacity. Even the strongest computers can’t handle 100s of AV streams at the same time.
Magic Multiplexing intelligently selects which tracks to forward to which clients, making sure the correct tracks are sent to all clients. For example, consider a meeting with 150 people and a client laptop that can show 9 people on the screen. Magic Multiplexing must decide which 9 to show and send based on activity. And what happens if someone off the screen starts talking? What happens as an individual user paginates to the next page in the conference room and a new 9 people must be shown?
Magic Multiplexing solves the first problem by forwarding extra tracks. In our example, we have 9 videos to show on the screen. We’ll send around 10 extra tracks, intelligently ordered. If someone who speaks frequently becomes inactive we’ll keep sending them in case they start speaking again and we’ll simply show them without the overhead of a new re-initialization.
Magic Multiplexing maintains, for each client in the meeting, precisely what they are seeing and what they should be seeing. As activity changes, Magic Multiplexing makes real-time decisions on which tracks to send to a client to display, and which extra tracks to keep on standby for a fast switch.
But what happens if someone becomes active who is not on the main screen or in the extra tracks? How can we make sure this person is heard?

Making Sure Everyone is Heard: Signal Processing

The SFU receives audio packets from every client. The key is in which tracks it decides to forward on to all the other clients in the meeting. Also, it’s important to ensure it is actual human speech that’s being forward, not just random noise. Noise suppression models catch most of the non-speech sound. Our SFU looks at the signals showing up, and their volume. Our empirical testing determined that you can clip about 10 or 20ms of the beginning of a word with only minor audible impact. Anything beyond that sounds like the start of a word is cut off, which is a poor experience for our users. By the time you’ve lost 3-4 packets (60-80 ms), it’s quite bad. The challenge we have is to make sure we make the correct decision on whether a new person is speaking quickly and then deliver the packets quickly (and in order) so everyone is heard.
In addition to the “starting” problem, when people speak they naturally have pauses, so there’s total quiet space between utterances. We must solve this as well to ensure that we don’t cut someone off during a brief pause. Through trial and error, we came up with an approach that used an asymmetric exponential moving average of volume levels to rank the tracks. I won’t share the full weighted equation here (though it’s in patent application) but here’s the base equation:

Delta between the Speed of Sound and the Speed of Light

It seems paradoxical to be able to anticipate that someone’s about to speak before they do, and thereby not clip their audio. But we use a little trick to do this. When Magic Multiplexing decides to swap in a new person to a track pair, if we have already missed the “delivery opportunity” for that first 20ms of audio we can deliver it anyway but slightly delayed. In most cases this doesn’t negatively impact the experience, as there is always a little delay on the internet as packets are sent digitally to and from servers around the world, so a slight temporary increase in delay goes unnoticed. We can also make up for that delay later by “compressing” silences and pauses that are sufficiently long to remove a corresponding amount of silence and get back to real time.

Active Speaker

Roam now also supports Active Speaker mode, which will automatically pin the primary active speaker in the prominent focus window. This is related to multiplexing with audio, but a bit different, because the stakes are higher. You don’t want to jump the gun and make someone an active speaker because their dog just barked. And, you don’t want to give users a headache by switching too fast constantly as all kinds of little noises happen.

Noise Gate with Hysteresis

Through empirical testing, we found that simply showing the most recent person who talked as the active speaker was way too janky and unreliable. People talk, they pause, they talk again. Being the active speaker during a moment in a meeting isn’t just about being the person talking. The greatest speakers use silence and long pauses during their speeches to incredible dramatic effect! We would be remiss in shifting the active speaker view away from the Reverend Martin Luther King Jr during a great pause because someone in the audience made an encouraging shout! In a way, being the “active speaker” at the moment is an intuitive feeling that combines several factors.
We were stuck until we had a critical breakthrough: we combined a signal processing technique known as Noise Gate with Hysteresis together with a collection of other custom filters to produce a selection algorithm in the “goldilocks” zone (waiting long enough to switch to avoid blips, but switching fast enough to feel snappy). The noise gate looks at the volume and patterns of their speech wave to decide whether someone is a possible candidate to become an active speaker. This is then combined with the smoothed volume ranking and other sources of information to choose which candidate should actually become active speaker, and if it would be appropriate to replace the current speaker with a new one.

Concluding Thoughts

Zoom set the technical standard for video conferencing quality. Roam sets the standard for instant collaboration, elite culture and AI productivity with our Virtual Office Platform. With today’s Magic Multiplexing and Active Speaker release, our technology has made a quantum leap forward in narrowing the video conferencing technology gap for meetings up to 300 people.
This breakthrough new architecture is capable of supporting 1,000+ in a video meeting with top quality speed and smoothness. (Note: our Theater supports up to 3,000 simultaneous participants in a different type of audience setting). It also dramatically improves our performance for meetings of nearly any size on our iOS and Android platforms. The Roam !nvention team looks forward to near-term continued improvements to deliver you the high quality + most affordable video conferencing on the planet within your Virtual Office Platform.