by Kumba » Wed Apr 18, 2018 8:10 pm
Pretty much what Frequency just described. The first issue you run into with Virtualization is with the RTP audio. This works on a 20-ms timing period, or P-Time. This means that the audio stream is chopped up into 20-ms chunks of audio and sent one chunk per packet to the far end. You are also receiving a packet every 20-ms from the far end with their audio stream. This is why you can get one-way audio because one side of the stream is making it to the other side but not receiving a stream themselves. This results in roughly 100 packets per second per remote channel. Each agent is a remote channel and each customer call attempt is a remote channel.
So if you have 20 agents, dialing 50 lines, you have potential for 100,000 packets per second. On outbound this is skewed somewhat and probably more like 60,000 packets per second due to pre-answer and ringing but on inbound this is exactly the kind of traffic you'll be seeing on a single dialer. Now lets say you have 4 guests on that host each doing 20 agents at 50 lines, that means you have the Host OS receiving and handling 400,000 packets per second. Since all computer networks inherently have some measurable form of lag on them you have to buffer this transfer with a jitter buffer. A good jitter buffer will be 80 to 100 ms or less. This means you have 80 to 100 ms of lag that you can compensate for from end to end on this call because your agent or your customer will start to have audio quality issues. Bad jitter generally will sound like either skips/cut-outs in audio or like the person is under water. Packet loss usually sounds like chirps or blips or stutters. This isn't always universal but a pretty good rule of thumb when you hear audio quality issues.
Now lets go back to this VM host. ALL virtualization by it's very nature operates on a sort of time slice CPU sharing mechanism. So when the CPU is underutilized everything can run at near realtime, almost like it was on baremetal hardware. This works great for development where you are more focused on features and general testing since you have no real load to speak of. I myself develop ViciBox and other ViciDial things using Virtual Box because it's fast and easy.
The problem comes into play when you start introducing real load to the guests in production. What happens is once the host reaches saturation (Host Overhead + VM Overhead) you will start seeing your guests put into small suspended states so that the other guests can run. Now we have an 80 to 100 ms jitter buffer that can be used to help buffer this, but that assumes that the guest VM is suspended for less then 80 to 100 ms and when it does start running again that the CPU can play catch up fast enough before it gets suspended again. So for every 20 ms a guest is suspended that's 140 packets (20 agents plus 50 customers) the CPU hasn't processed that it now does have to process in addition to the 140 packets that were behind it. This also reduces your allowable jitter to 60 ms (100ms jitter buffer minus the 20ms of suspended packets plus the normal delivery of the next 20ms packets). If the CPU and I/O subsystem in the guest/host doesn't have enough power to deliver those 280 packets, reassemble them into a coherent stream, do codec translation, and deliver them to the endpoints in 60 ms you get audio issues.
This also results in a sort of load oscillation because now you go from processing a more or less steady stream of 140-packets per second every 20 ms to now processing twice the number of packets, which causes the guest to use MORE cpu, which causes the host to try and allocate more CPU, which results in it potentially putting other guests in a suspended state to provide that CPU power. Now add to this scenario Asterisk, which is known for being somewhat troublesome on it's own at high loads. When asterisk gets loaded up it tends to lock up and start causing CPU spikes on it's own. CPU spikes and high load generally does not work well on Virtualization whereas bare metal hardware will just keep running instead of potentially being stopped and eventually catch up transparently by virtue of the jitter buffer.
After a long enough time the spikes can get bad enough that you progress from audio issues to packet loss and strange program behaviour. Usually onces the realtime clock starts to skew you see issues with Asterisk missing or dropping calls, actions not happening like hanging up a call, the web interface may delay or not see a click action from the users browser, etc. Basically you get all the tell-tale signs of an overloaded system, but it happens at a significantly lower load. You lose anywhere from 5 to 15% of your capacity in just overhead between the Host and Guests depending upon the hardware architecture and VM environment. Once you introduce high steady loads to the VM and then have bursts of activity you compound the issue resulting in the host being over-utilized. This is why at load in all of our testing and the testing other clients have done youe best case scenario is 50% of normal bare metal capacity due to the extra overhead you need to ensure you don't have any issues. The other issue with Asterisk is it does not scale much past a quad-core CPU whether it's virtualized or baremetal.
So long story short VM's generally do not work well for high load environments with load bursts. VMs do work great for HA setups and where hardware can be vertically scaled to meet your load or clustered out so more servers are utilized to meet your requirements.
You can safely virtualize an archive server, but that only needs to provide FTP and HTTP so literally anything can be an archive server. You could potentially virtualize the web server or database but your performance will be noticeably better on baremetal hardware if you run any significant loads on these. If your web and database load are more consistent then you might have decent results. It is worth noting that doing something as simple as loading leads on a virtualized web or database server can cause a significant enough load spike that the guest wobbles a little as the host tries to allocate resources to it.