Do not rely on verbal or written "from someone else" to determine if the connections are gigabit.
- Code: Select all
ethtool eth0
...
Speed: 1000Mb/s
I do know that they are all on the same subnet - so their traffic should not have to pass through any router/firewall to reach each other.
Subnets can be virtual/logical rather than physical. We have VPN servers that allow agents to be on the "local/same" subnet as the servers for many clients. But that sort of connection can result in far too many delays and dropped packets for Server to Servr connections. If you can not visibly confirm the servers are on a hard-wired local physical network, try passing massive files between them and see what your throughput is. If you perform this task when not otherwise using the servers, your throughput should be spectacular, right? If it's not, then that bottleneck could cap your system capacity. And note that the pathways change with the tide. Servers pass calls to other servers, reports are run ... it's a storm that's constantly moving, so 1G to and from and between all servers is not a bad thing to confirm. Then comes the challenge of colocation: Is any of this shared physically? Some colos will combine all networks into one system and then split it out later with routers. But that means your 1G connection at night with nobody using it ... has sharing issues during the day. So check it again during a break and see if you're sharing anything. (ie: transfer massive files during a quick down-moment in the middle of the day, if possible, and see if those results are worse than the middle of the night). If it's dedicated inter-server switching, you shouldn't see any difference.
In case that didn't quite get through: Just because each of the servers is directly connected to a gigabit switch, does not prove that they are gigabit all the way to the other server(s). Anyhting 10/100 in the path bottlenecks. Anyone sharing that hardware (even on a different logical subnet) reduces capacity WHEN THEY ARE USING IT. We have had incidents where another facility firing up something major for a few minutes daily would generate a problem for a client ONLY during those few minutes. And let's not forget the Airplane interference from the MetroNet (microwave internet a client was using), lol.
ALL of the connected servers in the cluster show up as OK (1 ms) to (3 ms) maximum in the status column - so - I am pretty sure they all have a decent connection to each other.
Good ... was this during a situation when the problem was happening? Or during a "slow time" when you could check such things? If during slow time, then it has no meaning. You need to check settings (all of them) during the problem. Things Change.
we get a handful of agents that show up with QUEUE as their status - some disappear after a bit
Check your mysql errors (logs and variables) and see if those grow when these abnormalities occur. We've seen situations like this where the (fairly few, but still notable) error counts in the logs grow each time an abnormality occurs, generally indicating that the problem was in fact logged. Then comes the "ok, then what caused this?" battle, of course. Churning rdp ports faster, opening more ports, removing process limitations ... all sorts of things become necessary.