Exceptionally Long Queue Length & Timesync error

All installation and configuration problems and questions

Moderators: gerski, enjay, williamconley, Op3r, Staydog, gardo, mflorell, MJCoate, mcargile, Kumba, Michael_N

Exceptionally Long Queue Length & Timesync error

Postby dgroth02 » Tue Aug 07, 2018 10:25 am

Ok this is now happening to me on multiple systems (all hosted in datacenters), but I'll give the details of one of the systems here

System Configuration:
Multi-Server System, 1 DB/Web Server, 7 dialing servers
ViciBox v.7.0.4-170113
VERSION: 2.14-667a
BUILD: 180331-1715
Asterisk 11.25.1-Vici on all servers

Dual Quad core Xeon in each server Minimum (Slightly varying speeds). DB Server with 128GB of RAM, each dialing server has 16GB RAM. SSD drives in all servers

This particular system has 50 agents spread across the servers, dialing at approximately 7 to 1.

The problem is essentially that occasionally, at various dialing levels, a server will go "Red" in the summary screen. When that happens, all the agents with their phones on that server will get the dreaded Timesync Error. and will be unable to log back in. The asterisk command line shows hundreds of " channel.c:1310 __ast_queue_frame: Exceptionally long voice queue length queuing" warnings.

After a reboot, the "Red" is cleared and phones can re-register, agents can log in, and things will continue as normal.

Here is what I have noticed that does/does not affect the problem and what I have done to date
  • Seems to happen most often when a remote client has a particularly bad Internet connection, but not always - Connection can be great and it will still happen
  • A servers are in time synchronization with each other
  • All servers have remote client IP's http, sip, and rtp white listed only
  • System uses SIP trunks and have their IP's white listed.
  • All other access to the system on all ports is closed
  • Trunks are balanced across all dialing servers

Yes I have read the manual and searched the forums, and while I see others mentioning this issue, no real solution was presented (apart from an NTPdate sync in the crontab - which doesn't work) and that was 2 years ago.

For my clients that I recommend Vicidial to as a dialing server, this is pretty much the ONLY issue they ever run into, but it is quickly becoming the issue that causes them to leave Vicidial.

I would really appreciate ANY help or direction.

Thank you.
dgroth02
 
Posts: 20
Joined: Wed May 06, 2015 2:24 pm

Re: Exceptionally Long Queue Length & Timesync error

Postby uncapped_shady » Tue Aug 07, 2018 12:47 pm

Hi what timing devices are you using in your servers? Vicidial recommends the Amfeltec PCI Express cards.
uncapped_shady
 
Posts: 27
Joined: Sat Jan 20, 2018 5:51 pm
Location: South Africa Gauteng

Re: Exceptionally Long Queue Length & Timesync error

Postby dgroth02 » Tue Aug 07, 2018 2:17 pm

Just the standard dahdi_dummy internal timing. I've been told by a friend that this may have to do with Asterisk 11.25.1 and that I should upgrade to 11.25.3 - Anyone else have this issue with this resolution?
dgroth02
 
Posts: 20
Joined: Wed May 06, 2015 2:24 pm

Re: Exceptionally Long Queue Length & Timesync error

Postby dgroth02 » Thu Aug 09, 2018 6:10 pm

Have upgraded asterisk and still having this issue...still suspect Internet related timeouts. Any suggestions on making vic close connections faster or make it less sensitive?
dgroth02
 
Posts: 20
Joined: Wed May 06, 2015 2:24 pm

Re: Exceptionally Long Queue Length & Timesync error

Postby williamconley » Thu Aug 09, 2018 6:26 pm

Time sync error is tossed during two basic categories of fault:

1) Time is out of sync between servers. Solution: Sync all the servers to ONE of the servers in the cluster (with iBurst) so they are always in sync. Use NTP, don't "set the time" periodically. Note that this is not something that would affect individual agents, it would down an entire server, and it would be off by about 6 seconds to cause the issue. 5% of the time this is the problem.

2) The time field of the agent's session is not updated at all (instead of being 'out of sync', it's just 'not updating', but tosses the same error!) This can be caused by a bad connection between the agent's web screen and the agent's web server, although it can also happen if the agent's dialer (the server to which the agent has registered his phone) processes fail. There are screens (screen -list) running which must update various fields continually. If one of those terminates OR Just Stops working even though it's still listed, the results can be deadly to the cluster. The one that ordinarily causes "red server" in the "Reports" page is the update script. Often just killing that script (and allowing it to regenerate on it's own, at the 1 minute mark from the keepalive script) will remove the redness and bring the server back online.

Sounds like you have more than one of the 2) problems going if you have bad internet for some agent and red servers.
Vicidial Installation and Repair, plus Hosting and Colocation
SugarCRM integration - Customization and Add-ons - We Bring It All Together.
http://www.PoundTeam.com # 352-269-0000 # +44 (203) 769-2294 # +506 4001-8914
williamconley
 
Posts: 17872
Joined: Wed Oct 31, 2007 4:17 pm
Location: Davenport, FL (By Disney!)

Re: Exceptionally Long Queue Length & Timesync error

Postby dgroth02 » Fri Aug 10, 2018 9:32 am

I understand timesync, but this still doesn't explain LongQueue length errors on Asterisk console. Would the update script cause this to happen?

If the time field is sensitive to updates - is there a way to make it less sensitive?
dgroth02
 
Posts: 20
Joined: Wed May 06, 2015 2:24 pm

Re: Exceptionally Long Queue Length & Timesync error

Postby williamconley » Fri Aug 10, 2018 2:35 pm

It's not "sensitive", it's a purposebuilt "symptom" indicating "you have a problem, you must fix this". Connectivity or time, one of them is broken. If you make it "not sensitive", then the system will stop working and you'll never know. Data will be incorrect. Calls won't transfer. All sorts of bad things will happen. But you will receive NO warnings.

Long data queues have their own cause, unrelated to Vicidial code. Likely also due to connectivity issues. Another symptom. Treat the cause.
Vicidial Installation and Repair, plus Hosting and Colocation
SugarCRM integration - Customization and Add-ons - We Bring It All Together.
http://www.PoundTeam.com # 352-269-0000 # +44 (203) 769-2294 # +506 4001-8914
williamconley
 
Posts: 17872
Joined: Wed Oct 31, 2007 4:17 pm
Location: Davenport, FL (By Disney!)

Re: Exceptionally Long Queue Length & Timesync error

Postby dgroth02 » Tue Oct 30, 2018 9:41 am

Still having major issues with this and the timesync was a red herring. Of course its throwing a timesync error when Asterisk is having issues. Makes perfect sense. Still doesn't explain why asterisk is going nuts.

Basically the servers in the cluster will operate fine for hours to days. Then one (specifically with agents on it) will start this issue.

Anyone have any ideas?
dgroth02
 
Posts: 20
Joined: Wed May 06, 2015 2:24 pm

Re: Exceptionally Long Queue Length & Timesync error

Postby thephaseusa » Tue Oct 30, 2018 10:34 am

I’m glad you posted again, Mr. Roth. I wanted to tell you that using the kernel parameters Kumba gave us:

nopti nospectre_v2 nospec

It reduced the load on my db server by half.

As for this current problem, i remember you saying before you had vicidial 7.04 on your cluster, and everything was working perfectly. You upgraded all boxes to 8.1.2, you weren’t happy, so you went back to 7.0.4. Were you able to restore all your boxes to 7.0.4 from backups?

John
thephaseusa
 
Posts: 307
Joined: Tue May 16, 2017 2:23 pm

Re: Exceptionally Long Queue Length & Timesync error

Postby dgroth02 » Tue Oct 30, 2018 12:11 pm

Nope - that was a COMPLETELY different issue - this is the same issue on v7.0.4 - and I have it with several different clients running 7.0.4

BUT - I will try that kernel parameter and see how it looks.
dgroth02
 
Posts: 20
Joined: Wed May 06, 2015 2:24 pm

Re: Exceptionally Long Queue Length & Timesync error

Postby williamconley » Wed Oct 31, 2018 11:04 pm

1) Do you reboot these servers nightly?

2) Does this happen on all servers only certain servers?

3) Do the servers upon which this happens have inbound calls on them?

4) Time Sync isn't a red herring: It's a symptom, as stated earlier. A warning sign that "now is a good time to look at this system to see what's wrong". Whatever is causing your time sync errors may be causing all your other problems, too. Or at least some of them.

So check logs when you get time sync. Check for dropped packets. Check load. Use the time sync as a trigger and see if you can clear the time sync. If that fixes your problem you can do a little dance. If fixing the time sync doesn't fix your other problem(s), at least you don't get the time sync problem any more.

And to be clear: Time sync isn't generated by asterisk. It's perl script based unless time is actually out of sync (which is worth checking: it happens! and it can cause all sorts of issues!)
Vicidial Installation and Repair, plus Hosting and Colocation
SugarCRM integration - Customization and Add-ons - We Bring It All Together.
http://www.PoundTeam.com # 352-269-0000 # +44 (203) 769-2294 # +506 4001-8914
williamconley
 
Posts: 17872
Joined: Wed Oct 31, 2007 4:17 pm
Location: Davenport, FL (By Disney!)

Re: Exceptionally Long Queue Length & Timesync error

Postby dgroth02 » Thu Nov 01, 2018 3:17 pm

1) Tried both rebooting and not rebooting, it didn't make a difference
2) Mostly those with agents on them, the other "dialing only" servers (they aren't using cross server extensions) don't seem to have issues, but to be sure, they do occasionally have this problem, just much less often.
3) No, except for one server.
4) I'm sorry about that - what I mean to say is that I understand Time Sync is an issue with communication and the system and Asterisk isn't involved in generating that error per se. All the servers are synced to the Web/DB server, which is synced to the ntp.org servers - that's what I meant by red herring - Its totally a warning sign that something isn't right, but so is the big red bar on the "Reports" page

HOWEVER - I'm pleased to report that as of two days ago, I made the kernel boot parameter changes suggested (nopti nospectre_v2 nospec) on both the dialing/agent servers and the DB server - and as of two days of operation, not a single issue. Noticed also that the server load has drastically reduced.

Will continue to monitor and keep everyone up to date.
dgroth02
 
Posts: 20
Joined: Wed May 06, 2015 2:24 pm

Re: Exceptionally Long Queue Length & Timesync error

Postby williamconley » Thu Nov 01, 2018 4:21 pm

Excellent Postback!!
Vicidial Installation and Repair, plus Hosting and Colocation
SugarCRM integration - Customization and Add-ons - We Bring It All Together.
http://www.PoundTeam.com # 352-269-0000 # +44 (203) 769-2294 # +506 4001-8914
williamconley
 
Posts: 17872
Joined: Wed Oct 31, 2007 4:17 pm
Location: Davenport, FL (By Disney!)

Re: Exceptionally Long Queue Length & Timesync error

Postby dgroth02 » Mon Nov 19, 2018 11:44 am

update:

The saga continues.

While the servers are a BIT more stable, they still, approximately once every 3 days, have the "Exceptionally Long Queue Length" error and need to be rebooted.

Before anyone asks, yes I have tried turning on auto reboot at night - no difference.

I'm tearing my hair out here. I can't believe that an acceptable level of performance is "oh well reboot it when it happens and it will be fine".

I have noticed ONE trend - that 95% of the time - the servers I have to reboot are the servers with agent phones on them. Remember I have agents spread across servers, and only 4 of the 9 servers have agents on them.
dgroth02
 
Posts: 20
Joined: Wed May 06, 2015 2:24 pm

Re: Exceptionally Long Queue Length & Timesync error

Postby williamconley » Mon Nov 19, 2018 8:25 pm

williamconley wrote:3) Do the servers upon which this happens have inbound calls on them?

4) Time Sync isn't a red herring: It's a symptom, as stated earlier. A warning sign that "now is a good time to look at this system to see what's wrong". Whatever is causing your time sync errors may be causing all your other problems, too. Or at least some of them.

So check logs when you get time sync. Check for dropped packets. Check load. Use the time sync as a trigger and see if you can clear the time sync. If that fixes your problem you can do a little dance. If fixing the time sync doesn't fix your other problem(s), at least you don't get the time sync problem any more.

And to be clear: Time sync isn't generated by asterisk. It's perl script based unless time is actually out of sync (which is worth checking: it happens! and it can cause all sorts of issues!)

So ... inbound? Has time sync been resolved? Do you have anything else in the logs from the moment of an occurrence? For instance do they turn "red" on the Reports page? vicidial/admin.php?ADD=999999

Does it happen on ALL servers at the same time? On all servers but one at a time? On some servers? Are you logging these occurrences to get a handle on it?
Vicidial Installation and Repair, plus Hosting and Colocation
SugarCRM integration - Customization and Add-ons - We Bring It All Together.
http://www.PoundTeam.com # 352-269-0000 # +44 (203) 769-2294 # +506 4001-8914
williamconley
 
Posts: 17872
Joined: Wed Oct 31, 2007 4:17 pm
Location: Davenport, FL (By Disney!)

Re: Exceptionally Long Queue Length & Timesync error

Postby Op3r » Tue Nov 20, 2018 10:08 am

timesync means your servers are not in sync with ntp.

time sync gets fixed if you have ntpd running.

Check the load of the servers. everything goes to shit when its overloaded.

Further more long queue length and timesync error may also be connected with your db. Check the health of your db. Separate the webserver from the DB and gain a little more time watching youtube videos during shift.
Get paid for US outbound Toll Free calls. PM me.
Op3r
 
Posts: 1373
Joined: Wed Jun 07, 2006 7:53 pm
Location: Manila

Re: Exceptionally Long Queue Length & Timesync error

Postby dgroth02 » Tue Nov 20, 2018 10:39 am

Ok - maybe I wasn't completely clear.

With respect to the inbound calls, no they do not (there is ONE server on the cluster that handles inbound calls, but thats it)

NTPD IS RUNNING AND IN SYNC, even when this is happening.

The "load" of any the servers never goes over 40%

The Database server load is consistently at 10% or less.

The servers "go red" in the reports page when this is happening most of the time, but often I will hear of problems from the agent (can't log in - no "you are the only one in this conference) and dialing issues long before that happens.

As far as logging goes, yes I have logging enabled across all the servers. However, with a thousand calls simmultaneously and a hundred agents, where would you suggest I begin looking for this issue. I've got through the asterisk log - the Exceptionally long queue length errors just appear - no error precedes them.
dgroth02
 
Posts: 20
Joined: Wed May 06, 2015 2:24 pm

Re: Exceptionally Long Queue Length & Timesync error

Postby Op3r » Tue Nov 20, 2018 1:21 pm

dgroth02 wrote:Ok - maybe I wasn't completely clear.

With respect to the inbound calls, no they do not (there is ONE server on the cluster that handles inbound calls, but thats it)

NTPD IS RUNNING AND IN SYNC, even when this is happening.


--- Dont think so. Time Sync comes when ntpd is not running on the all the servers. Double check it. The resolution of that issue lies there.

The "load" of any the servers never goes over 40%

The Database server load is consistently at 10% or less.

-- Check for locks and waiting.

The servers "go red" in the reports page when this is happening most of the time, but often I will hear of problems from the agent (can't log in - no "you are the only one in this conference) and dialing issues long before that happens.

-- Dmesg. something's wrong with your server.

As far as logging goes, yes I have logging enabled across all the servers. However, with a thousand calls simmultaneously and a hundred agents, where would you suggest I begin looking for this issue. I've got through the asterisk log - the Exceptionally long queue length errors just appear - no error precedes them.[/quote]
-- Just told you to look at the locks on the db.

remove the webserver with the db. remove the crontab too.

oh btw check at your crontabs too. People are so used at using the automated scripts now allowing people to install "cluester" install without checking out if it is installed properly or not.
Get paid for US outbound Toll Free calls. PM me.
Op3r
 
Posts: 1373
Joined: Wed Jun 07, 2006 7:53 pm
Location: Manila

Re: Exceptionally Long Queue Length & Timesync error

Postby williamconley » Tue Nov 20, 2018 3:25 pm

Op3r wrote:timesync means your servers are not in sync with ntp.

time sync gets fixed if you have ntpd running.


You havin' a bad day? Neither of those statements is technically true, and I'm certain you are aware of it.

Time Sync errors can also be caused by anything that blocks the agent session's "every second" packet from updating a specific value in the DB. This can be caused by overloaded workstations, bad networking, even a crashed "screen" on the dialer.

Time sync can still be OFF even with ntp running if ntp is not configured properly OR (in some wacky circumstances) has decided that it doesn't want to sync properly during startup. We've had several machines (some still in use) that must have ntp restarted about a minute after boot to "reset" time sync so it will actually sync instead of "syncing with itself" which doesn't keep it in sync with the DB server for the cluster. Odd, but this is linux. Strange shit happens and a blanket "NTPD is running so time sync must be fixed" is not a good approach.

Just like politics: Trust but verify.

Code: Select all
ntpq -p
This should show sync symbols in the far left column indicating sync with the master (or internet masters) and an offset in single digits. Anything else could be out of sync.
Vicidial Installation and Repair, plus Hosting and Colocation
SugarCRM integration - Customization and Add-ons - We Bring It All Together.
http://www.PoundTeam.com # 352-269-0000 # +44 (203) 769-2294 # +506 4001-8914
williamconley
 
Posts: 17872
Joined: Wed Oct 31, 2007 4:17 pm
Location: Davenport, FL (By Disney!)

Re: Exceptionally Long Queue Length & Timesync error

Postby Op3r » Tue Nov 20, 2018 4:56 pm

Hahaha, Yep i know that ntpd is not always the case however the most common error that throws out timesync is actually due to ntpd not running or not syncing properly even when you thought its running. restarting ntpd service does the trick. If it doesnt then something more drastic is about to be done cos production time comes first than dicking around with finding out why shit is broken.

Considering he is on a hosted dedicated server environment, I would really advise him to check out his ntp connectivity. Most likely its not connecting to the pool.ntp.org or whatnot.

If its not the ntpd causing the problem then time to freaking reinstall VICIDIAL and on another set of boxes. That cluster got issues that are not worth the trouble troubleshooting and much faster resolution comes from reinstalling everything (ask the ops a single day of downtime or hindered output cost more than the freaking cost of reinstall.)
Get paid for US outbound Toll Free calls. PM me.
Op3r
 
Posts: 1373
Joined: Wed Jun 07, 2006 7:53 pm
Location: Manila

Re: Exceptionally Long Queue Length & Timesync error

Postby frequency » Tue Nov 20, 2018 10:45 pm

Normally it takes place if one of the applications to take load and it crashes. i have seen multiple instances with 11.25.1 where asterisk takes the load, the load avg's haywire like 100 and time sync error and server needs to be rebooted. 11.25.3 is a bit stable for what i have seen.

also, are you taking inbound calls? webrtc?
frequency
 
Posts: 98
Joined: Mon Jun 13, 2016 11:18 am

Re: Exceptionally Long Queue Length & Timesync error

Postby dgroth02 » Wed Nov 21, 2018 11:25 am

I have had this problem on both Asterisk 11.25.1 and 11.25.3

As I stated before, there is only 1 server in this cluster taking inbound calls - and it just so happens it is not the servers having these problems. It seems to be the servers with agents on them that are having this issues.

Additionally, I checked ntpq -p

during production and received the following:

Code: Select all
     remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
+pyrrha.fi.muni. 147.231.2.6      2 u  849 1024  377  167.201    1.929   1.261
-ns3.switch.ca   206.108.0.134    3 u  695 1024  377   73.601    9.841   1.123
*t2.time.gq1.yah 10.211.121.37    2 u  908 1024  377   33.333   -0.320   1.277
+t1.time.bf1.yah 98.139.133.62    2 u  940 1024  377   79.310   -4.701   1.287


However:

I did manage to capture a dmesg and am seeing a ton of

Code: Select all
[710303.849121] net_ratelimit: 96 callbacks suppressed
[710758.909989] net_ratelimit: 143 callbacks suppressed
[710989.582516] net_ratelimit: 241 callbacks suppressed
[710994.622822] net_ratelimit: 242 callbacks suppressed
[711234.727354] net_ratelimit: 51 callbacks suppressed
[711524.170307] net_ratelimit: 58 callbacks suppressed
[714007.497591] net_ratelimit: 167 callbacks suppressed
[770058.038706] net_ratelimit: 64 callbacks suppressed
[770617.699543] net_ratelimit: 115 callbacks suppressed
[771170.784027] net_ratelimit: 121 callbacks suppressed
[771678.128427] net_ratelimit: 81 callbacks suppressed
[772783.721849] net_ratelimit: 57 callbacks suppressed


I know this has to do with number of logged network errors, but is it possible that a network issue is causing this?

These are all Dell servers - identical.
dgroth02
 
Posts: 20
Joined: Wed May 06, 2015 2:24 pm

Re: Exceptionally Long Queue Length & Timesync error

Postby dgroth02 » Thu Nov 29, 2018 2:56 pm

Anyone????
dgroth02
 
Posts: 20
Joined: Wed May 06, 2015 2:24 pm

Re: Exceptionally Long Queue Length & Timesync error

Postby williamconley » Thu Nov 29, 2018 3:16 pm

dgroth02 wrote:but is it possible that a network issue is causing this?


Yes. As stated earlier:
Time Sync errors can also be caused by anything that blocks the agent session's "every second" packet from updating a specific value in the DB. This can be caused by overloaded workstations, bad networking, even a crashed "screen" on the dialer.

The Vicidial Agent Screen uses AJAX to send/receive a packet every second. That packet doesn't just collect information, it also causes a value to be updated in the DB. Without that value update: "Time Sync Error" is tossed. Thus networking or any other crash during that process from agent web page to db update can cause time sync error.

This could be DB socket shortfall, code errors in any of the scripts, literally anything in that path.

This error exists to make it a requirement that communication from the agent to the DB is impeccable, as anything short of perfect WILL lead to incorrect data storage and missing data for the agent. If your networking packets are dropping, you'll have to fix it to get rid of that error. Even "slow" is acceptable: just NOT dropped packets.
Vicidial Installation and Repair, plus Hosting and Colocation
SugarCRM integration - Customization and Add-ons - We Bring It All Together.
http://www.PoundTeam.com # 352-269-0000 # +44 (203) 769-2294 # +506 4001-8914
williamconley
 
Posts: 17872
Joined: Wed Oct 31, 2007 4:17 pm
Location: Davenport, FL (By Disney!)

Re: Exceptionally Long Queue Length & Timesync error

Postby dgroth02 » Thu Nov 29, 2018 4:46 pm

I get that about the Time Sync error - but that doesn't explain the large string "Exceptionally Long Queue Length" errors that get thrown and keep going and cause the server not to properly work (i.e. agents can't login, calls can't be made on that server) until I reboot it. That causes a server to "go red", then throw timesync errors for agents on that server.
dgroth02
 
Posts: 20
Joined: Wed May 06, 2015 2:24 pm

Re: Exceptionally Long Queue Length & Timesync error

Postby williamconley » Thu Nov 29, 2018 6:09 pm

dgroth02 wrote:I get that about the Time Sync error - but that doesn't explain the large string "Exceptionally Long Queue Length" errors that get thrown and keep going and cause the server not to properly work (i.e. agents can't login, calls can't be made on that server) until I reboot it. That causes a server to "go red", then throw timesync errors for agents on that server.

So fix your networking problem causing the time sync error (if there is one) and hope that the network error is also causing the queue issue.

Also:

http://forums.asterisk.org/viewtopic.php?f=1&t=91780
Vicidial Installation and Repair, plus Hosting and Colocation
SugarCRM integration - Customization and Add-ons - We Bring It All Together.
http://www.PoundTeam.com # 352-269-0000 # +44 (203) 769-2294 # +506 4001-8914
williamconley
 
Posts: 17872
Joined: Wed Oct 31, 2007 4:17 pm
Location: Davenport, FL (By Disney!)


Return to Support

Who is online

Users browsing this forum: Google [Bot] and 39 guests