The use of LIMIT 10000 causing major performance issues

Any and all non-support discussions

Moderators: gerski, enjay, williamconley, Op3r, Staydog, gardo, mflorell, MJCoate, mcargile, Kumba, Michael_N

The use of LIMIT 10000 causing major performance issues

Postby greg@byteworth.com » Sat Feb 18, 2017 5:21 pm

We are using :
VERSION: 2.14-584a
BUILD: 170113-1637
2 Xeon Hexa Core processors
96G RAM
Samsung SSD 500G drives with RAID 1 using: LSI Logic MegaRAID
Currently have single server config. We are preparing to cluster and will dedicate the database server.
Currently only running 30 agents with low dial lever of about 3:1 or 5:1, recording all force.

We do have large log tables in the 10 to 20 million row range.

What I find strange is that there is no sign of server stress on CPU at 2%, RAM at 6G or abouts.
Yet, the mysql slow log is full of 10 and 20 second queries.

I modified the USERSTATS.PHP by removing the LIMIT 10000 from the recording_log query.
That changed the 1 minute disaster load time to SUB SECOND!

So, after looking at mysql slow log again, I see the same problem "all of these queries that have LIMIT nnnnn on them are taking a life time to execute.

Is there something other than figuring out which queries I should remove the limit nnnnn from to save the day?

I am aware of archiving but I don't see why that should be required if things are properly configured, this would be a least desired and last resort.

Any suggestions would be greatly appreciated.

Thank you in advance,
Greg Hill
greg@byteworth.com
 
Posts: 29
Joined: Wed Sep 23, 2015 8:28 pm

Re: The use of LIMIT 10000 causing major performance issues

Postby mflorell » Sat Feb 18, 2017 10:29 pm

Interesting, I haven't run into that one before.

What MySQL/MariaDB version are you using?
mflorell
Site Admin
 
Posts: 18379
Joined: Wed Jun 07, 2006 2:45 pm
Location: Florida

Re: The use of LIMIT 10000 causing major performance issues

Postby greg@byteworth.com » Sun Feb 19, 2017 2:16 am

Version is pasted below:
Maybe the title of the post is a little misleading. limit 10000 thing might help point the type of tuning that needs to happen.
This is a new installation. We are upgrading our customer to this more powerful server. After installed, we imported the database from the old server to this new server. We do this for a living and have many successful installations. My technician used your most recent version. We have had 40 agents on a piece of junk non-server grade single core processor with 8G RAM, working better than this.

version 10.1.20-MariaDB
"protocol_version" "10"
"version_comment" "openSUSE package"
"version_compile_machine" "x86_64"
"version_compile_os" "Linux"
"version_malloc_library" "system jemalloc"
"version_ssl_library" "OpenSSL 1.0.1i-fips 6 Aug 2014"
"wsrep_patch_version" "wsrep_25.16"
greg@byteworth.com
 
Posts: 29
Joined: Wed Sep 23, 2015 8:28 pm

Re: The use of LIMIT 10000 causing major performance issues

Postby mflorell » Sun Feb 19, 2017 8:14 am

What my.cnf file settings are you using?

Have you tried running the "extras/mysql-tuning.sh" script yet?
mflorell
Site Admin
 
Posts: 18379
Joined: Wed Jun 07, 2006 2:45 pm
Location: Florida

Re: The use of LIMIT 10000 causing major performance issues

Postby greg@byteworth.com » Sun Feb 19, 2017 2:00 pm

mflorell wrote:What my.cnf file settings are you using?

Have you tried running the "extras/mysql-tuning.sh" script yet?


The first day this dialer went in to production, it was pausing the agents on and off. It was like a cycle of every 1 or 2 minutes. Customer was going crazy.
We made a big improvement by changing:
thread_concurrancy to 24
max_connections, and query_cache_size

Then:
Just lats night a change key_buffer_size=16

NOTE: the # symbol indicates a change we made below:

[mysqld]
log_error = /var/log/mysql/error.log
log_bin=/var/lib/mysql/mysql-bin
binlog_format=mixed
server-id=1
relay-log = /var/lib/mysql/mysqld-relay-bin
slave-skip-errors = 1032,1690,1062
datadir=/var/lib/mysql
sql_mode=NO_ENGINE_SUBSTITUTION
port = 3306
socket = /var/run/mysql/mysql.sock
skip-external-locking
skip-name-resolve
connect_timeout=60
long_query_time=3
slow_query_log=1
slow-query-log-file=/var/log/mysql/mysqld-slow.log
max_connections=2048
#max_connections=512
key_buffer_size=16G
max_allowed_packet=16M
table_open_cache=512
table_definition_cache=2048
open_files_limit=24576
sort_buffer_size=4M
net_buffer_length=8K
read_buffer_size=4M
read_rnd_buffer_size=16M
myisam_sort_buffer_size=128M
join_buffer_size=1M
thread_cache_size=100
query_cache_size=0
#query_cache_size=32M
thread_concurrency=24
#thread_concurrency=8
default-storage-engine=MyISAM
slow_query_log=1
slow-query-log-file=/var/log/mysql/mysqld-slow.log
max_connections=2048
#max_connections=512
key_buffer_size=16G
#key_buffer_size=2G
max_allowed_packet=16M
table_open_cache=512
table_definition_cache=2048
open_files_limit=24576
sort_buffer_size=4M
net_buffer_length=8K
read_buffer_size=4M
read_rnd_buffer_size=16M
myisam_sort_buffer_size=128M
join_buffer_size=1M
thread_cache_size=100
query_cache_size=0
#query_cache_size=32M
thread_concurrency=24
#thread_concurrency=8
default-storage-engine=MyISAM
expire_logs_days=3
concurrent_insert=2
myisam_repair_threads=1
myisam_use_mmap=1
skip-innodb
delay_key_write=ALL
max_write_lock_count=1
greg@byteworth.com
 
Posts: 29
Joined: Wed Sep 23, 2015 8:28 pm

Re: The use of LIMIT 10000 causing major performance issues

Postby greg@byteworth.com » Sun Feb 19, 2017 2:09 pm

greg@byteworth.com wrote:
mflorell wrote:What my.cnf file settings are you using?

Have you tried running the "extras/mysql-tuning.sh" script yet?


It's looking a little better today, the log has a few queries at 3 seconds, but this one at 19:

SET timestamp=1487509320;
select count(*) from vicidial_list where phone_code='1' and phone_number LIKE "615%";
# Time: 170219 7:02:04
# User@Host: cron[cron] @ localhost []
# Thread_id: 76676 Schema: asterisk QC_hit: No
# Query_time: 3.940227 Lock_time: 0.000030 Rows_sent: 1 Rows_examined: 1350657
# Rows_affected: 0
SET timestamp=1487509324;
select count(*) from vicidial_list where phone_code='1' and phone_number LIKE "615%" and (gmt_offset_now != '-6' or gmt_offset_now IS NULL);
# Time: 170219 12:23:13
# User@Host: cron[cron] @ localhost []
# Thread_id: 392012 Schema: asterisk QC_hit: No
# Query_time: 19.363740 Lock_time: 0.000093 Rows_sent: 1 Rows_examined: 1055853
# Rows_affected: 0
greg@byteworth.com
 
Posts: 29
Joined: Wed Sep 23, 2015 8:28 pm

Re: The use of LIMIT 10000 causing major performance issues

Postby greg@byteworth.com » Sun Feb 19, 2017 5:06 pm

After making the adjustment for key_buffer_size = 16G

server ran for 75 minutes or so and died. It appeared that Apache was still working but the mysql was dead.

Reset back to 2G.

Not sure what is happening. Starting to think it could memory.

Greg
greg@byteworth.com
 
Posts: 29
Joined: Wed Sep 23, 2015 8:28 pm

Re: The use of LIMIT 10000 causing major performance issues

Postby mflorell » Mon Feb 20, 2017 10:17 am

That's just what I was going to suggest. We have seen a few cases over the years where there was bad memory, and that caused similar issues with MySQL.
mflorell
Site Admin
 
Posts: 18379
Joined: Wed Jun 07, 2006 2:45 pm
Location: Florida

Re: The use of LIMIT 10000 causing major performance issues

Postby rockgeneral » Wed Feb 22, 2017 2:06 pm

greg@byteworth.com

I am having the same problem with queries that use limit 10000 as well. Have you made any headway figuring out the cause in your case?
System Info:
ViciBox_v9.x86_64-9.0.2.iso | Version: 2.14b0.5 SVN: 3551 | DB Schema Version: 1650 | Asterisk 13.29.2-vici | Single Server
rockgeneral
 
Posts: 93
Joined: Thu Mar 04, 2010 9:28 pm

Re: The use of LIMIT 10000 causing major performance issues

Postby greg@byteworth.com » Thu Feb 23, 2017 11:46 am

Will know if the RAM fixes it today. RAM is scheduled to be delivered to my customer at some time today.
Do you have large tables?
RIAD?

I will let you know.

Greg Hill
greg@byteworth.com
 
Posts: 29
Joined: Wed Sep 23, 2015 8:28 pm

Re: The use of LIMIT 10000 causing major performance issues

Postby rockgeneral » Thu Feb 23, 2017 1:16 pm

Let me know if the memory upgrade fixes it.

> Do you have large tables?
It depends on what you consider large I suppose. The table where I first noticed the issue with is the recording _log which has 7884814 records. When I query WITHOUT LIMIT 10000 it returns results almost instantly. When I run it WITH LIMIT 10000 the query hangs at SENDING DATA (per the show processlist command).

> RAID?
No we don't have a RAID setup, however we are a small shop running with 10 agents outbound dialing usually at around 3:1. We have another 10 employees logged in but they make stricly manual dial calls. Also I've been watching stats via iostat and IO doesn't seem to be a problem.

We have an Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz (8 core) processor with 32GB of RAM. The server is hardly under any load.

Regards,

Michael George
System Info:
ViciBox_v9.x86_64-9.0.2.iso | Version: 2.14b0.5 SVN: 3551 | DB Schema Version: 1650 | Asterisk 13.29.2-vici | Single Server
rockgeneral
 
Posts: 93
Joined: Thu Mar 04, 2010 9:28 pm

Re: The use of LIMIT 10000 causing major performance issues

Postby greg@byteworth.com » Thu Feb 23, 2017 1:54 pm

Are your agents getting paused out?
Are there any other symptoms?
Can click around in the administrator gui without any delays or are you experiencing any delay when click around when all 10 agents are active?

Thank you,
Greg Hill
greg@byteworth.com
 
Posts: 29
Joined: Wed Sep 23, 2015 8:28 pm

Re: The use of LIMIT 10000 causing major performance issues

Postby rockgeneral » Thu Feb 23, 2017 2:19 pm

> Are your agents getting paused out?
I've heard occasional complaints from a few users that they were paused but I haven't really witnessed it.

> Are there any other symptoms?
I get a couple of calls each day that gets stuck in the vicidial_auto_calls. I see them when I run the AST_timeonVDAD.php report. Everything else with these calls seems normal i.e. They get transferred to agents, recorded etc... I don't even know that this is related however, it is something that began happening post upgrade.

> Can click around in the administrator gui without any delays or are you experiencing any delay when click around when all 10 agents are active?
During normal operations, which is most of the time, clicking around in the admin interface is responsive. However, during one of these events when one of these LIMIT 10000 queries locks tables it can become unresponsive depending upon which table is locked. Fortunately, in my case it is seems mostly relegated to the recording_log table.

Regards,

Michael George
System Info:
ViciBox_v9.x86_64-9.0.2.iso | Version: 2.14b0.5 SVN: 3551 | DB Schema Version: 1650 | Asterisk 13.29.2-vici | Single Server
rockgeneral
 
Posts: 93
Joined: Thu Mar 04, 2010 9:28 pm

Re: The use of LIMIT 10000 causing major performance issues

Postby greg@byteworth.com » Fri Feb 24, 2017 8:13 pm

Put new RAM last night.
Sunday, the full shift will login and and dial.
However, I did test the user_stat.php
Restored user_stats.php to include the LIMIT 10000;

It is smoking fast with the LIMIT 10000; included.

Had a small shift today and didn't hear complaints BUT, I can't say with 100% certainty, until I see all agents of about 30 agents logged on running.

I am optimistic that the RAM replacement will be the fix.

I will update this post after or on SUNDAY.

Greg Hill
greg@byteworth.com
 
Posts: 29
Joined: Wed Sep 23, 2015 8:28 pm

Re: The use of LIMIT 10000 causing major performance issues

Postby rockgeneral » Sat Feb 25, 2017 6:33 am

I'm crossing my fingers, hoping Sunday goes well for you :) If it does I'm going to replace our memory as well.

Regards,

Michael George
System Info:
ViciBox_v9.x86_64-9.0.2.iso | Version: 2.14b0.5 SVN: 3551 | DB Schema Version: 1650 | Asterisk 13.29.2-vici | Single Server
rockgeneral
 
Posts: 93
Joined: Thu Mar 04, 2010 9:28 pm

Re: The use of LIMIT 10000 causing major performance issues

Postby greg@byteworth.com » Sun Feb 26, 2017 7:41 pm

It didn't fix the problem at all.
Major issues with agents getting paused out.
Major lag on any and all admin options.

I have tuned the my.cnf to use larger amounts of memory today but got no benefit from it.

Something is drastically wrong.

I have vicidial 4.0 with double the size log tables KICKING BUTT with no issues.

Right now everything is pointing the this latest version. Even the previous version is slower than the older versions of Vicidial.

Customer is fed up with us and wants to start separating servers before we can get this resolved.

I will have to speak with owner of system shortly, in attempt to allow him to let us roll the tables. However, I am not convinced that is the problem when we consider how older versions have no issues like this.

Anyone else experiencing issues with the newer versions?
Is there a setting that we don't know about that is causing it to run like a turdle?
I mean we have been delivering Vicidial solutions since 2011 and never encountered something like this.

Greg Hill
greg@byteworth.com
 
Posts: 29
Joined: Wed Sep 23, 2015 8:28 pm

Re: The use of LIMIT 10000 causing major performance issues

Postby williamconley » Sun Feb 26, 2017 7:58 pm

greg@byteworth.com wrote:It didn't fix the problem at all.
Major issues with agents getting paused out.
Major lag on any and all admin options.
...


1) Please always post your installer with full version. This can be a serious help, obviously, in bringing problems with an installer to light.

2) I note that you have said you have problems with limit in place, but resolved this problem by removing limit. But now your client is upset ... which begs the question: If you removed the limit to resolve the issue, why is the client upset? Or did you have to put the limit back in place for some other reason? Or are you saying that removing limit didn't actually resolve the issue at all? (note that this is the opposite of the expected behavior of that command: it ordinarily drastically improves the response time of sql requests on large tables with over a million records)

3) You've said you have no problems with older systems but do have problems with newer systems. Were these newer systems upgraded from the prior ones? Or were they fresh installs that did not involve even bringing the DB along for the ride? If so, do these newer systems have the same lead-count and call volume as the prior ones?

4) I don't mean to be insulting with this one, but here it goes anyway (lol): You sound like someone well-versed enough in Linux and Vicidial to ... outsmart himself. Have you verified the original problem on an unmodified stock installation? Or did you "tune it" before checking? I've bumped into this many times over the years.

5) You seem to describe, recently, some enterprise level issues which are often linked to available ports and services (ports for any enterprise-level server, services if the backlog is from mysql or apache2). We also bump into this a lot in larger systems. Best to begin with all your logs and attack each connection problem at a time. Do not assume they are all related, and definitely don't assume they are related to the "limit" in mysql.
Vicidial Installation and Repair, plus Hosting and Colocation
Newest Product: Vicidial Agent Only Beep - Beta
http://www.PoundTeam.com # 352-269-0000 # +44(203) 769-2294
williamconley
 
Posts: 20229
Joined: Wed Oct 31, 2007 4:17 pm
Location: Davenport, FL (By Disney!)

Re: The use of LIMIT 10000 causing major performance issues

Postby mflorell » Sun Feb 26, 2017 8:10 pm

We have one of our largest clients running on the newest version of VICIdial, placing up to 1.5 million calls a day with 250+ agents, so we can confirm that there is nothing wrong with the VICIdial software itself.

Problems like you are mentioning are almost always hardware, network or configuration issues. But sometimes they can be difficult to diagnose.

For example, we've seen issues like "slow DNS server", "failing network switch", "port exhaustion", etc... all cause similar issues recently.
mflorell
Site Admin
 
Posts: 18379
Joined: Wed Jun 07, 2006 2:45 pm
Location: Florida

Re: The use of LIMIT 10000 causing major performance issues

Postby greg@byteworth.com » Mon Feb 27, 2017 10:55 pm

williamconley wrote:
greg@byteworth.com wrote:It didn't fix the problem at all.
Major issues with agents getting paused out.
Major lag on any and all admin options.
...


1) Please always post your installer with full version. This can be a serious help, obviously, in bringing problems with an installer to light.
I believe I posted system info in first post of this issue. Vicibox installs with most recent version 30 days ago. See my initial post.


2) I note that you have said you have problems with limit in place, but resolved this problem by removing limit. But now your client is upset ... which begs the question: If you removed the limit to resolve the issue, why is the client upset? Or did you have to put the limit back in place for some other reason? Or are you saying that removing limit didn't actually resolve the issue at all? (note that this is the opposite of the expected behavior of that command: it ordinarily drastically improves the response time of sql requests on large tables with over a million records)

There is more to my post than what you are quoting.
Removing LIMIT on 1 single statement proved to turn a slow query into an extremely fast one.
But I stated that I don't think that is the source of the problem. So, I have been going back to the drawing board. Very confused. Server show zero stress, yet, agents are getting paused and system becomes unresponsive.
It gets worse when the agent count gets upwards of 25 users.
It also doesn't seem to happen in morning startup, it seems to hit about 1 hour into the shift.

3) You've said you have no problems with older systems but do have problems with newer systems. Were these newer systems upgraded from the prior ones? Or were they fresh installs that did not involve even bringing the DB along for the ride? If so, do these newer systems have the same lead-count and call volume as the prior ones?

The database was ported from previous version, we do this on many systems.


4) I don't mean to be insulting with this one, but here it goes anyway (lol): You sound like someone well-versed enough in Linux and Vicidial to ... outsmart himself. Have you verified the original problem on an unmodified stock installation? Or did you "tune it" before checking? I've bumped into this many times over the years.

No offense taken, no tuning was done until trouble started. It is was a stock install. I posted earlier that I only disabled query cache and increased threads to match CPU cores of 24.
Only did this after we were frustrated and started troubleshooting.

I thought that maybe it was Maria DB 10.2, was thinking of downgrading.


5) You seem to describe, recently, some enterprise level issues which are often linked to available ports and services (ports for any enterprise-level server, services if the backlog is from mysql or apache2). We also bump into this a lot in larger systems. Best to begin with all your logs and attack each connection problem at a time. Do not assume they are all related, and definitely don't assume they are related to the "limit" in mysql.


Yeah, I am done with the LIMIT, it was a symptom of the true problem.

I am not aware of port limitations outside of the router. We understand ports in relationship to router configuration.
Can you guys point me in the right direction for adjusting ports, is there something in APACHE2, something on the linux box server that needs tweaked?

NOTE: When we ping to or from this vicidial server, regardless of it being to an internal node or an external node we "never see bad ping times".

Is it possible to see pings looking good when the network is the cause of system delays?
I have always had bad pings when the network was the issue.

I am pushing the client to replace switch, now I am thinking networking. Please see my question about ports.

Thank you,
Greg Hill
greg@byteworth.com
 
Posts: 29
Joined: Wed Sep 23, 2015 8:28 pm

Re: The use of LIMIT 10000 causing major performance issues

Postby mflorell » Mon Feb 27, 2017 11:04 pm

You can test Apache port exhaustion with the following command:

“netstat -a | grep WAIT | wc -l”

If you run out of ports, there really isn't anything you can do but get more servers to spread the load. Although, depending on how Apache is configured, you may not be using as many ports as the system is capable of using(65,000)

As for network, have you tried loading Wireshark on this network to see if there was any strange network activity?

Have you tried enabling Agent Screen Debug Logging in System Settings to see if there are any strange issues going on?
mflorell
Site Admin
 
Posts: 18379
Joined: Wed Jun 07, 2006 2:45 pm
Location: Florida

Re: The use of LIMIT 10000 causing major performance issues

Postby williamconley » Tue Feb 28, 2017 12:16 am

Code: Select all
netstat -n | grep TIME_WAIT | wc -l


Showing ... hundreds is normal, thousands Not So Much. This ONLY counts when the server is experiencing the problem. You can check the present settings here when the server is idle:

Code: Select all
cat /etc/sysctl.conf |grep "net.ipv4.tcp_fin_timeout"
cat /etc/sysctl.conf |grep "net.ipv4.tcp_tw_recycle"
cat /etc/sysctl.conf |grep "net.ipv4.tcp_tw_reuse"


Temporary solution:

Code: Select all
echo '1' > /proc/sys/net/ipv4/tcp_tw_recycle


There are several other fixes that can assist as well, with these fixes we've been able to reduce this problem to nothing. However: There will be a "hiccup" when this takes effect in some cases. Then it will smooth out. Also:

Note that a log would ordinarily lead you here before the netstat command. A timeout failure connecting to mysql from a dialr, for instance, when the DB server has not run out of connections and has no error in the log ... indicating that mysql never received the request but the connection failed.

There are several other similar issues in the same "Enterprise" level of problems.
Vicidial Installation and Repair, plus Hosting and Colocation
Newest Product: Vicidial Agent Only Beep - Beta
http://www.PoundTeam.com # 352-269-0000 # +44(203) 769-2294
williamconley
 
Posts: 20229
Joined: Wed Oct 31, 2007 4:17 pm
Location: Davenport, FL (By Disney!)

Re: The use of LIMIT 10000 causing major performance issues

Postby greg@byteworth.com » Tue Feb 28, 2017 10:09 pm

mflorell wrote:You can test Apache port exhaustion with the following command:

“netstat -a | grep WAIT | wc -l”

If you run out of ports, there really isn't anything you can do but get more servers to spread the load. Although, depending on how Apache is configured, you may not be using as many ports as the system is capable of using(65,000)

As for network, have you tried loading Wireshark on this network to see if there was any strange network activity?

Have you tried enabling Agent Screen Debug Logging in System Settings to see if there are any strange issues going on?


////////

mflorell wrote:You can test Apache port exhaustion with the following command:

“netstat -a | grep WAIT | wc -l”

If you run out of ports, there really isn't anything you can do but get more servers to spread the load. Although, depending on how Apache is configured, you may not be using as many ports as the system is capable of using(65,000)

As for network, have you tried loading Wireshark on this network to see if there was any strange network activity?

Have you tried enabling Agent Screen Debug Logging in System Settings to see if there are any strange issues going on?

//////////////


Thank you and William for the info, I am seeing a count of 300 to 400, that seems about normal right?

By the way, the customer is going have a network company (not sure how competent), come out to make more "home run" cable connections and also to eliminate several extra switches. Then they will put in a new 48 port switch.
I am crossing my fingers that something in their network config and gear.

I turned on agent logging (thank you for that feature), it could prove to be very handy.

Regarding WireShark (that tool is advanced beyond my ability to intelligently use and interpret).
I asked the customer to request that from there network guys. Not sure if they be able to get that done.

We use the PFSense router, we have analyzed the local traffic and have not seen anything obvious there. We do see occasional 1 to 2 MB transfers. We don't know who has internet access and who doesn't.

The customer is trying to convince me that it is a bug with using multiple campaigns, they pull some people out of a campaign and have them all go into 1 campaign and they say the problem goes away, I am not convinced.

Thank you guys again, I hope we get to the bottom of this soon.
greg@byteworth.com
 
Posts: 29
Joined: Wed Sep 23, 2015 8:28 pm

Re: The use of LIMIT 10000 causing major performance issues

Postby williamconley » Tue Feb 28, 2017 10:35 pm

I hope the servers have direct access to each other on a Gigabit switch (no bouncing, all on one switch)?
Vicidial Installation and Repair, plus Hosting and Colocation
Newest Product: Vicidial Agent Only Beep - Beta
http://www.PoundTeam.com # 352-269-0000 # +44(203) 769-2294
williamconley
 
Posts: 20229
Joined: Wed Oct 31, 2007 4:17 pm
Location: Davenport, FL (By Disney!)

Re: The use of LIMIT 10000 causing major performance issues

Postby greg@byteworth.com » Wed Mar 01, 2017 1:12 am

williamconley wrote:
Code: Select all
netstat -n | grep TIME_WAIT | wc -l


Showing ... hundreds is normal, thousands Not So Much. This ONLY counts when the server is experiencing the problem. You can check the present settings here when the server is idle:

Code: Select all
cat /etc/sysctl.conf |grep "net.ipv4.tcp_fin_timeout"
cat /etc/sysctl.conf |grep "net.ipv4.tcp_tw_recycle"
cat /etc/sysctl.conf |grep "net.ipv4.tcp_tw_reuse"


Temporary solution:

Code: Select all
echo '1' > /proc/sys/net/ipv4/tcp_tw_recycle


There are several other fixes that can assist as well, with these fixes we've been able to reduce this problem to nothing. However: There will be a "hiccup" when this takes effect in some cases. Then it will smooth out. Also:

Note that a log would ordinarily lead you here before the netstat command. A timeout failure connecting to mysql from a dialr, for instance, when the DB server has not run out of connections and has no error in the log ... indicating that mysql never received the request but the connection failed.

There are several other similar issues in the same "Enterprise" level of problems.


Am I understanding you correctly, that if I run "netstat -n | grep TIME_WAIT | wc -l" command and it returns a number that is under 1000, we are not exhausting the ports. However, if it returns upwards of a 1000, it is a possibility of having agents getting paused out. Is that correct?

The results of running these commands are as follows:
350 average for the first command
15 for net.ipv4.tcp_fin_timeout
0 for "net.ipv4.tcp_tw_recycle"

Is it likely that the 15 timeouts were agent pause's or the double calls occurring at the agents stations?

For the "echo '1' > /proc/sys/net/ipv4/tcp_tw_recycle", I was scared to run it. I am not sure what it will do or if it is needed to run?
If it does cure the problem, will it persist after a reboot?

Thank you again.

Greg Hill
greg@byteworth.com
 
Posts: 29
Joined: Wed Sep 23, 2015 8:28 pm

Re: The use of LIMIT 10000 causing major performance issues

Postby greg@byteworth.com » Wed Mar 01, 2017 1:56 am

williamconley wrote:I hope the servers have direct access to each other on a Gigabit switch (no bouncing, all on one switch)?


It will be one switch to server the incoming additional server.

As it stands right now, it is a stand alone / single server.

The customer has been hesitant to bring the additional server in to a problematic environment, especially since his newly upgraded server, is not running any better than the previous.
Our plan was to upgrade the previous server to the current version of software and then cluster them. Making this current high powered server the database server and the previous server the asterisk/web server, (this we have done several times with success).

Now for the first time, I am questioning whether it should be a database server (as we normally do), or an asterisk/web server.

We are are only dealing with maximum 35 agents and seeing problems with 20 agents.
The dial lever is 3 on the low and 5 on the high. This shouldn't be happening with these numbers.

Maybe the results of the network change will reveal which way to go with the type of the server cluster.

Thank you,
Greg Hill
greg@byteworth.com
 
Posts: 29
Joined: Wed Sep 23, 2015 8:28 pm

Re: The use of LIMIT 10000 causing major performance issues

Postby williamconley » Wed Mar 01, 2017 2:17 am

I think you are operating under the false pretense that "clustering" is something you should do after building the server. Clustering is best done during the installation of Vicidial on a fresh server. Properly done, the fresh install while properly answering the questions during the installation script, will cluster everything except the NTP process (which should be configured after the install for the new slave server to get time from the DB and assume the DB has the correct time).

You would do much better to perform a fresh install on the new server and install Dialer/Web on it rather than trying to cluster a previously installed server.

You can NOT have two database servers for a cluster, FYI, so you don't have the option of adding the DB role to the new server.
Vicidial Installation and Repair, plus Hosting and Colocation
Newest Product: Vicidial Agent Only Beep - Beta
http://www.PoundTeam.com # 352-269-0000 # +44(203) 769-2294
williamconley
 
Posts: 20229
Joined: Wed Oct 31, 2007 4:17 pm
Location: Davenport, FL (By Disney!)

Re: The use of LIMIT 10000 causing major performance issues

Postby greg@byteworth.com » Wed Mar 01, 2017 8:24 pm

williamconley wrote:I think you are operating under the false pretense that "clustering" is something you should do after building the server. Clustering is best done during the installation of Vicidial on a fresh server. Properly done, the fresh install while properly answering the questions during the installation script, will cluster everything except the NTP process (which should be configured after the install for the new slave server to get time from the DB and assume the DB has the correct time).

You would do much better to perform a fresh install on the new server and install Dialer/Web on it rather than trying to cluster a previously installed server.

You can NOT have two database servers for a cluster, FYI, so you don't have the option of adding the DB role to the new server.


If that is what you got out my post, I have done a bad job of communicating.
The clustered box "previous server", will be wiped clean and then installed as a clustered asterisk/web server to the "upgraded new server".

Did you see the question I asked about the 15 timeouts? I asked if those were a concern or even possibly related to the double calls and or paused out agents?

Also, was asking about that last command that you said could cause hiccups but would fix something. I wasn't clear on what it fixes or if it should be ran?

Let me know.
Thank you,
Greg Hill
greg@byteworth.com
 
Posts: 29
Joined: Wed Sep 23, 2015 8:28 pm

Re: The use of LIMIT 10000 causing major performance issues

Postby williamconley » Thu Mar 02, 2017 12:00 am

Is it likely that the 15 timeouts were agent pause's or the double calls occurring at the agents stations?

that is a configuration option, not a "counter".
Vicidial Installation and Repair, plus Hosting and Colocation
Newest Product: Vicidial Agent Only Beep - Beta
http://www.PoundTeam.com # 352-269-0000 # +44(203) 769-2294
williamconley
 
Posts: 20229
Joined: Wed Oct 31, 2007 4:17 pm
Location: Davenport, FL (By Disney!)

Re: The use of LIMIT 10000 causing major performance issues

Postby greg@byteworth.com » Thu Mar 02, 2017 4:38 am

mflorell wrote:You can test Apache port exhaustion with the following command:

“netstat -a | grep WAIT | wc -l”

If you run out of ports, there really isn't anything you can do but get more servers to spread the load. Although, depending on how Apache is configured, you may not be using as many ports as the system is capable of using(65,000)

As for network, have you tried loading Wireshark on this network to see if there was any strange network activity?

Have you tried enabling Agent Screen Debug Logging in System Settings to see if there are any strange issues going on?



mflorell wrote:You can test Apache port exhaustion with the following command:
If you run out of ports, there really isn't anything you can do but get more servers to spread the load. Although, depending on how Apache is configured, you may not be using as many ports as the system is capable of using(65,000)


Hey Matt,
I just encountered a reported bug that is nearly identical to ours. Bug ID: 0000989.
I didn't post this behaviour because the customer was adding agents after approximately 1 hour (so we attributed the issue to the added agents).
It does get worse getting worse after one hour (give or take 10 minutes).
Like the reporter, we have a Supermicro server.

Are you getting any other reports with these symptoms?

top -c showed asterisk at top using 30% of CPU, while idle. I checked our other dialers with older version and noted decimal levels of usage.I will watch it during production.

This customer complains of double calls (2 callings joining the conference) at odd times.
This same customer has been talking with their counterpart (a fundraiser company in California) complaining of the same problems with Vicidial 7.4.
Is there a known issue with 2 calls merging randomly?

Your code and help is greatly appreciated.

Greg Hill
greg@byteworth.com
 
Posts: 29
Joined: Wed Sep 23, 2015 8:28 pm

Re: The use of LIMIT 10000 causing major performance issues

Postby mflorell » Thu Mar 02, 2017 7:02 am

There can be all kinds of problems when Asterisk is at high load.

As for that specific issue, we have determined that it is the result of a specially crafted remote SIP registration attack that affects Asterisk 11 and higher. It uses a malformed SIP registration message that is not reported by Asterisk, so it cannot be blocked by a log-processor like fail2ban.

We started encountering this issue on our hosted platform a few months ago and were confused by what was going on initially, but after analyzing the incoming SIP traffic with HOMER, we were able to write a process that was able to pinpoint the attacking IPs and broadcast out a blacklist to be blocked by iptables on all of our dialers.

The other option to fix the symptoms of this is to downgrade to asterisk 1.8(there are instructions for how to do that on the VICIbox forum). Although, that does not prevent the attacks from happening, it will just prevent the instability that Asterisk has because of the attacks.

Over the last 2 months since we started analyzing these specific attacks, the attacks come from about 200-400 different IPs over the course of a week's time.
mflorell
Site Admin
 
Posts: 18379
Joined: Wed Jun 07, 2006 2:45 pm
Location: Florida

Re: The use of LIMIT 10000 causing major performance issues

Postby greg@byteworth.com » Thu Mar 02, 2017 3:59 pm

mflorell wrote:There can be all kinds of problems when Asterisk is at high load.

As for that specific issue, we have determined that it is the result of a specially crafted remote SIP registration attack that affects Asterisk 11 and higher. It uses a malformed SIP registration message that is not reported by Asterisk, so it cannot be blocked by a log-processor like fail2ban.

We started encountering this issue on our hosted platform a few months ago and were confused by what was going on initially, but after analyzing the incoming SIP traffic with HOMER, we were able to write a process that was able to pinpoint the attacking IPs and broadcast out a blacklist to be blocked by iptables on all of our dialers.

The other option to fix the symptoms of this is to downgrade to asterisk 1.8(there are instructions for how to do that on the VICIbox forum). Although, that does not prevent the attacks from happening, it will just prevent the instability that Asterisk has because of the attacks.

Over the last 2 months since we started analyzing these specific attacks, the attacks come from about 200-400 different IPs over the course of a week's time.
[quote="williamconley"][quote]

Good news on CPU/asterisk:
I don't see it at an alarming level with 30 agents logged in. It is only at 100% average. Was wierd to see it at 30% last night though.
////////////
Hacking SIP:
2) If vicidial admin's, whitelist their SIP connections in their router, there wouldn't be any SIP hacking at all. There is no way to hack something that don't exist.
There is the outside chance that a station has a virus on one of the computers of the Private LAN network, that hacking could happen from there.
However, we are not seeing any of that activity on the asterisk console.

Thank you,
Greg Hill

Thank you
greg@byteworth.com
 
Posts: 29
Joined: Wed Sep 23, 2015 8:28 pm

Re: The use of LIMIT 10000 causing major performance issues

Postby williamconley » Thu Mar 02, 2017 4:54 pm

There is the outside chance that a station has a virus on one of the computers of the Private LAN network, that hacking could happen from there.

My *personal* favorite: A very secure system allowing only VPN tunnel access to the dialers. One of the call centers has a "dual vpn router" and of course is only using one of the two VPN connections to link to the VPN tunnel in the Vicidial location's colo.

Someone hacked the second VPN port on the router, used that access to "bounce" back out the primary vpn port and into the colo through the VPN tunnel.

We traced the IP of the access to the secondary VPN port to ... a "VPN Expert" in another country. Of course, at this point we merely permanently disabled the secondary VPN port and got back to work (one of three centers was "down" while we worked out the attack problem). But it goes to show that when there's a deep pocket, hackers will squeeze through any hole they can find.
Vicidial Installation and Repair, plus Hosting and Colocation
Newest Product: Vicidial Agent Only Beep - Beta
http://www.PoundTeam.com # 352-269-0000 # +44(203) 769-2294
williamconley
 
Posts: 20229
Joined: Wed Oct 31, 2007 4:17 pm
Location: Davenport, FL (By Disney!)

Re: The use of LIMIT 10000 causing major performance issues

Postby greg@byteworth.com » Thu Mar 02, 2017 8:07 pm

williamconley wrote:
There is the outside chance that a station has a virus on one of the computers of the Private LAN network, that hacking could happen from there.

My *personal* favorite: A very secure system allowing only VPN tunnel access to the dialers. One of the call centers has a "dual vpn router" and of course is only using one of the two VPN connections to link to the VPN tunnel in the Vicidial location's colo.

Someone hacked the second VPN port on the router, used that access to "bounce" back out the primary vpn port and into the colo through the VPN tunnel.

We traced the IP of the access to the secondary VPN port to ... a "VPN Expert" in another country. Of course, at this point we merely permanently disabled the secondary VPN port and got back to work (one of three centers was "down" while we worked out the attack problem). But it goes to show that when there's a deep pocket, hackers will squeeze through any hole they can find.


That is a interesting case for the books.

When hackers can't see an open port, they don't even stop to try. That is why whitelisting is essential part of security but of course much more can be done.

They can sometimes get onto the ssh port (even though we spoof the port number), and then from there reek havic on the server and much more.

We are seeing improvements.
We believe that the customer had some extra routers that were being used as switches and "possibly serving DHCP" and causing IP conflicts within the Private network. We can't be 100% sure because we are working remotely.
greg@byteworth.com
 
Posts: 29
Joined: Wed Sep 23, 2015 8:28 pm

Re: The use of LIMIT 10000 causing major performance issues

Postby williamconley » Thu Mar 02, 2017 8:52 pm

Do you use Dynamic Good Guys?

http://viciwiki.com/index.php/DGG
Vicidial Installation and Repair, plus Hosting and Colocation
Newest Product: Vicidial Agent Only Beep - Beta
http://www.PoundTeam.com # 352-269-0000 # +44(203) 769-2294
williamconley
 
Posts: 20229
Joined: Wed Oct 31, 2007 4:17 pm
Location: Davenport, FL (By Disney!)

Re: The use of LIMIT 10000 causing major performance issues

Postby greg@byteworth.com » Fri Mar 03, 2017 12:06 am

williamconley wrote:Do you use Dynamic Good Guys?

http://viciwiki.com/index.php/DGG


No, as long as the router supports whitelisting, it is easy enough.
We use the pfsense opensource router. It is limitless as far as I know.

When I talk about people getting hacked, I am referring to our clients who have their own network solutions and equipment in place and don't follow our recommendations.

Thank you,
Greg Hill
greg@byteworth.com
 
Posts: 29
Joined: Wed Sep 23, 2015 8:28 pm

Re: The use of LIMIT 10000 causing major performance issues

Postby williamconley » Fri Mar 03, 2017 12:28 am

greg@byteworth.com wrote:When I talk about people getting hacked, I am referring to our clients who have their own network solutions and equipment in place and don't follow our recommendations.

That's why we recommend DGG. The server provides its own firewall. And if the server then has its own IP address, the client router is not involved in the process. Best of both worlds. Whitelist and full control from the CLI. Plus, if a workstation gets infected or uses too much bandwidth, they can unplug the router without any adverse effect on Vicidial. :)

To date I have not found a client who can pfSense or Untangle properly. Normal routers are often a challenge, but those two completely baffle most who try to use them when they hit Vicidial (great for everything else, though).
Vicidial Installation and Repair, plus Hosting and Colocation
Newest Product: Vicidial Agent Only Beep - Beta
http://www.PoundTeam.com # 352-269-0000 # +44(203) 769-2294
williamconley
 
Posts: 20229
Joined: Wed Oct 31, 2007 4:17 pm
Location: Davenport, FL (By Disney!)

Re: The use of LIMIT 10000 causing major performance issues

Postby greg@byteworth.com » Fri Mar 03, 2017 4:00 am

williamconley wrote:
greg@byteworth.com wrote:When I talk about people getting hacked, I am referring to our clients who have their own network solutions and equipment in place and don't follow our recommendations.

That's why we recommend DGG. The server provides its own firewall. And if the server then has its own IP address, the client router is not involved in the process. Best of both worlds. Whitelist and full control from the CLI. Plus, if a workstation gets infected or uses too much bandwidth, they can unplug the router without any adverse effect on Vicidial. :)

To date I have not found a client who can pfSense or Untangle properly. Normal routers are often a challenge, but those two completely baffle most who try to use them when they hit Vicidial (great for everything else, though).


I am sure the DGG is a good solution. The only time we help clients with networking is when they don't already have someone to do it. We don't let them in the router, they are not equipped properly manage routers in general and would more than like cause us more havoc than we already have.

We have been very happy with pfsense and have been using it with vicidial without issue for over 5 years with Vicidial.
greg@byteworth.com
 
Posts: 29
Joined: Wed Sep 23, 2015 8:28 pm


Return to General Discussion

Who is online

Users browsing this forum: Google [Bot] and 22 guests