Moderators: gerski, enjay, williamconley, Op3r, Staydog, gardo, mflorell, MJCoate, mcargile, Kumba, Michael_N
mflorell wrote:What my.cnf file settings are you using?
Have you tried running the "extras/mysql-tuning.sh" script yet?
greg@byteworth.com wrote:mflorell wrote:What my.cnf file settings are you using?
Have you tried running the "extras/mysql-tuning.sh" script yet?
greg@byteworth.com wrote:It didn't fix the problem at all.
Major issues with agents getting paused out.
Major lag on any and all admin options.
...
williamconley wrote:greg@byteworth.com wrote:It didn't fix the problem at all.
Major issues with agents getting paused out.
Major lag on any and all admin options.
...
1) Please always post your installer with full version. This can be a serious help, obviously, in bringing problems with an installer to light.
I believe I posted system info in first post of this issue. Vicibox installs with most recent version 30 days ago. See my initial post.
2) I note that you have said you have problems with limit in place, but resolved this problem by removing limit. But now your client is upset ... which begs the question: If you removed the limit to resolve the issue, why is the client upset? Or did you have to put the limit back in place for some other reason? Or are you saying that removing limit didn't actually resolve the issue at all? (note that this is the opposite of the expected behavior of that command: it ordinarily drastically improves the response time of sql requests on large tables with over a million records)
There is more to my post than what you are quoting.
Removing LIMIT on 1 single statement proved to turn a slow query into an extremely fast one.
But I stated that I don't think that is the source of the problem. So, I have been going back to the drawing board. Very confused. Server show zero stress, yet, agents are getting paused and system becomes unresponsive.
It gets worse when the agent count gets upwards of 25 users.
It also doesn't seem to happen in morning startup, it seems to hit about 1 hour into the shift.
3) You've said you have no problems with older systems but do have problems with newer systems. Were these newer systems upgraded from the prior ones? Or were they fresh installs that did not involve even bringing the DB along for the ride? If so, do these newer systems have the same lead-count and call volume as the prior ones?
The database was ported from previous version, we do this on many systems.
4) I don't mean to be insulting with this one, but here it goes anyway (lol): You sound like someone well-versed enough in Linux and Vicidial to ... outsmart himself. Have you verified the original problem on an unmodified stock installation? Or did you "tune it" before checking? I've bumped into this many times over the years.
No offense taken, no tuning was done until trouble started. It is was a stock install. I posted earlier that I only disabled query cache and increased threads to match CPU cores of 24.
Only did this after we were frustrated and started troubleshooting.
I thought that maybe it was Maria DB 10.2, was thinking of downgrading.
5) You seem to describe, recently, some enterprise level issues which are often linked to available ports and services (ports for any enterprise-level server, services if the backlog is from mysql or apache2). We also bump into this a lot in larger systems. Best to begin with all your logs and attack each connection problem at a time. Do not assume they are all related, and definitely don't assume they are related to the "limit" in mysql.
netstat -n | grep TIME_WAIT | wc -l
cat /etc/sysctl.conf |grep "net.ipv4.tcp_fin_timeout"
cat /etc/sysctl.conf |grep "net.ipv4.tcp_tw_recycle"
cat /etc/sysctl.conf |grep "net.ipv4.tcp_tw_reuse"
echo '1' > /proc/sys/net/ipv4/tcp_tw_recycle
mflorell wrote:You can test Apache port exhaustion with the following command:
“netstat -a | grep WAIT | wc -l”
If you run out of ports, there really isn't anything you can do but get more servers to spread the load. Although, depending on how Apache is configured, you may not be using as many ports as the system is capable of using(65,000)
As for network, have you tried loading Wireshark on this network to see if there was any strange network activity?
Have you tried enabling Agent Screen Debug Logging in System Settings to see if there are any strange issues going on?
mflorell wrote:You can test Apache port exhaustion with the following command:
“netstat -a | grep WAIT | wc -l”
If you run out of ports, there really isn't anything you can do but get more servers to spread the load. Although, depending on how Apache is configured, you may not be using as many ports as the system is capable of using(65,000)
As for network, have you tried loading Wireshark on this network to see if there was any strange network activity?
Have you tried enabling Agent Screen Debug Logging in System Settings to see if there are any strange issues going on?
williamconley wrote:
- Code: Select all
netstat -n | grep TIME_WAIT | wc -l
Showing ... hundreds is normal, thousands Not So Much. This ONLY counts when the server is experiencing the problem. You can check the present settings here when the server is idle:
- Code: Select all
cat /etc/sysctl.conf |grep "net.ipv4.tcp_fin_timeout"
cat /etc/sysctl.conf |grep "net.ipv4.tcp_tw_recycle"
cat /etc/sysctl.conf |grep "net.ipv4.tcp_tw_reuse"
Temporary solution:
- Code: Select all
echo '1' > /proc/sys/net/ipv4/tcp_tw_recycle
There are several other fixes that can assist as well, with these fixes we've been able to reduce this problem to nothing. However: There will be a "hiccup" when this takes effect in some cases. Then it will smooth out. Also:
Note that a log would ordinarily lead you here before the netstat command. A timeout failure connecting to mysql from a dialr, for instance, when the DB server has not run out of connections and has no error in the log ... indicating that mysql never received the request but the connection failed.
There are several other similar issues in the same "Enterprise" level of problems.
williamconley wrote:I hope the servers have direct access to each other on a Gigabit switch (no bouncing, all on one switch)?
williamconley wrote:I think you are operating under the false pretense that "clustering" is something you should do after building the server. Clustering is best done during the installation of Vicidial on a fresh server. Properly done, the fresh install while properly answering the questions during the installation script, will cluster everything except the NTP process (which should be configured after the install for the new slave server to get time from the DB and assume the DB has the correct time).
You would do much better to perform a fresh install on the new server and install Dialer/Web on it rather than trying to cluster a previously installed server.
You can NOT have two database servers for a cluster, FYI, so you don't have the option of adding the DB role to the new server.
Is it likely that the 15 timeouts were agent pause's or the double calls occurring at the agents stations?
mflorell wrote:You can test Apache port exhaustion with the following command:
“netstat -a | grep WAIT | wc -l”
If you run out of ports, there really isn't anything you can do but get more servers to spread the load. Although, depending on how Apache is configured, you may not be using as many ports as the system is capable of using(65,000)
As for network, have you tried loading Wireshark on this network to see if there was any strange network activity?
Have you tried enabling Agent Screen Debug Logging in System Settings to see if there are any strange issues going on?
mflorell wrote:You can test Apache port exhaustion with the following command:
If you run out of ports, there really isn't anything you can do but get more servers to spread the load. Although, depending on how Apache is configured, you may not be using as many ports as the system is capable of using(65,000)
[quote="williamconley"][quote]mflorell wrote:There can be all kinds of problems when Asterisk is at high load.
As for that specific issue, we have determined that it is the result of a specially crafted remote SIP registration attack that affects Asterisk 11 and higher. It uses a malformed SIP registration message that is not reported by Asterisk, so it cannot be blocked by a log-processor like fail2ban.
We started encountering this issue on our hosted platform a few months ago and were confused by what was going on initially, but after analyzing the incoming SIP traffic with HOMER, we were able to write a process that was able to pinpoint the attacking IPs and broadcast out a blacklist to be blocked by iptables on all of our dialers.
The other option to fix the symptoms of this is to downgrade to asterisk 1.8(there are instructions for how to do that on the VICIbox forum). Although, that does not prevent the attacks from happening, it will just prevent the instability that Asterisk has because of the attacks.
Over the last 2 months since we started analyzing these specific attacks, the attacks come from about 200-400 different IPs over the course of a week's time.
There is the outside chance that a station has a virus on one of the computers of the Private LAN network, that hacking could happen from there.
williamconley wrote:There is the outside chance that a station has a virus on one of the computers of the Private LAN network, that hacking could happen from there.
My *personal* favorite: A very secure system allowing only VPN tunnel access to the dialers. One of the call centers has a "dual vpn router" and of course is only using one of the two VPN connections to link to the VPN tunnel in the Vicidial location's colo.
Someone hacked the second VPN port on the router, used that access to "bounce" back out the primary vpn port and into the colo through the VPN tunnel.
We traced the IP of the access to the secondary VPN port to ... a "VPN Expert" in another country. Of course, at this point we merely permanently disabled the secondary VPN port and got back to work (one of three centers was "down" while we worked out the attack problem). But it goes to show that when there's a deep pocket, hackers will squeeze through any hole they can find.
greg@byteworth.com wrote:When I talk about people getting hacked, I am referring to our clients who have their own network solutions and equipment in place and don't follow our recommendations.
williamconley wrote:greg@byteworth.com wrote:When I talk about people getting hacked, I am referring to our clients who have their own network solutions and equipment in place and don't follow our recommendations.
That's why we recommend DGG. The server provides its own firewall. And if the server then has its own IP address, the client router is not involved in the process. Best of both worlds. Whitelist and full control from the CLI. Plus, if a workstation gets infected or uses too much bandwidth, they can unplug the router without any adverse effect on Vicidial.
To date I have not found a client who can pfSense or Untangle properly. Normal routers are often a challenge, but those two completely baffle most who try to use them when they hit Vicidial (great for everything else, though).
Users browsing this forum: Google [Bot] and 22 guests