Skip to main content

Nginx load balancing bad configuration or bad behaviour? [Resolved]

I'm currently using Nginx as load balancer in order to balance network traffic between 3 nodes where a NodeJS API is running on.

The Nginx instance runs on node1 and every request is made to node1. I have a peek of requests about 700k in 2 hours and nginx in configured to switch, in a round robin manner, them between node1, node2 and node3. Here the conf.d/deva.conf:

upstream deva_api {
    server 10.8.0.30:5555 fail_timeout=5s max_fails=3;
    server 10.8.0.40:5555 fail_timeout=5s max_fails=3;
    server localhost:5555;
    keepalive 300;
}

server {

        listen 8000;

        location /log_pages {

                proxy_redirect off;
                proxy_set_header Host $host;
                proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;

                proxy_http_version 1.1;
                proxy_set_header Connection "";

                add_header 'Access-Control-Allow-Origin' '*';
                add_header 'Access-Control-Allow-Methods' 'GET, POST, PATCH, PUT, DELETE, OPTIONS';
                add_header 'Access-Control-Allow-Headers' 'Authorization,Content-Type,Origin,X-Auth-Token';
                add_header 'Access-Control-Allow-Credentials' 'true';

                if ($request_method = OPTIONS ) {
                        return 200;
                }

                proxy_pass http://deva_api;
                proxy_set_header Connection "Keep-Alive";
                proxy_set_header Proxy-Connection "Keep-Alive";

                auth_basic "Restricted";                                #For Basic Auth
                auth_basic_user_file /etc/nginx/.htpasswd;  #For Basic Auth
        }
}

and here the nginx.conf configuration:

user www-data;
worker_processes auto;
pid /run/nginx.pid;
include /etc/nginx/modules-enabled/*.conf;

worker_rlimit_nofile 65535;
events {
        worker_connections 65535;
        use epoll;
        multi_accept on;
}

http {

        ##
        # Basic Settings
        ##

        sendfile on;
        tcp_nopush on;
        tcp_nodelay on;
        keepalive_timeout 120;
        send_timeout 120;
        types_hash_max_size 2048;
        server_tokens off;

        client_max_body_size 100m;
        client_body_buffer_size  5m;
        client_header_buffer_size 5m;
        large_client_header_buffers 4 1m;

        open_file_cache max=200000 inactive=20s;
        open_file_cache_valid 30s;
        open_file_cache_min_uses 2;
        open_file_cache_errors on;

        reset_timedout_connection on;

        include /etc/nginx/mime.types;
        default_type application/octet-stream;

        ##
        # SSL Settings
        ##

        ssl_protocols TLSv1 TLSv1.1 TLSv1.2; # Dropping SSLv3, ref: POODLE
        ssl_prefer_server_ciphers on;

        ##
        # Logging Settings
        ##

        access_log /var/log/nginx/access.log;
        error_log /var/log/nginx/error.log;

        ##
        # Gzip Settings
        ##

        gzip on;
        include /etc/nginx/conf.d/*.conf;
        include /etc/nginx/sites-enabled/*;
}

The problem is that, with this configuration, I get hundreds of errors in error.log like the following:

upstream prematurely closed connection while reading response header from upstream

but only on node2 and node3. I have already tried the following tests:

  1. increase the number of concurrent APIs in each node (actually I'm using PM2 as intra-node balancer)
  2. remove one node in order to make easier nginx's job
  3. applied weights to nginx

Nothing makes the result better. In those tests I noticed that there were errors only on the 2 remote nodes (node2 and node3), so i tried to remove them from the equation. The result was that I get no more errors like that one but I started to have 2 different errors:

recv() failed (104: Connection reset by peer) while reading response header from upstream

and

writev() failed (32: Broken pipe) while sending request to upstream

It seems the problem was due to the lack of API on node1, the APIs probably cannot respond all the inbound traffic before the client's timeout (this was, is, my guess). Said that, I have increased the number of concurrent API on node1 and the result was better than previous ones but I continue to get the latter 2 errors and I cannot increase the concurrent API on node1 anymore.

So, the question is, why I cannot use nginx as load balancer with all my nodes? Am I making errors in the nginx configuration? Are there any other problems that I did not notice?

EDIT: I run some network tests between 3 nodes. The nodes communicate each other via Openvpn:

PING:

node1->node2
PING 10.8.0.40 (10.8.0.40) 56(84) bytes of data.
64 bytes from 10.8.0.40: icmp_seq=1 ttl=64 time=2.85 ms
64 bytes from 10.8.0.40: icmp_seq=2 ttl=64 time=1.85 ms
64 bytes from 10.8.0.40: icmp_seq=3 ttl=64 time=3.17 ms
64 bytes from 10.8.0.40: icmp_seq=4 ttl=64 time=3.21 ms
64 bytes from 10.8.0.40: icmp_seq=5 ttl=64 time=2.68 ms

node1->node2
PING 10.8.0.30 (10.8.0.30) 56(84) bytes of data.
64 bytes from 10.8.0.30: icmp_seq=1 ttl=64 time=2.16 ms
64 bytes from 10.8.0.30: icmp_seq=2 ttl=64 time=3.08 ms
64 bytes from 10.8.0.30: icmp_seq=3 ttl=64 time=10.9 ms
64 bytes from 10.8.0.30: icmp_seq=4 ttl=64 time=3.11 ms
64 bytes from 10.8.0.30: icmp_seq=5 ttl=64 time=3.25 ms

node2->node1
PING 10.8.0.12 (10.8.0.12) 56(84) bytes of data.
64 bytes from 10.8.0.12: icmp_seq=1 ttl=64 time=2.30 ms
64 bytes from 10.8.0.12: icmp_seq=2 ttl=64 time=8.30 ms
64 bytes from 10.8.0.12: icmp_seq=3 ttl=64 time=2.37 ms
64 bytes from 10.8.0.12: icmp_seq=4 ttl=64 time=2.42 ms
64 bytes from 10.8.0.12: icmp_seq=5 ttl=64 time=3.37 ms

node2->node3
PING 10.8.0.40 (10.8.0.40) 56(84) bytes of data.
64 bytes from 10.8.0.40: icmp_seq=1 ttl=64 time=2.86 ms
64 bytes from 10.8.0.40: icmp_seq=2 ttl=64 time=4.01 ms
64 bytes from 10.8.0.40: icmp_seq=3 ttl=64 time=5.37 ms
64 bytes from 10.8.0.40: icmp_seq=4 ttl=64 time=2.78 ms
64 bytes from 10.8.0.40: icmp_seq=5 ttl=64 time=2.87 ms

node3->node1
PING 10.8.0.12 (10.8.0.12) 56(84) bytes of data.
64 bytes from 10.8.0.12: icmp_seq=1 ttl=64 time=8.24 ms
64 bytes from 10.8.0.12: icmp_seq=2 ttl=64 time=2.72 ms
64 bytes from 10.8.0.12: icmp_seq=3 ttl=64 time=2.63 ms
64 bytes from 10.8.0.12: icmp_seq=4 ttl=64 time=2.91 ms
64 bytes from 10.8.0.12: icmp_seq=5 ttl=64 time=3.14 ms

node3->node2
PING 10.8.0.30 (10.8.0.30) 56(84) bytes of data.
64 bytes from 10.8.0.30: icmp_seq=1 ttl=64 time=2.73 ms
64 bytes from 10.8.0.30: icmp_seq=2 ttl=64 time=2.38 ms
64 bytes from 10.8.0.30: icmp_seq=3 ttl=64 time=3.22 ms
64 bytes from 10.8.0.30: icmp_seq=4 ttl=64 time=2.76 ms
64 bytes from 10.8.0.30: icmp_seq=5 ttl=64 time=2.97 ms

Bandwidth check, via IPerf:

node1 -> node2
[ ID] Interval       Transfer     Bandwidth
[  4]  0.0-10.0 sec   229 MBytes   192 Mbits/sec

node2->node1
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec   182 MBytes   152 Mbits/sec

node3->node1
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec   160 MBytes   134 Mbits/sec

node3->node2
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec   260 MBytes   218 Mbits/sec

node2->node3
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec   241 MBytes   202 Mbits/sec

node1->node3
[ ID] Interval       Transfer     Bandwidth
[  4]  0.0-10.0 sec   187 MBytes   156 Mbits/sec

It seems there is a bottleneck in the OpenVPN tunnel because the same test through eth is about 1Gbits. Said that, I have followed this guide community.openvpn.net but I got only twice of the bandwidth measured before.

I would like to keep OpenVPN on, so are there any other tweaks to make in order to increase network bandwidth or any other adjustment to nginx configuration to make it work properly?


Question Credit: Gappa
Question Reference
Asked March 13, 2019
Posted Under: Network
38 views
1 Answers

The problems were caused by the slowness of the OpenVPN network. By routing the requests on the internet after the addition of authentications on each different server we got the errors down to 1-2/day and are probably caused by some other issues now.


credit: Dege
Answered March 13, 2019
Your Answer
D:\Adnan\Candoerz\CandoProject\vQA