nginx代理dns的问题

环境背景

目前是由nginx 4层代理的 bind 服务提供 dns 解析服务，但是出现了一些问题，这里作为一个记录

配置和问题

这是原本的配置，测试dns解析基本上没有问题

    upstream dns {
        server 192.168.10.120:53;
        server 192.168.10.121:53;
        server 192.168.10.122:53;
    }
    server {
        listen 53;
        listen 53 udp;
        proxy_connect_timeout 2s;
        proxy_timeout 2s;
        proxy_next_upstream on;
        proxy_pass dns;
        error_log /usr/local/nginx/logs/dns.log info;
    }

但日志一直有问题，要等到会话超时，也就是 proxy_timeout 之后才会正常打印日志，这样后续就会很难配合 bind 服务排查问题，时间是对不上的，而且bind服务只能记录 nginx 机器的 ip 地址

经过检索，增加了如下配置，主要是参数 proxy_responses

    server {
        listen 53;
        listen 53 udp;
        proxy_connect_timeout 2s;
        proxy_responses 1;
        proxy_timeout 2s;
        proxy_next_upstream on;
        proxy_pass dns;
        error_log /usr/local/nginx/logs/dns.log info;
    }

这个参数 nginx 官方的解释如下:

Sets the number of datagrams expected from the proxied server in response to a client datagram if the UDP protocol is used. The number serves as a hint for session termination. By default, the number of datagrams is not limited.

If zero value is specified, no response is expected. However, if a response is received and the session is still not finished, the response will be handled.

主要设置一个接收响应的报文，如果设置 1 ，则只要有一个数据包响应，则认为会话结束，然后它就会记录日志，这样确实解决了及时记录日志的问题

新的问题

在配置使用几天后发现了新的问题，明显感觉这几天上网、登录服务器等操作经常变慢，监控甚至又开始出现了一个很早之前的报错---域名解析失败，怀疑是上次修改的dns代理配置问题

在nginx服务器上检查了日志，可以看到如下内容：

2024/07/22 11:00:02 [error] 17500#0: *705 upstream timed out (110: Connection timed out) while proxying connection, udp client: 192.168.7.21, server: 0.0.0.0:53, upstream: "192.168.10.120:53
", bytes from/to client:86/43, bytes from/to upstream:43/86
2024/07/22 11:00:04 [error] 17500#0: *1173 upstream timed out (110: Connection timed out) while proxying connection, udp client: 192.168.3.103, server: 0.0.0.0:53, upstream: "192.168.10.122:
53", bytes from/to client:100/176, bytes from/to upstream:176/100
2024/07/22 11:00:08 [error] 17500#0: *1313 upstream timed out (110: Connection timed out) while proxying connection, udp client: 172.23.126.234, server: 0.0.0.0:53, upstream: "192.168.10.121
:53", bytes from/to client:76/232, bytes from/to upstream:232/76
2024/07/22 11:00:08 [error] 17500#0: *38157 no live upstreams while connecting to upstream, udp client: 172.23.127.238, server: 0.0.0.0:53, upstream: "dns", bytes from/to client:37/0, bytes 
from/to upstream:0/0
2024/07/22 11:00:08 [error] 17500#0: *38158 no live upstreams while connecting to upstream, udp client: 192.168.3.103, server: 0.0.0.0:53, upstream: "dns", bytes from/to client:29/0, bytes f
rom/to upstream:0/0
2024/07/22 11:00:08 [error] 17500#0: *38159 no live upstreams while connecting to upstream, udp client: 172.23.126.242, server: 0.0.0.0:53, upstream: "dns", bytes from/to client:41/0, bytes 
from/to upstream:0/0
2024/07/22 11:00:08 [error] 17500#0: *38160 no live upstreams while connecting to upstream, udp client: 172.23.126.254, server: 0.0.0.0:53, upstream: "dns", bytes from/to client:33/0, bytes 
from/to upstream:0/0
2024/07/22 11:00:09 [error] 17500#0: *38161 no live upstreams while connecting to upstream, udp client: 192.168.3.100, server: 0.0.0.0:53, upstream: "dns", bytes from/to client:40/0, bytes f
rom/to upstream:0/0

显然在上次增加了 proxy_responses 1 参数后，经常访问不到后端服务器，连接后端超时，等到全部 upstream 都被判定超时则提示 no live upstream，基本上符合问题现象

接下来反复打开关闭这个参数验证多次，确实发现只要开启参数，过一会儿就会有一大片的 upstream 超时日志

虽然找到了问题，但目前还未找到原因和解决办法 --- 这个参数会导致 upstream 超时，所以只能暂时关闭了

dns真实ip问题

udp的代理是不支持 proxy_protocol 的，所以无法传递真实的 ip 给 upstream，但是了解到可以使用 proxy_responses 0 配合 DSR 来实现传递真实的 ip，后续如果要这样先把 dns 服务器独立出来吧，暂时先这样，参考文档:

https://www.nginx-cn.net/blog/ip-transparency-direct-server-return-nginx-plus-transparent-proxy/

nginx代理dns的问题

环境背景

配置和问题

新的问题

dns真实ip问题

nginx限制ip访问-安全

docker nginx的真实ip地址问题

nginx限制时间段访问-安全

docker网络的疑难问题之三

评论