环境背景
目前是由nginx 4层代理的 bind 服务提供 dns 解析服务,但是出现了一些问题,这里作为一个记录
配置和问题
这是原本的配置,测试dns解析基本上没有问题
upstream dns {
server 192.168.10.120:53;
server 192.168.10.121:53;
server 192.168.10.122:53;
}
server {
listen 53;
listen 53 udp;
proxy_connect_timeout 2s;
proxy_timeout 2s;
proxy_next_upstream on;
proxy_pass dns;
error_log /usr/local/nginx/logs/dns.log info;
}
但日志一直有问题,要等到会话超时,也就是 proxy_timeout
之后才会正常打印日志,这样后续就会很难配合 bind 服务排查问题,时间是对不上的,而且bind服务只能记录 nginx 机器的 ip 地址
经过检索,增加了如下配置,主要是参数 proxy_responses
server {
listen 53;
listen 53 udp;
proxy_connect_timeout 2s;
proxy_responses 1;
proxy_timeout 2s;
proxy_next_upstream on;
proxy_pass dns;
error_log /usr/local/nginx/logs/dns.log info;
}
这个参数 nginx 官方的解释如下:
Sets the number of datagrams expected from the proxied server in response to a client datagram if the UDP protocol is used. The number serves as a hint for session termination. By default, the number of datagrams is not limited.
If zero value is specified, no response is expected. However, if a response is received and the session is still not finished, the response will be handled.
主要设置一个接收响应的报文,如果设置 1 ,则只要有一个数据包响应,则认为会话结束,然后它就会记录日志,这样确实解决了及时记录日志的问题
新的问题
在配置使用几天后发现了新的问题,明显感觉这几天上网、登录服务器等操作经常变慢,监控甚至又开始出现了一个很早之前的报错---域名解析失败,怀疑是上次修改的dns代理配置问题
在nginx服务器上检查了日志,可以看到如下内容:
2024/07/22 11:00:02 [error] 17500#0: *705 upstream timed out (110: Connection timed out) while proxying connection, udp client: 192.168.7.21, server: 0.0.0.0:53, upstream: "192.168.10.120:53
", bytes from/to client:86/43, bytes from/to upstream:43/86
2024/07/22 11:00:04 [error] 17500#0: *1173 upstream timed out (110: Connection timed out) while proxying connection, udp client: 192.168.3.103, server: 0.0.0.0:53, upstream: "192.168.10.122:
53", bytes from/to client:100/176, bytes from/to upstream:176/100
2024/07/22 11:00:08 [error] 17500#0: *1313 upstream timed out (110: Connection timed out) while proxying connection, udp client: 172.23.126.234, server: 0.0.0.0:53, upstream: "192.168.10.121
:53", bytes from/to client:76/232, bytes from/to upstream:232/76
2024/07/22 11:00:08 [error] 17500#0: *38157 no live upstreams while connecting to upstream, udp client: 172.23.127.238, server: 0.0.0.0:53, upstream: "dns", bytes from/to client:37/0, bytes
from/to upstream:0/0
2024/07/22 11:00:08 [error] 17500#0: *38158 no live upstreams while connecting to upstream, udp client: 192.168.3.103, server: 0.0.0.0:53, upstream: "dns", bytes from/to client:29/0, bytes f
rom/to upstream:0/0
2024/07/22 11:00:08 [error] 17500#0: *38159 no live upstreams while connecting to upstream, udp client: 172.23.126.242, server: 0.0.0.0:53, upstream: "dns", bytes from/to client:41/0, bytes
from/to upstream:0/0
2024/07/22 11:00:08 [error] 17500#0: *38160 no live upstreams while connecting to upstream, udp client: 172.23.126.254, server: 0.0.0.0:53, upstream: "dns", bytes from/to client:33/0, bytes
from/to upstream:0/0
2024/07/22 11:00:09 [error] 17500#0: *38161 no live upstreams while connecting to upstream, udp client: 192.168.3.100, server: 0.0.0.0:53, upstream: "dns", bytes from/to client:40/0, bytes f
rom/to upstream:0/0
显然在上次增加了 proxy_responses 1
参数后,经常访问不到后端服务器,连接后端超时,等到全部 upstream 都被判定超时则提示 no live upstream
,基本上符合问题现象
接下来反复打开关闭这个参数验证多次,确实发现只要开启参数,过一会儿就会有一大片的 upstream 超时日志
虽然找到了问题,但目前还未找到原因和解决办法 --- 这个参数会导致 upstream 超时,所以只能暂时关闭了
dns真实ip问题
udp的代理是不支持 proxy_protocol
的,所以无法传递真实的 ip 给 upstream,但是了解到可以使用 proxy_responses 0
配合 DSR 来实现传递真实的 ip,后续如果要这样先把 dns 服务器独立出来吧,暂时先这样,参考文档:
https://www.nginx-cn.net/blog/ip-transparency-direct-server-return-nginx-plus-transparent-proxy/
评论