Random timeout failures on Java SFTP Pipeline and SO_KEEPALIVE

This post is about a customer case I worked on some time ago. The customer was using our Java-based ETL application where they had a Pipeline to read files from a local server directory, parse some information read from these files, and finally send the files to a remote SFTP server. It was a streaming Pipeline (not a batch Pipeline), which means that it’s continuously running, checking every few seconds the origin directory for new files, and processing/sending them to the destination as soon as they arrive. The Pipeline does not open a new SFTP connection every time it sends a new file. Instead, the Pipeline keeps a SFTP connection open permanently against the SFTP server, which is more efficient because there are no new connections overhead.

The customer stated that the Pipeline was working fine, but sometimes randomly starts to fail and new files in the origin directory are not being sent to the destination, taking 10 minutes or more to be copied onto the destination. On the Pipeline application log we found several traces with the error message ‘Connection timed out (Read failed)‘ related to the SFTP connection, and we also noticed several retries from the Pipeline trying to list the remote SFTP directory. The customer also mentioned that if they restart the Pipeline when they start having these errors, then the Pipeline continues processing the pending files to be processed without issues.

java.net.SocketException: Connection timed out (Read failed)
        at java.net.SocketInputStream.socketRead0(Native Method) ~[?:1.8.0_261]
        at java.net.SocketInputStream.socketRead(SocketInputStream.java:116) ~[?:1.8.0_261]
        at java.net.SocketInputStream.read(SocketInputStream.java:171) ~[?:1.8.0_261]
        at java.net.SocketInputStream.read(SocketInputStream.java:141) ~[?:1.8.0_261]
        at net.schmizz.sshj.transport.Reader.run(Reader.java:50) [sshj-0.38.0.jar:?]

After these errors, the customer also verified that the remote SFTP server works without issues, by manually opening a new connection with a SFTP client and checking that they can list, read, and write in the SFTP server directory using the same SFTP user that they use on the Pipeline. The issue on the Pipeline happens randomly at some point on an already established SFTP connection.

In order to know at which stage of the connection happens the issue, and what happens with the network packets around that time, we need the complete network flow. In this case, the best way to get the network capture is by running a tcpdump command in the background, and stopping the capture once the error happens. Given we are going to run a tcpdump capture in the background, it’s important to limit the size of the capture in order to avoid filling the server disk; we can do that by using the tcpdump flags ‘-C 100 -W 5’ (a maximum of 5 files of 100 MB each one). I asked the customer to run tcpdump captures on both the application server, and also on the SFTP server, filtering through port 22, and using the mentioned flags.

Once the customer sent us the tcpdump pcap files, by opening the capture file taken on the application server side, and filtering with ‘grep’ the time around the issue reoccurred, I found an existing connection from the application server (10.200.70.45:42154) destined to the SFTP server (10.100.90.201:22) which was experiencing a network retransmission. The packet with ‘seq 0:68‘ is being sent to the remote side, but we are not receiving any response.

$ tcpdump -nn -tttt -r /tmp/tcpdump_AppServer_port22_UTC.pcap0 host 10.100.90.201 and port 42154
reading from file /tmp/tcpdump_AppServer_port22_UTC.pcap0, link-type LINUX_SLL (Linux cooked)
...
2024-06-12 12:23:36.071936 IP 10.200.70.45.42154 > 10.100.90.201.22: Flags [P.], seq 0:68, ack 1, win 378
2024-06-12 12:23:36.297941 IP 10.200.70.45.42154 > 10.100.90.201.22: Flags [P.], seq 0:68, ack 1, win 378
2024-06-12 12:23:36.751002 IP 10.200.70.45.42154 > 10.100.90.201.22: Flags [P.], seq 0:68, ack 1, win 378
2024-06-12 12:23:37.656939 IP 10.200.70.45.42154 > 10.100.90.201.22: Flags [P.], seq 0:68, ack 1, win 378
2024-06-12 12:23:39.468937 IP 10.200.70.45.42154 > 10.100.90.201.22: Flags [P.], seq 0:68, ack 1, win 378
2024-06-12 12:23:43.096930 IP 10.200.70.45.42154 > 10.100.90.201.22: Flags [P.], seq 0:68, ack 1, win 378
2024-06-12 12:23:50.344928 IP 10.200.70.45.42154 > 10.100.90.201.22: Flags [P.], seq 0:68, ack 1, win 378
2024-06-12 12:24:04.824949 IP 10.200.70.45.42154 > 10.100.90.201.22: Flags [P.], seq 0:68, ack 1, win 378
2024-06-12 12:24:33.816954 IP 10.200.70.45.42154 > 10.100.90.201.22: Flags [P.], seq 0:68, ack 1, win 378
2024-06-12 12:25:31.800946 IP 10.200.70.45.42154 > 10.100.90.201.22: Flags [P.], seq 0:68, ack 1, win 378
2024-06-12 12:27:27.768949 IP 10.200.70.45.42154 > 10.100.90.201.22: Flags [P.], seq 0:68, ack 1, win 378
2024-06-12 12:29:28.088956 IP 10.200.70.45.42154 > 10.100.90.201.22: Flags [P.], seq 0:68, ack 1, win 378
2024-06-12 12:31:28.408931 IP 10.200.70.45.42154 > 10.100.90.201.22: Flags [P.], seq 0:68, ack 1, win 378
...

Looking at the tcpdump capture taken on the SFTP server side, we saw that the mentioned packet never arrived on the SFTP server side, which means that there is an issue at network level caused by a network device like a network router or network switch. The issue at this point is that if no data is sent for a period (often 5–30 minutes), the firewall or the network device “forgets” the connection and silently drops subsequent packets. The solution is to enable Keep-Alive messages in the sshj configuration to send periodic “heartbeat” packets that keep the connection active, but this change requires a change to the source code of our Java application, which can not be done immediately.

Another option is to modify the ‘net.ipv4.tcp_keepalive_time‘ at OS/kernel level (using sysctl) on the application server side to reduce it from the default of 2 hours (7200) to 2 minutes (120) for example. Before telling to the customer to apply this solution, I tested this change in my POC environment but I didn’t get the expected behavior. I created a similar Pipeline as the customer had, set ‘tcp_keepalive_time’ to 2 minutes, restarted the Java application, started the Pipeline, waited for the Pipeline to process the current files, and after waiting more than 2 minutes on the idle SFTP connection, I didn’t see any keepalive packet (ACK packet) sent from my application server side during this interval. Why? The reason is that the ‘tcp_keepalive_time‘ doesn’t apply to all sockets, it applies to sockets with the ‘SO_KEEPALIVE‘ option.

Regarding the socket option ‘SO_KEEPALIVE‘, as an example, a strace capture on a Linux SFTP client process, focusing on the connection part, looks as follows, where ‘SO_KEEPALIVE‘ option is set.

# LinuxSFTP client
8663  03:42:46.722665 socket(AF_INET, SOCK_STREAM, IPPROTO_TCP) = 3
8663  03:42:46.722898 fcntl(3, F_SETFD, FD_CLOEXEC) = 0
8663  03:42:46.723081 gettimeofday({tv_sec=1718250166, tv_usec=723119}, NULL) = 0
8663  03:42:46.723177 connect(3, {sa_family=AF_INET, sin_port=htons(22), sin_addr=inet_addr("51.79.160.8")}, 16) = 0
8663  03:42:46.933472 setsockopt(3, SOL_SOCKET, SO_KEEPALIVE, [1], 4) = 0
8663  03:42:46.933646 getpeername(3, {sa_family=AF_INET, sin_port=htons(22), sin_addr=inet_addr("51.79.160.8")}, [128->16]) = 0
8663  03:42:46.933765 getpeername(3, {sa_family=AF_INET, sin_port=htons(22), sin_addr=inet_addr("51.79.160.8")}, [128->16]) = 0
8663  03:42:46.933942 getsockname(3, {sa_family=AF_INET, sin_port=htons(56022), sin_addr=inet_addr("10.10.63.247")}, [128->16]) = 0
8663  03:42:46.934053 getsockname(3, {sa_family=AF_INET, sin_port=htons(56022), sin_addr=inet_addr("10.10.63.247")}, [128->16]) = 0

As a comparison, below you have a strace capture attached to the Java application, focusing on the part where the SFTP connection is created. Notice that ‘SO_KEEPALIVE’ is not set on the ‘setsockopt‘ syscall, so OS ‘tcp_keepalive_time’ won’t be applied in this connection.

# JavaSSHJ client
6672  03:47:42.793776 socket(AF_INET6, SOCK_STREAM, IPPROTO_IP) = 8672
6672  03:47:42.793833 setsockopt(8672, SOL_IPV6, IPV6_V6ONLY, [0], 4) = 0
6672  03:47:42.793991 connect(8672, {sa_family=AF_INET6, sin6_port=htons(22), inet_pton(AF_INET6, "::ffff:51.79.160.8", &sin6_addr), sin6_flowinfo=htonl(0), sin6_scope_id=0}, 28 <unfinished ...>

The final solution is to set the OS Keepalive (tcp_keepalive_time) on the SFTP server side to 2 minutes (sshd server sets SO_KEEPALIVE option on the socket for the connections it receives), and this will make to generate keepalive probes/packets from SFTP server side when the connection is inactive for more than 2 minutes, and this will prevent the connections being dropped by the network devices. As a final example output, below you have thetcpdump capture of a SFTP connection which has been inactive for 2 minutes; we can see that OS keepalive is working as expected from the SFTP server side, by generating ACK packets when the connection is inactive for more than 2 minutes. .

# Keepalive probes from SFTP server side
...
05:28:32.521873 IP 51.79.160.8.22 > 35.162.35.89.18755: Flags [P.], seq 2974:3042, ack 2906, win 262, options [nop,nop,TS val 2396797466 ecr 1186115941], length 68
05:28:32.747090 IP 35.162.35.89.18755 > 51.79.160.8.22: Flags [P.], seq 2906:2958, ack 3042, win 291, options [nop,nop,TS val 1186116167 ecr 2396797466], length 52
05:28:32.747431 IP 51.79.160.8.22 > 35.162.35.89.18755: Flags [P.], seq 3042:3110, ack 2958, win 262, options [nop,nop,TS val 2396797691 ecr 1186116167], length 68
05:28:33.012247 IP 35.162.35.89.18755 > 51.79.160.8.22: Flags [.], ack 3110, win 291, options [nop,nop,TS val 1186116432 ecr 2396797691], length 0
---
05:30:33.023772 IP 51.79.160.8.22 > 35.162.35.89.18755: Flags [.], ack 2958, win 262, options [nop,nop,TS val 2396917968 ecr 1186116432], length 0
05:30:33.261010 IP 35.162.35.89.18755 > 51.79.160.8.22: Flags [.], ack 3110, win 291, options [nop,nop,TS val 1186236684 ecr 2396797691], length 0
---
05:32:33.391751 IP 51.79.160.8.22 > 35.162.35.89.18755: Flags [.], ack 2958, win 262, options [nop,nop,TS val 2397038336 ecr 1186236684], length 0
05:32:33.628734 IP 35.162.35.89.18755 > 51.79.160.8.22: Flags [.], ack 3110, win 291, options [nop,nop,TS val 1186357055 ecr 2396797691], length 0
...

Adib Ahmed Akhtar

Random timeout failures on Java SFTP Pipeline and SO_KEEPALIVE

More posts

Random timeout failures on Java SFTP Pipeline and SO_KEEPALIVE

How Kubernetes CoreDNS works

Understanding Kubernetes node network interfaces

Auditd rules for Kubernetes nodes