After reducing UDP stack traversal overhead via GSO and GRO,
runtime.chanrecv() began to account for a high percentage (20% in one
environment) of perf samples during a throughput benchmark. The
individual packet channel ops with the crypto goroutines was the primary
contributor to this overhead.
Updating these channels to pass vectors, which the device package
already handles at its ends, reduced this overhead substantially, and
improved throughput.
The iperf3 results below demonstrate the effect of this commit between
two Linux computers with i5-12400 CPUs. There is roughly ~13us of round
trip latency between them.
The first result is with UDP GSO and GRO, and with single element
channels.
Starting Test: protocol: TCP, 1 streams, 131072 byte blocks
[ ID] Interval Transfer Bitrate Retr Cwnd
[ 5] 0.00-10.00 sec 12.3 GBytes 10.6 Gbits/sec 232 3.15 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
Test Complete. Summary Results:
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 12.3 GBytes 10.6 Gbits/sec 232 sender
[ 5] 0.00-10.04 sec 12.3 GBytes 10.6 Gbits/sec receiver
The second result is with channels updated to pass a slice of
elements.
Starting Test: protocol: TCP, 1 streams, 131072 byte blocks
[ ID] Interval Transfer Bitrate Retr Cwnd
[ 5] 0.00-10.00 sec 13.2 GBytes 11.3 Gbits/sec 182 3.15 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
Test Complete. Summary Results:
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 13.2 GBytes 11.3 Gbits/sec 182 sender
[ 5] 0.00-10.04 sec 13.2 GBytes 11.3 Gbits/sec receiver
Reviewed-by: Adrian Dewhurst <adrian@tailscale.com>
Signed-off-by: Jordan Whited <jordan@tailscale.com>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
StdNetBind probes for UDP GSO and GRO support at runtime. UDP GSO is
dependent on checksum offload support on the egress netdev. UDP GSO
will be disabled in the event sendmmsg() returns EIO, which is a strong
signal that the egress netdev does not support checksum offload.
The iperf3 results below demonstrate the effect of this commit between
two Linux computers with i5-12400 CPUs. There is roughly ~13us of round
trip latency between them.
The first result is from commit 052af4a without UDP GSO or GRO.
Starting Test: protocol: TCP, 1 streams, 131072 byte blocks
[ ID] Interval Transfer Bitrate Retr Cwnd
[ 5] 0.00-10.00 sec 9.85 GBytes 8.46 Gbits/sec 1139 3.01 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
Test Complete. Summary Results:
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 9.85 GBytes 8.46 Gbits/sec 1139 sender
[ 5] 0.00-10.04 sec 9.85 GBytes 8.42 Gbits/sec receiver
The second result is with UDP GSO and GRO.
Starting Test: protocol: TCP, 1 streams, 131072 byte blocks
[ ID] Interval Transfer Bitrate Retr Cwnd
[ 5] 0.00-10.00 sec 12.3 GBytes 10.6 Gbits/sec 232 3.15 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
Test Complete. Summary Results:
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 12.3 GBytes 10.6 Gbits/sec 232 sender
[ 5] 0.00-10.04 sec 12.3 GBytes 10.6 Gbits/sec receiver
Reviewed-by: Adrian Dewhurst <adrian@tailscale.com>
Signed-off-by: Jordan Whited <jordan@tailscale.com>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
If an IPC operation is in flight while close starts, it is possible for
both processes to deadlock. Prevent this by taking the IPC lock at the
start of close and for the duration.
Signed-off-by: James Tucker <jftucker@gmail.com>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
We can't have the netlink listener socket, so it's not possible to
support it. Plus, android networking stack complexity makes it a bit
tricky anyway, so best to leave it disabled.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
There's not really a use at the moment for making this configurable, and
once bind_windows.go behaves like bind_std.go, we'll be able to use
constants everywhere. So begin that simplification now.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Implement TCP offloading via TSO and GRO for the Linux tun.Device, which
is made possible by virtio extensions in the kernel's TUN driver.
Delete conn.LinuxSocketEndpoint in favor of a collapsed conn.StdNetBind.
conn.StdNetBind makes use of recvmmsg() and sendmmsg() on Linux. All
platforms now fall under conn.StdNetBind, except for Windows, which
remains in conn.WinRingBind, which still needs to be adjusted to handle
multiple packets.
Also refactor sticky sockets support to eventually be applicable on
platforms other than just Linux. However Linux remains the sole platform
that fully implements it for now.
Co-authored-by: James Tucker <james@tailscale.com>
Signed-off-by: James Tucker <james@tailscale.com>
Signed-off-by: Jordan Whited <jordan@tailscale.com>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Accept packet vectors for reading and writing in the tun.Device and
conn.Bind interfaces, so that the internal plumbing between these
interfaces now passes a vector of packets. Vectors move untouched
between these interfaces, i.e. if 128 packets are received from
conn.Bind.Read(), 128 packets are passed to tun.Device.Write(). There is
no internal buffering.
Currently, existing implementations are only adjusted to have vectors
of length one. Subsequent patches will improve that.
Also, as a related fixup, use the unix and windows packages rather than
the syscall package when possible.
Co-authored-by: James Tucker <james@tailscale.com>
Signed-off-by: James Tucker <james@tailscale.com>
Signed-off-by: Jordan Whited <jordan@tailscale.com>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
For some reason, this was omitted for response messages.
Reported-by: z <dzm@unexpl0.red>
Fixes: 8c34c4c ("First set of code review patches")
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
We missed a function exit point. This was exacerbated by e3134bf
("device: defer state machine transitions until configuration is
complete"), but the bug existed prior. Minus provided the following
useful reproducer script:
#!/usr/bin/env bash
set -eux
make wireguard-go || exit 125
ip netns del test-ns || true
ip netns add test-ns
ip link add test-kernel type wireguard
wg set test-kernel listen-port 0 private-key <(echo "QMCfZcp1KU27kEkpcMCgASEjDnDZDYsfMLHPed7+538=") peer "eDPZJMdfnb8ZcA/VSUnLZvLB2k8HVH12ufCGa7Z7rHI=" allowed-ips 10.51.234.10/32
ip link set test-kernel netns test-ns up
ip -n test-ns addr add 10.51.234.1/24 dev test-kernel
port=$(ip netns exec test-ns wg show test-kernel listen-port)
ip link del test-go || true
./wireguard-go test-go
wg set test-go private-key <(echo "WBM7qimR3vFk1QtWNfH+F4ggy/hmO+5hfIHKxxI4nF4=") peer "+nj9Dkqpl4phsHo2dQliGm5aEiWJJgBtYKbh7XjeNjg=" allowed-ips 0.0.0.0/0 endpoint 127.0.0.1:$port
ip addr add 10.51.234.10/24 dev test-go
ip link set test-go up
ping -c2 -W1 10.51.234.1
Reported-by: minus <minus@mnus.de>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
The deferred RUnlock calls weren't executing until all peers
had been processed. Add an anonymous function so that each
peer may be unlocked as soon as it is completed.
Signed-off-by: Josh Bleecher Snyder <josh@tailscale.com>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
There are more places where we'll need to add it later, when Go 1.18
comes out with support for it in the "net" package. Also, allowedips
still uses slices internally, which might be suboptimal.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
A peer.endpoint never becomes nil after being not-nil, so creation is
the only time we actually need to set this. This prevents a race from
when the variable is actually used elsewhere, and allows us to avoid an
expensive atomic.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Forgetting to seed the unsafe rng, the jitter before followed a fixed
pattern, which didn't help when a fleet of computers all boot at once.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Trying this for every peer winds up being very slow and precludes it
from acceptable runtime in the CI, so reduce this to 4.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Now that we have parent pointers hooked up, we can simply go right to
the node and remove it in place, rather than having to recursively walk
the entire trie.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
This makes the insertion algorithm a bit more efficient, while also now
taking on the additional task of connecting up parent pointers. This
will be handy in the following commit.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Heavier network extensions might require the wireguard-go component to
use less ram, so let users of this reduce these as needed.
At some point we'll put this behind a configuration method of sorts, but
for now, just expose the consts as vars.
Requested-by: Josh Bleecher Snyder <josh@tailscale.com>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
On Linux we can run `ip link del wg0`, in which case the fd becomes
stale, and we should exit. Since this is an intentional action, don't
treat it as an error.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
When debugging, it's useful to know why a receive func exited.
We were already logging that, but only in the "death spiral" case.
Move the logging up, to capture it always.
Reduce the verbosity, since it is not an error case any more.
Put the receive func name in the log line.
Signed-off-by: Josh Bleecher Snyder <josharian@gmail.com>
Note: this bug is "hidden" by avoiding "death spiral" code path by
6228659 ("device: handle broader range of errors in RoutineReceiveIncoming").
If the code reached "death spiral" mechanism, there would be multiple
double frees happening. This results in a deadlock on iOS, because the
pools are fixed size and goroutine might stop until somebody makes
space in the pool.
This was almost 100% repro on the new ARM Macbooks:
- Build with 'ios' tag for Mac. This will enable bounded pools.
- Somehow call device.IpcSet at least couple of times (update config)
- device.BindUpdate() would be triggered
- RoutineReceiveIncoming would enter "death spiral".
- RoutineReceiveIncoming would stall on double free (pool is already
full)
- The stuck routine would deadlock 'device.closeBindLocked()' function
on line 'netc.stopping.Wait()'
Signed-off-by: Kristupas Antanavičius <kristupas.antanavicius@nordsec.com>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Instead of hard-coding exactly two sources from which
to receive packets (an IPv4 source and an IPv6 source),
allow the conn.Bind to specify a set of sources.
Beneficial consequences:
* If there's no IPv6 support on a system,
conn.Bind.Open can choose not to return a receive function for it,
which is simpler than tracking that state in the bind.
This simplification removes existing data races from both
conn.StdNetBind and bindtest.ChannelBind.
* If there are more than two sources on a system,
the conn.Bind no longer needs to add a separate muxing layer.
Signed-off-by: Josh Bleecher Snyder <josharian@gmail.com>
RoutineReceiveIncoming exits immediately on net.ErrClosed,
but not on other errors. However, for errors that are known
to be permanent, such as syscall.EAFNOSUPPORT,
we may as well exit immediately instead of retrying.
This considerably speeds up the package device tests right now,
because the Bind sometimes (incorrectly) returns syscall.EAFNOSUPPORT
instead of net.ErrClosed.
Signed-off-by: Josh Bleecher Snyder <josharian@gmail.com>