Commit Graph

55 Commits

Author SHA1 Message Date
4f8b7857f9 change imports to personal fork 2024-01-07 22:03:11 +03:00
Jason A. Donenfeld
542e565baa device: do atomic 64-bit add outside of vector loop
Only bother updating the rxBytes counter once we've processed a whole
vector, since additions are atomic.

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2023-12-11 16:35:57 +01:00
Jordan Whited
7c20311b3d device: reduce redundant per-packet overhead in RX path
Peer.RoutineSequentialReceiver() deals with packet vectors and does not
need to perform timer and endpoint operations for every packet in a
given vector. Changing these per-packet operations to per-vector
improves throughput by as much as 10% in some environments.

Signed-off-by: Jordan Whited <jordan@tailscale.com>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2023-12-11 16:34:09 +01:00
Jordan Whited
1ec454f253 device: move Queue{In,Out}boundElement Mutex to container type
Queue{In,Out}boundElement locking can contribute to significant
overhead via sync.Mutex.lockSlow() in some environments. These types
are passed throughout the device package as elements in a slice, so
move the per-element Mutex to a container around the slice.

Reviewed-by: Maisem Ali <maisem@tailscale.com>
Signed-off-by: Jordan Whited <jordan@tailscale.com>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2023-10-10 15:07:36 +02:00
Jordan Whited
4201e08f1d device: distribute crypto work as slice of elements
After reducing UDP stack traversal overhead via GSO and GRO,
runtime.chanrecv() began to account for a high percentage (20% in one
environment) of perf samples during a throughput benchmark. The
individual packet channel ops with the crypto goroutines was the primary
contributor to this overhead.

Updating these channels to pass vectors, which the device package
already handles at its ends, reduced this overhead substantially, and
improved throughput.

The iperf3 results below demonstrate the effect of this commit between
two Linux computers with i5-12400 CPUs. There is roughly ~13us of round
trip latency between them.

The first result is with UDP GSO and GRO, and with single element
channels.

Starting Test: protocol: TCP, 1 streams, 131072 byte blocks
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-10.00  sec  12.3 GBytes  10.6 Gbits/sec  232   3.15 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
Test Complete. Summary Results:
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  12.3 GBytes  10.6 Gbits/sec  232   sender
[  5]   0.00-10.04  sec  12.3 GBytes  10.6 Gbits/sec        receiver

The second result is with channels updated to pass a slice of
elements.

Starting Test: protocol: TCP, 1 streams, 131072 byte blocks
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-10.00  sec  13.2 GBytes  11.3 Gbits/sec  182   3.15 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
Test Complete. Summary Results:
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  13.2 GBytes  11.3 Gbits/sec  182   sender
[  5]   0.00-10.04  sec  13.2 GBytes  11.3 Gbits/sec        receiver

Reviewed-by: Adrian Dewhurst <adrian@tailscale.com>
Signed-off-by: Jordan Whited <jordan@tailscale.com>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2023-10-10 15:07:36 +02:00
Jason A. Donenfeld
0ad14a89f5 global: buff -> buf
This always struck me as kind of weird and non-standard.

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2023-03-13 17:55:53 +01:00
Jordan Whited
3bb8fec7e4 conn, device, tun: implement vectorized I/O plumbing
Accept packet vectors for reading and writing in the tun.Device and
conn.Bind interfaces, so that the internal plumbing between these
interfaces now passes a vector of packets. Vectors move untouched
between these interfaces, i.e. if 128 packets are received from
conn.Bind.Read(), 128 packets are passed to tun.Device.Write(). There is
no internal buffering.

Currently, existing implementations are only adjusted to have vectors
of length one. Subsequent patches will improve that.

Also, as a related fixup, use the unix and windows packages rather than
the syscall package when possible.

Co-authored-by: James Tucker <james@tailscale.com>
Signed-off-by: James Tucker <james@tailscale.com>
Signed-off-by: Jordan Whited <jordan@tailscale.com>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2023-03-10 14:52:13 +01:00
Jason A. Donenfeld
ebbd4a4330 global: bump copyright year
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2023-02-07 20:39:29 -03:00
Jason A. Donenfeld
bb719d3a6e global: bump copyright year
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2022-09-20 17:21:32 +02:00
Brad Fitzpatrick
b51010ba13 all: use Go 1.19 and its atomic types
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2022-09-04 12:57:30 +02:00
Jason A. Donenfeld
ef8d6804d7 global: use netip where possible now
There are more places where we'll need to add it later, when Go 1.18
comes out with support for it in the "net" package. Also, allowedips
still uses slices internally, which might be suboptimal.

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-11-23 22:03:15 +01:00
Jason A. Donenfeld
841756e328 device: simplify allowedips lookup signature
The inliner should handle this for us.

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-06-03 16:29:43 +02:00
Jason A. Donenfeld
7121927b87 device: add ID to repeated routines
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-05-07 12:21:21 +02:00
Jason A. Donenfeld
efb8818550 device: avoid verbose log line during ordinary shutdown sequence
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-05-07 09:39:06 +02:00
Josh Bleecher Snyder
60a26371f4 device: log all errors received by RoutineReceiveIncoming
When debugging, it's useful to know why a receive func exited.

We were already logging that, but only in the "death spiral" case.
Move the logging up, to capture it always.
Reduce the verbosity, since it is not an error case any more.
Put the receive func name in the log line.

Signed-off-by: Josh Bleecher Snyder <josharian@gmail.com>
2021-05-06 11:22:13 +02:00
Jason A. Donenfeld
54dbe2471f conn: reconstruct v4 vs v6 receive function based on symtab
This is kind of gross but it's better than the alternatives.

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-04-12 15:35:32 -06:00
Kristupas Antanavičius
d2fd0c0cc0 device: allocate new buffer in receive death spiral
Note: this bug is "hidden" by avoiding "death spiral" code path by
6228659 ("device: handle broader range of errors in RoutineReceiveIncoming").

If the code reached "death spiral" mechanism, there would be multiple
double frees happening. This results in a deadlock on iOS, because the
pools are fixed size and goroutine might stop until somebody makes
space in the pool.

This was almost 100% repro on the new ARM Macbooks:

- Build with 'ios' tag for Mac. This will enable bounded pools.
- Somehow call device.IpcSet at least couple of times (update config)
- device.BindUpdate() would be triggered
- RoutineReceiveIncoming would enter "death spiral".
- RoutineReceiveIncoming would stall on double free (pool is already
  full)
- The stuck routine would deadlock 'device.closeBindLocked()' function
  on line 'netc.stopping.Wait()'

Signed-off-by: Kristupas Antanavičius <kristupas.antanavicius@nordsec.com>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-04-12 11:14:53 -06:00
Josh Bleecher Snyder
10533c3e73 all: make conn.Bind.Open return a slice of receive functions
Instead of hard-coding exactly two sources from which
to receive packets (an IPv4 source and an IPv6 source),
allow the conn.Bind to specify a set of sources.

Beneficial consequences:

* If there's no IPv6 support on a system,
  conn.Bind.Open can choose not to return a receive function for it,
  which is simpler than tracking that state in the bind.
  This simplification removes existing data races from both
  conn.StdNetBind and bindtest.ChannelBind.
* If there are more than two sources on a system,
  the conn.Bind no longer needs to add a separate muxing layer.

Signed-off-by: Josh Bleecher Snyder <josharian@gmail.com>
2021-04-02 11:07:08 -06:00
Josh Bleecher Snyder
6228659a91 device: handle broader range of errors in RoutineReceiveIncoming
RoutineReceiveIncoming exits immediately on net.ErrClosed,
but not on other errors. However, for errors that are known
to be permanent, such as syscall.EAFNOSUPPORT,
we may as well exit immediately instead of retrying.

This considerably speeds up the package device tests right now,
because the Bind sometimes (incorrectly) returns syscall.EAFNOSUPPORT
instead of net.ErrClosed.

Signed-off-by: Josh Bleecher Snyder <josharian@gmail.com>
2021-03-30 12:41:43 -07:00
Jason A. Donenfeld
4e439ea10e conn: bump to 1.16 and get rid of NetErrClosed hack
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-02-16 21:05:25 +01:00
Jason A. Donenfeld
da32fe328b device: handshake routine writes into encryption queue
Since RoutineHandshake calls peer.SendKeepalive(), it potentially is a
writer into the encryption queue, so we need to bump the wg count.

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-02-09 19:26:45 +01:00
Jason A. Donenfeld
6ac1240821 device: do not attach finalizer to non-returned object
Before, the code attached a finalizer to an object that wasn't returned,
resulting in immediate garbage collection. Instead return the actual
pointer.

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-02-09 15:37:04 +01:00
Josh Bleecher Snyder
d8dd1f254f device: remove mutex from Peer send/receive
The immediate motivation for this change is an observed deadlock.

1. A goroutine calls peer.Stop. That calls peer.queue.Lock().
2. Another goroutine is in RoutineSequentialReceiver.
   It receives an elem from peer.queue.inbound.
3. The peer.Stop goroutine calls close(peer.queue.inbound),
   close(peer.queue.outbound), and peer.stopping.Wait().
   It blocks waiting for RoutineSequentialReceiver
   and RoutineSequentialSender to exit.
4. The RoutineSequentialReceiver goroutine calls peer.SendStagedPackets().
   SendStagedPackets attempts peer.queue.RLock().
   That blocks forever because the peer.Stop
   goroutine holds a write lock on that mutex.

A background motivation for this change is that it can be expensive
to have a mutex in the hot code path of RoutineSequential*.

The mutex was necessary to avoid attempting to send elems on a closed channel.
This commit removes that danger by never closing the channel.
Instead, we send a sentinel nil value on the channel to indicate
to the receiver that it should exit.

The only problem with this is that if the receiver exits,
we could write an elem into the channel which would never get received.
If it never gets received, it cannot get returned to the device pools.

To work around this, we use a finalizer. When the channel can be GC'd,
the finalizer drains any remaining elements from the channel and
restores them to the device pool.

After that change, peer.queue.RWMutex no longer makes sense where it is.
It is only used to prevent concurrent calls to Start and Stop.
Move it to a more sensible location and make it a plain sync.Mutex.

Signed-off-by: Josh Bleecher Snyder <josh@tailscale.com>
2021-02-08 13:02:52 -08:00
Josh Bleecher Snyder
0bcb822e5b device: overhaul device state management
This commit simplifies device state management.
It creates a single unified state variable and documents its semantics.

It also makes state changes more atomic.
As an example of the sort of bug that occurred due to non-atomic state changes,
the following sequence of events used to occur approximately every 2.5 million test runs:

* RoutineTUNEventReader received an EventDown event.
* It called device.Down, which called device.setUpDown.
* That set device.state.changing, but did not yet attempt to lock device.state.Mutex.
* Test completion called device.Close.
* device.Close locked device.state.Mutex.
* device.Close blocked on a call to device.state.stopping.Wait.
* device.setUpDown then attempted to lock device.state.Mutex and blocked.

Deadlock results. setUpDown cannot progress because device.state.Mutex is locked.
Until setUpDown returns, RoutineTUNEventReader cannot call device.state.stopping.Done.
Until device.state.stopping.Done gets called, device.state.stopping.Wait is blocked.
As long as device.state.stopping.Wait is blocked, device.state.Mutex cannot be unlocked.
This commit fixes that deadlock by holding device.state.mu
when checking that the device is not closed.

Signed-off-by: Josh Bleecher Snyder <josh@tailscale.com>
2021-02-08 10:32:07 -08:00
Josh Bleecher Snyder
9c75f58f3d device: remove device.state.stopping from RoutineHandshake
It is no longer necessary.

Signed-off-by: Josh Bleecher Snyder <josh@tailscale.com>
2021-02-08 08:18:32 -08:00
Josh Bleecher Snyder
84a42aed63 device: remove device.state.stopping from RoutineDecryption
It is no longer necessary, as of 454de6f3e64abd2a7bf9201579cd92eea5280996
(device: use channel close to shut down and drain decryption channel).

Signed-off-by: Josh Bleecher Snyder <josh@tailscale.com>
2021-02-08 08:18:32 -08:00
Jason A. Donenfeld
beb25cc4fd device: use new model queues for handshakes
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-01-29 18:24:45 +01:00
Jason A. Donenfeld
9263014ed3 device: simplify peer queue locking
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-01-29 16:21:53 +01:00
Jason A. Donenfeld
d4112d9096 global: bump copyright
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-01-28 17:52:15 +01:00
Jason A. Donenfeld
1b092ce584 device: get rid of nonce routine
This moves to a simple queue with no routine processing it, to reduce
scheduler pressure.

This splits latency in half!

benchmark                  old ns/op     new ns/op     delta
BenchmarkThroughput-16     2394          2364          -1.25%
BenchmarkLatency-16        259652        120810        -53.47%

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-01-27 18:38:27 +01:00
Jason A. Donenfeld
d669c78c43 device: combine debug and info log levels into 'verbose'
There are very few cases, if any, in which a user only wants one of
these levels, so combine it into a single level.

While we're at it, reduce indirection on the loggers by using an empty
function rather than a nil function pointer. It's not like we have
retpolines anyway, and we were always calling through a function with a
branch prior, so this seems like a net gain.

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-01-26 23:05:48 +01:00
Josh Bleecher Snyder
7139279cd0 device: change logging interface to use functions
This commit overhauls wireguard-go's logging.

The primary, motivating change is to use a function instead
of a *log.Logger as the basic unit of logging.
Using functions provides a lot more flexibility for
people to bring their own logging system.

It also introduces logging helper methods on Device.
These reduce line noise at the call site.
They also allow for log functions to be nil;
when nil, instead of generating a log line and throwing it away,
we don't bother generating it at all.
This spares allocation and pointless work.

This is a breaking change, although the fix required
of clients is fairly straightforward.

Signed-off-by: Josh Bleecher Snyder <josh@tailscale.com>
2021-01-26 22:40:20 +01:00
Jason A. Donenfeld
294d3bedf9 device: allow compiling with Go 1.15
Until we depend on Go 1.16 (which isn't released yet), alias our own
variable to the private member of the net package. This will allow an
easy find replace to make this go away when we eventually switch to
1.16.

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-01-20 20:12:32 +01:00
Josh Bleecher Snyder
a86492a567 device: remove QueueInboundElement.dropped
Now that we block when enqueueing to the decryption queue,
there is only one case in which we "drop" a inbound element,
when decryption fails.

We can use a simple, obvious, sync-free sentinel for that, elem.packet == nil.
Also, we can return the message buffer to the pool slightly later,
which further simplifies the code.

Signed-off-by: Josh Bleecher Snyder <josh@tailscale.com>
2021-01-20 19:57:06 +01:00
Josh Bleecher Snyder
2fe19ce54d device: remove selects from encrypt/decrypt/inbound/outbound enqueuing
Block instead. Backpressure here is fine, probably preferable.
This reduces code complexity.

Signed-off-by: Josh Bleecher Snyder <josh@tailscale.com>
2021-01-20 19:57:00 +01:00
Josh Bleecher Snyder
48c3b87eb8 device: use channel close to shut down and drain decryption channel
This is similar to commit e1fa1cc556,
but for the decryption channel.

It is an alternative fix to f9f655567930a4cd78d40fa4ba0d58503335ae6a.

Signed-off-by: Josh Bleecher Snyder <josh@tailscale.com>
2021-01-20 19:56:54 +01:00
Jason A. Donenfeld
ea6c1cd7e6 device: receive: do not exit immediately on transient UDP receive errors
Some users report seeing lines like:

> Routine: receive incoming IPv4 - stopped

Popping up unexpectedly. Let's sleep and try again before failing, and
also log the error, and perhaps we'll eventually understand this
situation better in future versions.

Because we have to distinguish between the socket being closed
explicitly and whatever error this is, we bump the module to require Go
1.16.

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-01-08 14:30:04 +01:00
Jason A. Donenfeld
29b0477585 device: receive: drain decryption queue before exiting RoutineDecryption
It's possible for RoutineSequentialReceiver to try to lock an elem after
RoutineDecryption has exited. Before this meant we didn't then unlock
the elem, so the whole program deadlocked.

As well, it looks like the flush code (which is now potentially
unnecessary?) wasn't properly dropping the buffers for the
not-already-dropped case.

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-01-07 17:08:41 +01:00
Josh Bleecher Snyder
b5f966ac24 device: remove QueueInboundElement leak with stopped peers
This is particularly problematic on mobile,
where there is a fixed number of elements.
If most of them leak, it'll impact performance;
if all of them leak, the device will permanently deadlock.

I have a test that detects element leaks, which is how I found this one.
There are some remaining leaks that I have not yet tracked down,
but this is the most prominent by far.

I will commit the test when it passes reliably.

Signed-off-by: Josh Bleecher Snyder <josh@tailscale.com>
2021-01-07 14:49:44 +01:00
Brad Fitzpatrick
e9edc16349 device: fix error shadowing before log print
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
2021-01-07 14:49:44 +01:00
Josh Bleecher Snyder
c8faa34cde device: always name *Queue*Element variables elem
They're called elem in most places.
Rename a few local variables to make it consistent.
This makes it easier to grep the code for things like elem.Drop.

Signed-off-by: Josh Bleecher Snyder <josh@tailscale.com>
2021-01-07 14:49:44 +01:00
Josh Bleecher Snyder
41cd68416c device: simplify copying counter to nonce
Since we already have it packed into a uint64
in a known byte order, write it back out again
the same byte order instead of copying byte by byte.

This should also generate more efficient code,
because the compiler can do a single uint64 write,
instead of eight bounds checks and eight byte writes.

Due to a missed optimization, it actually generates a mishmash
of smaller writes: 1 byte, 4 bytes, 2 bytes, 1 byte.
This is https://golang.org/issue/41663.
The code is still better than before, and will get better yet
once that compiler bug gets fixed.

Signed-off-by: Josh Bleecher Snyder <josh@tailscale.com>
2021-01-07 14:49:44 +01:00
Josh Bleecher Snyder
c9e4a859ae device: remove starting waitgroups
In each case, the starting waitgroup did nothing but ensure
that the goroutine has launched.

Nothing downstream depends on the order in which goroutines launch,
and if the Go runtime scheduler is so broken that goroutines
don't get launched reasonably promptly, we have much deeper problems.

Given all that, simplify the code.

Passed a race-enabled stress test 25,000 times without failure.

Signed-off-by: Josh Bleecher Snyder <josh@tailscale.com>
2021-01-07 14:49:44 +01:00
Josh Bleecher Snyder
d3ff2d6b62 device: clear pointers when returning elems to pools
Signed-off-by: Josh Bleecher Snyder <josh@tailscale.com>
2020-12-08 14:25:02 -08:00
Haichao Liu
913f68ce38 device: add write queue mutex for peer
fix panic: send on closed channel when remove peer

Signed-off-by: Haichao Liu <liuhaichao@bytedance.com>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-11-18 14:22:15 +01:00
Jason A. Donenfeld
db0aa39b76 global: update header comments and modules
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-05-02 02:08:26 -06:00
David Crawshaw
203554620d conn: introduce new package that splits out the Bind and Endpoint types
The sticky socket code stays in the device package for now,
as it reaches deeply into the peer list.

This is the first step in an effort to split some code out of
the very busy device package.

Signed-off-by: David Crawshaw <crawshaw@tailscale.com>
2020-05-02 01:46:42 -06:00
Jason A. Donenfeld
f361e59001 device: receive: uniform message for source address check 2019-07-01 15:24:50 +02:00
Jason A. Donenfeld
dd8817f50e device: receive: simplify flush loop 2019-07-01 15:23:24 +02:00
Jason A. Donenfeld
3371f8dac6 device: update transfer counters correctly
The rule is to always update them to the full packet size minus UDP/IP
encapsulation for all authenticated packet types.
2019-06-11 18:13:52 +02:00