Commit Graph

49 Commits

Author SHA1 Message Date
Jason A. Donenfeld
da32fe328b device: handshake routine writes into encryption queue
Since RoutineHandshake calls peer.SendKeepalive(), it potentially is a
writer into the encryption queue, so we need to bump the wg count.

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-02-09 19:26:45 +01:00
Josh Bleecher Snyder
4eab21a7b7 device: make RoutineReadFromTUN keep encryption queue alive
RoutineReadFromTUN can trigger a call to SendStagedPackets.
SendStagedPackets attempts to protect against sending
on the encryption queue by checking peer.isRunning and device.isClosed.
However, those are subject to TOCTOU bugs.

If that happens, we get this:

goroutine 1254 [running]:
golang.zx2c4.com/wireguard/device.(*Peer).SendStagedPackets(0xc000798300)
        .../wireguard-go/device/send.go:321 +0x125
golang.zx2c4.com/wireguard/device.(*Device).RoutineReadFromTUN(0xc000014780)
        .../wireguard-go/device/send.go:271 +0x21c
created by golang.zx2c4.com/wireguard/device.NewDevice
        .../wireguard-go/device/device.go:315 +0x298

Fix this with a simple, big hammer: Keep the encryption queue
alive as long as it might be written to.

Signed-off-by: Josh Bleecher Snyder <josh@tailscale.com>
2021-02-09 09:53:00 -08:00
Josh Bleecher Snyder
cae090d116 device: clarify device.state.state docs (again)
Signed-off-by: Josh Bleecher Snyder <josh@tailscale.com>
2021-02-09 18:29:01 +01:00
Jason A. Donenfeld
9e728c2eb0 device: rename unsafeRemovePeer to removePeerLocked
This matches the new naming scheme of upLocked and downLocked.

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-02-09 16:11:33 +01:00
Jason A. Donenfeld
eaf664e4e9 device: remove deviceStateNew
It's never used and we won't have a use for it. Also, move to go-running
stringer, for those without GOPATHs.

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-02-09 15:39:19 +01:00
Jason A. Donenfeld
a816e8511e device: fix comment typo and shorten state.mu.Lock to state.Lock
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-02-09 15:37:04 +01:00
Jason A. Donenfeld
02138f1f81 device: fix typo in comment
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-02-09 15:37:04 +01:00
Jason A. Donenfeld
d7bc7508e5 device: fix alignment on 32-bit machines and test for it
The test previously checked the offset within a substruct, not the
offset within the allocated struct, so this adds the two together.

It then fixes an alignment crash on 32-bit machines.

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-02-09 15:37:04 +01:00
Jason A. Donenfeld
d6e76fdbd6 device: do not log on idempotent device state change
Part of being actually idempotent is that we shouldn't penalize code
that takes advantage of this property with a log splat.

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-02-09 15:37:04 +01:00
Josh Bleecher Snyder
57aadfcb14 device: create channels.go
We have a bunch of stupid channel tricks, and I'm about to add more.
Give them their own file. This commit is 100% code movement.

Signed-off-by: Josh Bleecher Snyder <josh@tailscale.com>
2021-02-08 12:38:19 -08:00
Josh Bleecher Snyder
3516ccc1e2 device: remove device.state.stopping from RoutineTUNEventReader
The TUN event reader does three things: Change MTU, device up, and device down.
Changing the MTU after the device is closed does no harm.
Device up and device down don't make sense after the device is closed,
but we can check that condition before proceeding with changeState.
There's thus no reason to block device.Close on RoutineTUNEventReader exiting.

Signed-off-by: Josh Bleecher Snyder <josh@tailscale.com>
2021-02-08 10:32:07 -08:00
Josh Bleecher Snyder
0bcb822e5b device: overhaul device state management
This commit simplifies device state management.
It creates a single unified state variable and documents its semantics.

It also makes state changes more atomic.
As an example of the sort of bug that occurred due to non-atomic state changes,
the following sequence of events used to occur approximately every 2.5 million test runs:

* RoutineTUNEventReader received an EventDown event.
* It called device.Down, which called device.setUpDown.
* That set device.state.changing, but did not yet attempt to lock device.state.Mutex.
* Test completion called device.Close.
* device.Close locked device.state.Mutex.
* device.Close blocked on a call to device.state.stopping.Wait.
* device.setUpDown then attempted to lock device.state.Mutex and blocked.

Deadlock results. setUpDown cannot progress because device.state.Mutex is locked.
Until setUpDown returns, RoutineTUNEventReader cannot call device.state.stopping.Done.
Until device.state.stopping.Done gets called, device.state.stopping.Wait is blocked.
As long as device.state.stopping.Wait is blocked, device.state.Mutex cannot be unlocked.
This commit fixes that deadlock by holding device.state.mu
when checking that the device is not closed.

Signed-off-by: Josh Bleecher Snyder <josh@tailscale.com>
2021-02-08 10:32:07 -08:00
Josh Bleecher Snyder
9c75f58f3d device: remove device.state.stopping from RoutineHandshake
It is no longer necessary.

Signed-off-by: Josh Bleecher Snyder <josh@tailscale.com>
2021-02-08 08:18:32 -08:00
Josh Bleecher Snyder
84a42aed63 device: remove device.state.stopping from RoutineDecryption
It is no longer necessary, as of 454de6f3e64abd2a7bf9201579cd92eea5280996
(device: use channel close to shut down and drain decryption channel).

Signed-off-by: Josh Bleecher Snyder <josh@tailscale.com>
2021-02-08 08:18:32 -08:00
Josh Bleecher Snyder
8a374a35a0 device: tie encryption queue lifetime to the peers that write to it
Signed-off-by: Josh Bleecher Snyder <josh@tailscale.com>
2021-02-03 00:57:57 +01:00
Jason A. Donenfeld
4846070322 device: use a waiting sync.Pool instead of a channel
Channels are FIFO which means we have guaranteed cache misses.

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-02-02 19:32:13 +01:00
Jason A. Donenfeld
de51129e33 device: use int64 instead of atomic.Value for time stamp
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-01-29 18:57:03 +01:00
Jason A. Donenfeld
beb25cc4fd device: use new model queues for handshakes
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-01-29 18:24:45 +01:00
Jason A. Donenfeld
9263014ed3 device: simplify peer queue locking
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-01-29 16:21:53 +01:00
Jason A. Donenfeld
d4112d9096 global: bump copyright
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-01-28 17:52:15 +01:00
Jason A. Donenfeld
6a128dde71 device: do not allow get to run while set runs
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-01-28 15:26:22 +01:00
Jason A. Donenfeld
ace50a0529 device: avoid deadlock when changing private key and removing self peers
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-01-27 15:53:21 +01:00
Jason A. Donenfeld
8cc99631d0 device: use linked list for per-peer allowed-ip traversal
This makes the IpcGet method much faster.

We also refactor the traversal API to use a callback so that we don't
need to allocate at all. Avoiding allocations we do self-masking on
insertion, which in turn means that split intermediate nodes require a
copy of the bits.

benchmark               old ns/op     new ns/op     delta
BenchmarkUAPIGet-16     3243          2659          -18.01%

benchmark               old allocs     new allocs     delta
BenchmarkUAPIGet-16     35             30             -14.29%

benchmark               old bytes     new bytes     delta
BenchmarkUAPIGet-16     1218          737           -39.49%

This benchmark is good, though it's only for a pair of peers, each with
only one allowedips. As this grows, the delta expands considerably.

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-01-27 01:48:58 +01:00
Jason A. Donenfeld
d669c78c43 device: combine debug and info log levels into 'verbose'
There are very few cases, if any, in which a user only wants one of
these levels, so combine it into a single level.

While we're at it, reduce indirection on the loggers by using an empty
function rather than a nil function pointer. It's not like we have
retpolines anyway, and we were always calling through a function with a
branch prior, so this seems like a net gain.

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-01-26 23:05:48 +01:00
Josh Bleecher Snyder
7139279cd0 device: change logging interface to use functions
This commit overhauls wireguard-go's logging.

The primary, motivating change is to use a function instead
of a *log.Logger as the basic unit of logging.
Using functions provides a lot more flexibility for
people to bring their own logging system.

It also introduces logging helper methods on Device.
These reduce line noise at the call site.
They also allow for log functions to be nil;
when nil, instead of generating a log line and throwing it away,
we don't bother generating it at all.
This spares allocation and pointless work.

This is a breaking change, although the fix required
of clients is fairly straightforward.

Signed-off-by: Josh Bleecher Snyder <josh@tailscale.com>
2021-01-26 22:40:20 +01:00
Josh Bleecher Snyder
cecb41515d device: serialize access to IpcSetOperation
Interleaves IpcSetOperations would spell trouble.

Signed-off-by: Josh Bleecher Snyder <josh@tailscale.com>
2021-01-25 09:38:09 -08:00
Josh Bleecher Snyder
7c5d1e355e device: remove unnecessary zeroing
Newly allocated objects are already zeroed.

Signed-off-by: Josh Bleecher Snyder <josh@tailscale.com>
2021-01-20 19:57:07 +01:00
Josh Bleecher Snyder
0cc15e7c7c device: put handshake buffer in pool in FlushPacketQueues
This appears to have been an oversight.

Signed-off-by: Josh Bleecher Snyder <josh@tailscale.com>
2021-01-20 19:56:59 +01:00
Josh Bleecher Snyder
48c3b87eb8 device: use channel close to shut down and drain decryption channel
This is similar to commit e1fa1cc556,
but for the decryption channel.

It is an alternative fix to f9f655567930a4cd78d40fa4ba0d58503335ae6a.

Signed-off-by: Josh Bleecher Snyder <josh@tailscale.com>
2021-01-20 19:56:54 +01:00
Jason A. Donenfeld
29b0477585 device: receive: drain decryption queue before exiting RoutineDecryption
It's possible for RoutineSequentialReceiver to try to lock an elem after
RoutineDecryption has exited. Before this meant we didn't then unlock
the elem, so the whole program deadlocked.

As well, it looks like the flush code (which is now potentially
unnecessary?) wasn't properly dropping the buffers for the
not-already-dropped case.

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-01-07 17:08:41 +01:00
Josh Bleecher Snyder
1481e72107 all: use ++ to increment
Make the code slightly more idiomatic. No functional changes.

Signed-off-by: Josh Bleecher Snyder <josh@tailscale.com>
2021-01-07 14:49:44 +01:00
Jason A. Donenfeld
ad73ee78e9 device: add missing colon to error line
People are actually hitting this condition, so make it uniform. Also,
change a printf into a println, to match the other conventions.

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-01-07 14:49:44 +01:00
Josh Bleecher Snyder
f7bbdc31a0 device: fix data race in peer.timersActive
Found by the race detector and existing tests.

To avoid introducing a lock into this hot path,
calculate and cache whether any peers exist.

Signed-off-by: Josh Bleecher Snyder <josh@tailscale.com>
2021-01-07 14:49:44 +01:00
Josh Bleecher Snyder
63066ce406 device: fix persistent_keepalive_interval data races
Co-authored-by: David Anderson <danderson@tailscale.com>
Signed-off-by: Josh Bleecher Snyder <josh@tailscale.com>
2021-01-07 14:49:44 +01:00
Josh Bleecher Snyder
e1fa1cc556 device: use channel close to shut down and drain encryption channel
The new test introduced in this commit used to deadlock about 1% of the time.

I believe that the deadlock occurs as follows:

* The test completes, calling device.Close.
* device.Close closes device.signals.stop.
* RoutineEncryption stops.
* The deferred function in RoutineEncryption drains device.queue.encryption.
* RoutineEncryption exits.
* A peer's RoutineNonce processes an element queued in peer.queue.nonce.
* RoutineNonce puts that element into the outbound and encryption queues.
* RoutineSequentialSender reads that elements from the outbound queue.
* It waits for that element to get Unlocked by RoutineEncryption.
* RoutineEncryption has already exited, so RoutineSequentialSender blocks forever.
* device.RemoveAllPeers calls peer.Stop on all peers.
* peer.Stop waits for peer.routines.stopping, which blocks forever.

Rather than attempt to add even more ordering to the already complex
centralized shutdown orchestration, this commit moves towards a
data-flow-oriented shutdown.

The device.queue.encryption gets closed when there will be no more writes to it.
All device.queue.encryption readers always read until the channel is closed and then exit.
We thus guarantee that any element that enters the encryption queue also exits it.
This removes the need for central control of the lifetime of RoutineEncryption,
removes the need to drain the encryption queue on shutdown, and simplifies RoutineEncryption.

This commit also fixes a data race. When RoutineSequentialSender
drains its queue on shutdown, it needs to lock the elem before operating on it,
just as the main body does.

The new test in this commit passed 50k iterations with the race detector enabled
and 150k iterations with the race detector disabled, with no failures.

Signed-off-by: Josh Bleecher Snyder <josh@tailscale.com>
2021-01-07 14:49:44 +01:00
Josh Bleecher Snyder
c9e4a859ae device: remove starting waitgroups
In each case, the starting waitgroup did nothing but ensure
that the goroutine has launched.

Nothing downstream depends on the order in which goroutines launch,
and if the Go runtime scheduler is so broken that goroutines
don't get launched reasonably promptly, we have much deeper problems.

Given all that, simplify the code.

Passed a race-enabled stress test 25,000 times without failure.

Signed-off-by: Josh Bleecher Snyder <josh@tailscale.com>
2021-01-07 14:49:44 +01:00
Dmytro Shynkevych
4369db522b device: wait for routines to stop before removing peers
Peers are currently removed after Device's goroutines are signaled to stop,
but without waiting for them to actually do so, which is racy.

For example, RoutineHandshake may be in Peer.SendKeepalive
when the corresponding peer is removed, which closes its nonce channel.
This causes a send on a closed channel, as observed in tailscale/tailscale#487.

This patch seems to be the correct synchronizing action:
Peer's goroutines are receivers and handle channel closure gracefully,
so Device's goroutines are the ones that should be fully stopped first.

Signed-Off-By: Dmytro Shynkevych <dmytro@tailscale.com>
2020-07-04 20:29:31 +10:00
David Crawshaw
b84f1d4db2 device: export Bind and remove socketfd shims for android
Signed-off-by: David Crawshaw <crawshaw@tailscale.com>
2020-06-22 10:42:28 +10:00
Jason A. Donenfeld
db0aa39b76 global: update header comments and modules
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-05-02 02:08:26 -06:00
David Crawshaw
203554620d conn: introduce new package that splits out the Bind and Endpoint types
The sticky socket code stays in the device package for now,
as it reaches deeply into the peer list.

This is the first step in an effort to split some code out of
the very busy device package.

Signed-off-by: David Crawshaw <crawshaw@tailscale.com>
2020-05-02 01:46:42 -06:00
Jason A. Donenfeld
4739708ca4 noise: unify zero checking of ecdh 2020-03-17 23:07:14 -06:00
Jason A. Donenfeld
6ed56ff2df device: fix private key removal logic 2020-02-04 22:02:53 +01:00
Jason A. Donenfeld
4be9630ddc device: drop lock before expiring keys 2019-08-05 17:46:34 +02:00
Jason A. Donenfeld
a961aacc9f device: immediately rekey all peers after changing device private key
Reported-by: Derrick Pallas <derrick@pallas.us>
2019-07-11 17:37:35 +02:00
Matt Layher
1f48971a80 tun: remove TUN prefix from types to reduce stutter elsewhere
Signed-off-by: Matt Layher <mdlayher@gmail.com>
2019-06-14 18:35:57 +02:00
Jason A. Donenfeld
d9f995209c device: add SendKeepalivesToPeersWithCurrentKeypair for handover 2019-05-30 15:16:16 +02:00
Jason A. Donenfeld
c718f3940d device: fail to give bind if it doesn't exist 2019-05-17 15:35:20 +02:00
Jason A. Donenfeld
3bf41b06ae global: regroup all imports 2019-05-14 09:09:52 +02:00
Jason A. Donenfeld
69f0fe67b6 global: begin modularization 2019-03-03 05:00:40 +01:00