Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tunneled LSO/CSO Support #614

Open
wants to merge 47 commits into
base: master
Choose a base branch
from
Open

Tunneled LSO/CSO Support #614

wants to merge 47 commits into from

Conversation

FelixMcFelix
Copy link
Collaborator

@FelixMcFelix FelixMcFelix commented Nov 21, 2024

This PR makes use of in-progress illumos bits to detect common offloads between underlay devices, and advertises LSO/CSO capabilities on opte ports, relying illumos's emulation functionality to split packets and insert checksums if not on cxgbe (or other compatible NIC).

This allows for a few nice things:

  • Guests will send us TCP packets without an L4 checksum and up to ~64KiB in size.
    • The guest and OPTE spend less time computing/updating checksums.
    • OPTE has to process fewer packets – hooray! Ultimately, that's less processing time per payload byte.
    • If sending out over the NIC, then either the NIC or illumos's LSO emulation path are responsible for splitting the packet into TCP packets which will not violate the MTU. See below on what we tweak here.
    • If sending in the loopback path, we get to hand these packets directly to the target guest without splitting them apart. We do need to insert the checksum ourself, however.
  • Guests will send us UDP packets without an L4 checksum.
    • Again, the guest and OPTE spend less time computing/updating checksums.

Ultimately, we have control over the MSS we advertise to the NIC for LSO. The useful part of this is that when we know we are using a purely rack-internal path, we can elevate the MSS up to MTU - overheads. I've added a system of 'well-known' ActionMeta KV pairs which allow for layers like overlay to propagate this knowledge out. Given that the use of a larger MTU vastly reduces inbound packet rate on the receive half (the main bottleneck today), this gets us to around 8Gbps iPerf for rack-local traffic (and 17Gbps for two or more parallel streams) on glasgow.

On dublin with a full control plane, this resolves to around 4Gbps sled-to-sled and 14-Gbps sled-local between Linux VMs – we want to investigate the drop here once this is on dogfood. illumos as a guest doesn't do much better than before since it does not advertise GRO (illumos-as-host will fragment the packets) or LSO (it will never send TCP packets > MTU).

  • Cleanup.
  • Finalise illumos interfaces. stlouis#663
  • Manually fill checksums in guest loopback cases.
  • Have viona perform LSO if guests do not advertise/expect LRO. illumos#17032

Closes #328, closes #329.

Based on #688.

@FelixMcFelix FelixMcFelix added this to the 13 milestone Nov 21, 2024
@FelixMcFelix FelixMcFelix self-assigned this Nov 21, 2024
Needed for older propolis (falcon), which does not preallocate
sufficient headroom for us to push into.
This makes sure that if we're receiving a packet which has benefited
from real/pseudo GRO, viona is able to split it up if the guest can't
actually take those packets.
@morlandi7 morlandi7 modified the milestones: 13, 14 Feb 11, 2025
At the very least, this compiles.

We needed to regenerate `ip.rs`, on account of the many `extern` ->
`unsafe extern` block changes.
CI on recent PRs is breaking, due to rustup 1.28.0+ no longer
autoinstalling the correct rust toolchain version. This hurts us
immediately since we have *two* toolchains (pinned nightly and stable),
and deliberately specified the nightly for some tooling.

This PR changes this over to use buildomat's auto-installation for the
stable variant, and the new toolchain show -> install pattern for
nightly. This also lets us place `$NIGHTLY` into most of our `cargo fmt`
invocations, which should reduce the busywork in future compiler bumps
for XDE.
@FelixMcFelix FelixMcFelix changed the base branch from master to rust-2024 March 5, 2025 13:43
@@ -0,0 +1,107 @@
// This Source Code Form is subject to the terms of the Mozilla Public
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of this file is moved out of port/mod.rs (formerly port.rs). The discussion of "opte:" reserved keys and the use of INTERNAL_TARGET are new.

@@ -204,7 +204,7 @@ impl DlsStream {
/// but for now we pass only a single packet at a time.
pub fn tx_drop_on_no_desc(
&self,
pkt: MsgBlk,
pkt: impl AsMblk,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The AsMblk changes here would ordinarily support later functionality like Tx/Rx of batches of packets (blocked on the flows rework, naturally). I've pulled it forward here because, strictly speaking, mac_hw_emul can turn (1..n) packets into (0..m) packets even if we're only using it to emulate checksums one-at-a-time.

@FelixMcFelix FelixMcFelix marked this pull request as ready for review March 7, 2025 17:25
Base automatically changed from rust-2024 to master March 11, 2025 19:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Performance: T6 Geneve aware checksumming Performance: T6 Geneve aware TSO
2 participants