cjdelisle · April 14, 2019 20:59
diff --git a/gistfile1.txt b/gistfile1.txt
 # Cjdns Packet Prioritization System

 The packet prioritization system is designed to allow access to a link to be exercised by it's
 owner, thus allowing for bandwidth to be sold and used.

 * A packet SHOULD NOT be dropped unless there is actually not enough bandwidth to send everything.
 * A bandwidth lease MUST be able to send at least its bandwidth limit.
 * Unused bandwidth SHOULD be divided between those leases which are currently sending traffic in a
 fair way based on the relative size of the leases.

 What follows is a description of how bandwidth leases are used to decide which packet has priority
 when there is contention and some packet must be dropped.

 ## Background
 Every network switch is a series of incoming paths linked to a series of outgoing paths. Data
 coming *in* to the switch follows the incoming path for the interface which that data comes from
 and then based on the address of the packet, the data is sent down one of the outgoing paths.
 There is always a potential problem that data is coming from all incoming paths and it is all
 destine for the same outgoing path, unless the bandwidth of the outgoing path is at least as much
 as the sum of bandwidth of all incoming paths, the switch needs to make a decision which packets
 will be *dropped*. In our design, there is a
 [multiplexer](https://en.wikipedia.org/wiki/Multiplexer) which merges data coming from the
 different incoming paths. Data is merged into one outgoing path that *is* as fast as the sum of
 bandwidth of all incoming paths, but this fast path is only on-chip and the priority rules are
 applied in order to allow everything to fit on the final outgoing link. See this blog series for
 more information about
 [how network switches work](https://etherealmind.com/whats-happening-inside-an-ethernet-switch-or-network-switches-for-virtualization-people/).

 ## Outgoing path
 The outgoing path of the switch begins with a multiplexer (mux) which converges traffic coming from
 multiple interfaces down to a single *high speed* outgoing path (denoted by the === path) the high
 speed path is fast enough that it can carry as much traffic as all incoming interfaces can send at
 the same time. The objective of drop1, drop2 and ringbuf is to limit that traffic to what the actual
 network link can handle.

 ```
 incomingA---+++++++
            |     |
 incomingB---+     +
            | mux |===< prioritizer >===<drop1>[ ringbuf ]<drop2>--link-->
 incomingC---+     +
            |     |
 incomingD---+++++++
 ```

 * **incoming[A-D]**: Incoming traffic is tagged with the bandwidth lease which that traffic should
 be accounted on, see: Virtual Switches to understand how that tag is decided.
 * **mux**: This is a multiplexer which takes traffic from *n* incoming links and merges it into
 one highspeed outgoing path, the high speed path is fast enough to accommodate as much traffic as
 all incoming links can send at the same time.
 * **prioritizer**: The prioritizer reads the bandwidth lease tag and converts this into a priority tag.
 Supposing you have a switch supporting 8 leases per link, the prioritizer will contain 8 bandwidth
 meters (implemented as [IIR functions](https://en.wikipedia.org/wiki/Infinite_impulse_response)).
 In order to allow network tuning, the alpha for the IIR functions, henceforth `prioritizer_alpha`
 will be configured by the switch owner (or his supernode) and will be public information. When the
 amount of bandwidth used on a bandwidth lease is exactly the amount of the lease, the prioritizer
 will tag packets on that lease with priority 1. If the amount of bandwidth used is more, it will be
 a fractional priority and if it is less, it will be a priority greater than 1.
 * **Drop1 and drop2**: Ideally, packets would be placed in a
 [priority queue](https://en.wikipedia.org/wiki/Priority_queue) and the lowest priority packet would
 be *replaced* when the queue is full and more traffic comes than can be handled. However, this
 would require continuously ordering packets by priority, a significant processing overhead. Rather
 than a queue, we will use a [ring buffer](https://en.wikipedia.org/wiki/Circular_buffer) which is
 the norm in network device queues. A ring buffer can be thought of like a simple queue: You can
 add to the tail and remove from the head but you cannot alter the middle. Therefore we have one
 packet dropped in each place where the queue can be easily manipulated, the head and the tail.
 The packet droppers simply discard any packet with priority lower than `min_priority` where 
 `min_priority` rises as the ring buffer fills. If the ring buffer remains less than or equal to half
 full, `min_priority` should remain at zero. The size of the ring buffer will be determined by the
 hardware maker who designs the switch chip, the mean and the highest value of `min_priority` in a
 given period should be published as well as drop counts for each bandwidth lease.

 ### Packet Priority
 The actual packet priority used by the droppers is a 32 bit value made up of 2 16 bit elements, the
 most significant 16 bits are the priority as defined by the prioritizer, the least significant 16
 bits is the packet priority as defined by a priority field on the packet header. The priority field
 allows bandwidth leaseholders to control how their "virtual networks" are used.

 ```
 real_priority = priority_from_prioritizer * 2**16 +
    priority_from_packet_header
 ```

 ## Incoming path
 The incoming path is where packets are received from an external interface.

 ```
                        ++++++++++++++---> outgoingA
                        |            |
                        +            +---> outgoingB
 --< priority updater >--|cjdns switch|
                        +            +---> outgoingC
                        |            |
                        ++++++++++++++---> outgoingD
 ```

 The 2 main components of the incoming path are the cjdns switch and the priority updater. The cjdns
 switch is already well documented elsewhere so we will focus on the priority updater.

 All incoming traffic is tagged with the bandwidth lease with which it is associated, if
 (for example) the switch supports 8 leases per link, the priority updater will have 8 priority
 meters. While the bandwidth meters in the outgoing path are intended to protect leaseholders
 from each other, the priority updater is intended to ensure fairness within the virtual network
 created by a single bandwidth leaseholder. Also unlike bandwidth meters, priority meters measure
 kilobits TIMES priority per second (Kbp/s). In order to allow a bandwidth leaseholder to
 interconnect with other leaseholders and customers, there is a configurable called
 `bandwidth_priority_limit`. If for example a customer has a 10Gb link to a bandwidth leaseholder
 but they only pay for the equivilant of 1Gb of "network access", the leaseholder can configure
 `bandwidth_priority_limit` to 1Gbp/s for that link, allowing the customer to use all 10Gb when
 the network is idle but limiting them to the equivilent of 1Gb/s of network access during times
 of network contention.

 The way `bandwidth_priority_limit` works is that when each packet comes in, it's bandwidth-priority
 is measured in a IIR function, a number called `priority_multiplier` is initialized at 1 and
 decreased toward zero as the measured bandwidth-priority exceeds `bandwidth_priority_limit`, when
 bandwidth-priority goes below `bandwidth_priority_limit`, `priority_multiplier` is increased until
 it reaches 1 again. For each incoming packet, the priority is multiplied by `priority_multiplier`,
 thus keeping the total bandwidth-priority per second less than or equal to
 `bandwidth_priority_limit`. The leaseholder is clearly able to change the `bandwidth_priority_limit`
 but they are also able to change the alpha used in the IIR function used for measuring
 bandwidth-priority (aka: `bandwidth_priority_alpha`)

 ## Free tier
 Every interface will have a "free tier" lease for which the priority decided by the prioritizer is
 always zero. The priority updater for the free tier will function as normal but the configuration
 of the `bandwidth_priority_alpha` and `bandwidth_priority_limit` will be configurable only by the
 switch owner.

 ## Virtual Switches
 In order to determine which bandwidth lease a packet will sent on, we introduce the idea of the
 virtual switch. Each physical switch device will have a capability of some number of virtual
 switches that will be determined by the maker of the hardware. Virtual switches will be a resource
 which can be leased to bandwidth leaseholders in order for them to use their bandwidth leases to build
 networks.

 An owner of a virtual switch will be able to authorize associating their switch with a given lease,
 while the holder of a bandwidth lease will also be able to authorize associating their lease with a
 given virtual switch. When both sides authorize the other, the lease will be associated with the
 virtual switch. When a lease and a virtual switch are associated:
 * Any incoming traffic on that lease will go to that that virtual switch and
 * Any outgoing traffic from that switch that is destine for that link will use that bandwidth lease

 Any traffic destine for an interface for which there is no associated lease will use the free tier.

 ## Configurables and data-points

 * Hardware defined:
  * Ring buffer size
  * Max leases per interface
  * Max virtual switches
 * Data points (read-only):
  * Per hard interface:
    * Highest `min_priority` in time-period
    * Average `min_priority` over time-period
    * Packets lost in transit over time-period
  * Per lease:
    * Bytes sent in time-period
    * Bytes dropped in time-period
    * Highest `priority_multiplier` in time-period
    * Average `priority_multiplier` in time-period
  * Per virtual switch
    * Bytes sent in time-period (to each interface)
 * Configurables
  * Global (defined by switch device owner)
    * `prioritizer_alpha`
    * Free tier `bandwidth_priority_limit`
    * Free tier `bandwidth_priority_alpha`
  * Per lease (defined by leaseholder)
    * `bandwidth_priority_limit` (Kbp/s)
    * `bandwidth_priority_alpha`
    * Associated virtual switch ID
  * Per virtual switch (defined by virtual switch leaseholder)
    * Associated lease ID (for each interface)


 ## Notes
 * We need to be careful of the possibility for oscillations between ring buffers filling and
 emptying and min_priority rising and falling.
  * Perhaps we should introduce a configurable amount of randomized jitter in the IIR functions,
  in order to frustrate attempts to create oscillations in the network.
 * "priority_multiplier is initialized at 1 and decreased toward zero as the measured bandwidth-priority exceeds bandwidth_priority_limit" the actual formula is not defined
 * We need to find a good alpha to define for the free tier (and to use as a default for leases)
	# Cjdns Packet Prioritization System

	The packet prioritization system is designed to allow access to a link to be exercised by it's
	owner, thus allowing for bandwidth to be sold and used.

	* A packet SHOULD NOT be dropped unless there is actually not enough bandwidth to send everything.
	* A bandwidth lease MUST be able to send at least its bandwidth limit.
	* Unused bandwidth SHOULD be divided between those leases which are currently sending traffic in a
	fair way based on the relative size of the leases.

	What follows is a description of how bandwidth leases are used to decide which packet has priority
	when there is contention and some packet must be dropped.

	## Background
	Every network switch is a series of incoming paths linked to a series of outgoing paths. Data
	coming in to the switch follows the incoming path for the interface which that data comes from
	and then based on the address of the packet, the data is sent down one of the outgoing paths.
	There is always a potential problem that data is coming from all incoming paths and it is all
	destine for the same outgoing path, unless the bandwidth of the outgoing path is at least as much
	as the sum of bandwidth of all incoming paths, the switch needs to make a decision which packets
	will be dropped. In our design, there is a
	[multiplexer](https://en.wikipedia.org/wiki/Multiplexer) which merges data coming from the
	different incoming paths. Data is merged into one outgoing path that is as fast as the sum of
	bandwidth of all incoming paths, but this fast path is only on-chip and the priority rules are
	applied in order to allow everything to fit on the final outgoing link. See this blog series for
	more information about
	[how network switches work](https://etherealmind.com/whats-happening-inside-an-ethernet-switch-or-network-switches-for-virtualization-people/).

	## Outgoing path
	The outgoing path of the switch begins with a multiplexer (mux) which converges traffic coming from
	multiple interfaces down to a single high speed outgoing path (denoted by the === path) the high
	speed path is fast enough that it can carry as much traffic as all incoming interfaces can send at
	the same time. The objective of drop1, drop2 and ringbuf is to limit that traffic to what the actual
	network link can handle.

	```
	incomingA---+++++++
	\| \|
	incomingB---+ +
	\| mux \|===< prioritizer >===<drop1>[ ringbuf ]<drop2>--link-->
	incomingC---+ +
	\| \|
	incomingD---+++++++
	```

	* incoming[A-D]: Incoming traffic is tagged with the bandwidth lease which that traffic should
	be accounted on, see: Virtual Switches to understand how that tag is decided.
	* mux: This is a multiplexer which takes traffic from n incoming links and merges it into
	one highspeed outgoing path, the high speed path is fast enough to accommodate as much traffic as
	all incoming links can send at the same time.
	* prioritizer: The prioritizer reads the bandwidth lease tag and converts this into a priority tag.
	Supposing you have a switch supporting 8 leases per link, the prioritizer will contain 8 bandwidth
	meters (implemented as [IIR functions](https://en.wikipedia.org/wiki/Infinite_impulse_response)).
	In order to allow network tuning, the alpha for the IIR functions, henceforth `prioritizer_alpha`
	will be configured by the switch owner (or his supernode) and will be public information. When the
	amount of bandwidth used on a bandwidth lease is exactly the amount of the lease, the prioritizer
	will tag packets on that lease with priority 1. If the amount of bandwidth used is more, it will be
	a fractional priority and if it is less, it will be a priority greater than 1.
	* Drop1 and drop2: Ideally, packets would be placed in a
	[priority queue](https://en.wikipedia.org/wiki/Priority_queue) and the lowest priority packet would
	be replaced when the queue is full and more traffic comes than can be handled. However, this
	would require continuously ordering packets by priority, a significant processing overhead. Rather
	than a queue, we will use a [ring buffer](https://en.wikipedia.org/wiki/Circular_buffer) which is
	the norm in network device queues. A ring buffer can be thought of like a simple queue: You can
	add to the tail and remove from the head but you cannot alter the middle. Therefore we have one
	packet dropped in each place where the queue can be easily manipulated, the head and the tail.
	The packet droppers simply discard any packet with priority lower than `min_priority` where
	`min_priority` rises as the ring buffer fills. If the ring buffer remains less than or equal to half
	full, `min_priority` should remain at zero. The size of the ring buffer will be determined by the
	hardware maker who designs the switch chip, the mean and the highest value of `min_priority` in a
	given period should be published as well as drop counts for each bandwidth lease.

	### Packet Priority
	The actual packet priority used by the droppers is a 32 bit value made up of 2 16 bit elements, the
	most significant 16 bits are the priority as defined by the prioritizer, the least significant 16
	bits is the packet priority as defined by a priority field on the packet header. The priority field
	allows bandwidth leaseholders to control how their "virtual networks" are used.

	```
	real_priority = priority_from_prioritizer * 2**16 +
	priority_from_packet_header
	```

	## Incoming path
	The incoming path is where packets are received from an external interface.

	```
	++++++++++++++---> outgoingA
	\| \|
	+ +---> outgoingB
	--< priority updater >--\|cjdns switch\|
	+ +---> outgoingC
	\| \|
	++++++++++++++---> outgoingD
	```

	The 2 main components of the incoming path are the cjdns switch and the priority updater. The cjdns
	switch is already well documented elsewhere so we will focus on the priority updater.

	All incoming traffic is tagged with the bandwidth lease with which it is associated, if
	(for example) the switch supports 8 leases per link, the priority updater will have 8 priority
	meters. While the bandwidth meters in the outgoing path are intended to protect leaseholders
	from each other, the priority updater is intended to ensure fairness within the virtual network
	created by a single bandwidth leaseholder. Also unlike bandwidth meters, priority meters measure
	kilobits TIMES priority per second (Kbp/s). In order to allow a bandwidth leaseholder to
	interconnect with other leaseholders and customers, there is a configurable called
	`bandwidth_priority_limit`. If for example a customer has a 10Gb link to a bandwidth leaseholder
	but they only pay for the equivilant of 1Gb of "network access", the leaseholder can configure
	`bandwidth_priority_limit` to 1Gbp/s for that link, allowing the customer to use all 10Gb when
	the network is idle but limiting them to the equivilent of 1Gb/s of network access during times
	of network contention.

	The way `bandwidth_priority_limit` works is that when each packet comes in, it's bandwidth-priority
	is measured in a IIR function, a number called `priority_multiplier` is initialized at 1 and
	decreased toward zero as the measured bandwidth-priority exceeds `bandwidth_priority_limit`, when
	bandwidth-priority goes below `bandwidth_priority_limit`, `priority_multiplier` is increased until
	it reaches 1 again. For each incoming packet, the priority is multiplied by `priority_multiplier`,
	thus keeping the total bandwidth-priority per second less than or equal to
	`bandwidth_priority_limit`. The leaseholder is clearly able to change the `bandwidth_priority_limit`
	but they are also able to change the alpha used in the IIR function used for measuring
	bandwidth-priority (aka: `bandwidth_priority_alpha`)

	## Free tier
	Every interface will have a "free tier" lease for which the priority decided by the prioritizer is
	always zero. The priority updater for the free tier will function as normal but the configuration
	of the `bandwidth_priority_alpha` and `bandwidth_priority_limit` will be configurable only by the
	switch owner.

	## Virtual Switches
	In order to determine which bandwidth lease a packet will sent on, we introduce the idea of the
	virtual switch. Each physical switch device will have a capability of some number of virtual
	switches that will be determined by the maker of the hardware. Virtual switches will be a resource
	which can be leased to bandwidth leaseholders in order for them to use their bandwidth leases to build
	networks.

	An owner of a virtual switch will be able to authorize associating their switch with a given lease,
	while the holder of a bandwidth lease will also be able to authorize associating their lease with a
	given virtual switch. When both sides authorize the other, the lease will be associated with the
	virtual switch. When a lease and a virtual switch are associated:
	* Any incoming traffic on that lease will go to that that virtual switch and
	* Any outgoing traffic from that switch that is destine for that link will use that bandwidth lease

	Any traffic destine for an interface for which there is no associated lease will use the free tier.

	## Configurables and data-points

	* Hardware defined:
	* Ring buffer size
	* Max leases per interface
	* Max virtual switches
	* Data points (read-only):
	* Per hard interface:
	* Highest `min_priority` in time-period
	* Average `min_priority` over time-period
	* Packets lost in transit over time-period
	* Per lease:
	* Bytes sent in time-period
	* Bytes dropped in time-period
	* Highest `priority_multiplier` in time-period
	* Average `priority_multiplier` in time-period
	* Per virtual switch
	* Bytes sent in time-period (to each interface)
	* Configurables
	* Global (defined by switch device owner)
	* `prioritizer_alpha`
	* Free tier `bandwidth_priority_limit`
	* Free tier `bandwidth_priority_alpha`
	* Per lease (defined by leaseholder)
	* `bandwidth_priority_limit` (Kbp/s)
	* `bandwidth_priority_alpha`
	* Associated virtual switch ID
	* Per virtual switch (defined by virtual switch leaseholder)
	* Associated lease ID (for each interface)


	## Notes
	* We need to be careful of the possibility for oscillations between ring buffers filling and
	emptying and min_priority rising and falling.
	* Perhaps we should introduce a configurable amount of randomized jitter in the IIR functions,
	in order to frustrate attempts to create oscillations in the network.
	* "priority_multiplier is initialized at 1 and decreased toward zero as the measured bandwidth-priority exceeds bandwidth_priority_limit" the actual formula is not defined
	* We need to find a good alpha to define for the free tier (and to use as a default for leases)