1. Week 1
- 1.1. Introduction
  - 1.1.1. Logistics
  - 1.1.2. Material overview
- 1.2. Introduction to the kernel
2. Week 2
- 2.1. Locking
- 2.2. Processes
3. Week 3
- 3.1. Process
4. Week 4
- 4.1. Security
- 4.2. Virtual Memory
5. Week 5
- 5.1. Virtual Memory
6. Week 6
- 6.1. I/O System Overview
7. Week 7
- 7.1. Devices
- 7.2. Fast Filesystem
8. Week 8
- 8.1. Filesystems
9. Week 9
- 9.1. ZFS filesystem
10. Week 10
- 10.1. NFS
11. Week 11
- 11.1. Networking and Interprocess Communication
12. Week 12
- 12.1. The Network Layer
13. Week 13
- 13.1. The Transport Layer
14. Week 14
- 14.1. System startup and shutdown
15. Week 15
- 15.1. System measurements and performance tune
- 15.2. Background

Week 1

Introduction

Logistics

Info about Hillside
Class times

Material overview

Man pages
Outline
- Listed in the printed material

Introduction to the kernel

Unix VM

Unix was about portability
- The vm is built uppon 5 properties
  - paged virtual addr space
    - If you want to allocate more than physical memory
  - software interrupts
    - Define hardware events and turn it into a software event
    - Catch signal interrupt and deal with things like division by zero
    - Call your signal handler
    - Unified interface for hardware events
  - timers and counters
  - identifiers for accounting, protection and scheduling
  - descriptor
    - Handle of some kind of I/O
    - Before that each thing had it own descriptor

Process

Fundamental to the unix philosophy
Input/Output connected thru pipes creates a pipeline of execution
Process is made of 4 main pieces:
- CPU Time
  - Scheduling made by kernel
- Asynchronous Events
  - Interruptions have to be dealt with, signals, network traffic
- Memory
  - How virtual memory is managed
  - How to make a hardware provide an address space
- I/O Descriptors
  - Any useful process is going to have I/O Descriptors
  - Most fundamental service provided by UNIX

System calls

Top half of the kernel
- Top half has its own stack because when the kernel wants it is there in a real physical memory
More malloc and free in the kernel stack because there is a limited stack available
Bottom half
- Scheduled by interrupts
- Top-half will be preempt when bottom-half interrupts
- You can block but generaly don't block on bottom-half

Entering the kernel

Part 1:
1. Hardware traps
2. System calls
  - Special case of hardware traps
3. Hardware interrupts
  - Async (disk io complete, user typed a key)
  - Higher priority
4. Software-initiated traps or interrupts
  - Lower priority
Part 2: Switch to kernel mode
1. Push PC and PSW into the kernel per-process stack
2. Push trap type os syscall number into the stack
3. Save general purpose registers
4. Call handler (syscall()) for a syscall, trap() for a trap

Returning from the kernel

Timers
- 3 per process interval timers, optional value to be loaded when it expires
- ITIMER_REAL
  - Decrements in real time
  - Sends SIGALRM on expiration
- ITIMER_VIRTUAL
  - Decrements in user virtual time
  - Sends SIGVTALRM on expiration
- ITIMER_PROF
  - Decrements in user + system virtual time
  - Sends SIGPROF on expiration
- Select timer is independent from the interval timers
- Kqueue can also create timers
Timouts
- Arrange a timeout in the kernel
  - handle = timeout(function, arg, ticks)
- Cancel it
  - untimeout(handle)
- Compute the number of ticks
  - ticks = hzto(timevalp)
- Timout routines are called from softclock at splsoftclock

Week 2

Locking

Historic Synchronization

When a lock is not available sleep on it
- wchan or wake_channel receive a pointer to the resource
- process waiting for a specific resource will be waken

Lock Hierarchy

From the most simple to the most complex
- Hardware
- Spin Locks
- Locks that blocks briefly
  - blocking mutex
  - pool mutex
  - reader-writer locks
  - read-mostly
- Locks using the sleep-queue interface
- Witness

Turnstiles

Sys/sys_proc.h
- data structures using locks to protect the access into the fields
Designed for short periods
Used to protect read and write
Track current lock holder
- If you hold a short term lock and goes to sleep the kernel will panic
Priority propagation

Turnstiles implementation

Hash header for quickly find locks
A turnstile is needed each time a thread blocks
- A thread can only block on one thing

Sleep queues

Used by shared-exclusive locks, condition variables
No propagation
You're are not allowed to own a turnstile when requesting a sleep-queue lock
The queue track the exclusive lock holder
May be recursive

Sleep interface

wchan and priority
- msleep
sleep on lbolt (lightning bolt)
wait syscall wait for children to exit and put the parent to sleep in its own context
sigpause
- The only way to exit a sigpause is a longjmp
sleep implementation
- lock sleep hash header
- record wchan, hashs to find slepe queue
- set td_priority (for scheduling)
- place on approp sleep queue
- call sched_switch() to run the next thread
- sleep thread is not running until awakened
- if PCATCH is set check for signals
wakeup implementation
- awaken ALL sleepers on wchan starting from the front
wakeup_one
- awaken first sleeper on wchan
To awaken a thread
- remove from sleep queue
- recompute kg_user_pri for long sleepers
- sleeping, make runnable, place on run queue
- stopped: "unsleep", remain stopped
request CPU rescheduling by setting NEEDRESCHED

Critical Sections

protected by locks

Hardware requirements for lock

Lock is not possible without assitance from hardware
Test-and-set
- Very expansive instructions
FreeBSD uses a compare and swap
- instead of a memory bucket it is a pointer
- Owner field for a free lock has MTX_UNOWNED
- Owner field for a held lock has the pointer to owning thread

Spin mutex

Exclusive access only
Loops waiting for the mutex to become available
Run inside the critical section
More expensive than a blocking mutex

Blocking mutex

Exclusive only
Adaptive spining which only spins if the owner is running
All waiters are awekened when lock is released
- Cheaper to release an uncontested lock
- Often end up scheduling sequentially

Pool Mutex

Used for small short-lived data structures
- Just need a pointer to a mutex rather than a large mutex itself
- Mutex is preallocated so avoid high creation and destroy times
Example is pool syscall that needs a struct to track poll requests

Reader and Writer locks

Exclusive and shared mutex
Uses a turnstile so cannot be held when you sleep
Priority propagation for exclusive not for shared access
May specify permission to recurse

Read-Mostly lock

Same properties as reader-writer
Priority propagation for shared
Designed for fast access
- Read without locking then check if write happened
- if so redo the reading
The routing table is a good example of this lock
IBM holds patent on RCU
- Allow GPL code to use
- FreeBSD is not GPL

Shared-Exclusive locks

fastest ans simplest of the locks that can sleep
Shared and exclusive
Permission to recurse
May request interruptions (PCATCH)
No priority propagation

Condition variables

wrapper on traditional sleep and wakeup
Allow wake one or all waiters
Must hold a mutex before awakening or waiting

Locking Manager locks

Most full-featured of the locks that can sleep
Downgrade (from exclusive to shared)
Upgrade (from shared to exclusive)
Upgrade exclusive prevent other exclusive lock
Can pass ownership from a lock from a thread to the kernel
You need to clean up before leaving the kernel
Ability to drain all accessing threads for deallocation
No propagation

Witness

Partial ordering requires:
1. Thread acquires only one lock in a class
2. Thread may acquire only a lock in a higher-numberred class
  - Class 1 > Class 2 - thread with locks on class 1 can acquire class 2 locks
Programmers can define lock classes
Witness code observes actual lock ordering and complains when either rules is violated
Very very expansive
Debate exists about enabling in production

Processes

Process Resources

CPU Time
- What runs when
Async events
- timers
- hardware exceptions
- external events
- I/O
Memory
- text/data
- heap
- runtime stack
I/O Descriptors
- files
- pipes
- socket
- stream

BSD Process structures

Process entry
- All unix has one, solaris and linux process entries may be different
Process structures dynamically allocated
Placed in linked list

Open file information

filedesc structure
Open file array is reallocated when overflow

Other Process Substructures

Process limits (plimit)
- array of rlimit
- copied lazily
Process accounting/statistics (pstats)
- rusage (Resource usage)
- virtual-time timer
- profiling
- start time
Signal actions
- Pending signals
  - sigacts per signal actions,
  - signal stack info,
  - current state during sigpause

Processes versus Threads

kthread model (started in BSD world)
- One process structure
- Multiple threads
Fork model (started in the Linux world)
- Each thread has its own process structure
- same model as that used by Linux
kthread context switch is wayyy cheaper
rfork slower for context switch

Week 3

Process

Process States

runnable
sleeping
stopped
- stopped by job control or tracing
- return to sleep or run
minor_states
- new
- zombie
  - Zombies won't close socket descriptors because tcp may have data sitting around
  - Init's job is to wait for orphans left by other processes

Finding processes entries

by process ID
- hashed, linked through p_hash index
- pfind()
by state
- run queue: td_runq
- sleep queue: td_slpq
- lock queue: td_lockq
by type
- on one of allproc or zombproc

Process entries of related processes

p_ is always the process_struct

Scheduling classes

REALTIME
ULE
4BSD
IDLE

Week 4

Security

Security mindset

UNIX was designed with some security in mind
Users exist in UNIX since the begining
- Windows didn't had users until NT

Trusted Computing Base

Things that have to be secure
- Kernel
- Boot scripts
- Core utilities (shell, login, ifconfig)
- Libraries used by core utilities (libc)
Solid crypto base

Process credentials

process credentials (ucred structure)
- effective uid (setuid)
- group set (gid)
- real uid, gid
- saved uid, gid from set*id program
exported credentials (xucred structure) - outside kernel
- effective uid
- group set

Set-User-Identification Functionality

seteuid
setegid

Immutable and Append-only flags

Immutable limitation
- Immutable files can be changed only on singleuser
- Append-only files can only be rotated in single user
- Direct hardware is retricted
- All startup activities needs to be protected

Jails

Started from chroot (xiruu)
Jails was created by a demand of the user base (ISP's)
vnet
vem0a - vem0b (master slave virtual net)

Jails rules

Permitted
- running or signaling process within the jail
- change files within jail
- bind ports
- access raw, divert or routing sockets jail's virtual network
Not permitted
- get info from processes outside jail
- change kernel variables
- mount or unmount filesystem
- modify phys net interface
- reboot

Random number generation

/dev/random apps access
Uses yarrow, PRNG
Yarrow
- reuse existing crypto primitives such as hash and block encryption
- framework for entropy acquisition
- an entropy accumulator
- a reseed mechanism
- a generation machanism (SHA256 AND AES)
Yarrow will be replaced by Fortuna on FreeBSD 11

Access control lists (ACL)

File permission bits
ACL capabilities
- Look up
- Administration permissions
  - Change acl on the file
  - List of users
  - List of groups
- Default ACL Inheritable

Access control list semantics

ZFS uses NFSv4 ACL model (compatible with windows ACL)
NFSv4 uses inheritable ACL ranther than default ACL in POSIX

Privileges (Kernel only, no userland interface)

Nearly 200 privileges
In the kernel checks for uid=0 also calls priv_check()

Privileges applied

Jails uses privileges to control what a process can do
MAC is a loadable kernel module

Auditing

Used when you know when things go bad
Good auditing should have Intrusion Detecting
OpenBSM
Looks like truss or strace

Auditing handling

auditd daemon
- starts a kernel thread that control record distribution

Capsicum

Sandbox
- Powerful than jails
Untrusted code run in separated process with access limited
cap_enter()
No access to filesystem

Sample Capsicum capability

Defined at /sys/sys/capability.h

Virtual Memory

Virtual Memory Layout

Top part of the address space used by the kernel
- malloc() memory kernel thread stack
- no page faults on kernel
- load everything, really
Bottom part for userland
- Context switch happens here
- Trampolin code for signals
- User stack no executable page
- First page left invalid for null-pointer dereference
Heap grows up, Stack grows down
Shared library seat a couple pages after Heap limit

Virtual memory hardware operation

Virtual address to Physical address
MMU
- L1 cache will probably have a page cached
- TLB - Translation Lookaside Buffer
- Page_fault - interrupt
- MMU should be informed that a new page is load
Context-swich is expensive because a whole new pagedir needs to be loaded

Virtual Memory Design

Process as regions
- Collection of address space that is threated the same way
- Shared memory
- File Mapping
- Copy on write after fork
vm design is archtecture independent

User interface to Virtual Memory

mmap()
- maps file into memory
munmap()
- removes mapping from memory
msync()
- sync hardware caches to main memory
mprotect()
- control access to parts of the address space
madvise()
- advise the kernel on how an area of memory will be used
- garbage colletor advice random
mlock()
- force page to be resident no paged out
- used in database for transactions, prevent pages to go away
- call msync() to sync main memory from caches

Week 5

Virtual Memory

Physical memory and swap space

when malloc fails you probably used all your address space
linux does not enforce address space size

Virtual Memory organization

Data Structures
- Per process: shared per address space, per thread
- Per address space: collection of objects

Process Address Space

most of the code is machine independent
vmspace
- vm_map
- vm_pmap (p for physical)
- statistics (about the vm system, page faults, workset size)
vm_map_entry
- anonymous object
- vnode / object
- start and end address
vm ask object if it have a page

Process Address Space Sharing

Private mapping
- Shadow objects between the vm_entry and the file object/vnode
- On low memory scenarios page are taken from vnode and put into memory

Shadow object chains

When a process with MAP_PRIVATE mapping forks, its child must get it own shadow map

Machine-dependent virtual-memory description

32bit the kernel can only have 2gb of address space

Kernel memory allocator hierarchy

Memory in kernel space
- submaps
- vmem arena
- slabs out of the arena
- bucket out of slabs
- bucket are per CPU

Kernel Maps ans Submaps

kernel described by map entries

Vmem Data Structures (Arena)

bt boundary tag
power-of-2 free list
BESTFIT or INSTANTFIT
hash list for used pages pointed by the bt

The Slab Allocator

the slab allocator is responsable for allocating a zone of particular thing
all objects are the same size

The Keg Allocator

manages a particular zone
track slab objects
list for full, partially allocated and empty slabs
request a new slab when everything is in use

The Zone Allocator

Lock per zone
Each CPU has it own lockless list protected by a critical section
Each CPU has it own bucket of memory
Two buckets per CPU

Fast Kernel malloc implementation

Memory pool characteristics
- Size is fixed
- Max size is no bigger than phys mem
Memory mapping hardware can be directly controlled
Known usage patterns
- Most allocations are small
- Many allocations are for a power-of-2 sized piece of memory

Inteface look and feel

Simple and familiar similar to C malloc() and free()
Small allocation
- use power of two
- called using malloc and free
Large common structures
- use zone-based allocation
- more compact than power-of-two
- avoid cache line problems
- avoid initializing structures
- must reclaim unused memory from zones

Types of pages

Wired - kernel
Active - In use
Inactive - valid contents, dirty
Cache - valid contents, clean
Free - invalid contents, ready for use
Necessary to have higher priority page access

Paging strategies

free pages for the free list
- no longer explicity
- process exit anon pages go to the freelist
To add inactive pages to the cache list
- remove oldest page of the inactive list
- if first inspection mark scanned
- if dirt put back to the inactive list
- pages on cache needs to be clean
Inactive and cache are LRU

Paging strategies (part II)

To add to inactive list
- Remove oldest page from active list
- if in use, increment active count and put on the end of active list
- else not in use, decrement active count
  - if active count greater than zero, put on end of active list
  - else clear scanned flag and move to end of inactive list

Superpage Motivation

The problem is the limited size of the TLB
if TLB takes more than a cycle it is slow

Superpage Solution

4k pages
2Mb pages on PAE
4Mb pages without PAE

Superpage Reservation

On FreeBSD the kernel decides when to use superpages
Others allow programs to decide
Starts on the begining (first fault) because small pages are already in use
- Annon memory is always elegible for superpage
- mapped file at least superpage size is eligible
Only save TLB misses

Superpage promotion

if your are touching all pages on your reservation
promotion to read-only superpage

Superpage fragmentation control

Kernel memory is aways superpage
- Kernel do page out
Cache and free are kept in buddy lists (pair)
- Aggregate small pages until a full superpage is available

Filesystem interaction with the virtual memory

filesystem own "buffer cache" pages
annon object owns it memory page
annon objects don't have a filesystem
pages may be both mapped and cached (copy on write)
management of swap space
- page cluster

Kernel structure

p 315 (book reference picture)

Week 6

I/O System Overview

Kernel I/O Structure

GEOM handles the disk IO
- Can do all sort of stuff, RAID, Stripping, etc
CAM layer handles in a generic way the disks I/O
- Individual drivers are much smaller
- Based on SCSI but extended to ATA and SATA

I/O Descriptors

File Descriptor
- An array of bytes
- Move file descriptor
- Read " "
- write " "
Access to other hardware
- Terminals
- Tape drivers
- Printers
- Disks
- Modems
Pipes
- Reliable byte stream

Descritor I/O Structure

Per-process file descriptor index (integer array)

File Entry

One per open file descriptor, shared across dups()

File Entry Types

VNODE
SOCKET
PIPE
FIFO

. .

KQUEUE, MQUEUE, ETC

Multiplexing I/O

Use select() or poll() system call
- Useful when the I/O usually blocks
- Expansive when I/O succeeds
- Stateless
Use non-blocking mode
- Useful when the I/O usually succeeds
- Expansive when I/O blocks
Use signal driven I/O
- Useful for infrequent stuff
- Expansive for high volume I/O
- Context switch for each signal
Use kqueue to monitor kevent
- Useful for monitoring many events
- Only in BSD and MacOS

Select Top-Level

Select will sit on top of the underlying bottom half to ask about states
Pool of mutex

Select Bottom-Level

Select on async devices

Select Notification

Select notification when I/O becomes possible
selwakeup selinfo is passed to be used to traverse the linked list
Device remove horizontol list, thread remove the vertical

Kqueues and Kevents

General event notification facility
Stateful interface
Register events of interest
- May modify the event (add, delete, enable, disable)
- Tracked events are deleted when resource is gone
Poll for events using kevent()

Types of kqueue events

Supported kevents on FreeBSD
- R/W as in select
- AIO Async I/O events
- VNODE files information has changed
- PROC status of processes
- SIGNAL posted to processes
- TIMER an event-based timer
- USER application defined events
Each event has an identifier

Kevent notification

Each event specifies a filter flag
An event may be another kqueue descriptor

Kevent Implementation

Hash list for free form descriptors
Descriptors indexed array by its number
List of pending events

Semantics of Non-Blocking

Non-block flag is stored in the file table

Uio Structure

IO vector
Increment Offset
Decrement Resid
Segment flag tells you where/from the I/O is coming to
- USERSPACE
- SYSSPACE
- NOCOPY
Pointer to the thread that requested the I/O

Level of Filesystem Interface

system call
- open/close, read/write, link/unlink
file table
- open/close, read/write, ioctl
virtual node (vnode)
- VOP_
inode
- namei,iget/iput
block I/O
- bread/bwrite

Contents of a Vnode

type
pointer to type specific control block
flags
reference count
pointer to operations
pointer to containing file system
list of associated clean buffers
list of associated dirt buffers
content of pending I/O operations

Vnode interface - Name Translation

Allow but does not require state
Steps in translating pathname

Vnode interface - Vnode Management

Single system-wide vnode table
- vn_inactive() references are back to zero
- vn_reclaim() recycle vnode
Genral set of utility routinges are provided to manage vnodes for a all filesystems

Vnode interface - Name Cacheing

Single system-wide soft-reference name cache with common routines
- cache_lookup()
- cache_enter()
- cache_purge()
- support negative cacheing

Vnode linkages

Vnode Mount interface

/etc/mtab info is now maintaned in the kernel
- statfs(2)
- getfsstat(2) return how many things are mounted
- getmntinfo(3) saner than ^
Generic mount information can be updated at any time

Buffers

Getting buffers
- bp = bread(vnode_ptr, file_offset, size);
Releasing buffers
- brelse(bp)
- bwrite(bp)
- bawrite(bp)
- bdwrite(bp)

Stackable filesystems

Born in UCLA

Loopback mounts

allow arbritary directories to be mounted anywhere else
implemented as a filesystem layer
bases for other filesystems implmentations

Union mounts

Merge filesystems together instead of hide the underlying fs

Union mount naming

Union mount issues

Remove files from lower layers
- whiteout write in the top layer
Removed directory must be marked as opaque for whiteout of directories

Union implementation

it basically deals with naming
all other operations are passed down to the lower layers

Dead file system - DFS

Week 7

Devices

Special Devices

Character devices
- pseudo-terminals
- frame buffer
- printer
Network devices
- network interfaces
Disk devices
- disk management (GEOM)
- filesystem interface

Device Naming and Access

Historic device nodes
/dev filesystem
devd

Character devices entry points

Operations supported by any character device

Line displines entry points

Used by tty devices
Divided into 2 parts
- One from things going down to the kernel
- One from things coming out of the device

Termios

Programatic interface to terminal drivers
All states contained in one structure
struct termios{}

Changing the termios structure

tcgetattr()
tcsetattr()
command flags passed to tcsetattr()
- TCSANOW - make change immediate
- TCSADRAIN - drain output, than change
- TCSADFLUSH - drain output, flush input
TCSASOFT - only interested in soft state

Allocation of controlling terminal

Historically a side effect of open
Now and explicit ioctl: TIOCSTTY
Only session leaders can acquire controlling terminal
only one per session

Revocating of controlling terminal

SIGHUP when the controlling process disapear
revoke the vnode that references the pty

Network driver entry points

Entry points in the ifnet structure
- if_ operations

Network driver control messages

Control messages sent using if_ioctl
- SIO* various operations of the network interface

Network packet reception

D buffer owned by device may be filled
K buffer owned by kernel and may be read
When k is done kernel stops, when d is done network drops packets

Disk volume management

GEOM layer
- disk partitioning and labelling
- disk aggregation
- collection of I/O statistics
- I/O optimization such as disk sorting

Mirroring a filesystem

GEOM mirror layer
- mirror underlying partition
- can be inseted to copy filesystem

GEOM Operation

Downwards is handled by g_down thread (single threaded)
Upwards is handled by g_up thread
Module cannot sleep or compute excessively
Modules providing locking can allow direct dispatch
When the provider goes away, error is propagated
When provider changes, change is propagated up the stack

Disk sorting

Organization of buffers awaiting I/O for a disk

currently working on buffer for block 25
next buffer to be processed is at block 40
blocks to be done in next pass start at switch_point_list

Communication from the top to the bottom of the kernel

Acquire a queue mutex and block on CPU
Place request on queue
If device is idle, start it
If synchronous request, sleep
Release mutex

From bottom to the top

Acquire the mutex
Pop next request from queue
Release the mutex
Do required work
If synchronous, wakeup thread(s)
If asynchronous, free resources

FreeBSD Disk I/O Subsystem

Disk can connect in several ways
Non-disk USB and Firewire devices bypass CAM

Physical Drive Management

CAM Layer handles disk devices
- routing throught I/O busses to drive
  - PCI
  - Cardbus
  - Firewire
- tag queueing
- serializing requests
- allocating and freeing DMA resources
DMA map works similar to a TLB but much simpler

Autoconfiguration

policy
- support widest possible range of CPU types and configuration
- minimize information that specified in advance
strategy
- decode CPU type
- check memory controller type
- find types and locations of possible main I/O busses
- descend I/O busses, testing for presence of controllers and devices
- attach and configure discovered controllers and devices

Autoconfiguration Functions

device_probe
device_identify
device_attach
device_detach
device_shutdown
device_suspend
device_resume

Device driver bidding

Bid process for drivers
Highest bid will get it
probe - bid - attach

Autoconfiguration data structures for pci0

the pci0 device scans its bus and finds an ATA disk controller
scanning its drivers, its "atapci" drivers wins the auction

Sample autoconfiguration hierarchy (devinfo)

Device virtualization

full virtualization lets guest operating system run without change
paravirtualization requires guest operating system to use virtualized devices and CPU features
vertio model splits device driver
- guest operating system front end sends device commands to hypervisor
- hypervisor back-end operates physical devices
- communication is often done using shared memory rings

Bhyve virtualization

FreeBSD embeds bhyve to provide virtualized devices vesus Xen which is a standalone hypervisor that allows guest operating system to interact with each other
Bhyve exposes a virtual PCI bus to the guest
- guest can probe and attach back-end devices

Xen virtualization

Xen Grant tables

grant tables are shared memory used to communicate between a guest operating system and Xen or another guest
grant table operations
- permit_access
- accpet_transfer
- may limit export or import page to read-only
grant table states
- reading
- writing

Fast Filesystem

File filesystem history

Directory structure

all entries start on a 4-byte boundary
active entries are linked together

Extended attributes

It is a place to put metadata about the file
Header of each attribute has:
- 4-bytes length
- 1-byte namespace class
- 1-byte content pad lenght
- 1-byte name lenght
- name padded so that the contents start an 8-byte boundary

File locking

Advisory file locking
- no lock
- shared access
- exclusive access
locks are not tied to file access mode
locks may be ignored
locks may change after the file is open

File locking implementation

Week 8

Filesystems

Quota

each "uid" may have quotas associated
quotas may be specified by filesystems

Semantics of quota

quota may be imposed on both users and groups
quota policy
- Users should stay below their "quota"

Current UNIX Filesystems

process descriptor
- open file entry
  - vnode
    1. inode
      1. disk
      2. inode
      3. data

Inode

Contains all the metadata about the I/O
Symlink stored directly into the inode
direct blocks
- single indirect (pointers to pointers to data)
- triple indirect (pointer to pointers to pointers to data)

A small filesystem

UNIX Filesystem

I/O in UNIX

align operations to block boundaries
inodes map logical block numbers to disk blocks
write to system buffer and copy back to blocks

Contents of a cylinder group

Superblock - static parameter of the filesystem
- block size
- fragment size
- disk characteristics
- layout policy
Cylinder group map - dynamic parameters of the cylinder group
- free blocks and fragments
- free inodes
- allocation summaries
- recent allocation history
Inode - per file information
- type/owner
- access/mode times
- array of pointers to data blocks
Data blocks

Optimizing storage utilization

Time/space tradeoff
- Time - big blocks reduce the number of reads/writes to access a file
- Space - big blocks waste space for small files
UNIX File are predominantly small and so use a hybrid block size

Filesystem parametrization

Maintain system characteristics
- time to schedule an I/O operation
- ability to chain I/O operations
Use this information to do block layout ASSUMING sequential access.
Laying out blocks in a totationally optimal fashion.

Layout policies

need to localize allocation of resource that are accessed together
need to spread out unrelated allocations
Inode layout policy
- localize
  - try to place all files contained directory in the same cylinder group as the directory
- spread out
  - Place new directories in a cylinder in a cyliner group with a greater than avarege number of free inodes but fewest number of directories. Cluster leaf directories.

Layout policies (part 2)

Data block layout policy

Localize
- Place data blocks together in the same cylinder group as the inode resides
Spread
- Redirect block allocation for a file to a new cylinder group every ten to twenty Megabytes
Policy aim
- Keep some free blocks in every cylinder group
- Localize all references when many small files

Dynamic block reallocation

Size of file is not known when the file is open

Implementation of dynamic block reallocation

Optimized Metadata Layout

first 4% percent of data area of each cylinder group is held for metadata

Implementation of allocation

Use requested block
Use closest forward block within cylinder group
Use quadratic rehash to pick a different cylinder group
Brute force search of cylinder group

To maintain reasonable optons, 5%-8% of the disk is kept free

Filesystem Consistency

Metadata must be consistent
- directories
- inodes
- bitmaps
Rules
- Never point to a structure before it is initialized
- Never reuse a resource before nullifying all previous pointers to it
- Never reset an old pointer to a live resource before the new pointer has ben set

Keep metadata consistent 1
- Synchronous writes
  - simple and effective
  - create/delete intensive is slow
- Non-volatile RAM
  - usually runa all operations at memory speed
  - expensive hardware unalavailable on many machines
- Atomic updates (journaling and logging)
  - create/remove do not slow down under heavy load
  - extra I/O generated, little speed-up for light loads
Keep metadata consistent 2
- Copy-on-write filesystems (LFS, ZFS, WAFL, etc)
  - write throughput, cheap snapshots, always consistent
  - disk fragmentation, memory overhead
- Soft updates
  - most operations run at memory speed, reduce system I/0, instant recovery after a crash
  - complex code, background fsck, and increased memory loading

Adding journaling to soft updates

Only need to journal operations that orphan resources
Journal needs only 16Mbyte independent of filesystem size
Filesystem operations that require journaling
- Increased link count
- Decreased link count
- Unlink while referenced
- Change of directory offset
- Cylinder group updates of reed blocks and inodes

Additional requirements of journaling

Additional soft updates tracking
Reclaiming journal space

Tracking file removal dependencies

Ordering constraints
- log of the location in the directory that has the name to be deleted
- the filename in the on-disk copy of the directory must be deleted
- log the blocks to be deleted an deallocated the file zeroing out its on-disk dinode.
- the blocks formerly referenced by the inode for the file must be released to the free-space bitmap
- log the successful completion of removal

Journaled Crash recovery

Crash recovery is done by fsck
Recovery steps
- Scan journal
- Link count increases
- Link count decreases
- Free inodes with zero link count
- Free inodes that were unlinked but busy
- Free unallocated blocks

Snapshots

Create a copy-on-write image of a filesystem image
1. suspend processes initiating system calls that modify the filesystem
2. allow all modifications in progress to complete
3. write out all dirty buffers to disk
4. create an empty snapshot file the size of the filesystem partition
5. mark the block that are currently in use
6. resume write operations on the filesystem
7. on each disk write, check to see if it has been copied making copy if the write is for an in-use block that has not yet been copied

Snapshot implementation

Week 9

ZFS filesystem

ZFS overview

Never over-write an existing block
It is always consistent
State atomically advances at checkpoints
Snapshots(read-only) and clones(read-write) are cheap
Metadata redundancy and data checksums
Don't need fsck
Selective data compression and deduplication
Filesystem shared pool
Quota
Mirroring and multiple parity RAIDZ

ZFS Modules

ZFS is monolithic

Functional organization

GEOM used to provide access to disks
GEOM used also to present zvol as disks

Structural organization

Multiple layers
- Meta-object-set
- Object-set
  - Array of inodes or dnodes
Uberblock anchor the pool
Meta-object-set (MOS) describes array of filesystems, clones, snapshots, and ZVOLs
Each MOS object references an object-set that describes its objects
Space map is at the MOS layer, if Object-set needs more space just ask MOS

Uberblock

Uberblock describe the entire ZFS pool
Areas are used as a circular queue

Dnode

Equivalent of inode in FFS
Describe files and directories
But also snapshots, clones and etc.
Dnode is embedded with a znode when it describes a file
Blocks range in size from 4 to 128Kbyte

ZFS Block pointers

it is huge, much larger than FFS block pointer

ZFS block management

Disk blocks are kept in the pool

ZFS Structure

Huge picture page 532-535

ZFS Relationships

Clones can only be taken of a snapshot
Can make multiple clone of same snapshot
Clones can be promoted to the filesystem
Can also take snapshots and clone ZVOLs

ZFS Checkpoints

Checkpoint affects all filesystems, clones, snapshots and ZVOL

Writing new data

Checkpoints are taken when one of:
- five seconds have passed
- 64Mb of dirty data accumulate
- administrative task (like a snapshot)
To take a checkpoint
- finish in-progress system calls
- flush all dirty data
- complete by writing new uberblock

Flushing dirty data

Logging

Handled by ZIL
Used to record changes between checkpoints
Not only metadata but content (log no journal)
Data over 32-kb are written to disk and pointer put in log

RAIDZ

Traditional RAID has fixed stripe size
RAIDZ has variable stripe size

RAIDZ Recovery

build by traversing all MOS objects and rebuild their blocks
never need to recalculate and write parity

Storage pool allocator

Manages the space in the pool
It is a bitmap
Allocation proceeds as:
- Find disk with the most free space
- select the least fragment fixed-size map
- allocate space closest to previous allocation
SPA also handles
- compression
- deduplication

Snapshots

take a checkpoint
allocate new dnode in the MOS
copy the block pointer from the fs dnode to the snapshot dnode
link the snapshot dnode into the head of filesystem's snapshot list
move the filesystem deadlist to the snapshot

Freeing filesystem blocks

Blocks are tracked using space maps, birth time and deadlist
When a block is allocated, its birth time is set to the current transaction number
Over time snapshots are taken which also reference the block
When a file is overwritten, truncated, or deleted, its blocks are released
For each freed block, the kernel must determina if a snapshot still references it.
- if born after most recent snapshot, it can be freed
- otherwise it is added to the filesystem's deadlist

Deduplication

It is done in a pool wise basis
Block identify duplicated by hash (using the checksums)

Remote replication

backup of pool either locally or remotely
can do full backup
- stored as a single file
- expanded back to its original form
can do incremental backup since backup of earlier snapshot
tracks the progress of partial completed backup

ZFS strenths

high write throughput
fast RAIDZ reconstruction on pools with less than 40% usage
avoid RAID "write hole"
blocks move between filesystem as needed
integration eases administration (mount points, exports, etc)

ZFS weakness

Slowly written files scattered on the disk
- That is why it uses a lot of memory
Slow reconstruction on pools with greater than 40% usage
Block cache must fit in the kernel's address space, thus works well only on 64-bit
Needs under 75% utilization for good performance
RAIDZ has high overhead for small block size such as 4kb blocks typically used by databases and ZVOLs
blocks cached in memory are not part of the unified memory cache so inefficient for files and executables using mmap() or when sendfile()

Week 10

NFS

Sun Microsystems Network File System

Widely available
Divided into two halves
- fileserver exports locally attached disks to client hosts
- client hosts that import filesystems from central server
NFS basically started open source
Sun promoted connectatons to make sure evey implementation were interoperable

NFS v3

stateless protocol
benefits
- can continue to operate when some systems fail
- fast recovery when failed components return
drawbacks
- sync writes
- high network traffic

NFS v4

stateful protocol
- state is bounded to speed recovery
benefits
- can continue limited operations when some systems fail
- bounded recovery time when failed components return
- minimal network traffic validating clients cache
- less need for separate
- sessions (added in nfs 4.1) ensure idempotency of requests
drawbacks
- sync writes
- server required to maintain and manage states

NFS File identification

file handles
- created by server to identify files when first looked up
- used by client in all NFS requests after open
file handle contents
- filesystem identifier
- inode number
- generation number

NFS v3 daemons

rpcbind (all)
mountd (server)
nfsd (server)
nfsiod (client)
lockd (all)
statd (all)

NFS v3 mount operation

NFS v3 I/O

NFS v4 goals

good performance on the internet
strong security with negotiation built into the protocol
good cross-platform interoperability
designed for protocol extension
provide protocol support for clustered-server deployment

NFS v4 changes

compund RPC's
lock operations included in the protocol
explicit open and close operations
pseudo-filesystem hierarchy to support multiple roots
extensive attribute support
- required attributes (16)
- recommended attributes (43)
- extensible attributes(k/v pair)
access control list
delegation and callbacks(leases)

NFS v4 daemons

rpcbind (all)
nfsd (server)
nfscbd (client)
nfsuserd (client)

NFS v4 mount operation

NFS v4 sessions

NFS v4 delegation/recall

NFS v4 crash recovery

after client crash
- client attempt first mount from server
- server finds previous client state (if not timed out yet) and releases it
after server crash
- server sends BAD_SESSION replies to out of date client IDs
- client realizes server has crashed and reestabilishes it state
- server gives clients time (default 15 seconds) to rebuild state before handling out new state

Week 11

Networking and Interprocess Communication

Graph of communicating processes

Domain A has P1 and P2 processes
Domain B has P2, P3 and P4
Protocols used for Addressing Domains
Domain addressing is the problem to find a process within a domain
- No centrelized authority

Name server

Domain based distributed name server
Gethostbyname() does RPC to name server daemon usually run on gateway
Bind was made by one of his students, got an A because it was deployable

The ISO networking model

Came up to create a standard way for interoperability
OSI layers came out to put in the communication protocol layer
internet in EU born from a guy buying modems and distributing to friends, before internet protocols (UUCP)

Packet traffic

mbuf designed to hold network data

Socket system call interface

socket, bind, listen, write, read, etc…

Types of network address

local domain
IPv4
IPv6

Use of IPC/Networking facilities

Outline of a simple "remote write"

System layering

sockets
protocols
net inerface and hardware

UDP

it is the absolute minimum you can get with
layer 4 transport, datagram protocol
demultiplex on port number
optional checksum

TCP

layer 4, connection oriented
provides everything that UDP does not
flow control (windowing)
port to port connection
acknowledgements
may piggybacked, coalesced
3-way handshake on open, 2-way handshake on close

SCTP

Reliable multiple-stream transport protocol
provides associations between up to 65k independent streams
good for multi-streams applications, no need for multiple handshakes

IP

layer 3
designed for packet-switched network
both IPv4 and IPv6
- host-to-host addressing
- routing
- time to live
only IPv4
- fragmentation and reassembly
- type of service
- options: source route, record route, timestamp

Mbufs

limited to 224k of data
can have data written in the begining and in the end of the buffer

Mbufs continued

use mbufs pool, allocated in a separate page map
pages are allocated when the mbuf is allocated and returned free of last reference
mbuf data segment was converted into other fields
mbufs data will sit outside of the mbuf for larger data sizes
external mbufs

Socket data queuing

stream socket
- packets are linked to the next
- no boundaries without application interaction
datagram socket
- are record based
- if you read a datagram you get the entire datagram
30 is the number of root servers because it is the amount of servers that fit in a UDP packet

Socket interface protocols

pru_*

Network interface addresses

ifnet
pf1_addr
pf2_addr

Network interface to link layer

driver, one per device type
link layer are linke normal drivers but easier
- there is not much in link layer we really had to do

Net to protocol interface

input
- common routines, copy or map data into mbufs
- save interface id with packet

Week 12

The Network Layer

Internet Addressing - IP Version 4

IHL = Ip Header Length
- number of 32-bit words after the header
ID line is mostly used be fragmentation
TTL
Header checksum is for the IP header only
3 layer routing, network -> subnet -> host

Internet Addressing - Subnets

22bit network

IP Multicast

Class D reserved for multicast
Interesting how multicast is created on gateways on the path of the publisher
Gateways should have to have multicast support if they are doing multicast
IGMP
- add, leave group
- programming interface: setsockopt/getsockopt
- on Ethernet group address is added to interface filter
Multicast agent
- listens to IGMP
- fowards multicasts to remote members

Address Resolution Protocol

Problem
- how to map network-protocol-level addresses to a link-level address
Solution
- broadcast an "ARP" request containing the host being sought
- all the machines on the wire get the request
- machines that matches sends response back to broadcaster containing link-level address to use
- netowk keeps cahe of "ARP" translations (in routing table)

Internet Addresseing - IP Version 6

128-bits was planned way ahead
Autoconfiguration
- router discover
- neighbor discovery
Security (IPSec)

IPv6 addressing

Version
traffic class and flow
next-header used to describe the type of routing among a lot of other things like authentication and encapsulation

IPv6 autoconfiguration

Router discovery
- send router solicitation message to router "anycast" address
- receive router advertisement messages containing the router IP address and router link-level address
- mimicked by IPv4 by DHCP
Neighbor discovery
- send neighbor solicitation
- receive neighbor advertisement containing the neighbor IP address and neighbor link-level address
- Mimicked in IPv4 by ARP

Internet Protocols - ICMP

part of IPv4, IPv6 has something similar
error messages (include header of problem + 8 bytes)
- destination unreachable
- service unavailable
- time-to-live expired
- header problem
control messages
- redirect
- source quench
low level network functions
- echo (ping)
- net address request/response
- net mask resquest/response
- timestamp request/response

ICMP Redirect processing

ICMP redirect cause direct modification of kernel route, changing gateway address
Redirects on FreeBSD accepted only from current router

IP Forwarding

First thing is to check if forwarding is enabled
check destination, no loopback, no broadcast
routing lookup
- if from the same wire, ICMP redirect message will be sent

Routing goals

From the 1980s:

most machine should only need one network interface
network topology should not be visible to users
routing should have dynamic reconfiguration
multi-interface machines should potentially be able to act as gateways

Separation of policy and machanism

kernel implements mechanism
intelligent routing process determines policy
simple hosts may be simple
sophistication should be easily configurable
routing can be arbitrarily complex

Routing design

User level process: routed
routing daemons exchange information to determine best routes
routing daemons add, delete and replace routes in the kernel table to reflect best routing
kernel limited to looking up routes in its table

Kernel routing table

all routes are in a single hierarchical table organized as a redix tree
routes include variable-length destination, mask and host or gateway
routing table holds/cache round-trip time, hop count, packet size
link layer may allocate spaces for link-layer information/cache (ifaddr structure argumented with link-layer calls for route management)
"cloning" routes create host-specific routes dynamically; may use external resolution agents

Routing: radix search algorithm

Example radix tree:

circles are internal node
bit position tested is shown in circle
leaf nodes are rectagles containing a key ans mask
same interior nodes are associated with masks

– Picture here of radix tree routing algorithm

Kernel routing lookup

Callers set destination in route structure, then rtalloc() looks up an rtentry, returning a reference
rtallocl() does a lookup, optionally cloning the target route if necessary

Kernel routing socket

Routing ioctl commands replaced with message-oriented routing socket
Changes to the kernel routing table are requested by messages via routing sockets
Responses include message ID, errno, are sent to all listeners
Routing socket also reports other routing changes, redirects, lookup failures, interface up/down

Routing strategies

DHCP
- assigns IP address
- installs route to gateway
Routed (RIP)
- dynamic local routing
- best strategy for clusters of local networks
Gated
- OSPF
- EGP
- BGP

Routing Daemon (routed)

dynamic local routing using variant of Xerox NS routing information protocol
Bellman-Ford (vector-distance) algorithm
Gateways broadcast destination and hop for known routes
Routes may "count to infinity" when destinations become unreachable; inifinity is chosen to be 16
Monitors network link for proper functions, drops routes through down link
On host with one interface no updates are sent

OSPF routing protocol

Open Shortest-Path-First routing protocol
- shortest path first, or link-state routing algorithm
- each router maintains topology database, monitor link state
- two-level hierarchy: routing areas connected by backbone
- internal routers have complete topology for their area
- backbone routers have complete information about areas and backbones

IPSec Goals

Goals:

Authentication - knowledge of a key proves you are who you claim to be
Confidentiality - data is encrypted across the connection
Replay protection - packets are sequenced at the application level so the replay attempts can be detected

IPSec transport mode

Key:

AH - authentication header
ESP - encapsulating-security payload
SPI - security-parameter index

IPSec tunnel mode

Router to router connection
Encryption happens only on both connected routers
Host to router connection is possible, the host has to have the key

IPSec key management

Keys stored in kernel managed with key socket with messages:
- getspi
- update
- add
- delete
- get
- acquire
- register
- expire
- flush
- dump

Packet-processing frameworks

Packet processing frameworks used to:
- implements firewalls
- NAT
- debug network problems
- provide framework for network-research testbeds
Available filter frameworks:
- BPF - identify and copy selected packets
- IP firewall (IPFW) - modify or drop seleccted packets
- Dummynet - traffic shaping
- Packet Filter (PF) - drop selected packets
- Netgraph - provides data-flow between network processing modules
- Netmap - direct access to network packet rings

Berkeley Packet Filter

Generaly used via tcpdump which uses pcap to express filter
Scans all incoming packets and copies one that match selection criterion
Never drops or modifies packets but does not allow packets to be injected into input stream
pcap compiles high-level description to simple byte codes(which a JIT turns into native machine code)
Can avoid copying by using shared user-space/kernel buffers

IPFW

Single central dispatch receives all incoming packets and each packet can be:
- passed through unchanged
- copied to additional packet streams
- diverted to an alternate packet stream
- subject to NAT
- reassembled for further inspection
- send to dummynet and/or netgraph
- dropped

Decision on what to do with packets described by rulesets loaded by the system admistrator

Dummynet

Packet-processing framework that provides:

traffic shaping
packet delay emulation
packet scheduling

Pipe emulates a communications link with a scheduler arbitrating multiple independent queues that features:

programmable bandwidth and delay
selectable queue priorities
variable number of queue
selectable queue priorities
scalable to many thousand pipes and queue

Dummynet scheduling policies
- FIFO - constant running time with no service guarantees
- weighted round robing - contant run time with limited service guarantees
- quick fair queueing - constant run time with bounded service guarantees
- weighted fair queueing - logarithmic run time (in number of queues) with optimal service guarantees

Packet filter (PF)

FreeBSD version is PFv1, PFv2 still to be ported
Similar functionality to IPFW
Developed under OpenBSD and ported to FreeBSD
Packets are normalized then run through a set of rules provided by the system administrator
Packets that pass the first set of rules run through another ruleset specific to their protocol type (TCP, UDP, ICMP, etc)
Packets that pass second set of rules are passed through for normal processing

Netgraph

Built around a data-flow model:

Nodes provide data processing
Edges provide packet flow between nodes

Currently 50 nodes available ranging from simple node that echos its input to a complex node that implements Bluetooth

Nodes have a seprate control channel to receive configuration (as well as input and output)

Nodes can be added as loadable kernel modules

Netgraph example
- Ethernet bridge

Netmap

network interface packet rings directly mapped into the application address space
system calls to request packet transmission and be notified of packet reception
access via command to/from: /dev/netmap
- NIOCGREGIF command to attach to interface
- attaching stop flow of packets into kernel stack
- software ring provided to selectively copy packets into stack and accept packets from the stack

Netmap receive ring

tail and head sync with kernel/hardware -> userland

Netmap transmit ring

similar to receive ring, sync with the kernel periodicaly to sync head and tail on the list

Week 13

The Transport Layer

Protocol Control Block

TCP example
No other connection will have the tuple (local addr -> local port / foreign addr -> foreign port)
Hash table for connections using the four tuple

Internet Protocol control block

Create control block
Bind local address and port
Set remote address (connect, sendto)
- source is chosen according to outgoing network interface after routing lookup
Look up control block with matching address and port
Deallocate control block

TCP Sequence space

so_snd.sb_hwat - Socket buffer high water mark
- max number of bytes accepted from applications
so_snd.sb_cc - Socket buffer character count
- bytes already written from the application
snd_wnd - Send window
- flow control uses the window, if it is full do not expand the window
snd_nxt - data sent but no ack received
All flow is controlled by sequence numbers
Window cannot shrink, you can move forward though
Sequence number negociated during handshake

TCP Packet header

source port
destination port
- IP already has the local and external address
sequence number
acknowlodge number
data offset
Urgent pointer only used when URG flag set
TCP protocol limits options to 40 bytes
Windows scaling can be negociated
- pick a scaling factor, windows will be multiplied by
Checksums

TCP timers

Five tcp timers
- 2ML (twice maximum segment lifetime) 60 seconds, used while waiting for close
- KEEP (keep alive) 75 seconds, send keep alive or drop dormant connection
- PERSIST (persist) 5 to 60 seconds, force a connection to persist
- REXMT (retransmit), 30 milliseconds to 64 seconds, called when retransmission is necessary
- DELACK (delayed acknowledgement), 100 milliseconds, send a delayed ack to the peer

TCP Output states

Idle
- no data has been sent that has not yet been acked, send queue is empty
Transmitting
- data have been sent and not yet acked
- REXMT timer is running
Persisting
- Window is zero or too small to send
- PERSIST timer is running

TCP Round tripe timing

RTT timing

Counter (rtt) is started when frist data segment sent from idle state
Counter is sampled when timed sequence number is acked
Smoothed round-trip time (srtt) and variance (rttvar) are computed as moving avarege of rtt samples using fixed point arithmetic
Sample is discarded if timed data is retransmitted
If timestamps are in use, round-trip time of every packet can be measured

Connection establishment

Opening 3-way handshake
- client sends SYN to server
- server sends SYN+ACK to client
- client sends ACK to server
Closing 2-way handshake
- send FIN
- received ACK
- FIN_WAIT_2 until other side does 2-way
During opening handshake each side defines:
- initial sequence number
- max segment size (MSS)
- windows size
Options both sides must offer to be usable
- window scaling
- timestamp request with units used
- selective acknowledgement (SACK)

Socket connection queueing

Socket in accept mode
Complete and Incomplete list
sonewconn1() duplicates head socket
soqinsque() adds to a queue
soqremque() removes from queue
so_incomp links partially set up connections
so_head links connections read for acceptance
Need to cope with syn-flooding attacks

Handling SYN attacks

SYN Cache is hash table holding SYN state

lookup when SYN received
- if not found create entry
- if found, resend SYN/ACK
never ACK any data sent in initial SYN packet

SYN Cookies used when SYN Cache is full

maintain no state
encode local and foreign addresses and ports and MSS and place in initial sequence number in SYN/ACK
if timestamp supported then encode send receive window scaling and SACK support and place and timestamp field
must respond within 16 seconds with valid ACK (subtract 1 from initial sequence number and encode to see if they match)

Connection metrics

Metrics cached about a host

rmx_mtu - MTU for this path
rmx_ssthresh - outbound gateway buffer limit
rmx_rtt - estimated rtt
rmx_rttvar - rtt variance
rmx_bandwidth - estimated
rmx_sendpipe - outbound delay bandwidth product
rmx_recvpipe - inbound delay bandwidth product

Looked up at connection establishment

Updated at connection shutdown

Cached entries kept for one hour after last use

Silly window syndrome avoidance

Silly window syndrome - offering or accepting windows less than maximum segment size when max window is large
Receiver silly window avoidance: if offered windows would be less than 1/4 of receive buffer and less than max segment, offer zero window
Sender silly window avoidance: if window is too small for all data to be sent, and window is less than max segment, wait for window update or persist timeout.

Delayed acknoledgement

Delayed ack is used to piggyback ack on window updates and application-level responses
Delayed ack also allows multiple packets to be acked at once
Implementation:
- Delayed-ack flag is set when new data arrives
- After process takes data, if window update would move window by 2 max-size segments, window update and ack are sent.
- If data is returned, ack is piggyback
- DELACK runs every 100 ms forces out ackowledgements not yet sent

Small packet avoidance (Nagle algorithm)

Problem: how to coalesce small segments into larger one without delaying transmission too long
Other solutions: set a "short" timer when first character is received, transmit when timer expires. (Timer isn't long enough to help slow nets, is just enough to irritate users on local nets.)
Nagle algorithm: send immediately when a smal amount of data is presented; then, hold additional data until a full packet can be sent or until the first segment is acked
Certain clients send small segments that receive no responsee, expect rapid action; need option to defeat

Slow start

Van Jacobson *
Observations from Van: in steady state, ACKs from receiver should tell transmitter when to send new data
Problem: how to reach steady state.

Acknowledgement clocking

Congestion avoidance

Slow start doesn't solve congestion
Solution: maintain estimated window size aceptable for network (ssthresh),
Whenever packet is lost, ssthresh to half
Slow start opens windows exponentially up to ssthresh
Window is opened linearly after slow start, probing network capacity

TCP retransmission

REXMT timer set to initial value, when sending data
If timer expires:
- snd_nxt is set to snd_una
- sshthresh is set to half the current window
- cwnd is set to 1 segment
- tcp_output is called, transmits one segment
Timer is restarted, backed off at exponential rate
After 12 transmission packet for each additional packet for each additional duplicated ack

Fast restransmission
- keep acking the same previous packet until the missing packet is retransmitted

Selective acknoledgement

Tells the sender packets it has received beyond currently ACK'ed data
- Acked to 500
- SACK indicated that 1000-2000 and 3000-2400 have been received
- Sender will respond by sending 500-1000 and 2000-3000, then resume at 3500

Modular congestion control

Until FreeBSD 7 kernel had a single congestion control algorithm(New Reno from 4.4BSD)
Congestion control algorithms can now be added as a loadable kernel module
Currently five available:
- New Reno (default)
- Vagas
- CUBIC
- Hamilton Institute's delay based
- H-TCP

Modular congestion control (II)

Each connection gets to choose its algorithm (using TCP_CONGESTION option to setsockopt())
Every time ACK received you get this information

Algorithms
- New Reno
  - detection based on retransmission timer
  - exponential backoff
- Vegas
  - On each ACK measures the difference between expected and actual rate
  - if difference below threshold, increase the congestion window
  - if above, decrease
  - if within threshold, make no change
- CUBIC
  - Designed to quickly find network capacity on long fast link (1Gbit US to Japan has 110ms RTT and takes 10 min to reach capacity)

Basic TCP send policy

Stream Control Transmission Protocol

Reliable multiple-stream transport protocol
Provides associations between hosts with up to 65k independent streams
Each stream can be message- or stream-based
Associations setup with 4-way handshake to avoid SYN attack
No server state required until third handshake
Data can start flowing on third handshake
Streams setup within an association do not require handshake

SCTP chunks

look like tcp packets
with the addition of the multi-stream handling
transmission sequence number
stream identifier tells which stream
stream sequence number tells which packet within stream
though length 16-bits, messages may be made up of multiple chunks

SCTP multiple paths

SCTP allow association to have multiple address/route using stcp_bindx() and sctp_connectx()
One route selected for use
Backup routes send heatbeat packets to confirm viability
If selected route fail, switch to an alternate

Week 14

System startup and shutdown

Booting overview

Firmware initialize hardware and loads first-level bootstrap
First level bootstrap identifies disk label and loads second-level bootstrap (8kb to 64kb)
Second level bootstrap interprets filesystem to load full bootstrap
Final stage loads kernel and kernel modules and starts the kernel
Kernel runs autoconfig and starts remaining CPUs
Process 1 (/sbin/init)

Firmware bootstrap

Initialize low-level hardware
Optionialy run hardware diagnostics
Pre-boot services such as RAID configuration and remote management
Select boot CPU and put remaining CPUs in suspend mode
Build list of attached I/O infracture to export to kernel using ACPI or FDT
Identify possible boot devices and select one
Read in first-level boot strap from front of boot device and branch to it to start it running
Provide rudimentary I/O capabilities until kernel can operate devices by itself

First level bootstrap

Contains either a (historic) MBR or more modern GPT label
Use lebel to identify location of FreeBSD bootable partition
Load first 64K of FreeBSD partition and branch to it to start it running

Second level bootstrap

gptboot to load from UFS
gptzfsboot to load from ZFS

Actions taken by the second-level bootstrap

enable access to full memory
fetch boot.config and interpret any flags
load /boot/loader

Final sate bootstrap

Full featured interactive loader

choices of kernel to load
inspect and set kernel env vars
preload kernel modules
preload security keys
preload memory-disk images
filesystem access
forth script interpreter

Default at startup is to run /boot/loader.4th

load /boot/kernel/kernel and any modules in /boot/loader.conf
at end of countdown, boot loaded kernel

Kernel loadable modules

Modules can be loaded into the kernel

when it is compiled
when it is loaded
when it is running (if not disabled)

Module register an event handler

called at startup to initialize the module
called at shutdown to clean up the module

Has version number to allow field upgrades Can register new system calls that it supports

Kernel boot

Stages of kernel startup

Assembly-language setup and enabling memory management unit
Platform-specific C-language startup of architecture specific interrupt vectors, stacks, and crafting first kernel process and thread
Setup basic kernel services: memory allocation, scheduling, locking
Run through kernel module initialization to initialize higher-level services (filesystem, network stack, start /sbin/init)

Process 1 will be forked from process 0 (usually swapd)

Kernel module initialization

SYSINIT(name, subsystem, order, function, indentifier) to start
SYSUNINIT(name, subsystem, order, function, identifier) to stop

Ordering based on 2-level hierarchy:

subsystem is first level listed in /sys/sys/kernel.h
order is second level to differentiate modules in same subsystem

Module invoked as function(identifier)

Sorted at startup, then executed in order

Kernel modules (part I)

Defined in /sys/sys/kernel.h

SI_SUB_TUNABLES
- set subsystem parameters from loader tunable
SI_SUB_VM
- early virtual memory
SI_SUB_KMEM
- kernel malloc() initialization
SI_SUB_HYPERVISOR
- Xen HVM hypervisor detection
SI_SUB_WITNESS
- configure witness lock-debugging facility
SI_SUB_MTX_POOL_DYNAMIC
- initialize dynamically allocated mutex pools
SI_SUB_LOCK
- initialize statically declared or allocated locks
SI_SUB_EVENTHANDLER
- register statically declared event handlers

Kernel modules (part II)

SI_SUB_VNET_PRELINK
SI_SUB_KLD
SI_SUB_CPU
SI_SUB_RACCT
SI_SUB_RANDOM
SI_SUB_KDTRACE
SI_SUB_MAC
SI_SUB_MAC_POLICY
SI_SUB_MAC_LATE

Kernel threads

SI_SUB_INTRINSIC - hand craft proc 0
SI_SUB_AUDIT - create audit worker thread
SI_SUB_CREATE_INIT - fork process 1
SI_SUB_SCHED_IDLE - create idle thread
SI_SUB_INTR - create device-driver threads
SI_SUB_SOFTINTR - create callout thread, netisr thread
SI_SUB_DRIVERS - GEOM thread, random device
SI_SUB_CLOCKS - create and start system clock threads
SI_SUB_KTHREAD_PAGE - create vm_pageout daemon thread
SI_SUB_KTHREAD_VM - create vm_daemon thread
SI_SUB_KTHREAD_BUF - create buf_daemon
SI_SUB_KTHREAD_UPDATE - create vnlru_proc daemon
SI_SUB_KTHREAD_UPDATE - create sched_sync daemon
SI_SUB_KTHREAD_INIT - schedule init process to run

Mouting the root and /dev filesystems

When init thread starts running it kernel it must get root filesystem mounted

Mount devfs filesystem as root ("/") filesystem
Search through "rootdevnames" for a viable root filesystem
Mount selected root filesystem on /root
Swap filesystems: /root => / and / => /dev

Find and load /sbin/init Exit kernel into init process

User-space startup

Load additional kernel modules In single-user mode init spawns a shell In multi-user mode, init spawns a shell to run commands in /etc/rc

Check filesystems as needed and then mount then
Run rcorder to select order that scripts in /etc/rc.d and /usr/local/etc/rc.d should run based on REQUIRE and PROVIDE declarations in each script
Run the script using /etc/rc.conf and /etc/default/rc.conf to configure subsystems to be run (such as sshd) and the options with which they should run

System shutdown

Shutdown happens in 3 steps

Shutdown services that need to store data in the filesystem
Shutdown the filesystem
Shutdown of the remaining services

Service register using EVENTHANDLER_REGISTER(name, function, argument, priority)

The name identifies which shutdown step
1. shutdown_pre_sync
2. shutdown_post_sync
3. shutdown_final
The priority works like the orher parameter of SYSINIT()
Event is invoked as function(argument)

Performance and monitoring system tunning (tools)

Tunning filesystem

-m minimum percentage of free space
-e maximum number of blocks (2048 = 32 Megabytes)
-a enable or disable access control list (disable)
-l enable or disable MAC
-j enable or disable journaling
-s specify avarage number of files in a directory (64)
-f specify avarage file size (16384)
-o optimization to minimize system time or to minimize space used on disk (time)

System monitoring with systat

Output of "systat -vmstat"

Load avg = should be less the number of CPU
systat -ifstat

Output of "systat -ifstat"

Kernel tracing

Ktrace and Truss this will be about ktrace

Provide kernel tracing logging
- system calls
- pathname translation
- signal processing
- I/O
Tracing can be flexibly applied
- selectively for running processes
- for individual commands
- for process groups
- through inheritance to all current or future children

Kernel trace display

Trace files are generated in binary format
Translation to readable form is done by kdump
Useful translation are optionally added
- system calls number to name
- ioctl values
- system errors with standard error string
output can display running time between and during system calls

trace example

Administration of kernel information

most information available through sysctl()
extensible MIB-style interface
allows selective retrieval of kernel data structures
- process table
- routing table
- vnode table
allow selective setting of kernel variables
- networking: IP forwarding, redirect, enable, time-to-live
- debugging: filesystem cluster enabling
- limits: maxproc, maxfiles, maxvnodes

MIB = Management Information Base

Getting process information

Process structures no longer continguous, other info scattered.
Solution: define kinfo_proc structure:
- useful field from proc
- xucred
- useful fields from vmspace
- session and pgrp info (login, pgid, tty)
- wchan message
Interface provides selective retrieval of the process table
- by proc id
- by proc group id
- by session
- by effective user id
- by real user id

Program interface

Goals
- Want all programs that acess kernel table to work on running or dead kernel
- Want to keep machine dependent information in a single place
Provide a "kernel virtual memory" (kvm) library
- uses sysctl for running kernels
- knows how to extract tables for dead kernel
- maintains a cache of kernel namelist entries for running kernel to avoid repeated nlist() calls
Provided that programs do note use machine dependent fields in the table they extract, they are portable from one architecture to another once the kvm library has been ported

Kernel debugger

FreeBSD does kernel debugging remarkable well

Better than Solaris
Better than linux (mostly because of Linus attitude of "read the code" and figure it out

Low-level built-in debugger runs on console

High-level kernel debugging facility provides complete symbolic debugging using kgdb(1) or lldp(1)

symbolically examine and modify memory locations
set breakpoint in the kernel
single step the kernel

Usage - enable the debugger when booting, then stop:

at first instruction
when debugger is attached
on kernel trap/panic

Kernel crashes

Enable crash dumps

location specified using dumpon(8)
usually configured for primary swap area
default is "mini-dump" which saves only kernel memory

Recovering a crash dump

after reboot run savecore(8) to extract
can then run kgdb(1) or lldp(1) to perusee the crash
general hints given in crash(8)
wealth of information collected using crashinfo(8) and textdump(4)

Netstat memory statistics

netstat -m -N kernel.80 -M vmcore.80

System malloc memory statistic

vmstat -m -N kernel.80 -M vmcore.80

System malloc memory statistics (part II)

REST OF vmstat -m

System zone memory statistics

vmstat -z -N kernel.80 -M vmcore.80

Example ps output

ps -axlw

Exaple fstat output

fstat | more

Example of netstat output

netstat -N kernel.80 -M vmcore.80

DTrace

DTrace provides detailed information throughout the application stack

system calls
libraries
application code

many handy one-line DTrace scripts: https://wiki.freebsd.org/DTrace/One-Liners

Performance monitoring counters

pmc
- details of all activity on a CPU
- activity of a process
Things that can be measured:
- interrupts taken
- instructions executed
- branches executed and mispredicted
- number of non-idle processor cycles
- instructions cache misses and prefetches
- data cache misses, prefetches and writebacks

Using PMC

pmc(3) describes types of counters
pmccontrol -L - lists names of available counters for current machine
pcmstat -T -S event-spec - gives a top-like display of the listed counters

Week 15

System measurements and performance tune

Background

Understand the operating system
While reading about it do not turn a page without understanding it completely

Observability tools

Good observability tools will help you find unecessary work
It helps you to eliminate this unecessary work and tune parameters

Recap tools:

systat
pidstat
vmstat
iostat

Tracing tools

truss
tcpdump

Methodologies

Anti-pattern
Workload Identification
- who
- what
- why
- how
USE Method
- Utilization
- Saturation
- Errors
Off-CPU

Benchmark

100% of benchmarks are wrong
Use only for experimental analysis
Types:
- micro benchmarks
- macro benchmarks
- bad benchmarks

Profiling

Great to understand what is a benchmark is doing
- pmcstat
- dtrace
pmcstats
- Cycles per instructions
  - unhalted count
  - instructions count
dtrace

Tracing

truss
ktrace
DTrace

Tuning

sysctl -aW
/etc/sysctl.conf
Treat tunable parameters as someone else's medicine cabinet

rsampaio/freebsd_notes.md

Table of Contents

Week 1

Introduction

Logistics

Material overview

Introduction to the kernel

Unix VM

Process

System calls

Entering the kernel

Returning from the kernel

Week 2

Locking

Historic Synchronization

Lock Hierarchy

Turnstiles

Turnstiles implementation

Sleep queues

Sleep interface

Critical Sections

Hardware requirements for lock

Spin mutex

Blocking mutex

Pool Mutex

Reader and Writer locks

Read-Mostly lock

Shared-Exclusive locks

Condition variables

Locking Manager locks

Witness

Processes

Process Resources

BSD Process structures

Open file information

Other Process Substructures

Processes versus Threads

Week 3

Process

Process States

Finding processes entries

Process entries of related processes

Scheduling classes

Week 4

Security

Security mindset

Trusted Computing Base

Process credentials

Set-User-Identification Functionality

Immutable and Append-only flags

Jails

Jails rules

Random number generation

Access control lists (ACL)

Access control list semantics

Privileges (Kernel only, no userland interface)

Privileges applied

Auditing

Auditing handling

Capsicum

Sample Capsicum capability

Virtual Memory

Virtual Memory Layout

Virtual memory hardware operation

Virtual Memory Design

User interface to Virtual Memory

Week 5

Virtual Memory

Physical memory and swap space

Virtual Memory organization

Process Address Space

Process Address Space Sharing

Shadow object chains

Machine-dependent virtual-memory description

Kernel memory allocator hierarchy

Kernel Maps ans Submaps

Vmem Data Structures (Arena)

The Slab Allocator

The Keg Allocator

The Zone Allocator