- 1. Week 1
- 2. Week 2
- 2.1. Locking
- 2.1.1. Historic Synchronization
- 2.1.2. Lock Hierarchy
- 2.1.3. Turnstiles
- 2.1.4. Turnstiles implementation
- 2.1.5. Sleep queues
- 2.1.6. Sleep interface
- 2.1.7. Critical Sections
- 2.1.8. Hardware requirements for lock
- 2.1.9. Spin mutex
- 2.1.10. Blocking mutex
- 2.1.11. Pool Mutex
- 2.1.12. Reader and Writer locks
- 2.1.13. Read-Mostly lock
- 2.1.14. Shared-Exclusive locks
- 2.1.15. Condition variables
- 2.1.16. Locking Manager locks
- 2.1.17. Witness
- 2.2. Processes
- 2.1. Locking
- 3. Week 3
- 4. Week 4
- 4.1. Security
- 4.1.1. Security mindset
- 4.1.2. Trusted Computing Base
- 4.1.3. Process credentials
- 4.1.4. Set-User-Identification Functionality
- 4.1.5. Immutable and Append-only flags
- 4.1.6. Jails
- 4.1.7. Jails rules
- 4.1.8. Random number generation
- 4.1.9. Access control lists (ACL)
- 4.1.10. Access control list semantics
- 4.1.11. Privileges (Kernel only, no userland interface)
- 4.1.12. Privileges applied
- 4.1.13. Auditing
- 4.1.14. Auditing handling
- 4.1.15. Capsicum
- 4.1.16. Sample Capsicum capability
- 4.2. Virtual Memory
- 4.1. Security
- 5. Week 5
- 5.1. Virtual Memory
- 5.1.1. Physical memory and swap space
- 5.1.2. Virtual Memory organization
- 5.1.3. Process Address Space
- 5.1.4. Process Address Space Sharing
- 5.1.5. Shadow object chains
- 5.1.6. Machine-dependent virtual-memory description
- 5.1.7. Kernel memory allocator hierarchy
- 5.1.8. Kernel Maps ans Submaps
- 5.1.9. Vmem Data Structures (Arena)
- 5.1.10. The Slab Allocator
- 5.1.11. The Keg Allocator
- 5.1.12. The Zone Allocator
- 5.1.13. Fast Kernel malloc implementation
- 5.1.14. Inteface look and feel
- 5.1.15. Types of pages
- 5.1.16. Paging strategies
- 5.1.17. Paging strategies (part II)
- 5.1.18. Superpage Motivation
- 5.1.19. Superpage Solution
- 5.1.20. Superpage Reservation
- 5.1.21. Superpage promotion
- 5.1.22. Superpage fragmentation control
- 5.1.23. Filesystem interaction with the virtual memory
- 5.1.24. Kernel structure
- 5.1. Virtual Memory
- 6. Week 6
- 6.1. I/O System Overview
- 6.1.1. Kernel I/O Structure
- 6.1.2. I/O Descriptors
- 6.1.3. Descritor I/O Structure
- 6.1.4. File Entry
- 6.1.5. File Entry Types
- 6.1.6. Multiplexing I/O
- 6.1.7. Select Top-Level
- 6.1.8. Select Bottom-Level
- 6.1.9. Select Notification
- 6.1.10. Kqueues and Kevents
- 6.1.11. Types of kqueue events
- 6.1.12. Kevent notification
- 6.1.13. Kevent Implementation
- 6.1.14. Semantics of Non-Blocking
- 6.1.15. Uio Structure
- 6.1.16. Level of Filesystem Interface
- 6.1.17. Contents of a Vnode
- 6.1.18. Vnode interface - Name Translation
- 6.1.19. Vnode interface - Vnode Management
- 6.1.20. Vnode interface - Name Cacheing
- 6.1.21. Vnode linkages
- 6.1.22. Vnode Mount interface
- 6.1.23. Buffers
- 6.1.24. Stackable filesystems
- 6.1.25. Loopback mounts
- 6.1.26. Union mounts
- 6.1.27. Union mount naming
- 6.1.28. Union mount issues
- 6.1.29. Union implementation
- 6.1.30. Dead file system - DFS
- 6.1. I/O System Overview
- 7. Week 7
- 7.1. Devices
- 7.1.1. Special Devices
- 7.1.2. Device Naming and Access
- 7.1.3. Character devices entry points
- 7.1.4. Line displines entry points
- 7.1.5. Termios
- 7.1.6. Changing the termios structure
- 7.1.7. Allocation of controlling terminal
- 7.1.8. Revocating of controlling terminal
- 7.1.9. Network driver entry points
- 7.1.10. Network driver control messages
- 7.1.11. Network packet reception
- 7.1.12. Disk volume management
- 7.1.13. Mirroring a filesystem
- 7.1.14. GEOM Operation
- 7.1.15. Disk sorting
- 7.1.16. Communication from the top to the bottom of the kernel
- 7.1.17. From bottom to the top
- 7.1.18. FreeBSD Disk I/O Subsystem
- 7.1.19. Physical Drive Management
- 7.1.20. Autoconfiguration
- 7.1.21. Autoconfiguration Functions
- 7.1.22. Device driver bidding
- 7.1.23. Autoconfiguration data structures for pci0
- 7.1.24. Device virtualization
- 7.1.25. Bhyve virtualization
- 7.1.26. Xen virtualization
- 7.1.27. Xen Grant tables
- 7.2. Fast Filesystem
- 7.1. Devices
- 8. Week 8
- 8.1. Filesystems
- 8.1.1. Quota
- 8.1.2. Semantics of quota
- 8.1.3. Current UNIX Filesystems
- 8.1.4. Inode
- 8.1.5. A small filesystem
- 8.1.6. UNIX Filesystem
- 8.1.7. I/O in UNIX
- 8.1.8. Contents of a cylinder group
- 8.1.9. Optimizing storage utilization
- 8.1.10. Filesystem parametrization
- 8.1.11. Layout policies
- 8.1.12. Layout policies (part 2)
- 8.1.13. Dynamic block reallocation
- 8.1.14. Implementation of dynamic block reallocation
- 8.1.15. Optimized Metadata Layout
- 8.1.16. Implementation of allocation
- 8.1.17. Filesystem Consistency
- 8.1.18. Adding journaling to soft updates
- 8.1.19. Additional requirements of journaling
- 8.1.20. Tracking file removal dependencies
- 8.1.21. Journaled Crash recovery
- 8.1.22. Snapshots
- 8.1.23. Snapshot implementation
- 8.1. Filesystems
- 9. Week 9
- 9.1. ZFS filesystem
- 9.1.1. ZFS overview
- 9.1.2. ZFS Modules
- 9.1.3. Functional organization
- 9.1.4. Structural organization
- 9.1.5. Uberblock
- 9.1.6. Dnode
- 9.1.7. ZFS Block pointers
- 9.1.8. ZFS block management
- 9.1.9. ZFS Structure
- 9.1.10. ZFS Relationships
- 9.1.11. ZFS Checkpoints
- 9.1.12. Writing new data
- 9.1.13. Flushing dirty data
- 9.1.14. Logging
- 9.1.15. RAIDZ
- 9.1.16. RAIDZ Recovery
- 9.1.17. Storage pool allocator
- 9.1.18. Snapshots
- 9.1.19. Freeing filesystem blocks
- 9.1.20. Deduplication
- 9.1.21. Remote replication
- 9.1.22. ZFS strenths
- 9.1.23. ZFS weakness
- 9.1. ZFS filesystem
- 10. Week 10
- 10.1. NFS
- 10.1.1. Sun Microsystems Network File System
- 10.1.2. NFS v3
- 10.1.3. NFS v4
- 10.1.4. NFS File identification
- 10.1.5. NFS v3 daemons
- 10.1.6. NFS v3 mount operation
- 10.1.7. NFS v3 I/O
- 10.1.8. NFS v4 goals
- 10.1.9. NFS v4 changes
- 10.1.10. NFS v4 daemons
- 10.1.11. NFS v4 mount operation
- 10.1.12. NFS v4 sessions
- 10.1.13. NFS v4 delegation/recall
- 10.1.14. NFS v4 crash recovery
- 10.1. NFS
- 11. Week 11
- 11.1. Networking and Interprocess Communication
- 11.1.1. Graph of communicating processes
- 11.1.2. Name server
- 11.1.3. The ISO networking model
- 11.1.4. Packet traffic
- 11.1.5. Socket system call interface
- 11.1.6. Types of network address
- 11.1.7. Use of IPC/Networking facilities
- 11.1.8. System layering
- 11.1.9. UDP
- 11.1.10. TCP
- 11.1.11. SCTP
- 11.1.12. IP
- 11.1.13. Mbufs
- 11.1.14. Mbufs continued
- 11.1.15. Socket data queuing
- 11.1.16. Socket interface protocols
- 11.1.17. Network interface addresses
- 11.1.18. Network interface to link layer
- 11.1.19. Net to protocol interface
- 11.1. Networking and Interprocess Communication
- 12. Week 12
- 12.1. The Network Layer
- 12.1.1. Internet Addressing - IP Version 4
- 12.1.2. Internet Addressing - Subnets
- 12.1.3. IP Multicast
- 12.1.4. Address Resolution Protocol
- 12.1.5. Internet Addresseing - IP Version 6
- 12.1.6. IPv6 addressing
- 12.1.7. IPv6 autoconfiguration
- 12.1.8. Internet Protocols - ICMP
- 12.1.9. ICMP Redirect processing
- 12.1.10. IP Forwarding
- 12.1.11. Routing goals
- 12.1.12. Routing design
- 12.1.13. Kernel routing table
- 12.1.14. Routing: radix search algorithm
- 12.1.15. Kernel routing lookup
- 12.1.16. Kernel routing socket
- 12.1.17. Routing strategies
- 12.1.18. Routing Daemon (routed)
- 12.1.19. OSPF routing protocol
- 12.1.20. IPSec Goals
- 12.1.21. IPSec transport mode
- 12.1.22. IPSec tunnel mode
- 12.1.23. IPSec key management
- 12.1.24. Packet-processing frameworks
- 12.1.25. Berkeley Packet Filter
- 12.1.26. IPFW
- 12.1.27. Dummynet
- 12.1.28. Packet filter (PF)
- 12.1.29. Netgraph
- 12.1.30. Netmap
- 12.1.31. Netmap receive ring
- 12.1.32. Netmap transmit ring
- 12.1. The Network Layer
- 13. Week 13
- 13.1. The Transport Layer
- 13.1.1. Protocol Control Block
- 13.1.2. Internet Protocol control block
- 13.1.3. TCP Sequence space
- 13.1.4. TCP Packet header
- 13.1.5. TCP timers
- 13.1.6. TCP Output states
- 13.1.7. TCP Round tripe timing
- 13.1.8. Connection establishment
- 13.1.9. Socket connection queueing
- 13.1.10. Handling SYN attacks
- 13.1.11. Connection metrics
- 13.1.12. Silly window syndrome avoidance
- 13.1.13. Delayed acknoledgement
- 13.1.14. Small packet avoidance (Nagle algorithm)
- 13.1.15. Slow start
- 13.1.16. Acknowledgement clocking
- 13.1.17. Congestion avoidance
- 13.1.18. TCP retransmission
- 13.1.19. Selective acknoledgement
- 13.1.20. Modular congestion control
- 13.1.21. Modular congestion control (II)
- 13.1.22. Basic TCP send policy
- 13.1.23. Stream Control Transmission Protocol
- 13.1.24. SCTP chunks
- 13.1.25. SCTP multiple paths
- 13.1. The Transport Layer
- 14. Week 14
- 14.1. System startup and shutdown
- 14.1.1. Booting overview
- 14.1.2. Firmware bootstrap
- 14.1.3. First level bootstrap
- 14.1.4. Second level bootstrap
- 14.1.5. Final sate bootstrap
- 14.1.6. Kernel loadable modules
- 14.1.7. Kernel boot
- 14.1.8. Kernel module initialization
- 14.1.9. Kernel modules (part I)
- 14.1.10. Kernel modules (part II)
- 14.1.11. Kernel threads
- 14.1.12. Mouting the root and /dev filesystems
- 14.1.13. User-space startup
- 14.1.14. System shutdown
- 14.1.15. Performance and monitoring system tunning (tools)
- 14.1.16. Kernel tracing
- 14.1.17. Kernel trace display
- 14.1.18. Administration of kernel information
- 14.1.19. Getting process information
- 14.1.20. Program interface
- 14.1.21. Kernel debugger
- 14.1.22. Kernel crashes
- 14.1.23. Netstat memory statistics
- 14.1.24. System malloc memory statistic
- 14.1.25. System malloc memory statistics (part II)
- 14.1.26. System zone memory statistics
- 14.1.27. Example ps output
- 14.1.28. Exaple fstat output
- 14.1.29. Example of netstat output
- 14.1.30. DTrace
- 14.1.31. Performance monitoring counters
- 14.1. System startup and shutdown
- 15. Week 15
- Info about Hillside
- Class times
- Man pages
- Outline
- Listed in the printed material
- Unix was about portability
- The vm is built uppon 5 properties
- paged virtual addr space
- If you want to allocate more than physical memory
- software interrupts
- Define hardware events and turn it into a software event
- Catch signal interrupt and deal with things like division by zero
- Call your signal handler
- Unified interface for hardware events
- timers and counters
- identifiers for accounting, protection and scheduling
- descriptor
- Handle of some kind of I/O
- Before that each thing had it own descriptor
- paged virtual addr space
- The vm is built uppon 5 properties
- Fundamental to the unix philosophy
- Input/Output connected thru pipes creates a pipeline of execution
- Process is made of 4 main pieces:
- CPU Time
- Scheduling made by kernel
- Asynchronous Events
- Interruptions have to be dealt with, signals, network traffic
- Memory
- How virtual memory is managed
- How to make a hardware provide an address space
- I/O Descriptors
- Any useful process is going to have I/O Descriptors
- Most fundamental service provided by UNIX
- CPU Time
- Top half of the kernel
- Top half has its own stack because when the kernel wants it is there in a real physical memory
- More malloc and free in the kernel stack because there is a limited stack available
- Bottom half
- Scheduled by interrupts
- Top-half will be preempt when bottom-half interrupts
- You can block but generaly don't block on bottom-half
-
Part 1:
- Hardware traps
- System calls
- Special case of hardware traps
- Hardware interrupts
- Async (disk io complete, user typed a key)
- Higher priority
- Software-initiated traps or interrupts
- Lower priority
-
Part 2: Switch to kernel mode
- Push PC and PSW into the kernel per-process stack
- Push trap type os syscall number into the stack
- Save general purpose registers
- Call handler (syscall()) for a syscall, trap() for a trap
-
Timers
- 3 per process interval timers, optional value to be loaded when it expires
- ITIMER_REAL
- Decrements in real time
- Sends SIGALRM on expiration
- ITIMER_VIRTUAL
- Decrements in user virtual time
- Sends SIGVTALRM on expiration
- ITIMER_PROF
- Decrements in user + system virtual time
- Sends SIGPROF on expiration
- Select timer is independent from the interval timers
- Kqueue can also create timers
-
Timouts
- Arrange a timeout in the kernel
- handle = timeout(function, arg, ticks)
- Cancel it
- untimeout(handle)
- Compute the number of ticks
- ticks = hzto(timevalp)
- Timout routines are called from softclock at splsoftclock
- Arrange a timeout in the kernel
- When a lock is not available sleep on it
- wchan or wake_channel receive a pointer to the resource
- process waiting for a specific resource will be waken
- From the most simple to the most complex
- Hardware
- Spin Locks
- Locks that blocks briefly
- blocking mutex
- pool mutex
- reader-writer locks
- read-mostly
- Locks using the sleep-queue interface
- Witness
- Sys/sys_proc.h
- data structures using locks to protect the access into the fields
- Designed for short periods
- Used to protect read and write
- Track current lock holder
- If you hold a short term lock and goes to sleep the kernel will panic
- Priority propagation
- Hash header for quickly find locks
- A turnstile is needed each time a thread blocks
- A thread can only block on one thing
- Used by shared-exclusive locks, condition variables
- No propagation
- You're are not allowed to own a turnstile when requesting a sleep-queue lock
- The queue track the exclusive lock holder
- May be recursive
- wchan and priority
- msleep
- sleep on lbolt (lightning bolt)
- wait syscall wait for children to exit and put the parent to sleep in its own context
- sigpause
- The only way to exit a sigpause is a longjmp
- sleep implementation
- lock sleep hash header
- record wchan, hashs to find slepe queue
- set td_priority (for scheduling)
- place on approp sleep queue
- call sched_switch() to run the next thread
- sleep thread is not running until awakened
- if PCATCH is set check for signals
- wakeup implementation
- awaken ALL sleepers on wchan starting from the front
- wakeup_one
- awaken first sleeper on wchan
- To awaken a thread
- remove from sleep queue
- recompute kg_user_pri for long sleepers
- sleeping, make runnable, place on run queue
- stopped: "unsleep", remain stopped
- request CPU rescheduling by setting NEEDRESCHED
- protected by locks
- Lock is not possible without assitance from hardware
- Test-and-set
- Very expansive instructions
- FreeBSD uses a compare and swap
- instead of a memory bucket it is a pointer
- Owner field for a free lock has MTX_UNOWNED
- Owner field for a held lock has the pointer to owning thread
- Exclusive access only
- Loops waiting for the mutex to become available
- Run inside the critical section
- More expensive than a blocking mutex
- Exclusive only
- Adaptive spining which only spins if the owner is running
- All waiters are awekened when lock is released
- Cheaper to release an uncontested lock
- Often end up scheduling sequentially
- Used for small short-lived data structures
- Just need a pointer to a mutex rather than a large mutex itself
- Mutex is preallocated so avoid high creation and destroy times
- Example is pool syscall that needs a struct to track poll requests
- Exclusive and shared mutex
- Uses a turnstile so cannot be held when you sleep
- Priority propagation for exclusive not for shared access
- May specify permission to recurse
- Same properties as reader-writer
- Priority propagation for shared
- Designed for fast access
- Read without locking then check if write happened
- if so redo the reading
- The routing table is a good example of this lock
- IBM holds patent on RCU
- Allow GPL code to use
- FreeBSD is not GPL
- fastest ans simplest of the locks that can sleep
- Shared and exclusive
- Permission to recurse
- May request interruptions (PCATCH)
- No priority propagation
- wrapper on traditional sleep and wakeup
- Allow wake one or all waiters
- Must hold a mutex before awakening or waiting
- Most full-featured of the locks that can sleep
- Downgrade (from exclusive to shared)
- Upgrade (from shared to exclusive)
- Upgrade exclusive prevent other exclusive lock
- Can pass ownership from a lock from a thread to the kernel
- You need to clean up before leaving the kernel
- Ability to drain all accessing threads for deallocation
- No propagation
- Partial ordering requires:
- Thread acquires only one lock in a class
- Thread may acquire only a lock in a higher-numberred class
- Class 1 > Class 2 - thread with locks on class 1 can acquire class 2 locks
- Programmers can define lock classes
- Witness code observes actual lock ordering and complains when either rules is violated
- Very very expansive
- Debate exists about enabling in production
- CPU Time
- What runs when
- Async events
- timers
- hardware exceptions
- external events
- I/O
- Memory
- text/data
- heap
- runtime stack
- I/O Descriptors
- files
- pipes
- socket
- stream
- Process entry
- All unix has one, solaris and linux process entries may be different
- Process structures dynamically allocated
- Placed in linked list
- filedesc structure
- Open file array is reallocated when overflow
- Process limits (plimit)
- array of rlimit
- copied lazily
- Process accounting/statistics (pstats)
- rusage (Resource usage)
- virtual-time timer
- profiling
- start time
- Signal actions
- Pending signals
- sigacts per signal actions,
- signal stack info,
- current state during sigpause
- Pending signals
- kthread model (started in BSD world)
- One process structure
- Multiple threads
- Fork model (started in the Linux world)
- Each thread has its own process structure
- same model as that used by Linux
- kthread context switch is wayyy cheaper
- rfork slower for context switch
- runnable
- sleeping
- stopped
- stopped by job control or tracing
- return to sleep or run
- minor_states
- new
- zombie
- Zombies won't close socket descriptors because tcp may have data sitting around
- Init's job is to wait for orphans left by other processes
- by process ID
- hashed, linked through p_hash index
- pfind()
- by state
- run queue: td_runq
- sleep queue: td_slpq
- lock queue: td_lockq
- by type
- on one of allproc or zombproc
- p_ is always the process_struct
- REALTIME
- ULE
- 4BSD
- IDLE
- UNIX was designed with some security in mind
- Users exist in UNIX since the begining
- Windows didn't had users until NT
- Things that have to be secure
- Kernel
- Boot scripts
- Core utilities (shell, login, ifconfig)
- Libraries used by core utilities (libc)
- Solid crypto base
- process credentials (ucred structure)
- effective uid (setuid)
- group set (gid)
- real uid, gid
- saved uid, gid from set*id program
- exported credentials (xucred structure) - outside kernel
- effective uid
- group set
- seteuid
- setegid
-
Immutable limitation
- Immutable files can be changed only on singleuser
- Append-only files can only be rotated in single user
- Direct hardware is retricted
- All startup activities needs to be protected
- Started from chroot (xiruu)
- Jails was created by a demand of the user base (ISP's)
- vnet
- vem0a - vem0b (master slave virtual net)
- Permitted
- running or signaling process within the jail
- change files within jail
- bind ports
- access raw, divert or routing sockets jail's virtual network
- Not permitted
- get info from processes outside jail
- change kernel variables
- mount or unmount filesystem
- modify phys net interface
- reboot
- /dev/random apps access
- Uses yarrow, PRNG
- Yarrow
- reuse existing crypto primitives such as hash and block encryption
- framework for entropy acquisition
- an entropy accumulator
- a reseed mechanism
- a generation machanism (SHA256 AND AES)
- Yarrow will be replaced by Fortuna on FreeBSD 11
- File permission bits
- ACL capabilities
- Look up
- Administration permissions
- Change acl on the file
- List of users
- List of groups
- Default ACL Inheritable
- ZFS uses NFSv4 ACL model (compatible with windows ACL)
- NFSv4 uses inheritable ACL ranther than default ACL in POSIX
- Nearly 200 privileges
- In the kernel checks for uid=0 also calls priv_check()
- Jails uses privileges to control what a process can do
- MAC is a loadable kernel module
- Used when you know when things go bad
- Good auditing should have Intrusion Detecting
- OpenBSM
- Looks like truss or strace
- auditd daemon
- starts a kernel thread that control record distribution
- Sandbox
- Powerful than jails
- Untrusted code run in separated process with access limited
- cap_enter()
- No access to filesystem
- Defined at /sys/sys/capability.h
- Top part of the address space used by the kernel
- malloc() memory kernel thread stack
- no page faults on kernel
- load everything, really
- Bottom part for userland
- Context switch happens here
- Trampolin code for signals
- User stack no executable page
- First page left invalid for null-pointer dereference
- Heap grows up, Stack grows down
- Shared library seat a couple pages after Heap limit
- Virtual address to Physical address
- MMU
- L1 cache will probably have a page cached
- TLB - Translation Lookaside Buffer
- Page_fault - interrupt
- MMU should be informed that a new page is load
- Context-swich is expensive because a whole new pagedir needs to be loaded
- Process as regions
- Collection of address space that is threated the same way
- Shared memory
- File Mapping
- Copy on write after fork
- vm design is archtecture independent
- mmap()
- maps file into memory
- munmap()
- removes mapping from memory
- msync()
- sync hardware caches to main memory
- mprotect()
- control access to parts of the address space
- madvise()
- advise the kernel on how an area of memory will be used
- garbage colletor advice random
- mlock()
- force page to be resident no paged out
- used in database for transactions, prevent pages to go away
- call msync() to sync main memory from caches
- when malloc fails you probably used all your address space
- linux does not enforce address space size
- Data Structures
- Per process: shared per address space, per thread
- Per address space: collection of objects
- most of the code is machine independent
- vmspace
- vm_map
- vm_pmap (p for physical)
- statistics (about the vm system, page faults, workset size)
- vm_map_entry
- anonymous object
- vnode / object
- start and end address
- vm ask object if it have a page
- Private mapping
- Shadow objects between the vm_entry and the file object/vnode
- On low memory scenarios page are taken from vnode and put into memory
- When a process with MAP_PRIVATE mapping forks, its child must get it own shadow map
- 32bit the kernel can only have 2gb of address space
- Memory in kernel space
- submaps
- vmem arena
- slabs out of the arena
- bucket out of slabs
- bucket are per CPU
- kernel described by map entries
- bt boundary tag
- power-of-2 free list
- BESTFIT or INSTANTFIT
- hash list for used pages pointed by the bt
- the slab allocator is responsable for allocating a zone of particular thing
- all objects are the same size
- manages a particular zone
- track slab objects
- list for full, partially allocated and empty slabs
- request a new slab when everything is in use
- Lock per zone
- Each CPU has it own lockless list protected by a critical section
- Each CPU has it own bucket of memory
- Two buckets per CPU
- Memory pool characteristics
- Size is fixed
- Max size is no bigger than phys mem
- Memory mapping hardware can be directly controlled
- Known usage patterns
- Most allocations are small
- Many allocations are for a power-of-2 sized piece of memory
- Simple and familiar similar to C malloc() and free()
- Small allocation
- use power of two
- called using malloc and free
- Large common structures
- use zone-based allocation
- more compact than power-of-two
- avoid cache line problems
- avoid initializing structures
- must reclaim unused memory from zones
- Wired - kernel
- Active - In use
- Inactive - valid contents, dirty
- Cache - valid contents, clean
- Free - invalid contents, ready for use
- Necessary to have higher priority page access
- free pages for the free list
- no longer explicity
- process exit anon pages go to the freelist
- To add inactive pages to the cache list
- remove oldest page of the inactive list
- if first inspection mark scanned
- if dirt put back to the inactive list
- pages on cache needs to be clean
- Inactive and cache are LRU
- To add to inactive list
- Remove oldest page from active list
- if in use, increment active count and put on the end of active list
- else not in use, decrement active count
- if active count greater than zero, put on end of active list
- else clear scanned flag and move to end of inactive list
- The problem is the limited size of the TLB
- if TLB takes more than a cycle it is slow
- 4k pages
- 2Mb pages on PAE
- 4Mb pages without PAE
- On FreeBSD the kernel decides when to use superpages
- Others allow programs to decide
- Starts on the begining (first fault) because small pages are already in use
- Annon memory is always elegible for superpage
- mapped file at least superpage size is eligible
- Only save TLB misses
- if your are touching all pages on your reservation
- promotion to read-only superpage
- Kernel memory is aways superpage
- Kernel do page out
- Cache and free are kept in buddy lists (pair)
- Aggregate small pages until a full superpage is available
- filesystem own "buffer cache" pages
- annon object owns it memory page
- annon objects don't have a filesystem
- pages may be both mapped and cached (copy on write)
- management of swap space
- page cluster
- p 315 (book reference picture)
- GEOM handles the disk IO
- Can do all sort of stuff, RAID, Stripping, etc
- CAM layer handles in a generic way the disks I/O
- Individual drivers are much smaller
- Based on SCSI but extended to ATA and SATA
- File Descriptor
- An array of bytes
- Move file descriptor
- Read " "
- write " "
- Access to other hardware
- Terminals
- Tape drivers
- Printers
- Disks
- Modems
- Pipes
- Reliable byte stream
- Per-process file descriptor index (integer array)
- One per open file descriptor, shared across dups()
- VNODE
- SOCKET
- PIPE
- FIFO
. .
- KQUEUE, MQUEUE, ETC
- Use select() or poll() system call
- Useful when the I/O usually blocks
- Expansive when I/O succeeds
- Stateless
- Use non-blocking mode
- Useful when the I/O usually succeeds
- Expansive when I/O blocks
- Use signal driven I/O
- Useful for infrequent stuff
- Expansive for high volume I/O
- Context switch for each signal
- Use kqueue to monitor kevent
- Useful for monitoring many events
- Only in BSD and MacOS
- Select will sit on top of the underlying bottom half to ask about states
- Pool of mutex
- Select on async devices
- Select notification when I/O becomes possible
- selwakeup selinfo is passed to be used to traverse the linked list
- Device remove horizontol list, thread remove the vertical
- General event notification facility
- Stateful interface
- Register events of interest
- May modify the event (add, delete, enable, disable)
- Tracked events are deleted when resource is gone
- Poll for events using kevent()
- Supported kevents on FreeBSD
- R/W as in select
- AIO Async I/O events
- VNODE files information has changed
- PROC status of processes
- SIGNAL posted to processes
- TIMER an event-based timer
- USER application defined events
- Each event has an identifier
- Each event specifies a filter flag
- An event may be another kqueue descriptor
- Hash list for free form descriptors
- Descriptors indexed array by its number
- List of pending events
- Non-block flag is stored in the file table
- IO vector
- Increment Offset
- Decrement Resid
- Segment flag tells you where/from the I/O is coming to
- USERSPACE
- SYSSPACE
- NOCOPY
- Pointer to the thread that requested the I/O
- system call
- open/close, read/write, link/unlink
- file table
- open/close, read/write, ioctl
- virtual node (vnode)
- VOP_
- inode
- namei,iget/iput
- block I/O
- bread/bwrite
- type
- pointer to type specific control block
- flags
- reference count
- pointer to operations
- pointer to containing file system
- list of associated clean buffers
- list of associated dirt buffers
- content of pending I/O operations
- Allow but does not require state
- Steps in translating pathname
- Single system-wide vnode table
- vn_inactive() references are back to zero
- vn_reclaim() recycle vnode
- Genral set of utility routinges are provided to manage vnodes for a all filesystems
- Single system-wide soft-reference name cache with common routines
- cache_lookup()
- cache_enter()
- cache_purge()
- support negative cacheing
- /etc/mtab info is now maintaned in the kernel
- statfs(2)
- getfsstat(2) return how many things are mounted
- getmntinfo(3) saner than ^
- Generic mount information can be updated at any time
- Getting buffers
- bp = bread(vnode_ptr, file_offset, size);
- Releasing buffers
- brelse(bp)
- bwrite(bp)
- bawrite(bp)
- bdwrite(bp)
- Born in UCLA
- allow arbritary directories to be mounted anywhere else
- implemented as a filesystem layer
- bases for other filesystems implmentations
- Merge filesystems together instead of hide the underlying fs
- Remove files from lower layers
- whiteout write in the top layer
- Removed directory must be marked as opaque for whiteout of directories
- it basically deals with naming
- all other operations are passed down to the lower layers
- Character devices
- pseudo-terminals
- frame buffer
- printer
- Network devices
- network interfaces
- Disk devices
- disk management (GEOM)
- filesystem interface
- Historic device nodes
- /dev filesystem
- devd
- Operations supported by any character device
- Used by tty devices
- Divided into 2 parts
- One from things going down to the kernel
- One from things coming out of the device
- Programatic interface to terminal drivers
- All states contained in one structure
- struct termios{}
- tcgetattr()
- tcsetattr()
- command flags passed to tcsetattr()
- TCSANOW - make change immediate
- TCSADRAIN - drain output, than change
- TCSADFLUSH - drain output, flush input
- TCSASOFT - only interested in soft state
- Historically a side effect of open
- Now and explicit ioctl: TIOCSTTY
- Only session leaders can acquire controlling terminal
- only one per session
- SIGHUP when the controlling process disapear
- revoke the vnode that references the pty
- Entry points in the ifnet structure
- if_ operations
- Control messages sent using if_ioctl
- SIO* various operations of the network interface
- D buffer owned by device may be filled
- K buffer owned by kernel and may be read
- When k is done kernel stops, when d is done network drops packets
- GEOM layer
- disk partitioning and labelling
- disk aggregation
- collection of I/O statistics
- I/O optimization such as disk sorting
- GEOM mirror layer
- mirror underlying partition
- can be inseted to copy filesystem
- Downwards is handled by g_down thread (single threaded)
- Upwards is handled by g_up thread
- Module cannot sleep or compute excessively
- Modules providing locking can allow direct dispatch
- When the provider goes away, error is propagated
- When provider changes, change is propagated up the stack
Organization of buffers awaiting I/O for a disk
- currently working on buffer for block 25
- next buffer to be processed is at block 40
- blocks to be done in next pass start at switch_point_list
- Acquire a queue mutex and block on CPU
- Place request on queue
- If device is idle, start it
- If synchronous request, sleep
- Release mutex
- Acquire the mutex
- Pop next request from queue
- Release the mutex
- Do required work
- If synchronous, wakeup thread(s)
- If asynchronous, free resources
- Disk can connect in several ways
- Non-disk USB and Firewire devices bypass CAM
- CAM Layer handles disk devices
- routing throught I/O busses to drive
- PCI
- Cardbus
- Firewire
- tag queueing
- serializing requests
- allocating and freeing DMA resources
- routing throught I/O busses to drive
- DMA map works similar to a TLB but much simpler
- policy
- support widest possible range of CPU types and configuration
- minimize information that specified in advance
- strategy
- decode CPU type
- check memory controller type
- find types and locations of possible main I/O busses
- descend I/O busses, testing for presence of controllers and devices
- attach and configure discovered controllers and devices
- device_probe
- device_identify
- device_attach
- device_detach
- device_shutdown
- device_suspend
- device_resume
- Bid process for drivers
- Highest bid will get it
- probe - bid - attach
- the pci0 device scans its bus and finds an ATA disk controller
- scanning its drivers, its "atapci" drivers wins the auction
- Sample autoconfiguration hierarchy (devinfo)
- full virtualization lets guest operating system run without change
- paravirtualization requires guest operating system to use virtualized devices and CPU features
- vertio model splits device driver
- guest operating system front end sends device commands to hypervisor
- hypervisor back-end operates physical devices
- communication is often done using shared memory rings
- FreeBSD embeds bhyve to provide virtualized devices vesus Xen which is a standalone hypervisor that allows guest operating system to interact with each other
- Bhyve exposes a virtual PCI bus to the guest
- guest can probe and attach back-end devices
- grant tables are shared memory used to communicate between a guest operating system and Xen or another guest
- grant table operations
- permit_access
- accpet_transfer
- may limit export or import page to read-only
- grant table states
- reading
- writing
- all entries start on a 4-byte boundary
- active entries are linked together
- It is a place to put metadata about the file
- Header of each attribute has:
- 4-bytes length
- 1-byte namespace class
- 1-byte content pad lenght
- 1-byte name lenght
- name padded so that the contents start an 8-byte boundary
- Advisory file locking
- no lock
- shared access
- exclusive access
- locks are not tied to file access mode
- locks may be ignored
- locks may change after the file is open
- each "uid" may have quotas associated
- quotas may be specified by filesystems
- quota may be imposed on both users and groups
- quota policy
- Users should stay below their "quota"
- process descriptor
- open file entry
- vnode
- inode
- disk
- inode
- data
- inode
- vnode
- open file entry
- Contains all the metadata about the I/O
- Symlink stored directly into the inode
- direct blocks
- single indirect (pointers to pointers to data)
- triple indirect (pointer to pointers to pointers to data)
- align operations to block boundaries
- inodes map logical block numbers to disk blocks
- write to system buffer and copy back to blocks
- Superblock - static parameter of the filesystem
- block size
- fragment size
- disk characteristics
- layout policy
- Cylinder group map - dynamic parameters of the cylinder group
- free blocks and fragments
- free inodes
- allocation summaries
- recent allocation history
- Inode - per file information
- type/owner
- access/mode times
- array of pointers to data blocks
- Data blocks
- Time/space tradeoff
- Time - big blocks reduce the number of reads/writes to access a file
- Space - big blocks waste space for small files
- UNIX File are predominantly small and so use a hybrid block size
- Maintain system characteristics
- time to schedule an I/O operation
- ability to chain I/O operations
- Use this information to do block layout ASSUMING sequential access.
- Laying out blocks in a totationally optimal fashion.
- need to localize allocation of resource that are accessed together
- need to spread out unrelated allocations
- Inode layout policy
- localize
- try to place all files contained directory in the same cylinder group as the directory
- spread out
- Place new directories in a cylinder in a cyliner group with a greater than avarege number of free inodes but fewest number of directories. Cluster leaf directories.
- localize
Data block layout policy
- Localize
- Place data blocks together in the same cylinder group as the inode resides
- Spread
- Redirect block allocation for a file to a new cylinder group every ten to twenty Megabytes
- Policy aim
- Keep some free blocks in every cylinder group
- Localize all references when many small files
- Size of file is not known when the file is open
- first 4% percent of data area of each cylinder group is held for metadata
- Use requested block
- Use closest forward block within cylinder group
- Use quadratic rehash to pick a different cylinder group
- Brute force search of cylinder group
To maintain reasonable optons, 5%-8% of the disk is kept free
- Metadata must be consistent
- directories
- inodes
- bitmaps
- Rules
- Never point to a structure before it is initialized
- Never reuse a resource before nullifying all previous pointers to it
- Never reset an old pointer to a live resource before the new pointer has ben set
-
Keep metadata consistent 1
- Synchronous writes
- simple and effective
- create/delete intensive is slow
- Non-volatile RAM
- usually runa all operations at memory speed
- expensive hardware unalavailable on many machines
- Atomic updates (journaling and logging)
- create/remove do not slow down under heavy load
- extra I/O generated, little speed-up for light loads
- Synchronous writes
-
Keep metadata consistent 2
- Copy-on-write filesystems (LFS, ZFS, WAFL, etc)
- write throughput, cheap snapshots, always consistent
- disk fragmentation, memory overhead
- Soft updates
- most operations run at memory speed, reduce system I/0, instant recovery after a crash
- complex code, background fsck, and increased memory loading
- Copy-on-write filesystems (LFS, ZFS, WAFL, etc)
- Only need to journal operations that orphan resources
- Journal needs only 16Mbyte independent of filesystem size
- Filesystem operations that require journaling
- Increased link count
- Decreased link count
- Unlink while referenced
- Change of directory offset
- Cylinder group updates of reed blocks and inodes
- Additional soft updates tracking
- Reclaiming journal space
- Ordering constraints
- log of the location in the directory that has the name to be deleted
- the filename in the on-disk copy of the directory must be deleted
- log the blocks to be deleted an deallocated the file zeroing out its on-disk dinode.
- the blocks formerly referenced by the inode for the file must be released to the free-space bitmap
- log the successful completion of removal
- Crash recovery is done by fsck
- Recovery steps
- Scan journal
- Link count increases
- Link count decreases
- Free inodes with zero link count
- Free inodes that were unlinked but busy
- Free unallocated blocks
- Create a copy-on-write image of a filesystem image
- suspend processes initiating system calls that modify the filesystem
- allow all modifications in progress to complete
- write out all dirty buffers to disk
- create an empty snapshot file the size of the filesystem partition
- mark the block that are currently in use
- resume write operations on the filesystem
- on each disk write, check to see if it has been copied making copy if the write is for an in-use block that has not yet been copied
- Never over-write an existing block
- It is always consistent
- State atomically advances at checkpoints
- Snapshots(read-only) and clones(read-write) are cheap
- Metadata redundancy and data checksums
- Don't need fsck
- Selective data compression and deduplication
- Filesystem shared pool
- Quota
- Mirroring and multiple parity RAIDZ
- ZFS is monolithic
- GEOM used to provide access to disks
- GEOM used also to present zvol as disks
- Multiple layers
- Meta-object-set
- Object-set
- Array of inodes or dnodes
- Uberblock anchor the pool
- Meta-object-set (MOS) describes array of filesystems, clones, snapshots, and ZVOLs
- Each MOS object references an object-set that describes its objects
- Space map is at the MOS layer, if Object-set needs more space just ask MOS
- Uberblock describe the entire ZFS pool
- Areas are used as a circular queue
- Equivalent of inode in FFS
- Describe files and directories
- But also snapshots, clones and etc.
- Dnode is embedded with a znode when it describes a file
- Blocks range in size from 4 to 128Kbyte
- it is huge, much larger than FFS block pointer
- Disk blocks are kept in the pool
- Huge picture page 532-535
- Clones can only be taken of a snapshot
- Can make multiple clone of same snapshot
- Clones can be promoted to the filesystem
- Can also take snapshots and clone ZVOLs
- Checkpoint affects all filesystems, clones, snapshots and ZVOL
- Checkpoints are taken when one of:
- five seconds have passed
- 64Mb of dirty data accumulate
- administrative task (like a snapshot)
- To take a checkpoint
- finish in-progress system calls
- flush all dirty data
- complete by writing new uberblock
- Handled by ZIL
- Used to record changes between checkpoints
- Not only metadata but content (log no journal)
- Data over 32-kb are written to disk and pointer put in log
- Traditional RAID has fixed stripe size
- RAIDZ has variable stripe size
- build by traversing all MOS objects and rebuild their blocks
- never need to recalculate and write parity
- Manages the space in the pool
- It is a bitmap
- Allocation proceeds as:
- Find disk with the most free space
- select the least fragment fixed-size map
- allocate space closest to previous allocation
- SPA also handles
- compression
- deduplication
- take a checkpoint
- allocate new dnode in the MOS
- copy the block pointer from the fs dnode to the snapshot dnode
- link the snapshot dnode into the head of filesystem's snapshot list
- move the filesystem deadlist to the snapshot
- Blocks are tracked using space maps, birth time and deadlist
- When a block is allocated, its birth time is set to the current transaction number
- Over time snapshots are taken which also reference the block
- When a file is overwritten, truncated, or deleted, its blocks are released
- For each freed block, the kernel must determina if a snapshot still references it.
- if born after most recent snapshot, it can be freed
- otherwise it is added to the filesystem's deadlist
- It is done in a pool wise basis
- Block identify duplicated by hash (using the checksums)
- backup of pool either locally or remotely
- can do full backup
- stored as a single file
- expanded back to its original form
- can do incremental backup since backup of earlier snapshot
- tracks the progress of partial completed backup
- high write throughput
- fast RAIDZ reconstruction on pools with less than 40% usage
- avoid RAID "write hole"
- blocks move between filesystem as needed
- integration eases administration (mount points, exports, etc)
- Slowly written files scattered on the disk
- That is why it uses a lot of memory
- Slow reconstruction on pools with greater than 40% usage
- Block cache must fit in the kernel's address space, thus works well only on 64-bit
- Needs under 75% utilization for good performance
- RAIDZ has high overhead for small block size such as 4kb blocks typically used by databases and ZVOLs
- blocks cached in memory are not part of the unified memory cache so inefficient for files and executables using mmap() or when sendfile()
- Widely available
- Divided into two halves
- fileserver exports locally attached disks to client hosts
- client hosts that import filesystems from central server
- NFS basically started open source
- Sun promoted connectatons to make sure evey implementation were interoperable
- stateless protocol
- benefits
- can continue to operate when some systems fail
- fast recovery when failed components return
- drawbacks
- sync writes
- high network traffic
- stateful protocol
- state is bounded to speed recovery
- benefits
- can continue limited operations when some systems fail
- bounded recovery time when failed components return
- minimal network traffic validating clients cache
- less need for separate
- sessions (added in nfs 4.1) ensure idempotency of requests
- drawbacks
- sync writes
- server required to maintain and manage states
- file handles
- created by server to identify files when first looked up
- used by client in all NFS requests after open
- file handle contents
- filesystem identifier
- inode number
- generation number
- rpcbind (all)
- mountd (server)
- nfsd (server)
- nfsiod (client)
- lockd (all)
- statd (all)
- good performance on the internet
- strong security with negotiation built into the protocol
- good cross-platform interoperability
- designed for protocol extension
- provide protocol support for clustered-server deployment
- compund RPC's
- lock operations included in the protocol
- explicit open and close operations
- pseudo-filesystem hierarchy to support multiple roots
- extensive attribute support
- required attributes (16)
- recommended attributes (43)
- extensible attributes(k/v pair)
- access control list
- delegation and callbacks(leases)
- rpcbind (all)
- nfsd (server)
- nfscbd (client)
- nfsuserd (client)
- after client crash
- client attempt first mount from server
- server finds previous client state (if not timed out yet) and releases it
- after server crash
- server sends BAD_SESSION replies to out of date client IDs
- client realizes server has crashed and reestabilishes it state
- server gives clients time (default 15 seconds) to rebuild state before handling out new state
- Domain A has P1 and P2 processes
- Domain B has P2, P3 and P4
- Protocols used for Addressing Domains
- Domain addressing is the problem to find a process within a domain
- No centrelized authority
- Domain based distributed name server
- Gethostbyname() does RPC to name server daemon usually run on gateway
- Bind was made by one of his students, got an A because it was deployable
- Came up to create a standard way for interoperability
- OSI layers came out to put in the communication protocol layer
- internet in EU born from a guy buying modems and distributing to friends, before internet protocols (UUCP)
- mbuf designed to hold network data
- socket, bind, listen, write, read, etc…
- local domain
- IPv4
- IPv6
- Outline of a simple "remote write"
- sockets
- protocols
- net inerface and hardware
- it is the absolute minimum you can get with
- layer 4 transport, datagram protocol
- demultiplex on port number
- optional checksum
- layer 4, connection oriented
- provides everything that UDP does not
- flow control (windowing)
- port to port connection
- acknowledgements
- may piggybacked, coalesced
- 3-way handshake on open, 2-way handshake on close
- Reliable multiple-stream transport protocol
- provides associations between up to 65k independent streams
- good for multi-streams applications, no need for multiple handshakes
- layer 3
- designed for packet-switched network
- both IPv4 and IPv6
- host-to-host addressing
- routing
- time to live
- only IPv4
- fragmentation and reassembly
- type of service
- options: source route, record route, timestamp
- limited to 224k of data
- can have data written in the begining and in the end of the buffer
- use mbufs pool, allocated in a separate page map
- pages are allocated when the mbuf is allocated and returned free of last reference
- mbuf data segment was converted into other fields
- mbufs data will sit outside of the mbuf for larger data sizes
- external mbufs
- stream socket
- packets are linked to the next
- no boundaries without application interaction
- datagram socket
- are record based
- if you read a datagram you get the entire datagram
- 30 is the number of root servers because it is the amount of servers that fit in a UDP packet
- pru*
- ifnet
- pf1_addr
- pf2_addr
- driver, one per device type
- link layer are linke normal drivers but easier
- there is not much in link layer we really had to do
- input
- common routines, copy or map data into mbufs
- save interface id with packet
- IHL = Ip Header Length
- number of 32-bit words after the header
- ID line is mostly used be fragmentation
- TTL
- Header checksum is for the IP header only
- 3 layer routing, network -> subnet -> host
- 22bit network
- Class D reserved for multicast
- Interesting how multicast is created on gateways on the path of the publisher
- Gateways should have to have multicast support if they are doing multicast
- IGMP
- add, leave group
- programming interface: setsockopt/getsockopt
- on Ethernet group address is added to interface filter
- Multicast agent
- listens to IGMP
- fowards multicasts to remote members
- Problem
- how to map network-protocol-level addresses to a link-level address
- Solution
- broadcast an "ARP" request containing the host being sought
- all the machines on the wire get the request
- machines that matches sends response back to broadcaster containing link-level address to use
- netowk keeps cahe of "ARP" translations (in routing table)
- 128-bits was planned way ahead
- Autoconfiguration
- router discover
- neighbor discovery
- Security (IPSec)
- Version
- traffic class and flow
- next-header used to describe the type of routing among a lot of other things like authentication and encapsulation
- Router discovery
- send router solicitation message to router "anycast" address
- receive router advertisement messages containing the router IP address and router link-level address
- mimicked by IPv4 by DHCP
- Neighbor discovery
- send neighbor solicitation
- receive neighbor advertisement containing the neighbor IP address and neighbor link-level address
- Mimicked in IPv4 by ARP
- part of IPv4, IPv6 has something similar
- error messages (include header of problem + 8 bytes)
- destination unreachable
- service unavailable
- time-to-live expired
- header problem
- control messages
- redirect
- source quench
- low level network functions
- echo (ping)
- net address request/response
- net mask resquest/response
- timestamp request/response
- ICMP redirect cause direct modification of kernel route, changing gateway address
- Redirects on FreeBSD accepted only from current router
- First thing is to check if forwarding is enabled
- check destination, no loopback, no broadcast
- routing lookup
- if from the same wire, ICMP redirect message will be sent
From the 1980s:
- most machine should only need one network interface
- network topology should not be visible to users
- routing should have dynamic reconfiguration
- multi-interface machines should potentially be able to act as gateways
Separation of policy and machanism
- kernel implements mechanism
- intelligent routing process determines policy
- simple hosts may be simple
- sophistication should be easily configurable
- routing can be arbitrarily complex
- User level process: routed
- routing daemons exchange information to determine best routes
- routing daemons add, delete and replace routes in the kernel table to reflect best routing
- kernel limited to looking up routes in its table
- all routes are in a single hierarchical table organized as a redix tree
- routes include variable-length destination, mask and host or gateway
- routing table holds/cache round-trip time, hop count, packet size
- link layer may allocate spaces for link-layer information/cache (ifaddr structure argumented with link-layer calls for route management)
- "cloning" routes create host-specific routes dynamically; may use external resolution agents
Example radix tree:
- circles are internal node
- bit position tested is shown in circle
- leaf nodes are rectagles containing a key ans mask
- same interior nodes are associated with masks
– Picture here of radix tree routing algorithm
- Callers set destination in route structure, then rtalloc() looks up an rtentry, returning a reference
- rtallocl() does a lookup, optionally cloning the target route if necessary
- Routing ioctl commands replaced with message-oriented routing socket
- Changes to the kernel routing table are requested by messages via routing sockets
- Responses include message ID, errno, are sent to all listeners
- Routing socket also reports other routing changes, redirects, lookup failures, interface up/down
- DHCP
- assigns IP address
- installs route to gateway
- Routed (RIP)
- dynamic local routing
- best strategy for clusters of local networks
- Gated
- OSPF
- EGP
- BGP
- dynamic local routing using variant of Xerox NS routing information protocol
- Bellman-Ford (vector-distance) algorithm
- Gateways broadcast destination and hop for known routes
- Routes may "count to infinity" when destinations become unreachable; inifinity is chosen to be 16
- Monitors network link for proper functions, drops routes through down link
- On host with one interface no updates are sent
- Open Shortest-Path-First routing protocol
- shortest path first, or link-state routing algorithm
- each router maintains topology database, monitor link state
- two-level hierarchy: routing areas connected by backbone
- internal routers have complete topology for their area
- backbone routers have complete information about areas and backbones
Goals:
- Authentication - knowledge of a key proves you are who you claim to be
- Confidentiality - data is encrypted across the connection
- Replay protection - packets are sequenced at the application level so the replay attempts can be detected
Key:
- AH - authentication header
- ESP - encapsulating-security payload
- SPI - security-parameter index
- Router to router connection
- Encryption happens only on both connected routers
- Host to router connection is possible, the host has to have the key
- Keys stored in kernel managed with key socket with messages:
- getspi
- update
- add
- delete
- get
- acquire
- register
- expire
- flush
- dump
- Packet processing frameworks used to:
- implements firewalls
- NAT
- debug network problems
- provide framework for network-research testbeds
- Available filter frameworks:
- BPF - identify and copy selected packets
- IP firewall (IPFW) - modify or drop seleccted packets
- Dummynet - traffic shaping
- Packet Filter (PF) - drop selected packets
- Netgraph - provides data-flow between network processing modules
- Netmap - direct access to network packet rings
- Generaly used via tcpdump which uses pcap to express filter
- Scans all incoming packets and copies one that match selection criterion
- Never drops or modifies packets but does not allow packets to be injected into input stream
- pcap compiles high-level description to simple byte codes(which a JIT turns into native machine code)
- Can avoid copying by using shared user-space/kernel buffers
- Single central dispatch receives all incoming packets and each packet can be:
- passed through unchanged
- copied to additional packet streams
- diverted to an alternate packet stream
- subject to NAT
- reassembled for further inspection
- send to dummynet and/or netgraph
- dropped
Decision on what to do with packets described by rulesets loaded by the system admistrator
Packet-processing framework that provides:
- traffic shaping
- packet delay emulation
- packet scheduling
Pipe emulates a communications link with a scheduler arbitrating multiple independent queues that features:
- programmable bandwidth and delay
- selectable queue priorities
- variable number of queue
- selectable queue priorities
- scalable to many thousand pipes and queue
-
Dummynet scheduling policies
- FIFO - constant running time with no service guarantees
- weighted round robing - contant run time with limited service guarantees
- quick fair queueing - constant run time with bounded service guarantees
- weighted fair queueing - logarithmic run time (in number of queues) with optimal service guarantees
- FreeBSD version is PFv1, PFv2 still to be ported
- Similar functionality to IPFW
- Developed under OpenBSD and ported to FreeBSD
- Packets are normalized then run through a set of rules provided by the system administrator
- Packets that pass the first set of rules run through another ruleset specific to their protocol type (TCP, UDP, ICMP, etc)
- Packets that pass second set of rules are passed through for normal processing
Built around a data-flow model:
- Nodes provide data processing
- Edges provide packet flow between nodes
Currently 50 nodes available ranging from simple node that echos its input to a complex node that implements Bluetooth
Nodes have a seprate control channel to receive configuration (as well as input and output)
Nodes can be added as loadable kernel modules
-
Netgraph example
- Ethernet bridge
- network interface packet rings directly mapped into the application address space
- system calls to request packet transmission and be notified of packet reception
- access via command to/from: /dev/netmap
- NIOCGREGIF command to attach to interface
- attaching stop flow of packets into kernel stack
- software ring provided to selectively copy packets into stack and accept packets from the stack
- tail and head sync with kernel/hardware -> userland
- similar to receive ring, sync with the kernel periodicaly to sync head and tail on the list
- TCP example
- No other connection will have the tuple (local addr -> local port / foreign addr -> foreign port)
- Hash table for connections using the four tuple
- Create control block
- Bind local address and port
- Set remote address (connect, sendto)
- source is chosen according to outgoing network interface after routing lookup
- Look up control block with matching address and port
- Deallocate control block
- so_snd.sb_hwat - Socket buffer high water mark
- max number of bytes accepted from applications
- so_snd.sb_cc - Socket buffer character count
- bytes already written from the application
- snd_wnd - Send window
- flow control uses the window, if it is full do not expand the window
- snd_nxt - data sent but no ack received
- All flow is controlled by sequence numbers
- Window cannot shrink, you can move forward though
- Sequence number negociated during handshake
- source port
- destination port
- IP already has the local and external address
- sequence number
- acknowlodge number
- data offset
- Urgent pointer only used when URG flag set
- TCP protocol limits options to 40 bytes
- Windows scaling can be negociated
- pick a scaling factor, windows will be multiplied by
- Checksums
- Five tcp timers
- 2ML (twice maximum segment lifetime) 60 seconds, used while waiting for close
- KEEP (keep alive) 75 seconds, send keep alive or drop dormant connection
- PERSIST (persist) 5 to 60 seconds, force a connection to persist
- REXMT (retransmit), 30 milliseconds to 64 seconds, called when retransmission is necessary
- DELACK (delayed acknowledgement), 100 milliseconds, send a delayed ack to the peer
- Idle
- no data has been sent that has not yet been acked, send queue is empty
- Transmitting
- data have been sent and not yet acked
- REXMT timer is running
- Persisting
- Window is zero or too small to send
- PERSIST timer is running
RTT timing
- Counter (rtt) is started when frist data segment sent from idle state
- Counter is sampled when timed sequence number is acked
- Smoothed round-trip time (srtt) and variance (rttvar) are computed as moving avarege of rtt samples using fixed point arithmetic
- Sample is discarded if timed data is retransmitted
- If timestamps are in use, round-trip time of every packet can be measured
- Opening 3-way handshake
- client sends SYN to server
- server sends SYN+ACK to client
- client sends ACK to server
- Closing 2-way handshake
- send FIN
- received ACK
- FIN_WAIT_2 until other side does 2-way
- During opening handshake each side defines:
- initial sequence number
- max segment size (MSS)
- windows size
- Options both sides must offer to be usable
- window scaling
- timestamp request with units used
- selective acknowledgement (SACK)
- Socket in accept mode
- Complete and Incomplete list
- sonewconn1() duplicates head socket
- soqinsque() adds to a queue
- soqremque() removes from queue
- so_incomp links partially set up connections
- so_head links connections read for acceptance
- Need to cope with syn-flooding attacks
SYN Cache is hash table holding SYN state
- lookup when SYN received
- if not found create entry
- if found, resend SYN/ACK
- never ACK any data sent in initial SYN packet
SYN Cookies used when SYN Cache is full
- maintain no state
- encode local and foreign addresses and ports and MSS and place in initial sequence number in SYN/ACK
- if timestamp supported then encode send receive window scaling and SACK support and place and timestamp field
- must respond within 16 seconds with valid ACK (subtract 1 from initial sequence number and encode to see if they match)
Metrics cached about a host
- rmx_mtu - MTU for this path
- rmx_ssthresh - outbound gateway buffer limit
- rmx_rtt - estimated rtt
- rmx_rttvar - rtt variance
- rmx_bandwidth - estimated
- rmx_sendpipe - outbound delay bandwidth product
- rmx_recvpipe - inbound delay bandwidth product
Looked up at connection establishment
Updated at connection shutdown
Cached entries kept for one hour after last use
- Silly window syndrome - offering or accepting windows less than maximum segment size when max window is large
- Receiver silly window avoidance: if offered windows would be less than 1/4 of receive buffer and less than max segment, offer zero window
- Sender silly window avoidance: if window is too small for all data to be sent, and window is less than max segment, wait for window update or persist timeout.
- Delayed ack is used to piggyback ack on window updates and application-level responses
- Delayed ack also allows multiple packets to be acked at once
- Implementation:
- Delayed-ack flag is set when new data arrives
- After process takes data, if window update would move window by 2 max-size segments, window update and ack are sent.
- If data is returned, ack is piggyback
- DELACK runs every 100 ms forces out ackowledgements not yet sent
- Problem: how to coalesce small segments into larger one without delaying transmission too long
- Other solutions: set a "short" timer when first character is received, transmit when timer expires. (Timer isn't long enough to help slow nets, is just enough to irritate users on local nets.)
- Nagle algorithm: send immediately when a smal amount of data is presented; then, hold additional data until a full packet can be sent or until the first segment is acked
- Certain clients send small segments that receive no responsee, expect rapid action; need option to defeat
- Van Jacobson *
- Observations from Van: in steady state, ACKs from receiver should tell transmitter when to send new data
- Problem: how to reach steady state.
- Slow start doesn't solve congestion
- Solution: maintain estimated window size aceptable for network (ssthresh),
- Whenever packet is lost, ssthresh to half
- Slow start opens windows exponentially up to ssthresh
- Window is opened linearly after slow start, probing network capacity
- REXMT timer set to initial value, when sending data
- If timer expires:
- snd_nxt is set to snd_una
- sshthresh is set to half the current window
- cwnd is set to 1 segment
- tcp_output is called, transmits one segment
- Timer is restarted, backed off at exponential rate
- After 12 transmission packet for each additional packet for each additional duplicated ack
-
Fast restransmission
- keep acking the same previous packet until the missing packet is retransmitted
- Tells the sender packets it has received beyond currently ACK'ed data
- Acked to 500
- SACK indicated that 1000-2000 and 3000-2400 have been received
- Sender will respond by sending 500-1000 and 2000-3000, then resume at 3500
- Until FreeBSD 7 kernel had a single congestion control algorithm(New Reno from 4.4BSD)
- Congestion control algorithms can now be added as a loadable kernel module
- Currently five available:
- New Reno (default)
- Vagas
- CUBIC
- Hamilton Institute's delay based
- H-TCP
- Each connection gets to choose its algorithm (using TCP_CONGESTION option to setsockopt())
- Every time ACK received you get this information
-
Algorithms
- New Reno
- detection based on retransmission timer
- exponential backoff
- Vegas
- On each ACK measures the difference between expected and actual rate
- if difference below threshold, increase the congestion window
- if above, decrease
- if within threshold, make no change
- CUBIC
- Designed to quickly find network capacity on long fast link (1Gbit US to Japan has 110ms RTT and takes 10 min to reach capacity)
- New Reno
- Reliable multiple-stream transport protocol
- Provides associations between hosts with up to 65k independent streams
- Each stream can be message- or stream-based
- Associations setup with 4-way handshake to avoid SYN attack
- No server state required until third handshake
- Data can start flowing on third handshake
- Streams setup within an association do not require handshake
- look like tcp packets
- with the addition of the multi-stream handling
- transmission sequence number
- stream identifier tells which stream
- stream sequence number tells which packet within stream
- though length 16-bits, messages may be made up of multiple chunks
- SCTP allow association to have multiple address/route using stcp_bindx() and sctp_connectx()
- One route selected for use
- Backup routes send heatbeat packets to confirm viability
- If selected route fail, switch to an alternate
- Firmware initialize hardware and loads first-level bootstrap
- First level bootstrap identifies disk label and loads second-level bootstrap (8kb to 64kb)
- Second level bootstrap interprets filesystem to load full bootstrap
- Final stage loads kernel and kernel modules and starts the kernel
- Kernel runs autoconfig and starts remaining CPUs
- Process 1 (/sbin/init)
- Initialize low-level hardware
- Optionialy run hardware diagnostics
- Pre-boot services such as RAID configuration and remote management
- Select boot CPU and put remaining CPUs in suspend mode
- Build list of attached I/O infracture to export to kernel using ACPI or FDT
- Identify possible boot devices and select one
- Read in first-level boot strap from front of boot device and branch to it to start it running
- Provide rudimentary I/O capabilities until kernel can operate devices by itself
- Contains either a (historic) MBR or more modern GPT label
- Use lebel to identify location of FreeBSD bootable partition
- Load first 64K of FreeBSD partition and branch to it to start it running
- gptboot to load from UFS
- gptzfsboot to load from ZFS
Actions taken by the second-level bootstrap
- enable access to full memory
- fetch boot.config and interpret any flags
- load /boot/loader
Full featured interactive loader
- choices of kernel to load
- inspect and set kernel env vars
- preload kernel modules
- preload security keys
- preload memory-disk images
- filesystem access
- forth script interpreter
Default at startup is to run /boot/loader.4th
- load /boot/kernel/kernel and any modules in /boot/loader.conf
- at end of countdown, boot loaded kernel
Modules can be loaded into the kernel
- when it is compiled
- when it is loaded
- when it is running (if not disabled)
Module register an event handler
- called at startup to initialize the module
- called at shutdown to clean up the module
Has version number to allow field upgrades Can register new system calls that it supports
Stages of kernel startup
- Assembly-language setup and enabling memory management unit
- Platform-specific C-language startup of architecture specific interrupt vectors, stacks, and crafting first kernel process and thread
- Setup basic kernel services: memory allocation, scheduling, locking
- Run through kernel module initialization to initialize higher-level services (filesystem, network stack, start /sbin/init)
Process 1 will be forked from process 0 (usually swapd)
Register with:
- SYSINIT(name, subsystem, order, function, indentifier) to start
- SYSUNINIT(name, subsystem, order, function, identifier) to stop
Ordering based on 2-level hierarchy:
- subsystem is first level listed in /sys/sys/kernel.h
- order is second level to differentiate modules in same subsystem
Module invoked as function(identifier)
Sorted at startup, then executed in order
Defined in /sys/sys/kernel.h
- SI_SUB_TUNABLES
- set subsystem parameters from loader tunable
- SI_SUB_VM
- early virtual memory
- SI_SUB_KMEM
- kernel malloc() initialization
- SI_SUB_HYPERVISOR
- Xen HVM hypervisor detection
- SI_SUB_WITNESS
- configure witness lock-debugging facility
- SI_SUB_MTX_POOL_DYNAMIC
- initialize dynamically allocated mutex pools
- SI_SUB_LOCK
- initialize statically declared or allocated locks
- SI_SUB_EVENTHANDLER
- register statically declared event handlers
- SI_SUB_VNET_PRELINK
- SI_SUB_KLD
- SI_SUB_CPU
- SI_SUB_RACCT
- SI_SUB_RANDOM
- SI_SUB_KDTRACE
- SI_SUB_MAC
- SI_SUB_MAC_POLICY
- SI_SUB_MAC_LATE
-
SI_SUB_INTRINSIC - hand craft proc 0
-
SI_SUB_AUDIT - create audit worker thread
-
SI_SUB_CREATE_INIT - fork process 1
-
SI_SUB_SCHED_IDLE - create idle thread
-
SI_SUB_INTR - create device-driver threads
-
SI_SUB_SOFTINTR - create callout thread, netisr thread
-
SI_SUB_DRIVERS - GEOM thread, random device
-
SI_SUB_CLOCKS - create and start system clock threads
-
SI_SUB_KTHREAD_PAGE - create vm_pageout daemon thread
-
SI_SUB_KTHREAD_VM - create vm_daemon thread
-
SI_SUB_KTHREAD_BUF - create buf_daemon
-
SI_SUB_KTHREAD_UPDATE - create vnlru_proc daemon
-
SI_SUB_KTHREAD_UPDATE - create sched_sync daemon
-
SI_SUB_KTHREAD_INIT - schedule init process to run
When init thread starts running it kernel it must get root filesystem mounted
- Mount devfs filesystem as root ("/") filesystem
- Search through "rootdevnames" for a viable root filesystem
- Mount selected root filesystem on /root
- Swap filesystems: /root => / and / => /dev
Find and load /sbin/init Exit kernel into init process
Load additional kernel modules In single-user mode init spawns a shell In multi-user mode, init spawns a shell to run commands in /etc/rc
- Check filesystems as needed and then mount then
- Run rcorder to select order that scripts in /etc/rc.d and /usr/local/etc/rc.d should run based on REQUIRE and PROVIDE declarations in each script
- Run the script using /etc/rc.conf and /etc/default/rc.conf to configure subsystems to be run (such as sshd) and the options with which they should run
Shutdown happens in 3 steps
- Shutdown services that need to store data in the filesystem
- Shutdown the filesystem
- Shutdown of the remaining services
Service register using EVENTHANDLER_REGISTER(name, function, argument, priority)
- The name identifies which shutdown step
- shutdown_pre_sync
- shutdown_post_sync
- shutdown_final
- The priority works like the orher parameter of SYSINIT()
- Event is invoked as function(argument)
-
Tunning filesystem
-m minimum percentage of free space -e maximum number of blocks (2048 = 32 Megabytes) -a enable or disable access control list (disable) -l enable or disable MAC -j enable or disable journaling -s specify avarage number of files in a directory (64) -f specify avarage file size (16384) -o optimization to minimize system time or to minimize space used on disk (time)
-
System monitoring with systat
Output of "systat -vmstat"
Load avg = should be less the number of CPU
-
systat -ifstat
Output of "systat -ifstat"
Ktrace and Truss this will be about ktrace
- Provide kernel tracing logging
- system calls
- pathname translation
- signal processing
- I/O
- Tracing can be flexibly applied
- selectively for running processes
- for individual commands
- for process groups
- through inheritance to all current or future children
- Trace files are generated in binary format
- Translation to readable form is done by kdump
- Useful translation are optionally added
- system calls number to name
- ioctl values
- system errors with standard error string
- output can display running time between and during system calls
- trace example
- most information available through sysctl()
- extensible MIB-style interface
- allows selective retrieval of kernel data structures
- process table
- routing table
- vnode table
- allow selective setting of kernel variables
- networking: IP forwarding, redirect, enable, time-to-live
- debugging: filesystem cluster enabling
- limits: maxproc, maxfiles, maxvnodes
MIB = Management Information Base
- Process structures no longer continguous, other info scattered.
- Solution: define kinfo_proc structure:
- useful field from proc
- xucred
- useful fields from vmspace
- session and pgrp info (login, pgid, tty)
- wchan message
- Interface provides selective retrieval of the process table
- by proc id
- by proc group id
- by session
- by effective user id
- by real user id
- Goals
- Want all programs that acess kernel table to work on running or dead kernel
- Want to keep machine dependent information in a single place
- Provide a "kernel virtual memory" (kvm) library
- uses sysctl for running kernels
- knows how to extract tables for dead kernel
- maintains a cache of kernel namelist entries for running kernel to avoid repeated nlist() calls
- Provided that programs do note use machine dependent fields in the table they extract, they are portable from one architecture to another once the kvm library has been ported
FreeBSD does kernel debugging remarkable well
- Better than Solaris
- Better than linux (mostly because of Linus attitude of "read the code" and figure it out
Low-level built-in debugger runs on console
High-level kernel debugging facility provides complete symbolic debugging using kgdb(1) or lldp(1)
- symbolically examine and modify memory locations
- set breakpoint in the kernel
- single step the kernel
Usage - enable the debugger when booting, then stop:
- at first instruction
- when debugger is attached
- on kernel trap/panic
Enable crash dumps
- location specified using dumpon(8)
- usually configured for primary swap area
- default is "mini-dump" which saves only kernel memory
Recovering a crash dump
- after reboot run savecore(8) to extract
- can then run kgdb(1) or lldp(1) to perusee the crash
- general hints given in crash(8)
- wealth of information collected using crashinfo(8) and textdump(4)
netstat -m -N kernel.80 -M vmcore.80
vmstat -m -N kernel.80 -M vmcore.80
REST OF vmstat -m
vmstat -z -N kernel.80 -M vmcore.80
- ps -axlw
- fstat | more
- netstat -N kernel.80 -M vmcore.80
DTrace provides detailed information throughout the application stack
- system calls
- libraries
- application code
many handy one-line DTrace scripts: https://wiki.freebsd.org/DTrace/One-Liners
- pmc
- details of all activity on a CPU
- activity of a process
- Things that can be measured:
- interrupts taken
- instructions executed
- branches executed and mispredicted
- number of non-idle processor cycles
- instructions cache misses and prefetches
- data cache misses, prefetches and writebacks
Using PMC
- pmc(3) describes types of counters
- pmccontrol -L - lists names of available counters for current machine
- pcmstat -T -S event-spec - gives a top-like display of the listed counters
- Understand the operating system
- While reading about it do not turn a page without understanding it completely
- Good observability tools will help you find unecessary work
- It helps you to eliminate this unecessary work and tune parameters
Recap tools:
- systat
- pidstat
- vmstat
- iostat
- truss
- tcpdump
- Anti-pattern
- Workload Identification
- who
- what
- why
- how
- USE Method
- Utilization
- Saturation
- Errors
- Off-CPU
- 100% of benchmarks are wrong
- Use only for experimental analysis
- Types:
- micro benchmarks
- macro benchmarks
- bad benchmarks
- Great to understand what is a benchmark is doing
- pmcstat
- dtrace
- pmcstats
- Cycles per instructions
- unhalted count
- instructions count
- Cycles per instructions
- dtrace
- truss
- ktrace
- DTrace
- sysctl -aW
- /etc/sysctl.conf
- Treat tunable parameters as someone else's medicine cabinet