- last time: - capabilities: token to access = ability to access - example: file descriptors - to make useful: provide way to pass between processes - permissions check done when acquiring capability (hopefully) - virtual machines: hardware / host+hypervisor / guest OS / application in guest OS - hypervisor tracks similar information to process control block: current registers; current memory; current I/O devices - hypervisor tracks extr ainformation to emulate HW's kernel mode: does guest OS believe it's in kernel mode? [guest OS is always in user mode on the real hardware] guest OS page table pointer guest OS exception table pointer ... - trap-and-emulate hope/make sure special operations in guest OS trigger exception I/O updating things that support kernel mode on the real HWS system calls ... in hypervisor/host OS exception handler: do what hardware would do if guest OS thinks it's in kernel mode, maybe privileged operation emulate talking to I/O device update emulated page table pointer switch from pretend kernel to pretend user mode reflect exceptions back to guest OS system call while guest OS is running --> triggers GUEST system call handler - shadow page tables build "shadow" page table for real hardware by combining guest + hypervisor PT add shadow page table entries on demand clear shadow page table in response to TLB flushes separate tables for pretend kernel versus not ----- - page cache - want to keep program data and file data mostly in memory - *mostly* --- have a place on disk for the data - can store on disk --- when not loaded yet - can store on disk --- if not enough space in memory - program data: everything that's in a program memory - file data: everything in a regular file - we need some way to lookup: given something I want to access, where is it in memory? ["forward mapping" (cache hit)] if it's part of a process's address space: page table if it's part of a file being read with system calls: some OS data structure given something I want to access, where is it on disk? ["forward mapping" (cache miss)] if it's part of a process's address space: some OS data structure attached to the PCB if it's part of a file being read with system calls: based on how the FS works - more subtly, also need a way to lookup: given something I know is memory, where are there pointers to it? ["reverse mapping"] we need some sort of OS data structure for everything kept in memory in the cache - needed this information because when memory is full, we need to unset pointers to free up more memory for something else - "page replacement" - page replacement algorithms = a way to choose what thing to replace when memory is full and we need to bring in something new theoretical best: replace whatever is going to be accessed never again or furthest in the future ("Belady's MIN") practical ideas: assume things that haven't been accessed recently won't be accessed for a long time --> least recently used policy if this assumption is true, it's enough to track things which have been accessed recently without being precise about it example: second-chance policy: look for something which hasn't been accessed since we added it to a list of pages various choices about what time-spans we look for accesses in sometimes the assumption above is false ---> we can have heuristics ["guesses"] for those cases example: things we're only going to access once - device driver flow two components to the device driver: "top half": the part run from system calls/etc. sets up the I/O operation "bottom half": the part run from the interrupt handler triggered by the device, finished the I/O operation fills in buffer general pattern: process makes a read/write call top half executes and checks if it already has the data for read/write [typically: buffer containing data from/to device] if it doesn't, it puts the process to sleep while the process is sleeping, other things can happen device triggers interrupt when read/write is finished bottom half completes I/O operation [typically: updates buffer, etc.] bottom half wakes up process doing read/write call (sometimes: top half might need to run multiple times to do a very big read/write) - SSD versus HDD performance HDD: big performance issue is that the head needs to move to the right location on the disk so it's useful/important to try to keep data you want to access together close physically on the disk to help OSes, etc. do this: HDDs assign sector numbers such that nearby sector numbers are close on the disk SSD: no moving parts, so physical location doesn't matter very much in fact: really doesn't matter, because the SSD doesn't place data in physical locations knows about fast read/write big performance problem comes from needing to clear up space has to be done in "erasure blocks" data has to copied to make sure a whole erasure block can be erased happens in the device (OS doesn't generally get know about it except for slowness) - inode-based filesystems without block groups layout of the disk: [ header ][ array of inodes ][ array of bools: is data block free ][ data blocks] ^^^^^^^^^^^^^^^^^^^ information about files (size, owner, permissions, where the first several data blocks are located) - single/double/triple-indirect blocks inode directly contains the block #s (index into data blocks part of the disk) of first few blocks of data "direct block pointers" example: inode has 10 direct block pointers and blocks are 2K, then the first 20K of the file can be found using those 10 pointers after that, more and more indirection, depending on how big the file is: example: inode has one indirect pointer (after 10 direct) then bytes 20K through (1024K+20K) = 1044K are found using that indirect pointer indirect pointer points to a block (2K) the block contains an array of block pointers -- how big? if block pointers are 4 bytes, then 2K/4b = 512 elements each of those elements points to a block [that will contain data for the file] 512 * 2K = 1024K in those blocks what if 1044K isn't big enough? example: inode has one double-indirect pointer (after 10 direct + 1 indirect) then bytes 1044K through (1044K+524288K) are found using that double-indirect pointer double-indirect pointer points to a block, whcih contains 512 pointers ["indirect pointers"] those pointers each point to a block, which contains 512 pointers ["direct pointers"] and those pointers point to blocks that will contain data for the file 512 * 512 * 2K of data for the file in those blocks --> 524288K of data - # of blocks needed/file size calculation if we use the indirect block, that doesn't mean that all the pointers in the direct block are neeeded example: if the file is 24K, we need 2 pointers in the indirect block (20-22K, 22-24K) example: if the file is 1048K, we need 1 pointer in the double-indirect block (1044-2064?K) and 2 pointer in that block that pointer to (1044-1046K, 1046-1048K) the first 20K use the direct pointers, next 1024K use the indirect block and all its 512 pointers then we have 4K left over, but we need to use the double-indirect pointer next even though we only want to point to two blocks of data to do this: we have double-indirect pointer point to a block with one useful indirect pointer then that useful indirect pointer points to a block with two useful direct pointers total number blocks: of data: 1048K / 2K = 524 blocks of data plus 1 block of direct pointers pointed to by the indirect pointer in the inode [points to where bytes 20K through 1044K of the file's data are] plus 1 block on indirect pointers pointed to by the double-indirect pointer in the inode plus 1 block of direct pointers pointed to by the first entry in the double-indirect block in the inode [points to where bytes 1044K through 1048K of the file's data are] = 527 - fragments - for very small files, we might not want to allocate a whole block of data - divide blocks into smaller pieces, but only use those smaller pieces for small files - fragments: these smaller pieces - why only for small files? - if we used small block sizes for large files, then we'd add a lot more block pointers - sockets - when setup: act like two-way pipes (with file descriptors for each end) can write on "end" of the socket, read on the other "end" and vice-versa - setup process: we need to specify [client] the address + port number ("socket address") to 'connect' to socket() --> create file descriptor getaddrinfo() --> translate human-readable hostname (virginia.edu) + service name to socket address connect() --> make file descriptor access a server using that socket address [server] the address + port number to 'listen' for connections on socket() --> create "server socket" file descriptor getaddrinfo() --> translate human-readable hostname (virginia.edu) + service name to socket address bind() --> set the address + port number using a socket address listen() --> tell the OS to expect connects accept() --> to get a new file descriptor for the next client that connects can call accept() in a loop for multiple clients - distributed system failure cases - our main failure model assumed for *machines* "fail-stop" either the machine works or it's *as if* powered-off can remember things in spite of failure by writing to disk carefully - our main failure model assuemd for *networks* we'd have lost messages + reordering lost messages: you send a message and it doesn't get there sometimes includes: you send a message, and it does get there, but the reply message doesn't get to you reordered messages: you send a message A, then a message B, and message B arrives first, then message A includes: you send message A and wait long enough to assume it's not getting there - guest OS versus host OS exceptions - when a guest OS is running, the only exception handler on the Real Hardware is the host OS's [the real hardware doesn't know it's running a virtual machine (b/c we assumed no HW support for VMs)] - causes of host OS exceptions when a guest OS is running: * the guest OS did something that would've triggered an exception if the guest OS wasn't in a VM: - ?caused an I/O device to request something to happen - making a system call - accessing memory that is invalid in a page table - ... --> if it would've happened outside a VM, our VM (b/c the guest OS shouldn't need to know it's in a VM) should run the guest OS's exception handler from the host OS exception handler * the guest OS did something that would've done something other than triggering an exception if the guest OS wasn't in a VM: - talking to an I/O device via memory-mapped I/O - asking to change the page table base pointer - asking to disable interrupts - asking to change the exception table address - asking to switch from kernel to user mode ---> the VM should pretend to do ("emulate") the operation the guest OS would've done outside a VM - which thread scheduler policy makes sense when - metrics: turnaround time ~~ time from when a program becomes runnable until when it stops being runnable if the program is a loop of "wait for input; process input", then this reflects how "interactive" the program is --> optimal algorithm: SRTF (shortest remaining time first) --> practical algorithm: approximations of SRTF that estimate the remaining time somehow usual estimate: based on past amounts of time used fairness ~~ whether multiple programs/users get to use the same amount of CPU or have the same ability to access the CPU --> round-robin: when you're runnable, same ability to access CPU but we don't care about when you're not runnable --> Linux CFS (virtual time idea): same total amount of CPU usage over time with some configurable options for over how much time? [b/c we don't want to worry about what happens minutes/hours/days ago] throughput ~~ amount of compute time used for actual work, not figuring out what to do or starting/stopping programs ("overhead") --> FCFS: simplest scheduling decision + minimal context switches ... - POSIX access control - two mechanisms for specifying file access control ~ historic mechanism where files have: --> usually stored in inodes an owner (user ID) a group (group ID) "mode" bits specifying: can owner read the file? can owner write the file? can owner execute the file? can group read the file? can group write the file? can group execute the file? can anyone else read the file? can anyone else write the file? can anyone else execute the file? --> can view this as an access control list, but with very limited features ~ less supported, much more flexible mecahnism called "POSIX ACLs" --> probably requires special filesystem support ("real access control lists") have a list of any number of user ID: can user ID read / write / execute? group ID: can group ID read / write / execute? default: can non-specified user/group read/write/execute? ... ~ for either of these checked on open(), rename(), remove(), etc. based on the user and group ID stored in the process control block of whoever makes a file-related system call - TOCTTOU time of check to time of use a security problem that occurs when we separate the check of an access control list (or similar) from actually accessing an item example: print file program that runs as root timeline: [attacker] calls the print file program on example.txt [print file program] checks that the attacker is allowed to access example.txt <--- check [attacker] moves (links, makes a symbolic link) of secret.txt to example.txt [print file program] accesses and prints "example.txt", which is now a different file <--- use one solution is: print file program opens file, checks using the file descriptors --> file descriptor won't change what file it refers to because of rename/link/etc. operation print file program opens file in a way which checks if attacker can access file --> both solutions: combine check and use in some way --- ensure it's always the same file can occur with non-files: [attacker] kill(pid 10000) [kernel system call implementation] can attacker kill pid 10000? --> yes [attacker] cause pid 10000 to exit [attacker] trigger important program to start as pid 10000 [kernel system call implementation] actually kill pid 10000, but it might different than the one checked solution: make sure what pid 10000 doesn't change between check + use (to kill it) - final logistics ~ I'm going to release a document of questions and an answer template document ~ answer template will be a text file with fill-in-the-blank areas ~ you'll upload that text file after filling it in to kytos ~ worth 15% -- same as midterm focus on material from after the midterm there probably will be some coverage of topics before the midterm