- last time: virtual machines guest / physical / machine trap-and-emulate hopefully instructions that behave diff user/kernel mode cause exception exception handler emulates instruction do priv op if guest OS in pretend kernel mode forward exception if guest OS in pretend user mode shadow page tables synthesize page table from guest OS page table + hypervisor physical -> machine mapping shadow page table = virtual -> machine mapping, needed by hardware (when guest OS is running in user mode) modern VMs --- extra HW support --- - inodes - for comparison: FAT splits the data for a file across different strurctures size, modification time, etc. is in the directory entry the location of the first block of the file is in the directory entry thelocation of later blocks are in the FAT - traditional layout of disk with inodes and no block groups and no journalling: [superblock = header (approx. the Fat BPB] [free block map] [inode array (fixed # of inodes created with disk is allocated)] [actual data blocks] - traditional Unix design: most data for the file is in this "inode" one per file directory entries ONLY contain name + pointer to inode - can have multiple directory entries referring to the same inode unlike FAT, no worries about keeping them in sync link() system call sets this up - traditional way inodes specify blocks of the file w/ an array of block pointers (block pointer = # of a data blockon a the disk) some number of direct pointers --- in the inode itself (for sufficiently small files --- all we needed) K direct pointers = first K blocks o fthe file if not enough direct pointers to fit file data, then an indirect pointer to another array of blocks array of blocks was only for this file if an indirect pointer + direct pointers still wasn't enough, then an double-indirect pointer to pointer to pointers to more arrays and sometimes a triple-indirect pointer, etc. - (alternatives to block pointers we talked about might be used instead, sometimes extents (block pointer + # blocks instead of single block pointers) fragments -- sometimes replace block pointer w/ pointer to a patr of block ) - direct/indirect/... counitng data - example: 4K blocks, 8B block pointers - 10 direct blocks, 1 indirect, 1 double-indirect, 1 triple-indirect - 10 direct blocks: point to 4K each ---> 40K - 1 indirect block: point to (4KB / 8B =) 512 pointers to 4K each --> 4*512K - 1 double-indirect block point to (4KB / 8B =) 512 pointers to indirect blocks pointers --> 512 pointers to 4*512K each --> 4*(512^2)K - 1 triple-indirect block point to (4KB / 8B =) 512 pointers to double-indrirect --> 512 to 4*512^2 K --> 4*(512^3)K - redo-logging normal operation: - write our intention to log typically: intention is a set of operations that we want to happen all together example: removing a file changes direntry + inode for file + free blocks <-- failures just mean we don't do what we intended - write a "commit" to our intention to log <-- failures mean we STILL do what we intended, because we read our log and might end up doing the thing twice - do the actual thing we intended because we might do this twice on some types failures the operation MUST BE SAFE TO DO TWICE OKAY: operations like "set count to 14" NOT OKAY: operations like "increment count" - after the actual thing is done, can clear the log recovery (coming back after crash): - if commit is written the log, redo everything that was committed to "the transaction" - if commit message NOT written, then ignore that part of the log (and can delete that the part of the log) - RAID -- reading and writing data split up into sets of sectors disk 1 disk 2 disk 3 disk 4 sector A / sector B / sector C / parity = sector A XOR B XOR C can recover sector A given B, C, (A XOR B XOR C) " " " B " A, C, (A XOR B XOR C) " " " C " A, B, (A XOR B XOR C) " " " (A XOR B OR C) given A, B, C normal operation: to update sector A, we need to also update parity write to sector A becomes: read sector B read sector C compute (A XOR B XOR C) write to sector A write to parity on failure: lose sector A to read the data for sector A: read B read C read (A XOR B XOR C) compute (B XOR C) XOR (A XOR B XOR C) = A - RAID 5 versus RAID 4 RAID 4: disk 1 disk 2 disk 3 disk 4 sector A / sector B / sector C / parity ABC = sector A XOR B XOR C sector D / sector E / sector F / parity DEF write A + E --> write to disk 1 once, disk 2 once, disk 4 twice write A, then B, then C, then D, then E, then F one at a time --> write to disk 1 twice, disk 2 twice, disk 3 twice, disk 4 six times [if we were writing these all in one batch, we could do better by only writing the parity once] RAID 5: disk 1 disk 2 disk 3 disk 4 sector A / sector B / sector C / parity ABC = sector A XOR B XOR C sector D / sector E / parity DEF / sector F write A + E --> write to disk 1 once, disk 2 once, disk 3 once, disk 4 once write A, then B, then C, then D, then E, then F one at a time --> write to disk 1 twice, disk 2 twice, disk 3 four times, disk 4 four times [if we were writing these all in one batch, we could do better by only writing the parity once] - RPC and marshalling - marshalling AKA serialization is translate data into bytes - what's wrong with just copying data from memory (e.g. what you did to load structs from the FAT FS) * pointers --- want to send things over the network with pointers e.g. vectors, variable-length strings * portability --- want to transfer data between different architectures e.g. endianness e.g. different integer sizes e.g. different padding - the typical procedure for RPC libraries is that we have an "interface description language" that compiles into a class/struct definition for our data + code to convert it to/from bytes (alternative: parse existing class/structs) - RPC: high-level --- make sending messages to server look like calling a function function turns into a connect to server + send command + receive response command contains arguments + identifier for function response turns into return value - RPC can't quite look like a normal function call ~ argument/return value type need to be convertible to bytes ~ failures can happen (e.g. lost connection) ~ have some way to identify the server - what is two-phaes commit for multiple machines want to do an operation "together" and each machine has some part they must do locally e.g. update machine A's student record and machine B's course record - two-phase commit failure cases - PREPARE coordinator -> worker: "this is what I want to do" [PREPARE] worker -> coordinator: "I agree to do this" [AGREE TO COMMIT] OR "I agree to NEVER do this" [AGREE TO ABORT] - COMMIT all workers agreed to do it: doesn't matter if worker goes away after agreeing coordinator -> worker: "Everyon'e committed, do it" [COMMIT] - ABORT at least one worker agreed to NEVER do it coordinator -> worker: "It's not going to happen" [ABORT] FAILURES: - coordinator fails before recording responses from workers: resend the PREPARE message, get same response that workers might have already sent - coordinator fails after recording response from workers: coordinator's log indicates whether to COMMIT or ABORT resend COMMIT/ABORT messages - worker fails after deciding whether to AGREE TO COMMIT/ABOTR worker's log indicates decision made [gaurentees that worker doesn't change it's mind] resend AGREE-TO-COMMIT/ABORT - messages lost in network - can resend on timer if we have reason to believe that other end - coordinator should resend COMMIT/ABORT if workers resend vote - capabilities - in a true-capability-based OS, we don't have global names instead, you have some handles (like file descriptors) to access resources or get more handles for resources (e.g. directory handle gives you handles for files in directory) rather than your access being determined by user+group, its determined by what handles you hvae on this system, the login/program launching utility probably has handles that give it handles for anything, and it gives out a subset that makes sense for you hope of these sytems is that each program has only handles that make sense e.g. web browser doesn't get read all my taxes - Capiscism --- FreeBSD capability extension not a "pure" capability system because it was built on top of a Unix (couldn't ginore that underlying system still had access control lists) - last post-quiz on permissions - want to give a user write permission but only to write ASCII data - NOT A FEATURE POSIX permission support "read" "write" "execute" --> "read" and e"execute" are irrelevant "write gives too many permissoins --- can write ANY data - also not a feature file descriptors support file descrpitor is "writeable" "readable" ---> "writeable --> can write ANY data - setuid program can enforce this by doing the write on behalf of the user procedure: check which user ran me, use my set-user-ID permissions to open the file, copy the data the user sends me to my stdin to the file BUT ONLY if it's ASCII data - working set model - intuition: programs have a "working set" that they are accessing - data outside this working set doesn't need to be in memory for the program to run - consequence: can keep only ythe working sets of programs in RAM and still have our system run well - consequence: if we canNOT keep the working sets of porgrams in RAM (don't have enough space, e.g.) --- our system will run horribly ("thrashing") - consequence: least-recently used does okay for deciding what to put in RAM (assuming RAM fits all program's working sets) things in working set will usually be recently used (exception: switching between working sets) - mmap - interface: rather than using read/write, I call mmap() and get a memory region which reflects the contents of the file - if I use MAP_SHARED ---> updating this memory region updates the file and updates to file update this memory region - if I use MAP_PRIVATE --> updating this memory region gives me a private copy (and unspecified otherwise whether I get updates from the file --- compromise to make it easier to implement mmap) - implementation: OS caches data of files anyways, we just not only keep the cahce in the kernel's memory, but also add it to the user's page table some extra work: copy-on-write for MAP_PRIVATE mappings checking dirty bits to identify if file is modified making sure we update page tables if we replace file cache pages - readahead - programs often read files sequentially - would prefer not to wait for program to get to byte X of the file to read byte X from disk - solution: detect when program is reading sequentially and try to stay ahead of the program in reading the disk - typical heuristic: have a window extra stuff we've read vvvvvvvvvvv [ * ] ^^^^^ program has read have some mark *, where we say "yeah, I think the program is still reading from the file, we should read even more ahead" - compromise: reading too much ahead --- evict normal data from memory especially if program doesn't end up reading hte reset of the file reading too little ahead --- fall behhind the program is reading