Today's plan
Minix File System Implementation: in-memory data structures
- the file system process table (p. 797) includes the current
and root directories, process IDs, an array of pointers to open files,
and information that allows saving information for slow reads/writes
(e.g. on a pipe or a terminal)
- inodes on disk (p. 793) have up to
7 direct block pointer, one indirect, one double-indirect -- total, 64 bytes
- an inode block is read from disk whenever the file corresponding
to at least one of the inodes is opened
- the corresponding inode representation in memory (p. 800) has
additional references to make it easy to locate the device and the
device's superblock, whether the inode is dirty, whether the file
is special in some way (pipe, mount, etc)
- the superblock array (p. 802) stores information about each
mounted file system, particularly the sizes of time inode and zone
bitmaps. The in-memory copy of the structure also stores mount
information, including the device and the byte-order
- an open file structure (p. 799) contains a pointer to the inode,
the position being read or written, and a few other items. These
structures may be shared, e.g. by a parent and a child after a fork.
- the lock array (p. 799) contains information for each file lock
set -- this is checked every time a lock is requested
- the device array (p. 799) contains function pointers for opening,
closing, and reading/writing data, and the number of the corresponding
device task
Minix File System Implementation: buffer cache
- a buffer may contain (p. 798) a data page, a directory page,
an indirect block, an inode block, or a bitmap block
- most free buffers are maintained in an LRU doubly-linked list,
but some free buffers are placed at the head of the list because
they are very unlikely to be needed again (e.g. superblock buffers)
- when a new buffer is needed, it is taken from the head of the list
- when reading a block from disk (or when writing part of a block),
the LRU list is first searched to see if the block might be in the
LRU list -- if so, no need to access the disk
- to make this search fast Minix uses a hash table, indexed by the low bits
of the block number
- when writing a full block to disk, the head of the LRU list
is used to store the data
- most dirty buffers on the LRU list are only written back when:
- the block reaches the head of the list, or
- another block on the same device is written back
- when writing any block to disk, Minix writes all blocks from
that device, in sorted (elevator) order
Minix Buffer Cache Implementation
- get_block (p. 806) searches through the list,
returning it (line 21063) if found, and if not found,
allocates a new block (by recycling the head of the LRU, p. 812)
and, if necessary, fetches it from disk (line 21110, also p. 809,
p. 894, p. 898)
- recycling the first block on the LRU chain may require
writing back a dirty block, in which case all blocks are
written back (flushall, p. 810)
- put_block (p. 807) puts the block at the rear or at
the front of the LRU list, possibly writing it back immediately
(especially if ROBUST=1 -- see p. 798)
- alloc_zone (p. 809) and
free_zone (p. 810) manage the bitmaps for zones, reading
the bitmap from disk and saving it back to disk as appropriate
- rw_scattered (p. 810) does the actual sorting of
reads/writes and performs the I/O requests, freeing the corresponding
blocks (by calling put_block) and clearing the dirty bits
Minix Inode and Superblock Implementation
- inodes have a link count and an in-memory reference count:
- if the reference count becomes zero, the file should be closed
- if the link count becomes zero, the file should be removed
duplicating an inode (as in dup or dup2) simply
requires incrementing the reference count
- when creating a new inode, the block containing the inode
must be read into memory, since the other inodes on the block
might already exist (as an optimization, the bitmap could be
checked to avoid the read if all other inodes in the block are free...
but minix doesn't do this)
- inode blocks are written back immediately if ROBUST is set to
one
- to avoid calling the clock unnecessarily, the times are only
updated at most once when the inode is written back
- the Minix superblock (on disk) is written only when initializing the
ram disk, otherwise it is read-only
- superblock management includes allocating and freeing bits
in the two bit maps
Minix Opening and Closing
- common_open (p. 835)
- may create a new file, by calling
new_node (p. 838), which allocats an inode then
adds the name and the inode to the directory
entry
- allocates an in-memory inode and filp and initializes
them
- checks protections
- does type-specific operations for regular files, directories,
block- and character-special files, and pipes
- e.g. for regular files, truncates them (by returning all the
blocks and clearing the inode) if the open requested truncation
- mknod and mkdir do what is needed, e.g. mkdir creates the
"." and ".." links
- search_dir (p. 865) edits or serches the directory
- last_dir (p. 862) returns the inode of the last
directory in the path
Minix Reading and Writing
- reading and writing are the almost the same
(read_write, p. 843):
- check protections, sizes
- find the correct block to read or write (perhaps allocating a new block)
- copy the data to or from the block
- both reading and writing only access up to the next block boundary
in a single iteration of the main loop
- rw_chunk (p. 846) does the actual (partial) block transfer,
using read_map (p. 847) to obtain the address of an existing block
- rw_chunk may have to call
new_block (p. 854) if writing to an as-yet-unwritten part of the
file
-
new_block uses
write_map (p. 852) to do the hard work of allocating new zones
in an inode, including the indirect and double-indirect blocks
Minix Pipes
- pipes are treated almost like files, except:
- the maximum size is limited to PIPE_SIZE (7 blocks, 7KB)
- when the file has been read completely, the write position is
reset to the beginning (line 23570, p. 845)
- on a read, wake up any sleeping writers (check_pipe, p. 858, line 24413)
- on a write, wake up any sleeping readers
- the "file" has an inode and a filp, but no directory entries
(usually -- a "named pipe" might have a directory entry)
Minix Linking and Unliking
- linking adds a new link to an existing file
- unlinking removes an existing link to a file or directory
- rmdir is a slightly safer (more error checking) version of unlink
- when unliking a directory, must remove the "." and ".." entries
and updated the parent's link count
- renaming is almost like link followed by unlink, but slightly
optimized to still work even when the disk is full
- renaming also allows renaming directories, whereas linking directories
is only allowed to the superuser
- linking directories is dangerous because it introduces the risk
of having loops in the file system hierarchy -- this could lead to
infinite loops in the file system, e.g. line 25598 on p. 874
Minix FS system call retry
- any process making a "slow" system call (read or write on a pipe,
read on a terminal) may be suspended, in which case the system call
must be retried later
- suspend (p. 858) copies the system call number and parameters to
the process table, and avoids replying to the caller
- release (p. 858) calls revive for any process waiting on a given
system call (e.g. writing) on a given pipe
- revive (p. 858) either sets a flag (line 24542), or directly
replies to the suspended process (lines 24547 and 24551)
- the main loop of the file system process calls get_work (p. 829),
which checks to see if any processes are being revived, and only if
none are being revived, then calls receive
- if processes are being revived, the system call number and arguments
are taken from the process table rather than from the message received
Minix FS miscellaneous calls
- call_task (p 898) tries to send a message to a task, and is
prepared to receive a completely unrelated response
- fetch_name (p 900) takes an argument name either from the message
(if it is short), or by copying bytes from user space (for a long argument) --
the library must of course set up the message accordingly
- conv2 and conv4 (p. 902) do byte-swapping if the byte ordering on
the disk is not the same as the byte ordering of the machine -- this may
also be needed for networking