Today's plan
- SELinux
- Minix FS implementation
Project 5
- basically, was measuring average signal latency: time to call
a signal handler
- this latency varies a lot depending on what else was happening
in the system:
- signals are low priority compared to interrupt handlers
- a running process will probably run until the end of its
quantum before the signal is delivered to a process that is suspended
(system-dependent, of course)
- in fact, all process in the ready queue will probably run
until the end of their quantum, otherwise a process could artificially
increase its priority by setting alarms
- did not measure the total cost of signal handling, since
returning from a signal handler also takes time
- did not measure the accuracy of the hardware clock, since the
hardware clock itself was being used as the reference
- few OSs try hard to optimize signal handling latency
- alarm precision is at best 1 tick, so that may cause a small
amount of inaccuracy (tick on Linux 2.6 is 1ms, but as long as the
alarm is reset on the same tick, there is no loss of accuracy)
Security-Enhanced Linux
- homework: read article at
http://www.linuxjournal.com/article.php?sid=6837, and look at
http://acl.bestbits.at/man/man.shtml.
- developed by NSA under GPL
- similar features in TrustedBSD
- Mandatory Access Control (MAC): users cannot give access to their
files if it is in violation of the policy
- each program runs with a minimum privilege
- modified programs include login, cron, logrotate, ps, ls
- domain is used almost as above (collection of things a program
can and cannot do) -- example the user_t domain (for most users),
the sysadm_t domain (for most system administration),
the init_t domain (for init),
the passwd_t domain (for the passwd program run by a user),
- role specifies what domains can be used. Examples include
user_r and sysadm_r, and any user with
user_r is allowed to enter domain passwd_t to run the
passwd command. This rule must be specified in the policy
configuration file.
- a setuid program cannot be executed by a user unless allowed
by policy
- the user identity is not changed by the su command
(but su still changes the UID) -- roles can be changed by
newrole, of course only if permitted by policy
- each object has a type, e.g. type user_home_t
for users' files in /usr/home, or type tmp_t for files in /tmp,
or user_tmp_t for files in /tmp created by users.
Types can be inherited from the parent directory.
Domains are types for processes.
SELinux Policies
- each daemon has its own policy, e.g. apache.te,
which can be edited by the system administrator and compiled
to a policy database loaded at boot time.
- spasswd runs
passwd in the correct domain and only for the Unix user ID
corresponding to your SELinux identity
- sadminpasswd allows changing other users' passwords,
but only by a process in with role sysadmin_r in
domain sysadmin_t
- permissive mode allows testing a system, with output to the
logs -- enforcing mode then actually enforces the rules
- the security module in Linux gets the arguments to selected
system calls and is allowed to fail them for permission denied,
with detailed information going to the logs
SELinux File System Labels
- each file in an SELinux file system is labeled (perhaps logically)
with its owner, role, and type, e.g. system_u:object_r:shadow_t
for the /etc/shadow file
- the labels may be:
- stored in the extended attributes of a file under ext2,
ext3, XFS, ReiserFS
- assigned when the volume is mounted (and only stored in memory,
not on the filesystem disk), or
- default context assigned by the security policy
- labels (contexts) are assigned automatically on file creation
(based on a specific security policy or the parent directory's context),
or manually using chcon or setfilecon
- privileged processes may modify their file creation security policy
by writing to /proc/self/attr/fscreate
Linux extended attributes
- mostly from attr(5), from
http://acl.bestbits.at/man/man.shtml.
- each file in most Linux file systems may support extended
attributes, which can be used, among other things, for security
- the regular attributes are the ones supported by stat(2)
- the extended attributes are available as namespace.attribute,
e.g. user.mime_type, trusted.md5sum,
system.posix_acl_access, or security.selinux
- user attributes can be used
to identify arbitrary additional user
information, e.g. the mime type of the file
- trusted attributes are only available to
the sysadmin (or any other user with
capability CAP_SYS_ADMIN)
- system attributes are used by the kernel,
e.g. to store access control lists
and capabilities
- in ext2/ext3, extended attributes
must fit into a single file system block (1K, 2K, or 4K) -- so
how are they implemented?
- in XFS, extended attributes have no size limit -- so how are
they implemented?
Minix File System Implementation: in-memory data structures
- the file system process table (p. 797) includes the current
and root directories, process IDs, an array of pointers to open files,
and information that allows saving information for slow reads/writes
(e.g. on a pipe or a terminal)
- inodes on disk (p. 793) have up to
7 direct block pointer, one indirect, one double-indirect -- total, 64 bytes
- an inode block is read from disk whenever the file corresponding
to at least one of the inodes is opened
- the corresponding inode representation in memory (p. 800) has
additional references to make it easy to locate the device and the
device's superblock, whether the inode is dirty, whether the file
is special in some way (pipe, mount, etc)
- the superblock array (p. 802) stores information about each
mounted file system, particularly the sizes of time inode and zone
bitmaps. The in-memory copy of the structure also stores mount
information, including the device and the byte-order
- an open file structure (p. 799) contains a pointer to the inode,
the position being read or written, and a few other items. These
structures may be shared, e.g. by a parent and a child after a fork.
- the lock array (p. 799) contains information for each file lock
set -- this is checked every time a lock is requested
- the device array (p. 799) contains function pointers for opening,
closing, and reading/writing data, and the number of the corresponding
device task
Minix File System Implementation: buffer cache
- a buffer may contain (p. 798) a data page, a directory page,
an indirect block, an inode block, or a bitmap block
- most free buffers are maintained in an LRU doubly-linked list,
but some free buffers are placed at the head of the list because
they are very unlikely to be needed again (e.g. superblock buffers)
- when a new buffer is needed, it is taken from the head of the list
- when reading a block from disk (or when writing part of a block),
the LRU list is first searched to see if the block might be in the
LRU list -- if so, no need to access the disk
- to make this search fast Minix uses a hash table, indexed by the low bits
of the block number
- when writing a full block to disk, the head of the LRU list
is used to store the data
- most dirty buffers on the LRU list are only written back when:
- the block reaches the head of the list, or
- another block on the same device is written back
- when writing any block to disk, Minix writes all blocks from
that device, in sorted (elevator) order
Minix Buffer Cache Implementation
- get_block (p. 806) searches through the list,
returning it (line 21063) if found, and if not found,
allocates a new block (by recycling the head of the LRU, p. 812)
and, if necessary, fetches it from disk (line 21110, also p. 809,
p. 894, p. 898)
- recycling the first block on the LRU chain may require
writing back a dirty block, in which case all blocks are
written back (flushall, p. 810)
- put_block (p. 807) puts the block at the rear or at
the front of the LRU list, possibly writing it back immediately
(especially if ROBUST=1 -- see p. 798)
- alloc_zone (p. 809) and
free_zone (p. 810) manage the bitmaps for zones, reading
the bitmap from disk and saving it back to disk as appropriate
- rw_scattered (p. 810) does the actual sorting of
reads/writes and performs the I/O requests, freeing the corresponding
blocks (by calling put_block) and clearing the dirty bits