Reliable FS

Threats to FS Reliability

Operation interruption

  • A crash or power failure
  • A file operation often consists of many I/O updates to the storage
  • An example: mv ./dir1/file1 ./dir2/file2
    • Writing the dir1 directory file to remove file1
    • (optional) Growing the dir2 directory’s file to include another block of storage to accommodate a new directory entry for ‘file2’
    • Writing the new directory entry to the directory file
    • Updating the last-modified time of the dir1 directory
    • Updating the file system’s free space bitmap
    • Updating the size and last-modified time of the dir2 directory
  • At physical level, operations complete one at a time.

Loss of stored data, either physical or electric.

Reliability vs. Availability

  • Reliability: the probability that storage system will continue to be reliable for some specified period of time.
  • Availability: the probability that the storage system will be available at any given time.

What a Reliable FS Does?

  • “All or nothing”
    • Either an update is completed, or not at all
    • Must be guaranteed whenever a crash happens
    • Must be transparent to users/apps
  • Quite similar to the critical section problem in concurrency
    • Avoid someone observing the state in an intermediate, inconsistent state
    • No control over “when it happens”

Reliability Approach 1

Careful ordering

  • Sequence operations in a specific order
    • Careful design to allow sequence to be interrupted safely
  • Post-crash recovery
    • Read data structures to see if there were any operations in progress
    • Clean up/finish as needed
  • Approach taken by
    • FAT and FFS (fsck) to protect filesystem structure/metadata
    • Many app-level recovery schemas (e.g. Word, emacs autosaves)

Issues with approach 1

  • Complex reasoning
    • So many possible operations and failures
  • Slow updates
    • File systems are forced to insert sync operations or barriers between dependent operations
  • Extremely slow recovery
    • Need to scan all of its disks inconsistent metadata structures

Transactions

Def. Transaction is an atomic sequence of actions (reads/writes) on a storage system (or database). That takes it from one consistent state to another

  • Use transactions for atomic updates
    • Ensure that multiple related updates are performed atomically
    • i.e. If a crash occurs in the middle, the state of the systems reflects either all or none of the updates
    • Most modern file systems use transactions internally to update filesystem structures and metadata
    • Many applications implement their own transactions
  • They extend concept of atomic update from memory to stable storage
    • Atomically update multiple persistent data structures

Typical Structure

  1. Begin a transaction - get transaction id.
  2. Do a bunch of updates
    • If any fail along the way, roll-back
    • Or, if any conflicts with other transactions, roll-back.
  3. Commit the transaction

The Key Properties of Transactions

  • Atomicity: all actions in the transaction happen, or none happen
  • Consistency: transactions maintain data integrity, e.g.
    • Balance cannot be negative
    • Cannot reschedule meeting on February 30
  • Isolation: execution of one transaction is isolated from that of all others; no problem from concurrency
  • Durability: if a transaction commits, its effects persist despite crashes.

Logging

  • Instead of modifying data structures on disk directly, write changes to a journal/log
    • Intention list: set of changes we intend to make
    • Log/journal is append-only
    • Single commit record commits transaction
  • Once changes are in log, it is safe to apply changes to data structures on disk
    • Recovery can read log to see what changes were intended
    • Can take our time making the changes
      • As long as new requests consult the log first
  • Basic assumption
    • Updates to sectors are atomic and ordered

Def. Log is an append-only file containing log records

  • <start t>: transaction t has begun
  • <t, x, v>: transaction t has updated block x and its new value is v
    • Can log block “diffs” instead of full blocks
    • Can log operations instead of data
  • <commit t>: transaction t has committed - updates will survive a crash
  • Committing involves writing the records - the home data needn’t be updated at this time
  • Logs are often kept in a separation partition
  • Once transactions are committed, logs can be cleaned up

Implementing Transactions: Redo Logging

  1. Prepare
    • Write all changes/updates to log
    • Can happen at once, or over time
    • Wait until all updates are written in log
  2. Commit
    • Append a commit record to the log
    • Or can roll back (abandoned), write a roll-back record
      • An atomic operation
        • Before it, we can safely roll-back
        • After it, the transaction must take effect
  3. Write-back
    • Write all of the transaction’s updates to disk
  4. Garbage collection
    • Reclaim space in log
  5. Recovery
    • Read log
    • Redo any operations for committed transactions
    • Garbage collect log

Implementation Details

  • Deal with concurrent transactions
    • Must identify which transaction does a record belong to
  • Repeated write-backs are OK
    • Works for idempotent updates:“write 42 to each byte of sector 74”
    • Redo log systems do not permit non-idempotent records such as “add 42 to each byte in sector 74”.
  • Restarting recovery is OK
    • If another crash occurs during recovery
  • The performance of redo logging is not as bad as it looks like:
    • Log updates are sequential
    • Asynchronous write-back
      • Low latency for commit(); high throughput as updates can be batched
    • Group commit: combine a set of transaction commits into one log write
      • Amortize the cost of initiating the write (e.g., seek and rotational delays)
  • New requests (e.g., reads) need to consult the log first to ensure the data consistency
    • Can be alleviated by caching
  • Ordering is essential, as we must ensure:
    • A transaction’s updates are on disk in the log before the commit is
    • The commit is on disk before any of the write-backs are
    • All of the write-backs are on disk before a transaction’s log records are garbage collected.

Transactional File Systems

  • Two ways to use transactions in file systems: journaling and logging
    • Journaling: apply updates to the system’s metadata via transactions
      • Microsoft’s NTFS,Apple’s HFS+, and Linux’s XFS/JFS
    • (Full) Logging: apply both metadata and data in transactions
      • Linux’s ext3 and ext4 can be configured to use either journaling or logging

Journaling File Systems

  • Applies updates to system metadata (inodes, bitmaps, directories, and indirect blocks) using transactions
    • So those critical data structures are always consistent
  • Updates to non-directory files (i.e., user stuff) can be done in place (without logs), full logging optional
    • Avoids writing file contents twice
    • If a program using a journaling file system requires atomic multi-block updates, it needs to provide them itself

Copy-on-Write File System

  • To update file system, write a new version of the file system containing the update
    • Never update in place
    • Reuse existing unchanged disk blocks
  • Optimization: batch updates
    • Transform many small, random writes into large, sequential writes
  • Approach taken in network file server appliances
    • NetApp’s Write Anywhere File Layout (WAFL)
    • ZFS (Sun/Oracle) and OpenZFS

RAID: Redundant Arrays of Inexpensive Disks

Storage Devices Failure

  • Sector and page failure: one or more individual sectors of a disk are lost, but the rest of the disk continues to operate correctly
  • Full disk failure: a device stops being able to service reads or writes to all sectors

RAID 1: Disk Mirroring/Shadowing

  • Each disk is fully duplicated onto its “shadow”
    • For high I/O rate, high availability environments
    • Most expensive solution: \(100\%\) capacity overhead
  • Bandwidth sacrificed on write:
    • Logical write = two physical writes
    • Highest bandwidth when disk heads and rotation fully synchronized (hard to do)
  • Reads may be optimized
    • Can have two independent reads to same data
  • Recovery:
    • Disk failure \(\Rightarrow\) replace disk and copy data to new disk
    • Hot Spare: idle disk already attached to system to be used for immediate replacement

RAID 5+: High I/O Rate Parity

  • Data stripped across multiple disks
    • Successive blocks stored on successive (non-parity) disks
    • Increased bandwidth over single disk
  • Parity block (in green) constructed by XORing data blocks in stripe
    • \(P0 = D0 \oplus D1 \oplus D2 \oplus D3\)
    • Can destroy any one disk and still reconstruct data
    • Suppose Disk 3 fails, then can reconstruct: \(D2= D0 \oplus D1 \oplus D3 \oplus P0\)
  • Rotating parity
    • The parity needs to be updated more often than normal data blocks.
  • Striping data
    • Balance parallelism vs. sequential access efficiency

RAID 5 can recover the failed disk only if (i) only one disk fails and (ii) the failed disk is known.

Distributed Systems: Motivation/Issues/Promise

  • Why do we want distributed systems?
    • Cheaper and easier to build lots of simple computers
    • Easier to add power incrementally
    • Users can have complete control over some components
    • Collaboration: much easier for users to collaborate through network resources (such as network file systems)
  • The promise of distributed systems:
    • Higher availability: one machine goes down, use another
    • Better durability: store data in multiple locations
    • More security: each piece easier to make secure

Distributed Systems: Reality

  • Reality has been disappointing
    • Worse availability: depend on every machine being up
      • Lamport: “a distributed system is one where I can’t do work because some machine I’ve never heard of isn’t working!”
    • Worse reliability: can lose data if any machine crashes
    • Worse security: anyone in world can break into system
  • Coordination is more difficult
    • Must coordinate multiple copies of shared state information (using only a network)
    • What would be easy in a centralized system becomes a lot more difficult

Distributed Systems: Goals/Requirements

  • Transparency: the ability of the system to mask its complexity behind a simple interface
  • Possible transparencies:
    • Location: Can’t tell where resources are located
    • Migration: Resources may move without the user knowing
    • Replication: Can’t tell how many copies of resource exist
    • Concurrency: Can’t tell how many users there are
    • Parallelism: System may speed up large jobs by splitting them into smaller pieces
    • Fault Tolerance: System may hide various things that go wrong
  • Transparency and collaboration require some way for different processors to communicate with one another

例题

计算示例 FFS 的最大文件容量及其设计分析

(1). The FastFile file system uses an inode array to organize the files on disk. Each inode consists of a user id (2 bytes), three time stamps (4 bytes each), protection bits (2 bytes), a reference count (2 byte), a file type (2 bytes) and the size (4 bytes). Additionally, the inode contains 13 direct indexes, 1 index to a 1st-level index table, 1 index to a 2nd-level index table, and 1 index to a 3rd level index table. The file system also stores the first 436 bytes of each file in the inode.

  1. Assume a disk sector is 512 bytes, and assume that any auxiliary index table takes up an entire sector, what is the maximum size for a file in this system.
    • 每个 inode 必定带有 \((2 + 4 + 2 + 2 + 2 + 4 = 16) \text{bytes}\)
    • 每个 inode 带有 size 字段,表明所指文件理论最大大小为 \(2^{32} - 1 \approx 4 \text{GB}\)
    • 假设每个索引大小为 \(4 \text{bytes}\),每个 block 大小为 \(512 \text{bytes}\),则每个 block 存储的索引数目为 \(\frac{512}{4} = 2^{7}\)
    • 则很容易算出实际的最大大小为 \[ (16 + 436 + 13 \times 512 + 2^7 \times 512 + 2^{14} \times 512 + 2^{21} \times 512)\text{bytes} = 1.008 \text{GB} \]
  2. Is there any benefit for including the first 436 bytes of the file in the inode?
    • 显然,对于任何 \(\text{size} \leq 436 \text{bytes}\),其信息只需要和存储在其 inode 中。这既可以优化性能,减少对磁盘扇区的读取,又可以节省空间

设计文件系统以区分普通文件和目录

(2). When user tries to write a file, the file system needs to detect if that file is a directory so that it can restrict writes to maintain the directory’s internal consistency. Given a file’s name, how would you design a file system to keep track of whether each file is a regular file or a directory?

  1. In FAT
    • 目录项 (Directory Entry) 中,有一个 属性字节 (Attribute Byte)
    • 属性字节的第 4 位(通常为 0x10)用于标识该条目是否为目录。
    • 如果该位为 1,表示该条目是目录;否则为普通文件。
  2. In FFS
    • inode 的元数据部分,包含一个 文件类型字段 (File Type)模式位 (Mode Bits)
    • 例如:
      • S_IFDIR 表示目录。
      • S_IFREG 表示普通文件。
    • 文件系统通过检查 inode 的文件类型字段来区分文件和目录。
  3. In NTFS
    • MFT (Master File Table) 条目中,通过 标准属性 (Attribute Records) 标识文件类型。
    • 文件属性标志中有一位用于标识是否为目录。
    • 目录本质上是一个包含索引属性(如 $INDEX_ROOT)的文件。

FFS 变体

(3). Suppose a variation of FFS includes in each inode 12 direct, 1 indirect, 1 double indirect, 2 triple indirect, and 1 quadruple indirect pointers. Assuming 6 KB blocks and 6-byte pointers.

  1. What is the largest file that can be accessed with direct pointers only?
    • 类似问题 (1),最大文件大小为 \[ (12 \times 6) \text{KB} = 72 \text{KB} \]
  2. What is the largest file that can be accessed in total?
    • 类似问题 (1),最大文件大小为 \[ (12 \times 6 + 2^{10} \times 6 + 2^{20} \times 6 + 2^{31} \times 6 + 2^{40} \times 6) \text{KB} \approx 6.01 \text{PB} \]

电梯算法考察

(4). Consider a disk queue holding requests to the following cylinders in the listed order: 116, 22, 3, 11, 75, 185, 100, 87. Using the elevator scheduling algorithm, what is the order that the requests are serviced, assuming the disk head is at cylinder 88 and moving upward through the cylinders?

  1. 允许反向,则
    • 向上\(100 \to 116 \to 185\)
    • 向下\(87 \to 75 \to 22 \to 11 \to 3\)
    • 服务序列为:\(100 \to 116 \to 185 \to 87 \to 75 \to 22 \to 11 \to 3\)
  2. 不允许反向,始终向上,则
    • 阶段一:\(100 \to 116 \to 185\)
    • 阶段二:\(3 \to 11 \to 22 \to 75 \to 87\)
    • 服务序列为:\(100 \to 116 \to 185 \to 3 \to 11 \to 22 \to 75 \to 87\)

(5). Search for how different RAID versions (at least 5) work differently and list a table to compare them.

RAID 个版本对比总结

RAID 版本 特点 优点 缺点
RAID 0 数据条带化(Striping),将数据分布到多个磁盘上,无冗余。 提高读写性能,磁盘空间利用率为 100%。 无容错能力,任意一块磁盘故障都会导致数据丢失。
RAID 1 镜像(Mirroring),每块磁盘都有一个完全相同的备份磁盘。 高数据可靠性,磁盘故障时可快速恢复;读性能提升。 磁盘利用率低(50%),写性能较差(需写两次)。
RAID 5 数据条带化 + 分布式奇偶校验,奇偶校验数据分布在所有磁盘上。 提供容错能力(允许 1 块磁盘故障);读性能较高,写性能适中。 奇偶校验计算增加写入开销;仅支持单盘故障恢复,多盘故障会导致数据丢失。
RAID 6 RAID 5 的扩展版,使用双重奇偶校验,允许两块磁盘同时故障。 提供更高的容错能力(允许 2 块磁盘故障);读性能较高。 奇偶校验计算开销更大;磁盘利用率较低(需要至少 4 块磁盘)。
RAID 10 RAID 1 + RAID 0,先镜像后条带化。 结合 RAID 1 的高可靠性和 RAID 0 的高性能;读写性能均较高。 磁盘利用率低(50%),需要至少 4 块磁盘;成本较高。
RAID 50 RAID 5 + RAID 0,多个 RAID 5 组条带化组合。 提供 RAID 5 的容错能力和 RAID 0 的高性能;适合大规模存储系统。 奇偶校验计算增加写入开销;成本较高,配置复杂。
RAID 60 RAID 6 + RAID 0,多个 RAID 6 组条带化组合。 提供 RAID 6 的双盘容错能力和 RAID 0 的高性能;适合高可靠性需求场景。 奇偶校验计算开销更大;成本高,配置复杂。

Reliable FS
https://ddccffq.github.io/2025/12/15/操作系统/Reliable_FS/
作者
ddccffq
发布于
2025年12月15日
许可协议