diff options
| author | mo khan <mo@mokhan.ca> | 2021-07-25 17:04:51 -0600 |
|---|---|---|
| committer | mo khan <mo@mokhan.ca> | 2021-07-25 17:04:51 -0600 |
| commit | 02a064677aff7d3063853397252d19916c340f81 (patch) | |
| tree | 61e526f87cbaa99faa04e6b8a6547b43a8c78c79 | |
| parent | fee20e87b88c43a56ca34bd5600ae913aaa2942b (diff) | |
finish the body of the zfs paper
| -rw-r--r-- | doc/my-paper.md | 273 |
1 files changed, 273 insertions, 0 deletions
diff --git a/doc/my-paper.md b/doc/my-paper.md new file mode 100644 index 0000000..84fac6c --- /dev/null +++ b/doc/my-paper.md @@ -0,0 +1,273 @@ +# An Introduction to the Zettabyte File System (ZFS) + +## 0. Abstract + +This paper gives an introduction to the Zettabyte File System, now referred +to as "ZFS", that was the state of the art in the early 2000s. Starting with an +overview of relevant events prior and during this time period, this paper +presents an introduction to file systems, the design of ZFS and analysis of +this file system. + +## 1. Introduction + +In order to properly understand the relative importance of file systems, one +has to place them in their historical context. Relevant events start taking +place well before the first file system was built [3]: + +* 1977: FAT: Marc McDonald designs and implements the 8 bit file system. [4] +* 1980: FAT12: Tim Paterson extends FAT to 12 bits. [4] +* 1984: FAT16: Cluster addresses were increased to 16-bit. [4] +* 1993: NTFS: Microsoft develops a proprietary journaling file system. [5] +* 1993: ext2: Rémy Card designs a replacement for the extended file system (ext) for the Linux kernel. [9] +* 1994: XFS: Silicon Graphics, Inc releases a high-performance 64-bit journaling file system named XFS. [12] +* 1996: FAT32: Microsoft designs FAT32 which uses 32-bit cluster addresses. +* 1998: HFS+: Apple Inc. develops the HFS Plus journaling file system. [6] +* 19xx: HFS: Apple Inc. develops the proprietary Hierarchical File System [7] +* 2001: ZFS: Zettabyte file system is released as part of Sun Microsystems Solaris operating system. [14] +* 2001: ext3: ext2 is extended to support journaling. [10] +* 2008: ext4: fourth extended file system is a journaling file system for Linux, developed as the successor to ext3. [11] +* 2009: btrfs: B-tree file system is introduced into the Linux kernel. [13] +* 2017: APFS: macOS replaces HFS+ with Apple File System (APFS) + +## Traditional File Systems + +Traditionally, file system administration of file systems and disk can be +difficult, slow, and error prone. Adding more storage to an existing file +system requires unmounting block devices which requires temporary service +interruptions. + +Many file systems use a one-to-one association between the file system and the +block device. Volume managers are responsible for providing virtual address to +the underlying physical storage. The virtual blocks are presented to the file +system as a logical storage device. System administrators are required to +predict the maximum future size of each file system at the time of creation. + +Most file systems allow the on-disk data to be inconsistent in some way for +varying periods of time. If an unexpected crash or power cycle occurs while the +on-disk state is inconsistent, the file system will require some form of repair +during the next boot. + +If the file system doesn't validate the data returned from the device +controller this can lead to errors from returning corrupted data. If the file +system can detect and automatically correct corrupted data this reduces other +possible system errors. + +```plaintext + Traditional file system block diagram [1] + + ---------------------------- + | | + | System Call | + | | + ---------------------------- + |Vnode interface| + ---------------------------- + | | + | | + | File System | + | | + | | + ---------------------------- + | logical device, offset| + ---------------------------- + | | + | Volume Manager | + | | + ---------------------------- + | physical device, offset| + ---------------------------- + | | + | Device Driver | + | | + ---------------------------- +``` + +## Data Corruption + +Disk corruption occurs when any data access from disk does not have the expected +contents due to some problem in the storage stack [2]. This can occur for many +reasons such as errors in magnetic media, spikes in power, erratic mechanical +movements, and physical damage, and defects in device firmware, operating +system code, device drivers. Error correction codes (ECC) can catch many of these corruptions but not all of +them. + +Some ways to handle data corruption include using checksums to verify data +integrity, implementing redundancy by choosing data structures and algorithms +that can detect corruption and recover from them such as B-Tree file system +structures, or choosing a RAID storage setup to stripe/mirror the data across +physical devices. + +## ZFS File System + +The Zettabyte File System (ZFS) is a file system developed at Sun Microsystems. +ZFS was originally implemented in the Solaris operating system and was intended +for use on everything from desktops to database servers. ZFS attempts to +achieve the following goals: + +* strong data integrity +* simple administration +* handle immense capacity + +It uses checksums to verify data integrity and changes the interaction between +the file system and volume manager to simplify administration and utilizes +128 bit block addresses to be able to address vast amounts of data. + +```plaintext + ZFS Block diagram [1] + + -------------------------------- + | | + | System Call | + | | + -------------------------------- + | Vnode interface | + -------------------------------- + | | + | ZFS POSIX Layer (ZPL) | + | | + -------------------------------- + | dataset, object, offset | + -------------------------------- + | | + | Data Management Unit (DMU) | + | | + -------------------------------- + | data virtual address | + -------------------------------- + | | + | Storage Pool Allocator (SPA) | + | | + -------------------------------- + | physical device, offset | + -------------------------------- + | | + | Device Driver | + | | + -------------------------------- +``` + +1. The device driver exports a block device to the SPA +1. The SPA handles: + * block allocation and I/O + * exports virtual addresses + * allocates and frees blocks to the DMU +1. The DMU turns virtual addresses block into a transactional object for the ZPL +1. The ZPL implements a POSIX file system on top of the DMU objects and exports + vnode operations to the system call layer. + +The SPA allocates blocks from all the devices in a storage pool. +The SPA provides a `malloc()` and `free()` like interface for allocating and +freeing disk space. These virtual addresses for disk blocks are called DVA (data +virtual addresses). + +System administrators no longer have to create logical device or partition +storage, they just tell the SPA which devices to use. The SPA uses 128-bit +block addresses to allow addressing massive amounts of data +(340,282,366,920,938,463,463,374,607,431,768,211,456 addresses). + +To protect against data corruption, each block is checksummed before it is +written to disk. A block's checksum is stored in its parent indirect block. +Separating the checksum from the data ensures that the data can be checked for +integrity using a checksum located in the parent. + +```plaintext + ------- + | | | uberblock (has checksum of itself) + |___|___| + |___|___| + / \ + ------- ------- + | | | | | | + |___|___| |___|___| + |___|___| |___|___| + / \ / \ + ----- ----- ----- ----- + | | | | | | | | + ----- ----- ----- ----- + +[1] +``` + +When data is received from the block device the checksum is compared to check +for corruption. If corruption is detected self-healing is possible in some +conditions. + +Virtual device drivers (vdevs) implements a small set of routines for a +particular features like mirroring, striping etc. The SPA allocates blocks in +a round-robin strategy from the top-level vdevs. + +The DMU consumes blocks from the SPA and exports objects (flat files). Objects +live within a dataset. A dataset provides a private namespace for the objects +contained by the dataset. Objects are identified by 64 bit numbers and can be +created, destroyed, read and written. + +The DMU keeps the on-disk data consistent at all times by treating all blocks as +copy-on-write. All data in the pool is part of a tree of indirect blocks, with +the data blocks as the leaves of the tree. + +## ZFS Observations + +A 2010 analysis [2] of ZFS by Yupa Zhang et al. observed the following: + +Data corruption + +1. ZFS detects all corruptions due to the use of checksums. +1. ZFS gracefully recovers from single metadata block corruptions. +1. ZFS does not recover from data block corruptions. +1. In-memory copies of metadata help ZFS to recover from serious multiple block + corruptions. +1. ZFS cannot recover from multiple block corruptions affecting all ditto blocks + when no in-memory copy exists. + +Memory corruption + +1. ZFS does not use the checksums in the page cache along with the blocks to + detect memory corruptions. +1. The window of vulnerability of blocks in the page cache is unbounded. +1. Since checksums are created when blocks are written to disk, any corruption + to blocks that are dirty (or will be dirtied) is written to disk permanently + on a flush. +1. Dirtying blocks due to updating file access time increases the possibility of + making corruptions permanent. +1. For most metadata blocks in the page cache, checksums are not valid and thus + useless in detecting memory corruptions. +1. When metadata is corrupted, operations fail with wrong results, or give + misleading error messages. +1. Many corruptions lead to a system crash. +1. The read() system call may return bad data. +1. There is no recovery for corrupted metadata. + +> We argue that file systems should be designed with end-to-end data integrity +> as a goal. File system should not only provide protection against disk +> corruptions, but also aim to protect data from memory corruptions. + +## 5. Conclusion + +The original goals stated for the ZFS project were to address concerns related +to many file systems of that generation such as data integrity, simple +administration, and handling immense capacity. To accomplish these goals, new +abstractions were created such as the ZPL, DMU, SPA. It is my opinion that the +addition of these abstractions increased the complexity of the underlying file +system while improving data integrity for specific scenarios. The addition of +new object data structures, checksums for all read/writes and the use of 128 bit +block addresses increases the amount of CPU, memory and disk space required to +accommodate this file system. This author rejects the claim that this file +system is suitable for general desktop environments but acknowledges that +certain server side use cases could benefit from the features that ZFS provides. + +## 6. References + +1. Jeff Bonwick, Matt Ahrens, Val Henson, Mark Maybee, Mark Shellenbaum - The Zettabyte File System. https://www.cs.hmc.edu/~rhodes/cs134/readings/The%20Zettabyte%20File%20System.pdf +1. Yupa Zhang, Abhishek Rajimwale, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau - End-to-end Data Integrity for File Systems: A ZFS Case Study. https://www.usenix.org/legacy/event/fast10/tech/full_papers/fast10proceedings.pdf#page=37 +1. Wikipedia authors - List of default file systems. https://en.wikipedia.org/wiki/List_of_default_file_systems +1. Wikipedia authors - File Allocation Table. https://en.wikipedia.org/wiki/File_Allocation_Table +1. Wikipedia authors - NTFS. https://en.wikipedia.org/wiki/NTFS +1. Wikipedia authors - HFS Plus. https://en.wikipedia.org/wiki/HFS_Plus +1. Wikipedia authors - Hierarchical File System. https://en.wikipedia.org/wiki/Hierarchical_File_System +1. Wikipedia authors - Unix File System. https://en.wikipedia.org/wiki/Unix_File_System +1. Wikipedia authors - ext2. https://en.wikipedia.org/wiki/Ext2 +1. Wikipedia authors - ext3. https://en.wikipedia.org/wiki/Ext3 +1. Wikipedia authors - ext4. https://en.wikipedia.org/wiki/Ext4 +1. Wikipedia authors - XFS. https://en.wikipedia.org/wiki/XFS +1. Wikipedia authors - Btrfs. https://en.wikipedia.org/wiki/Btrfs +1. Wikipedia authors - ZFS. https://en.wikipedia.org/wiki/ZFS |
