doc/my-paper.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274

# A Brief Introduction to ZFS
Mo Khan - 3431709

## 0. Abstract

This paper gives an introduction to the Zettabyte File System, now referred
to as "ZFS", that was the state of the art in the early 2000s. Starting with an
overview of relevant events prior and during this time period, this paper
presents an introduction to file systems, the design of ZFS and analysis of
this file system.

## 1. Introduction

In order to properly understand the relative importance of file systems, we
need to place them in their historical context. [3]:

* 1977: FAT: Marc McDonald designs/implements a 8-bit file system. [4]
* 1980: FAT12: Tim Paterson extends FAT to 12 bits. [4]
* 1984: FAT16: Cluster addresses were increased to 16-bit. [4]
* 1985: HFS: Apple Inc. develops the Hierarchical File System [7]
* 1993: NTFS: Microsoft develops a proprietary journaling file system. [5]
* 1993: ext2: Rémy Card replaces the extended file system. [9]
* 1994: XFS: Silicon Graphics releases a 64-bit journaling file system. [12]
* 1996: FAT32: Microsoft designs FAT32 which uses 32-bit cluster addresses.
* 1998: HFS+: Apple Inc. develops the HFS Plus journaling file system. [6]
* 2001: ZFS: Zettabyte file system is released as part of Sun Solaris. [14]
* 2001: ext3: ext2 is extended to support journaling. [10]
* 2008: ext4: fourth extended file system is a journaling file system. [11]
* 2009: btrfs: B-tree file system is introduced into the Linux kernel. [13]
* 2017: APFS: macOS replaces HFS+ with Apple File System [15]

## 2. Traditional File Systems

Traditionally, file system administration of file systems and disk can be
difficult, slow, and error prone. Adding more storage to an existing file
system requires unmounting block devices which requires temporary service
interruptions.

Many file systems use a one-to-one association between the file system and the
block device. Volume managers are responsible for providing virtual address to
the underlying physical storage. The virtual blocks are presented to the file
system as a logical storage device. System administrators are required to
predict the maximum future size of each file system at the time of creation.

Most file systems allow the on-disk data to be inconsistent in some way for
varying periods of time. If an unexpected crash or power cycle occurs while the
on-disk state is inconsistent, the file system will require some form of repair
during the next boot.

If the file system doesn't validate the data returned from the device
controller this can lead to errors from returning corrupted data. If the file
system can detect and automatically correct corrupted data this reduces other
possible system errors.

```plaintext
  Traditional file system block diagram [1]

              ----------------------------
              |                          |
              |      System Call         |
              |                          |
              ----------------------------
                  |Vnode interface|
              ----------------------------
              |                          |
              |                          |
              |      File System         |
              |                          |
              |                          |
              ----------------------------
              | logical device, offset|
              ----------------------------
              |                          |
              |      Volume Manager      |
              |                          |
              ----------------------------
              | physical device, offset|
              ----------------------------
              |                          |
              |       Device Driver      |
              |                          |
              ----------------------------
```

## 3. Data Corruption

Disk corruption occurs when any data access from disk does not have the expected
contents due to some problem in the storage stack [2]. This can occur for many
reasons such as errors in magnetic media, spikes in power, erratic mechanical
movements, and physical damage, and defects in device firmware, operating
system code, device drivers. Error correction codes (ECC) can catch many of these corruptions but not all of
them.

Some ways to handle data corruption include using checksums to verify data
integrity, implementing redundancy by choosing data structures and algorithms
that can detect corruption and recover from them such as B-Tree file system
structures, or choosing a RAID storage setup to stripe/mirror the data across
physical devices.

## 4. ZFS File System

The Zettabyte File System (ZFS) is a file system developed at Sun Microsystems.
ZFS was originally implemented in the Solaris operating system and was intended
for use on everything from desktops to database servers. ZFS attempts to
achieve the following goals:

* strong data integrity
* simple administration
* handle immense capacity

It uses checksums to verify data integrity and changes the interaction between
the file system and volume manager to simplify administration and utilizes
128 bit block addresses to be able to address vast amounts of data.

```plaintext
              ZFS Block diagram [1]

              --------------------------------
              |                              |
              |        System Call           |
              |                              |
              --------------------------------
                    | Vnode interface |
              --------------------------------
              |                              |
              |    ZFS POSIX Layer (ZPL)     |
              |                              |
              --------------------------------
                | dataset, object, offset |
              --------------------------------
              |                              |
              |  Data Management Unit (DMU)  |
              |                              |
              --------------------------------
                  | data virtual address |
              --------------------------------
              |                              |
              | Storage Pool Allocator (SPA) |
              |                              |
              --------------------------------
                | physical device, offset |
              --------------------------------
              |                              |
              |       Device Driver          |
              |                              |
              --------------------------------
```

1. The device driver exports a block device to the SPA
1. The SPA handles:
    * block allocation and I/O
    * exports virtual addresses
    * allocates and frees blocks to the DMU
1. The DMU turns virtual addresses block into a transactional object for the ZPL
1. The ZPL implements a POSIX file system on top of the DMU objects and exports
   vnode operations to the system call layer.

The SPA allocates blocks from all the devices in a storage pool.
The SPA provides a `malloc()` and `free()` like interface for allocating and
freeing disk space. These virtual addresses for disk blocks are called DVA (data
virtual addresses).

System administrators no longer have to create logical device or partition
storage, they just tell the SPA which devices to use. The SPA uses 128-bit
block addresses to allow addressing massive amounts of data
(340,282,366,920,938,463,463,374,607,431,768,211,456 addresses).

To protect against data corruption, each block is checksummed before it is
written to disk. A block's checksum is stored in its parent indirect block.
Separating the checksum from the data ensures that the data can be checked for
integrity using a checksum located in the parent.

```plaintext
                     -------
                    |   |   | uberblock (has checksum of itself)
                    |___|___|
                    |___|___|
                      /     \
                -------      -------
              |   |   |    |   |   |
              |___|___|    |___|___|
              |___|___|    |___|___|
                /     \      /     \
            -----  -----  -----  -----
            |   |  |   |  |   |  |   |
            -----  -----  -----  -----

[1]
```

When data is received from the block device the checksum is compared to check
for corruption. If corruption is detected self-healing is possible in some
conditions.

Virtual device drivers (vdevs) implements a small set of routines for a
particular features like mirroring, striping etc. The SPA allocates blocks in
a round-robin strategy from the top-level vdevs.

The DMU consumes blocks from the SPA and exports objects (flat files). Objects
live within a dataset. A dataset provides a private namespace for the objects
contained by the dataset. Objects are identified by 64 bit numbers and can be
created, destroyed, read and written.

The DMU keeps the on-disk data consistent at all times by treating all blocks as
copy-on-write. All data in the pool is part of a tree of indirect blocks, with
the data blocks as the leaves of the tree.

## 5. ZFS Observations

A 2010 analysis [2] of ZFS by Yupa Zhang et al. observed the following:

Data corruption

1. ZFS detects all corruptions due to the use of checksums.
1. ZFS gracefully recovers from single metadata block corruptions.
1. ZFS does not recover from data block corruptions.
1. In-memory copies of metadata help ZFS to recover from serious multiple block
   corruptions.
1. ZFS cannot recover from multiple block corruptions affecting all ditto blocks
   when no in-memory copy exists.

Memory corruption

1. ZFS does not use the checksums in the page cache along with the blocks to
   detect memory corruptions.
1. The window of vulnerability of blocks in the page cache is unbounded.
1. Since checksums are created when blocks are written to disk, any corruption
   to blocks that are dirty (or will be dirtied) is written to disk permanently
   on a flush.
1. Dirtying blocks due to updating file access time increases the possibility of
   making corruptions permanent.
1. For most metadata blocks in the page cache, checksums are not valid and thus
   useless in detecting memory corruptions.
1. When metadata is corrupted, operations fail with wrong results, or give
   misleading error messages.
1. Many corruptions lead to a system crash.
1. The read() system call may return bad data.
1. There is no recovery for corrupted metadata.

> We argue that file systems should be designed with end-to-end data integrity
> as a goal. File system should not only provide protection against disk
> corruptions, but also aim to protect data from memory corruptions.

## 6. Conclusion

The original goals stated for the ZFS project were to address concerns related
to many file systems of that generation such as data integrity, simple
administration, and handling immense capacity. To accomplish these goals, new
abstractions were created such as the ZPL, DMU, SPA. It is my opinion that the
addition of these abstractions increased the complexity of the underlying file
system while improving data integrity for specific scenarios. The addition of
new object data structures, checksums for all read/writes and the use of 128 bit
block addresses increases the amount of CPU, memory and disk space required to
accommodate this file system. This author rejects the claim that this file
system is suitable for general desktop environments but acknowledges that
certain server side use cases could benefit from the features that ZFS provides.

## 7. References

1. Jeff Bonwick, Matt Ahrens, Val Henson, Mark Maybee, Mark Shellenbaum - The Zettabyte File System. https://www.cs.hmc.edu/~rhodes/cs134/readings/The%20Zettabyte%20File%20System.pdf
1. Yupa Zhang, Abhishek Rajimwale, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau - End-to-end Data Integrity for File Systems: A ZFS Case Study. https://www.usenix.org/legacy/event/fast10/tech/full_papers/fast10proceedings.pdf#page=37
1. Wikipedia authors - List of default file systems. https://en.wikipedia.org/wiki/List_of_default_file_systems
1. Wikipedia authors - File Allocation Table. https://en.wikipedia.org/wiki/File_Allocation_Table
1. Wikipedia authors - NTFS. https://en.wikipedia.org/wiki/NTFS
1. Wikipedia authors - HFS Plus. https://en.wikipedia.org/wiki/HFS_Plus
1. Wikipedia authors - Hierarchical File System. https://en.wikipedia.org/wiki/Hierarchical_File_System
1. Wikipedia authors - Unix File System. https://en.wikipedia.org/wiki/Unix_File_System
1. Wikipedia authors - ext2. https://en.wikipedia.org/wiki/Ext2
1. Wikipedia authors - ext3. https://en.wikipedia.org/wiki/Ext3
1. Wikipedia authors - ext4. https://en.wikipedia.org/wiki/Ext4
1. Wikipedia authors - XFS. https://en.wikipedia.org/wiki/XFS
1. Wikipedia authors - Btrfs. https://en.wikipedia.org/wiki/Btrfs
1. Wikipedia authors - ZFS. https://en.wikipedia.org/wiki/ZFS
1. Wikipedia authors - Apple File System. https://en.wikipedia.org/wiki/Apple_File_System