The Filesystem, Version 1

nakst  —  8 months, 1 week ago [Edited 4 months later]
The Filesystem, Version 1

What is a filesystem?

As far as your drive is concerned, there is no such thing as a "file" or "folder".
Drives divide up their space into equally-sized sectors, 512 bytes each.

(This number, 512 bytes, originates from the sector size on floppy disks.
Modern drives have larger sectors behind the scenes.
However, drivers send commands to the drive controllers with a precision of 512-byte blocks.
As far as the filesystem driver should be considered, sectors are 512 bytes each.)

This isn't very useful for users.
They want to have hierachical folders which contain files,
where each node stores metadata, such as names, modification dates, access permissions, etc.

To achieve this, a layer needs to be inserted between programs and the drive controller drivers.
Filesystem drivers must be introduced to provide this functionality.

There are several different filesystems, such as FAT32, ext4, and NTFS.
For this operating system, I am creating a new filesystem.

Overview

Here is an overview of the filesystem's structure:

1
2
3
4
---------------------------------------------------------------------------------------------------------
| 8KB        | 8KB        | Group            | Root Directory | Directories, Files, | 8KB               |
| Bootloader | Superblock | Descriptor Table | File Entry     | and Block Bitmaps   | Superblock Backup |
---------------------------------------------------------------------------------------------------------


Superblock

At the start of a volume there is 8KiB of space reserved for any necessary boot code.
This is because in BIOS boot the very start of a partition is loaded and executed.

After this, there is 8KiB of space used for the superblock.
The superblock contains metadata about the volume and filesystem.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
struct EsFSSuperblock {
	// Version 1

/*0*/	char signature[ESFS_SIGNATURE_STRING_LENGTH];	// The filesystem signature; should be ESFS_SIGNATURE_STRING.
/*16*/	char volumeName[ESFS_MAXIMUM_VOLUME_NAME_LENGTH]; // The name of the volume.
/**/	
/*48*/	uint16_t requiredReadVersion;			// If this is greater than the driver's version, then the filesystem cannot be read.
/*50*/	uint16_t requiredWriteVersion;			// If this is greater than the driver's version, then the filesystem cannot be written.
/**/	
/*52*/	uint8_t mounted;				// Non-zero to indicate that the volume is mounted, or was not properly unmounted.
/**/	
/*56*/	uint64_t blockSize;				// The size of a block on the volume.
/*64*/	uint64_t blockCount;				// The number of blocks on the volume.
/*72*/	uint64_t blocksUsed;				// The number of blocks that are in use.
/**/	
/*80*/	uint16_t blocksPerGroup;			// The number of blocks in a group.
/*88*/	uint64_t groupCount;				// The number of groups on the volume.
/*96*/	uint64_t blocksPerGroupBlockBitmap;		// The number of blocks used to a store a group's block bitmap.
/**/	
/*104*/	EsFSLocalExtent gdt;				// The group descriptor table's location.
/*108*/	EsFSLocalExtent rootDirectoryFileEntry;		// The file entry for the root directory.
/*112*/	EsFSLocalExtent kernelFileEntry;		// The file entry for the kernel.
/**/	
/*116*/	UniqueIdentifier identifier;			// The unique identifier for the volume.
/*132*/	UniqueIdentifier osInstallation;		// The unique identifier of the Essence installation this volume was made for.
							// If this is zero, then the volume is not an Essence installation volume.
};


To fully understand each field, we need to introduce some concepts.

Volumes

I use the term "volume" to refer to a filesystem encompassing a partition on a drive.
(Partitioning is a way of breaking up space on a drive so that multiple filesystems can reside on a single drive.)

Some filesystems can have volumes that span multiple partitions on different drives,
but we currently don't do anything that fancy.
So you can assume a one-to-one correspondence between partitions and volumes.
Blocks

As described in the introduction, drives are physically split into 512-byte sectors.
For larger volumes, this is a too large granularity for efficient metadata storage in the filesystem.
Therefore we group sectors into blocks.

The blockSize field in the superblock contains the size in bytes of a block on the volume.
For example, if blockSize was 1024 then each block would encompass 2 sectors,
and the block number 4 would refer to sectors 8 and 9.

The size of each block depends on the size of the volume.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
if (driveSize < 512 * 1024 * 1024) { // 512MB
	blockSize = 512;
} else if (driveSize < 1024 * 1024 * 1024) { // 1GB
	blockSize = 1024;
} else if (driveSize < 2048l * 1024 * 1024) { // 2GB
	blockSize = 2048;
} else if (driveSize < 256l * 1024 * 1024 * 1024) { // 256GB
	blockSize = 4096;
} else if (driveSize < 256l * 1024 * 1024 * 1024 * 1024) { // 256TB
	blockSize = 8192;
} else {
	blockSize = ESFS_MAX_BLOCK_SIZE; // = 16384.
}


These block sizes are fairly arbitrary.
Most current filesystems seem to max out at 4KB.

It is a tradeoff between efficiency of metadata storage and efficiency of data storage.
Larger blocks means less metadata needs to be stored for the whole filesystem,
but smaller blocks means more space is likely to be wasted at the end of files.
Block Groups

Treating the volume as a massive array of blocks is not particularly efficient.
We can first divide the volume in block groups, which in turn contain some number of blocks.
Each block group contains a contiguous span on blocks on the drive.

Additional metadata then can be stored on a block-group basis.
This can help to speed up block allocation, as we can, for example, store how many blocks are available in each block group.
If we need to allocate a certain number of contiguous blocks, we can quickly decide which block groups would be able to service the request.
We also try to keep <data blocks of files in the same directory> in the same block group.
We'll have a look at how this works in a bit.
We try to have 4096 blocks in each group. Again, this number is fairly arbitrary. Tweaking this constant might enable better block allocation. This must however be less than 65536, since we use uint16_t's to index blocks within a block group.
1
2
3
4
5
6
superblock->blocksPerGroup = 4096;
while (true) {
	superblock->groupCount = superblock->blockCount / superblock->blocksPerGroup;
	if (superblock->groupCount) break;
	superblock->blocksPerGroup /= 2;
}


Extents

An extent is a contiguous group of blocks.
They store an block offset and a block count.

In a local extent, the offset is stored relative to the start of the block group to which the extent refers.
In a global extent, the offset is relative to the start of the volume.

In a local extent, the offset and the count are stored as uint16_t's. In a global extent, the offset and the count are stored as uint64_t's. The filesystem uses extents to track block usage in several places, because we generally try to keep things contiguous in the volume. UniqueIdentifier

1
2
3
struct UniqueIdentifier {
		uint8_t d[16];
};


A UniqueIdentifier is simply an identifier, that should be unique within some group of things.
For example, the osInstallation field in the superblock should be unique for every installation of the operating system.
This is (in theory) ensured by storing 16 bytes of random data in the field.

The possibility of a collision of 128-bit random identifiers is neglible, for the use case of a filesystem.
(I think.)
Group Descriptor Table (GDT)

After the superblock, exactly at the 16384th byte in the volume is the group descriptor table.
This contains an record for each block group in the volume.
Along with the superblock, it is initialised during formatting.

This is the structure that stores metadata about each block group.

It currently only stores 2 fields:

1
2
3
4
5
6
7
struct EsFSGroupDescriptor {
	uint64_t blockBitmap;				// The block that contains the block bitmap for the group.
							// This is usually at the start of the group.
							// If this is 0 then there is no block bitmap, and no blocks in the group are used.

	uint16_t blocksUsed;				// The number of used blocks in this group.
};


Each block group has a block bitmap which contains 1 bit per block in the group.
If the bit is set then the block is use; if it clear then it is available for use.

This only needs to be one block long, since the maximum blocks per group is 4096,
each block takes 1 bit, and the minimum block size is 512.

Block bitmaps are lazily allocated.
The first time a block is allocated from a block group, its block bitmap is initialised.
The block bitmap for any block group always is in the first block of the group, except in the first block group, where it appears after the core data (like the superblock and GDT).

Here is a diagram representing how a block group looks:

1
2
3
4
------------------------------------
| 1 block            | 4093 blocks |
| Block Usage Bitmap | Data        |
------------------------------------


The *data* contains whatever files and directories are stored there.

Attributes

To assist with backwards and forwards compatability between different versions of the filesystem,
a lot of data is stored in attribute lists.

An attribute list contains several contiguous attributes (each with a standard header),
ending with an ESFS_ATTRIBUTE_LIST_END attribute.

The standard attribute header contains 2 fields:

1
2
3
4
struct EsFSAttributeHeader {
	uint16_t type;
	uint16_t size; // in bytes.
}


There are several different attribute types.
The filesystem driver ignores attributes of types that it doesn't recognise or doesn't need.

There are 4 attributes I'll discuss in the blog post:
ESFS_ATTRIBUTE_FILE_DATA,
ESFS_ATTRIBUTE_FILE_DIRECTORY,
ESFS_ATTRIBUTE_DIRECTORY_NAME, and ESFS_ATTRIBUTE_DIRECTORY_FILE. File and Directory Entries

The filesystem makes an important distinction between file entries and directory entries.

A file entry is a record that stores information about a file, such as its size, where to find the data in the volume, its creation data etc.
A directory entry is a record that maps a filename to a file entry. Directories contain a list of directory entries.

This distinction allows hard links to exist, where multiple directory entries, each in different directories or with different names,
could point to the same file entry. (Note: The filesystem currently doesn't actually have hard links implemented.) File Entries

A file entry looks like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
struct EsFSFileEntry {
	// The size of a file entry must be less than the size of a block.
	// File entries may not span over a block boundary.

#define ESFS_FILE_ENTRY_SIGNATURE "FileEsFS"
/*0*/	char signature[8];				// Must be ESFS_FILE_ENTRY_SIGNATURE.

/*8*/	UniqueIdentifier identifier;

#define ESFS_FILE_TYPE_FILE (1)
#define ESFS_FILE_TYPE_DIRECTORY (2)
#define ESFS_FILE_TYPE_SYMBOLIC_LINK (3)
/*24*/	uint8_t fileType;

	.....

/*48*/	// An attribute list follows.
};


In the current version of the filesystem the attribute list must contain an ESFS_ATTRIBUTE_FILE_DATA attribute.
This attribute describes where to find the data in the file.
I'll discuss this attribute in a bit.

If the fileType is ESFS_FILE_TYPE_DIRECTORY then a ESFS_ATTRIBUTE_FILE_DIRECTORY attribute will also be present.
This stores additional information specific to directories, such as the number of items in the directory.
Directory Entries

A directory entry consists only of a signature and an attribute list.

1
2
3
4
5
6
struct EsFSDirectoryEntry {
#define ESFS_DIRECTORY_ENTRY_SIGNATURE "DirEntry"
	char signature[8];				// Must be ESFS_DIRECTORY_ENTRY_SIGNATURE.

	// An attribute list follows.
};


In the attribute list there currently is an ESFS_ATTRIBUTE_DIRECTORY_NAME attribute, which stores the filename,
and an ESFS_ATTRIBUTE_DIRECTORY_FILE attribute which stores the file entry.

In the future when links are implemented, the ESFS_ATTRIBUTE_DIRECTORY_FILE attribute will become optional.
Other attributes will be introduced to store the data needed to make the link.

Directories

If you haven't realised by now, a directory is just like a normal file, except its contents follows a specific format.
A directory is a list of directory entry.

Each block contains an array of directory entries.
If there is additional space left at the end of a block, then the first unused byte is set to 0.

The format of a directory entry is discussed above. Metadata about the directory is stored in the ESFS_ATTRIBUTE_FILE_DIRECTORY attribute of its own file entry. This is currently found within the data of the parent directory. Resizing Directories

As directories are resized one entry at a time, it makes sense to allocate multiple blocks for the directory at a time when it needs to grow.
This reduces the potential fragmentation of the directory's data - although the filesystem has other ways of avoiding fragmentation.
The number of blocks actually in use is stored in the ESFS_ATTRIBUTE_FILE_DIRECTORY attribute.

Directory Example

A simplified view of a directory looks like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
The directory's file entry: [ itemsInDirectory = 3, blockCount = 2 ]

Block 0: [ Directory entry 1: [ Name: "Shopping List.txt", File entry: [...] ]
	   Directory entry 2: [ Name: "Cat Picture 1.png", File entry: [...] ]
           Empty space (but not enough to fit entry 3)]

Block 1: [ Directory entry 3: [ Name: "Cat Picture 2.png", File entry: [...] ]
	   Empty space]

Block 2-7: Overallocated; unused.


EsFSAttributeFileData

This is perhaps the most important attribute in the filesystem.
It is found in file entries, and describes where the data of the file can be found in the volume.

For reference, here is the structure that defines it:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
struct EsFSAttributeFileData {
/*0*/	EsFSAttributeHeader header;

	.....

#define ESFS_DATA_INDIRECT   (1)
#define ESFS_DATA_INDIRECT_2 (2) 
#define ESFS_DATA_DIRECT     (4)
/*5*/	uint8_t indirection;				// The level of indirection needed to read the data.

/*6*/	uint16_t extentCount;				// The number of extents describing the file.

/*8*/	uint64_t size;					// The size of the data in the stream in bytes.

/*16*/	union {
#define ESFS_INDIRECT_EXTENTS (4)
		EsFSGlobalExtent indirect[ESFS_INDIRECT_EXTENTS]; // Extents that contain the stream's data.
	
#define ESFS_INDIRECT_2_LISTS (8)
		uint64_t indirect2[ESFS_INDIRECT_2_LISTS];	// Lists of extents that contain the stream's data.

#define ESFS_DIRECT_BYTES (64)
		uint8_t direct[ESFS_DIRECT_BYTES];		// The actual file data, inline.
	};
};


If the file is less than or equal to ESFS_DIRECT_BYTES, then the contents of the file is stored in the data attribute.
This helps reduce fragmentation of the volume from the storage of many small files.

If the file is larger than it, then the file will be stored in an list of extents.
As discussed before, an extent is a contiguous group of blocks in the volume.
As it is not always possible to store a file in a single extent (e.g. fragmentation, block group boundaries, etc.)
a list of extents must be managed to keep track of where all the data in a file is stored in the volume. If only ESFS_INDIRECT_EXTENTS or less extents are needed to encompass the file's data, then the list of extents is stored directly in the attribute. Otherwise, the extent list must be stored within other extents. There can be up to ESFS_INDIRECT_2_LISTS of these; these are extents of fixed size (ESFS_BLOCKS_PER_EXTENT_LIST). Extent Allocation

When a file shrinks or is deleted, some of its extents are freed.
This is done simply by clearing the corresponding bits in the block bitmap of the corresponding block group.

However, when a file grows, extents need to be allocated, which is a more complicated process.

Ultimately, there is a tradeoff:
- Within a directory, files are ideally near each other on the physical drive. - But each file should have room to grow and remain contiguous. We solve this partially with block groups. New files in a given directory will first try to allocate their extents from the same block group as the directory is stored in. However, within a block group we try to keep files as far apart as possible. When growing a file, the first thing we do is check if there is room directly after the last extent in the file. If there is, we allocate it an modify the existing extent. If there isn't space, or this is the first extent in the file, then we use an algorithm to determine where in a block group we should allocate an extent. Let's think about we can keep files as far apart as possible.
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
*** Block Group X ***
Index       0 1 2 3 4 5 6 7
Usage      [- - - - - - - -]
                             File 0 is placed at the start of the group.
           [0 - - - - - - -] 
                             File 1 is placed at the halfway along the group.
           [0 - - - 1 - - -] 
                             File 2 is placed at the halfway between files 1 and 2.
           [0 - 2 - 1 - - -] 
                             File 3 is placed at the halfway between files 1 and the end.
           [0 - 2 - 1 - 3 -] 
                             File 4 is placed at the halfway between files 0 and 2.
           [0 4 2 - 1 - 3 -] 
                             File 5 is placed at the halfway between files 1 and 3.
           [0 4 2 - 1 5 3 -] 
etc.


If we look at order in which we check indices for availability within a 8-block block group, we get this:

1
2
3
4
5
6
7
8
0 -> 0
1 -> 4
2 -> 2
3 -> 6
4 -> 1
5 -> 5
6 -> 3
7 -> 7


This looks like a difficult pattern to create, until we write the numbers in binary.

1
2
3
4
5
6
7
8
000 -> 000 (0)
001 -> 100 (4)
010 -> 010 (2)
011 -> 110 (6)
100 -> 001 (1)
101 -> 101 (5)
110 -> 011 (3)
111 -> 111 (7)


We just have to reverse the binary number!

There's some nice bit hacking we can do to achieve this:
(This is for uint16_t's).

1
2
3
4
index = ((index >> 1) & 0x5555) | ((index & 0x5555) << 1);
index = ((index >> 2) & 0x3333) | ((index & 0x3333) << 2);
index = ((index >> 4) & 0x0F0F) | ((index & 0x0F0F) << 4);
index = ((index >> 8) & 0x00FF) | ((index & 0x00FF) << 8);


After this, we shift the value to right so its the correct size for the block group size.

...and that's how we allocate extents.

Reading a File - From Beginning to End

Let's look at what has to happen to read a file.

First, the superblock, group descriptor table and root directory file entry are loaded from the drive.
The superblock's contents are examined to ensure the drive contains a valid filesystem.

To open an file, we need to search the directories that lead to it.
The root directory file entry specifies which blocks on the volume contain the listing of the root directory. We load these blocks into a buffer, and then go through each *directory entry* in the directory. We compare the filename we are searching for with the contents ESFS_ATTRIBUTE_DIRECTORY_NAME attribute on each directory entry. (The attributes of a directory entry are contained within it.) When we find a match, we look at the ESFS_ATTRIBUTE_DIRECTORY_FILE attribute. This contains a file entry for the file, which specifies which blocks contains the data in the file in the form of an array of extents. As before, we load these blocks into memory from the drive to read a file. Formatting

Before the volume can be used, it has to be formatted.
This creates the superblock, the block descriptor table, and the root directory.
This produces a empty, yet valid filesystem.

What do other filesystems do?

This filesystem is somewhat of a cross between ext2 and NTFS.
The ideas for block groups and directories come from ext2,
and the ideas for attribute lists come from NTFS.

While filesystems can vary, there are a lot of things that they all have to do.
All filesystems have to have some way to keep track of which blocks in the volume are used,
and all filesystems have to have some for of directory listings. But there are a lot of interesting features that some others have that we don't (yet :D). For example...
  • Directory entry name hashing - for faster file lookup
  • Journals - a log of in progress operations that can be reversed in the event of power failure
  • Filesystem scrubbing - an online (runs while the volume is mounted) checker of the volume to find and fix errors
  • Checksums - to checks that metadata is uncorrupted (and to possibly fix it)
  • Copy on write - never write to existing blocks, only ever allocate new ones and update block pointers
  • Snapshots - store a backup copy of the filesystem in its current state


That's Basically It

That's how the filesystem currently works, skipping the less interesting parts.

It's still a work in progress; things definitely will change.
As I've been developing this filesystem, there are several things I wish I'd done differently.
In the future I may change a lot of how this all works.
See the section above for features that are in other filesystems that we don't have.
However the core functionality of the filesystem is in place. With the kernel's current driver the following operations are available: - Directory enumeration - Reading/writing/shrinking/growing files - Creating/deleting/moving files and directories Thank you for reading my first blog post! Please comment any thoughts or questions you have. ******************************** Edit: I have decided to rewrite the filesystem after learning from the mistakes of this filesystem.
#14962
Simon Anciaux  —  8 months ago
Thanks for the post. Here are some questions.

Is this only a "common" way to do a filesystem ? What are the differences with other OSes / filesystems ?
Is a "volume" a physical hard drive or a partition ? What does partitioning an formatting imply for your custom filesystem ? Did you have to make tools for partitioning and formatting ?
What is the reasoning for computing the block size ? e.g. why is it only doubled when going from 2GB to 256GB ?

In "Block groups":
This can help to speed up block allocation, as we can, for example, store how many blocks are available in each section of the volume.
What is a section of the volume ? Is it a "Block group" ?
Why do you want 4096 blocks per group ?

Where are the Attribute Lists stored ? Are they stored per group that contains files ? I think I'm a bit lost, a schema of the whole system might help.

A note about the post itself: there are several place where there are new lines in the middle of a sentence which is a bit harder to read.
#14975
nakst  —  8 months ago [Edited 0 minutes later]
Hi mrmixer,
Thanks for the feedback!
I'm a bit busy this week, so I'll try to update the post with diagrams, explain terminology, answer your questions, etc. next week.
#15145
Simon Anciaux  —  7 months, 1 week ago
I just noticed you updated the article. Thanks.
Log in to comment