Created
December 8, 2013 02:29
-
-
Save chungy/7852622 to your computer and use it in GitHub Desktop.
Technical information about UMSDOS.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
UMSDOS uses a fairly simple system to store metadata information in | |
the --LINUX-.--- files. Each full metadata block is a multiple of 64 | |
bytes, up to 256 bytes, depending on the length of the filename. | |
UMSDOS uses a deterministic way to convert Linux filenames into | |
MS-DOS-compatible 8.3 style names, handling situations like | |
case-sensitivity, uniqueness when the filenames differ after the 8th | |
character, special filenames not allowed on MS-DOS and FAT, and so on. | |
It allows a fairly full set of typical POSIX functionality, only | |
lacking sparse file support (which would be impossible to implement | |
while allowing non-UMSDOS aware systems to correctly access a file's | |
content). Hard links are specially treated; the link names have | |
mirrored metadata and the files that appear on disk contain only the | |
path name to the actual hidden link file. The link file contains the | |
contents of hard linked files, is stored in the UMSDOS metadata, but | |
is not directly accessible from UMSDOS. Additionally, the | |
--LINUX-.--- control files do not appear under UMSDOS and there is no | |
way to store a file named as such (or one with a different case) on | |
the system. | |
Fields in a --LINUX-.--- file: | |
unsigned char name_length | |
unsigned char flags | |
unsigned short number_of_links | |
uid_t (unsigned short?) uid | |
gid_t (unsigned short?) gid | |
long atime | |
long mtime | |
long ctime | |
unsigned char dev_minor [1] | |
unsigned char dev_major [1] | |
unsigned short mode | |
char spare[12] // reserved bytes, not used | |
char name[220] [2] | |
[1] the device major/minor numbers are treated as a single unsigned | |
short in the original C sources, but effectively it's easier to treat | |
them as separate. These might be reversed on big-endian. | |
[2] This char array can be *up to* 220 bytes long, but is usually much | |
shorter; short enough to be only 28 bytes. It's only as long as it | |
needs to be for the entire metadata block to be a multiple of 64 | |
bytes. \x00-padded, but not \x00 terminated. | |
The flags field is a little bit special and is only used for | |
supporting hard links. The value of 1 means the file is hidden; that | |
is, it never shows up in any kind of stat, like how the --linux-.--- | |
file is treated. A value of 2 means that the file represents a hard | |
link, and like a symlink, the contents of it point to its actual | |
destination, which is the hidden link file. The hard linked name | |
itself gets a mode of 100777 set, and the number_of_links field is set | |
to 1 like any other regular file. The hidden link file, instead, | |
contains all of the metadata to be displayed for hard links; this | |
includes a proper number_of_links count, time stamps, permissions, and | |
so forth. | |
All of these fields are not endian-safe! A big-endian system running | |
Linux 2.4 and creating/using a UMSDOS filesystem will have all of its | |
bytes swapped compared to a little-endian system. Most likely, any | |
UMSDOS filesystems you'll see around will be little-endian thanks to | |
its rather niche purpose of providing POSIX semantics on top of an | |
MS-DOS system, but it'd be trivial to support both little- and | |
big-endian. | |
Metadata entries can be cleared by zeroing out the entire entry. This | |
should make it simple to support even instances where much of the | |
beginning of the file is just \x00s; upon reading a \x00 of the first | |
char, the name_length field, seek forward 63 bytes and try reading the | |
next one, and so forth. Files that are renamed from a rather long | |
name to a shorter name would have no problem just zeroing out the | |
extra name bytes, but the kernel driver instead writes a new entry at | |
the end of the --linux-.--- file instead. The same method is easily | |
applied to renaming a file from a short name to a long one: zero out | |
the entry and make a new one at the end of --linux-.---. This might | |
lead to some horribly space-inefficient metadata files over time, but | |
that might be better handled through an independent fsck or other | |
clean-up utility. | |
UMSDOS has special functionality to allow certain characters and names | |
not allowed on MS-DOS and/or FAT, but bear no special meaning to | |
Linux. POSIX systems typically only disallow two characters, \x00 and | |
/ (\x2f), beyond that, any character or string of characters may be | |
used. Forbidden characters in DOS/FAT names are: | |
* Control characters \x01 to \x1f and \x7f | |
- UMSDOS still doesn't allow the storage of \x7f as a character in | |
a file name. There shouldn't be any technical reason for | |
disallowing it, but it's probably an oversight. | |
* Space character | |
- Technically, not actually forbidden by DOS, but most programs and | |
tools make it difficult to store and use such names. Linux's | |
msdos filesystem with check=s also forbids its use, and chkdsk, | |
scandisk, and dosfsck all report it as an error if a name does | |
contain a space. It's best to avoid it at least. | |
* " * + , ; < = > ? [ \ ] | : . | |
- The period can only appear once in a filename, and its use is | |
solely to separate the basename from the extension, which are | |
stored separately in FAT. A file cannot lack a basename, or in | |
otherwords start with a period. Multiple periods are not a valid | |
DOS or FAT name. A file on DOS may be referenced with a trailing | |
period, but this means there is no extension and it has the same | |
meaning as leaving the trailing period out. | |
UMSDOS generates FAT file names on a rather simple method: | |
1. Lower-case filenames that can fit in 8+3 limits are stored as-is; | |
for example, the file "dir.c" is stored simply as "dir.c". | |
2. Upper-case is always mangled. A directory in which the only file | |
ever stored in it that is called "Makefile" will be stored as | |
makefile.{__. | |
3. Extra periods are converted to underscores and also mangled. | |
linux-2.4.37.11.tar.gz will be stored as linux-2_.{__. | |
4. Control characters, spaces, other special characters, and bytes | |
above \x7f get converted to #s. C:\DOS\RUN gets stored as | |
c##dos#r.{__. | |
There are additional strings of characters that may not make up the | |
entirety of the basename. UMSDOS mangles the name so that they may be | |
used as on Linux. For example, on DOS, the file 'aux.sh' cannot be | |
stored or accessed; however, a name like '-aux.sh' is OK and can be | |
stored. The following strings are forbidden as a whole part of a | |
name: AUX, CLOCK$, COM1, COM2, COM3, COM4, CON, LPT1, LPT2, LPT3, | |
LPT4, NUL, PRN. Additional ones reserved by certain TSRs but not | |
blocked by DOS itself are EMMXXXX0, XMSXXXX0, and SETVERXX; these are | |
also mangled by UMSDOS. The doschk utiltity additionally lists | |
MS$MOUSE and SMARTAAR as reserved names, but UMSDOS does not mangle | |
them. | |
Extensions for mangled names are generated deterministically, using | |
base 32 with up to 9216 unique (mangled name?) files. The last two | |
characters are just base-32, with 0 replaced with a _. The first | |
character is one of, in order: { } ( ) ! ` ^ & @ | |
The extension is based on the location of the file's metadata in | |
--linux-.---, in multiples of 64. The entry beginning at 0x00 becomes | |
{__, the one at 0x3f {_1, the one at 0x12a00 (pos 1192) }58. In | |
base-32, effectively the highest number possible in this scheme is 9vv | |
(translated as @vv); it may seem odd, but it avoids any clashes with | |
any extensions common in the DOS (or Windows) world, such as .com, | |
.doc, .123, and so on, while still allowing a reasonable number of | |
files in a directory (most will never reach anywhere close to 9216 :). | |
A lot of these filename restrictions are not present under VFAT; | |
spaces, +, =, commas, and periods may be freely used. Windows | |
Explorer and command.com/cmd.exe do not allow creating filenames with | |
a leading period (normally represents a hidden file on Linux), but it | |
is not a restriction of the filesystem itself. This isn't | |
particularly relevant to UMSDOS which predates VFAT and is concerned | |
about 8+3 semantics of DOS before Windows 95. In some ways, UMSDOS | |
maintains better compatibility than VFAT does; it doesn't futz around | |
with filesystem structures liable to be removed or corrupted when | |
running scandisk or defrag from MS-DOS. | |
There was an experimental UVFAT in development for a short time that | |
shared the base filename space with regular VFAT. It allows for more | |
meaningful down-conversions from Linux names, making accessibility | |
from Windows much more convenient. It is incompatible with UMSDOS and | |
I have not explored it, but it may be worth looking into at least as | |
inspiration for future expansions. There are some ideas I have to | |
make UMSDOS behave better, especially for some fringe circumstances | |
where you might want to store a file named --linux-.--- or where you | |
have a name that conforms to DOS and 8+3 limits but also looks like a | |
mangled name. Largely, the UMSDOS limitations cannot be repaired | |
without breaking compatibility, so it'd be better to start off | |
fresh... and the old DOS restrictions aren't quite so relevant | |
anymore, but VFAT is; utilizing that would be beneficial. The | |
posixovl project already is one such attempt at being a modern | |
filesystem of this kind, but it too suffers from many limitations and | |
is rather unstable. | |
Well that's all there is to it. The goal of this project at the | |
moment is to be compatible with UMSDOS as it is, providing both a FUSE | |
filesystem for it as well as some tools to manipulate/poke around it | |
without having to mount it. For my purposes, I'm only really | |
concerned with operating with a Slackware 11 UMSDOS installation, | |
which as far as I'm aware, is the last distribution that still | |
supported UMSDOS; one of the last hold-outs on the Linux 2.4 kernel | |
for that matter. The hope is to also have a more stable filesystem | |
than Linux 2.4 had; even with 2.4.37.11, the last Linux 2.4 release | |
ever, there are a number of ways to break UMSDOS directories entirely | |
user-side, and not even as the root-user! |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment