Cromfs: Compressed ROM filesystem for Linux (user-space)
0. Contents
This is the documentation of cromfs-1.2.4.1.
1. Purpose

Cromfs is a compressed read-only filesystem for Linux. Cromfs is intended
for permanently archiving gigabytes of big files that have lots of redundancy.
In terms of compression it is much similar to
7-zip files, except that fast random
access is provided for the whole archive contents; the user does not need
to launch a program to decompress a single file, nor does he need to wait
while the system decompresses 500 files from a 1000-file archive to get
him the 1 file he wanted to open.
Note: The primary design goal of cromfs is compression power.
It is much slower than its peers, and uses more RAM.
If all you care about is "powerful compression"
and "random file access", then you will be happy with cromfs.
The creation of cromfs was inspired
from
Squashfs
and
Cramfs.
2. News
3. Overview
- Data, inodes, directories and block lists are stored compressed
- Duplicate inodes, files and even duplicate file portions are detected and stored only once
- Especially suitable for gigabyte-class archives of
thousands of nearly-identical megabyte-class files.
- Files are stored in solid blocks, meaning that parts of different
files are compressed together for effective compression
- Furthermore, different files utilize the same data blocks where
possible, to reduce the amount of data that needs to be compressed
- Most of inode types recognized by Linux are supported (see comparisons).
- The LZMA compression is used.
In the general case, LZMA compresses better than gzip and bzip2.
- As with usual filesystems, the files on a cromfs volume can be
randomly accessed in arbitrary order; by all the means one would
expect, including memorymapping.
- Works on 64-bit and 32-bit systems.
See
the documentation of the cromfs format for technical details
(also included in the source package as doc/FORMAT).
4. Limitations
- Filesystem is write-once, read-only. It is not possible to append
to a previously-created filesystem, nor it is to mount it read-write.
- Max filesize: 2^64 bytes (16777216 TB), but 256 TB with default settings.
- Max number of files in a directory: 2^30 (smaller if filenames are longer, but still more than 100000 in almost all cases)
- Max number of inodes (all files, dirs etc combined): 2^60, but depends on file sizes
- Max filesystem size: 2^64 bytes (16777216 TB)
- There are no "." and ".." entries in directories.
- cromfs and mkcromfs are slower than their peers.
- The cromfs-driver has a large memory footprint. It is not
suitable for very size-constrained systems.
- Maximum filename length: 4095 bytes
- Being an user-space filesystem, it might not be suitable for
root filesystems of rescue, tiny-Linux and installation disks.
(Facts needed.)
- For device inodes, hardlink count of 1 is assumed.
(This has no effect to compression efficiency.)
Development status: Beta. The Cromfs project has been created very recently
and it hasn't been yet tested extensively. There is no warranty against data
loss or anything else, so use at your own risk.
That being said, there are no known bugs.
5. Comparing to other filesystems
This is all very biased probably, hypothetical,
and by no means a scientific study, but here goes:
Legend: Good,
Bad,
Partial
Feature |
Cromfs |
Cramfs |
Squashfs (3.0) |
Cloop |
Compression unit |
adjustable arbitrarily (2 MB default) |
adjustable, must be power of 2 (4 kB default) |
adjustable, must be power of 2 (64 kB max) |
adjustable in 512-byte units (1 MB max) |
Files are compressed (up to block size limit) |
Together |
Individually |
Individually, except for fragments |
Together |
Maximum file size |
16 EB (2^44 MB) (theoretical; actual limit depends on settings) |
16 MB (2^4 MB) |
16 EB (2^44 MB) (4 GB before v3.0) |
Depends on slave filesystem |
Maximum filesystem size |
16 EB (2^44 MB) |
272 MB |
16 EB (2^44 MB) (4 GB before v3.0) |
Unknown |
Duplicate whole file detection |
Yes |
No |
Yes |
No |
Hardlinks detected and saved |
Yes |
Yes |
Yes, since v3.0 |
depends on slave filesystem |
Near-identical file detection |
Yes (identical blocks) |
No |
No |
No |
Compression method |
LZMA |
gzip (patches exist to use LZMA) |
gzip (patches exist to use LZMA) |
gzip or LZMA |
Ownerships |
uid,gid (since version 1.1.2)
| uid,gid (but gid truncated to 8 bits) |
uid,gid |
Depends on slave filesystem |
Timestamps |
mtime only |
None |
mtime only |
Depends on slave filesystem |
Endianess-safety |
Works on little-endian only |
Safe, but not exchangeable |
Safe, but not exchangeable |
Depends on slave filesystem |
Kernelspace/userspace
(fuse is good for security and modularity
but cannot be used in bootdisks etc,
kernel is vice versa)
|
User (fuse) |
Kernel |
Kernel |
Kernel |
Appending to a previously created filesystem |
No |
No |
Yes |
No (the slave filesystem can
be decompressed, modified, and compressed
again, but in a sense, so can every other
of these.) |
Mounting as read-write |
No |
No |
No |
No |
Supported inode types |
all |
all |
all |
Depends on slave filesystem |
Fragmentation (good for compression, bad for access speed) |
Commonplace
| None |
File tails only |
Depends on slave filesystem |
Holes (aka. sparse files); storage optimization
of blocks which consist entirely of nul bytes |
Identical blocks are merged and compressed, not limited to nul-blocks.
| Supported |
Not supported
| Depends on slave filesystem |
Waste space (partially filled sectors) |
No |
Unknown |
Mostly not |
Depends on slave filesystem, usually yes |
Extended attributes |
No |
Unknown |
Unknown |
Unknown, may depend on slave filesystem |
Note: cromfs now saves the uid and gid in the filesystem. However,
when the uid is 0 (root), the cromfs-driver returns the uid of the
user who mounted the filesystem, instead of root. Similarly for gid.
This is both for backward compatibility and for security.
If you mount as root, this behavior has no effect.
5.1. Compression tests
Note: I use the -e and -r options in all of these mkcromfs tests
to avoid unnecessary decompression+recompression steps, in order
to speed up the filesystem generation. This has no effect in
compression ratio.
In this table,
k equals 1024 bytes (2
10)
and
M equals 1048576 bytes (2
20).
Item |
10783 NES ROMs (2523 MB) |
Mozilla source code from CVS (279 MB) |
Damn small Linux liveCD (113 MB)
(size taken from "du -c" output in the uncompressed filesystem) |
Cromfs |
mkcromfs -s16384 -a… -b… -f…
With 16M fblocks, 2k blocks: 202,811,971 bytes
With 16M fblocks, 1k blocks, 198,410,407 bytes
With 16M fblocks, ¼k blocks: 194,386,381 bytes
Even smaller sizes achievable by using the -c option.
|
mkcromfs
29,525,376 bytes |
mkcromfs -f1048576
With 64k blocks (-b65536), 39,778,030 bytes
With 16k blocks (-b16384), 39,718,882 bytes
With 1k blocks (-b1024), 40,141,729 bytes
|
Cramfs |
mkcramfs -b65536
dies prematurely, "filesystem too big" |
mkcramfs
with 2M blocks (-b2097152), 58,720,256 bytes
with 64k blocks (-b65536), 57,344,000 bytes
with 4k blocks (-b4096), 68,435,968 bytes
|
mkcramfs -b65536
51,445,760 bytes
|
Squashfs |
mksquashfs -b65536
(using an optimized sort file) 1,185,546,240 bytes |
mksquashfs
43,335,680 bytes |
mksquashfs -b65536
50,028,544 bytes
|
Cloop |
create_compressed_fs
(using an iso9660 image created with mkisofs -R)
using 7zip, 1M blocks (-B1048576 -t2 -L-1): 1,136,789,006 bytes
|
create_compressed_fs
(using an iso9660 image created with mkisofs -RJ)
using 7zip, 1M blocks (-B1048576 -L-1): 41,201,014 bytes
(1 MB is maximum block size in cloop)
|
create_compressed_fs
(using an iso9660 image)
using 7zip, 1M blocks (-B1048576 -L-1): 48,328,580 bytes
using zlib, 64k blocks (-B65536 -L9): 50,641,093 bytes
|
7-zip (p7zip) (an archive, not a filesystem) |
7za -mx9 -ma=2 a
with 32M blocks (-md=32m): 235,037,017 bytes
with 128M blocks (-md=128m): 222,523,590 bytes
with 256M blocks (-md=256m): 212,533,778 bytes
| untested |
7za -mx9 -ma2 a
37,205,238 bytes
|
An explanation why mkcromfs beats 7-zip in the NES ROM packing test:
7-zip packs all the files together as one stream. The maximum dictionary
size in 32-bit mode is 256 MB.
(Note: The default for "maximum compression" is 32 MB.)
When 256 MB of data has been packed and more data comes in,
similarities between the first megabytes of data and the latest data are
not utilized. For example, Mega Man and Rockman are two
almost identical versions of the same image, but because there's more
than 400 MB of files in between of those when they are processed in
alphabetical order, 7-zip does not see that they are similar, and will
compress each one separately.
7-zip's chances could be improved by sorting the files so that it will
process similar images sequentially. It already attempts to accomplish
this by sorting the files by filename extension and filename, but it
is not always the optimal way, as shown here.
mkcromfs however keeps track of all blocks it has encoded, and will remember
similarities no matter how long ago they were added to the archive.
(Click here to read
how it does that.)
This is why it outperforms 7-zip in this case, even
when it only used 16 MB fblocks.
In the liveCD compressing test, mkcromfs does not beat 7-zip because this
advantage is too minor to overcome the overhead needed to provide random
access to the filesystem. It still beats cloop, squashfs and cramfs though.
5.2. Speed tests
Speed testing hasn't been done yet. It is difficult to test the speed,
because it depends on factors such as cache (with compressed filesystems,
decompression consumes CPU power but usually only needs to be done once)
and block size (bigger blocks need more time to decompress).
However, in the general case, it is quite safe to assume
that mkcromfs is the
slowest of all. The same goes
for resource testing (RAM).
cromfs-driver requires an amount of RAM proportional to a few factors.
It can be approximated with this formula:
Max_RAM_usage = FBLOCK_CACHE_MAX_SIZE × fblock_size + READDIR_CACHE_MAX_SIZE × 60k + 8 × num_blocks
Where
- fblock_size is the value of "--fblock" used when the filesystem was created
- FBLOCK_CACHE_MAX_SIZE is a constant defined in cromfs.cc (default: 10)
- READDIR_CACHE_MAX_SIZE is a constant defined in cromfs.cc (default: 3)
- 60k is an estimate of a large directory size (2000 files with average name length of 10-20 letters)
- num_blocks is the number of block structures in the filesystem
(maximum size is
ceil(total_size_of_files / block_size)
,
but it may be smaller.)
For example, for a 500 MB archive with 16 kB blocks and 1 MB fblocks,
the memory usage would be around 10.2 MB.
6. Getting started
- Install the development requirements: make, gcc-c++ and fuse
- Remember that for fuse to work, the kernel must also contain the fuse support.
Do "modprobe fuse", and check if you have "/dev/fuse" and check if it works.
- If an attempt to read from "/dev/fuse" (as root) gives "no such device",
it does not work. If it gives "operation not permitted", it might work.
- Build "cromfs-driver", "util/mkcromfs", "util/cvcromfs" and "util/unmkcromfs", i.e. command "make":
$ make
If you get compilation problems related to hash_map or hash,
edit cromfs-defs.hh and remove the line that says #define USE_HASHMAP.
- Create a sample filesystem:
$ util/mkcromfs . sample.cromfs
- Mount the sample filesystem:
$ mkdir sample
$ ./cromfs-driver sample.cromfs sample &
- Observe the sample filesystem:
$ cd sample
$ du
$ ls -al
- Unmounting the filesystem:
$ cd ..
$ fusermount -u sample
or, type "fg" and press ctrl-c to terminate the driver.
7. Tips
To improve the compression, try these tips:
- Adjust the block size (--bsize) in mkcromfs. If your files
have a lot identical content, aligned at a certain boundary,
use that boundary as the block size value. If you are uncertain,
use a small value (500-5000) rather than a bigger value (20000-400000).
Too small values will however make inodes large, so keep it sane.
Note: The value does not need to be a power of two.
- Adjust the fblock size (--fsize) in mkcromfs. Larger values
cause almost always better compression.
Note: The value does not need to be a power of two.
- Adjust the --autoindexratio option (-a). A bigger value will
increase the chances of mkcromfs finding an identical block
from something it already processed (if your data has that
opportunity). Finding that two blocks are identical always
means better compression.
You can use this formula to pick an optimal
maximum value for -a:
amount_of_spare_RAM × blocksize / (32 × total_size_of_files × estimated_remaining_ratio)
where estimated_remaining_ratio
is a decimal number
smaller than 1.0, indicating how much you think block merging will
reduce the amount of data to compress (i.e. how little will remain
to compress before even feeding it to LZMA).
With 800 MB of RAM, 4 GB of files, block size (-b) of
512 bytes and estimated_remaining_ratio of 0.90 (10% reduction),
this formula would thus give a value of about 4 for the -a option.
- Sort your files. Files which have similar or partially
identical content should be processed right after one other.
- Adjust the --bruteforcelimit option (-c). Larger values will require
mkcromfs to check more fblocks for each block it encodes (making the
encoding much slower), in the hope it improves compression.
Basically, --bruteforcelimit is a way to virtually multiply
the --fsize (thus improving compression) by an integer factor
without increasing the memory or CPU usage of cromfs-driver.
Using it is recommended, unless you want mkcromfs to be fast.
Although there are no upper limits on the recommended values of -c,
it is not meaningful to make it larger than the fblock count on the
filesystem being created.
To improve the filesystem generation speed, try these tips:
- Use the --decompresslookups option (-e), if you have the
diskspace to spare.
- Use a large value for the --randomcompressperiod option,
for example -r10000. This together with -e will significantly
improve the speed of mkcromfs, on the cost of temporary disk
space usage. A small value causes mkcromfs to randomly compress
one of the temporary fblocks more often. It has no effect to
the compression ratio of the resulting filesystem.
- Use the TEMP environment variable to control where the temp
files are written. Example: TEMP=~/cromfs-temp ./mkcromfs …
- Use larger block size (--bsize). Smaller blocks mean more blocks
which means more work. Larger blocks are less work.
- Do not use the --bruteforcelimit option (-c). The default value 0
means that the candidate fblock will be selected straightforwardly.
To control the memory usage, use these tips:
- Adjust the fblock size (--fsize). The memory used by cromfs-driver
is directly proportional to the size of your fblocks. It keeps at
most 10 fblocks decompressed in the RAM at a time. If your fblocks
are 4 MB in size, it will use 40 MB at max.
- In mkcromfs, adjust the --autoindexratio option (-a). This will
not have effect on the memory usage of cromfs-driver, but it will
control the memory usage of mkcromfs. If you have lots of RAM, you
should use bigger --autoindexratio (because it will improve the chances
of getting better compression results), and use smaller if you have less RAM.
- Find the CACHE_MAX_SIZE settings in cromfs.cc and edit them. This will
require recompiling the source. (In future, this should be made a command
line option for cromfs-driver.)
- In mkcromfs, adjust the block size (--bsize). The RAM usage of mkcromfs
is directly proportional to the number of blocks (and the filesystem size),
so smaller blocks require more memory and larger require less.
To control the filesystem speed, use these tips:
- The speed of the underlying storage affects.
- The bigger your fblocks (--fsize), the bigger the latencies are.
cromfs-driver caches the decompressed fblocks, but opening a non-cached
fblock requires decompressing it entirely, which will block the user
process for that period of time.
- The smaller your blocks (--bsize), the bigger the latencies are, because
there will be more steps to process for handling the same amount of data.
- Use the most powerful compiler and compiler settings available
for building cromfs-driver. This helps the decompression and cache lookups.
- Use fast hardware…
8. Using cromfs in bootdisks and tiny Linux distributions
Cromfs can be used in bootdisks and tiny Linux distributions only
by starting the cromfs-driver from a ramdisk (initrd), and then
pivot_rooting into the mounted filesystem (but not before the
filesystem has been initialized; there is a delay of a few seconds).
Theoretical requirements to use cromfs in the root filesystem:
- Cromfs-driver should probably be statically linked.
- An initrd, that contains the cromfs-driver program
- Fuse driver in the kernel (it may be loaded from the initrd).
- Use of pivot_root to change the root into the mounted image
- One must wait until the cromfs-driver outputs "ready"
before accessing the filesystem
9. Other applications of cromfs
The compression algorithm in cromfs can be used to determine how similar
some files are to each others.
This is an example output of the following command:
$ unmkcromfs --simgraph fs.cromfs '*.qh' > result.xml
from a sample filesystem:
<?xml version="1.0" encoding="UTF-8"?>
<simgraph>
<volume>
<total_size>64016101</total_size>
<num_inodes>7</num_inodes>
<num_files>307</num_files>
</volume>
<inodes>
<inode id="5595"><file>45/qb5/ir/basewc.qh</file></inode>
<inode id="5775"><file>45/qb5/ir/edit.qh</file></inode>
<inode id="5990"><file>45/qb5/ir/help.qh</file></inode>
<inode id="6220"><file>45/qb5/ir/oemwc.qh</file></inode>
<inode id="6426"><file>45/qb5/ir/qbasic.qh</file></inode>
<inode id="18833"><file>c6ers/newcmds/toolib/doc/contents.qh</file></inode>
<inode id="19457"><file>c6ers/newcmds/toolib/doc/index.qh</file></inode>
</inodes>
<matches>
<match inode1="5595" inode2="5990"><bytes>396082</bytes><ratio>0.5565442944</ratio></match>
<match inode1="5595" inode2="6220"><bytes>456491</bytes><ratio>0.6414264256</ratio></match>
<match inode1="5990" inode2="6220"><bytes>480031</bytes><ratio>0.6732618693</ratio></match>
</matches>
</simgraph>
It reads a cromfs volume generated earlier, and outputs statistics of it.
Such statistics can be useful in refining further compression, or just
finding useful information regarding the redundancy of the data set.
It follows this DTD:
<!ENTITY % INTEGER "#PCDATA">
<!ENTITY % REAL "#PCDATA">
<!ENTITY % int "CDATA">
<!ELEMENT simgraph (volume, inodes, matches)>
<!ELEMENT volume (total_size, num_inodes, num_files)>
<!ELEMENT total_size (%INTEGER;)>
<!ELEMENT num_inodes (%INTEGER;)>
<!ELEMENT num_files (%INTEGER;)>
<!ELEMENT inodes (inode*)>
<!ELEMENT inode (file+)>
<!ATTLIST inode id %int; #REQUIRED>
<!ELEMENT file (#PCDATA)>
<!ELEMENT matches (match*)>
<!ELEMENT match (bytes, ratio)>
<!ATTLIST match inode1 %int; #REQUIRED>
<!ATTLIST match inode2 %int; #REQUIRED>
<!ELEMENT bytes (%INTEGER;)>
<!ELEMENT ratio (%REAL;)>
Once you have generated the file system, running the
--simgraph query is
relatively a cheap operation (but still O(n^2) for the number of files);
it involves analyzing the structures created by mkcromfs, and does not
require any search on the actual file contents. However, it can only report as
fine-grained similarity information as were the options in the generation of
the filesystem (level of compression).
10. Copying
cromfs has been written by Joel Yliluoma, a.k.a.
Bisqwit,
and is distributed under the terms of the
General Public License (GPL).
The LZMA code embedded within is licensed under LGPL.
Patches and other related material can be submitted to the
author
by e-mail at:
sz0eJyeJoelgij Yli87gb/dluom@4a <bigv@yfs/Ohsqwibbeztob@t@ikz@.sei.fi>
The author also wishes to hear if you use cromfs, and for what you
use it and what you think of it.
11. Requirements
- GNU make and gcc-c++ are required to recompile the source code.
- The filesystem works under the Fuse
user-space filesystem framework. You need to install both the Fuse kernel
module and the userspace programs before mounting Cromfs volumes.
You need Fuse version 2.5.2 or newer.
12. Downloading
Generated from
progdesc.php (last updated: Tue, 05 Dec 2006 13:57:21 +0200)
with docmaker.php (last updated: Sun, 12 Jun 2005 06:08:02 +0300)
at Tue, 05 Dec 2006 14:34:59 +0200