cromfs - Copyright (C) 1992,2011 Bisqwit (http://iki.fi/bisqwit/)
License: GPL3
Homepage: http://bisqwit.iki.fi/source/cromfs.html

How mkcromfs remembers which blocks it has encoded.

--------------------------------

One of the greatest powers of mkcromfs is that it remembers all blocks
it has encoded, and if it meets an identical block again later, in the
same file or in another file, it can encode it as a reference to the
previous data instead of adding it into the stream that is compressed
by LZMA.

It does this by the means of hashing:

    typedef std::multimap<hash_t, block_info> block_index_t;

This STL container is an associative container where the key
is a 32-bit hash value and the value is a block_info struct.

block_info is a struct that contains two integers that tell
which fblock the data was written in and at which offset.

When mkcromfs wants to encode a new block, it calculates a hash of
the block, and checks for it in this map.

    block_index_t::iterator i = block_index.find(hash);

If it is found,
   it loads the data from the fblock indicated in the block_info struct,
   and checks if it's identical to the data that is being encoded.
   If it is, it will reuse that struct.
If it is not found, or the data was not identical, it will insert a
new key-value pair to the map and add the block data to the stream.

    block_index.insert(std::make_pair(hash, blk));

When adding the block data to the stream, it does a search to see
if the data already exists in the particular fblock it's appending,
and will overlap the block with the tail of the fblock if it's able
to do so.

--------------------------------

I chose to document this technique, because it could be useful in other
compression programs as well. block_info could be implemented as
containing a filename and the starting offset. This way, no lookups
within the encoded data would be necessary.

--------------------------------

Note: The actual code in mkcromfs is different
and more complex for memory efficiency reasons.