cromfs - Copyright (C) 1992,2007 Bisqwit (http://iki.fi/bisqwit/) License: GPL3 Homepage: http://bisqwit.iki.fi/source/cromfs.html How mkcromfs remembers which blocks it has encoded. -------------------------------- One of the greatest powers of mkcromfs is that it remembers all blocks it has encoded, and if it meets an identical block again later, in the same file or in another file, it can encode it as a reference to the previous data instead of adding it into the stream that is compressed by LZMA. It does this by the means of hashing: #ifdef USE_HASHMAP typedef __gnu_cxx::hash_multimap block_index_t; #else typedef std::multimap block_index_t; #endif This STL container (either one of them, the GNU version is just somewhat faster in effect) is an associative container where the key is a CRC32 value (32-bit integer) and the value is a block_info struct. It works the same way as the Array in PHP. block_info is a struct that contains two integers that tell which fblock the data was written in and at which offset. When mkcromfs wants to encode a new block, it calculates a CRC32 of the block, and checks for it in this map. block_index_t::iterator i = block_index.find(crc); If it is found, it loads the data from the fblock indicated in the block_info struct, and checks if it's identical to the data that is being encoded. If it is, it will reuse that struct. If it is not found, or the data was not identical, it will insert a new key-value pair to the map and add the block data to the stream. block_index.insert(std::make_pair(crc, blk)); As for CRC32: It does not matter which kind of hash you use. I originally used MD5, but I switched to CRC32 for better memory effeciency. Because a mere hash can never be unique, the same hash can refer to multiple different blocks, and a data verification is therefore always mandatory. When adding the block data to the stream, it does a search to see if the data already exists in the particular fblock it's appending, and will overlap the block with the tail of the fblock if it's able to do so. -------------------------------- I chose to document this technique, because it could be useful in other compression programs as well. block_info could be implemented as containing a filename and the starting offset. This way, no lookups within the encoded data would be necessary.