Freedup Download

How freedup in principle works

There are neither warranties nor guarantees for freedup working correctly. In principle freedup only knows about linking. Therefore the maximum risk is to link different files. During development many precautions were taken, but I have to emphasize that this risk exists. Only when using interactive mode you may delete files in a two step process, too. If you detect any possible source of misbehaviour in freedup, please report it for the sake of all users.

In principle freedup always searches for files of identical size and compares them byte-by-byte. The only exception are "extra styles", where the tags (details see next chapter) are intentionally skipped. Before files are compared byte-by-byte you might apply restrictions, like being owned by the same group or user, having the same permission or whatever the options allow you. Files that match in content and fulfil all required prerequisites are linked in the demanded way.

Scan all directory trees recursively for all regular files.
Build a sequential list of those files and keep their name.
The arg position or pipe input sequence is kept by adding sequence numbers th each file.
Use lstat() on each file to read and store its size with the filename.
Sort the file and its attached information by comparing their sizes using qsort().
In case the comparison has to report equal file size additional properties are compared.
Most of those property checks are switched off by default.
If all demands are fullfilled, the files are compared block by block (4k).
If both files are identical and on the same file system they are added to link list.
The link list will not be processed before all comparisons are complete.
After all files are compared freedup starts processing the link list.
For each link list entry, i.e. a set of identical files, the requested order is prepared.
In interactive mode the files are now presented to make your file specific choice.
The files that are intended to be linked, will be renamed, hard linked, renamed file removed.
If hardlinking is not possible soft links are tried, except one of the paths is not starting at root (but can be forced)
Finally a short report is delivered.

For more details please have a look into the source code or ask the author.

How freedup "extra styles" work

This concept was introduced in version 1.1 due to the fact that I wanted files to be linked although they differed. I am talking of mp3 files where the tags showed minor variations. First I considered retagging all files, but I would have to remove either all or complete all tags (n.b. MP3v1 tags are at the end, MP3v2 tags are at the beginning of an mp3 file, both are optional).

The extra style now should compare the essential file content, i.e. the mpeg encoded sound part in case of the mp3 files. Currently the following rules are established:

mp3 strips the mp3v1 and mp3v2 tags and provides comparison of the remaining body.
mp4 strips all sections up to the first mdat section and everything including the first non-mdat section after it. This should work for iPod files, AAC/FAAC encoded sounds, and files usually having extensions like MP4, M4A, M4V, MOV, etc.
mpc strips the the APETAGEX labeled tail from mousepack audio files.
ogg strips all infos until the sequence "vorbis.BCV" where the dot is arbitrary. Minor trailing infos (less than 128 bytes) are also cut off.
jpg tries to strip the comments at the beginning of each file. Since some comments where after the quantization table, this is stripped, too.

since for each file type exactly one method exists (might change in future), an automated mode will call the respective method according to the file magic. The name of each file is not considered for type checking.

Please note, that these styles change the behaviour according to the file contents. The change the size of the compared contents, but this does not affect the options that belong to the files, like ownerships or file names.

If you like to contribute, this is quite simple. There are source files for each style. Start with a copy of my.c and my.h. Rename the functions, fill in your way to evaluate the irrelevant bytes at start and the trailing ones, as well as a way to find size and magic. Add a matching line to the extra[] table in auto.c, compile, test and submit to me.

Why freedup does not use hash functions by default

Hash functions should speed up freedup since they avoid comparing files that have been scanned before (and might differ in the last characters). But freedup is slowed down, if files of the same size differ early. Then you should switch the hash function off, which is now the default. If most files of the same size are likely to be identical (more then just two), it probably pays to switch hash functions on. There is an internal hash function that allows some interesting speed enhancements (see below). External hash functions are kept, since they might be interesting to check the internal one for correctness.

The new algorithm records hash sums on the fly (starting in version 1.3-1) and is in worst case - depending on cpu - half as fast as without using hash functions. When reading files the hash function is calculated until the comparison fails. The hash context is stored until the next comparison takes place and if it fails at a later block, the hash calculation will be continued where it stopped earlier. Since reading and comparing files works with data blocks (predefined 4k) the hash values can sometimes be calculated although the comparison fails.

hash support	Parameter	Real Time	User Time	Sys Time
`time ./freedup -x mp3 --hash ? -ni /testdir` 7856 files; 1 match; average file size 46MB; 50% smaller 4k; 2900 BogoMIPS 2852 classic hash sums to avoid 3411 byte-by-byte comparisons.
without hash support	`--hash 0`	2m04.646s	0m00.599s	0m03.455s
with classic hashsum	`--hash 1`	5m31.221s	2m21.914s	0m16.303s
with advanced hash	`--hash 2`	1m59.720s	0m06.006s	0m03.515s

hash support	Parameter	Real Time	User Time	Sys Time
`time ./freedup --hash ? -n /mp3dir` 7919 files; 0 matches; all around average file size 4.5MB; 1400 BogoMIPS 4502 classic hash sums to avoid 3819 byte-by-byte comparisons.
without hash support	`--hash 0`	5m21.690s	0m15.130s	0m25.560s
with classic hashsum	`--hash 1`	45m14.048s	36m33.470s	2m29.380s
with advanced hash	`--hash 2`	10m01.311s	6m28.610s	0m28.150s

hash support	Parameter	Real Time	User Time	Sys Time
`time ./freedup --hash ? -x mp3 -n /mp3dir` 7919 files; 456 duplicates; all around average file size 4.5MB; 1400 BogoMIPS 4524 classic hash sums to avoid 3425 byte-by-byte comparisons.
without hash support	`--hash 0`	6m48.276s	0m18.590s	0m28.600s
with classic hashsum	`--hash 1`	49m35.108s	37m06.450s	2m47.400s
with advanced hash	`--hash 2`	12m33.688s	6m51.530s	0m31.090s

As a consequence of these results, the advantage of hash functions is not obvious for most environments. I assume that there are situations, where many files have the same size and quite similar contents. Then one should switch hash function usage to the advanced mode. But since I do not intend to rely on hash results without having byte-by-byte comparison, I changed the default value since freedup 1.3-2 to off.