Maybe you had a lot of files scattered around on different drives, and you added them all into a single git-annex repository. Some of the files are surely duplicates of others.

While git-annex stores the file contents efficiently, it would still help in cleaning up this mess if you could find, and perhaps remove the duplicate files.

Here's a command line that will show duplicate sets of files grouped together:

git annex find --include '*' --format='${file} ${escaped_key}\n' | \
    sort -k2 | uniq --all-repeated=separate -f1 | \
    sed 's/ [^ ]*$//'

Here's a command line that will remove one of each duplicate set of files:

git annex find --include '*' --format='${file} ${escaped_key}\n' | \
    sort -k2 | uniq --repeated -f1 | sed 's/ [^ ]*$//' | \
    xargs -d '\n' git rm

--Joey

Very nice :) Just for reference, here's my Perl implementation. As per this discussion it would be interesting to benchmark these two approaches and see if one is substantially more efficient than the other w.r.t. CPU and memory usage.
Comment by http://adamspiers.myopenid.com/ Fri Dec 23 19:16:50 2011
Comments on this page are closed.