cancel
Showing results for 
Search instead for 
Did you mean: 

dedup challenge

effbiae
New Contributor
interesting problem...�

i have a hard drive that i know has duplicate files. �you know, when you make a backup of photos from laptop to desktop PC, then you get a big disk and backup /both/ the laptop and desktop and you end up with two copies of the photos on the big disk.

challenge: find the duplicate files

i started by running a `find` where my backups are stored - here's my recipe to get all the filenames and details into q:


$ sudo find /media/jack/ -exec ls -ld --full-time �{} \; >list


q)flip (9#"S";" ")0:`:list
drwxr-x---+ 7 �root � root �4096 �2013-03-29 15:29:43.876494607 +1100 /media/jack/ � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �
drwxr-xr-x �33 root � root �4096 �2012-11-09 21:07:37.364233083 +1100 /media/jack/831b08d9-942d-41ef-acd6-5ad675e233c0 � � � � � � � � � � � � � � � �
drwxr-xr-x �16 root � root �4096 �2011-07-03 17:17:55.750995002 +1000 /media/jack/831b08d9-942d-41ef-acd6-5ad675e233c0/var � � � � � � � � � � � � � �
drwxrwxrwt �2 �root � root �6 � � 2010-04-23 20:23:47.000000000 +1000 /media/jack/831b08d9-942d-41ef-acd6-5ad675e233c0/var/lock � � � � � � � � � � ��
drwxr-xr-x �7 �root � root �120 � 2010-08-16 20:10:54.000000000 +1000 /media/jack/831b08d9-942d-41ef-acd6-5ad675e233c0/var/run � � � � � � � � � � � �
drwxr-xr-x �2 �root � root �6 � � 2010-08-16 20:07:14.000000000 +1000 /media/jack/831b08d9-942d-41ef-acd6-5ad675e233c0/var/run/dbus � � � � � � � � ��
drwxr-xr-x �2 �saned �root �6 � � 2010-08-16 20:10:54.000000000 +1000 /media/jack/831b08d9-942d-41ef-acd6-5ad675e233c0/var/run/hplip � � � � � � � � �
drwxr-xr-x �3 �root � root �21 � �2010-08-16 20:10:05.000000000 +1000 /media/jack/831b08d9-942d-41ef-acd6-5ad675e233c0/var/run/samba � � � � � � � � �
drwxr-xr-x �2 �root � root �21 � �2010-08-16 20:10:05.000000000 +1000 /media/jack/831b08d9-942d-41ef-acd6-5ad675e233c0/var/run/samba/upgrades � � � ��
-rw-r--r-- �1 �root � root �12416 2010-08-16 20:10:05.000000000 +1000 /media/jack/831b08d9-942d-41ef-acd6-5ad675e233c0/var/run/samba/upgrades/smb.conf
drwxrwxr-x �2 �root � utmp �6 � � 2010-08-16 20:08:56.000000000 +1000 /media/jack/831b08d9-942d-41ef-acd6-5ad675e233c0/var/run/screen � � � � � � � ��
drwxr-xr-x �2 �usbmux audio 6 � � 2010-08-16 20:09:27.000000000 +1000 /media/jack/831b08d9-942d-41ef-acd6-5ad675e233c0/var/run/speech-dispatcher � � �
..

ta, jack
1 REPLY 1

bartosz_kaliszu
New Contributor
Check out 'fdupes' or the one-liner described here: http://ajayfromiiit.wordpress.com/2009/10/16/one-liner-to-find-and-remove-duplicate-files-in-linux/
find -not -empty -type f -printf "%s\n" | sort -rn | uniq -d | xargs -I{} -n1 find -type f -size {}c -print0 | xargs -0 md5sum | sort | uniq -w32 --all-repeated=separate

This might give you hints on the approach.

Br,
B