At one point, we had already talked about checking to see if strings inside of a file were unique and if we could sort them, but we haven't yet performed a similar operation on files. However, before diving in, let's make some assumptions about what constitutes a duplicate file for the purpose of this recipe: a duplicate file is one that may have a different name, but the same contents as another.
One way to investigate the contents of a file would be to remove all white space and purely check the strings contained within, or we could merely use tools such as SHA512sum and MD5sum to generate a unique hash (think unique string full of gibberish) of the contents of the files. The general flow would be as follows:
- Using this hash, we can compare the hash against a list of hashes already computed.
- If the has matches, we...