The genius way to remove duplicate images from your picture folder

When you have the whole family adding photos to your digital photo frame, it is likely that the same photo will be uploaded twice by different people and possibly with different names.

So, wouldn’t it be convenient if duplicates were automatically detected and moved out of the main photo library?

I wrote an article about a possible way to achieve this two years ago, but I wanted to see if the described method was still the most convenient one.

I also wanted to find a way to include all possible subdirectories in the “photos” folder, something that didn’t quite work in my earlier script.

So, I installed the latest Raspberry Pi OS bookworm and read up on the latest methods.

Tested with a Raspberry Pi 4 and Raspberry Pi OS bookworm (Oct 2023)

The best way to compare images

The method used in the script below, which involves comparing image hashes, is a fast and efficient way to identify exact duplicate images. However, whether it’s the “best” way to compare images depends on your specific use case and the types of differences you want to detect.

The image hashing approach for duplicate detection has three advantages:

  • Image hashing is relatively fast, making it suitable for large collections of images.
  • This method is excellent at detecting exact duplicate images, even if they have different file names, sizes, or metadata.
  • Image hashes have a low probability of producing the same hash for different images, reducing the chances of false positives.

However, there are also some limitations to this approach.

  • Image hashing can only detect exact duplicates. If you want to identify similar images with minor alterations, such as resizing, cropping, or small edits, you need a more advanced image comparison technique like perceptual hashing or image similarity measures.
  • Image hashing is sensitive to even the smallest changes in the image. Therefore, it may not be suitable for tasks where you want to find nearly identical images with minor variations.
  • If you need to find images based on their content, such as searching for images with specific objects or scenes, you would need other techniques like feature extraction and matching.

The image hashing method is suitable if you’re primarily interested in detecting exact duplicates.

If you need to handle near-duplicates or find images based on content, other techniques like the one described in my earlier blog post may be more appropriate.

How does the script work?

The method used in the script for comparing images is based on calculating a hash of the image content and then comparing these hashes to determine if two images are duplicates.

To compare images, it first calculates a hash for each image. I use the SHA-256 hashing algorithm in the script below to compute a hash of the image’s binary data. This hash is unique to the image content, meaning that even minor differences in the image will result in different hashes.

As the script processes the images in the specified folder, it maintains a Python dictionary (image_hash_dict) where the image hash is the key, and the value is the path to the image file. This dictionary keeps track of which images have the same hash.

For each image encountered in the folder, the script calculates its hash. If the calculated hash already exists in the dictionary, it means that there is another image with the same content. This indicates a duplicate image.

When a duplicate is found, the script moves one of the duplicate images to a “duplicates” folder.

Compared to my earlier article, the script will now traverse all subdirectories of the “photos” folder and check for duplicate images in them as well. The os.walk function is used to iterate through the directory tree, including all subdirectories.

The python script

So, without much further ado, here is the script:

#!/usr/bin/python3

import os
from PIL import Image
import shutil
from hashlib import sha256
import time

# Input folder where your images are located
input_folder = "photos"

# Output folder for duplicates
output_folder = "duplicates"

def calculate_image_hash(file_path):
    with open(file_path, "rb") as f:
        return sha256(f.read()).hexdigest()

def find_duplicate_images(folder):
    image_hash_dict = {}
    duplicate_count = 0  # Initialize a counter for duplicates

    start_time = time.time()  # Record start time

    for root, dirs, files in os.walk(folder):
        for file in files:
            file_path = os.path.join(root, file)
            if file.lower().endswith(('.jpg', '.jpeg', '.png')):
                image_hash = calculate_image_hash(file_path)

                if image_hash in image_hash_dict:
                    duplicate_count += 1  # Increment the duplicate count
                    duplicate_file = image_hash_dict[image_hash]
                    move_duplicate(file_path, duplicate_file)
                else:
                    image_hash_dict[image_hash] = file_path

    end_time = time.time()  # Record end time
    elapsed_time = end_time - start_time

    return duplicate_count, elapsed_time

def move_duplicate(src, dest):
    if not os.path.exists(output_folder):
        os.makedirs(output_folder)
    shutil.move(src, os.path.join(output_folder, os.path.basename(src)))

if __name__ == "__main__":
    duplicate_count, elapsed_time = find_duplicate_images(input_folder)

    if duplicate_count > 0:
        print(f"{duplicate_count} duplicate images found and moved to the 'duplicates' folder.")
    else:
        print("No duplicate images found.")

    print(f"Elapsed time: {elapsed_time:.2f} seconds")

Save as “duplicates_finder.py”.

Make sure that on your Pi, you have the directories “photos” and “duplicates”.

Put some images into your “photos” folder and duplicate some files to test it.

Start the script with

python3 duplicates_finder.py

I included a timer in the script that measures the elapsed time. The script also states how many duplicates have been identified.

With 250 test images and 33 duplicates, it took about 30 seconds to complete.

The time doesn’t matter since the script will run in the background at night. It may depend a little bit on the SD cards that you are using.

Depending on the Raspberry Pi OS version that you have installed, not all necessary modules to run this script may be available, and you may get an error message like “ModuleNotFoundError: No module named 'PIL‘”.
In this case, enter the following:
sudo apt install python3-pip -y && sudo apt install python3-pil -y

How to autorun the script daily

We will use systemd to run the script once daily to ensure that your image folder is regularly scanned for duplicates.

Let’s first create the service file.

sudo nano /etc/systemd/system/duplicates_finder.service

Copy & paste

[Unit]
Description=Checks for duplicates

[Service]
Type=simple
User=pi
ExecStart=/usr/bin/python3 /home/pi/duplicates_finder.py

Save and close.

Create the corresponding .timer file with

sudo nano /etc/systemd/system/duplicates_finder.timer

Copy & paste

[Unit]
Description=Timer to check for duplicate images

[Timer]
OnCalendar=daily

[Install]
WantedBy=timers.target

Save and close. Instead of “daily”, you can also insert “weekly” or “monthly”. It’s up to you to decide how often you want to run it.

Tell the system that you have added this file and want to enable this service.

sudo systemctl daemon-reload && sudo systemctl enable duplicates_finder.timer && sudo systemctl start duplicates_finder.timer

To check if everything is ok, type

sudo systemctl status duplicates_finder.timer

You should see something like this:

● duplicates_finder.timer - Timer to check for duplicate images
     Loaded: loaded (/etc/systemd/system/duplicates_finder.timer; enabled; preset: enabled)
     Active: active (waiting) since Sun 2023-10-15 08:26:13 CEST; 15s ago
    Trigger: Mon 2023-10-16 00:00:00 CEST; 15h left
   Triggers: ● duplicates_finder.service

Oct 15 08:26:13 pi4test systemd[1]: Started duplicates_finder.timer - Timer to check for duplicate image>

If you don’t update your images often, running it once a month will also be fine, but there is no real cost associated with running it more frequently.

Conclusion

With the above script, you can keep your photo library tidy and avoid photos being shown over-proportionally often because they happen to reside five times in your Pictures folder.

Was this article helpful?


Thank you for your support and motivation.


Scroll to Top