How to automatically remove duplicate images from your Raspberry Pi digital frame photo library 2

How to automatically remove duplicate images from your Raspberry Pi digital frame photo library

If you give your family & friends the permission to directly send pictures to your photo frame either via email or Dropbox, you are likely to end up with duplicate images over time.

To make matters worse, the duplicated images not be that easy to spot as their file names are often different.

In this article, I will describe a script that will run every night to check if you have duplicate images in your Pictures folder.

When a duplicate image is found, it will be moved to a folder called “DuplicatePhotos” in the user directory.

Tested with Raspberry Pi 3 and 4 with the Dec 2020 release of Raspberry Pi OS.

The Root-Mean-Square Difference (RMS)

Python provides several methods of comparing images to determine if they are the same or possibly similar. The method that is being used in this script is the Root-Mean-Square Difference (RMS).

To measure how similar two images are, one can calculate the root-mean-square value of the difference between the images. If the images are exactly identical, this value is zero.

The Python script to find duplicate images

The script takes the RMS value of an image and compares it to another image’s RMS value. If the difference between the two images is less than 1, then the images are considered the same. However, for safety reasons, no files are deleted automatically. They will be quarantined to the DuplicatePictures directory, where you can take a look at them from time to time.

I tested various settings and found that a value of “0.01” in the average_diff line below works best for avoiding false positives. But it still makes sure that full duplicates, i.e., not similar but exactly the same images, will be discovered reliably.

Credit for the script goes to Albatrossity on Stackoverflow, I only made some minor modifications.

There is currently one limitation: The script doesn’t scan the Pictures directory recursively, all images have to be on the main level. If you are a Python master and would like to help me in making it scan all subdirectories, please contact me.

Create a new file and paste this code into it.

#!/usr/bin/env python3

import os
from PIL import Image, ImageStat
import shutil

'''
The script takes the RMS value of an image and compares it to another image's RMS value. If the difference between the two images is less than 1 (calculated in the function 'average_diff') then the images are considered the same, and the duplicate is moved to the duplicates folder.
'''
image_folder = r'/home/pi/Pictures' # not yet recursive
duplicate_folder = r'/home/pi/DuplicatePictures'
image_files = []
rms_pixels = []

# create directory for duplicates if it does not exist
if not os.path.exists(duplicate_folder): 
    os.makedirs(duplicate_folder)

#Function calculates the difference between the CURRENT image file RMS value and RMS values calculated at start
def average_diff(v1, v2):
    duplicate = False
    calculated_rms_difference = [v1[0]-v2[0], v1[1]-v2[1], v1[2]-v2[2]]
    if calculated_rms_difference[0] < 0.01 and calculated_rms_difference[0] > -0.01 and calculated_rms_difference[1] < 0.01 and calculated_rms_difference[1] > -0.01 and calculated_rms_difference[2] < 0.01 and calculated_rms_difference[2] > -0.01:
        duplicate = True
    return duplicate

def quick_rms(images, rms):
    image_file_count = 0
    while image_file_count < len(images):
        image_file = images[image_file_count]
        check_duplicates = image_file.endswith('.jpg')
        if check_duplicates:
            rms_file_count = 1
            original_image = Image.open(os.path.join(image_folder, image_file))
            rms_original = ImageStat.Stat(original_image).mean
            while rms_file_count < len(rms):
                rms_file = rms[rms_file_count]
                duplicate_image = average_diff(rms_original, rms_file)
                print('Checking: ', image_file_count, 'against', rms_file_count, 'of', len(rms), end='\r', flush=False)
                if image_file != rms_file[3]:
                    if duplicate_image:
                        source = os.path.join(image_folder, rms_file[3])
                        dest = os.path.join(duplicate_folder, rms_file[3])
                        shutil.move(source, dest)
                        image_to_remove = rms_file[3]
                        images.remove(image_to_remove)
                        rms.remove(rms_file)
                        print('MOVED to ' +  duplicate_folder + ':' + image_to_remove) # comment this out when in production environment
                rms_file_count += 1
        image_file_count += 1
    return
#Creates a list of images to be checked and compared, these are the images stored in the Image folder at top of script
for x in os.listdir(image_folder):
    image_files.append(x)
#Create a list with all the Image RMS values. These are used to compare to the CURRENT image file in list
for x in image_files:
    if x.endswith('.jpg'):
        compare_image = Image.open(os.path.join(image_folder, x))
        rms_pixel = ImageStat.Stat(compare_image).mean
        rms_pixel.append(x)
        rms_pixels.append(rms_pixel)
        print(rms_pixel)
#Driver code, runs the script
quick_rms(image_files, rms_pixels)

Save as “duplicates_finder.pyand exit the file.

To test it put some images into your “Pictures” folder and duplicate some files.

Start the script with

python3 duplicates_finder.py

I ran a few speed tests out of our curiosity.

For 1,000 images with a resolution of 1920 x 1200 px, it took 5 minutes and 30 seconds on a Raspberry Pi 4 and 7 minutes 52 seconds on a Raspberry Pi 3.

My test with 10,000 images took 1 hour and 24 minutes with a Pi 4 at an average CPU load of 25%. On a Pi 3, the time was 2 hours and 49 minutes at 27% CPU load.

The time doesn’t really matter since the script will run in the background during the night. It may depend a little bit on the SD-Cards that you are using.

Autorun the script daily

We will use systemd to run the script once a day to make sure that your image folder is regularly scanned for duplicates.

Let’s first create the service file.

sudo nano /etc/systemd/system/duplicates_finder.service

Copy & paste

[Unit]
Description=Checks for duplicates

[Service]
Type=simple
User=pi
ExecStart=/usr/bin/python3 /home/pi/duplicates_finder.py

Save and close.

Create the corresponding .timer file with

sudo nano /etc/systemd/system/duplicates_finder.timer

Copy & paste

[Unit]
Description=Timer to check for duplicate images

[Timer]
OnCalendar=daily

[Install]
WantedBy=timers.target

Save and close. Instead of “daily”, you can also insert “weekly” or “monthly”. Up to you to decide how often you want to run it.

Tell the system that you have added this file and want to enable this service.

sudo systemctl daemon-reload && sudo systemctl enable duplicates_finder.timer && sudo systemctl start duplicates_finder.timer

To check if everything is ok, type

sudo systemctl status duplicates_finder.timer

You should see something like this:

pi@Pi4:~ $ sudo systemctl status duplicates_finder.timer
● duplicates_finder.timer - Timer to check for duplicate images
   Loaded: loaded (/etc/systemd/system/duplicates_finder.timer; enabled; vendor preset: enabled)
   Active: active (waiting) since Thu 2020-12-17 13:11:15 GMT; 7min ago
  Trigger: Fri 2020-12-18 00:00:00 GMT; 10h left

Dec 17 13:11:15 Pi4 systemd[1]: Started Timer to check for duplicate images.

If you don’t update your images often, running it once a month will also be fine but there is no real cost associated with running it more frequently.

Conclusion

I was surprised to learn how many different algorithms Python offers to measure the similarity of images.

The RMS method looks battle-proven. With the above script, you can keep your photo library tidy and avoid that photos are shown over-proportionally often because they happen to reside five times in your Pictures folder.

Scroll to Top