To make matters worse, the duplicated images not be that easy to spot as their file names are often different.
In this article, I will describe a script that will run every night to check if you have duplicate images in your Pictures folder.
When a duplicate image is found, it will be moved to a folder called “DuplicatePhotos” in the user directory.
Tested with Raspberry Pi 3 and 4 with the Dec 2020 release of Raspberry Pi OS.
The Root-Mean-Square Difference (RMS)
Python provides several methods of comparing images to determine if they are the same or possibly similar. The method that is being used in this script is the Root-Mean-Square Difference (RMS).
To measure how similar two images are, one can calculate the root-mean-square value of the difference between the images. If the images are exactly identical, this value is zero.
The Python script to find duplicate images
The script takes the RMS value of an image and compares it to another image’s RMS value. If the difference between the two images is less than 1, then the images are considered the same. However, for safety reasons, no files are deleted automatically. They will be quarantined to the DuplicatePictures directory, where you can take a look at them from time to time.
I tested various settings and found that a value of “0.01” in the average_diff line below works best for avoiding false positives. But it still makes sure that full duplicates, i.e., not similar but exactly the same images, will be discovered reliably.
Credit for the script goes to Albatrossity on Stackoverflow, I only made some minor modifications.
There is currently one limitation: The script doesn’t scan the Pictures directory recursively, all images have to be on the main level. If you are a Python master and would like to help me in making it scan all subdirectories, please contact me.
Create a new file and paste this code into it.
#!/usr/bin/env python3 import os from PIL import Image, ImageStat import shutil ''' The script takes the RMS value of an image and compares it to another image's RMS value. If the difference between the two images is less than 1 (calculated in the function 'average_diff') then the images are considered the same, and the duplicate is moved to the duplicates folder. ''' image_folder = r'/home/pi/Pictures' # not yet recursive duplicate_folder = r'/home/pi/DuplicatePictures' image_files =  rms_pixels =  # create directory for duplicates if it does not exist if not os.path.exists(duplicate_folder): os.makedirs(duplicate_folder) #Function calculates the difference between the CURRENT image file RMS value and RMS values calculated at start def average_diff(v1, v2): duplicate = False calculated_rms_difference = [v1-v2, v1-v2, v1-v2] if calculated_rms_difference < 0.01 and calculated_rms_difference > -0.01 and calculated_rms_difference < 0.01 and calculated_rms_difference > -0.01 and calculated_rms_difference < 0.01 and calculated_rms_difference > -0.01: duplicate = True return duplicate def quick_rms(images, rms): image_file_count = 0 while image_file_count < len(images): image_file = images[image_file_count] check_duplicates = image_file.endswith('.jpg') if check_duplicates: rms_file_count = 1 original_image = Image.open(os.path.join(image_folder, image_file)) rms_original = ImageStat.Stat(original_image).mean while rms_file_count < len(rms): rms_file = rms[rms_file_count] duplicate_image = average_diff(rms_original, rms_file) print('Checking: ', image_file_count, 'against', rms_file_count, 'of', len(rms), end='\r', flush=False) if image_file != rms_file: if duplicate_image: source = os.path.join(image_folder, rms_file) dest = os.path.join(duplicate_folder, rms_file) shutil.move(source, dest) image_to_remove = rms_file images.remove(image_to_remove) rms.remove(rms_file) print('MOVED to ' + duplicate_folder + ':' + image_to_remove) # comment this out when in production environment rms_file_count += 1 image_file_count += 1 return #Creates a list of images to be checked and compared, these are the images stored in the Image folder at top of script for x in os.listdir(image_folder): image_files.append(x) #Create a list with all the Image RMS values. These are used to compare to the CURRENT image file in list for x in image_files: if x.endswith('.jpg'): compare_image = Image.open(os.path.join(image_folder, x)) rms_pixel = ImageStat.Stat(compare_image).mean rms_pixel.append(x) rms_pixels.append(rms_pixel) print(rms_pixel) #Driver code, runs the script quick_rms(image_files, rms_pixels)
Save as “duplicates_finder.py“and exit the file.
To test it put some images into your “Pictures” folder and duplicate some files.
Start the script with
I ran a few speed tests out of our curiosity.
For 1,000 images with a resolution of 1920 x 1200 px, it took 5 minutes and 30 seconds on a Raspberry Pi 4 and 7 minutes 52 seconds on a Raspberry Pi 3.
My test with 10,000 images took 1 hour and 24 minutes with a Pi 4 at an average CPU load of 25%. On a Pi 3, the time was 2 hours and 49 minutes at 27% CPU load.
The time doesn’t really matter since the script will run in the background during the night. It may depend a little bit on the SD-Cards that you are using.
Autorun the script daily
We will use systemd to run the script once a day to make sure that your image folder is regularly scanned for duplicates.
Let’s first create the service file.
sudo nano /etc/systemd/system/duplicates_finder.service
Copy & paste
[Unit] Description=Checks for duplicates [Service] Type=simple User=pi ExecStart=/usr/bin/python3 /home/pi/duplicates_finder.py
Save and close.
Create the corresponding .timer file with
sudo nano /etc/systemd/system/duplicates_finder.timer
Copy & paste
[Unit] Description=Timer to check for duplicate images [Timer] OnCalendar=daily [Install] WantedBy=timers.target
Save and close. Instead of “daily”, you can also insert “weekly” or “monthly”. Up to you to decide how often you want to run it.
Tell the system that you have added this file and want to enable this service.
sudo systemctl daemon-reload && sudo systemctl enable duplicates_finder.timer && sudo systemctl start duplicates_finder.timer
To check if everything is ok, type
sudo systemctl status duplicates_finder.timer
You should see something like this:
pi@Pi4:~ $ sudo systemctl status duplicates_finder.timer ● duplicates_finder.timer - Timer to check for duplicate images Loaded: loaded (/etc/systemd/system/duplicates_finder.timer; enabled; vendor preset: enabled) Active: active (waiting) since Thu 2020-12-17 13:11:15 GMT; 7min ago Trigger: Fri 2020-12-18 00:00:00 GMT; 10h left Dec 17 13:11:15 Pi4 systemd: Started Timer to check for duplicate images.
If you don’t update your images often, running it once a month will also be fine but there is no real cost associated with running it more frequently.
I was surprised to learn how many different algorithms Python offers to measure the similarity of images.
The RMS method looks battle-proven. With the above script, you can keep your photo library tidy and avoid that photos are shown over-proportionally often because they happen to reside five times in your Pictures folder.
- Check the wifi signal strength of your Raspberry Pi digital picture frame before you hang it up on the wall
- How to start realtime systemd timer services as a crontab replacement on the Raspberry Pi
- The ultimate guide on using systemd to autostart scripts on the Raspberry Pi
- The beginner’s guide to working with the Terminal on the Raspberry Pi from a Windows, macOS, or Linux computer (2020 Version)