If you give your family & friends permission to directly send pictures to your photo frame via email or Dropbox, you will likely end up with duplicate images over time.
To make matters worse, the duplicated images are not easy to spot, as their file names are often different.
In this article, I will describe a script that will run every night to check if you have duplicate images in your Pictures folder.
When a duplicate image is found, it will be moved to a folder called “DuplicatePhotos” in the user directory.
Tested with Raspberry Pi 3 and 4 with the Dec 2020 release of Raspberry Pi OS. There is an updated script here.
The Root-Mean-Square Difference (RMS)
Python provides several methods of comparing images to determine if they are the same or possibly similar. The method that is being used in this script is the Root-Mean-Square Difference (RMS).
To measure how similar two images are, one can calculate the root-mean-square value of the difference between the images. If the images are exactly identical, this value is zero.
The Python script to find duplicate images
The script takes the RMS value of an image and compares it to another image’s RMS value. If the difference between the two images is less than 1, then the images are considered the same. However, for safety reasons, no files are deleted automatically. They will be quarantined to the DuplicatePictures directory, where you can take a look at them from time to time.
I tested various settings and found that a value of “0.01” in the average_diff line below works best for avoiding false positives. But it still ensures that full duplicates, i.e., not similar but exactly identical images, will be discovered reliably.
Credit for the script goes to Albatrossity on Stackoverflow, I only made some minor modifications.
There is currently one limitation: The script doesn’t scan the Pictures directory recursively. All images have to be on the main level. For an updated version of this script head over here.
Create a new file and paste this code into it.
#!/usr/bin/env python3
import os
from PIL import Image, ImageStat
import shutil
'''
The script takes the RMS value of an image and compares it to another image's RMS value. If the difference between the two images is less than 1 (calculated in the function 'average_diff') then the images are considered the same, and the duplicate is moved to the duplicates folder.
'''
image_folder = r'/home/pi/Pictures' # not yet recursive
duplicate_folder = r'/home/pi/DuplicatePictures'
image_files = []
rms_pixels = []
# create directory for duplicates if it does not exist
if not os.path.exists(duplicate_folder):
os.makedirs(duplicate_folder)
#Function calculates the difference between the CURRENT image file RMS value and RMS values calculated at start
def average_diff(v1, v2):
duplicate = False
calculated_rms_difference = [v1[0]-v2[0], v1[1]-v2[1], v1[2]-v2[2]]
if calculated_rms_difference[0] < 0.01 and calculated_rms_difference[0] > -0.01 and calculated_rms_difference[1] < 0.01 and calculated_rms_difference[1] > -0.01 and calculated_rms_difference[2] < 0.01 and calculated_rms_difference[2] > -0.01:
duplicate = True
return duplicate
def quick_rms(images, rms):
image_file_count = 0
while image_file_count < len(images):
image_file = images[image_file_count]
check_duplicates = image_file.endswith('.jpg')
if check_duplicates:
rms_file_count = 1
original_image = Image.open(os.path.join(image_folder, image_file))
rms_original = ImageStat.Stat(original_image).mean
while rms_file_count < len(rms):
rms_file = rms[rms_file_count]
duplicate_image = average_diff(rms_original, rms_file)
print('Checking: ', image_file_count, 'against', rms_file_count, 'of', len(rms), end='\r', flush=False)
if image_file != rms_file[3]:
if duplicate_image:
source = os.path.join(image_folder, rms_file[3])
dest = os.path.join(duplicate_folder, rms_file[3])
shutil.move(source, dest)
image_to_remove = rms_file[3]
images.remove(image_to_remove)
rms.remove(rms_file)
print('MOVED to ' + duplicate_folder + ':' + image_to_remove) # comment this out when in production environment
rms_file_count += 1
image_file_count += 1
return
#Creates a list of images to be checked and compared, these are the images stored in the Image folder at top of script
for x in os.listdir(image_folder):
image_files.append(x)
#Create a list with all the Image RMS values. These are used to compare to the CURRENT image file in list
for x in image_files:
if x.endswith('.jpg'):
compare_image = Image.open(os.path.join(image_folder, x))
rms_pixel = ImageStat.Stat(compare_image).mean
rms_pixel.append(x)
rms_pixels.append(rms_pixel)
print(rms_pixel)
#Driver code, runs the script
quick_rms(image_files, rms_pixels)
Save as “duplicates_finder.py“and exit the file.
To test it, put some images into your “Pictures” folder and duplicate some files.
Start the script with
python3 duplicates_finder.py
I ran a few speed tests out of our curiosity.
For 1,000 images with a resolution of 1920 x 1200 px, it took 5 minutes and 30 seconds on a Raspberry Pi 4 and 7 minutes 52 seconds on a Raspberry Pi 3.
My test with 10,000 images took 1 hour and 24 minutes with a Pi 4 at an average CPU load of 25%. On a Pi 3, the time was 2 hours and 49 minutes at 27% CPU load.
The time doesn’t really matter since the script will run in the background during the night. It may depend a little bit on the SD-Cards that you are using.
Autorun the script daily
We will use systemd to run the script once a day to make sure that your image folder is regularly scanned for duplicates.
Let’s first create the service file.
sudo nano /etc/systemd/system/duplicates_finder.service
Copy & paste
[Unit]
Description=Checks for duplicates
[Service]
Type=simple
User=pi
ExecStart=/usr/bin/python3 /home/pi/duplicates_finder.py
Save and close.
Create the corresponding .timer file with
sudo nano /etc/systemd/system/duplicates_finder.timer
Copy & paste
[Unit]
Description=Timer to check for duplicate images
[Timer]
OnCalendar=daily
[Install]
WantedBy=timers.target
Save and close. Instead of “daily”, you can also insert “weekly” or “monthly”. Up to you to decide how often you want to run it.
Tell the system that you have added this file and want to enable this service.
sudo systemctl daemon-reload && sudo systemctl enable duplicates_finder.timer && sudo systemctl start duplicates_finder.timer
To check if everything is ok, type
sudo systemctl status duplicates_finder.timer
You should see something like this:
pi@Pi4:~ $ sudo systemctl status duplicates_finder.timer
● duplicates_finder.timer - Timer to check for duplicate images
Loaded: loaded (/etc/systemd/system/duplicates_finder.timer; enabled; vendor preset: enabled)
Active: active (waiting) since Thu 2020-12-17 13:11:15 GMT; 7min ago
Trigger: Fri 2020-12-18 00:00:00 GMT; 10h left
Dec 17 13:11:15 Pi4 systemd[1]: Started Timer to check for duplicate images.
If you don’t update your images often, running it once a month will also be fine but there is no real cost associated with running it more frequently.
Conclusion
I was surprised to learn how many different algorithms Python offers to measure the similarity of images.
The RMS method looks battle-proven. With the above script, you can keep your photo library tidy and avoid that photos are shown over-proportionally often because they happen to reside five times in your Pictures folder.
Was this article helpful?
Thank you for your support and motivation.
Related Articles
- How to fix an unreliable WiFi connection on your Raspberry Pi picture frame
- How to find the IP address of a new device in your network with Angry IP Scanner
- Stay Connected: Enhancing Raspberry Pi Wi-Fi Stability by Turning Off Power Management
- How to reduce the wear on your SD card in your Raspberry Pi digital photo frame