Handle PIL UnidentifiedImageError exception when running cleanvision on local image folder dataset #263
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
This PR ensures that corrupted / unreadable images are filtered out at dataset construction time, so they never appear in
dataset.index.This prevents runtime crashes in downstream code paths such as
find_issues()andvisualize(), which assume that every index corresponds to a readable image.Fixes #222
Motivation
Currently, dataset indices are created purely from discovered filepaths (or integer indices for torchvision datasets), without checking whether the underlying image data is actually readable.
As a result:
dataset.indexSince
visualize()and issue managers materialize indices before accessing images, lazy handling in__getitem__is insufficient.The correct place to handle this is before the index is finalized.
What this PR changes
File-based datasets (
FSDataset)_filepathsanddataset.indexTorchVision datasets (
TorchDataset)_set_index()PIL.Imageis excludeddataset.indexcontains only readable samplesWhat this PR does NOT do
visualize()or issue managersNonepropagation or sentinel valuesAll downstream code continues to rely on the existing invariant:
Performance considerations
Result
dataset.indexis always consistentThis makes dataset handling more robust while keeping the rest of the codebase unchanged.