Skip to content

Conversation

@pavansai018
Copy link

Summary

This PR ensures that corrupted / unreadable images are filtered out at dataset construction time, so they never appear in dataset.index.

This prevents runtime crashes in downstream code paths such as find_issues() and visualize(), which assume that every index corresponds to a readable image.

Fixes #222

Motivation

Currently, dataset indices are created purely from discovered filepaths (or integer indices for torchvision datasets), without checking whether the underlying image data is actually readable.

As a result:

  • Corrupted images can enter dataset.index
  • Errors surface later during visualization or issue detection
  • Failures occur far away from the root cause and are hard to recover from

Since visualize() and issue managers materialize indices before accessing images, lazy handling in __getitem__ is insufficient.

The correct place to handle this is before the index is finalized.


What this PR changes

File-based datasets (FSDataset)

  • During filepath discovery, each image is opened once to check integrity
  • Corrupted images are silently skipped
  • Only valid image paths are included in _filepaths and dataset.index

TorchVision datasets (TorchDataset)

  • Dataset indices are validated once during _set_index()
  • Any sample whose image cannot be accessed or is not a valid PIL.Image is excluded
  • dataset.index contains only readable samples

What this PR does NOT do

  • ❌ No changes to visualize() or issue managers
  • ❌ No None propagation or sentinel values
  • ❌ No behavior changes for valid datasets

All downstream code continues to rely on the existing invariant:

Every index in dataset.index maps to a valid image


Performance considerations

  • Each image/sample is checked once at dataset construction time
  • This is unavoidable if dataset membership must exclude corrupted entries
  • For valid images, there is no additional overhead during processing or visualization

Result

  • dataset.index is always consistent
  • Visualization never crashes due to corrupted images
  • Errors are handled at the correct architectural boundary

This makes dataset handling more robust while keeping the rest of the codebase unchanged.

@CLAassistant
Copy link

CLAassistant commented Dec 15, 2025

CLA assistant check
All committers have signed the CLA.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Handle PIL UnidentifiedImageError exception when running cleanvision on local image folder dataset

2 participants