Skip to content

Conversation

@david-cortes-intel
Copy link
Contributor

Description

Sklearn allows passing X=None to all prediction-related methods of KNN estimators, but sklearnex takes that as an all-nan data, which is not the same. This PR fixes it.


Checklist:

Completeness and readability

  • Git commit message contains an appropriate signed-off-by string (see CONTRIBUTING.md for details).
  • I have resolved any merge conflicts that might occur with the base branch.

Testing

  • I have run it locally and tested the changes extensively.
  • All CI jobs are green or I have provided justification why they aren't.

@david-cortes-intel david-cortes-intel added the bug Something isn't working label Dec 3, 2025
@david-cortes-intel
Copy link
Contributor Author

/intelci: run

@david-cortes-intel
Copy link
Contributor Author

/azp run Nightly

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@codecov
Copy link

codecov bot commented Dec 3, 2025

Codecov Report

❌ Patch coverage is 62.50000% with 3 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
sklearnex/neighbors/knn_classification.py 50.00% 0 Missing and 3 partials ⚠️
Flag Coverage Δ
azure 80.41% <62.50%> (-0.02%) ⬇️
github 82.00% <62.50%> (-0.04%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
onedal/neighbors/neighbors.py 80.05% <100.00%> (-0.28%) ⬇️
sklearnex/neighbors/common.py 92.52% <100.00%> (+0.04%) ⬆️
sklearnex/neighbors/knn_classification.py 94.59% <50.00%> (-4.00%) ⬇️

... and 2 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Comment on lines 158 to 161
data[0] is None
and method_name in ["predict", "predict_proba", "score"]
),
"Predictions on 'None' data are handled by internal sklearn methods.",
Copy link
Contributor

@icfaust icfaust Dec 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this instead look for the _fit_X attribute if X is None? I'm not entirely sure I follow the solution of this PR (I do see the problem however).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, that wouldn't give the same results. The sklearn docs mention:

In this case, the query point is not considered its own neighbor
https://github.com/scikit-learn/scikit-learn/blob/4a10d0ed8d85e6ed24a647bd28a65c0c64b101ef/sklearn/neighbors/_base.py#L767C13-L767C77

And one can see that forwarding the same fit data doesn't generate the same results:

import numpy as np
from sklearn.neighbors import KNeighborsClassifier
rng = np.random.default_rng(seed=123)
X = rng.standard_normal(size=(5,3))
y = rng.integers(2, size=X.shape[0])
knn = KNeighborsClassifier(n_neighbors=2).fit(X, y)
knn.predict_proba(None)
array([[0.5, 0.5],
       [1. , 0. ],
       [0. , 1. ],
       [0.5, 0.5],
       [0.5, 0.5]])
knn.predict_proba(X)
array([[1. , 0. ],
       [0.5, 0.5],
       [0.5, 0.5],
       [0.5, 0.5],
       [0. , 1. ]])

Copy link
Contributor

@icfaust icfaust Dec 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But this is why we have the following logic: https://github.com/uxlfoundation/scikit-learn-intelex/blob/main/onedal/neighbors/neighbors.py#L297-L306 Just to full block it is a loss of functionality seems a bit heavy handed. At a minimum the explanation text for debugging needs to be made clearer, it needs to be discussed with someone with KNN knowledge on oneDAL side about this case @ethanglaser ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm understanding the logic correctly, that would either way produce incorrect results in cases where there are observations with the same 'X' features but different 'y'.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually no, turns out that sklearn suffers from the same problem.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either way, even if this ends up offloading the operation to sklearn, since the class overrides the method kneighbors which sklearn ends up calling, the code from onedal.neighbors ends up being used instead of sklearn's.

Copy link
Contributor Author

@david-cortes-intel david-cortes-intel Dec 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've now modified it to use the sklearnex methods for predict_proba so that the log entry would be reflective of what happens, but it looks like there's no implementations of predict and score for X=None on either the onedal or sklearnex side, so we'll still need to fall back on those, unless they were to be reimplemented on the python side by calling kneighbors internally like sklearn does for theirs. Note that it still ends up calling oneDAL since it overrides the methods from the sklearn class that are called internally.

@david-cortes-intel
Copy link
Contributor Author

/intelci: run

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants