-
-
Notifications
You must be signed in to change notification settings - Fork 19.4k
API: to_datetime(ints, unit) give requested unit #63347
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
jorisvandenbossche
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
| # Without this as_unit cast, we would fail to overflow | ||
| # and get much-too-large dates | ||
| return to_datetime(new_data, errors="raise", unit=date_unit).dt.as_unit( | ||
| "ns" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not directly understanding that comment. new_data are integers here? Why does the return unit of this function need to be nanoseconds? (to preserve current functionality?) Why would this give (wrong?) too large dates?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is inside a block that tries large units and if they overflow then tries smaller units. This PR makes the large units not-overflow in cases where this piece of code expects them to. Without this edit, e.g. pandas/tests/io/json/test_pandas.py::TestPandasContainer::test_date_unit fails with
left = DatetimeIndex(['30004724859-08-03', '30007462766-08-06', '30010200673-08-08',
'30012938580-08-10', '300... '30106027418-11-06', '30108765325-11-08', '30111503232-11-10'],
dtype='datetime64[s]', freq=None)
right = DatetimeIndex(['2000-01-03', '2000-01-04', '2000-01-05', '2000-01-06',
'2000-01-07', '2000-01-10', '200...2000-02-08', '2000-02-09',
'2000-02-10', '2000-02-11'],
dtype='datetime64[ns]', freq=None)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is my previous comment clear? and if so, suggestions for how to adapt that to a clearer code comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is inside a block that tries large units and if they overflow then tries smaller units.
OK, that was the context I was missing. But, then I still don't entirely get how this was currently working.
The dates you show above like '2000-01-03' fit in the range of all units. So how would the integer value for that ever overflow?
If I put a breakpoint specifically for Overflow on the line below, and run the full test_pandas.py file, I don't get a catch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fetched the branch to play a bit with the tests: I was misled by the OverflowError, because it is OutOfBoundsDatetime that is being raised when trying to cast to nanoseconds.
So essentially this "infer unit" code assumes that the integer value came from a timestamp that originally had a nanosecond resolution (or at least that should fit in a nanosecond resolution)? Which makes sense from the time we only supported ns.
(Nowadays though .. but that is for another issue)
We could also check this by doing a manual bounds check instead of the casting? (I don't if we have an existing helper function for that)? So we could keep the logic of the inference of the date_unit, but then keep the actual returned data in that unit, instead of forcing it to nanoseconds
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(also, for the case where the user specifies the unit, we so we don't have to infer, we actually don't need to force cast to nanoseconds / check bounds, because that restriction is then not needed)
|
|
||
| result = read_json(StringIO(json), typ="series") | ||
| expected = ts.copy() | ||
| expected = ts.copy().dt.as_unit("ns") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not for this PR, but so this is another case where we currently return ns unit but could change to use us by default?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, but im inclined to leave that to will to decide.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general I think we should use the same default of microseconds whenever we infer / parse strings, and so for IO formats that means whenever they don't store any type / unit information (in contrast to eg parquet). We already do that for csv, html, excel, etc, so I don't think there is a reason to not do that for JSON. But opened #63442 to track that.
|
|
||
| index = pd.MultiIndex( | ||
| levels=[[1, 2, 3], [pd.to_datetime("2000-01-01", unit="ns")]], | ||
| levels=[[1, 2, 3], [pd.to_datetime("2000-01-01", unit="ns").as_unit("ns")]], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did this PR change that? (that this no longer returns nanoseconds)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but i didn't realize it until I just checked. I thought this PR only affected integer cases. I also didn't think on main that the unit keyword would have any effect in this case. So there's at least two things I need to look into.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK i think ive figured this out. By passing unit we go down a path that in main always gives "ns" (so the fact that unit="ns" here is irrelevant). Adding as_unit just restores the behavior in main.
| for date_unit in date_units: | ||
| try: | ||
| return to_datetime(new_data, errors="raise", unit=date_unit) | ||
| # Without this as_unit cast, we would fail to overflow |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
something like this:
| # Without this as_unit cast, we would fail to overflow | |
| # In case of multiple possible units, infer the likely unit based on the first unit | |
| # for which the parsed dates fit within the nanoseconds bounds | |
| # -> do as_unit cast to ensure OutOfBounds error |
to_datetime analogue of #63303.