Skip to content

Conversation

@HuijingHei
Copy link
Member

Move the continuous writing to service instead of systemd-run.
Fixes openshift/os#1751

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the ostree.sync test to use a dedicated systemd service for continuous file writing, instead of systemd-run. This is a good improvement for robustness and clarity. The associated path changes from /var/tmp to /var/mnt are applied consistently. My review includes a suggestion to simplify the writer script by removing redundant sudo calls, as the script is already executed with root privileges by the systemd service.

@HuijingHei
Copy link
Member Author

HuijingHei commented Sep 19, 2025

Run ostree.sync manually on c10s/rhcos10, the result is passed.

[coreos-assembler]$ ./bin/kola run -p qemu -b rhcos --qemu-image /srv/data/scos-10.0.20250918-0-qemu.x86_64.qcow2 ostree.sync --multiply 5
=== RUN   ostree.sync0
=== RUN   ostree.sync1
=== RUN   ostree.sync2
=== RUN   ostree.sync3
=== RUN   ostree.sync4
--- PASS: ostree.sync0 (258.91s)
        sync.go:221: Got NFS mount.
        sync.go:265: Found test=1 in kernel argument after rebooted.
--- PASS: ostree.sync3 (258.19s)
        sync.go:221: Got NFS mount.
        sync.go:265: Found test=1 in kernel argument after rebooted.
--- PASS: ostree.sync4 (257.61s)
        sync.go:221: Got NFS mount.
        sync.go:265: Found test=1 in kernel argument after rebooted.
--- PASS: ostree.sync2 (258.52s)
        sync.go:221: Got NFS mount.
        sync.go:265: Found test=1 in kernel argument after rebooted.
--- PASS: ostree.sync1 (258.35s)
        sync.go:221: Got NFS mount.
        sync.go:265: Found test=1 in kernel argument after rebooted.
PASS, output in tmp/kola/qemu-2025-09-19-0338-7444

[coreos-assembler]$ ../coreos-assembler/bin/kola run -p qemu -b rhcos --qemu-image /srv/data/rhcos-10.1.20250911-0-qemu.x86_64.qcow2 ostree.sync --multiply 5
=== RUN   ostree.sync2
=== RUN   ostree.sync3
=== RUN   ostree.sync4
=== RUN   ostree.sync0
=== RUN   ostree.sync1
--- PASS: ostree.sync2 (258.38s)
        sync.go:221: Got NFS mount.
        sync.go:265: Found test=1 in kernel argument after rebooted.
--- PASS: ostree.sync0 (258.06s)
        sync.go:221: Got NFS mount.
        sync.go:265: Found test=1 in kernel argument after rebooted.
--- PASS: ostree.sync1 (260.50s)
        sync.go:221: Got NFS mount.
        sync.go:265: Found test=1 in kernel argument after rebooted.
--- PASS: ostree.sync4 (260.28s)
        sync.go:221: Got NFS mount.
        sync.go:265: Found test=1 in kernel argument after rebooted.
--- PASS: ostree.sync3 (263.38s)
        sync.go:221: Got NFS mount.
        sync.go:265: Found test=1 in kernel argument after rebooted.
PASS, output in tmp/kola/qemu-2025-09-19-0522-8047

@HuijingHei HuijingHei changed the title tests: fix ostree.sync test on c10s/rhcos10.1 tests: fix failed ostree.sync on c10s/rhcos10.1 Sep 19, 2025
@HuijingHei HuijingHei force-pushed the fix-sync-rhel10 branch 2 times, most recently from bd975fd to b975df9 Compare September 19, 2025 08:30
jbtrystram
jbtrystram previously approved these changes Sep 19, 2025
Copy link
Member

@jbtrystram jbtrystram left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. minor suggestion

@HuijingHei
Copy link
Member Author

/hold
Do more testing and check that this is not stable.

Copy link
Member

@marmijo marmijo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great overall and matches what I saw during my local testing of ostree.sync: the rpm-ostree transaction completes fine, but systemd is failing to unmount the NFS client mounts under /var/mnt/data*.

One small change, I think, is to uncomment the ExecStopPost so the write unit cleans up it's own NFS mounts on stop. I tested the changes from this PR, with that line enabled, and ostree.sync test passed reliably on both rhel-10.1 and c10s (5 consecutive runs on each). The rest of the PR LGTM!

After=remote-fs.target
[Service]
ExecStart=/usr/local/bin/nfs-random-write.sh %I
#ExecStopPost=/bin/umount -lf /var/mnt/data%I || true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#ExecStopPost=/bin/umount -lf /var/mnt/data%I || true
ExecStopPost=/bin/umount -lf /var/mnt/data%I || true

I think we should uncomment the ExecStopPost here so the service can clean up its own NFS mount which avoids the later var-mnt-data*.mount/var.mount shutdown unmount failures.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should uncomment the ExecStopPost here so the service can clean up its own NFS mount which avoids the later var-mnt-data*.mount/var.mount shutdown unmount failures.

Thank you @marmijo for your review and comments, I am trying to test to see how long the vm will start. Hopefully will find out how long we need for the client to boot, but agree this is a workaround.

Copy from comment
the client-10 hang about 806s ([23910.934397] ~ [23104.029251]) and finally success to reboot. While the client-9 hang 313s ([18009.805121] ~ [17696.012479]) which is much shorter.

Move the continuous writing to service instead of systemd-run.
Fixes openshift/os#1751
@openshift-ci
Copy link

openshift-ci bot commented Dec 18, 2025

@HuijingHei: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/rhcos 3d45308 link true /test rhcos

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[scos10] ostree.sync kola test failed

3 participants