Skip to content

Conversation

@brandonpayton
Copy link
Member

Motivation for the change, related issues

In order to fix native file locking in Windows, we are moving native file locking into workers.

More details coming...

Implementation details

TBD

Testing Instructions (or ideally a Blueprint)

TBD

@brandonpayton
Copy link
Member Author

We can do this with a FileLockManagerForPosix and a FileLockManagerForWindows and fall back to the FileLockManagerForNode if the native locking API is not available.

I've working on the implementation for both. The main things that require care are:

  • Treat zero-length ranges as extending to the end of the file
  • Release locks when a related file descriptor is closed
  • Release locks when the process exits (or in this case, when a PHP request is completed)
  • Implement fcntl() semantics
    • Release fcntl() locks when any file descriptor for the locked file is closed by the locking process.
    • Locked ranges can be unlocked or merged piece by piece. In contrast, Windows locking requires that an unlock range corresponds exactly to the locked range.

For Posix, we can keep it fairly simple for fcntl() by keeping track of which files a process has locked via fcntl() and then unlocking the entire range via fcntl() when locks need to be released.

For Windows, implementing fcntl() semantics is more complicated. We'll have to maintain a collection of which ranges are locked per file in order to be able to unlock those ranges. If a caller wants to unlock part of a range, we'll have to unlock the entire range and then obtain locks for the remaining portions of the original locked range. For shared locks, we can obtain the new ranges before releasing the original range, but for exclusive ranges, we'll have to release the original range before attempting to obtain locks on the remaining ranges. (According to a Google answer about whether Windows allows overlapping exclusive locks by the same process)

The good news is that we are already tracking locked ranges in the FileLockManagerForNode. The work for the Windows locks shouldn't be that different.

cc @adamziel

@brandonpayton brandonpayton force-pushed the playground-cli/move-native-locking-into-workers branch from 60a6f0b to 6023b7d Compare December 9, 2025 18:10
@brandonpayton
Copy link
Member Author

I roughed out native FileLockManager's for POSIX and Windows, but they are yet untested. Tomorrow, I plan to start by adapting native locking tests and testing these new classes.

@adamziel
Copy link
Collaborator

The pre-requisites for this one seem to be mostly in place. The CLI spawn handler now creates a new OS process for any spawned PHP subprocess. The request handler still uses multiple PHP instances, but can be tuned down by adding maxPhpInstances: 1, to every bootWordPress* call in worker-thread-v*.ts files – which we could do in this PR.

In #3014, I'm exploring a CI stress test to confirm multiple workers are indeed used for handling concurrent requests.

@brandonpayton
Copy link
Member Author

@adamziel, all the FileLockManager tests are passing for the Windows version. I explicitly disabled the tests that assert partial lock/unlock and overlapping locked ranges. Those things aren't supported for Windows, and I'm not sure we'll be able to support that safely on Windows. Regardless, I don't think we'll need it to support SQLite's locking scheme.

I think we might be able to test Playground CLI with the Windows lock integration before moving to separate worker processes, so I'm planning to try that yet tonight. It may give us a little more info.

Either way, these Windows tests make me think it's worth trying the multi-process approach here.

@brandonpayton
Copy link
Member Author

brandonpayton commented Dec 17, 2025

@adamziel I was able to get Playground CLI booting with native file locking in Windows. With a very simple test it appears to be working well.

Tomorrow, I plan to prototype moving Playground CLI from multi-worker to multi-process. This is needed if we want to rely exclusively on native OS file locking because locks are per-fd and per-process. Otherwise the native OS will see locks held by php-wasm workers as locks held by the same process.

@adamziel
Copy link
Collaborator

Amazing @brandonpayton!

@adamziel
Copy link
Collaborator

Btw, I thought Playground CLI was already multi-process and we just needed to cap the number of workers to 1?

The purpose of this is to eliminate worker args that
are not serializable as JSON so we can move from workers
to separate processes which communicate via JSON-serialized
messages rather than via transferred or cloned objects.
@brandonpayton
Copy link
Member Author

brandonpayton commented Dec 18, 2025

Btw, I thought Playground CLI was already multi-process and we just needed to cap the number of workers to 1?

@adamziel we've just been multi-worker-thread up to now. Last week, I realized that I'd thought we were multi-process already but was incorrect (mentioned here). But I think we are close to moving to multiple processes. I see how it can work using the Node.js cluster package and am working out the details. (I think we're very close) The cool thing is that we won't need to pass the response from the worker process back to the main... the worker will get the connection's socket and can respond directly.

@brandonpayton
Copy link
Member Author

brandonpayton commented Dec 18, 2025

I have what I think should be working for a Playground CLI process cluster but need to wire up spawning a another PHP worker from the child process (for proc_open() I think). For that case, we probably just want to fork() a child process rather than fork() a process within the cluster. Will see.

@adamziel
Copy link
Collaborator

adamziel commented Dec 18, 2025

You know all that, but I saw fork() and just wanted to say it out loud: fork() in Node.js !== fork() in unix. Also, it will give you a new Node.js process rather than a worker.

@brandonpayton
Copy link
Member Author

You know all that, but I saw fork() and just wanted to say it out loud: fork() in Node.js !== fork() in unix. Also, it will give you a new Node.js process rather than a worker.

Thanks for stating this explicitly!

😄 I just discovered cluster.fork() is more like POSIX fork() than child_process.fork(). By default cluster.fork() at least creates a child that runs the main-thread/primary script unless you explicitly override the child script path with cluster.setupPrimary({ exec: 'path-to-child-script' }) before calling fork().

In that case, it's like you're saying "actually this other script is the 'primary' you should fork() from". But it's not actually the primary script. It's just the script you want to run in a cluster. 🙃

@brandonpayton
Copy link
Member Author

brandonpayton commented Dec 18, 2025

@adamziel @mho22 I pushed some WIP changes for running Playground CLI php-wasm workers as separate processes. Some notes:

  • It uses the Node.js cluster package to create worker processes.
    • Each worker initiates listening on the Playground CLI web server port.
    • cluster allows multiple cluster workers to listen on the same port.
    • On non-Windows, Node.js attempts to load balance with a round-robin approach.
    • On Windows, the first process that accepts a request services that request. According to Node.js cluster docs, this may change once libuv is able to share IOCP (I/O Completion Ports) handles efficiently on Windows.
    • express is no longer used.
    • The main Playground CLI process no longer services HTTP requests. Requests are completely handled by worker processes.
  • It is currently working for Blueprints v1 only, though supporting v2 should be straightforward. In the meantime Blueprints v2 is broken within this PR.
  • When spawning a worker for things like proc_open(), we use child_process.fork() because those processes are not part of the web server process cluster. They run PHP but do not handle incoming HTTP requests.
  • There is an IPC error printed to the console after we kill the initial worker used for WordPress setup. We just need to release the comlink handle more gracefully.
  • You still need to pass --experimental-multi-worker to get multiple workers.
  • My subjective user experience: For an instant, it feels like the server is slow to start responding, and then everything is lightning fast.

This is all that is coming to me now, but I'll likely follow up and share more soon. :)

In the meantime, I'm working on:

  • Trying this in Windows
  • Tons of cleanup and reflection
  • Running more of the Playground CLI automated tests
  • Fixing type and lint errors

@brandonpayton
Copy link
Member Author

Actually, the push failed. Fixing...

The purpose of this is to eliminate worker args that
are not serializable as JSON so we can move from workers
to separate processes which communicate via JSON-serialized
messages rather than via transferred or cloned objects.
@brandonpayton
Copy link
Member Author

The multi-process Playground CLI changes are now pushed.

@brandonpayton
Copy link
Member Author

It's kinda wild, but multi-process Playground CLI worked with no modifications on Windows. 🎉 That's great (and unusual)!

I instrumented FileLockManagerForWindows to log when locking file byte ranges, and it is being used to lock the SQLite DB.

@brandonpayton
Copy link
Member Author

Currently IPC is JSON-based, but I just learned there is an "advanced" serialization option based on the HTML structured clone algorithm. It may not matter either way, but being able to pass instances of builtin objects is a blessing.

@brandonpayton
Copy link
Member Author

Also, IPC via comlink isn't used much after startup. After that, the workers are just servicing HTTP requests directly.

@brandonpayton
Copy link
Member Author

The Playground CLI file locking tests are passing. There are still other Playground CLI test failures, and I'm looking at those.

@brandonpayton
Copy link
Member Author

The main Playground run-cli tests for Blueprints v1 are passing except for:

   ❯ other run-cli behaviors (2) 2842ms
     ❯ auto-login (1) 1074ms
       × should clear old auto-login cookie 1074ms
     ❯ error handling (1) 1768ms
       × should return 500 when the request handler throws an error 1768ms

The biggest thing that was causing failure was that runCLI() was being used with a single worker process which now only has a single PHP instance. Some tests needed more than a single instance. Switching to explicit multi-worker-process fixed them.

Ultimately, to land this PR, we'll need to always run multiple php-wasm worker processes.

I'm pleased with how well things appear to be going. Tomorrow, I'll continue cleaning this up and fixing tests.

@brandonpayton
Copy link
Member Author

One note:
The autologin cookie cleanup testing is failing because that cookie cleanup isn't happening at the moment. That needs to be fixed.

It's a bit more awkward because, with a cluster of workers, there is no one place that can judge which request is the first (so we know the autologin-has-happened cookie cannot be from the current Playground CLI session).

@adamziel
Copy link
Collaborator

adamziel commented Dec 19, 2025

Ultimately, to land this PR, we'll need to always run multiple php-wasm worker processes.

Sounds good and makes sense. Should we have a lower bound on the number of worker processes?

It's a bit more awkward because, with a cluster of workers, there is no one place that can judge which request is the first (so we know the autologin-has-happened cookie cannot be from the current Playground CLI session).

Maybe doing everything in PHP is not useful here? What do you think about moving some of the logic to start-server.ts where we know which request comes in first? We'd still need to keep it functional in the browser version but that's okay.

@brandonpayton
Copy link
Member Author

Should we have a lower bound on the number of worker processes?

Yes! In single-worker mode, we have a default maximum of 5 php-wasm instances at a time.

Let's start our lower bound at 5 instances and see how it goes.

@brandonpayton
Copy link
Member Author

Maybe doing everything in PHP is not useful here? What do you think about moving some of the logic to start-server.ts where we know which request comes in first? We'd still need to keep it functional in the browser version but that's okay.

The previous logic for clearing the autologin-has-happened cookie was in start-server.ts, but now there is not a single place it is used:

Every php-wasm worker process in the cluster calls startServer() to listen on the same port. It's how the cluster works, allowing each member to listen on the same port and coordinating which member gets the next request.

Maybe we can do something with a lock file. Will see :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants