A bit of background first. I have used the technique of having a production directory where everyone in the studio reference/execute files from, and secondary development directory that is for testing etc.
When I feel like the development pipeline is in a stable state, I make a copy of the development directory, and point everyone’s startup scripts to this. After a couple of days, I should be fairly certain that no one is using the old pipeline.
The problem is that in an environment with 100+ machines, not all machines get restarted and thus is still using the old pipeline. The issue is that Pyblish-qml is run at machine start, and so somewhere a machine is still holding onto some files within pyblish-qml.
I’m finding it impossible to track down the culprit machines. This is not exclusive to to pyblish-qml, as there are other files from for example ftrack-connect that gets held onto.
Currently the only way to overcome this problem, as far as I can see, is to have a local install of the pipeline on each machine. This though brings issues of updating to a new level.
I’ve seen it done in at least a few studios of moderate and large size. The one thing I dislike about it, from an artist perspective, is that you end up with the infamous “Having trouble? Restart your computer” kind of mindset for solutions. Or worse, “It was working until I restarted!”
But as mentioned in the other thread, overwriting files is a hard problem to solve.
I have generally instructed people to try a restart of the our startup script, when having problems. This would be the same as restarting without rebooting the machine.
That might be enough, but I’m not too sure because the OS can hold on to files even after an application is done with it. And it’s the OS what needs to release its handle before you can be sure nothing can go wrong.
I’d expect it to be fine in most cases, but when updating something like multiple git repositories of potentially thousands of files each that may all be accessed in one way or another, leakage could happen. The kind of leakage that give way to some very subtle bugs.
I was hoping that because I terminate the ftrack-connect and pyblish-qml (pyblish-tray) processes with the startup script, I would be able to “just” overwrite files. Something to look out for definitely.
these long running processes are always a issues. but the way we try and mitigate some of these issues is using both metrics and exception handling that give us more insight into the problems artist are facing.
we use sentry to give is detailed tracebacks on any issues in our pipeline including versions of packages we are running so we can catch people using old software.
by adding tags on users, projects, sys path, machines, packages version, we can get a clean picture of issues.
but bigger then that is having a engineering culture of releasing software with messages to the team telling them to update.
not really, but we don’t use QML here and our only long running process is a internal version of ftrack-connect.
the problem when issues arise is getting enough usable metrics out of the system to pinpoint the issue quickly. for that any type of monitoring or exception handling is worth its weight in gold.
We do fall into the trap of restarting some times. one thing we do is ask every one to logout at the end of the day.
this helps with the renderfarm (where logged out machines get automatic) and also with problems like this.
I know this isn’t what you meant, but for completeness I would highlight that this isn’t necessarily a the problem unique to pyblish-qml. Any file in use during overwriting is at risk of trouble. It’s true a long running process is at greater risk, but in a shared environment any file is at some risk, and that risk is multiplied by the number of users.
In theory, and this is just me spit-balling, if a file is located on a server, than it’s possible that server could be configured such that no client could “lock” any file located on it, and instead break the connection during an upgrade. This way the currently connected client(s) would suffer the blow, but that blow might not be an issue other than “oh, it randomly crashed this one time this one month”.
Random crashes aren’t great, but I’m skeptical mysterious restarts as solutions to mysterious problems are any better, and in this way an upgrade could never fail and a user would be less likely to require a restart to stay up to date.
Maybe our IT guys don’t know about it, but here they couldn’t sever the connection. This means at least for our sake, I would need to investigate the syncing method.
I’ll keep poking at our IT guys, and see if they can come up with something.