Pipeline updating and locked files

tokejepsen · July 7, 2016, 7:28am

A bit of background first. I have used the technique of having a production directory where everyone in the studio reference/execute files from, and secondary development directory that is for testing etc.

When I feel like the development pipeline is in a stable state, I make a copy of the development directory, and point everyone’s startup scripts to this. After a couple of days, I should be fairly certain that no one is using the old pipeline.

The problem is that in an environment with 100+ machines, not all machines get restarted and thus is still using the old pipeline. The issue is that Pyblish-qml is run at machine start, and so somewhere a machine is still holding onto some files within pyblish-qml.

I’m finding it impossible to track down the culprit machines. This is not exclusive to to pyblish-qml, as there are other files from for example ftrack-connect that gets held onto.
Currently the only way to overcome this problem, as far as I can see, is to have a local install of the pipeline on each machine. This though brings issues of updating to a new level.

marcus · July 7, 2016, 8:51am

CC’ing @Lars_van_der_Bijl, he manages these things at the scale you are looking for.

tokejepsen · July 12, 2016, 8:36am

So the best solution I can think of, is to sync the files to the local machine on startup/login.

This would ensure all machines are using local files.

marcus · July 12, 2016, 8:40am

I think that’s reasonable.

I’ve seen it done in at least a few studios of moderate and large size. The one thing I dislike about it, from an artist perspective, is that you end up with the infamous “Having trouble? Restart your computer” kind of mindset for solutions. Or worse, “It was working until I restarted!”

But as mentioned in the other thread, overwriting files is a hard problem to solve.

tokejepsen · July 12, 2016, 8:56am

I have generally instructed people to try a restart of the our startup script, when having problems. This would be the same as restarting without rebooting the machine.

marcus · July 12, 2016, 9:28am

That might be enough, but I’m not too sure because the OS can hold on to files even after an application is done with it. And it’s the OS what needs to release its handle before you can be sure nothing can go wrong.

I’d expect it to be fine in most cases, but when updating something like multiple git repositories of potentially thousands of files each that may all be accessed in one way or another, leakage could happen. The kind of leakage that give way to some very subtle bugs.

tokejepsen · July 12, 2016, 9:31am

Good point!

I was hoping that because I terminate the ftrack-connect and pyblish-qml (pyblish-tray) processes with the startup script, I would be able to “just” overwrite files. Something to look out for definitely.

Lars_van_der_Bijl · July 12, 2016, 3:27pm

sorry for the late response guys.

these long running processes are always a issues. but the way we try and mitigate some of these issues is using both metrics and exception handling that give us more insight into the problems artist are facing.

we use sentry to give is detailed tracebacks on any issues in our pipeline including versions of packages we are running so we can catch people using old software.
by adding tags on users, projects, sys path, machines, packages version, we can get a clean picture of issues.

but bigger then that is having a engineering culture of releasing software with messages to the team telling them to update.

tokejepsen · July 12, 2016, 3:31pm

@Lars_van_der_Bijl, how do you distribute your code?

Do people have access to a common network drive, or do each machine have the necessary software/code on them?

Lars_van_der_Bijl · July 12, 2016, 3:33pm

we use a shared network drive distributed among our 2 offices.

our code is release using a CI server that tags release and ensures the code gets release everywhere

tokejepsen · July 12, 2016, 3:35pm

I guess if you have any problems with locked files, you can easily pin-point which machine is causing problems?

Have you have encountered this problem before?

Lars_van_der_Bijl · July 12, 2016, 3:42pm

not really, but we don’t use QML here and our only long running process is a internal version of ftrack-connect.

the problem when issues arise is getting enough usable metrics out of the system to pinpoint the issue quickly. for that any type of monitoring or exception handling is worth its weight in gold.

We do fall into the trap of restarting some times. one thing we do is ask every one to logout at the end of the day.
this helps with the renderfarm (where logged out machines get automatic) and also with problems like this.

marcus · July 12, 2016, 8:54pm

I know this isn’t what you meant, but for completeness I would highlight that this isn’t necessarily a the problem unique to pyblish-qml. Any file in use during overwriting is at risk of trouble. It’s true a long running process is at greater risk, but in a shared environment any file is at some risk, and that risk is multiplied by the number of users.

In theory, and this is just me spit-balling, if a file is located on a server, than it’s possible that server could be configured such that no client could “lock” any file located on it, and instead break the connection during an upgrade. This way the currently connected client(s) would suffer the blow, but that blow might not be an issue other than “oh, it randomly crashed this one time this one month”.

Random crashes aren’t great, but I’m skeptical mysterious restarts as solutions to mysterious problems are any better, and in this way an upgrade could never fail and a user would be less likely to require a restart to stay up to date.

tokejepsen · July 12, 2016, 11:33pm

Maybe our IT guys don’t know about it, but here they couldn’t sever the connection. This means at least for our sake, I would need to investigate the syncing method.

I’ll keep poking at our IT guys, and see if they can come up with something.