Nuke crash [wontfix]

tokejepsen · July 29, 2015, 4:22pm

Getting this crash in Nuke sometimes. Seems when Nuke has been open for a longer period of time, and restarting nuke solves the problem.

tokejepsen · July 31, 2015, 8:41am

So the problem is here; http://docs.thefoundry.co.uk/nuke/80/pythonreference/nuke.executeInMain-pysrc.html

The RunInMainThread is a built-in method so we can’t look into that; http://docs.thefoundry.co.uk/nuke/80/pythonreference/nuke.RunInMainThread-class.html

I think the problem is that for some reason (haven’t been able to make a repro case:( ), the RunInMainThread is not giving python any exception set (or class?). I don’t even think its possible to reproduce this with python alone, as this might be a C problem with the RunInMainThread.

marcus · July 31, 2015, 6:51pm

Mm, this does look like we’re giving Nuke an itch somehow.

We might be able to create a reproducible case by spamming Nuke with empty calls out and back in from TCP, from a separate thread. As it only happens after a while, it might be a memory leak of sorts, and we might be triggering it. But at the end of the day, it sounds like it will either have to be solved by The Foundry, or worked around by us.

I’ll have a look at mocking something up next week, but if you feel the urge, you can have a look at xmlrpclib. It’s a the standard Python module used for IPC. We’ll need that running in a separate thread, and then we’ll spam it with a for x in xrange(10000) or similar until it spits out the same error.

Here’s something to get started.

# Server
import threading
from SimpleXMLRPCServer import SimpleXMLRPCServer

def add(x, y):
    return x + y

server = SimpleXMLRPCServer(("127.0.0.1", 8000))
server.register_function(add)

t = threading.Thread(target=server.serve_forever)
t.daemon = True
t.start()

From the same Nuke session, we’ll do something like this.

import xmlrpclib

p = xmlrpclib.ServerProxy("http://127.0.0.1:8000")

for x in xrange(1000):
  p.add(x, 1)

And start increasing the number, it shouldn’t take long. I’d imagine the amount of requests to occur when publishing for a few hours to be under 100.000.

marcus · August 3, 2015, 9:47am

Here’s a test you can try.

It takes around 8 minutes on my machine, and will lock up your Nuke during that time, max out one or two logical cores and consume about 1 gb of memory (and interestingly stay there once the test is over).

If you know any way of disabling the visual refresh in Nuke whilst it’s running, it might go faster and still be valid. I’m getting a flicker of nodes being created here, but shouldn’t be anything to worry about.

import nuke
import xmlrpclib
import threading
from SimpleXMLRPCServer import SimpleXMLRPCServer

def compute():
    """Perform something Nuke-ly"""
    node = nuke.executeInMainThreadWithResult(nuke.createNode, "Blur")
    nuke.executeInMainThreadWithResult(nuke.delete, node)
    return True

def runner():
    """Call Nuke over IPC a number of times"""
    p = xmlrpclib.ServerProxy("http://127.0.0.1:8000")
    for x in xrange(10000):
        p.compute()
    print "Done"

def serve():
    """`runner` calls `compute` through this server"""
    server = SimpleXMLRPCServer(("127.0.0.1", 8000), allow_none=True)
    server.register_function(compute)
    server.serve_forever()


for thread in (serve, runner):
    t = threading.Thread(target=thread)
    t.daemon = True
    t.start()

If nothing goes wrong, try upping the number in xrange to 100.000, but for that to work, we’ll need to do something other than creating nodes as it causes Nuke to accumulate memory, possibly 10x as much as with an xrange of 10.000.

The best thing you can do is to replace the contents of compute with something you use in plug-ins you think may have something to do with the crash, like communicating with ftrack.

Let me know how it goes.

tokejepsen · August 3, 2015, 10:11am

I saw it error, but the log calls aren’t stored so can’t post the result. It errored for a few calls, then went back to working.

marcus · August 3, 2015, 10:14am

Haha, that’s great, but I’m unsure of how to stop on error, or how to store the standard output… Can you think of anything?

I’ll get back to you if I think of something, you can’t re-run the test as-is, as it will launch a new server and new runner. It would need some modification to be made more interactive, should we need it. You will need to restart Nuke each time.

Did you have time to tell whether it was the same error?

tokejepsen · August 3, 2015, 10:27am

Yeah, it was the same error.

marcus · August 3, 2015, 10:30am

Ok, that means the problem is either in Nuke, or your machine.

It works on my machine, on Nuke 8.0. The safest thing to do would be to have the test stop on error, such that we could run it safely on more than a single machine and get reliable output. By then, it’s time to first test on another machine. If the problem persists, then it’s time to submit a bug ticket to The Foundry.

The fact that you’re getting the error during the test means it isn’t related to Pyblish, so I’ll have to leave you to solving it from here.

tokejepsen · August 3, 2015, 10:36am

That’s cool, thanks for the code:)

tokejepsen · August 3, 2015, 10:52am

Would we not be able to add some extra fail safes into the pyblish-nuke to prevent this?

marcus · August 3, 2015, 10:54am

Possibly, but we first need to know what the problem is. At this point, it’s either in Nuke or on your local machine.

tokejepsen · August 3, 2015, 10:56am

ok, I’ll get a consistent test together and test on other machines

marcus · September 3, 2015, 11:01am

@tokejepsen has this happened to you since we solved the other issue with the occupied port number?

tokejepsen · September 3, 2015, 11:18am

No it hasn’t, everything has been quite stable lately.

We also switched all the Windows 8 machines to Windows 7. Don’t know what was wrong, but I couldn’t troubleshoot it either as I was working on a Windows 7 machine. Probably something with network login setup.

marcus · September 3, 2015, 11:20am

Ok, thanks for letting me know. A friend got in touch about having a similar issue. Forwarding this to him and will keep this thread updated if any other solutions pops up.

mkolar · February 26, 2016, 1:29pm

Seems like I have a reproducible case for this.

OS: windows 10

open empty nuke with pyblish loaded.
drop in this node (random tile) with 1000 tiles
duplicate it twice
get our favourite error whether you try to run pyblish, or this test that Marcus posted

It seems to me, that amount of nodes might be the problem considering under the hood, random tile creates thousands of them. I don’t know for sure. What I do know is that this is a major bug in Nuke. Not much to do with pyblish, however it might be another good reason to actively start working on a workaround. From my finding it seems, that The Foundry knows about it, and doesn’t really give a crap, as it’s not really affecting the software itself that much.

Right now I really don’t know how we’re going to deal with this. Maybe it’s a good time to start actively working on a pyside UI that doesn’t require rpc.

marcus · March 2, 2016, 8:02am

Sorry I completely missed this.

The problem is likely a threading issue. They are always a pain to debug. But what you can do, is try and force Nuke to go single-threaded with respect to the Pyblish thread. It isn’t a simple toggle, but it is possible.

Currently, the thread responds to requests immediately, but if you can find a way to double check which other threads are currently active, and delay running until all is clear (i.e. during idle) then I wouldn’t be surprised if this problem went away.

You will likely take a performance doing this, but it should be minimal; ~5% slower would be my guestimation.

To try this out:

Do some StackOverflowing on how to inspect thread usage in Python
Do some Googling on how to wait for other/all threads (except main)
See _dispatch(), this is the function responsible for running the process() method in your plug-in after having received a request to do so from the GUI. This is the one you want to take a chill pill and await its turn.

Let me know if you need more help!

mkolar · March 2, 2016, 11:27am

Thanks! I’ll have a look at it when I find time.

For now we’ve dropped using tray which fixes the issue right away, and currently doesn’t have any disadvantages compared to running tray.

marcus · March 2, 2016, 3:09pm

Aw, no one is having any luck with Tray.