Responsibilities of Extractors

When it comes to the data extracted, is it the extractor responsibility to inform what it extracted and where?

Or is it the integrators responsibility to find the expected output?

We’re making sure the Extractor identifies what and where it extracted to make it easy for an Integrator to pick it up. For example storing a “extracted” folder path on the instance, e.g. like instance.data['extractDir'] = path.

I’m assuming this is the most straight-forward way since “searching” for extracted data is quite a broad and undefined search. You’d still need to search it in a certain location to be of adequate use (and speed for that matter).

How were you thinking you’d use an Integrator without a clue where data is extracted?
Like how were you seeing an alternative being used?

Well this is where I was getting confused and blurring the lines between extractors and integrators. Currently I’m relying on all extracted data being located next to the scene file, which means I can make assumptions in the integration about where the data is.

Think I’ll get into putting the responsibility onto the extractor to tell the integrators where the data is:)

All our Extractors inherit from a specific base class. I think it’s even the Magenta extractor class. This way they all use the same behavior.

Basically in this folder we’d save the content for each instance and the integrator finds that.

Actually now looking at that class there it seems to be somewhat confusing since it seems to take a context whereas we’re passing it an instance so the data gets stored where it belongs to. The folder for the instance. :slight_smile:

I’ve decided at the very start, to leave this to extractor. I came with some kind of internal guideline to make it consistent to certain degree.

Extractor always sets instance.data['outputPath'] and integrator uses that. Different extractors use a variation of that data outputPath_ma, outpuPath_abc, outputPath_obj, to differentiate between them.

Integrator the goes through these, and cleans up accordingly https://github.com/mkolar/pyblish-kredenc/blob/master/pyblish_kredenc/plugins/maya/integrate_asset.py#L16

I was thinking of maybe doing something similar, but have instance.data["outputPaths"] which would be a dict holding the sequence name and files extracted;

{"path/to/file.%04d.png": ["path/to/file.0001.png", "path/to/file.0002.png"],
"path/to/other_file.png": ["path/to/other_file.png"]}

As of now we imagine having the extractors to populate a list (instance.data[‘ftrack_components’]), with dictionaries containing the path and necessary information for the integration step.

Depending on how things are done and if the path that the extractor is yielding is deterministic I guess the integrator should be able to generate the same path from a given instance + context?

Yeah, I guess the best option is to have as “stupid”/simple integrators as possible, so they can be concentrated on one task.

I’ll give you the idealistic view of what an Extractor should know, keeping in mind that in practice some of this may be bent.

This is an expansion of the CVEI definition from the Wiki, in particular the note on “out-of-band” access.

If we think of CVEI as a self-contained unit within which input enters through Collection and output exits through Integration, then Extraction should know nothing about the surrounding pipeline.

        _______________________________
       |  ___     ___     ___     ___  |
       | |   |   |   |   |   |   |   | |
in --->|-| C |-->| V |-->| E |-->| I |-|---> out
       | |___|   |___|   |___|   |___| |
       |_______________________________|

CVEI dataflow

Idealistically, that means even Validation should not be allowed to grasp for information outside of what already exists within the Context and Instances.

For example.

my_collector.py

# OK
import external_library_a
import external_library_b

class MyCollector(...):
  ...

my_extractor.py

# NOT OK
import external_library_a

my_integrator.py

# OK
import external_library_a

Collection and Integration reach out-of-band for information, that’s fine. But extraction cannot. It should already have all it needs, and if it doesn’t than there’s more to collect during collection.

The idea is to isolate external requests to as few points as possible and keep the internals as simple and “dumb” as possible, so as to facilitate reuse in other plug-in stacks and/or pipelines. If each plug-in reaches out and grabs information ad-hoc (i.e. globally), you’ve got a recipe for tight coupling and poor reusability.

With the above in mind, the extractor’s only responsibility is to extract. To serialise. To take something only it understands (e.g. the Maya scene format) and produce something other’s can potentially understand (e.g. an .obj or .json or .abc).

It’s unfortunate it must produce physical files, as it forces us to make a decision up-front about where to put this file. But I would make a conscious effort to make this location as un-global and simple as possible, such as a temporary directory.

Think of it like an in-memory location, that by limitations of Maya and others must have a filesystem address. Something being built during extraction, but not complete until all extractor finishes and not available to others until Integration has formatted and integrated it into the pipeline.

As a practical example to your question, I would define a temporary directory during collection, potentially based on an out-of-band mechanism (e.g. you have a pipelined temporary directory, for dedicated network access or whatever), and create/fill this during extraction.

If there are special rules related to formatting of names that you can establish during Collection, such as a global filename pattern, then I would make note of this too during Collection and use this during Extraction. Other things, such as versioning, may only be available during integration, in which case it’s fine to rename these as you move/copy these into the public space.

class MyCollector(...):
  def process(self, context):
    context.data["tempdir"] = tempfile.mkdtemp()
    context.data["alembicTemplate"] = "{project}_{asset}_alembic_{version}.abc"
1 Like

Thank you @marcus for the breakdown:)

Was wondering whether you guys are using a temporary directory?

Picked up the workflow of extracting data next to the scene file (context.data["currentFile"]) from some of @mkolar plugins, which I thought was a good workflow since this data is “standard” in all integrations.
It’s also very visual if something went wrong with the extraction or integration, since you can easily see the files in the work area.

We’re definitely writing to a local temporary folder on our end. This has sped up our extractions for caches quite a lot, more than I imagined at first. Also relieved some stress from the server by batch copying instead of writing snippets to the server over time. So far it has never held us back.

For file caches that are already produced by an artist (e.g. we have Yeti fur caches) in their work directory we have a special integrator that transfers those files directly from there to the correct destination folder. (We’re avoiding the intermediate step of just “copying to local” -> “copying to server”, we’re just doing it fully on the server. So only if there’s the extraction of a new file it’s written locally and then transferred to the server.

So when you transfer to the server destination, do you moved the entire folder first and then rename/move the files around to where you want them? Or do you just move the entire folder to server and keep the “messy” files inside the folder, as is?

Our integrator only cares about the “top level files” in the Extracted directory and copies it to the destination path with a cleaned name. So yes, it’s moved including renaming. Basically our extractors don’t care about any naming convention and the Integrator takes over.

We do however use a convention for our Extractors where they somewhat identify what they are by means of their extension to allow different files to be extracted with the same extension within one instance + family. For example a gpuCache gets extracted as <content>.gpuCache.abc instead of <content>.abc to avoid conflict with a regular Alembic export. (Both usually have the .abc extension).

I’ve felt the system was somewhat too simple/generic from the beginning, but I’ve never had the need for more in the projects so far. It has served us well and I’ll likely not be making it more complex if I don’t need to.

TL;DR. Yes, the integrator does rename files (but that’s directly during the transfer since it copies to a “renamed destination”). So that is a one-step process instead of two-step process.

We’re using a combination of both, work folder and temp folder, However we’re slowly switching to the temp folder only. There’s only a few extractors that don’t use it yet.

As @BigRoy said. It’s faster and with our genuinely shitty network and server, any load I take from it is great help.