Pipeline Development Environment

Me and @BigRoy have recently been looking into setting up a shared pipeline environment suitable for internal development and version control. But we ran into issues, so the below is a summary of how far we got.

Intro

Work in progress

From a 10,000 meter perspective, we’re asking ourselves.

  1. How do others do it?
  2. Is Git a good candidate, or is something like Subversion better suited?
  3. Should we version control external libraries?
  4. What does the general workflow look like when deploying new features to production, in a safe manner?

1. Setting the bar

For this to work, we figure it be best if there was a “testing ground”-type of area into which the development of new features or bug fixes were put in preparation for deployment into a live production environment.

We’ll call these two areas “development” and “production” environments, where there can be many development environments, but only one for production.

Once up and running, the goal would be to provide an environment within which software could be launched in either “production” or “development” mode.

2. Organisation

Similar to how Pyblish is being developed and hosted on GitHub, there would be one “master” repository used for production, and various forks for development.

  • studio/Production (master)
    • studio/Development (fork)
    • marcus/Development (fork)
    • roy/Development (fork)

Production and Development are always available internally and assumed to exist as far as pipeline tools are concerned. The production environment contains code that has first been tested in the Development environment, whereas other forks represents an individuals “sandbox” environment, some of which may or may not eventually get merged into Development.

On disk, it may look something along these lines.

$ cd z:\pipeline
$ tree
development
└── python
    ├── core
    │   ├── __init__.py
    │   └── model.py
    ├── maya
    └── tools
production
└── python
    ├── core
    ├── maya
    └── tools

Through the power of Git, the state of each environment may be “tagged” with a version, each version encapsulating a particular set of requirements for the next release until eventually merged with the Production repository where changes become live.

In practice, this means that whenever a new release is made, the previous one is still there and can be reverted back to. It also means there is room for scheduled releases with a fixed set of expected changes for inclusion in each release.

3. External software

We’ll also want to utilise external libraries and frameworks, such as Pyblish. This is where things got difficult for us, so I’ll talk about the long way around here in an attempt at possibly optimising it once a greater understanding has been reached.

As we can’t expect too much from externals, such as whether it utilises submodules with sane remotes, we’ll need to keep them as far away as possible from internals.

On disk, it may look something along these lines.

$ cd z:\pipeline
$ tree
development
production
externals
    production
    development
    └── python
        ├── core
        │   ├── __init__.py
        │   └── model.py
        ├── maya
        └── tools

Where the externals are kept outside of the version controlled internal pipeline, but retain similarities in terms of how it laid out on disk; in development and production directories respectively.

During production, external libraries are taken from the /production directory whereas the equivalent development directory is used during testing of new releases, similar to the development environment used internally.

The difference here is how libraries are updated.

The non-external libraries are cloned, pulled and merged internally whereas external libraries are merely deleted and copied over as-is. History is left to their official development environments, such as GitHub.

Any internal development of an external library then takes place either completely outside of the internal pipeline or within the development area where changes may get pushed to an external source such as GitHub and later merged.

External production libraries can then either pull individually from the official source, or from the development area.

4. Practical examples

Let’s consider a few practical scenarios, from an artists perspective.

4.1 Running Software

For example, running Autodesk Maya under Windows.

set_production_environment.bat
maya_2016.bat

Under the hood, a series of environment variables are set, most importantly the the PYTHONPATH from which Python packages are to be loaded at the launch of any tool within Maya.

For example, the userSetup.py of Maya might look something along these lines.

from pipeline.maya.tools import instance_creator
instance_creator.install()

In this case, the instance_creator of the production environment is automatically picked up due to it being the first item within the PYTHONPATH.

Running the development environment could look like this.

$ set_development_environment.bat
$ maya_2016.bat

4.2 Feature requests, bug fixes

There comes a time when an artist requests a new feature or a fix for a recurring problem, the ideal workflow might be something like this.

  1. A ticket/an issue is submitted
  2. Developer clones production and implements solution
  3. Developer merges potential solution into the development area
  4. The artist tries solution via the development area
  5. If the solution is accepted, it is merged with production
  6. Otherwise, repeat 2-4 until accepted.

This workflow implies a tight connection between development and production; the two should generally be equal, apart from temporal situations like the one above.

The development area is also where beta-testing could occur, possibly in dedicated branches of development such that an immediate fix could be merged with production as soon as possible.

Perhaps this is where something like Git Flow comes in?

4.3 Personal sandbox

In addition to running within either Production or Development, it may sometimes be useful to run within your own personal sandbox. An environment only you have access to and can both break without affecting others and maintain code only relevant to you anyway, such as personal scripts or software configurations.

  1. Clone production
  2. Implement personal solutions
  3. Run software with your environment
$ set_custom_environment.bat c:\users\marcus\git\pipeline
Validating custom pipeline..
Pipeline valid
$ maya_2016.bat

This setup assumes that your custom pipeline is based on (“subclassed” from?) the production pipeline such that the surrounding infrastructure can make assumptions about it.

5. Hosting

Ideally, none of this should ever leak outside of the internals of a company. But certain outside vendors provide services that can be hard to reproduce internally, such as GitHub, and sometimes having remote access is a good thing.

For the above workflow to function, there needs to be a server into which code may be pushed. In fact there needs to be several servers, one per environment, but primarily two; production and development.

There needs to be a concept of “forking” a repository along with “cloning”. Both of which are native to Git, but less intuitive from the command-line when compared with GitHub where these things are visual.

We’ll need “issue tracking” and “tagging”. Issues are where artists submits a detailed report of either a feature of bug such that it may at one point be implemented, whereas tagging is how known-to-work editions of a repository is marked.

5.1 Options

  • Manual
  • GitHub
  • BitBucket
  • GitLab
1 Like

Feedback from Sebastian Thiel, former head of pipeline at Trixter, Germany.

How do I prevent users from having to pull massive amounts of data from Github on the initial clone?

The only thing that comes to my mind is to specify the --reference flag when cloning. That way you can keep a cache of objects on a server which is in the intranet, which will be queried for objects if these are needed. Such a reference can also be setup later, it’s all in the docs of git clone.
The disadvantage of that might be that from that point on, they will need access to that intranet server to do git operations on pyblish, it’s definitely worth testing how it behaves if the alternate object repository is not available.
With such a setup, objects which are not in the alternate object repository will still be fetched from the remote location, github in that case, and it would be up to you to keep the cache up-to-date enough to prevent this data from getting too large.

How do I deal with private repositories on e.g. GitHub and Bitbucket?

That could work with deploy-keys, which can be set per repository, on github at least. It’s a secret that allows RO or RW access to the associated repo. Everyone with that key will have access to the repo, which can easily get out of hand. At least it’s easily revokable though.

Did you solve issues like that in the past?

My main strategy was to keep remotes pointing to the internet on the server-side. Clients/Users would access bare files via a RO network share. Devs would have a local checkout, but again push and pull only to the repository clones on the intranet server. Those who need to push elsewhere can still configure it accordingly via git-remote.
Generally, you would keep various versions of a resource checked out on the server in parallel to allow various groups of users to use different versions of it at the same time.

Think this is right thread for this…

Vendors

Inevitably there are other tools you make use of in a pipeline, that aren’t intended for Pyblish integration like; Python, FFMPEG, Image Magick etc. But Pyblish is still the glue that holds the tools and pipeline together. As we are aiming for pipeline solutions, the question is how to handle these external vendors easiest.

I can see two ways of dealing with vendors; guidelines or inclusion.

Guidelines

You can document how your pipeline needs to be setup to work. This keeps the package (extension?) very lightweight, but very prone to errors from the end users installation.

Inclusion

This is here we include all necessary tools for the pipeline in the package, so the end user has nothing to worry about, but it become a bigger repo.

How do you guys see this happening?

To follow up on this, I’ve collected further insights into this from pipeline developers in various studios.

What I wanted was a working style similar to what we have going for Pyblish; where a GitHub Organisation represents public work and forks are used for development. In a pipeline, the organisation would be the production data whereas forks represents development environment(s).

By the looks of things, even the largest of studios are quire primitive in this regard. It would seem that even these studios resort to merely copying files the old fashioned way for “deployment”; i.e. getting a working copy of some script or package into production.

For example, Framestore for example uses Git during development, and copies the current state of those files into production once ready, discarding the notion of Git (i.e. by not copying the inner .git folder). Possibly via some ad-hoc system, e.g. “only deploy this and that file” such that work in progress material can still be kept under wraps.

At ILM, they version a separate text file, such as a YAML or JSON, containing names and versions of packages. This text file is then used to “build” the production pipeline at a given time. For example.

{
    name: pyblish,
    version: v1.2.1
    repository: https://github.com/pyblish/pyblish.git
},
{
    name: mypackage,
    version: v2.0.0
    repository: git://path/to/remote/repository/mypackage.git
},
{
    name: internalpackage,
    version: v0.4.0
    repository: ssh://127.0.0.1:/git/internalpackage.git
}

This file is then versioned either in isolation from their internal software which is actually also included in this list, only via a repository path pointing to an internal git address.

In regards to vendors, I think Python has a system going that is both easy to explain which makes it a good starting point for anything more complex. There, dependencies are, as you say, either indicated or included.

When indicated, a package can either rely on the end-user manually installing required libraries during the installation of the package, or by using the requirements.txt file, which is a formatted file of package/version pairs, where the version is optional and defaults to “latest”.

An example of using this file which might look like this.

$ git clone https://github.com/me/myrepo.git
$ cd myrepo
$ pip install -r requirements.txt
$ python setup.py

Each package in the requirements.txt file is then installed prior to the installation of the repository itse;f, which is also a setuptools based Python package with the traditional setup.py alongside it.

This way, requirements are “included” with a package, but isn’t forcefully installed; rather optionally installed. Which in some cases is ideal, as you can choose not to install them, when for example you have access to them elsewhere, outside of pip’s knowledge at the time, such as on your server.

Personally, I’m of the Windows-school of thought in that I think everything should be bundled together into one big blob. External dependencies grant flexibility in that you can independently update each repository, but also run the danger or two packages requiring different versions of the same dependency and leave a lot behind once uninstalled.

On a Linux system, software is typically kept as minimal as possible and later built dynamically at install time out of the disparate pieces that it requires. I think these two extremes provide good insight into the longevity and maintainability of both systems and in my experience I’ve had far more problems installing and maintaining software on Linux than on Windows. I think the same could potentially apply to Python packages and other software libraries too.

I’m looking into how to encapsulate an entire pipeline for distribution, and thinking a hybrid approach of submodules and including external vendors might be the best option.

Basically it should be a one-stop-shop, so the end user can download it and keep updating that repo. So any tools that aren’t provided online as submodules will be included as a vendor.
Is this how other people work?

My guess would be “no”. But it does sound like an ideal to strive for, only I wonder how much added maintenance is added and what a studio would gain in return for putting up with it.

For example, is a repository has a submodule of any kind, than you will have a hard time redirecting those to internal development directories. This was the problem me and Roy were having; we had a parent with children, each child pointing to a fixed location. We wanted a local copy of the parent, such that we could perform work on it in isolation and away from production, but we also wanted local copies of the children. But because the children were inside the parent, making the parent local did not change were the children was pointing to; which in this case was to the master pyblish/pyblish-qml et. al.

Simply put, this might means that for a studio to fully encapsulate their pipeline, they cannot use submodules. This limitation would make it more possible to distribute their pipeline, but it comes at a cost. And the question then is, is the cost worth the payoff?

In practice, even if this particular problem can be worked around, I’d imagine there being many tradeoffs such as this and for a studio not seeing any direct gain from supplying their internal pipeline it might not be worth.

I think the reason we haven’t seen this happen (successfully) before is because of some of these tradeoffs and once you stop paying for them, you end up with a tightly coupled pipeline that only works for one particular studio in one particular circumstance; yours.

Not to throw a wrench in your line of thought, it is a goal (perhaps a holy grail, even?) worth pursuing and I find the GitHub workflow of forks and pull-request to be a potential mindset for how something like that could work (minus submodules).

I think with any pipeline tools might come in from anywhere. You’ll end up including tools that are not your own but you’d still want to manage in its version state for a production. This development that you don’t have control over might just be a non-git folder or even contain binaries that are much larger in size. So I can definitely see the need to “mix things up a bit” if you want to basically keep a tracked versioned history of your entire pipeline.

This is also likely why ILM versions their own “version” file that tracks the combination of packages with their own wrapper making it possible to mix different version control systems or even version something like full applications in a customized way. It really sounds like something that’s less a programmer’s area of expertise but would be essential knowledge for any big IT management.

Anyway, it’s good to see the discussion raised here with all the input from everyone.
I’m still struggling to find a way to manage both packaging things together into a versioned “pipeline” and at the same time allow a developer to hop right in with a version, start some development on a part of it and test it locally. Basically getting it down to something along the lines of “getting started in about 5 minutes”.

How about this for a structure?

The main pipeline repo consists of some git relate tools, and all external vendors. This repo has a update.bat, that pulls all the required git repos into its parent folder. It would look like this;

  • Pipeline repo
  • vendors
    • ffmpeg
    • python
  • update.bat
  • packages.json
  • Packages
  • pyblish-qml
  • pyblish
  • etc.

update.bat
Along with pulling in any repos into the Packages directory, it also updates the repos. This also updates the Pipeline repo, which might not be possible?

packages.json
This would be similar to what @marcus described for ILM’s workflow (ignoring syntax for json files here:))

 {
    name: pyblish,
    version: v1.2.1
    repository: https://github.com/pyblish/pyblish.git
},
{
    name: mypackage,
    version: v2.0.0
    repository: git://path/to/remote/repository/mypackage.git
},
{
    name: internalpackage,
    version: v0.4.0
    repository: ssh://127.0.0.1:/git/internalpackage.git
}

This is of course inspired by pyblish-win, but it should eliminate the need for submodules while being able to easily update to a “new” release of the pipeline.

I think the problem, and perhaps reason for doing this in the first place, lies in enabling others, in the extreme case a freelancer, to pull relevant repositories of interest in building upon and then pushing these changes.

In doing that, there needs to be a repo used in sharp production, and one for development, where both users and the freelancer can switch between the two, while working and testing.

With the suggested structure, it looks like we need two Pipeline repos, production and development, along with two Packages repos, again for production and development. But most problematic I think, is that no one can pull only relevant projects and work on just a small portion at a time, but must instead pull everything. And how do we synchonize between the update.bat and the target Packages directory? That is, where on the hard-disk are they located, and can they be moved?

Yes, you are right. That wasn’t clear in the description. I was intending there to be a production and development directory, which I currently utilize here as well, but its not mandatory.

I don’t know if this would ever be possible. The complexity of a pipeline most often relies of every component being present. What you can do is look at the packages the pipeline is using, and pull only the ones relevant to pyblish. That way you could work with a smaller portion of pipeline, but still being sure its the same versions of the repos.

Didn’t think of having the packages other places, than in the parent directory of the Pipeline repo but its an interesting idea. You could maybe have a configurable element where you can specify where the Packages directory resides.

One important element that I’m assuming here, is that there is a common entry point into the pipeline in Pipeline repo. Here we are using a simple bat script that gets run when users log into their machines. We then launch ftrack-connect and pyblish-tray from there, which sets the user up for launching from their browser and ftrack.
This means that you can have the Packages directory anywhere, as long as you have a reference to it from Pipeline repo.

You are probably right…

So I had some further thoughts about encapsulating a pipeline, and have been looking at Docker.

I still haven’t done any practical explorations into this, but this is how id imagined it working.

You would make sure that the all code operations are executed on the local machine and outside of the docket image, especially io operations so you don’t have to deal with mounting volumes. Basically the docker images would act as a server telling the local machine what to do.

@marcus, did you look at Docker at some point for Pyblish?

I’m using Docker for a lot of things and have been considering how we could leverage its potential with Pyblish, but not in this area.

Not sure how far you’ve gotten into your exploration, but each Docker image is like a VMWare Workstation or VirtualBox image; a fully encapsulated operating system where you install things like you would in any operating system, completely independent of the host.

In some ways, it’s the opposite of integration. That’s more or less the point. To encapsulate, not integrate.

Maybe if you broke it down into the steps you’d imagine this integration to work, or a user story of sorts? Because I’m not sure I’m seeing what you’re seeing here.

1 Like

I’ve dropped Docker for now, as it seems overkill for my needs.

I started a repo for trying to encapsulate the pipeline, by pulling repositories reference in a json file; https://github.com/Bumpybox/pipeline

Any feedback is welcome :smile:

Think my next steps are to try to isolate the pipeline from python and git, by including them into the vendors folder.

I started a repo for trying to encapsulate the pipeline, by pulling repositories reference in a json file;

Interesting start. I wonder when it really becomes a bottleneck or when the extra layer of the “json-reference” layer really starts shining.

Any feedback is welcome.

  • I’m assuming you’ll be adding git to vendors already based on your comment, but was going to mention that.
  • I think the repositorities.json should hold a specific version number or “range” that it requires of a git repository for the pipeline to function. I imagine this repo to be a lightweight version controlled “pipeline” that gets tagged with a specific version to switch to. Basically allowing you to “go back in time to that specific state for that particular show you had then and then”. So it would need to know what particular tag/commit should be pulled.
    • Similarly updating the pipeline (git pull + update) wouldn’t necessarily have the latest of all external git repos, since it could potentially break the pipeline (e.g. when there are external tools being updated with backwards incompatibility with your tools)
    • EDIT: Actually just now noticed you already had an issue for that: https://github.com/Bumpybox/pipeline/issues/2
  • How are you thinking to serve multiple OS platforms?
  • Should the pipeline also come with its own launchers and environment variables it should set? e.g. does your entire pipeline work if you only have the git repositories loaded? Or are there additional things that influence it?

I wonder when it really becomes a bottleneck or when the extra layer of the “json-reference” layer really starts shining.

True, I guess you could just as easily include all the repos you need for the pipeline. Though git submodules aren’t very nice to work with, when you want to develop.

I think the repositorities.json should hold a specific version number or “range” that it requires of a git repository for the pipeline to function. I imagine this repo to be a lightweight version controlled “pipeline” that gets tagged with a specific version to switch to. Basically allowing you to “go back in time to that specific state for that particular show you had then and then”. So it would need to know what particular tag/commit should be pulled.

Yeah, my thoughts are to have an optional commit in json file, where you can specify which commit to pull to. On that note I’m thinking I could add a snapshot method to update the json file with whatever current commit the repos are at.

How are you thinking to serve multiple OS platforms?

Thinking to have a shell script for other OSs with accommodating executables as vendors.

Should the pipeline also come with its own launchers and environment variables it should set? e.g. does your entire pipeline work if you only have the git repositories loaded? Or are there additional things that influence it?

You could definitely do that, or have a separate repo for this.

Actually I currently toying with making the repo more usable for other pipelines as well. The repo could be the “pipeline-manager”, that just updates repositories. If it finds a packages.json in other repos, it’ll pull the repos. This way you could point the “pipeline-manager” to any pipeline repo that has it’s own vendors and dependent repos.

I would make it a tag, with an overriding, optional commit for brute force versioning.

Tags are made whenever a release is made, so a tag for pyblish-base for example could be 1.4.0. It’d make it clearer in the JSON what’s going on, but also encourage stable releases of projects. Sometimes though you might still need a particular commit.

2 Likes

This all is starting to sound a lot like what Conda (http://conda.pydata.org/docs/) was built to do.

Thanks @jedfrechette! Will look into that. Initially it seems a lot closer to what I’m looking for than other options I’ve explored.

Out of curiosity are you using it?