Storing publish results within a database

Hi there,

I’m looking for the best way to store the results of the Publishing process into a database. I’d like to have a discussion about the best way of getting this accomplished and talk about the different pros and cons. that this approach might have. At the moment, I’m aiming to get it up and running using SQLite3.

Information to be stored

For the time being, this is the information that I want to store in the database:

  • Name of the asset being published
  • Name of the artist Publishing the asset
  • Date of the Publish
  • Project that the asset belongs to
  • Information of the results of each plugin

In regards of that last point: It would be fairly simple to store the results as success / failure but ideally, I want to find a nice way to store the results of each plugin. One solution to this is to store the message of the exception raised if the plugin failed but I’m wondering if we can get fancier than that.

Let’s say that we have a validator that checks for a series of things and it is able to raise 3 different exceptions. I’m wondering if we could have a table in the database with those 3 exceptions and complete their values depending on which one gets raised when the plugin fails. Another instance in which this might be useful is if I want to store a list of items that are causing the issue because then I could use that list to look for matches within the database.

Starting Point

I think that a good starting point is the code that MarcusO posted on Gitter a while back (As some of the information written in the JSON file is relevant to what we want to put in the database):

import json
import pyblish.api


@pyblish.api.log
class PostValidator(pyblish.api.Validator):
    order = pyblish.api.Validator.order + 0.5
    families = ['*']
    hosts = ['maya']

    def process_context(self, context):
        results = context.data("results")

        self.log.info("Were there any errors?")

        for result in results:
            if result.get("error"):
                path = r"Z:\results.json"

                self.log.info("Yes there were, writing to: %s" % path)

                with open(path, "w") as f:
                    return json.dump(results, f, indent=4, sort_keys=True)

        self.log.info("No errors")

Database experience

I have to admit that I’m kind of new to the world of databases and I’m looking for advice. Hopefully, some of you have had experience with them. :smile:

I hope that we have have a good and healthy discussion about the best way to get this done though.

First things first

I guess that before I get cracking with code,it is important to define the following things:

  • Structure of the database (Schema).
  • The best way to retrieve information from plugins in preparation for the database.

Any advice that you might have is more than welcome.
Cheers,

Dave

Looks good, David, but don’t go for SQL, go for a JSON-based database, like MongoDB (local) or Firebase (cloud).

SQL vs. JSON

I’ve re-written this post a few times trying to explain why, but it’s better you Google the details. In a nutshell, SQL is restrictive whereas JSON is loose. Practically, JSON will make your journey towards understanding what data you actually need and how you intend on using it a lot smoother than anything involving SQL.

For example, a JSON-based database can be exported/imported as plain JSON without loss of data. You can look at it, share it in forums, and understand what you are looking at. It’s also good for embedding directly in Instances and the Context, as it’s just a dictionary in Python-land. Tables offer nothing like that.

With that out of the way

Name of the asset being published
Name of the artist Publishing the asset
Date of the Publish
Project that the asset belongs to
Information of the results of each plugin

That is all great information to have.

Don’t forget to take into account the manner in which Pyblish processes your instances. It works in the form of pairs.

A pair is simply an Instance/Plug-in combination. After a completed publish, the processing history looks something like this.

SelectInstances:           Context;
ValidateNamespaces:        InstanceA;
ValidateNamespaces:        InstanceB;
ValidateSomethingAboutA:   InstanceA;
ExtractInstances:          InstanceA;
ExtractInstances:          InstanceB;
...

So the results will be in relation to which pair was processed.

In your case, this means that, in addition of gathering all information about a particular plug-in, you’ll also need to take into account each instance it processed. The result could look something like:

[
  {
    "plugin": "ValidateSometing",
    "instance": "InstanceA",
    "success": true
  },
  {
    "plugin": "ValidateSometingElse",
    "instance": "InstanceA",
    "success": false
  } 
]

Getting crackin’

Structure of the database (Schema).

If you go for JSON, you can work this out as you go, so I wouldn’t spend too much time thinking about it until you’re more familiar with the data you need. You can start throwing data into your database, and make use of it. As soon as you find something missing, you add it.

The best way to retrieve information from plugins in preparation for the database.

All log messages and exceptions will be available to you via the context.data("results") member. In addition to that, you could append information in each plug-in where it makes sense.

# ValidateNamespaces
if error:
   data = context.data("toDatabase", list())
   data.append({
      "plugin": type(self).__name__,
      "message": "Custom message, just for the database, not for the user or GUI"
   }
   context.set_data("toDatabase", data)

In this case, a particular plug-in attaches information that is specifically meant for a database, something that won’t be logged or raised.

Every plug-in could have this, and each plug-in will have intimate knowledge about what it’s doing, which makes it a good spot to generate this data from.

Notes

A JSON-based database is identical to working with a JSON on disk, so I would start by writing to a JSON as in the plug-in you posted, and transition to the database when you need higher performance. JSON on disk should be very capable of storing thousands if not millions of records without any noticeable drop in performance, but querying is another question, unless you know what you’re looking for.

Thanks for your reply, Marcus

Sounds like JSON-based database is the way to go. I will do a few experiments with MongoDB to get familiar with it.

Pairs of Instance/Plug-in

Don’t forget to take into account the manner in which Pyblish processes your instances. It works in the form of pairs. A pair is simply an Instance/Plug-in combination

I see what you mean in terms of how Pyblish processes the instances. I guess that I will need to determine what is the best way to organise the information within the database.

What about giving each publish a unique ID that I can attach to the information stored? By doing that, I will be able to determine what instances were Published and any errors associated with those (or the context).

Let’s get started

If you go for JSON, you can work this out as you go, so I wouldn’t spend too much time thinking about it until you’re more familiar with the data you need. You can start throwing data into your database, and make use of it. As soon as you find something missing, you add it.

I like the sound of that!

As a test scenario, I will be creating a database and I will store the following information:

  • Date of Publishing
  • User Publishing
  • Unique ID

I will be posting my progress as I go along.

Ok, here is a first implementation:

from datetime import datetime
import getpass
import pymongo

connection = pymongo.MongoClient()
database = connection.test_db
users = database.users
publishes = database.publishes

# Data to be stored
current_date = str(datetime.now().date())
current_time = str(datetime.now().time())
current_user = getpass.getuser()

# Insert the current user to the users collection if it does not exist
users.update({'name': current_user},
             {'$set': {'name': current_user}},
             upsert=True)

current_user_id = users.find_one({'name': current_user}, {'_id': True})['_id']

# Insert a a new publishing document
publishes.insert({'date': current_date,
                  'time': current_time,
                  'user_id': current_user_id})

As you can see, A database is created with two collections: users and publishes.

The first one will use unique values for the name of the users and it will not create additional entries regardless of the amount of times that the user publishes. The second one will create a new document every time that the code is evaluated and it will keep a reference to the user that was publishing.

Any thoughts?

Looking good, David.

Before I comment on the implementation, I’d like to ask, how you are looking to use the data? E.g. is it for tools, for graphing, for shoving into an archive somewhere?

Thanks Marcus,

For starters, I will be using the information to create some statistics. Having said that, it is quite likely that the database will grow and I will start using it for tools as well.

Then before getting too much data in there, I think it’s best to have a look at how you would start using the data for statistics. It’s quite likely that the method you chose to visualise things will have an impact on how the data is best stored.

Do you have any particular tool in mind for visualisation?

I’ve been looking into using this one, which is based on Elasticsearch (also JSON).

Here’s some more reference about storing and visualising logs.

Also part of Elasticsearch, apparently known as an “ELK stack”, for Elasticsearch, Logstash and Kibana.

Hi Marcus,

Thanks a lot for your recommendations.

Elasticsearch & Kibana

A combination of Elasticsearch & Kibana look like a really flexible and neat way to analyse data. I don’t think that Logstash would be required as we don’t need to do any conversion in the data that we want to use in elasticsearch. We are already working with JSON data.

I did a few experiments and I managed to add some data to Elasticsearch using curl. That means that we should be able to do the add information with Python using the requests library (thanks for the recommendation!)

It’s definitely something to keep in mind for the future. Having said that I’m looking for something simpler for now.

Another option: pygal

Today I’ve found a Python package called pygal and I it looks really promising! With it, you can create different kinds of chart with just a few lines of code. And the best thing is that it exports to SVG!

What’s the plan then?

So I think that for the time being I’m going to stay with MongoDB and I will setup a few queries that extract the information that I’m looking for to generate weekly stats. Once I have that I will create the charts using pygal.

I will keep you guys updated with my progress.

Thanks again

Sounds like a solid plan, David. Glad you found what you were looking for.

Once you have the data and are able to visualise it in any library, like pygal, it should be no problem getting it visualised in other ways as well, like in Kibana. The basics are all there and should mostly be a matter of mapping one to the other, along with some potential performance differences.

If you’re going for visualisation through a web browser, there is also D3.js which is more or less the standard for general visualisation and have a few examples and ready-made templates that you fill in with your data. It assumes some basic familiarity with JavaScript.

Have a glance at the enormous examples page.

Otherwise, PyGal looks very promising and generating SVG natively opens up a few additional doors. Like mentioned in our private chat, SVG is immensely flexible, I think of it as the 2d equivalent of Alembic. It’s a ASCII based format, like Maya’s .ma with both a hierarchy and arbitrary properties of shapes. You can access and modify these properties and shapes through JavaScript to make animations, alterations or to add interactivity.

The only disadvantage of PyGal I can think over D3 would be that the results are essentially baked, and that probably comes with limitations of it’s own.

Hi Marcus,

I would like to find out if something like DynamoDB (AWS) can be used in place of Firebase (Google Cloud)

Currently, I only have access to AWS resources.

Cheers

Hi @nicholas_yue, I think you could use any database, Pyblish doesn’t hold any knowledge about it. Do you have any particular use-case in mind?

I would like to build a really basic proof of concept pipeline (I am a pipeline TD) hopefully covering most departments so that I have a better understanding of PyBlish before I present it to my company for consideration but the people that I can collaborate with are not the same as in my workplace and there are too many security lock-down.

Hence I am considering building the proof-of-concept pipeline in the cloud.

If there are some material (cloud base DB + PyBlish) I can dive into that would be awesome.

I have watch a couple of the intro video of PyBlish and have downloaded and install it with Maya 2016.

Next is to dive deeper to actually publish something to understand what is happening.

Thanks for making PyBlish available.

Cheers

You are most welcome!

As for a proof of concept, Pyblish is entirely standalone and runs on the local machine, so there’s no requirement for any cloud solution or even database.

For example.

from pyblish import util
context = util.publish()

Now the context variable contains all of what happened during the publish (sorry if this was already covered above, I haven’t really read it through carefully). More info on publish reports here.

If you wanted to use a database, this is also the kind of data you’d likely want to put there; to e.g. share, archive or visualise.

Not that I can think of, but if you write something up yourself you’re welcome to share it with us!