Product Development: Acquia ContentHub 2.x

So my blog's been offline for a while now. I think there was a security issue sometime around 8.2.2 because I just upgraded the public version of the site from 8.2.1 directly to 8.8.1. I actually did an upgrade to 8.8.0 alpha-something that I used to bootstrap the upgrade to 8.8.1, but public-facing the site made a pretty big jump. Some of this is due to laziness on my part, but a pretty significant portion of my disappearance has been due to a new (to this blog) position within Acquia. I've been at Acquia for nearly 6 years now (in April of 2020), and for almost the last 2 years, I've been helping Acquia re-develop the ContentHub product.

ContentHub is a really interesting problem space. I tend to get sucked into these very nuanced, support-everything-imaginable situations with Drupal module and core development, and ContentHub is no different in that respect. Ultimately, it's intended to be a syndication engine, taking content from one site and materializing it on another. However, the nuance of doing this between Drupal sites is quite detailed. In order to solve this, we approach content syndication in a 5 step process, with 2 steps happening on the originator, 2 steps happening on the receiver, and 1 step happening within our service.

  1. Dependency Calculation (Originator)
    Any given piece of data within Drupal might depend on dozens of other pieces of data. The common "article node" that Drupal 8's standard profile ships with depends on around 40 other entities at the very least. These include the node type, field storage and config, author, tags, image(s) and any supporting entity bundle/field/view mode/form mode data each of those entities requires. Even simple entities are quite complicated in terms of what data they require in order to operate. We can't really "depend" on any of these things existing on the receiving end of our syndication pipeline, so we have to package it all up and send it en mass and let the receiver figure out the details.

  2. Data Serialization (Originator)
    Ideally, this is a "solved" problem in Drupal 8. The 8.x-1.x version of ContentHub tried to use the serialization engine that ships with Drupal core in order to do much of this work, but this approached tended to come up short when dealing with multilingual data. This may have been a flaw exclusive to ContentHub 8.x-1.x, but ultimately, when looking at this problem space, it seemed easier to have each field declare how it was to be serialized, deal with all language level data on a per field basis, and have a fallback mechanism for unrecognized field types for "best guess" solutions.

  3. Communication (Originator->Service->Receiver)
    Once we've found all our dependencies and serialized them, we send that data to an Acquia specific service whose job it is to filter data by customer-defined criteria, and send the appropriate data to any receiving sites the customer might have on a per-filter basis.

  4. Data Collection (Receiver)
    Since only filtered data is sent to a receiver, it may not actually get all the data required to install a given piece of content, it just gets the content itself. Each piece of data is constructed with a reference to all its dependencies so that a complete list of requirements can be built before attempting to import anything.

  5. Data Import/Site Configuration/Dependency Handling (Receiver)
    Ok, this is a bunch of stuff to have all in one step, but when you have a giant list of completely different types of data with different module & configuration dependencies, you have to be flexible. In this step we start by identifying all the required modules for an entire import. Once that's complete, we check to see if those modules exist on our receiver and bail out if they don't. If they DO exist, we enable them and start parsing through the incoming data. This is tricky too though because we can't create a new node without first creating the node type and adding all the fields to it. If one of those fields references taxonomy terms from a particular vocabulary, we have to make sure that vocabulary exists... etc, etc. To this end, we loop our data set progressively creating the dependencies based upon what we have locally available and what's required to support our incoming data. Eventually, we'll process the entire incoming data stream finishing with the original entity that was requested for syndication.

Having outlined this basic flow, I want to contrast this with Migrate for a moment. Both Drupal 8 & 7 migrations happen on a per-entity-type basis. All the incoming node types are created before any nodes will be created. All the users will be created before nodes. All the terms, etc before most nodes (based on what dependencies any migration might have). This is really sensible since we tend to use Migrate for an entire site's worth of data at a time. ContentHub doesn't have this luxury since as little as 1 item might go to a site for syndication, or as much as many thousands. Whatever the case, ContentHub has to be capable of identifying how to handle any incoming entity type, creating that entity type, and moving on to some other based upon how the dependencies are stacked in a given Drupal setup.

This approach is really powerful because it ends up acting very similarly to the Features module. Since ContentHub can syndicate any sort of entity (so long as it supports uuid... don't get me started) it can package up things like content types, views, etc and send them to another site, complete with automated module dependency calculation. The team's common demo example has been installing Umami as an originator of content, and then using Standard or Minimal as a receiver, and then watching all the Umami configurations and content stream in to fill out a blank site. Acquia's demo team has actually been using it for setting up demo content in a limited capacity, and I'd love to help push that further by figuring out how to use our data format to uninstall data from a site (by reverse dependency encumbered-ness).

Ultimately, I'm really pleased with the efforts of the last almost 2 years of time. I think ContentHub 2.x is a really cool tool that a lot of people could use to varying degrees to do different compelling things. We included a couple of drush commands with it so that people could play with it without subscribing to our service. They can import/export data from manifest files, and are really useful for demoing the product and also doing development in a completely controlled manner. If you get a chance to play with ContentHub 2.x I'd be very interested to hear from you, or if you want to ask questions, please let me know!

3 January 2020

Thank you for this detailled post.

It's great to see ContentHub moving forward!

First version of ContentHub was challenging to implement sometimes - especially with File entities. Is this solved by the new "5 steps" process?

 

3 January 2020

File entities have their own sets of problems. Today 2.x only support public schema files, so if that's all you need, then ContentHub 2.x has you covered. However, file entities support more schemas that just public and we're making plans to support s3 and discussing how best to support private. We knew that would be a problem space we'd need to support in the long term, so there's a custom plugin type for handling files that was baked into the module from day 1. Ultimately we might not be able to support all schema types, but if we can hit these 3, that should be well into the 90% range of use cases, and there's an API to allow others to attempt their own custom support if necessary.

larowlan (not verified)

3 January 2020

Congratulations Kris on two years of work, a major achievement!

FWIW Entity Pilot (drupal.org/project/entity_pilot) has had most of these features for the best part of 4 years, for a fraction of the cost - the only thing it doesn't have is multilingual support, unless core serializers support that now - not sure - it just defers to those.

Also, as cool as it sounds - I don't think this is the right approach

and then watching all the Umami configurations and content stream in to fill out a blank site

For the same reasons that I don't think the block place module was the right approach and that I'm still yet to be sold on 'layout builder everywhere' - mainly because it breaks the basic promise of the CMI initiative - that configuration is tracked in code. So whilst pulling in configuration from another site sounds great, unless the content editor who is doing the pulling in knows to do a subsequent config export and somehow check that into git, you're breaking the CMI workflow. Block place was the wrong idea (in my opinion) for the same reason, and thankfully it's since been deprecated and will be removed in Drupal 9.

With this problem space, Entity pilot takes a different approach - allowing site builders to pre-define configuration mappings (themselves stored as config entities, checked into CMI etc) so that you can translate 'an article on site A is a blog post on site B' and 'field_image on site A is field_media on site B' - see https://entitypilot.com/content/new-beta6-share-and-move-content-betwee…

But observations aside, this is a great milestone - I hope that in the future we could work together on the common aspects of entity pilot and content hub so we can come up with the best common solutions!

3 January 2020

I'd be really interested in working together on these problem spaces. I'm not sure how best to orchestrate that effort, but I would really value doing it.

Responding to a couple of your points. ContentHub 2.x is really multilingual first because sooooo many of our customers needed that. So it was table-stakes to even have the conversation. Second, with regard to CMI, I've never been of the opinion that we (the Drupal community) nailed down a truly workable CMI workflow. In the case of ContentHub customers, it's not uncommon for them to want/need fleet management and not want to even log into the subscribing sites. To facilitate this we built a system that could syndicate the config while maintaining their uuids (when possible) and allow for config to be centrally controlled just as content is.

That being said, your point about CMI is still relevant for many people trying to control their config through this workflow. To that end, I've worked with both developers at Palantir as well as a couple of different customers to come up with ways to both prevent the import of config under certain circumstances and also to prevent the syndication of config all together. Both of those efforts are fairly "new" in comparison to the rest of the 2.x effort, but I think having robust CMI workflow support is a thing we have to nail. Both of these separate efforts are patches in the acquia_contenthub issue queue and a gist for depcalc.

Thanks a ton for your specific input. Given that we've trod a lot of the same ground here, I really value it. I think a syndication handler for Drupal core would be a really interesting feature to work on collectively. Let me know if that's something that might interest you, and we'll see what we can do. :-D

Add new comment

The content of this field is kept private and will not be shown publicly.