Using Drush to Simulate ContentHub 2.x Syndication

In my last blog, I talked a bunch about some of the basics of our development efforts around Acquia ContentHub 2.x. There's a lot more I'd like to discuss about that module and the efforts our team has put into it, but one of the comments I got on twitter specifically asked about the Drush commands we packaged with 2.x, how they're used, and what you can do with them, so in an effort to both continue discussing the good work of the team, and the capabilities of the product, I'm going to dedicate this blog to that topic.

A Word about Data Models

I could certainly just document how to use the commands, but as with anything, I think understanding the theory involved is helpful, both for greater understanding of the product itself, as well as principles of good Drupal development. In the previous blog, we talked about the 5 basic steps of data flow, but I didn't get into the specifics much. So I want to talk first and foremost about how ContentHub finds and serializes data to be syndicated (since this will apply directly to one of our drush commands).

ContentHub 2.x relies heavily on a new contrib module called depcalc. Depcalc's got a rather simple API that allows a DependentEntityWrapper instance (a super lightweight pointer to load a particular entity) to be parsed for other dependencies. This API then recursively calls itself, finding all dependencies of all entities it encounters along the way, until it runs out of stuff to look at. Now, many data models are going to be processed pretty quickly, but there are lots of other data models which process slowly. Deeply nested dependency trees will take some time to calculate. I've seen trees of 1200-1400 take 6-8 minutes to process, but I've also seen trees of 300-400 process in just a few seconds. I've also seen data models that didn't come back with results after over 30 minutes of processing. The difference is HOW they're structured, so it's really critically important to understand your data model. If you don't you might not get the results you want or expect. ContentHub has a number of APIs dedicated to helping to streamline this functionality and I intend on discussing them at length in other blogs posts, but for the sake of this blog, it is just important that we establish a baseline understanding. Different data models have different processing characteristics. YMMV.

Calculating Dependencies

In order to understand the characteristics of your data model, you need to understand the basics of how depcalc is going to find dependencies for your entity(ies). First, when an entity is handed to depcalc, it dispatches a dependency calculation event with the DependentEntityWrapper. Event subscribers tend to focus on one classification of dependency, so for instance, one might check to see if the entity is a content entity with entity reference fields, and then dig through all those reference fields finding subsequent entities for processing. Another subscriber might execute if the entity is a config entity and process through Drupal core's config dependencies. Yet another subscriber might look exclusively at text areas finding the input filter that was used to process the data in the field. A handful of these sorts of events exist within depcalc specifically, and since it's an event subscriber pattern, if you have a custom relationship that won't be calculated through our existing subscribers, you can always add your own. As I previously mentioned, all entities found this way are recursively calculated until we find no new entities. We don't attempt to deal with simple configuration, and non-entity data is yet to be used in our syndication pattern.

Identifying Problematic Data Models

Now that you understand the basics of HOW we calculate dependencies, it's up to you to look at your own data models and make determinations about their compatibility with the process we're about to attempt. There are a few obvious guidelines to follow however.

If an entity of a particular bundle has references to other entities of that same bundle, it's possible you will end up having a dependency tree that includes ALL entities of that bundle.
One of our early clients asked us why they were getting ALL news articles when they exported any one news article. We looked at the data model and noticed that they had "previous article/next article" entity reference fields, and suddenly, this was a very easy question to answer. Similarly, I've had customers with less clear cut relationships between entities of "like" bundle. Organizations related to other organization, and this can lead to situations where calculation on one organization node might happen rather quickly while others simply seem to never finish. In future blogs we'll talk about how to handle these situations, but you need to identify up front if you have them.
Paragraphs
We support paragraphs, but if you've used it for page layout, it can be a real bear to calculate depending on how deeply nested it is and how many paragraphs are used on an average page. Also, we don't move twig templates around, so the receiving site likely won't have the templates to interpret the incoming paragraph data and it will be displayed oddly.
Lots of entity references
If you have a single entity bundle with many entity references, this can also be indicative of problematic data modeling, and can make it difficult to predict how long entities of a given bundle might take to process.
Custom Field Types
This isn't really "problematic" so much as you should be aware that ContentHub is going to make a "best guess" at field types it doesn't understand. If you have custom field types, or even contrib field types we've not yet written support for, some of your data may be incorrect or missing. If you find this to be true for any contrib field, feel free to file a ticket and we can look at what it would take to get it supported.

Ultimately, there's no harm in trying, but you might be surprised by how many entities are actually related to each other in these circumstances. In a future blog I'll detail how we break down these entities and make them processable even when they might be problematic. Also, keep in mind that if an entity you want to syndicate references an entity with the characteristics we've described above, all the same problems can apply.

Exporting Via Drush to Flat File

With all my caveats out of the way, let's get to the meat of this blog and talk about using Drush to export our data. We're going to use a file in the local file system to store our data for output. In order to do this though we'll actually need to files. ContentHub's Drush export command works on the idea of a manifest file to define the specific entities we want to see exported, so we must first create the manifest file. The file can be named anything you like, so you could actually have a series of manifests for different use cases. The manifest can have 1 entity, or however many you need. Start small and work your way up to whatever you might need. In the normal operation of ContentHub with Acquia's service, we seldom need to move many top level entities at once. While a lot of care was taken to make the import and export processes as lean as possible, Drupal core still has static caching of entities baked deeply into entity storage, which can exhaust memory if you have lots of entities loaded over the course of a single bootstrap.

The manifest file should be in yaml format. We support referencing entities in "type:id" or "type:uuid" formats. Your manifest could look as simple as this:

entities:
  - "node:1"

A more complicated manifest file might look thus:

entities:
  - "node:1"
  - "node:91644a22-8ec8-413e-91fb-b928dba88fd7"
  - "node:315a0239-57d2-4dcb-89bd-f9a76851b74c"

Our first example exports node 1 and its dependencies. Our second example exports node 1, and the two other nodes by their uuids, along with all the dependencies across all 3 entities. If all of these nodes were of the same type, the supporting config/content entities common to them all would only have one entry in the resulting exported output, so this can be a fairly efficient way to group entities together by top level bundle. Also, since we can export config entities, a manifest file can also reference config entities. If you wanted to export a view or some other entity with these 3 nodes, you could absolutely do that, you just need to use the Drupal entity type id, and the id or uuid of the entity you want to export.

Let's assume our manifest file is named "manifest.yml". We can execute our drush command from inside the Drupal directory like so:

drush ach-elc manifest.yml

Once the dependency calculation and serialization processes are complete, this will output what we call "CDF" directly into your terminal. CDF is a custom json data format we use for communicating about our entities. In future blogs, I'll break down CDF into its various components so that it's easy to understand and dissect. If you want to capture this CDF to file, we can do so with typical CLI notation:

drush ach-elc manifest.yml > manifest.json

A Quick Word About File Handling

CDF Doesn't attempt to make the binary representation of files portable. There are obvious reasons for and against doing this, but currently ContentHub depends on sites and their files being publicly accessible. We currently only support the public file scheme (though we want to support S3 and Private files in the long term). If the site you performed your export on is not accessible to the site you will import into, then your files will be missing once the import is complete.

Importing Data from CDF File

Assuming we have successfully exported CDF data to a local file, we can attempt an import. Let's discuss the basic requirements of the receiving site:

Code base must be the same
All the same modules must be available within the code base. They don't have to be enabled or configured (ContentHub will do that), but they do have to be present.
A blank canvas is always best
While not a strict requirement, a blank canvas in terms of content and configuration is always going to demo best. I'd suggest using the "Minimal" installation profile for your first attempt. Keep in mind, ContentHub attempts to unify your configuration settings, so if both the Originator (we call these sites Publishers) and the Receiver (we call these sites Subscribers) have the same entity bundles, ContentHub is going to bring the receiver's configuration inline with the originator. That's fine most of the time, but if your setup is more complicated and includes any sort of configuration conflicts, we'll need to solve those separately. While this CAN be done, you probably don't want to attempt it for your first try using ContentHub, which is why I'm suggesting the Minimal profile.

With those guidelines in place, we are now ready to attempt an import. Be sure to point the drush command at the json file you created, NOT our original yml file.

drush ach-ilc manifest.json

This should result in terminal output that tells you how many items were imported. Something like:

Imported 73 from manifest.json.

Conclusion

Congrats! You've just successfully moved content and configuration from one Drupal site to another via Acquia ContentHub! As I've mentioned before, ContentHub is actually backed by a service for doing this at scale, but the benefits of having a drush command for debugging and testing purposes is really invaluable. It works at a small scale, and makes it possible to trial ContentHub's features. We actually do this programmatically in our test coverage a lot to prove that ContentHub is working as expected, and covers the various use cases we want to see.

In future blog posts I'm going to dissect CDF and show what it's doing, and how it does it. I'll also be posting about manipulating your data models, controlling what field data is and isn't syndicated and calculated, and probably a general discussion of the different events ContentHub & Depcalc dispatch, what they're used for, and how they can customize and streamline your data flows. As always, I'm super interested in any feedback people have, and would love to hear about your experience.