How to build a robust environment?
This is a big subject and, as you know, there are many ways to set things up to be robust. That said, some practices are better than others. I can relate at least what we do and what we've seen customers do.
First, I'd recommend thinking of Cloud CMS as black box application that runs on top of MongoDB, Elastic Search (both of which can be thought of as databases) and a binary storage provider. Cloud CMS is a stateless application whose setup is actually quite simple. It doesn't maintain any state in and by itself. Instead, it is reliant on the underlying databases and storage.
To that end, backing up Cloud CMS is simply a matter of backing up MongoDB, Elastic Search and potentially the binary store. If you're running in AWS, then MongoDB and Elastic Search will both mount on top of EBS volumes. For the binary store, we recommend S3. This keeps all of the heavy data out of your DB (and out of EBS volumes, a good thing since S3 is very inexpensive).
For S3 "backup", one good strategy is a cold backup strategy using S3 replication between data centers. That way, when content is written to one S3 bucket, you can automatically have Amazon copy it to another bucket somewhere else. If S3 goes down in one data center, you can then swap over to another data center by changing the bucket location. There is a "hot" way of doing this (DNS record swap, for example) or you can just bounce servers with a configuration pointing to the alternate bucket.
For both MongoDB and Elastic Search, I would mention up front that we only provide our Docker Compose samples as a means for getting started. Anyone running Cloud CMS in a high scale or fault tolerant way would split out the tiers and have MongoDB and Elastic Search running separately. They might still use Docker but it'd be a different environment that can independently calibrated. So definitely aim for that in terms of any infrastructure that intends to be high performance.
For MongoDB, we recommend looking at MongoDB's documentation. They provide a very solid architecture that includes the concept of replica sets whereby multiple MongoDB servers work together as one. Requests go to a master server and writes spread across all members such that the master could fail and one of the replicas can take over. They also have a notion of shards whereby data can be spread out (non-replicated) across multiple MongoDB servers (each of which could be replicated).
For Elastic Search, we further recommend looking at Elastic.io's documentation. Their architecture is a bit more like our own in the sense that you can dynamically add or remove Elastic Search servers on the fly and the servers all find each other and rebalance (i.e. "elastic").
For both MongoDB and Elastic Search, data is stored on EBS volumes. We recommend using incremental EC2 snapshots. You can have these occur every hour. There can be quite a number of volumes depending on how complex you want to get (and thus, many more snapshots). From a recovery perspective, it's a matter of restoring volumes from snapshots. You can either do this at recovery time (which can potentially be expensive) or keep warm backups on-the-ready.
We don't have any recommendation on the best tooling to achieve all of this. It's a complex problem and the dev/ops challenges are significant for any application. Docker makes instantiation containers and orchestrating them easier but it doesn't handle the automation of the many steps involved.
At Cloud CMS, we've automated things quite a good bit for our own SaaS offering. Much of it is custom code written for Docker and using the AWS drivers. We manually drive the backup, restoration and data migration processes using the AWS API. I'm sure there are better ways to do things but it's proven to be very effective for us.
In terms of dev environments, our customers generally go with something really simple -- even just launching Docker Compose on a single host or developer laptop. In those cases, they don't seek to reproduce the infrastructure but rather seek to get a runtime going for building custom things. For staging and QA, it depends -- but yes, some customers do provide a replica of their production stack. These replicas generally aren't transient in nature (they're usually fixed boxes that stick around for awhile).