Message from Derek Palma to brooklyn-dev mailing list about Brooklyn management context and resilience

This my first time on this list. First my complements to the Brooklyn team, this project solves some complex problems with some very concisely written Java. However, I'd like to get an authoritative answer from the source regarding some questions that have come up for us.

I'm able to partially answer some of your points, since these are features I am interested in working on further in the next release of Brooklyn. Hopefully others can chime in and correct me...

Say I instantiate an app deployed across multiple locations and I lose the JVM with the ManagementContext (I am assuming this is the root access point to get access to the rest of the components), is there away to start another ManagementContext on another node. I.e. is this a supported feature in the existing code/design or not?

This is called 'rebind' and is a feature we have currently, and are working on improving for the next release of Brooklyn. The intent is to support exactly the circumstances you describe. (I think Aled worked on the current implementation, so he is probably better placed to describe what the current state of affairs is...)

The intended final goal is to persist the state (current and historical) of all applications and their entities in a distributed data store (some sort of eventually consistent key-value store, or cache) so that losing part of the management plane will not impact the rest of the management components, and loss of all Brooklyn management services will only require new manager(s) to be started and connected to this store to allow them to begin controlling the managed applications again.

The final architecture for this has not been fully determined, but I think it would probably involve something like this:

A cluster of Brooklyn management nodes, one per region or availability zone, as required.
A distributed co-ordination service, such as ZooKeeper, to manage access to the management nodes, and prevent network partitioning.
A clustered key-value or document store, such as Cassandra or MongoDB, to contain the management context and record the state of all applications, entities, and sensors.
Various front-end components, such as management and visualisation UIs and services like a CAMP API provider.

This would give us a fully resilient, high availability, distributed Brooklyn control and data plane. This should also be boot-strappable from a single management node, with Brooklyn itself managing the constituent services and nodes.

Obviously there are stages between a single server and this proposed setup, such as a single server with a clustered data store for the management context data, or a cluster of Brooklyn management nodes that share management state between themselves. The data store could be a simple as a Berkeley DB service running alongside the management node, or a sharded MongoDB cluster.

There have also been experiments with Hazelcast, which were promising, and Peter can give more details on this.

Say I start an instance of an app that Brooklyn understands (let's assume that this app was deployed by Brooklyn but somehow I was able to turn off all the Brooklyn components and leave the app running), can Brooklyn create new instances of all the Brooklyn components it would have created if it had deploy the app?

If I understand you correctly, this is a little harder. I assume you're talking about Brooklyn assuming control of a pre-existing application? In this case, the management context would never have existed in the first place, so there would need to be some sort of discovery mechanism to enumerate the components and their configurations, which is not something that is currently being looked at.

Brooklyn needs at least the application description entity code, which it uses to deploy everything initially. If this existed then it should be possible to rebind the entities, if their locations and other properties are known. What is harder is re-constructing the state of relationships between the entities if this information was never present in the first place.

Can you give more details about the sort of application or deployment you would want to manage where this sort of circumstance would occur?

Hope this was helpful, and I would also appreciate any comments on my proposed architecture description?

grkvlt/rebind.md