The current Chef data bag has many problems. Chef provides no help when dealing with encrypted data. It can be confusing when things are missing or don't work correctly. Some cookbooks assume various mechanisms for encryption or assume none.
Altogether it is confusing and hard to use and maintain.
This document is to point out problems, understand the problems, and to propose some solutions. It isn't meant to be a complete solution and shouldn't be taken as such.
The data bag model is an efficient way to store large amounts of variable data. Putting data bags into attributes would be horribly slow.
Data bags do offer some ability to keep some information secret. e.g. passwords, private signing keys, etc.
Currently, the main reason to use encryption is access control.
The nodes (and thus an administrator on the node) can see anything the
encrypted_data_bag_secret
can access. So the node gains nothing by having the
data encrypted.
In addition, the data is usually decrypted and used on the server. Which means
even if you can't access the encrypted_data_bag_secret
you can usually still
see the resulting data on the file system.
The network is already encrypted. Chef uses SSL to transmit the data and guarantees the server and client are paired correctly (if SSL validation is turned on). Encrypting the data again doesn't give us anything.
The chef server is king. The only reason to encrypt data here is to prevent the chef server from knowing about it. This is pointless as the chef-server can change the state of nodes and have them report secrets back to the server or some other system.
Because of this, I think encryption should be ditched in favor of something else, such as access controls.
One thing people do want is to control who and what can see the raw data.
I propose that we build an access control mechanism. A simple first version can just be a list of clients. But ideally, we want something more dynamic, such as the ability to specify only admin users or nodes with certain roles or recipes.
The data bag mental model makes understanding data bags a bit difficult: Bag name, Item name, then a key/value structure...
Instead, I suggest using a simpler system instead.
An example of a stateless filesystem based model:
data_bag_item('certificates', 'wildcard')['cert']
# would become
new_data_store('/certficates/wildcard')['cert']
# and would allow
new_data_store('/arbitrary/long/or/short/paths')['cert']
# or
new_data_store.ls('/certificates') # => ['wildcard']
# or
new_data_store.exist?('/certificates/wildcard') # => true
# In a recipe
file '/etc/mysslcert' do
owner 'root'
group 'root'
mode 0644
content new_data_store('/certificates/wildcard')['cert']
end
An alternative stateful filestem model:
data_bag_item('certificates', 'wildcard')['cert']
# would become
data_store.cd '/certificates/wildcard'
file '/etc/mysslcert' do
owner 'root'
group 'root'
mode 0644
content data_store.get('cert')
end
An example of hash based model:
data_bag_item('certificates', 'wildcard')['cert']
# would become
new_data_store['certificates']['wildcard']['cert']
An example of a method based model:
data_bag_item('certificates', 'wildcard')['cert']
# would become
new_data_store.certificates.wildcard.cert
An alternative fetch based model:
data_bag_item('certificates', 'wildcard')['cert']
# would become
fetch_data_store(category: 'certificates', name: 'wildcard')['cert']
While the current format of the data bags (JSON) is easy for chef to work with, it is a bit painful for humans to work with.
It would be nice if we just used the filesystem directly and mapped filenames to the keys and the contents to the values.
This is because while some use cases for the data bags are small keys and values, a lot of times it is large files.
A completely alternative mechanism would be to extend attributes to allow access controls and large (lazily loaded) data.
The main use for encrypted content in data bags is passwords and private keys.
Almost all other data I have seen (note: I'm a limited audience, please add more uses cases if they exist) has been non-secret data such as already-encrypted-with-a-key ssl certificates, large files such as DNS records or username lists.
It might be good to extend the attributes to have access controls. This would allow putting passwords into attributes and keeping them safe (and unfortunately, unsearchable).
If we do that then maybe extending attributes to handle large files (lazily loading them if needed). I think that explicitly marking certain attributes as "large" and not including them in the ohai data (maybe unless accessed) would be acceptable.
It would also have the advantage of exposing the large data that is used by a node.
This isn't complete and I plan on adding more to it as I think of it. Please feel free to comment below or fork it or whatever.