Core parts:
Should include a link to the low resolution and high resolution models.
The 3d model should include a low resolution .vox model standin that can be used when rendering the nft from far away. A gallery or store may have dozens of NFTs in them, and so we need a way to render a representation of the NFT as high speed with a single draw call. .vox models, which can be freely loaded in most engines, and are rendered using no textures, but using simple vertex shading, are a good opportunity to render a stand-in for the model. I recommend adopting a maximum size of 32x32x32 in this .vox model
The nft may have a high definition model in gltf. This model should be able to be displayed on a medium spec android phone. I'm not sure what the limits should be, maybe around 10k faces, 50 draw calls (so limited number of materials). It won't have any shaders but use PBR materials. A game that is displaying the model should switch from the low res .vox model to the HD gltf model when the user leaves their mouse focussed over the model for a few seconds.
When a user interacts with the model, games should display a summary of the information in the token json. This includes rarity, description, information about the collection the model is in etc. Scripting / animation of the models is beyond the scope of this gist.
Can we start with goals/requirements first?
If the goal is actual interoperability in a standard, and to convince developers like ourselves to do significant new implementation work to be compatible, then we need to cater to the elephants in the room: existing game developers who are locked into an existing engine that looks different from the ones we have.
In that context I think having two types of models is immediately a tall barrier for anyone trying to interop. Limiting to
.vox
also doesn't fit with a lot of games and engine's aesthetics, and forcing them into it is another design, and possibly technical barrier. Something like GLTF is a superset anyway. Finally, I don't think we can reasonably force rules like "no textures", because there are rendering styles that outright don't make sense once we start enforcing such things. See something SDF-based like Dreams or no man's sky.From my experience trying to solve this problem over the last few years and trying to get people onboard, I would err on the side of prescribing as little as possible on the model side.
(that is, if the goal is to interoperate -- if the goal of the standard is something else like to lock more people into CryptoVoxels then something like vox requirements make sense)