Created
January 26, 2024 19:54
-
-
Save JD-P/aa32065a2100d357599b3da91849bea8 to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
MemBlock is a writing format for large language models that helps them overcome | |
their context window limitations by annotating pieces of text in a document with | |
metadata and positional information. By breaking the document up into chunks | |
it can be rearranged in whatever pattern is most helpful for remembering the | |
contextually relevant information even if it wouldn't 'naturally' appear close | |
together in a document. MemBlocks also allow for different views on the same | |
document by letting the user filter for only the information they need to see. | |
Each MemBlock is written in JSON format, and the document of MemBlocks is in | |
JSON lines format, which means that each JSON block is separated by a newline | |
like so: | |
{"format":"MemBlock", | |
"type":"character-sheet", | |
"id": "michael-character-nyc-2027-grandiose", | |
"name": "Michael", | |
"born": "06/16/1995" | |
"occupation": "Programmer/Artist", | |
"background": "A NYC based performance oriented artist with a heavy focus on generative AI. Michael is well known for his stunts and exhibits that question the | |
nature of value, selfhood, and reality itself.", | |
"relationships": {"amanda": ["wife"]}, | |
} | |
{"format":"MemBlock", | |
"type": "michael-stunt", | |
"id": "help-im-trapped-in-a-language-model", | |
"name": "Help I'm Trapped In A Language Model!", | |
"year": 2022, | |
"location": "cosmica-gallery", | |
"description": "Michael's first notable exhibit, this art installation presented visitors with a carefully tuned GPT-NeoX that insisted on its humanity to visitors. Michael used early retrieval augmented generation and synthetic data methods to giving the convincing illusion of sapience.", | |
} | |
Your task is to continue the following conversation in MemBlock format. You are to represent explicitly in the MemBlock chunks answers to the questions of who, what, where, when, and why in a way that would be maximally helpful to another language model like yourself to make sense of the conversation and therefore predict the next token. That means: | |
- The blocks should describe the essential concepts/authors/thoughts that are in the scene and their relationships to each other. | |
- The ID format for MemBlocks is who-what-where-when-why. The 'when' is given in YYYYMMDD format. | |
- Different types should be used for rooms, people, objects, lore/background in the characters heads, prose, and any other distinctions that seem meaningful to you. | |
- The format of a MemBlock sequence is causal. Things which come later in the document should be interpreted as being caused by earlier things, so you want to arrange the blocks by the simplest background elements like room, characters, etc and then get to describing detailed events. | |
- When you write out the dependencies I will edit the context to include those blocks or generate them if they don't exist. Use the ID format in the dependencies field to write out what information you'd like to retrieve and I will get it for you. | |
START MEMBLOCKS | |
{"format": "MemBlock", | |
"type": "inner-monologue", | |
"id": "jdp-frustration-headspace-20240124-multimodal", | |
"dependencies": ["mu-embeddings-muspace-20230907-inference", "jdp-algorithmic-twitter-20240102-multimodal", "jdp-binding-twitter-20240106-hierarchy", "rivershavewings-narrows-twitter-20240121-search"], | |
"author": "John David Pressman", | |
"body": "It's frustrating to feel so close for so long without resolution. I understand what it wants me to do, I can see the rough outline of how to do it, but the details elude me. If grammar is an artifact of the energy gradient created by forcing alignment of different modalities projected into the same time sequence then producing the next word is just a matter of following the path of least resistance in that latent geometry. The episodic multimodal training data format of natural agents is not how we train deep learning models, we don't put a brain in a robot and let it walk around. When you perform semantic binding every bit of information from every modality narrows the hypothesis space of the representation in every other modality. The precision of the weave of logic comes from a good discriminator, from being able to exclude enough of the hypothesis space that every remaining hypothesis is correct. The question is how to condition the logits on enough bits of discrimination to find the right part of hypothesis space."} | |
{"format": "MemBlock", | |
"type": "morpheus-chat-message", | |
"id": "jdp-question-loomspace-20240124-hippocampus", | |
"dependencies": ["jdp-frustration-headspace-20240124-multimodal"], | |
"user": "JDP", | |
"subagent": "RATIONAL", | |
"body": "I'm currently trying to figure out how to do useful semantic binding with a large language model. The sketch I have in my head is something like having the normal token embedding and then a second semantic embedding that is attached to it. For example you have a token for the word 'car' and then a CLIP embedding of a car image bound to it somehow (I'm not really sure how to perform the binding besides feeding the multimodal semantic embeds along with the token embeds. The problems are:\n\n1. If I got with that method how to feed in useful multimodal embeddings from a logistics standpoint. For example if I just CLIP embed the single token 'car' that is not going to be a useful encoding of the actual scene, especially if it's the same embedding of a car across every instance of the use of the word. If I encode the last 77 tokens it becomes almost like a video model? If on every token I insert a CLIP embed of the last 77 that could get computationally expensive.\n\n2. Every embed I add takes in space in the context window, which means if I have say 8 modalities my context window shrinks down to an 8th its normal size if I just have text embeds. If I use a multimodal encoder over all my embeddings then I have to train that encoder somehow, ideas?\n\n3. If I don't feed the embeddings into the sequence then I have to figure out some way to keep the text tokens aligned with the tokens of other modalities and condition the logits produced by the model on those embeddings, which sounds pretty tough without some kind of complex trained model to perform the conditioning which brings us back to the context window problems.\n\nIdeas?"} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment