I was explaining to a non-technical stakeholder, in the context of Artificial Inteligence, how the process of divide a large text into smaller texts (chunks) works. We used a library called LangChain to make this process, so the explanation was about the internals of how LangChain create the chunks from the input text.
This investigation started because we wanted to understand (and, if possible, to control) how this library was breaking down the text to make the chunks.
PS: this conversation happened in a slack-channel on october 26th, 2023, and the content here is simply a copy-and-paste of the conversation.
PS²: Pinecone, mentioned in the end of the conversation, is a vector-database. We use it to store the text-chunks, so we could query it by giving some piece of text, and it would return similar chunks that it has stored.
---- the conversation starts here ----