I was explaining to a non-technical stakeholder, in the context of Artificial Inteligence, how the process of divide a large text into smaller texts (chunks) works. We used a library called LangChain to make this process, so the explanation was about the internals of how LangChain create the chunks from the input text.
This investigation started because we wanted to understand (and, if possible, to control) how this library was breaking down the text to make the chunks.
PS: this conversation happened in a slack-channel on october 26th, 2023, and the content here is simply a copy-and-paste of the conversation.
PS²: Pinecone, mentioned in the end of the conversation, is a vector-database. We use it to store the text-chunks, so we could query it by giving some piece of text, and it would return similar chunks that it has stored.
---- the conversation starts here ----
Stenio Wagner [1:20 AM]:
Ok, I cracked how the chunks are made.
-
Currently, the functionality indicated by the LLM that we’re using to breaking down generic texts into chunks is the
RecursiveCharacterTextSplitter
It works using a strategy called “recursion”, where we basically will be splitting up the same peace of text until we reach a certain condition. -
This condition basically will check if the chunk that we currently generated has a certain number of characters (let’s call it
chunk_size
). If we already reached that number, we can consider that we generated a valid chunk. If not, we’ll get the next character in the text until we reach a chunk with the required length. -
The algorithm will be splitting the text using the following characters (in the order presented) - Let’s call them
divider_characters:
\n\n, \n, (empty space), '' (the empty character)
-
Each
\n
represents an line-breaking in the text. -
The
\n\n
is only used once in order to group the paragraphs.
This is a high level explanation of the execution of this algorithm!
1 - Split the text using the \n\n
2 - For each group created after the split, do:
1 - Split up the group using the next character in divider_characters
For each group created after the split, do:
2 - Analyze if the group has less characters than the number specified in chunk_size.
3 - If this group has less characters than divider_characters, we concatenate this group with the next group (if the sum of their lengths are less than or equal to chunk_size.
4 - If not and we still have some divider_character to analyze against the chunk, repeat this loop.
As an example, consider:
divider_characters
= \n\n, \n, (empty space), ''
chunk_size
= 10
The text-content:
"Beloved Brazil
Roses are red
Violets are blue
Brazil is beautiful
How are you ?"
Result:
Group 1: Beloved Brazil
Group 2: Roses are red\nViolets are blue\nBrazil is beautiful\nHow are you ?
The length of this group is greater than chunk_size
, so we have to break it again using the next character in divider_characters
(\n), but since we don’t have it, the split won’t run, and we can use the next divider_character
valid for this group ' '(empty space)
.
Group 1.1: Beloved
Group 1.2: Brazil
We can’t merge both groups, since it would generate a chunk with more characters than chunk_size
, so we grab the next divider_character
, the empty character
.
In this case, we’ll less characters than chunk_size
again. So, since we’re out of divider_characters
, we can have the Group 1.1 and Group 1.2 as valid chunks.
Chunks so far: [Beloved, Brazil]
The length of this group is larger than chunk_size
, so we can’t consider it a valid chunk. Then, we split it with the next divider_character
: \n
Group 1: Roses are red
Group 2: Violets are blue
Group 3: Brazil is beautiful
Group 4: How are you ?
Here, as we can see, all groups have more than chunk_size
characters. So, we can split them up again using our next divider_character
: empty space
Group 1.1: Roses
Group 1.2: are
Group 1.3: red
Now, we can try to concatenate Group 1 and Group 2 - It works, since their length is less than or equal chunk_size
. But, if we try to merge Group 3 as well the length of the group will be greater than chunk_size
. So, we can consider Roses are and red as valid chunks.
Chunks so far: [Beloved, Brazil , Roses are, red]
I don’t want to make it too long, but I think that you already get it. We just need to apply the same idea again to the other groups, and our final result will be:
Chunks:
Beloved, Brazil , Roses are, red , Violets, are blue, Brazil is, beautiful, How are, you ?
With this in mind, we can generate chunks with different sizes. The default value for the chunk_size
used by the RecursiveCharacterTextSplitter
is the 1000.
It also worth to mention that we’re limited by the gpt and by the limit of tokens used in our question + their answer, so we can’t have, for example, a huge chunk, eg the entire text.Also, we have a parameter called chunk_overlap
, that basically overlap the first n
characters of the next chunk with the n last characters of the current chunk, where n is the chunk_overlap
.
With this, we can have some of the context of the current chunk into the next chunk.Anyway, I’m still afraid of loose some context in the chunks generated. If it’s too small, it can definitely lose context. But we can’t predict how big it can be, since every text will be unique.
Given that, I propose that we start to try to make the chunks of the size of one page. This way, it won’t be too long, because we’re not considering the whole document, and not to small, since we’ll have at least one page of content to analyze. My idea is to send this page-chunk to pinecone, and ask to generate the learning-objectives from it in isolation, instead of having to analyze all chunks.
I think that we can do that in pinecone, but I didn’t try yet, tough. If it doesn’t work, I’ll elaborate something else.
Toughts?
The Stakeholder [6:12 AM]:
Nice work!
Yes agreed. Lets do one topic per page, then distil those topics down to [user defined number], then write as learning objective for each topic
and have a little overlap between chunks to allow for context creep