-
-
Save avidale/c6b19687d333655da483421880441950 to your computer and use it in GitHub Desktop.
Hi @sridhardev07!
The simplest trick I could suggest is to convert all vectors from float32 to float16, this will reduce memory requirements by half without significantly affecting the quality.
If this does not suffice, you could look at https://github.com/facebookresearch/faiss - a library for fast vector similarity search that allegedly can work with very large sets. Specifically, they implement product quantization for lossy compression of the vectors. If you choose to use Faiss, you should rewrite my solution: unite process_sentences
and build_search_index
that processes the sentences incrementally and adds their vectors to a faiss.IndexIVFPQ
instead of a KDTree
.
Hi @avidale ! Thanks for the answer!
I tried converting the vectors to float16 it does help to reduce the size but not that much as I am working with large dataset.
I tried the second approach of Faiss, it worked good when I tried Flat index, so I can add the index incrementally. But on saving that to disk taking lots of storage. Approx 1 GB of 15K sentences. here is what I did:
def __init__(self, sentences, model):
self.sentences = sentences
self.model = model
self.index = faiss.IndexFlatL2(768)
def process_sentences(self):
result = self.model(self.sentences)
self.sentence_ids = []
self.token_ids = []
self.all_tokens = []
for i, (toks, embs) in enumerate(tqdm(result)):
# initialize all_embeddings for every new sentence
all_embeddings = []
for j, (tok, emb) in enumerate(zip(toks, embs)):
self.sentence_ids.append(i)
self.token_ids.append(j)
self.all_tokens.append(tok)
all_embeddings.append(emb)
all_embeddings = np.stack(all_embeddings) # Add embeddings after every sentence
self.index.add(all_embeddings)
faiss.write_index(self.index, "faiss_Model")
Then I tried with faiss.IndexIVFPQ, it works good, but did not works for incremental index as it needs the training data too. So need to calculate all the embeddings and then train and add. Again the size is small but its taking too much RAM that is causing issue while working with large data. here is what I did:
def __init__(self, sentences, model):
self.sentences = sentences
self.model = model
self.quantizer = faiss.IndexFlatL2(768)
self.index = faiss.IndexIVFPQ(self.quantizer, 768, 1000, 16, 8)
def process_sentences(self):
result = self.model(self.sentences)
self.sentence_ids = []
self.token_ids = []
self.all_tokens = []
all_embeddings = []
for i, (toks, embs) in enumerate(tqdm(result)):
for j, (tok, emb) in enumerate(zip(toks, embs)):
self.sentence_ids.append(i)
self.token_ids.append(j)
self.all_tokens.append(tok)
all_embeddings.append(emb)
all_embeddings = np.stack(all_embeddings)
self.index.train(all_embeddings) # Train
self.index.add(all_embeddings) # Add to index
faiss.write_index(self.index, "faiss_Model_mini")
Hi @avidale I tried this with some bigger dataset to test the accuracy. Dataset having sentences about 37126, it is showing me memory error: numpy.core._exceptions.MemoryError: Unable to allocate 2.35 GiB for an array with shape (819827, 768) and data type float32
I am having 16GB of RAM, can you tell any alternate way to do, which uses less RAM or retrieve the data from the disk??