avidale/bert_knn.ipynb

Last active February 11, 2024 16:08

Star (18) You must be signed in to star a gist
Fork (5) You must be signed in to fork a gist

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/avidale/c6b19687d333655da483421880441950.js"></script>
Save avidale/c6b19687d333655da483421880441950 to your computer and use it in GitHub Desktop.

Download ZIP

bert_knn.ipynb

Raw

bert_knn.ipynb

Sorry, something went wrong. Reload?

Sorry, we cannot display this file.

Sorry, this file is invalid so it cannot be displayed.

AsmaZbt commented Sep 3, 2020

the length of the liste date is 8804 and you have an index of 10759 that's why you have in out of index .
i need to think

AsmaZbt commented Sep 3, 2020 •

edited

Loading

do you have a two file ? .txt and .csv ? then may be the length of list of sentences is not the same as your csv file.

i thought that your sentences are from the same CSV file so the length of date and para are the same
but now I'm sorry I can't help you

you need information how to link that sentences and dates ( for each sentence the appropriate date ) then you can solve the pb

Shafi2016 commented Sep 3, 2020

It is the same CSV file. I first convert, CSV to list and text as in the original example.
df = pd.read_csv("/content/df3.csv",parse_dates= True)
df = df.set_index("content")
df.head(1)
text_dict = df.to_dict()
len_text = len(text_dict["date"])
df = df["date"].to_dict()
df_sentences_list = list(df.keys())
len(df_sentences_list)
df_sentences_list = [str(d) for d in tqdm(df_sentences_list)]
file_content = "\n".join(df_sentences_list)
with open("input_text.txt","w") as f:
f.write(file_content)

with open("/content/input_text.txt","r") as f:
lines1 = f.readlines()
lines1[0]
all_sentences = [l.split('\t')[0] for l in lines1]

Again for the date, We use the same csv file, I only distable this part df = df.set_index("content")

Shafi2016 commented Sep 3, 2020

If I can have your email Id I will send refine codes with small sample data

AsmaZbt commented Sep 3, 2020 via email

Send me at [email protected] Télécharger Outlook pour Android<https://aka.ms/ghei36>

…

________________________________ From: Shafi2016 <[email protected]> Sent: Thursday, September 3, 2020 10:20:07 PM To: avidale <[email protected]> Cc: tima <[email protected]>; Mention <[email protected]> Subject: Re: avidale/bert_knn.ipynb @Shafi2016 commented on this gist.

________________________________ If I can have your email Id I will send refine codes with small sample data — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<https://gist.github.com/c6b19687d333655da483421880441950#gistcomment-3441468>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AHVEMJ5ATYCP6PEW65NHUS3SD727PANCNFSM4NOZYUEQ>.

Shafi2016 commented Sep 3, 2020

Thanks a lot!!!

sridhardev07 commented Nov 1, 2021 •

edited

Loading

@avidale Hi thanks for the amazing work.. I want to implement that, but need few points to understand..

I want it to work for large dataset, can it handle it or need to implement something else on it?
After getting embeddings, can I save it so next time I just load the data and get result by query?

Author

avidale commented Nov 1, 2021

@sridhardev07 yes and yes

sridhardev07 commented Nov 1, 2021

@sridhardev07 yes and yes

Can you tell me how? that will be really helpful for me!!

sridhardev07 commented Nov 9, 2021

Hi @avidale I tried this with some bigger dataset to test the accuracy. Dataset having sentences about 37126, it is showing me memory error: numpy.core._exceptions.MemoryError: Unable to allocate 2.35 GiB for an array with shape (819827, 768) and data type float32

I am having 16GB of RAM, can you tell any alternate way to do, which uses less RAM or retrieve the data from the disk??

Author

avidale commented Nov 10, 2021

Hi @sridhardev07!

The simplest trick I could suggest is to convert all vectors from float32 to float16, this will reduce memory requirements by half without significantly affecting the quality.

If this does not suffice, you could look at https://github.com/facebookresearch/faiss - a library for fast vector similarity search that allegedly can work with very large sets. Specifically, they implement product quantization for lossy compression of the vectors. If you choose to use Faiss, you should rewrite my solution: unite process_sentences and build_search_index that processes the sentences incrementally and adds their vectors to a faiss.IndexIVFPQ instead of a KDTree.

sridhardev07 commented Nov 11, 2021

Hi @avidale ! Thanks for the answer!

I tried converting the vectors to float16 it does help to reduce the size but not that much as I am working with large dataset.

I tried the second approach of Faiss, it worked good when I tried Flat index, so I can add the index incrementally. But on saving that to disk taking lots of storage. Approx 1 GB of 15K sentences. here is what I did:

 def __init__(self, sentences, model):
        self.sentences = sentences
        self.model = model
        self.index = faiss.IndexFlatL2(768)

    def process_sentences(self):
        result = self.model(self.sentences)
        self.sentence_ids = []
        self.token_ids = []
        self.all_tokens = []
        for i, (toks, embs) in enumerate(tqdm(result)):
            # initialize all_embeddings for every new sentence
            all_embeddings = []
            for j, (tok, emb) in enumerate(zip(toks, embs)):
                self.sentence_ids.append(i)
                self.token_ids.append(j)
                self.all_tokens.append(tok)
                all_embeddings.append(emb)

            all_embeddings = np.stack(all_embeddings) # Add embeddings after every sentence
            self.index.add(all_embeddings)

        faiss.write_index(self.index, "faiss_Model")

Then I tried with faiss.IndexIVFPQ, it works good, but did not works for incremental index as it needs the training data too. So need to calculate all the embeddings and then train and add. Again the size is small but its taking too much RAM that is causing issue while working with large data. here is what I did:

def __init__(self, sentences, model):
       self.sentences = sentences
       self.model = model
       self.quantizer = faiss.IndexFlatL2(768)
       self.index = faiss.IndexIVFPQ(self.quantizer, 768, 1000, 16, 8)

   def process_sentences(self):
       result = self.model(self.sentences)
       self.sentence_ids = []
       self.token_ids = []
       self.all_tokens = []
       all_embeddings = []
       for i, (toks, embs) in enumerate(tqdm(result)):
           for j, (tok, emb) in enumerate(zip(toks, embs)):
               self.sentence_ids.append(i)
               self.token_ids.append(j)
               self.all_tokens.append(tok)
               all_embeddings.append(emb)

       all_embeddings = np.stack(all_embeddings)
       self.index.train(all_embeddings) # Train
       self.index.add(all_embeddings) # Add to index
       faiss.write_index(self.index, "faiss_Model_mini")