Created
June 25, 2020 09:14
-
-
Save jeremy-rutman/6c919a596ccc94cd4e3706775bd2ee74 to your computer and use it in GitHub Desktop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Some notes on torchtext | |
| you can read a csv and generate vocab like this | |
| tokenize = lambda x: str(x).split() # see if this fixes float vs. string error | |
| TEXT = data.Field(sequential=True, tokenize=tokenize, lower=True, include_lengths=True, batch_first=True, | |
| fix_length=200) | |
| LABEL = data.LabelField() # LABEL = data.LabelField(tensor_type=torch.FloatTensor) | |
| fields = {'textcol': ('text', TEXT), 'real': ('label',LABEL)} | |
| train_data, test_data, validate_data = data.TabularDataset.splits( | |
| path=path, | |
| train=trainfile, | |
| test=testfile, | |
| validation=validationfile, | |
| format='csv', #csv_reader_params= | |
| fields=fields | |
| ) | |
| TEXT.build_vocab(train_data, vectors=GloVe(name='6B', dim=300)) | |
| You can check that out-of-vocab words have been mapped to 0 vectors by getting the index then lookng at the vector | |
| TEXT.vocab.stoi['hello'] | |
| TEXT.vectors[22] | |
| TEXT.vocab.stoi['shmoogywoogies'] | |
| TEXT.vectors[5202] |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment