Skip to content

Instantly share code, notes, and snippets.

@Felflare
Felflare / sentence_similarity_mult.ipynb
Created May 26, 2020 20:38
This Snippet of code demonstates cross-language sentence embeddings system used for similarity search & match beating LASER embeddings [Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation](https://arxiv.org/pdf/2004.09813.pdf) by Nils Reimers and Iryna Gurevych.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
@Felflare
Felflare / Bert Abstractive summarization
Last active January 31, 2021 07:16
This Snippet of code incorporates [Text Summarization with Pretrained Encoders](https://arxiv.org/pdf/1908.08345.pdf) by Yang Liu and Mirella Lapata.
# Pull and install Huggingface Transformers Repo
git clone https://github.com/huggingface/transformers && cd transformers
pip install .
pip install nltk py-rouge
cd examples/summarization
#------------------------------
# Download original Summarization Datasets. The code downloads from Google drive on Linux
wget --save-cookies cookies.txt --keep-session-cookies --no-check-certificate 'https://drive.google.com/uc?export=download&id=0BwmD_VLjROrfTHk4NFg2SndKcjQ' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/Code: \1\n/p'
wget --load-cookies cookies.txt --no-check-certificate 'https://drive.google.com/uc?export=download&confirm=<CONFIRMATION CODE HERE>&id=0BwmD_VLjROrfTHk4NFg2SndKcjQ' -O cnn_stories.tgz
@Felflare
Felflare / XLNet_span_selection_squad
Created February 10, 2020 03:03
XLNet Model with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layers on top of the hidden-states output to compute `span start logits` and `span end logits`). Simple demo of loss and logits.
from transformers import XLNetTokenizer, XLNetForQuestionAnsweringSimple
import torch
tokenizer = XLNetTokenizer.from_pretrained('xlnet-base-cased')
model = XLNetForQuestionAnsweringSimple.from_pretrained('xlnet-base-cased')
input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute", add_special_tokens=True)).unsqueeze(0) # Batch size 1
print(f'Encoded sequence ids -- {input_ids.tolist()[0]}')
# Encoded sequence ids -- [17, 11368, 19, 94, 2288, 27, 10920, 4, 3]
start_positions = torch.tensor([1])
end_positions = torch.tensor([3])
@Felflare
Felflare / XLNet_sequence_classification
Created February 10, 2020 02:46
Demonstration of XLNet with Classification head on top, implementation of XLNet follows huggingface's pytorch build.
from transformers import XLNetTokenizer, XLNetForSequenceClassification
import torch
tokenizer = XLNetTokenizer.from_pretrained('xlnet-large-cased')
model = XLNetForSequenceClassification.from_pretrained('xlnet-large-cased')
input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute", add_special_tokens=True)).unsqueeze(0) # Batch size 1
labels = torch.tensor([1]).unsqueeze(0) # Batch size 1
outputs = model(input_ids, labels=labels)
loss, logits = outputs[:2]
print(f'Current Loss at -- {loss.tolist()}')
# Current Loss at -- 1.1906177997589111
@Felflare
Felflare / XLNet_generate_text
Created February 10, 2020 02:37
Sample function to generate text from XLNet model implemented by huggingface.
from transformers import XLNetTokenizer, XLNetLMHeadModel
import torch
import torch.nn.functional as F
tokenizer = XLNetTokenizer.from_pretrained('xlnet-large-cased')
model = XLNetLMHeadModel.from_pretrained('xlnet-large-cased')
# We show how to setup inputs to predict a next token using a bi-directional context.
encoded_text = tokenizer.encode("Quick brown fox jumped over the lazy <mask>.", add_special_tokens=True)
input_ids = torch.tensor(encoded_text).unsqueeze(0) # We will predict the masked token
print(f'Input squence -- {encoded_text}')
#Asyncrhonously processes text in a document stored in an S3 bucket. For set up information, see https://docs.aws.amazon.com/textract/latest/dg/async.html
import boto3
import json
import sys
import time
class ProcessType:
DETECTION = 1
ANALYSIS = 2