Created
May 22, 2018 08:17
-
-
Save sedflix/9f2067afeb26c6fc2105df6eccfb1464 to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"cell_type": "code", | |
"execution_count": 1, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"import utils\n", | |
"import gensim\n", | |
"import evaluate\n", | |
"from text_manipulation import split_sentences" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 2, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"word2vec = gensim.models.KeyedVectors.load_word2vec_format('/datadrive2/sid/GoogleNews-vectors-negative300.bin', binary=True)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 4, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"model = evaluate.load_model(model_path='checkpoints/model000.t7', is_cuda=False)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 7, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"raw_text = ''.join([i if ord(i) < 128 else ' ' for i in raw_text])" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 14, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stderr", | |
"output_type": "stream", | |
"text": [ | |
"2018-05-22 08:15:04,100 - INFO - No split for sentence with backslash n. Doc Id: 123 Sentence: \n", | |
"Introduction\n", | |
"Often, a longer document contains many themes, and each theme is located in a particular segment of the document.\n", | |
"2018-05-22 08:15:04,104 - INFO - No split for sentence with backslash n. Doc Id: 123 Sentence: Preliminaries\n", | |
"In the discussion below, a document means a sequence of words, and the goal is to break documents into coherent segments.\n", | |
"2018-05-22 08:15:04,107 - INFO - No split for sentence with backslash n. Doc Id: 123 Sentence: Motivation\n", | |
"Imagine your document as a random walk in the space of word embeddings.\n", | |
"2018-05-22 08:15:04,111 - INFO - Sentence with backslash was splitted. Doc Id: 123 Sentence: Formalisation\n", | |
"Suppose V is a segment given by the word sequence (W_1, ... W_n) and let w_i be the word vector of W_i and v := \\sum_i w_i the segment vector of V.\n", | |
"\n", | |
"The remarks in the previous section suggest that the segment vector length \\|v\\| corresponds to the amount of information in segment V. We can interpret \\|v\\| as a weighted sum of cosine similarities:\n", | |
"2018-05-22 08:15:04,113 - INFO - Sentence with backslash was splitted. Doc Id: 123 Sentence: \n", | |
"angle.\\]\\sum_i \\|w_i\\|\\left\\langle \f", | |
"rac{w_i}{\\|w_i\\|}, \f", | |
"rac{v}{\\|v\\|}\n", | |
"\n", | |
"angle is just the similarity of a word w_i to the segment vector v. The weighting coefficients \\| w_i \\| suppress frequent noise words, which are typically of smaller norm.\n", | |
"2018-05-22 08:15:04,115 - INFO - No split for sentence with backslash n. Doc Id: 123 Sentence: \n", | |
" \\[T := (t_0, ..., t_n) \\quad \text{such that} \\quad 0 = t_0 < t_i < ... < t_n = L\\]\n", | |
"\n", | |
"A natural first attempt is to ask for T maximising the sum of segment vector lengths.\n", | |
"2018-05-22 08:15:04,117 - INFO - Sentence with backslash was splitted. Doc Id: 123 Sentence: \n", | |
"ight\\|, \\quad \text{where} \\; v_i := w_{t_i} + \\ldots + w_{t_{i+1} - 1}.\\]\n", | |
"\n", | |
"However, without further constraints, the optimal solution to J is the partition splitting the document completely, so that each segment is a single word.\n", | |
"2018-05-22 08:15:04,120 - INFO - No split for sentence with backslash n. Doc Id: 123 Sentence: \n", | |
"ight\\| \\le \\sum_{i=0}^{L-1} \\| w_i \\|.\n", | |
"2018-05-22 08:15:04,124 - INFO - No split for sentence with backslash n. Doc Id: 123 Sentence: \\]\n", | |
"\n", | |
"Therefore, we must impose some limit on the granularity of the segmentation to get useful results.\n", | |
"2018-05-22 08:15:04,127 - INFO - No split for sentence with backslash n. Doc Id: 123 Sentence: \n", | |
"ight).\\] \\pi = \\sum_{i=0}^{n-1} \\left( \\left\\| v_i \n", | |
"\n", | |
"Algorithms\n", | |
"We developed two algorithms to tackle the problem.\n", | |
"2018-05-22 08:15:04,130 - INFO - No split for sentence with backslash n. Doc Id: 123 Sentence: Greedy\n", | |
"The greedy approach tries to maximise J(T) by choosing split positions one at a time.\n", | |
"2018-05-22 08:15:04,132 - INFO - No split for sentence with backslash n. Doc Id: 123 Sentence: \n", | |
"ight\\|\\]\\left\\| \\sum_{i=b}^{e-1} w_i t\\| \\sum_{i=b}^{t-1} w_i \n", | |
"\n", | |
"The score of a segmentation is the sum of the gains of its split positions.\n", | |
"2018-05-22 08:15:04,134 - INFO - Sentence with backslash was splitted. Doc Id: 123 Sentence: \\[\textnormal{score}(T) := \\sum_{i=1}^{n-1} \textnormal{gain}_{t_{i-1}}^{t_{i+1}}(t_i)\\]\n", | |
"\n", | |
"The greedy algorithm works as follows: Split the text iteratively at the position where the score of the resulting segmentation is highest until the gain of the latest split drops below the given penalty threshold.\n", | |
"2018-05-22 08:15:04,136 - INFO - No split for sentence with backslash n. Doc Id: 123 Sentence: Dynamic Programming\n", | |
"This approach exploits the fact that the optimal segmentations of all prefixes of a document up to a certain length can be extended to an optimal segmentation of the whole.\n", | |
"2018-05-22 08:15:04,139 - INFO - No split for sentence with backslash n. Doc Id: 123 Sentence: \n", | |
"Let T := (t_0, t_1, ..., t_n) be the optimal segmentation of the whole document.\n", | |
"2018-05-22 08:15:04,142 - INFO - Sentence with backslash was splitted. Doc Id: 123 Sentence: If this were not so, then the optimal segmentation for the document prefix (t_0, t_1, t_2, ..., t_k) would extend to a segmentation T'' for the whole document, using t_{k+1}, ..., t_n, with J(T'') < J(T), contradicting optimality of T. This gives us a constructive induction: Given optimal segmentations\n", | |
"\n", | |
" \\[\\{T^i \\;|\\; 0 < i < k, \\;T^i \\;\text{optimal for first} \\;i\\; \text{words}\\},\\]\n", | |
"\n", | |
"we can construct the optimal segmentation T^k up to word k, by trying to extend any of the segmentations T^i, \\;0 < i < k by the segment (W_i, \\ldots W_k), then choosing i to maximise the objective.\n", | |
"2018-05-22 08:15:04,147 - INFO - No split for sentence with backslash n. Doc Id: 123 Sentence: Hyperparameter Choice\n", | |
"Both algorithms depend on the penalty hyperparameter \\pi, which controls segmentation granularity: The smaller it is, the more segments are created.\n", | |
"2018-05-22 08:15:04,151 - INFO - No split for sentence with backslash n. Doc Id: 123 Sentence: Experiments\n", | |
"As word embeddings we used word2vec cbow hierarchical softmax models of dimension 400 and sample parameter 0.00001 trained on our preprocessed English Wikipedia articles.\n", | |
"2018-05-22 08:15:04,153 - INFO - No split for sentence with backslash n. Doc Id: 123 Sentence: P_k Metric\n", | |
"Following the paper Segmentation based on Semantic Word Embeddings, we evaluate the two approaches outlined above on documents composed of randomly concatenated document chunks, to see if the synthetic borders are detected.\n", | |
"2018-05-22 08:15:04,156 - INFO - Sentence with backslash was splitted. Doc Id: 123 Sentence: Given any positive integer k, define p_k to be the probability that a text slice S of length k, chosen uniformly at random from the test document, occurs both in the i^{\textnormal{th}} segment of the reference segmentation and in the i^{\textnormal{th}} segment the segmentation created by the algorithm, for some i, and set\n", | |
"\n", | |
" \\[P_k := 1 - p_k.\\]\n", | |
"\n", | |
"For a successful segmentation algorithm, the randomly chosen slice S will often occur in the same ordinal segment of the reference and computed segmentation.\n", | |
"2018-05-22 08:15:04,160 - INFO - No split for sentence with backslash n. Doc Id: 123 Sentence: \n", | |
"def reduce(vecs, n_comp):\n", | |
"2018-05-22 08:15:04,164 - INFO - No split for sentence with backslash n. Doc Id: 123 Sentence: u, s, v = np.linalg.svd(vecs, full_matrices=False)\n", | |
" return np.multiply(u[:, :n_comp], s[:n_comp])\n", | |
"The graphic shows the P_k metric (as mean) on the Y-axis.\n" | |
] | |
} | |
], | |
"source": [ | |
"sentences = split_sentences(raw_text, 123)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 15, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"[' Introduction Often, a longer document contains many themes, and each theme is located in a particular segment of the document.',\n", | |
" 'For example, each paragraph in a political essay may present a different argument in support of the thesis.',\n", | |
" 'Some arguments may be sociological, some economic, some historical, and so on.',\n", | |
" 'Thus, themes vary across the document, but remain constant within each segment.',\n", | |
" 'We will call such segments coherent .',\n", | |
" 'This post describes a simple principle to split documents into coherent segments, using word embeddings.',\n", | |
" 'Then we present two implementations of it.',\n", | |
" 'Firstly, we describe a greedy algorithm, which has linear complexity and runtime in the order of typical preprocessing steps (like sentence splitting, count vectorising).',\n", | |
" 'Secondly, we present an algorithm that computes the optimal solution to the objective given by the principle, but is of quadratic complexity in the document lengths.',\n", | |
" 'However, this optimal algorithm can be restricted in generality, such that processing time becomes linear.',\n", | |
" 'The approach presented here is quite similar to the one developed by Alemi and Ginsparg in Segmentation based on Semantic Word Embeddings.',\n", | |
" 'The implementation is available as a module on GitHub.',\n", | |
" 'Preliminaries In the discussion below, a document means a sequence of words, and the goal is to break documents into coherent segments.',\n", | |
" 'Our words are represented by word vectors.',\n", | |
" 'We regard segments themselves as sequences of words, and the vectorisation of a segment is formed by composing the vectorisations of the words.',\n", | |
" 'The techniques we describe are agnostic to the choice of composition, but we use summation here both for simplicity, and because it gives good results.',\n", | |
" 'Our techniques are also agnostic as to the choice of the units constituting a document they do not need to be words as described here.',\n", | |
" 'Given a sentence splitter, one could also consider a sentence as a unit.',\n", | |
" 'Motivation Imagine your document as a random walk in the space of word embeddings.',\n", | |
" 'Each step in the walk represents the transition from one word to the next, and is modelled by the difference in the corresponding word embeddings.',\n", | |
" 'In a coherent chunk of text, the potential step directions are not equally likely, because word embeddings capture semantics, and the chunk covers only a small number of topics.',\n", | |
" 'Since only some step directions are likely, the length of the accumulated step vector grows more quickly than for a uniformly random walking direction.',\n", | |
" 'Remark: Word embeddings usually have a certain preferred direction.',\n", | |
" 'One survey about this and a recipe to investigate your word embeddings can be found here.',\n", | |
" 'Formalisation',\n", | |
" 'Suppose V is a segment given by the word sequence (W_1, ... W_n) and let w_i be the word vector of W_i and v := \\\\sum_i w_i the segment vector of V.',\n", | |
" '',\n", | |
" 'The remarks in the previous section suggest that the segment vector length \\\\|v\\\\| corresponds to the amount of information in segment V. We can interpret \\\\|v\\\\| as a weighted sum of cosine similarities:',\n", | |
" '',\n", | |
" ' \\\\[\\\\|v\\\\| = \\\\left\\\\langle v, \\x0crac{v}{\\\\|v\\\\|} \\right\\rangle = \\\\sum_i\\\\left\\\\langle w_i, \\x0crac{v}{\\\\|v\\\\|}\\right\\rangle = \\\\sum_i \\\\|w_i\\\\|\\\\left\\\\langle \\x0crac{w_i}{\\\\|w_i\\\\|}, \\x0crac{v}{\\\\|v\\\\|}\\right\\rangle.\\\\]',\n", | |
" '',\n", | |
" 'As we usually compare word embeddings with cosine similarity, the last scalar product \\\\left\\\\langle \\x0crac{w_i}{\\\\|w_i\\\\|}, \\x0crac{v}{\\\\|v\\\\|}\\right\\rangle is just the similarity of a word w_i to the segment vector v. The weighting coefficients \\\\| w_i \\\\| suppress frequent noise words, which are typically of smaller norm.',\n", | |
" 'So \\\\|v\\\\| can be described as accumulated weighted cosine similarity of the word vectors of a segment to the segment vector.',\n", | |
" 'In other words: the more similar the word vectors are to the segment vector, the more coherent the segment is.',\n", | |
" 'How can we use the above notion of coherence to break a document of length L into coherent segments, say with word boundaries given by the segmentation:',\n", | |
" ' \\\\[T := (t_0, ..., t_n) \\\\quad \\text{such that} \\\\quad 0 = t_0 < t_i < ... < t_n = L\\\\] A natural first attempt is to ask for T maximising the sum of segment vector lengths.',\n", | |
" 'That is, we ask for T maximising:',\n", | |
" '',\n", | |
" ' \\\\[ J(T) := \\\\sum_{i=0}^{n-1}\\\\left\\\\| v_i \\right\\\\|, \\\\quad \\text{where} \\\\; v_i := w_{t_i} + \\\\ldots + w_{t_{i+1} - 1}.\\\\]',\n", | |
" '',\n", | |
" 'However, without further constraints, the optimal solution to J is the partition splitting the document completely, so that each segment is a single word.',\n", | |
" 'Indeed, by the triangle inequality, for any document (W_0, ..., W_{L-1}), we have:',\n", | |
" ' \\\\[ \\\\left\\\\| \\\\sum_{i=0}^{L-1} w_i \\right\\\\| \\\\le \\\\sum_{i=0}^{L-1} \\\\| w_i \\\\|.',\n", | |
" '\\\\] Therefore, we must impose some limit on the granularity of the segmentation to get useful results.',\n", | |
" 'To achieve this, we impose a penalty for every split made, by subtracting a fixed positive number \\\\pi for each segment.',\n", | |
" 'The error function is now:',\n", | |
" ' \\\\[ J(T) := \\\\sum_{i=0}^{n-1} \\\\left( \\\\left\\\\| v_i \\right\\\\| - \\\\pi \\right).\\\\] Algorithms We developed two algorithms to tackle the problem.',\n", | |
" 'Both depend on a hyperparameter, \\\\pi, that defines the granularity of the segmentation.',\n", | |
" 'The first one is greedy and therefore only a heuristic, intended to be quick.',\n", | |
" 'The second one finds an optimal segmentation for the objective J, given split penalty \\\\pi.',\n", | |
" 'Greedy The greedy approach tries to maximise J(T) by choosing split positions one at a time.',\n", | |
" 'To define the algorithm, we first define the notions of the gain of a split, and the score of a segmentation.',\n", | |
" 'Given a segment V = (W_b, \\\\ldots, W_{e-1}) of words and a split position t with b<t<e, the gain of splitting V at position t into (W_b, \\\\ldots, W_{t-1}) and (W_t, \\\\ldots, W_{e-1}) is the sum of norms of segment vectors to the left and right of c, minus the norm of the segment vector v:',\n", | |
" ' \\\\[\\textnormal{gain}_b^e(t) := \\\\left\\\\| \\\\sum_{i=b}^{t-1} w_i \\right\\\\| +\\\\left\\\\| \\\\sum_{i=t}^{e-1} w_i \\right\\\\| -\\\\left\\\\| \\\\sum_{i=b}^{e-1} w_i \\right\\\\|\\\\] The score of a segmentation is the sum of the gains of its split positions.',\n", | |
" '\\\\[\\textnormal{score}(T) := \\\\sum_{i=1}^{n-1} \\textnormal{gain}_{t_{i-1}}^{t_{i+1}}(t_i)\\\\]',\n", | |
" '',\n", | |
" 'The greedy algorithm works as follows: Split the text iteratively at the position where the score of the resulting segmentation is highest until the gain of the latest split drops below the given penalty threshold.',\n", | |
" 'Note that the splits resulting from this greedy approach may be less than the penalty \\\\pi, implying the segmentation is sub-optimal.',\n", | |
" 'Nonetheless, our empirical results are remarkably close to the global maximum of J that is guaranteed to be achieved by the dynamic programming approach discussed below.',\n", | |
" 'Dynamic Programming This approach exploits the fact that the optimal segmentations of all prefixes of a document up to a certain length can be extended to an optimal segmentation of the whole.',\n", | |
" 'The idea of dynamic programming is that one uses intermediate results to complete a partial solution.',\n", | |
" 'Let s have a look at our case:',\n", | |
" ' Let T := (t_0, t_1, ..., t_n) be the optimal segmentation of the whole document.',\n", | |
" \"We claim that T' := (t_0, t_1, ..., t_k) is optimal for the document prefix up to word W_k.\",\n", | |
" \"If this were not so, then the optimal segmentation for the document prefix (t_0, t_1, t_2, ..., t_k) would extend to a segmentation T'' for the whole document, using t_{k+1}, ..., t_n, with J(T'') < J(T), contradicting optimality of T. This gives us a constructive induction: Given optimal segmentations\",\n", | |
" '',\n", | |
" ' \\\\[\\\\{T^i \\\\;|\\\\; 0 < i < k, \\\\;T^i \\\\;\\text{optimal for first} \\\\;i\\\\; \\text{words}\\\\},\\\\]',\n", | |
" '',\n", | |
" 'we can construct the optimal segmentation T^k up to word k, by trying to extend any of the segmentations T^i, \\\\;0 < i < k by the segment (W_i, \\\\ldots W_k), then choosing i to maximise the objective.',\n", | |
" 'The reason it is possible to divide the maximisation task into parts is the additive composition of the objective and the fact that the norm obeys the triangle inequality.',\n", | |
" 'The runtime of this approach is quadratic in the input length L, which is a problem if you have long texts.',\n", | |
" 'However, by introducing a constant that specifies the maximal segment length, we can reduce the complexity to merely linear.',\n", | |
" 'Hyperparameter Choice Both algorithms depend on the penalty hyperparameter \\\\pi, which controls segmentation granularity: The smaller it is, the more segments are created.',\n", | |
" 'A simple way of finding an appropriate penalty is as follows.',\n", | |
" 'Choose a desired average segment length m. Given a sample of documents, record the lowest gains returned when splitting each document iteratively into as many segments as expected on average due to m, according to the greedy method.',\n", | |
" 'Take the mean of these records as \\\\pi.',\n", | |
" 'Our implementation of the greedy algorithm can be used to require a specific number of splits and retrieve the gains.',\n", | |
" 'The repository comes with a get_penalty function that implements the procedure as described.',\n", | |
" 'Experiments As word embeddings we used word2vec cbow hierarchical softmax models of dimension 400 and sample parameter 0.00001 trained on our preprocessed English Wikipedia articles.',\n", | |
" 'P_k Metric Following the paper Segmentation based on Semantic Word Embeddings, we evaluate the two approaches outlined above on documents composed of randomly concatenated document chunks, to see if the synthetic borders are detected.',\n", | |
" 'To measure the accuracy of a segmentation algorithm, we use the P_k metric as follows.',\n", | |
" 'Given any positive integer k, define p_k to be the probability that a text slice S of length k, chosen uniformly at random from the test document, occurs both in the i^{\\textnormal{th}} segment of the reference segmentation and in the i^{\\textnormal{th}} segment the segmentation created by the algorithm, for some i, and set',\n", | |
" '',\n", | |
" ' \\\\[P_k := 1 - p_k.\\\\]',\n", | |
" '',\n", | |
" 'For a successful segmentation algorithm, the randomly chosen slice S will often occur in the same ordinal segment of the reference and computed segmentation.',\n", | |
" 'In this case, the value of p_k will be high, hence the value of P_k low.',\n", | |
" 'Thus, P_k is an error metric.',\n", | |
" 'In our case, we choose k to be one half the length of the reference segment.',\n", | |
" 'We refer to the paper for more details on the P_k metric.',\n", | |
" 'Test documents were composed of fixed length chunks of random Wikipedia articles that had between 500 and 2500 words.',\n", | |
" 'Chunks were taken from the beginning of articles with an offset of 10 words to avoid the influence of the title.',\n", | |
" 'We achieved P_k values of about 0.05.',\n", | |
" 'We varied the length of the synthetic segments over values 50, 100 or 200 (the vertical axis in the displayed grid).',\n", | |
" 'We also varied the number of segments over values 3, 5 and 9 (the horizontal axis).',\n", | |
" 'Our base word vectorisation model was of dimension 400.',\n", | |
" 'To study how the performance of our segmentation algorithm varied with shrinking dimension, we compressed this model, using the SVD, into 16, 32, 64 and 128-dimensional models, and computed the P_k metrics for each.',\n", | |
" 'To do this, we used the python function:',\n", | |
" ' def reduce(vecs, n_comp):',\n", | |
" ' u, s, v = np.linalg.svd(vecs, full_matrices=False) return np.multiply(u[:, :n_comp], s[:n_comp]) The graphic shows the P_k metric (as mean) on the Y-axis.',\n", | |
" 'The penalty hyperparameter \\\\pi was chosen identically for both approaches and adjusted to approximately retrieve the actual number of segments.']" | |
] | |
}, | |
"execution_count": 15, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"sentences" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 13, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"ename": "ValueError", | |
"evalue": "need at least one array to concatenate", | |
"output_type": "error", | |
"traceback": [ | |
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", | |
"\u001b[0;31mValueError\u001b[0m Traceback (most recent call last)", | |
"\u001b[0;32m<ipython-input-13-78897f9821a1>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mcutoffs\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mevaluate\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mpredict_cutoffs\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0msentences\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mmodel\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mword2vec\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", | |
"\u001b[0;32m/home/sid/text-segmentation/evaluate.pyc\u001b[0m in \u001b[0;36mpredict_cutoffs\u001b[0;34m(sentences, model, word2vec)\u001b[0m\n\u001b[1;32m 40\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0mpredict_cutoffs\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0msentences\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mmodel\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mword2vec\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 41\u001b[0m \u001b[0mword2vec_sentences\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mtext_to_word2vec\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0msentences\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mword2vec\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 42\u001b[0;31m \u001b[0mtensored_data\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mprepare_tensor\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mword2vec_sentences\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 43\u001b[0m \u001b[0mbatched_tensored_data\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 44\u001b[0m \u001b[0mbatched_tensored_data\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mappend\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mtensored_data\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", | |
"\u001b[0;32m/home/sid/text-segmentation/evaluate.pyc\u001b[0m in \u001b[0;36mprepare_tensor\u001b[0;34m(sentences)\u001b[0m\n\u001b[1;32m 23\u001b[0m \u001b[0mtensored_data\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 24\u001b[0m \u001b[0;32mfor\u001b[0m \u001b[0msentence\u001b[0m \u001b[0;32min\u001b[0m \u001b[0msentences\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 25\u001b[0;31m \u001b[0mtensored_data\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mappend\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mutils\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mmaybe_cuda\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mtorch\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mFloatTensor\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mnp\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mconcatenate\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0msentence\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 26\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 27\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mtensored_data\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", | |
"\u001b[0;31mValueError\u001b[0m: need at least one array to concatenate" | |
] | |
} | |
], | |
"source": [ | |
" cutoffs = evaluate.predict_cutoffs(sentences, model, word2vec)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [] | |
} | |
], | |
"metadata": { | |
"kernelspec": { | |
"display_name": "Python 2", | |
"language": "python", | |
"name": "python2" | |
}, | |
"language_info": { | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 2 | |
}, | |
"file_extension": ".py", | |
"mimetype": "text/x-python", | |
"name": "python", | |
"nbconvert_exporter": "python", | |
"pygments_lexer": "ipython2", | |
"version": "2.7.15" | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 2 | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment