srush · January 11, 2017 16:06
diff --git a/gistfile1.txt b/gistfile1.txt
 ---------------------------------------------------------------------------
 Reviewer's Scores
 ---------------------------------------------------------------------------

                         Appropriateness: 5
                                 Clarity: 5
                             Originality: 3
                 Soundness / Correctness: 4
               Impact of Ideas / Results: 4
                   Meaningful Comparison: 4
                               Substance: 4
                           Replicability: 4
                          Recommendation: 4


 ---------------------------------------------------------------------------
 Comments
 ---------------------------------------------------------------------------

 This paper conducts sentence-level abstractive summarization/rewriting with a
 neural network approach. The method in principle can handle not just deletion
 but also rewriting (using words not in the original sentences), while minimally
 relies on linguistic analysis. The paper is easy to follow and is written
 clearly.  Advantages of the proposed models are supported by the experimental
 results. I recommend for publication.

 I have a couple of suggestions. First, (probably I missed something), will a
 good neural-network-based deletion model be good enough to capture the benefit
 seen in this paper? The approach proposed in this paper can generate summaries
 with words not seen in the original sentences, but how much does that benefit?
 (although the COMPRESS model is discussed in section 7.2, which used a very
 different approach and still has some gaps). From summarization evaluation view
 points, ROUGE may not value too much “unseen” words, compared with other
 metrics like Pyramid, so the benefit of the proposed model may not be fully
 showed. While Section 5 try to compromise towards ROUGE, some discussions may
 help think of the issue in another way.

 The paper used a trick (section 5) to tune the model towards ROUGE. If a little
 more discussion can be added, it may help some readers further understand why
 directly tuning an objective, e.g., ROUGE, is not feasible here, Is it because
 that ROUGE may not be accurate on a small data set (like BLUE on individual
 sentences), because of computational concerns, or because of other reasons.

 ============================================================================
                            REVIEWER #2
 ============================================================================


 ---------------------------------------------------------------------------
 Reviewer's Scores
 ---------------------------------------------------------------------------

                         Appropriateness: 5
                                 Clarity: 4
                             Originality: 3
                 Soundness / Correctness: 4
               Impact of Ideas / Results: 4
                   Meaningful Comparison: 4
                               Substance: 4
                           Replicability: 3
                          Recommendation: 4


 ---------------------------------------------------------------------------
 Comments
 ---------------------------------------------------------------------------

 This paper talks about abstractive sentence summarization, specifically
 headline  generation using a neural language model. The work was evaluated on
 the DUC 2004 dataset. The paper is well written and the work is interesting and
 has been carefully evaluated. However, while the quantitative evaluation seems
 reasonable, the actual summaries (from the examples) seems to have major
 grammatical and repetition issues and does not look quite as good as the true
 headline. Having said that, the idea is promising but it needs more work in
 terms of soundness of generated sentences.

 A few questions/comments:
 Why was the model trained using only the first line in the text? What is the
 intuition for this? Could the last line which summarizes the text be used as
 well? It would be nice if this is discussed in the paper.

 In Section 7.2 the authors talk about a capped ROUGE score, but the authors
 don't explain how this is computed. Was this used in the DUC 2004 task, if so
 please state and if not please provide the exact formula for reproduceability.

 It seems like the authors tried to fit more content than the page limit would
 allow as the bottom margin is completely off. Please fix this and make the
 writing more concise.

 ============================================================================
                            REVIEWER #3
 ============================================================================


 ---------------------------------------------------------------------------
 Reviewer's Scores
 ---------------------------------------------------------------------------

                         Appropriateness: 4
                                 Clarity: 3
                             Originality: 4
                 Soundness / Correctness: 4
               Impact of Ideas / Results: 4
                   Meaningful Comparison: 3
                               Substance: 4
                           Replicability: 3
                          Recommendation: 4


 ---------------------------------------------------------------------------
 Comments
 ---------------------------------------------------------------------------

 This paper utilizes neural language models to generate sentence summarization
 word by word, which goes beyond previous sentence based extractive methods and
 phrase based abstractive approaches for sentence summarization. More specific,
 their Attention Based Summarization (ABS) approach is modeled off the attention
 based encoder and a beam search decoder with extractive features, which can be
 seen as a tradeoff between abstractive and extractive methods.

 For the encoder models, they deploy four step-by-step models of which two are
 only considering the input word information, and the other two combining the
 embedded current context information. Thus, the latter two encoders, which are
 simultaneously learning embeddings for the input with distributions based on
 the current context, are capable to show interpretable alignment between the
 summary and the input sentence.

 The author conducted extensive experiments on several strong and well known
 baseline models, achieving promising results. Especially, their tuned model
 ABS+, which leverages the advantage of fluency by extractive features, scores
 significantly the best on the tasks. While they introduced how to tune the
 weight vector alpha, they don't report the real value of the alpha in the final
 best performance. Such real values would be beneficial to examine the
 importance of the extractive features. Therefore, I'm weakly hesitated for the
 analysis of the degree of their attention based neural models.

 I am just wondering how the grammar can be ensured with the proposed approach.
	---------------------------------------------------------------------------
	Reviewer's Scores
	---------------------------------------------------------------------------

	Appropriateness: 5
	Clarity: 5
	Originality: 3
	Soundness / Correctness: 4
	Impact of Ideas / Results: 4
	Meaningful Comparison: 4
	Substance: 4
	Replicability: 4
	Recommendation: 4


	---------------------------------------------------------------------------
	Comments
	---------------------------------------------------------------------------

	This paper conducts sentence-level abstractive summarization/rewriting with a
	neural network approach. The method in principle can handle not just deletion
	but also rewriting (using words not in the original sentences), while minimally
	relies on linguistic analysis. The paper is easy to follow and is written
	clearly. Advantages of the proposed models are supported by the experimental
	results. I recommend for publication.

	I have a couple of suggestions. First, (probably I missed something), will a
	good neural-network-based deletion model be good enough to capture the benefit
	seen in this paper? The approach proposed in this paper can generate summaries
	with words not seen in the original sentences, but how much does that benefit?
	(although the COMPRESS model is discussed in section 7.2, which used a very
	different approach and still has some gaps). From summarization evaluation view
	points, ROUGE may not value too much “unseen” words, compared with other
	metrics like Pyramid, so the benefit of the proposed model may not be fully
	showed. While Section 5 try to compromise towards ROUGE, some discussions may
	help think of the issue in another way.

	The paper used a trick (section 5) to tune the model towards ROUGE. If a little
	more discussion can be added, it may help some readers further understand why
	directly tuning an objective, e.g., ROUGE, is not feasible here, Is it because
	that ROUGE may not be accurate on a small data set (like BLUE on individual
	sentences), because of computational concerns, or because of other reasons.

	============================================================================
	REVIEWER #2
	============================================================================


	---------------------------------------------------------------------------
	Reviewer's Scores
	---------------------------------------------------------------------------

	Appropriateness: 5
	Clarity: 4
	Originality: 3
	Soundness / Correctness: 4
	Impact of Ideas / Results: 4
	Meaningful Comparison: 4
	Substance: 4
	Replicability: 3
	Recommendation: 4


	---------------------------------------------------------------------------
	Comments
	---------------------------------------------------------------------------

	This paper talks about abstractive sentence summarization, specifically
	headline generation using a neural language model. The work was evaluated on
	the DUC 2004 dataset. The paper is well written and the work is interesting and
	has been carefully evaluated. However, while the quantitative evaluation seems
	reasonable, the actual summaries (from the examples) seems to have major
	grammatical and repetition issues and does not look quite as good as the true
	headline. Having said that, the idea is promising but it needs more work in
	terms of soundness of generated sentences.

	A few questions/comments:
	Why was the model trained using only the first line in the text? What is the
	intuition for this? Could the last line which summarizes the text be used as
	well? It would be nice if this is discussed in the paper.

	In Section 7.2 the authors talk about a capped ROUGE score, but the authors
	don't explain how this is computed. Was this used in the DUC 2004 task, if so
	please state and if not please provide the exact formula for reproduceability.

	It seems like the authors tried to fit more content than the page limit would
	allow as the bottom margin is completely off. Please fix this and make the
	writing more concise.

	============================================================================
	REVIEWER #3
	============================================================================


	---------------------------------------------------------------------------
	Reviewer's Scores
	---------------------------------------------------------------------------

	Appropriateness: 4
	Clarity: 3
	Originality: 4
	Soundness / Correctness: 4
	Impact of Ideas / Results: 4
	Meaningful Comparison: 3
	Substance: 4
	Replicability: 3
	Recommendation: 4


	---------------------------------------------------------------------------
	Comments
	---------------------------------------------------------------------------

	This paper utilizes neural language models to generate sentence summarization
	word by word, which goes beyond previous sentence based extractive methods and
	phrase based abstractive approaches for sentence summarization. More specific,
	their Attention Based Summarization (ABS) approach is modeled off the attention
	based encoder and a beam search decoder with extractive features, which can be
	seen as a tradeoff between abstractive and extractive methods.

	For the encoder models, they deploy four step-by-step models of which two are
	only considering the input word information, and the other two combining the
	embedded current context information. Thus, the latter two encoders, which are
	simultaneously learning embeddings for the input with distributions based on
	the current context, are capable to show interpretable alignment between the
	summary and the input sentence.

	The author conducted extensive experiments on several strong and well known
	baseline models, achieving promising results. Especially, their tuned model
	ABS+, which leverages the advantage of fluency by extractive features, scores
	significantly the best on the tasks. While they introduced how to tune the
	weight vector alpha, they don't report the real value of the alpha in the final
	best performance. Such real values would be beneficial to examine the
	importance of the extractive features. Therefore, I'm weakly hesitated for the
	analysis of the degree of their attention based neural models.

	I am just wondering how the grammar can be ensured with the proposed approach.