-
-
Save cyx/212081 to your computer and use it in GitHub Desktop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
class HomeController < ApplicationController | |
def index | |
if (@target_url = params["url"]) && !@target_url.blank? | |
@target_url = @target_url =~ %r{^http://} ? @target_url : "http://#{@target_url}" | |
filter_if_length_less_than = 40 | |
@page = open(@target_url).read | |
doc = Nokogiri::HTML.parse(@page) | |
content = doc.search("h1,p,.comment") | |
content = content.reject { |node| node.text.gsub(/\W/,'').strip.length < filter_if_length_less_than } | |
content = content.reject { |node| (%w[noscript li] & node.ancestors.map { |e| e.name }).length > 0 } | |
@contents = content.map { |e| e.text }.join("\n").split(/\n+/).map { |e| "<p>#{e.strip}</p>" }.join | |
end | |
rescue Exception => exception | |
log_error(exception) if logger | |
erase_results if performed? | |
flash.now[:notice] = "Bad bad things happened without cause" | |
end | |
end |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
%h1 Extract real content from any page | |
- if flash[:notice] | |
.flash | |
= flash[:notice] | |
%p | |
Using url: | |
= @target_url | |
%form{:action => '/', :method => 'GET'} | |
%fieldset | |
%ol | |
%li | |
%label{:for => 'url'} URL | |
%input{:id => 'url', :name => 'url', :style => 'width: 30em'} | |
%input{:type => 'submit', :value => 'Strip'} | |
.contents | |
= @contents |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Feature: Strip content | |
In order to send real content to KIM | |
As a imindi user | |
I want to pull out real content from any web page | |
Scenario: Parse http://www.cnn.com/2009/US/10/15/arizona.sweat.lodge/index.html | |
Given I am on the home page | |
When I fill in "URL" with "http://www.cnn.com/2009/US/10/15/arizona.sweat.lodge/index.html" | |
When I press "Strip" | |
Then I should see "Using url:" | |
And I should see "http://www.cnn.com/2009/US/10/15/arizona.sweat.lodge/index.html" | |
And I should not see "Bad bad things happened without cause" | |
Scenario: Parse www.cnn.com/2009/US/10/15/arizona.sweat.lodge/index.html | |
Given I am on the home page | |
When I fill in "URL" with "http://www.cnn.com/2009/US/10/15/arizona.sweat.lodge/index.html" | |
When I press "Strip" | |
Then I should see "Using url:" | |
And I should see "http://www.cnn.com/2009/US/10/15/arizona.sweat.lodge/index.html" | |
And I should not see "Bad bad things happened without cause" |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment