Skip to content

Instantly share code, notes, and snippets.

@paveljurca
Created February 23, 2016 19:43
Show Gist options
  • Save paveljurca/f77a78ac11ac574bf3c9 to your computer and use it in GitHub Desktop.
Save paveljurca/f77a78ac11ac574bf3c9 to your computer and use it in GitHub Desktop.
WordPress to text
use strict;
use warnings;
use LWP::Simple;
use Mojo::DOM;
use open qw/:std :utf8/;
use utf8;
URL:
for my $url (<DATA>) {
chomp $url;
my $html = get($url);
if (not $html or
$html =~ /Stránka nebyla nenalezena/) {
print STDERR "ERR $url\n";
next URL;
}
my $dom = Mojo::DOM->new($html)->at('div#wysiwyg');
# navigace
$dom->at('div.breadcrumb')->remove;
# další odkazy
$dom->at('div#subpages')->remove
if $dom->at('div#subpages');
# patička
$dom->at('div#content-footer')->remove;
print $dom->all_text, "\n";
# při počítání normostran odečíst řádky
# tzn. `wc -m dump.txt` mínus `wc -l dump.txt`
}
__DATA__
http://vc.vse.cz/sluzby/helpdesk/
http://vc.vse.cz/sluzby/id-karty/
https://osi.vse.cz/iptv/informace/
...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment