Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save waffle2k/320661 to your computer and use it in GitHub Desktop.
Save waffle2k/320661 to your computer and use it in GitHub Desktop.
#!/usr/bin/perl
#
# Get a list of second level domains, by screen scraping wikipedia
#
# pete (at) killallhumans (dot) ca
# First, get a list of all TLDs from:
# http://data.iana.org/TLD/tlds-alpha-by-domain.txt
my @tlds = `wget -O - http://data.iana.org/TLD/tlds-alpha-by-domain.txt `;
for( @tlds ){
chomp;
next if /^#/;
my $tld = lc $_;
# Get the wikipedia entry for this domain
`mkdir -p /tmp/tld` unless -d '/tmp/tld';
print "Getting $tld from wikipedia\n"
unless -f "/tmp/tld/$tld";
`wget http://en.wikipedia.org/wiki/.$tld -O /tmp/tld/$tld`
unless -f "/tmp/tld/$tld";
open FD, "</tmp/tld/$tld"
or warn "Cannot open /tmp/tld/$tld: $!\n"
and next;
my $txt = do { local $/; <FD>; };
my @sld = $txt =~ /\<b\>(\S+\.$tld)\<\/b\>/g;
print "Second Level Domains: " . join(",", @sld ) . "\n"
if scalar @sld > 0;
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment