Skip to content

Instantly share code, notes, and snippets.

@ThinhPhan
Created March 8, 2024 03:56
Show Gist options
  • Save ThinhPhan/88c65bf3f138dbeacd1d3d9d0c628587 to your computer and use it in GitHub Desktop.
Save ThinhPhan/88c65bf3f138dbeacd1d3d9d0c628587 to your computer and use it in GitHub Desktop.
Extract all URLs from a webpage
// How to use
// Paste this script into Developer Console to run it
// Ref: https://www.datablist.com/learn/scraping/extract-urls-from-webpage
const results = [
['Url', 'Anchor Text', 'External']
];
var urls = document.getElementsByTagName('a');
for (urlIndex in urls) {
const url = urls[urlIndex]
const externalLink = url.host !== window.location.host
if(url.href && url.href.indexOf('://')!==-1) results.push([url.href, url.text, externalLink]) // url.rel
}
const csvContent = results.map((line)=>{
return line.map((cell)=>{
if(typeof(cell)==='boolean') return cell ? 'TRUE': 'FALSE'
if(!cell) return ''
let value = cell.replace(/[\f\n\v]*\n\s*/g, "\n").replace(/[\t\f ]+/g, ' ');
value = value.replace(/\t/g, ' ').trim();
return `"${value}"`
}).join('\t')
}).join("\n");
console.log(csvContent)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment