Skip to content

Instantly share code, notes, and snippets.

@jahands
Created February 2, 2016 23:36
Show Gist options
  • Save jahands/3a3ff3f8832aff63470d to your computer and use it in GitHub Desktop.
Save jahands/3a3ff3f8832aff63470d to your computer and use it in GitHub Desktop.
Basic web scraper.
$urlRegex="https?:\/\/(www\.)?[-a-zA-Z0-9@:%._\+~#=]{2,256}\.[a-z]{2,6}\b([-a-zA-Z0-9@:%_\+.~#?&//=]*)"
$links=@()
function Scrape([string]$Url){
(wget $Url).Links.href | Where-Object {$_ -match $urlRegex -and -not $global:links.Contains($_)} | ForEach-Object{
$global:links+=$_
echo $($global:links.Count.ToString() + " : " + $_)
Scrape -Url $_
}
}
Scrape -Url "https://technet.microsoft.com/en-us/magazine/2007.11.powershell.aspx"
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment