Last active
October 4, 2023 07:18
-
-
Save ilovefreesw/da435865a443a62923d67e6af6c6b2a8 to your computer and use it in GitHub Desktop.
A powershelll script to bulk convert webpages to PDF using headless Chrome. Save PDF with numberic names or based on webpage title.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
$sourceFile = " " # the source file containing the URLs you want to convert | |
$destFolder = " " # converted PDFs will be saved here. Folder has to exist. Don't forget to make sure that this path must end with "/" | |
$num = 1 | |
foreach ($link in [System.IO.File]::ReadLines($sourceFile)) | |
{ | |
$outfile = $num.ToString() + '.pdf' | |
$outputPath = Join-Path -Path $destFolder -ChildPath $outfile | |
& 'C:\Program Files\Google\Chrome\Application\chrome.exe' --headless --print-to-pdf="$outputPath" "$link" | |
Start-Sleep -Seconds 3 | |
$num++ | |
} |
this is awesome! Huge time saver for a project I'm working on! However, is it possible to tweak the script to name the output files according to the page title instead of a nondescript number?
You can try this snippet. But the webpage title cannot be used as file name as it contains illegal characters such as ":"
This modified snippet depends on PowerHTML, so install it first:
Install-Module -Name PowerHTML
and then use this script:
$sourceFile = "" # the source file containing the URLs you want to convert
$destFolder = "" # converted PDFs will be saved here. Folder has to exist.
foreach ($link in [System.IO.File]::ReadLines($sourceFile))
{
$title = (ConvertFrom-HTML -uri $link).SelectNodes('/html/head/title').InnerText
$pattern = "[{0}]" -f [RegEx]::Escape([System.IO.Path]::GetInvalidFileNameChars() -join '')
$cleanedTitle = $title -replace $pattern, ''
echo $cleanedTitle
$outfile = $cleanedTitle + '.pdf'
$outputPath = Join-Path -Path $destFolder -ChildPath $outfile
& 'C:\Program Files\Google\Chrome\Application\chrome.exe' --headless --print-to-pdf="$outputPath" "$link"
Start-Sleep -Seconds 3
echo " "
}
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
I couldn't get this script to work, so I used AI to fix it. Before it would overwrite every file with the new file and I would be left with only one pdf file. If anyone else has this problem, here's an updated script that works for me:
$sourceFile = " " # the source file containing the URLs you want to convert
$destFolder = " " # converted PDFs will be saved here. Folder has to exist.
$num = 1
foreach ($link in [System.IO.File]::ReadLines($sourceFile))
{
$outfile = $num.ToString() + '.pdf'
$outputPath = Join-Path -Path $destFolder -ChildPath $outfile
& 'C:\Program Files\Google\Chrome\Application\chrome.exe' --headless --print-to-pdf="$outputPath" "$link"
Start-Sleep -Seconds 3
$num++
}
P.S. Thanks for the article/script, much appreciated!