This is a helper function that will convert a given PDF file blob into text, as well as offering options to save the original PDF, intermediate Google Doc, and/or final plain text files. Additionally, the language used for Optical Character Recognition (OCR) may be specified, defaulting to 'en' (English).
Note: Updated 12 May 2015 due to deprecation of DocsList. Thanks to Bruce McPherson for the getDriveFolderFromPath()
utility.
// Start with a Blob object
var blob = gmailAttchment.getAs(MimeType.PDF);
// fileId will be the ID of a saved text file (default behavior):
var fileId = pdfToText( blob );
// filetext will contain text from pdf file, no residual files are saved:
var filetext = pdfToText( blob, {keepTextfile: false} );
// we can save other converted file types, too:
var options = {
keepPdf : true, // Keep a copy of the original PDF file.
keepGdoc : true, // Keep a copy of the OCR Google Doc file.
keepTextfile : true, // Keep a copy of the text file. (default)
path : "attachments/today" // Folder path to store file(s) in.
}
filetext = pdfToText( blob, options );
Generally this works really well for my purposes, but I've had a repeating issue with some PDFs where the text returned is not from the entire document. In the most recent instance this is a 19 page PDF and only the first 10 pages are returned. This repeats even if I randomly re-arrange the pages in the document. Is this a known issue/limitation?
EDIT: I found this https://stackoverflow.com/questions/27303488/using-gas-to-overcome-the-max-file-size-limit-for-google-drive-api-drive-files-i immediately after posting this... Seems we're stuck with this