Skip to content

Instantly share code, notes, and snippets.

@mogsdad
Last active August 8, 2024 15:21
Show Gist options
  • Save mogsdad/e6795e438615d252584f to your computer and use it in GitHub Desktop.
Save mogsdad/e6795e438615d252584f to your computer and use it in GitHub Desktop.
For http://stackoverflow.com/questions/26613809, a question about getting pdf attachments in gmail as text. I got a little carried away - this does much more than asked.

Google Apps Script pdfToText Utility#

This is a helper function that will convert a given PDF file blob into text, as well as offering options to save the original PDF, intermediate Google Doc, and/or final plain text files. Additionally, the language used for Optical Character Recognition (OCR) may be specified, defaulting to 'en' (English).

Note: Updated 12 May 2015 due to deprecation of DocsList. Thanks to Bruce McPherson for the getDriveFolderFromPath() utility.

    // Start with a Blob object
    var blob = gmailAttchment.getAs(MimeType.PDF);
    
    // fileId will be the ID of a saved text file (default behavior):
    var fileId = pdfToText( blob );

    // filetext will contain text from pdf file, no residual files are saved:
    var filetext = pdfToText( blob, {keepTextfile: false} );

    // we can save other converted file types, too:
    var options = {
       keepPdf : true,            // Keep a copy of the original PDF file.
       keepGdoc : true,           // Keep a copy of the OCR Google Doc file.
       keepTextfile : true,       // Keep a copy of the text file. (default)
       path : "attachments/today" // Folder path to store file(s) in.
    }
    filetext = pdfToText( blob, options );
/**
* Convert pdf file (blob) to a text file on Drive, using built-in OCR.
* By default, the text file will be placed in the root folder, with the same
* name as source pdf (but extension 'txt'). Options:
* keepPdf (boolean, default false) Keep a copy of the original PDF file.
* keepGdoc (boolean, default false) Keep a copy of the OCR Google Doc file.
* keepTextfile (boolean, default true) Keep a copy of the text file.
* path (string, default blank) Folder path to store file(s) in.
* ocrLanguage (ISO 639-1 code) Default 'en'.
* textResult (boolean, default false) If true and keepTextfile true, return
* string of text content. If keepTextfile
* is false, text content is returned without
* regard to this option. Otherwise, return
* id of textfile.
*
* @param {blob} pdfFile Blob containing pdf file
* @param {object} options (Optional) Object specifying handling details
*
* @returns {string} id of text file (default) or text content
*/
function pdfToText ( pdfFile, options ) {
// Ensure Advanced Drive Service is enabled
try {
Drive.Files.list();
}
catch (e) {
throw new Error( "To use pdfToText(), first enable 'Drive API' in Resources > Advanced Google Services." );
}
// Set default options
options = options || {};
options.keepTextfile = options.hasOwnProperty("keepTextfile") ? options.keepTextfile : true;
// Prepare resource object for file creation
var parents = [];
if (options.path) {
parents.push( getDriveFolderFromPath (options.path) );
}
var pdfName = pdfFile.getName();
var resource = {
title: pdfName,
mimeType: pdfFile.getContentType(),
parents: parents
};
// Save PDF to Drive, if requested
if (options.keepPdf) {
var file = Drive.Files.insert(resource, pdfFile);
}
// Save PDF as GDOC
resource.title = pdfName.replace(/pdf$/, 'gdoc');
var insertOpts = {
ocr: true,
ocrLanguage: options.ocrLanguage || 'en'
}
var gdocFile = Drive.Files.insert(resource, pdfFile, insertOpts);
// Get text from GDOC
var gdocDoc = DocumentApp.openById(gdocFile.id);
var text = gdocDoc.getBody().getText();
// We're done using the Gdoc. Unless requested to keepGdoc, delete it.
if (!options.keepGdoc) {
Drive.Files.remove(gdocFile.id);
}
// Save text file, if requested
if (options.keepTextfile) {
resource.title = pdfName.replace(/pdf$/, 'txt');
resource.mimeType = MimeType.PLAIN_TEXT;
var textBlob = Utilities.newBlob(text, MimeType.PLAIN_TEXT, resource.title);
var textFile = Drive.Files.insert(resource, textBlob);
}
// Return result of conversion
if (!options.keepTextfile || options.textResult) {
return text;
}
else {
return textFile.id
}
}
// Helper utility from http://ramblings.mcpher.com/Home/excelquirks/gooscript/driveapppathfolder
function getDriveFolderFromPath (path) {
return (path || "/").split("/").reduce ( function(prev,current) {
if (prev && current) {
var fldrs = prev.getFoldersByName(current);
return fldrs.hasNext() ? fldrs.next() : null;
}
else {
return current ? null : prev;
}
},DriveApp.getRootFolder());
}
@bobsquito
Copy link

I've been using this for a few months but yesterday it just stopped working completely, is anyone else having this problem?

It fails on line
var gdocFile = Drive.Files.insert(resource, pdfFile, insertOpts);

Just says internal error. I've also noticed that I can't right click a pdf in Drive and go to "open with > google docs" as that just errors too. I hope this gets fixed as I use this a lot.

@woetwoet
Copy link

Same issue here.... Also internal error on the same line.

@woetwoet
Copy link

I logged it as an issue with google : https://code.google.com/p/google-apps-script-issues/issues/detail?id=6201. As far as I see it is a bug with the google dudes.

@dan23njguy
Copy link

Hi mogsdad,

I currently use a script to download emailed pdf docs and save them to drive. Would you be interested in combining the two scripts or adding on to either? I am interested in engaging someone to code various scripts.

Thanks!
Dan

@arturoliveira
Copy link

@dan23njguy i currently also use another script to save attachments from gmail to drive and was looking for a way to search text inside pdfs and found this. Would you be interested in working together ?

@2wice2
Copy link

2wice2 commented Mar 19, 2018

Is it possible to rename PDF from the content of PDF?

@Armsp0
Copy link

Armsp0 commented May 24, 2018

I try to debug/run this code and I get the error "TypeError: Cannot call method "getName" of undefined etc"
What is wrong here? obviously I am a novice....

@vladox
Copy link

vladox commented Apr 10, 2021

I'll recommend using pdftotext from the poppler package

@ddavidio
Copy link

Hi David, by any chance, is there a license for this gist? I would like to use it to one of my projects but I want to make sure that I am not infringing on your copyright

@muhammadammarafzal
Copy link

Hi, My files are saving in the "My Drive" not in my desired folder, even though I put right address in path : "Attachments/Test" which exists. Can anyone help me to solve this issue?

@appscriptexpert
Copy link

Hi, My files are saving in the "My Drive" not in my desired folder, even though I put right address in path : "Attachments/Test" which exists. Can anyone help me to solve this issue?

Hi, Actually it is referring main drive (Drive.files), you need to replace it with "DriveApp.getFolderById('string_id_of_my_folder');"

You may visit us for more at help https://appsscriptexpert.com/

@muhammadammarafzal
Copy link

Hi, My files are saving in the "My Drive" not in my desired folder, even though I put right address in path : "Attachments/Test" which exists. Can anyone help me to solve this issue?

Hi, Actually it is referring main drive (Drive.files), you need to replace it with "DriveApp.getFolderById('string_id_of_my_folder');"

You may visit us for more at help https://appsscriptexpert.com/

It's giving error, "TypeError: DriveApp.getFolderById(...).insert is not a function".
https://script.google.com/d/17hfC56JdTkwL-XoXh_88u3bYysJuqd9ihKLtoO_rgEh-7lkCXPiHe9o6/edit?usp=sharing

@thokoe
Copy link

thokoe commented Oct 20, 2022

Hi, Is it possible to run the script without having to save the google doc to drive and then delete it.

@appscriptexpert
Copy link

appscriptexpert commented Oct 20, 2022 via email

@dahse89
Copy link

dahse89 commented Apr 13, 2023

The Script wasn't working for me, but i found this https://www.labnol.org/extract-text-from-pdf-220422

/*
 * Convert PDF file to text
 * @param {string} fileId - The Google Drive ID of the PDF
 * @param {string} language - The language of the PDF text to use for OCR
 * return {string} - The extracted text of the PDF file
 */

const convertPDFToText = (fileId, language) => {
  fileId = fileId || '18FaqtRcgCozTi0IyQFQbIvdgqaO_UpjW'; // Sample PDF file
  language = language || 'en'; // English

  // Read the PDF file in Google Drive
  const pdfDocument = DriveApp.getFileById(fileId);

  // Use OCR to convert PDF to a temporary Google Document
  // Restrict the response to include file Id and Title fields only
  const { id, title } = Drive.Files.insert(
    {
      title: pdfDocument.getName().replace(/\.pdf$/, ''),
      mimeType: pdfDocument.getMimeType() || 'application/pdf',
    },
    pdfDocument.getBlob(),
    {
      ocr: true,
      ocrLanguage: language,
      fields: 'id,title',
    }
  );

  // Use the Document API to extract text from the Google Document
  const textContent = DocumentApp.openById(id).getBody().getText();

  // Delete the temporary Google Document since it is no longer needed
  DriveApp.getFileById(id).setTrashed(true);

  // (optional) Save the text content to another text file in Google Drive
  const textFile = DriveApp.createFile(`${title}.txt`, textContent, 'text/plain');
  return textContent;
};

@ariessetiyawan
Copy link

The Script wasn't working for me, but i found this https://www.labnol.org/extract-text-from-pdf-220422

/*
 * Convert PDF file to text
 * @param {string} fileId - The Google Drive ID of the PDF
 * @param {string} language - The language of the PDF text to use for OCR
 * return {string} - The extracted text of the PDF file
 */

const convertPDFToText = (fileId, language) => {
  fileId = fileId || '18FaqtRcgCozTi0IyQFQbIvdgqaO_UpjW'; // Sample PDF file
  language = language || 'en'; // English

  // Read the PDF file in Google Drive
  const pdfDocument = DriveApp.getFileById(fileId);

  // Use OCR to convert PDF to a temporary Google Document
  // Restrict the response to include file Id and Title fields only
  const { id, title } = Drive.Files.insert(
    {
      title: pdfDocument.getName().replace(/\.pdf$/, ''),
      mimeType: pdfDocument.getMimeType() || 'application/pdf',
    },
    pdfDocument.getBlob(),
    {
      ocr: true,
      ocrLanguage: language,
      fields: 'id,title',
    }
  );

  // Use the Document API to extract text from the Google Document
  const textContent = DocumentApp.openById(id).getBody().getText();

  // Delete the temporary Google Document since it is no longer needed
  DriveApp.getFileById(id).setTrashed(true);

  // (optional) Save the text content to another text file in Google Drive
  const textFile = DriveApp.createFile(`${title}.txt`, textContent, 'text/plain');
  return textContent;
};

how abaut get Image ?, when I add script

const ImgContent = DocumentApp.openById(id).getBody().getImage();

I cannot get all Images....in PDF file, there are 3 images but 2 images detected only

@fernandomora
Copy link

The Drive.Files.insert api is outdated, now it needs a ParentReference on parents field, and the request is always uploading to the root folder

The if (options.path) must be replaced by

  if (options.path) {
    const folder = getDriveFolderFromPath (options.path);
    if (folder) {
      const parentReference = Drive.newParentReference();
      parentReference.id = folder.getId();
      parents.push(parentReference);
    }
  }

@lidia-rbr
Copy link

I'm using this function that works well:

/**
 * EXTRACT TEXT CONTENT FROM PDF 
 * 
 * @param {string} fileId
 * @param {string} parentFolderId
 * @returns {string} pdfContent
 */
function extractTextFromPDF(fileId, parentFolderId) {

  const destFolder = Drive.Files.get(parentFolderId, { "supportsAllDrives": true });
  const newFile = {
    "fileId": fileId,
    "parents": [
      destFolder
    ]
  };
  const args = {
    "resource": {
      "parents": [
        destFolder
      ],
      "name": "temp",
      "mimeType": "application/vnd.google-apps.document",
    },
    "supportsAllDrives": true
  };

  const newTargetDoc = Drive.Files.copy(newFile, fileId, args);
  const newTargetFile = DocumentApp.openById(newTargetDoc.getId());
  const pdfContent = newTargetFile.getBody().getText();

  return pdfContent;
}

@yusufnadiruzun
Copy link

I take "insert is not a function" error. I fixed with this code. You can read multiple pdf files.

const convertPDFToText = (pdfDocument) => {
  try {

    // Use OCR to convert PDF to a temporary Google Document
    const fileMetadata = {
      name: pdfDocument.getName().replace(/\.pdf$/, ''),
      mimeType: 'application/vnd.google-apps.document' // Ensuring the target MIME type is Google Docs
    };

    const media = pdfDocument.getBlob();

    const options = {
      ocr: true,
      ocrLanguage: "en",
      fields: 'id, name'
    };

    const response = Drive.Files.create(fileMetadata, media, options);
    const { id, name } = response;

    // Add a delay to ensure the document is fully processed
    Utilities.sleep(10000); // 10 seconds

    // Verify the document exists and is accessible
    const tempFile = DriveApp.getFileById(id);
   
    // Check if the file is a Google Document
    const mimeType = tempFile.getMimeType();
    
    // Ensure the file is a Google Document
    if (mimeType !== MimeType.GOOGLE_DOCS) {
      throw new Error(`Unexpected MIME type: ${mimeType}`);
    }

    // Use the Document API to extract text from the Google Document
    const doc = DocumentApp.openById(id);
    const body = doc.getBody();

    // Check if the document body is empty
    if (!body || !body.getText()) {
      throw new Error('Document body is empty or not accessible');
    }

    const textContent = body.getText();
    
    // Delete the temporary Google Document since it is no longer needed
    DriveApp.getFileById(id).setTrashed(true);
    return textContent;
  } catch (error) {
    Logger.log(`Error: ${error.message}`);
    throw error;
  }
};


const convertPDFsInFolderToText = (folderId) => {
  var folder = DriveApp.getFolderById(folderId);
  var files = folder.getFiles();
  var allTextContent = "";

  while (files.hasNext()) {
    var pdfFile = files.next();
    try {
      var textContent = convertPDFToText(pdfFile);
      allTextContent += textContent;

      
    } catch (error) {
      Logger.log(`Failed to process file ${pdfFile.getName()}: ${error.message}`);
    }
  }
        return allTextContent;
};

function Run(){
  const folderId = ""
  console.log(convertPDFsInFolderToText(folderId))
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment