Skip to content

Instantly share code, notes, and snippets.

@mogsdad
Last active September 3, 2025 23:32
Show Gist options
  • Select an option

  • Save mogsdad/e6795e438615d252584f to your computer and use it in GitHub Desktop.

Select an option

Save mogsdad/e6795e438615d252584f to your computer and use it in GitHub Desktop.
For http://stackoverflow.com/questions/26613809, a question about getting pdf attachments in gmail as text. I got a little carried away - this does much more than asked.

Google Apps Script pdfToText Utility#

This is a helper function that will convert a given PDF file blob into text, as well as offering options to save the original PDF, intermediate Google Doc, and/or final plain text files. Additionally, the language used for Optical Character Recognition (OCR) may be specified, defaulting to 'en' (English).

Note: Updated 12 May 2015 due to deprecation of DocsList. Thanks to Bruce McPherson for the getDriveFolderFromPath() utility.

    // Start with a Blob object
    var blob = gmailAttchment.getAs(MimeType.PDF);
    
    // fileId will be the ID of a saved text file (default behavior):
    var fileId = pdfToText( blob );

    // filetext will contain text from pdf file, no residual files are saved:
    var filetext = pdfToText( blob, {keepTextfile: false} );

    // we can save other converted file types, too:
    var options = {
       keepPdf : true,            // Keep a copy of the original PDF file.
       keepGdoc : true,           // Keep a copy of the OCR Google Doc file.
       keepTextfile : true,       // Keep a copy of the text file. (default)
       path : "attachments/today" // Folder path to store file(s) in.
    }
    filetext = pdfToText( blob, options );
/**
* Convert pdf file (blob) to a text file on Drive, using built-in OCR.
* By default, the text file will be placed in the root folder, with the same
* name as source pdf (but extension 'txt'). Options:
* keepPdf (boolean, default false) Keep a copy of the original PDF file.
* keepGdoc (boolean, default false) Keep a copy of the OCR Google Doc file.
* keepTextfile (boolean, default true) Keep a copy of the text file.
* path (string, default blank) Folder path to store file(s) in.
* ocrLanguage (ISO 639-1 code) Default 'en'.
* textResult (boolean, default false) If true and keepTextfile true, return
* string of text content. If keepTextfile
* is false, text content is returned without
* regard to this option. Otherwise, return
* id of textfile.
*
* @param {blob} pdfFile Blob containing pdf file
* @param {object} options (Optional) Object specifying handling details
*
* @returns {string} id of text file (default) or text content
*/
function pdfToText ( pdfFile, options ) {
// Ensure Advanced Drive Service is enabled
try {
Drive.Files.list();
}
catch (e) {
throw new Error( "To use pdfToText(), first enable 'Drive API' in Resources > Advanced Google Services." );
}
// Set default options
options = options || {};
options.keepTextfile = options.hasOwnProperty("keepTextfile") ? options.keepTextfile : true;
// Prepare resource object for file creation
var parents = [];
if (options.path) {
parents.push( getDriveFolderFromPath (options.path) );
}
var pdfName = pdfFile.getName();
var resource = {
title: pdfName,
mimeType: pdfFile.getContentType(),
parents: parents
};
// Save PDF to Drive, if requested
if (options.keepPdf) {
var file = Drive.Files.insert(resource, pdfFile);
}
// Save PDF as GDOC
resource.title = pdfName.replace(/pdf$/, 'gdoc');
var insertOpts = {
ocr: true,
ocrLanguage: options.ocrLanguage || 'en'
}
var gdocFile = Drive.Files.insert(resource, pdfFile, insertOpts);
// Get text from GDOC
var gdocDoc = DocumentApp.openById(gdocFile.id);
var text = gdocDoc.getBody().getText();
// We're done using the Gdoc. Unless requested to keepGdoc, delete it.
if (!options.keepGdoc) {
Drive.Files.remove(gdocFile.id);
}
// Save text file, if requested
if (options.keepTextfile) {
resource.title = pdfName.replace(/pdf$/, 'txt');
resource.mimeType = MimeType.PLAIN_TEXT;
var textBlob = Utilities.newBlob(text, MimeType.PLAIN_TEXT, resource.title);
var textFile = Drive.Files.insert(resource, textBlob);
}
// Return result of conversion
if (!options.keepTextfile || options.textResult) {
return text;
}
else {
return textFile.id
}
}
// Helper utility from http://ramblings.mcpher.com/Home/excelquirks/gooscript/driveapppathfolder
function getDriveFolderFromPath (path) {
return (path || "/").split("/").reduce ( function(prev,current) {
if (prev && current) {
var fldrs = prev.getFoldersByName(current);
return fldrs.hasNext() ? fldrs.next() : null;
}
else {
return current ? null : prev;
}
},DriveApp.getRootFolder());
}
@muhammadammarafzal
Copy link
Copy Markdown

Hi, My files are saving in the "My Drive" not in my desired folder, even though I put right address in path : "Attachments/Test" which exists. Can anyone help me to solve this issue?

Hi, Actually it is referring main drive (Drive.files), you need to replace it with "DriveApp.getFolderById('string_id_of_my_folder');"

You may visit us for more at help https://appsscriptexpert.com/

It's giving error, "TypeError: DriveApp.getFolderById(...).insert is not a function".
https://script.google.com/d/17hfC56JdTkwL-XoXh_88u3bYysJuqd9ihKLtoO_rgEh-7lkCXPiHe9o6/edit?usp=sharing

@thokoe
Copy link
Copy Markdown

thokoe commented Oct 20, 2022

Hi, Is it possible to run the script without having to save the google doc to drive and then delete it.

@appscriptexpert
Copy link
Copy Markdown

appscriptexpert commented Oct 20, 2022 via email

@dahse89
Copy link
Copy Markdown

dahse89 commented Apr 13, 2023

The Script wasn't working for me, but i found this https://www.labnol.org/extract-text-from-pdf-220422

/*
 * Convert PDF file to text
 * @param {string} fileId - The Google Drive ID of the PDF
 * @param {string} language - The language of the PDF text to use for OCR
 * return {string} - The extracted text of the PDF file
 */

const convertPDFToText = (fileId, language) => {
  fileId = fileId || '18FaqtRcgCozTi0IyQFQbIvdgqaO_UpjW'; // Sample PDF file
  language = language || 'en'; // English

  // Read the PDF file in Google Drive
  const pdfDocument = DriveApp.getFileById(fileId);

  // Use OCR to convert PDF to a temporary Google Document
  // Restrict the response to include file Id and Title fields only
  const { id, title } = Drive.Files.insert(
    {
      title: pdfDocument.getName().replace(/\.pdf$/, ''),
      mimeType: pdfDocument.getMimeType() || 'application/pdf',
    },
    pdfDocument.getBlob(),
    {
      ocr: true,
      ocrLanguage: language,
      fields: 'id,title',
    }
  );

  // Use the Document API to extract text from the Google Document
  const textContent = DocumentApp.openById(id).getBody().getText();

  // Delete the temporary Google Document since it is no longer needed
  DriveApp.getFileById(id).setTrashed(true);

  // (optional) Save the text content to another text file in Google Drive
  const textFile = DriveApp.createFile(`${title}.txt`, textContent, 'text/plain');
  return textContent;
};

@ariessetiyawan
Copy link
Copy Markdown

The Script wasn't working for me, but i found this https://www.labnol.org/extract-text-from-pdf-220422

/*
 * Convert PDF file to text
 * @param {string} fileId - The Google Drive ID of the PDF
 * @param {string} language - The language of the PDF text to use for OCR
 * return {string} - The extracted text of the PDF file
 */

const convertPDFToText = (fileId, language) => {
  fileId = fileId || '18FaqtRcgCozTi0IyQFQbIvdgqaO_UpjW'; // Sample PDF file
  language = language || 'en'; // English

  // Read the PDF file in Google Drive
  const pdfDocument = DriveApp.getFileById(fileId);

  // Use OCR to convert PDF to a temporary Google Document
  // Restrict the response to include file Id and Title fields only
  const { id, title } = Drive.Files.insert(
    {
      title: pdfDocument.getName().replace(/\.pdf$/, ''),
      mimeType: pdfDocument.getMimeType() || 'application/pdf',
    },
    pdfDocument.getBlob(),
    {
      ocr: true,
      ocrLanguage: language,
      fields: 'id,title',
    }
  );

  // Use the Document API to extract text from the Google Document
  const textContent = DocumentApp.openById(id).getBody().getText();

  // Delete the temporary Google Document since it is no longer needed
  DriveApp.getFileById(id).setTrashed(true);

  // (optional) Save the text content to another text file in Google Drive
  const textFile = DriveApp.createFile(`${title}.txt`, textContent, 'text/plain');
  return textContent;
};

how abaut get Image ?, when I add script

const ImgContent = DocumentApp.openById(id).getBody().getImage();

I cannot get all Images....in PDF file, there are 3 images but 2 images detected only

@fernandomora
Copy link
Copy Markdown

The Drive.Files.insert api is outdated, now it needs a ParentReference on parents field, and the request is always uploading to the root folder

The if (options.path) must be replaced by

  if (options.path) {
    const folder = getDriveFolderFromPath (options.path);
    if (folder) {
      const parentReference = Drive.newParentReference();
      parentReference.id = folder.getId();
      parents.push(parentReference);
    }
  }

@lidia-rbr
Copy link
Copy Markdown

I'm using this function that works well:

/**
 * EXTRACT TEXT CONTENT FROM PDF 
 * 
 * @param {string} fileId
 * @param {string} parentFolderId
 * @returns {string} pdfContent
 */
function extractTextFromPDF(fileId, parentFolderId) {

  const destFolder = Drive.Files.get(parentFolderId, { "supportsAllDrives": true });
  const newFile = {
    "fileId": fileId,
    "parents": [
      destFolder
    ]
  };
  const args = {
    "resource": {
      "parents": [
        destFolder
      ],
      "name": "temp",
      "mimeType": "application/vnd.google-apps.document",
    },
    "supportsAllDrives": true
  };

  const newTargetDoc = Drive.Files.copy(newFile, fileId, args);
  const newTargetFile = DocumentApp.openById(newTargetDoc.getId());
  const pdfContent = newTargetFile.getBody().getText();

  return pdfContent;
}

@yusufnadiruzun
Copy link
Copy Markdown

I take "insert is not a function" error. I fixed with this code. You can read multiple pdf files.

const convertPDFToText = (pdfDocument) => {
  try {

    // Use OCR to convert PDF to a temporary Google Document
    const fileMetadata = {
      name: pdfDocument.getName().replace(/\.pdf$/, ''),
      mimeType: 'application/vnd.google-apps.document' // Ensuring the target MIME type is Google Docs
    };

    const media = pdfDocument.getBlob();

    const options = {
      ocr: true,
      ocrLanguage: "en",
      fields: 'id, name'
    };

    const response = Drive.Files.create(fileMetadata, media, options);
    const { id, name } = response;

    // Add a delay to ensure the document is fully processed
    Utilities.sleep(10000); // 10 seconds

    // Verify the document exists and is accessible
    const tempFile = DriveApp.getFileById(id);
   
    // Check if the file is a Google Document
    const mimeType = tempFile.getMimeType();
    
    // Ensure the file is a Google Document
    if (mimeType !== MimeType.GOOGLE_DOCS) {
      throw new Error(`Unexpected MIME type: ${mimeType}`);
    }

    // Use the Document API to extract text from the Google Document
    const doc = DocumentApp.openById(id);
    const body = doc.getBody();

    // Check if the document body is empty
    if (!body || !body.getText()) {
      throw new Error('Document body is empty or not accessible');
    }

    const textContent = body.getText();
    
    // Delete the temporary Google Document since it is no longer needed
    DriveApp.getFileById(id).setTrashed(true);
    return textContent;
  } catch (error) {
    Logger.log(`Error: ${error.message}`);
    throw error;
  }
};


const convertPDFsInFolderToText = (folderId) => {
  var folder = DriveApp.getFolderById(folderId);
  var files = folder.getFiles();
  var allTextContent = "";

  while (files.hasNext()) {
    var pdfFile = files.next();
    try {
      var textContent = convertPDFToText(pdfFile);
      allTextContent += textContent;

      
    } catch (error) {
      Logger.log(`Failed to process file ${pdfFile.getName()}: ${error.message}`);
    }
  }
        return allTextContent;
};

function Run(){
  const folderId = ""
  console.log(convertPDFsInFolderToText(folderId))
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment