Skip to content

Instantly share code, notes, and snippets.

@tanaikech
Last active July 18, 2024 06:01
Show Gist options
  • Save tanaikech/e85193a89d041fed6122583739309786 to your computer and use it in GitHub Desktop.
Save tanaikech/e85193a89d041fed6122583739309786 to your computer and use it in GitHub Desktop.
Parsing HTML using Google Apps Script

Parsing HTML using Google Apps Script

This is a sample script for parsing HTML using Google Apps Script. When HTML data is converted to Google Document, the HTML data can be parsed and be converted to Google Document. In this case, the paragraphs, lists and tables are included. From this situation, I thought that this situation can be used for parsing HTML using Google Apps Script. So I could came up with this method.

In the Sheet API, the HTML data can be put to the Spreadsheet with the PasteDataRequest. But unfortunately, in this case, I couldn't distinguish between the body and tables.

The flow of this method is as follows. In this sample script, the tables from HTML are retrieved.

Flow

  1. Retrieve HTML data using UrlFetchApp.fetch().
  2. Create new Google Document by converting HTML data to Google Document using Drive API.
    • This is a temporal file.
  3. Retrieve all tables using Document service of Google Apps Script.
  4. Delete the temporal file.

Sample script

Before you run this script, please enable Drive API at Advanced Google Services.

function parseTablesFromHTML(url) {
  var html = UrlFetchApp.fetch(url);
  var docId = Drive.Files.insert(
    { title: "temporalDocument", mimeType: MimeType.GOOGLE_DOCS },
    html.getBlob()
  ).id;
  var tables = DocumentApp.openById(docId)
    .getBody()
    .getTables();
  var res = tables.map(function(table) {
    var values = [];
    for (var row = 0; row < table.getNumRows(); row++) {
      var temp = [];
      var cols = table.getRow(row);
      for (var col = 0; col < cols.getNumCells(); col++) {
        temp.push(cols.getCell(col).getText());
      }
      values.push(temp);
    }
    return values;
  });
  Drive.Files.remove(docId);
  return res;
}

// Please run this function.
function run() {
  var url = "###"; // <--- Please set URL that you want to retrieve table.
  var res = parseTablesFromHTML(url);
  Logger.log(res);
}

Result

As a test case, when you set https://gist.github.com/tanaikech/f52e391b68473cbf6d4ab16108dcfbbb to url and run the script, the following result can be retrieved.

[
  [
    ["head1_1", "head1_2", "head1_3\n"],
    ["value1_a1", "value1_b1", "value1_c1"],
    ["value1_a2", "value1_b2", "value1_c2"]
  ],
  [
    ["head2_1", "head2_2", "head2_3\n"],
    ["value2_a1", "value2_b1", "value2_c1"],
    ["value2_a2", "value2_b2", "value2_c2"]
  ]
]

Note

  • Using this method, all paragraphs and lists can be also retrieved.
  • This method can be also used with other languages.

References

@MostHated
Copy link

You will have to forgive me, as JS is not my normal language, but I have a script that I am using with Sheets that is using UrlFetchApp.fetch() and Parser().data().from().to() in a manner similar to what you are doing here. But those seem to be unique to App Script. Is anyone familiar with a library that lets you have this same workflow/functionality, just without the need to be running in Google App Script? Some sort of standalone equivalent?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment