Skip to content

Instantly share code, notes, and snippets.

@justinmeiners
Last active June 16, 2019 04:04
Show Gist options
  • Save justinmeiners/e448450bd6a8eee0355435d238b0b66d to your computer and use it in GitHub Desktop.
Save justinmeiners/e448450bd6a8eee0355435d238b0b66d to your computer and use it in GitHub Desktop.

The page should be designed like a series of filters. Each function inputs a DOM tree, and returns a modified DOM tree. Filters can be chained together to produce the final results. Different filters can be chosen conditionally at each stage.

1. DOM stripping.

This phase is a removal of elements which are not relevant to text. Certain tags, can be completely blacklisted. These include:

  • <button>
  • <input>
  • <script>
  • <canvas>
  • <audio>
  • <object>
  • <video>
  • <track>
  • <iframe>

Others can be removed based on classes, attributes, or content.

2. Branch of Interest

The DOM can be thought of as a tree. The goal of point of interest detection is to find a subtree of the tree which contains the content of interest, and then to eliminate other branches. For example, suppose we can determine that a vertex contains all of the content as descedants. Then we can use this vertex as the root, and ignore all portions of the tree.

One possible algorithm for doing this is to find all elements that contain something of interest. For example all <p> or <div> tags with interesting text content. Then find the common ancestor of all these elements.

Signals Include

  • significant text
  • content elements
  • css classes such as "content"
  • text which matches the meta content or description
  • text which matches the page title
  • heading elements
  • main images
<html>
<head>
</head>
<body>
  <div class="sidebar">
  ...
  </div>
  <div class ="content">
     // point of interest
  </div>
</body>

EXTRACTION => 

<div class ="content">
   // point of interest
</div>

3. DOM Flattening

This phase is the removal of tags which are not relevant, while preserving the content inside them. For example, divs and spans which provide formatting.

4. Attribute simplification.

Classes and extra attributes are stripped leaving simple HTML tags and simple attributes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment