html-page-stripper.md

The page should be designed like a series of filters. Each function inputs a DOM tree, and returns a modified DOM tree. Filters can be chained together to produce the final results. Different filters can be chosen conditionally at each stage.

1. DOM stripping.

This phase is a removal of elements which are not relevant to text. Certain tags, can be completely blacklisted. These include:

<button>
<input>
<script>
<canvas>
<audio>
<object>
<video>
<track>
<iframe>

Others can be removed based on classes, attributes, or content.

2. Branch of Interest

The DOM can be thought of as a tree. The goal of point of interest detection is to find a subtree of the tree which contains the content of interest, and then to eliminate other branches. For example, suppose we can determine that a vertex contains all of the content as descedants. Then we can use this vertex as the root, and ignore all portions of the tree.

One possible algorithm for doing this is to find all elements that contain something of interest. For example all <p> or <div> tags with interesting text content. Then find the common ancestor of all these elements.

Signals Include

significant text
content elements
css classes such as "content"
text which matches the meta content or description
text which matches the page title
heading elements
main images

<html>
<head>
</head>
<body>
  <div class="sidebar">
  ...
  </div>
  <div class ="content">
     // point of interest
  </div>
</body>

EXTRACTION => 

<div class ="content">
   // point of interest
</div>

3. DOM Flattening

This phase is the removal of tags which are not relevant, while preserving the content inside them. For example, divs and spans which provide formatting.

4. Attribute simplification.

Classes and extra attributes are stripped leaving simple HTML tags and simple attributes.

justinmeiners/html-page-stripper.md

1. DOM stripping.

2. Branch of Interest

3. DOM Flattening

4. Attribute simplification.