The page should be designed like a series of filters. Each function inputs a DOM tree, and returns a modified DOM tree. Filters can be chained together to produce the final results. Different filters can be chosen conditionally at each stage.
This phase is a removal of elements which are not relevant to text. Certain tags, can be completely blacklisted. These include:
<button>
<input>
<script>
<canvas>
<audio>
<object>
<video>
<track>
<iframe>
Others can be removed based on classes, attributes, or content.
The DOM can be thought of as a tree. The goal of point of interest detection is to find a subtree of the tree which contains the content of interest, and then to eliminate other branches. For example, suppose we can determine that a vertex contains all of the content as descedants. Then we can use this vertex as the root, and ignore all portions of the tree.
One possible algorithm for doing this is to find all elements that contain something of interest. For example all <p>
or <div>
tags with interesting text content. Then find the common ancestor of all these elements.
Signals Include
- significant text
- content elements
- css classes such as "content"
- text which matches the meta content or description
- text which matches the page title
- heading elements
- main images
<html>
<head>
</head>
<body>
<div class="sidebar">
...
</div>
<div class ="content">
// point of interest
</div>
</body>
EXTRACTION =>
<div class ="content">
// point of interest
</div>
This phase is the removal of tags which are not relevant, while preserving the content inside them. For example, divs and spans which provide formatting.
Classes and extra attributes are stripped leaving simple HTML tags and simple attributes.