Page Pruners


Sometimes most of the web page is appropriate, but parts of it could be removed; sidebars, image sections, ad headers, etc. Between downloading a page and scanning its content for phrases, the Filter Engine can perform “content pruning.” This is scanning the parsed HTML tree for elements matching certain criteria, and deleting those elements and their children.

Page Pruners are an advanced feature, although the rules themselves are relatively straightforward. If the network administrator should be at least relatively familiar with CSS selectors.


CSS Selectors

The pages that require pruning are are matched by domain names or regular expressions. Any domain name or regular expression can have multiple CSS Selectors.

Elements inside the page are matched by CSS Selectors. Two real-world examples are from craigslist.org, google.com, and bing.com.

/craigslist/d div#ppp
bing.com div.sb_adsWv2
bing.com div.ansC
google.com/search div.c
google.com/search div#rhs_block

In the examples above, the div tags with the relevant ID / Class will be pruned. Regex and domain matching is case insensitive.


jQuery Selectors

jQuery-type selectors are also supported for provisionally pruning page elements, if they contain some specimen of unwanted content.

/ebay/d li.sresult:contains("Adult only")
/ebay/d li.sresult:contains("adult only")

These selectors are case sensitive!


Phrase Filtering

But what if elements in the page aren't always objectionable? Add a score to the Pruner Selector to only prune the page element if the content exceeds that score.This is helpful on sites with ranges of items, such as online stores and images sites.

If a threshold is specified, the element and its children are deleted after the page is phrase-scanned if the score from the phrases found in a blocked category is at least the threshold.

Applying Phrase-filtered threshold means that page elements containing negative content will be pruned, while other instances of the same page element remain visible, if their score doesn't exceed the threshold.

/ebay/d 30 img
/ebay/d 70 li.sresult

In the examples above, ebay.com / ebay.ca images will be removed if the tag threshold exceeds 30. Listings will be removed if the threshold exceeds 70.