Category Patterns


A Category is a group of patterns that are matched in the Request or Response of HTTP requests. These patterns include Domains, URL Paths and Regular Expressions matching against URLs, and Phrases matching against the Response body.

Patterns are assigned a numerical score applicable to the significance and specificity of the pattern. The more general the pattern, the lower the score should be.


Phrases

The most intensive part Redwood's classification process happens on the Response returned by the web server. This Response is the code that the browser renders into the web page that the viewer sees and works with; you can see the raw text by typing "Ctrl+U" on most browsers, or by right-clicking in the page and selecting "View Source".

Write phrases by enclosing them in angle brackets and assigning a score after the closing bracket. Here are some example phrases that would be useful in computer classification.

<programming docs> 25

<speaker cable> 15
<tape copier> 25
<cassette duplicator> 25
<expansion duplicator> 25

<really simple syndication> 20
<rss module> 20
<rss spec> 20
<rss format> 20
<rss reader> 20

All patterns can have a subtracting score as well, to offset more precise matches which should not be included in the scoring. For example, the first phrase above might match on TV sites as well, so we can correct that by:

<tv programming docs> -25

Phrase matching is case insensitive, so use lower case for all phrases.

Punctuation characters are stripped out during the phrase matching process so use phrases like:

<cd rom drive> 20 rather than <cd-rom drive> 20


Regular Expressions

Pattern matching (regular expressions / regex) is an extremely powerful tool in building category parameters. Pattern matching happens on the URL of the request.

The simplest pattern is a string of characters that represents a word, enclosed in slashes "/". So pattern or regex "/microsoft/" will match any URL that contains the word "microsoft". So, perhaps you're wanting to create a whitelist for your employees, but want to allow computer updates to work. The "/microsoft/" regex will be helpful. But probably a bit broad...for computer updates to work, we probably only want to match domain names that contain "microsoft".

Regular Expression syntax can be used to target specific parts of the URL. The Log Cabin Console pattern builder enables regexes to match the following sections of the URL string:

  1. Default: matches anywhere in the URL
  2. Domain Name: limits matches to the domain name only (dell.com)
  3. Hostname: limits matches to the hostname only (support.dell.com)
  4. URL Path: limits matches to the path
  5. URL Query: limits matches to the query params

The Domain Name and Hostname settings will likely be most useful to you. See the diagram below for the relevant components of a URL or read more here.

                         URL
  ________________________|________________________
 /                                                 \
 scheme   host   domain
 name     name   name        path             query
 __|___  __|__  ___|____   ____|_____   _________|_________
/      \/     \/        \ /          \ /                   \
https://search.google.com/webhp/search?q=ford+rollback+truck

This does not begin to scratch the surface of the power of regular expressions. If you're so inclined, learn more here.

For a quick syntax reference, check the Go docs.


Domains and URLs

The most basic feature of http filters today is domain matching. Redwood's domain engine accepts params that match:

Domains: jcwhitney.com 250

Subdomains: support.dell.com 250

URL Path: support.dell.com/wiki/monitors 250