How do I define patterns for custom dictionaries?


How do I define patterns for custom dictionaries?

Defining patterns for custom dictionaries is one of the tasks you must complete when adding custom dictionaries. See How do I add a custom DLP Dictionary? for a full list of tasks. 

About Patterns

You can use alphanumeric patterns to configure custom dictionaries that match a wide variety of data types. For example, you can define patterns to detect data like US phone numbers, driver's license numbers, or credit card numbers for specific issuers (a number of sample patterns are provided below). General guidelines for patterns include the following:

  • A dictionary can contain up to eight patterns.
  • Each pattern can have a maximum of 128 characters.
  • Pattern-matching is case-sensitive by default.
  • Only unique patterns are counted; a specific text matching the pattern is counted only once, regardless of how many times it actually appears in the content.

Syntax Requirements for Patterns

The custom dictionary accepts a subset of the POSIX ERE (Extended Regular Expression) syntax. Click below to learn more.

If the pattern you enter is not a valid pattern, Zscaler returns an error message with information about the error and instructions on how to correct the error. Click below for a list of the possible errors you may receive, accompanied by an explanation.

Adding Patterns

To add pattern(s), follow the instructions below. 

  1. Go to Administration > DLP Dictionaries & Engines
  2. In the DLP Dictionaries tab, click Add DLP Dictionary OR edit an existing dictionary. 
  3. Enter pattern(s) you want the dictionary to match. See guidelines and syntax requirements for patterns above.
  4. For the pattern, specify the Action the dictionary takes upon detecting a valid match. Select one of the following options from the dropdown menu:
    • Ignore: The dictionary ignores matches of the pattern. The Ignore action is for testing purposes; no action is taken if the phrase is detected, but occurrences of the phrase are recorded for your analysis in the logs for DLP.
    • Count: The dictionary counts each unique match of the pattern toward the Number of Violations threshold. (For example, consider a custom dictionary for which a pattern for US phone numbers has been defined, and Count as the specified action for the pattern. If the content this dictionary scans contains three instances of the same exact US phone number, all three instances would count as just one match.)
    • Trigger: The dictionary immediately triggers upon a match of the pattern.
  5. To add another pattern, select Add Pattern.
  6. Click Save and activate the change.

Screenshot of Zscaler’s Add DLP Dictionary fields (DLP, Patterns, Phrase)

Chart of metacharacters supported by Zscaler’s custom dictionary

A repeat of an expression containing a repeat (*, +, ?, or {m,n}) is not supported. The prohibited repeated elements are in red text below.

Examples:

Bad: [0-9]-[A-Z]{2}?- Matches “5-X” or “5-GAX”

Good: [0-9]-([A-Z][A-Z])?X

Bad: ([0-9]{3}-){1,2}[0-9]{4} - Matches “555-555-5555” or “555-5555”

Bad: [0-9]{3}-([0-9]{3}-)?[0-9]{4}

Good: [0-9]{3}-([0-9][0-9][0-9]-)?[0-9]{4}

Alphanumeric patterns, like phrases, only match at a word boundary. For example, the phrase, "round number" does not match, "around number." Likewise, the pattern "A[0-9]{5}" does not match "AA12345". Thus, your pattern must begin with a simple sequence of alphabetic and numeric characters of a known length called the "base token."

The first part of your pattern may only include:

  • literal alphabetic characters (e.g., "A")
  • literal numeric characters (e.g., "5")
  • classes of only alphabetic and/or numeric characters (e.g., "[A-N]", "[0-9]", "[0-9A-Z]")
  • bounded repeats (e.g., "{3}", "{1,5}")

The base token must end with an expression that is non-alphanumeric (e.g., ";"), or with the end of the pattern.

Below are examples of valid and invalid base tokens (with the prohibited element in red text).

1.  The base token must be non-zero length.

Example of a pattern matching US phone numbers

  • Bad: (1-)?[0-9]{3}-[0-9]{3}-[0-9]{4} - Starts with an optional expression.
  • Good: Use multiple patterns.
    • 1-[0-9]{3}-[0-9]{3}-[0-9]{4} - Starts with a single-number token.
    • [0-9]{3}-[0-9]{3}-[0-9]{4} - Starts with a three-number token.

2. The initial sequence must be all alphabetic and/or numeric characters.

Example: IPv4 Address in dotted-quad notation.

  • Bad: [0-9.]{7,15} Ambiguous mix of numeric characters and punctuation.
  • Good: [0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3} - Starts with 1-3 numeric characters.

3.  Each expression after the initial one may either match only alphanumeric characters (i.e., becoming part of the base token), or only non-alphanumeric characters (i.e., marking the end of the base token).

Example: US Phone Number.

  • Bad: [0-9]{3}-?[0-9]{3}-?[0-9]{4} Second expression is optional.
  • Good: Use multiple patterns.
    • [0-9]{10} - Ten numeric characters, no delimiter.
    • [0-9]{3}-[0-9]{3}-[0-9]{4} - Delimited with ‘-’.

4.  The entire pattern may match just the base token:

Example: California Driver’s License Number.

  • [A-za-z][0-9]{7} - Alphabetic characters followed by seven numeric characters.

5.  Patterns only match at the beginning of a token, i.e., at the first alphabetic or numeric character after a non-alphabetic, non-numeric character

Example: California Driver’s License Number.

  • [A-za-z][0-9]{7} -- alphanumeric characters followed by seven numeric characters.
  • Matches: “ A1234567 ”, “-B7654321-”
  • Doesn’t Match: “ AA1234567 ”,
[2-9][0-9]{2}[-. ][0-9]{4}

matches:
555-1212
555.1212
555 1212
[2-9][0-9]{2}\)?[-. ][2-9][0-9]{2}[-. ][0-9]{4}

matches:
(415) 555-1212 *
415-555-1212
415 555 1212
415.555.1212

(* matches a substring)
4147[- ]?18[0-9]{2}[ -]?[0-9]{4}[ -]?[0-9]{4}

matches:
4147180011112222
4147-1800-1111-2222
4147 1800 1111 2222
[A-Z]{1,2}[0-9]{6}[0-9A]

matches:
A1234567
BF1234567
N123456A
ZX123456A
[A-Z][0-9]{7}

matches:
A1234567
X7654321
[0-9]{2} [0-9]{3} [0-9]{3} [0-9]{3}

matches:
12 333 444 555
[0-9]{4} [0-9]{5} [0-9]/[0-9]

matches:
1234 12345 3/3
[0123678][0-9]{8}

matches:
011112222
[A-Z]{6}[A-Z0-9]{2,5}

matches:
CALCUS6L
BCLFUS66
BIMIUS33
BBVAUS33GCI
[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}(/([0-9]|[1-2][0-9]|3[0-2]))?

matches:
192.168.100.0
192.168.100.0/24

If the pattern you enter is not valid, Zscaler returns an error message. See below for a list of the error messages you might receive, along with an explanation of what the error messages mean and how you can resolve the error.

Error Message Explanation & Resolution Tip
'?', '*', '{}', or '+' operand invalid You may only repeat a literal character, a bracketed expression, or a parenthesized expression.
\\ applied to unescapable character The only characters that are allowed to be escaped are "?*+()[]{}.\".
Anchor flags ^ and $ are not supported Your expression begins with "^" or ends with "$".
Backreferences are not supported Your expression contains a backslash followed by a numeric character (e.g., "\1").
Braces '{ }' not balanced You either have a '{' without a '}' in your pattern, or vice versa.
Brackets '[ ]' not balanced You either have a '[' without a ']' in your pattern, or vice versa.
Cannot identify base token See the Help section on Defining Base Tokens.
Empty (sub)expression You have a parenthesized expression with no content (e.g., "()").
Invalid character class You attemtped to use a POSIX expression like "", but "x" is not the name of a valid collating class.
Invalid character range in '[ ]' Within a bracketed expression, you have a character range in the wrong order (e.g., "[b-a]", "[2-1]").
Invalid collating element You attempted to use a POSIX expression like "[[.x.]]" or "[[=x=]]", but "x" is not the name of a valid collating element.
Invalid regular expression Your pattern contains one of several errors that don't have a specific message associated with them. Try compiling only a small sub-part of your expression, then, once that's accepted, adding it back piece-by-piece until you can identify where the problem is. 
Invalid repetition count(s) in '{ }' Only numeric characters are allowed in repetition counts. If you have two numbers (e.g., "{M,N}"), M must be less than N. The expression may not be empty (e.g., "{}").
Nested repeats are not supported You may not repeat an parenthesized expression that contains any kind of repeat (e.g., "(a{1,2}b)*", "(business(es)?)?").
Parentheses '( )' not balanced You either have a '(' without a ')' in your pattern, or vice versa.
Parenthesis nesting is too deep You pattern contains parentheses nested three deep or more (e.g., "((secret|confidential) (info|data(base)?))"
Pattern string too long Your pattern must be less than 128 characters long.
Ran out of memory Your expression is too large and complex. You must reduce or simplify it.
Repetition count(s) too high in '{ }' Repetition counts must be less than 100.