Wednesday, July 14, 2010

How Gmail Filter Email-Matching Works

I was trying to create some complicated Gmail filters. However, there doesn't seem to be any documentation of how the to and from fields work exactly. So I tried figuring it out myself...

General Matching Guidelines

The matching criteria is similar to Google's search. There is no word stemming, so you must enter full words (e.g. joh will not match john.smith@gmail.com). Not even plural stemming, like what Google search has, is supported (e.g. app will not match apps@example.com).

Word order does not matter, unless the words are enclosed in quotes (e.g. "smith john" will not match john.smith@gmail.com). Generally, symbols are ignored (for more information see the next section).

Words are split on everything except: letters, numbers, and underscores. The most common symbols that split words are +.@. This means that foo will not match foo_bar@example.com but will match foo+bar@example.com. The @ character itself is not considered a word and can be skipped over (e.g. "smith gmail" will match john.smith@gmail.com).

You can use the OR operator in addition to grouping () for some complex conditions.

Symbol Behavior

When you enter a symbol in the filter box, they usually behave differently:

  • Symbols that act as x y: ~#$%^*+;",<>? and the grave character. For example, smith~john becomes smith john, which matches john.smith@gmail.com.
  • Symbols that act as "x y": -=\:'./ -- For example, john-smith becomes "john smith", which matches john.smith@gmail.com.
  • Symbols that are treated literally: &_ -- For example, john_smith will match john_smith@gmail.com, but not john.smith@gmail.com.
  • Special symbols: !@()[]{}|
    • !: john!smith becomes john -smith, which matches john.foo@gmail.com but not john.smith@gmail.com.
    • @:
      • @ is stripped out at the end of a word. For example, john@ becomes john, which matches john.smith@gmail.com.
      • @ is stripped out at the start of a word. For example, @foo.com will become foo.com, which matches john+foo.com@gmail.com.
      • @ in the middle of a word will generally require the full address for a successful match. For example, john.smith@gmail will not match john.smith@gmail.com. Additionally, symbols will be taken literally. For example, to match john.smith@gmail.com you must use john.smith@gmail.com... both john-smith@gmail.com and john~smith@gmail.com will no longer work.
      • @ in a different location in the middle of a word has strange behavior. For example, when trying to match john.smith@gmail.com:
        • john@smith@gmail@com does not match
        • gmail@com does not match
        • @gmail@com does not match
        • smith@gmail@com does match
        • smith@gmail.com does not match
        • "john smith@gmail.com" does not match
        • "john.smith@gmail com" does not match
    • | acts as the OR operator.
    • Parenthesis act as grouping for OR and AND filters.

Other Matching Behaviors

The default account you use (e.g. john.smith@gmail.com) will match all variations of your address. This includes dot notation, plus addressing, and using the googlemail.com domain.

Here's a brief explanation of each:

  • Using dot notation: You can enter as many non-consecutive dots in your email as you want. For example, if your email is john.smith@gmail.com, mail sent to j.o.h.n.s.mith@gmail.com will still arrive at your account.
  • Using plus addressing: After your account name, you can enter the + sign and whatever text you want afterwards followed by the Gmail domain. For example, mail sent to john.smith+foo@gmail.com will arrive at john.smith@gmail.com.
  • Using googlemail.com domain: Any mail sent to your <your-gmail-account>@googlemail.com will arrive at your @gmail.com address. For example, mail sent to john.smith@googlemail.com will arrive at john.smith@gmail.com.

Any of the above can be combined (e.g. j.o.h.n.s.m.i.t.h+foo.bar@googlemail.com will still go to john.smith@gmail.com).

Interesting Consequences

  • Can't match all dot versions of your Gmail address easily: If you're in the habit of giving out the . version of your email address to prevent spam (e.g. j.ohn.smith@gmail.com), you cannot easily create a filter for all dot version of your address since these are split up into separate words (e.g. j ohn smith). When you only use one variation of this, it's easy to create a filter and, for example, send it to spam. However, if you start using different variations (e.g. jo.h.n.smi.th@gmail.com) it causes different words in the address (e.g. jo h n smi th), forcing you to create a distinct condition for each variation you use.
  • The + symbol is worse than the "" operator when matching plus addresses: If you're trying to create a filter for a plus address, your best bet is to include the full address (e.g. john.smith+foo@gmail.com). If for some reason you aren't using the full address, the + operator is actually worse than the "" operator. For example, john+foo is worse than using "john foo", since the former will match foo@john.com. Keep in mind that the later is not bullet proof either, it will still match foo@john.foo.com. It just guarantees that the order is correct. For clarity, you could use "john+foo", but realize that it's the same as "john foo".
  • You must use negation to match all email sent to plus addresses: To filter on all plus addresses (e.g. to send them to spam), you should use the query john.smith@gmail.com -"john smith gmail com". The first part of the query will match any plus addresses you have. The second will remove all those that don't have the words in the exact order. For example, john.smith+foo@gmail.com will not match since it has the word foo in between the other words. Note that there is one weird, and very unlikely, case where this won't work: john.smith+john.smith.gmail.com@gmail.com, since it does have the words in the specified order.

3 comments:

  1. All important tools are already provided by Google. However, I'll be more glad if they will add an email encryption tool directly inside the gmail.

    ReplyDelete
  2. Haha, never realized how retarded gmail filtering really is... just want to match a [tag] prepended to subject field, fat chance ):

    ReplyDelete
  3. Thanks for your article.
    Nevertheless, I don't find how to isolate an entire domain. For example, I would like to find all emails from @orange.com (xyz@orange.com and not xyz@[something]orange.com). How can I do that?

    ReplyDelete