Wednesday 11 November 2015

Dispatcher Configuration in Adobe AEM

Dispatcher is one of the best things that I have seen in Adobe CQ / AEM since we are playing around with Day/Adobe CQ/AEM .
This is a small module in Apache which is tremendously powerful when Caching comes into picture .
This module helps the CQ publish servers take some rest whenever the same page /url (which is cacheable) is hit multiple times .

Dispatcher and its anatomy

We will discuss about some terminologies and tricks in the dispatcher configuration file in production apache server which usually sits in front of the publish instances .
I will try to explain parts of the dispatcher.any file and then collate all the configurations .
The whole dispatcher.any file consists of some properties and their values (can be multivalued).
All the properties start with a forward slash “/” .
Values are enclosed with two braces “{}” .
The “name” property defines a human readable name to the dispatcher.
/name "any name you can provide" “farms” holds a bunch of properties and values .
Often in complex applications , its tough to put all the dispatcher configurations in a single farm . In that case , you might want to divide the logic/functionality for different sets of urls/websites into multiple farms and include them in the parent farm.
the /farms property can contain one farm (in case , you want to handle all websites/urls in the same manner) or multiple farms (when you want to define different sets of handlers/farm for different sets of websites/urls associated with your website ) . Don’t worry . This may sound a bit confusing at this moment .
Inside the /farms property you can define a farm or you can include a farm defined somewhere else or you can do both.
/farms
{

    /fadfish
    {
    ### This is the first farm which has been defined in the dispatcher.any itself .
    }

    $include ("fadfish2.any") # This is the second farm which has been defined in the fadfish2.any file .
}
Remember , the farms are evaluated from bottom to top .
The farm again contains some properties which you will find the most useful.
Most of the times “/clientheaders” is the first property in a farm configuration. Each HTTP request carries a set of Request Headers . They are pretty much needed for your application to decide some most crucial attributes of the request that is coming to your website.
To know more about HTTP Request Headers , I would insist on reading the HTTP RFC 2616 .
Your dispatcher can skim specific headers coming from Client before they reach to your CQ publish instance . Great !! ain’t it ? You can either allow all the request headers coming from client by specifying a “*” in the “/clientheaders” section or you can allow some custom CQ specific headers by providing a list of all the headers (custom + regular) in the “/clientheaders” section .
/clientheaders
     {
      "referer"
      "user-agent"
      "authorization"
      "cq-action"
      "cq-handle"
      "handle"
       ....
       ....
       ....
       ....
     }
The /virtualhosts property defines a list of all hostname/URI combinations that Dispatcher accepts for this farm.
/virtualhosts
    {

      # "*" will cause all the requests handled by this dispatcher.
      "*"
    }
You may have two farms to handle separate kinds of requests . Suppose , first farm will handle all the requests that is coming for /products directory and the other farm will handle the rest of the requests coming to your website .
The /renders property defines which URL the dispatcher sends requests to. In 99% of the use cases , this is the IP of the Publish CQ instance. You can have multiples renders for a farm , which will distribute the load to all those ip/CQ/AEM instances , mentioned in the renders property .
In the following example, rend01 and rend02 are two renders which distribute all the traffics coming to this dispatcher to localhost:4503 and localhost:4505 equally.
/timeout is the time in milliseconds the dispatcher should wait the AEM instance to respond.
/renders
  {
  /rend01
    {
    # Hostname or IP of the render

    /hostname "localhost"

    # Port of the render
    /port "4503"

    # Connect timeout in milliseconds, 0 to wait indefinitely
     /timeout "10000"
    }

    /rend02
    {
    # Hostname or IP of the render

    /hostname "localhost"

    # Port of the render
    /port "4505"

    # Connect timeout in milliseconds, 0 to wait indefinitely
     /timeout "10000"
    }

    }
Access Control over content
The /filter property helps define access control over content. It defines the HTTP requests that the dispatcher accepts . Other requests are sent back to the Apache Webserver with a 404 status code.
If you don’t define a /filter property , all the HTTP requests are accepted by the dispatcher.
Each item in the /filter section includes a glob(used to match the http request line) and a type (allow/deny).
As per good security measure, it is always advisable to block all the request, and then specially allow only those type of requests which you want to serve.
/filter
  {
    # Deny everything first and then allow specific entries
    /0001 { /type "deny"  /glob "*" }
    /0002 { /type "allow" /glob "* /content/* *" }
  }
/propagateSyndPost is the property , which decides whether your syndication requests should handled by the dispatcher or it should go to the renders(AEM instances) .
If, the /propagateSyndPost is set to 1 , the syndication requests are send to dispatcher . Make it 0 , unless you are sure what you are doing.
/propagateSyndPost "0" One of the most important part of the dispatcher is the cache section .
/cache property and its values define the way your dispatcher caches documents/pages . It has multiple sub properties and values .
I will cover few important properties which are a must to know .
/cache
    {

        /docroot ## Defines the place , where your cached files should be . The value should be relative to the docroot of the webserver.
        /statfileslevel  ## Sets the level upto which files named ".stat" will be created in the document root of the webserver.
        /serveStaleOnError
        /allowAuthorized  ## setting the value to 1 enables the dispatcher to cache authenicated documents to get cached in the dispatcher.
        /rules  ## It provides the documents which should be cached .
        /invalidate
        /invalidateHandler
        /allowedClients
        /ignoreUrlParams

    }
/serveStaleOnError is another important property which is often ignored. Whenever some content is activated it gets deleted from the dispatcher cache. But if the /serveStaleOnError property is set to 1 , the cached HTML does not get deleted immediately upon activation. Only the .stat file is touched. Next time if a request comes for the same page , it first goes to the AEM instance to take the latest content. If it gets a response , it replaces the old cached content and cache the new html. If , it gets a a 500/503/5XX response from the server, it simply serves the old cached content with a HTTP status code of 111.
Again I would suggest , use it if you have a good reason behind it. Else, no need to specify the property at all.
/rules property is another useful thing which must know when you are playing with dispatcher . You may get confused while defining /filters and /rules .
/filters come into picture when you want to decide whether the request should be dealt with your dispatcher or not . /rules come into action when you want to decide whether the request should be served from cache or it should delegate the request to AEM instance . Hence , /rules totally deals with caching . .
By default , dispatcher does not cache any request 1) whose HTTP method is not GET , 2) URI contains a question mark , 3) File extension is missing .
Generally , we cache everything else apart from whatever mentioned above . To define /rules also , we use a /glob pattern .
/rules
    {
    /0000
      {
       /glob "*"  /type "allow"
      }
    }
allow will define what to cache in webserver . deny will define what to be taken from the AEM instances and not cache in dispatcher.
/invalidate property defines what are the cached contents that will be invalidated whenever there is a content activation.
/invalidate
    {
        /0000
          {
          /glob "*"
          /type "deny"
          }
        /0001
          {
          # Consider all HTML files stale after an activation.
          /glob "*.html"
          /type "allow"
          }
        /0002
          {
          /glob "*/content/fadfish/products*"
          /type "deny"
          }

    }
The /allowedClients section restricts the client IP addresses that are allowed to issue activation requests. This is one of the most important security measure that you should take when configuring a dispatcher .
Ideally , you should only allow your Author AEM instance flush dispatcher cache for a page . Else , any intruder can send a POST request to your dispatcher and flush your cache.
/allowedClients
    {

        /0000
          {
          /glob "*"
          /type "deny"
          }
        /0001
          {
          /glob "AEM_Author_IP"
          /type "allow"
          }
    }
The /ignoreUrlParams is going to be the most handy thing when you have some query parameters in your request and you want them to come to your AEM instance .
By default , if any request comes to the dispatcher with a query param , it just delegates the request to the AEM instance (the renders section) . So, you bear a high risk if somebody intentionally create a script which will add dynamic query parameter to a request and send it to your dispatcher which can cause a huge load on your AEM instances and eventually break it down . This is where /ignoreUrlParams comes into action .
Note : To make use of this property your dispatcher module version should be equal to more than 4.1.2 .
/ignoreUrlParams
      {
        /0001 { /glob "*" /type "allow"  }
        /0002 { /glob "idPrefix" /type "deny"  }

      }
So, in the /ignoreUrlParams section , first we allow everything to be taken from cache and not to go to the AEM instance . Then , we specifically choose what are the requests that should go to the AEM instance . So, in the above example , we are specifying , that , the dispatcher should serve everything from cache . Then we declare , if the request URI contains idPrefix in it , then the request should be delegated to the AEM instance and not served from the cache.
allow here means that the cache request should be served from the cache. deny means that the request should go to AEM instance and not served from cache.
Following is a sample dispatcher.any file . Feel free to modify the file according to your project need .
# name of the dispatcher
/name "fadfish-server"
# Each farm configures a set of load balanced renders (i.e. remote servers)
/farms
  {
  # First farm entry
  /website
    {
    # Request headers that should be forwarded to the remote server.
    /clientheaders
      {
      "referer"
      "user-agent"
      "authorization"
      "from"
      "content-type"
      "content-length"
      "accept-charset"
      "accept-encoding"
      "accept-language"
      "accept"
      "host"
      "if-match"
      "if-none-match"
      "if-range"
      "if-unmodified-since"
      "max-forwards"
      "proxy-authorization"
      "proxy-connection"
      "range"
      "cookie"
      "cq-action"
      "cq-handle"
      "handle"
      "action"
      "cqstats"
      "depth"
      "translate"
      "expires"
      "date"
      "dav"
      "ms-author-via"
      "if"
      "lock-token"
      "x-expected-entity-length"
      "destination"

      }

    /virtualhosts
      {

      "*"
      }

    # The load will be balanced among these render instances
    /renders
      {
      /rend01
        {
        # Hostname or IP of the render

        /hostname "localhost"

        # Port of the render
        /port "4503"

        # Connect timeout in milliseconds, 0 to wait indefinitely
         /timeout "10000"
        }
     }

    # The filter section defines the requests that should be handled by the dispatcher.
    # The globs will be compared against the request line, e.g. "GET /index.html HTTP/1.1".
    /filter
      {
      # Deny everything first and then allow specific entries
      /0001 { /type "allow"  /glob "*" }
      /0002 { /type "allow" /glob "* /content/* *" }  # disable this rule to allow mapped content only
      /0025 { /type "deny"  /glob "* /common* *" } # if you enable /libs close access to proxy
      /0029 { /type "allow" /glob "* /content* *" }
      /0030 { /type "allow" /glob "* /content/dam/* *"   }
      # Enable specific mime types in non-public content directories
      /0041 { /type "allow" /glob "* *.css *"   }  # enable css
      /0042 { /type "allow" /glob "* *.gif *"   }  # enable gifs
      /0043 { /type "allow" /glob "* *.ico *"   }  # enable icos
      /0045 { /type "allow" /glob "* *.png *"   }  # enable png
      /0046 { /type "allow" /glob "* *.swf *"   }  # enable flash
      /0047 { /type "deny" /glob "* *.pdf *"   }  # enable PDF
      /0048 { /type "allow" /glob "* /content/*.pdf *"   }  # enable PDF
      /0049 { /type "deny" /glob "* *.txt *"   }
      /0050 { /type "deny" /glob "* /common/script/* *" }
      /0059 { /type "deny" /glob "* /styles/* *"   }
# Enable features
      /0061 { /type "allow" /glob "POST /content/[.]*.form.html" }  # allow POSTs to form selectors under content
      /0062 { /type "allow" /glob "* /libs/cq/personalization/*"  }  # enable personalization
      /0081 { /type "allow"  /glob "GET *.infinity.json*" }
      /0082 { /type "allow"  /glob "GET *.tidy.json*"     }
      /0083 { /type "allow"  /glob "GET *.sysview.xml*"   }
      /0084 { /type "allow"  /glob "GET *.docview.json*"  }
      /0085 { /type "allow"  /glob "GET *.docview.xml*"  }
      /0086 { /type "allow"  /glob "GET *.*[0-9].json*" }

      # Deny query
        /0090 { /type "deny"  /glob "* *.query.json*" }
    /0091 { /type "deny" /glob "* /crx[./]*" }
    /0092 { /type "deny" /glob "* /admin[./]*" }
    /0093 { /type "deny" /glob "* /var[./]*" }
    /0094 { /type "deny" /glob "* /tmp[./]*" }
    /0095 { /type "deny" /glob "* /bin/login[./]*" }
    /0096 { /type "deny" /glob "* /system[./]*" }
    /0097 { /type "deny" /glob "* /etc/packages*" }
    /0098 { /type "deny" /glob "* /libs/cq/core*" }
    /0099 { /type "deny" /glob "* /etc/replication*" }
      }
/propagateSyndPost "0"
    # The cache section regulates what responses will be cached and where.
    /cache
      {
       /docroot  "/mnt/fadfish/apache/www"

      /statfileslevel "5"

      /allowAuthorized "1"

      /rules
        {
        /0000
          {
          /glob "*"
          /type "allow"
          }
        }
      /invalidate
        {
        /0000
          {
          /glob "*"
          /type "deny"
          }
        /0001
          {
          /glob "*.html"
          /type "allow"
          }
    /0002
          {
          /glob "*/content/fadfish*"
          /type "deny"
          }

        }

      /allowedClients
        {

        /0000
          {
          /glob "*"
          /type "deny"
          }
        /0001
          {
          /glob "127.0.0.1"
          /type "allow"
          }
        }

    /ignoreUrlParams
      {
    /0001 { /glob "*" /type "allow"  }
    /0002 { /glob "blog_entries_start" /type "deny"  }
    /0003 { /glob "idPrefix" /type "deny"  }
    /0004 { /glob "query" /type "deny"  }

      }
    }

    /statistics
      {
      /categories
        {
        /html
          {
          /glob "*.html"
          }
        /others
          {
          /glob "*"
          }
        }
      }
    }
  }
Cheers !!

1 comment :

  1. Hi Kishore,
    sometimes we are getting an error while hit one of the website URL.

    Your browser sent a request that this server could not understand.
    Size of a request header field exceeds server limit

    could you help us what was the recommend value for LimitRequestFieldSize parameter in aem?

    Thanks,
    Satish

    ReplyDelete