The main role of this feature is to allow users to have in one “indicator” the aggregation of other states. This indicator can provide a unique view for users focused on different roles.
Typical roles:
- Service delivery Management
- Business Management
- Engineering
- IT support
Let’s take a simple example of a service delivery role for an ERP application. It mainly consists of the following IT components:
- 2 databases, in high availability, so with one database active, the service is considered up
- 2 web servers, in load sharing, so with one web server active, the service is considered up
- 2 load balancers, again in high availability
These IT components (Hosts in this example) will be the basis for the ERP service.
With business rules, you can have an “indicator” representing the “aggregated service” state for the ERP service! Shinken already checks all of the IT components one by one including processing for root cause analysis from a host and service perspective.
It’s a simple service (or a host) with a “special” check_command named bp_rule. :)
This makes it compatible with all your current habits and UIs. As the service aggregation is considered as any other state from a host or service, you can get notifications, actions and escalations. This means you can have contacts that will receive only the relevant notifications based on their role.
Warning
You do not have to define “bp_rule” command, it’s purely internal. You should NOT define it in you checkcommands.cfg file, or the configuration will be invalid due to duplicate commands!
Here is a configuration for the ERP service example, attached to a dummy host named “servicedelivery”.
define service{
use standard-service
host_name servicedelivery
service_description ERP
check_command bp_rule!(h1,database1 | h2,database2) & (h3,Http1 | h4,Http4) & (h5,IPVS1 | h6,IPVS2)
}
That’s all!
Note
A complete service delivery view should include an aggregated view of the end user availability perspective states, end user performance perspective states, IT component states, application error states, application performance states. This aggregated state can then be used as a metric for Service Management (basis for defining an SLA).
In some cases, you know that in a cluster of N elements, you need at least X of them to run OK. This is easily defined, you just need to use the “X of:” operator.
Here is an example of the same ERP but with 3 http web servers, and you need at least 2 of them (to handle the load):
define service{
use standard-service
host_name servicedelivery
service_description ERP
check_command bp_rule!(h1,database1 | h2,database2) & (2 of: h3,Http1 & h4,Http4 & h5,Http5) & (h6,IPVS1 | h7,IPVS2)
}
It’s done :)
The X of:
expression may be configured different values depending on the needs. The supported expressions are described below:
- A positive integer, which means “at least X host/servicices should be up“
- A positive percentage, which means “at least X percents of hosts/services should be up”. This percentage expression may be combined with Groupping expression expansion to build expressions such as “95 percents of the web front ends shoud be up”. This way, adding hosts in the web frontend hostgroup is sufficient, and the QoS remains the same.
- A negative integer, which means “at most X host/servicices may be down“
- A negative percentage, which means “at most X percents of hosts/services should may be down”. This percentage expression may be combined with Groupping expression expansion to build expressions such as “5 percents of the web front ends may be down”. This way, adding hosts in the web frontend hostgroup is sufficient, and the QoS remains the same.
Example:
define service{
use standard-service
host_name servicedelivery
service_description ERP
check_command bp_rule!(h1,database1 | h2,database2) & (h6,IPVS1 | h7,IPVS2) & 95% of: g:frontend,Http
}
You can define a not state rule. It can be useful for active/passive setups for example. You just need to add a ! before your element name.
Example:
define service{
use generic-service
host_name servicedelivery
service_description Cluster_state
check_command bp_rule!(h1,database1 & !h2,database2)
}
Aggregated state will be ok if database1 is ok and database2 is warning or critical (stopped).
In the Xof:
way the only case where you got a “warning” (=”degraded but not dead”) it’s when all your elements are in warning. But you should want to be in warning if 1 or your 3 http server is critical: the service is still running, but in a degraded state.
Here are some example for business rules about 5 services A, B, C, D and E. Like 5,1,1of:A|B|C|D|E
A | B | C | D | E |
Warn | Ok | Ok | Ok | Ok |
Rules and overall states:
- 4of: –> Ok
- 5,1,1of: –> Warning
- 5,2,1of: –> Ok
A | B | C | D | E |
Warn | Warn | Ok | Ok | Ok |
Rules and overall states:
- 4of: –> Warning
- 3of: –> Ok
- 4,1,1of: –> Warning
A | B | C | D | E |
Crit | Crit | Ok | Ok | Ok |
Rules and overall states:
- 4of: –> Critical
- 3of: –> Ok
- 4,1,1of: –> Critical
A | B | C | D | E |
Warn | Crit | Ok | Ok | Ok |
Rules and overall states:
- 4of: –> Critical
- 4,1,1of: –> Critical
A | B | C | D | E |
Warn | Crit | Crit | Ok | Ok |
Rules and overall states:
- 2of: –> Ok
- 2,4,4of: –> Ok
- 4,1,1of: –> Critical
- 4,1,2of: –> Critical
- 4,1,3of: –> Warning
Let’s look at some classic setups, for MAX elements.
- ON/OFF setup: MAXof: <=> MAX,MAX,MAXof:
- Warning as soon as problem, and critical if all criticals: MAX,1,MAXof:
- Worse state: MAX,1,1
Sometimes, you do not want to specify explicitly the hosts/services contained in a business rule, but prefer use a grouping expression such as hosts from the hostgroup xxx, services holding lablel yyy or hosts which name matches regex zzz.
To do so, it is possible to use a grouping expression which is expanded into hosts or services. The supported expressions use the following syntax:
flag:expression
The flag is a single character qualifying the expansion type. The supported types (and associated flags) are described in the table below.
F | Expansion | Example | Equivalent to |
g | Content of the hostgroup | g:webs | web-srv1 & web-srv2 & ... |
l | Hosts which are holding label | l:front | web-srv1 & db-srv1 & ... |
r | Hosts which name matches regex | r:^web | web-srv1 & web-srv2 & ... |
t | Hosts which are holding tag | t:http | web-srv1 & web-srv2 & ... |
F | Expansion | Example | Equivalent to |
g | Content of the servicegroup | g:web | web-srv1,HTTP & web-srv2,HTTP & ... |
l | Services which are holding label | l:front | web-srv1,HTTP & db-srv1,MySQL & ... |
r | Services which description matches regex | r:^HTTPS? | web-srv1,HTTP & db-srv2,HTTPS & ... |
t | Services which are holding tag | t:http | web-srv1,HTTP & db-srv2,HTTPS & ... |
- Labels are arbitrary names which may be set on any host or service using the
label
directive.- Tags are the template names inherited by hosts or services, generally coming from packs.
It is possible to combine both host and service expansion expression to build complex business rules.
Note
A business rule expression always has to be made of a host expression (selector if you prefer)
AND a service expression (still selector) separated by a coma when looking at service status.
If not so, there is no mean to distinguish a host status from a service status in the expression.
In servicegroup flag case, as you do not want to apply any filter on the host (you want ALL services which are member of the XXX service group, whichever host they are bound to),
you may use the * host selector expression. The correct expression syntax should be:
bp_rule!*,g:my-servicegroup
The same rule applies to other service selectors (l, r, t, and so on).
You want to build a business rule including all web servers composing the application frontend.
l:front,r:HTTPS?
which is equivalent to:
web-srv1,HTTP & web-srv3,HTTPS
You may obviously combine expression expansion with standard expressions.
l:front,h:HTTPS? & db-srv1,MySQL
which is equivalent to:
(web-srv1,HTTP & web-srv3,HTTPS) & db-srv1,MySQL
As of any host or service check, a business rule having its state in a non OK
state may send notifications depending on its notification_options
directive. But what if the underlying problems are known, and may be acknowledged ? The default behaviour is to continue sending notifications.
This may be what you need, but what if you want the business rule to stop sending notifications ?
Imagine your business rule is composed of all your site’s web front ends. If a host fails, you want to know it, but once someone starts to fix the issue, you don’t want to be notified anymore. A possible solution is to acknowledge the business rule itself. But if you do so, any other failing host won’t get notified. Another solution is to enable smart notification on the business rule check.
Smart notifications is a way to disable notifications on a business rule having all its problems acknowledged. If a new problem occurs, notifications will be enabled back while it has not been acknowledged.
To enable smart notifications, simply set the business_rule_smart_notifications
to 1
.
Downtimes are a bit more tricky to handle. While acknowledgement are necessarily set by humans, downtimes may be set automatically (for instance, by maintenance periods). You may still want to be notified during downtime periods. As a consequence, downtimes are not taken into account by smart notification processing, unless explicitly told to do so.
To enable downtimes in smart notifications processing, simply set the business_rule_downtime_as_ack
to 1
.
Another useful usage of business rules is consolidated services. Imagine you have a large web cluster, composed of hundreds of nodes. If a small portion of the nodes fail, you may receive a large number of notifications, which is not convenient. To prevent this, you may use a business rule looking like bp_rule!g:web,...
. If you disable notifications by setting notification_options
to n
on the underlying hosts or services, you would receive a single notification with all the failing nodes in one time, which may be clearer.
To avoid having to manually set notification_options
on each node, you may use two convenient directives on the business rule side: business_rule_host_notification_options
which enforces notification options of underlying hosts, and business_rule_service_notification_options
which does the same for services.
This feature, combined with the convenience of packs and Smart notifications allows to build large consolidated services very easily.
Example:
define host {
use http
host_name web-01
hostgroups web
...
}
define host {
use http
host_name web-02
hostgroups web
...
}
define host {
host_name meta
...
}
define service {
host_name meta
service_description Web cluster
check_command bp_rule!g:web,g:HTTPS?
business_rule_service_notification_options n
...
}
In the previous example, HTTP/HTTPS services come from the http
pack. If one or more http servers fail, a single notification would be sent, rather than one per failing service.
Warning
It would be very tempting in this situation to acknowledge the consolidated service if a notification is sent. Never do so, as any, as any new failure would not be reported. You still have to acknowledge each independent failure. Take care to explain this to people in charge of the operations.
It is possible in a business rule expression to include macros, as you would do for normal check command definition. You may for instance define a custom macro on the host or service holding the business rule, and use it in the expression.
Combined with macro modulation, this allows to define consolidated services with variable fault tolerance thresholds depending on the timeperiod.
Imagine your web frontend cluster composed of dozens servers serving the web site. If one is failing, this would not impact the service so much. During the day, when the complete team is at work, a single failure should be notified and fixed immediately. But during the night, you may consider that losing let’s say up to 5% of the cluster has no impact on the QoS: thus waking up the on-call guy is not useful.
You may handle that with a consolidated service using macro modulation combined with an Xof:
expression.
Example:
define macromodulation{
macromodulation_name web-xof
modulation_period night
_XOF_WEB -5% of:
}
define host {
use http
host_name web-01
hostgroups web
...
}
define host {
use http
host_name web-02
hostgroups web
...
}
define host {
host_name meta
macromodulations web-xof
...
}
define service {
host_name meta
service_description Web cluster
check_command bp_rule!$_HOSTXOF_WEB$ g:web,g:HTTPS?
business_rule_service_notification_options n
...
}
In the previous example, during the day, we’re outside the modulation period. The _XOF_WEB
is not defined, so the resulting business rule is g:web,g:HTTPS?
. During the night, the macro is set a value, then the resulting business rule is -5% of: g:web,g:HTTPS?
, allowing to lose 5% of the cluster silently.
By default, business rules checks have no output as there’s no real script or binary behind. But it is still possible to control their output using a templating system.
To do so, you may set the business_rule_output_template
option on the host or service holding the business rule. This attribute may contain any macro. Macro expansion works as follows:
- All macros outside the
$(
and)$
sequences are expanded using attributes set on the host or service holding the business rule.- All macros between the
$(
and)$
sequences are expanded for each underlying problem using its attributes.
All macros defined on hosts or services composing or holding the business rule may be used in the outer or inner part of the template respectively.
To ease writing output template for business rules made of both hosts and services, 3 convinience macros having the same meaning for each type may be used: STATUS
, SHORTSTATUS
, and FULLNAME
, which expand respectively to the host or service status, its status abreviated form and its full name (host_name
for hosts, or host_name/service_description
for services).
Example:
Imagine you want to build a consolidated service which notifications contain links to the underlying problems in the WebUI, allowing to acknowledge them without having to search. You may use a template looking like:
define service {
host_name meta
service_description Web cluster
check_command bp_rule!$_HOSTXOF_WEB$ g:web,g:HTTPS?
business_rule_output_template Down web services: $(<a href='http://webui.url/service/$HOSTNAME$/$SERVICEDESC$'>($SHORTSTATUS$) $HOSTNAME$</a> )$
...
}
The resulting output would look like Down web services: link1 link2 link3 ...
where linkN
are urls leading to the problem in the WebUI.