Monday, May 20, 2013

Opensource monitoring rapid testing, the 2013 update of capabilities

It was a busy week so far, I've been re-examining the status of different monitoring solutions based on opensource soft, and since Monday I deployando Nagios, Icinga, Ganglia, Cacti, OpenNMS and Zabbix, and Sensu'm installing now.

 Basically OpenNMS is what worked best out-of-the-box, carrying only a couple of hours the first complete installation + configuration (the 1st. Was the test too), and then with a few clicks, a self-discovery well swept detecting range of my network devices testing. Set thresholds and messaging was a bit more work, another hour of work, reading some documentation rather confusing and confused mails requests for assistance from users and vague answers. Sure, the solution was quite simple and intuitive ... after having operated the first time. Basically it works quite well but is a bit unstable in the sense that additional plugins deployar fast - track apt for example - will not end up always in a completely stable OpenNMS, and can have a service running perfectly for hours after installing plugins, and then the first system reboot, something triggerea boot failure, and error messages are basically the output dump of Java VM, and rarely contain useful information for recovery (the "profile "Response of the forums / lists that aims lot OpenNMS advanced users are very familiar with identifying" parts "of the soft setting to change / fix directly looking JVM dump rather than search / find that information in some documentation).

For example, to install the DHCP monitoring plugin, configure it, and then uninstall it, let the software unable to boot due to lack of binaries to start the service, in this case was lucky and the error message clearly indicated that the failure was because I could not start the DHCP service monitoring, and the solution was simply to return to comment on the service in the corresponding configuration file (and thus the attempt to start it off at the start of OpenNMS).

Cacti was very easy to install, use Cacti is so simple that almost no one bothered to create tutorials on how to add a device and potentially generate (and order), the graphics, but .. "simple" does not imply that it is fully intuitive, and I was a good half hour playing with the GUI to understand the workflow to add devices and generate graphical information, reason deployar basis for Cacti (anyway, apparently, OpenNMS is generating "exactly" the same information, but of course , must navigate several menus to find, in Cacti it has in view after login directly).


Ganglia is always my first choice to gather information and use performance servers, mainly because it installs quickly, requires no more than installing the software server, the soft client, and "hook" in configuration (you have to tell the client soft , the agent on the server ganglia to monitor, what is your - or their - server to accept communications). After installing Ganglia and leave taking information, I began to review the other options and half way to be pretty determined to implement OpenNMS and Cacti, Ganglia already had armed graphic profiles of my testing systems.

A Zabbix installed it in minutes (and a couple of agents as well), and the GUI is very attractive although it is not as intuitive as the OpenNMS (which is not too intuitive either), anyway I placed the capability of self- triggereé rapid discovery and discovery, which failed to capture a single device on the same network range that had loaded two hours earlier in OpenNMS (where the latter soft and perfectly detected my test servers and devices, snmp data included). So I went to the GUI to find documentation without further explanation of the procedure added - "intuitive" - ??devices, hence I find in forums and mailing lists, finding nothing back, I guess it is so easy that anyone bother to write a step-by-step, so I left without configuring Zabbix running for now (to find how to add devices). Similarly, at each stroke of Google I keep finding recommendations that Zabbix is ??"very easy", I guess you refer to the installation, but I have to devote more than the 20 minutes I spent in order to conclude something about the software ( and be able to charge at least a couple of test devices). If it does work, you might have a little utility that OpenNMS inclusive.


Nagios (and Icinga, in my first contact with the soft, I used my expertise in Nagios and I could configure / manage without any issue, so the portability of skills I can confirm) is what I left to prove in the end, it is tempting the desire to produce the software easier and faster set deployar, this does not always mean that the software is reasonably easy to manage on a day to day (well, in the case of Nagios IS easy to manage), and / or that scales very well even in the medium term.

Nagios does not scale at all well in dynamic environments where servers production up and down constantly, and the basic measure of this is to implement nagios to monitor clouds environments, however, if you implement Nagios in virtualized environments, quickly see how your servers only and stable production are constantly monitored, while the other servers that are plucked and off dynamically, even though they are in production, while slowly being left behind Nagios configurations only dedicated to control the servers that are running continuously without downtimes dynamic .

Besides there is the temptation to integrate Nagios + Cacti, Nagios Royal Decrees + Nagios + whatever, a solution that will quickly stop correctly reflect the true performance profile total virtualized environments, of course, unless you choose to handle the architecture so your servers "fixed" in production are always working on certain hypervisors and others, but dynamically torn production / off and of testing (that are created and deleted regularly), are limited to other hypervisors.

Mmm, there is a problem in that precisely the possibility of using idle capacity in hypervisors virtualize is reason in principle, so "limit" the focus of virtualized infrastructure for one (1) software does not have capabilities to "follow "the dynamism of virtual infra take excess capabilities in virtualization solution. Consider for example that the power limitation is dramatically when run on servers virtualized infras complex configurations (which define hierarchies for dynamic off hypervisor VMs under certain performance profiles, for example).

Sure, Nagios can be "adapted" to dynamic scenarios, but those settings will be static (basically you could "play" with schedulear downtimes scheduled downtimes match the estimable that VMs will take off when the servers virtualized infra ), with the result that on one hand we set the virtual infra automatically to fit on the other hand we have to deal with (re) adapt manual / software configuration continuously monitoring for servers.

Almost none of these conclusions is new (see the links monitoring-sucks), or use commercial software is the solution (it has the same limitations to adapt to dynamic infrastructures in general), and not that the same thing happens with the rest software I tried: OpenNMS, Cacti, Ganglia, etc.

I lack Groundwork and HypericHQ test (similar to Zabbix, commercial, but at least opensource or freeware) and see how they behave. I find it funny how the pages of all monitoring software sold say they are the best or something like this:-D> "The World's Largest Monitoring Web Applications"

Tuesday, May 7, 2013

Complete IT solutions and an example with vSphere virtual infrastructure

This article tell you how stole. solutions "complete" are not, and you have to complete them to really fulfill the purpose for which they were designed. There are also comments on the areas of responsibility of third party IT providers and suppliers regarding internal IT organization with its main client (the organization).


Good solutions, but partial
Infrastructure is common to see that when buying an IT solution, the company agrees to perform work, comes to the company / organization, does the work and then leaves. Leaving outstanding guarantees, for a time, under certain conditions, etc..

For example, installing an infrastructure vSphere hypervisors are installed, mount the vCenter server is added to vCenter hypervisors, virtual machines are deploya some - probably not, and ready (up there is the work agreed with the supplier IT service in this example). The customer then takes the baton from there, managing all infrastructure - now virtual. Installing, migrating operating systems from physical to virtual, etc.. etc.

The spot price of infrastructure work and its limits are essential, but the supplier will take over to Infinity any question related to what you installed / configured initially.


Complete IT Solutions
Now, the case of the areas of internal systems in the organization is rather different. Each area internal IT organization is required to sustain the continuity of the infrastructure over time, long-term.

What is very different from commercial IT supplier obligation, however it is common for the internal IT solutions are implemented in an organization "one-time", then they are left "as is" and without taking into prerogative account maintenance and continuous improvement (which is stole. a requirement of the job for internal IT employees in the area by the way).

Following the example of vSphere infrastructure, some steps after the "simple" installation and configuration of vSphere virtual infra could be (more or less in order of strategic importance-technical):

1) Implement automated backup vCenter Configuration (and backend DB)

2) Implement the automated backup ESXi configuration,

3) Deployar (buy stole.) Virtual backup solution (Veem, etc..) To the virtual machines themselves,

4) Implement automated check settings (remove all settings in vSphere, dump a GIT or the like, then go doing it regularly, to have an accurate central record of each configuration change), AKA "configuration management".

5) Implement virtual infrastructure monitoring (several ways)

6) Deployar one vSphere Update Manager (to keep all hypervisors updated / patched),

7) Implement High Availability for vCenter (ie mount another vCenter server, any of the several possible ways),

8) Implement required maintenance automation for vCenter (tip: the DB backend needs attention at times).

9) How to proceed and what to do just from the technical to recover the fall / crash / out of service any component of vSphere virtual infrastructure (including having installed and configured the tools, plans, and that there will be any recovery, have done internships and field tests to know that all policies / procedures / tools actually work as they should).

If you notice, extrapolating the general idea of ??the example, basically any infrastructure needs (plus installation, configuration and start initial production):

- Backup,
- Configuration Management,
- Monitoring and Optimization / Maintenance / Continuous Improvement.
- Add redundancy / additional resilience (as part of the continuous improvement)
- Action plan for disaster recovery.

Without all these details (and several others not mentioned), the solution can "crash" very easily and stop working properly, and with some bad luck also unexpectedly (eg New Year morning, 3 am, call from the owner of the company IT staff, dropping to 3.10 when the personnel using the system will warn that just does not go. "Use Cases" guard clinic, pharmacy guard, security company, polícia, etc.).

* This is a matter of opinion, but to complete more than the TCO of the solution, you could add the forecast / estimate future costs of lifecycle management, for example, by providing a platform migration.

Following the example foresee a possible / eventual migration path VMware vSphere 5.1 (+ ESXi) to Microsoft Hyper-V 2012 + System Center 2012 Virtual Machine Manager.

For example: having to buy a SAN "now":
- Increases the TCO of the solution vSphere, but
- Lower the TCO of the - possible future - Hyper-V 2012 solution, but
- Stole. lowers the TCO of the solution "Virtual Infrastructure"
(Which is what matters to the organization actually), and therefore generates a "migration path" acceptable, and concludes that buy the SAN "be good" :-)

Areas and limited times
Internal IT areas have an area of ??interference and obligations to the IT infrastructure by far much greater than almost any solution "turnkey" that can provide a third party, as even with the best available budget, the scope of interference by an outsourced IT provider always - but always - is limited to certain tasks and obligations, and a range of time - engaged - during which he will respond to the client. And after which, it will no longer have an obligation to respond to the client.

The internal IT area otherwise not limited at all of its obligations to the organization, which must respond by organizational commitment (ie, regardless of who / is are integrating the area as employees / managers), so continuous , and is responsible for completing and correcting any limitations that exist in the infrastructure.

Following the example in the solution which "turnkey" has not provided a backup mechanism for ESXi hypervisors. If the provider does not, it is the duty of the internal IT area complete the solution.


The IT provider's contractual obligation, always has a practical limit: the maximum time hired and how much work can be done during that time. Although and though they usually hire:
- "Solutions",
- "Turnkey solutions",
- "Solutions",

and other good IT vendor jargon, though is "promised" the solutions provided by a third party will never be able to be fully complete, but only will be hired in accordance with (a tasks list contained in the contract) , any additional work, paid or not is at the discretion and goodwill of the third party provider.

Directly ... unless they are permanently contracted to do the work of the internal area IT ... ooops, but the contract also has a maximum, so no, you can not sustain unlimited outsourcing, there will always be that pay more or additional services outsourcing to have an unlimited (so it is very good business indeed.