Alta Disponibilidade - Principais Tecnologias

Neste post serão caracterizados as principais tecnologias de Alta disponibilidade. Um sistema de alta disponibilidade é um sistema informático resistente a falhas de software e energia, cujo objetivo é manter os serviços disponibilizados o máximo de tempo possível.

Cada vez mais é necessário garantir a disponibilidade de um serviço, mas sendo que muitos componentes dos sistemas de informação atuais contêm partes mecânicas, a confiabilidade desses é relativamente insuficiente se o serviço fôr crítico. Para garantir a ausência de interrupções de serviço é necessário, muitas vezes, dispôr de hardware redundante que entre em funcionamento automaticamente quando há falha de um dos componentes em utilização.
Quanto mais redundância existir, menores serão os SPOF (Single Point Of Failure), e

menor será a probabilidade de interrupções no serviço. Até há poucos anos tais sistemas eram muito dispendiosos, e tem-se vindo a intensificar uma procura em soluções alternativas. Surgem então os sistemas construídos com hardware acessível (clusters), altamente escaláveis e de custo mínimo. A Figura 1 ilustra a configuração típica de um sistema de alta disponibilidade dual-node:


Figura 1 - Arquitectura clássica de um sistema dual-node de alta disponibilidade

Como se pode observar, não existe um único ponto nesta arquitectura que, ao falhar, implique a indisponibilidade de outro ponto qualquer (SPOF). O fato de ambos servidores se encontram em funcionamento e ligados à rede não implica, porém, que desempenhem as mesmas tarefas. Esse é uma decisão por parte do administrador e que tem o nome de balanceamento de carga.

Um dos termos de comparação geralmente utilizado na avaliação de soluções:
- HA: níveis de disponibilidade segundo tempos de indisponibilidade (downtime).

Geralmente, quanto maior a disponibilidade, maior a redundância e custo das soluções: tudo depende do tipo de serviço que se pretende disponibilizar. Por exemplo, um operador de telecomunicações desejará certamente o mais elevado a fim de poder garantir um elevado nível de disponibilidade, sob pena de perder os seus clientes caso o sistema sofra falhas constantemente. No entanto, uma empresa com horário de trabalho normal poderá considerar que 90% de disponibilidade serão suficientes. É de salientar que o nível de disponibilidade mensal não é o mesmo que o anual. Efetivamente, para se obter um nível de disponibilidade mensal de 97%, é necessário que o nível anual seja aproximadamente de 99,75%.

A tolerância a falhas consiste, basicamente, em ter hardware redundante que entra em funcionamento automaticamente após a detecção de falha do hardware principal. Independentemente da solução adotada, existe sempre dois parâmetros que possibilitam mensurar o grau de tolerância a falhas que são o MTBF - Mean Time Between Failures - (tempo médio entre falhas) e o MTTR - Mean Time To Repair - tempo médio de recuperação, que é o espaço de tempo (médio) que decorre entre a ocorrência da falha e a total recuperação do sistema ao seu estado operacional. A disponibilidade de um sistema pode ser calculada pela fórmula:

Disponibilidade = MTBF / (MTBF + MTTR)

DAS (Direct Attached Storage )
 Discos ou outros dispositivos diretamente conectados a servidores;
 USB 2.0, Firewire.
 Pequenas Empresas (até 50 Usuários)



NAS (Network Attached Storage)
 Servidores Dedicados
 IDE e SCSI (Atualmente PATA)
 Empresas Médias (50 a 500 usuários)



SAN (Storage Area Network)
 Redes de Alta velocidade (Servidores e Dispositivos)
 Gigabit e Fiber Chanel
 Grandes Empresas (+500 usuários)
 Maior desempenho e escalabidade



#################################################

High-availability clusters (also known as HA Clusters or Failover Clusters) are computer clusters that are implemented primarily for the purpose of improving the availability of services which the cluster provides. They operate by having redundant computers or nodes which are then used to provide service when system components fail. Normally, if a server with a particular application crashes, the application will be unavailable until someone fixes the crashed server. HA clustering remedies this situation by detecting hardware/software faults, and immediately restarting the application on another system without requiring administrative intervention, a process known as Failover. As part of this process, clustering software may configure the node before starting the application on it. For example, appropriate filesystems may need to be imported and mounted, network hardware may have to be configured, and some supporting applications may need to be running as well.

HA clusters are often used for key databases, file sharing on a network, business applications, and customer services such as electronic commerce websites.

HA cluster implementations attempt to build redundancy into a cluster to eliminate single points of failure, including multiple network connections and data storage which is multiply connected via Storage area networks.

HA clusters usually use a heartbeat private network connection which is used to monitor the health and status of each node in the cluster. One subtle, but serious condition every clustering SW must be able to handle is split-brain. Split-brain occurs when all of the private links go down simultaneously, but the cluster nodes are still running. If that happens, each node in the cluster may mistakenly decide that every other node has gone down and attempt to start services that other nodes are still running. Having duplicate instances of services may cause data corruption on the shared storage.


Node configurations
The most common size for an HA cluster is two nodes, since that's the minimum required to provide redundancy, but many clusters consist of many more, sometimes dozens of nodes. Such configurations can sometimes be categorized into one of the following models:

* Active/Active — Traffic intended for the failed node is either passed onto an existing node or load balanced across the remaining nodes. This is usually only possible when the nodes utilize a homogeneous software configuration.
* Active/Passive — Provides a fully redundant instance of each node, which is only brought online when its associated primary node fails. This configuration typically requires the most amount of extra hardware.
* N+1 — Provides a single extra node that is brought online to take over the role of the node that has failed. In the case of heterogeneous software configuration on each primary node, the extra node must be universally capable of assuming any of the roles of the primary nodes it is responsible for. This normally refers to clusters which have multiple services running simultaneously; in the single service case, this degenerates to Active/Passive.
* N+M — In cases where a single cluster is managing many services, having only one dedicated failover node may not offer sufficient redundancy. In such cases, more than one (M) standby servers are included and available. The number of standby servers is a tradeoff between cost and reliability requirements.
* N-to-1 — Allows the failover standby node to become the active one temporarily, until the original node can be restored or brought back online, at which point the services or instances must be failed-back to it in order to restore High Availability.
* N-to-N — A combination of Active/Active and N+M clusters, N to N clusters redistribute the services or instances from the failed node among the remaining active nodes, thus eliminating (as with Active/Active) the need for a 'standby' node, but introducing a need for extra capacity on all active nodes.

The term Logical host or Cluster logical host is used to describe the network address which is used to access services provided by the cluster. This logical host identity is not tied to a single cluster node. It is actually a network address/hostname that is linked with the service(s) provided by the cluster. If a cluster node with a running database goes down, the database will be restarted on another cluster node, and the network address that the users use to access the database will be brought up on the new node as well so that users can access the database again.

Application Design Requirements

Not every application can run in a high-availability cluster environment, and the necessary design decisions need to be made early in the software design phase. In order to run in a high-availability cluster environment, an application must satisfy at least the following technical requirements:

* There must be a relatively easy way to start, stop, force-stop, and check the status of the application. In practical terms, this means the application must have a command line interface or scripts to control the application, including support for multiple instances of the application.
* The application must be able to use shared storage (NAS/SAN).
* Most importantly, the application must store as much of its state on non-volatile shared storage as possible. Equally important is the ability to restart on another node at the last state before failure using the saved state from the shared storage.
* Application must not corrupt data if it crashes or restarts from the saved state.

The last two criteria are critical to reliable functionality in a cluster, and are the most difficult to satisfy fully. Finally, licensing compliance must be observed.

Node reliability
HA clusters usually utilize all available techniques to make the individual systems and shared infrastructure as reliable as possible. These include:

* Disk mirroring so that failure of internal disks does not result in system crashes
* Redundant network connections so that single cable, switch, or network interface failures do not result innetwork outages
* Redundant Storage area network or SAN data connections so that single cable, switch, or interface failures do not lead to loss of connectivity to the storage
* Redundant electrical power inputs on different circuits, usually both or all protected by Uninterruptible power supply units, and redundant power supply units, so that single power feed, cable, UPS, or power supply failures do not lead to loss of power to the system.
These features help minimize the chances that the clustering failover between systems will be required. In such a failover, the service provided is unavailable for at least a little while, so measures to avoid failover are preferred.

Comentários

Postagens mais visitadas deste blog

Redação Ti Nota 10 - Klauss

Prova Discursiva nota 10 - Banca Cespe

Portugues - Orações