Cloud Architecture Mistakes: Organizations Need Shorter Mean Time to Recovery

A common complaint about cloud computing is that the costs of operating in the cloud can get very expensive. In this five-part series, I’ve examined the costly cloud architecture mistakes organizations often make that contribute to those costs, and how an independent cloud platform can solve those problems. Part one explained how organizations can quickly lose visibility and control over their data processing. Part two looked at how a DIY approach can go wrong. Part three examined how easily costs can mount when organizations don’t have a cloud networking platform that enables intelligent costing and billing. And part four explained why an on-premise attitude toward security in the cloud weakens an enterprise’s defenses while contributing to mounting expenses.

Now, in part five, I’ll show how failing to take control of security and recovery – and relying too much on what a CSP provides – can greatly lengthen an organization’s mean time to recovery after an outage or failure.

There is no such thing as a perfect CSP or perfect network. It is naïve to believe that a CSP would never go down or have any issues. Every CSP is susceptible to outages. A legit concern is that a higher mean time to recovery (MTTR) from an outage or failure would incur additional costs in terms of loss of business, productivity, and time-to-market.

Every CSP provides an inconsistent set of tools to troubleshoot and recover from an outage or failure. The challenge is that CSP native services are offered on shared infrastructure, so they cannot offer advanced troubleshooting tools that provide the deep visibility demanded by enterprises. This black box nature of those CSP constructs causes frustration and extended time to troubleshoot and recover from failure.

For instance, simple and familiar tools like ping, traceroute, and packet captures are unavailable among the CSPs. Even if something similar is available, these tools lack the consistency and functionality needed to troubleshoot quickly.

A higher MTTR is extremely costly because a security breach can keep a business down until the cause is determined and the solution is found. When your business runs in the cloud, any downtime is unacceptable.

Recommendations

Organizations can take several steps to avoid lengthy MTTR:

Your DR strategy must include multiple regions and multiple clouds.
Adopt a cloud networking approach or platform that gives you advanced troubleshooting tools, where you own the network and security architecture with data in your control.
Observability and visualization of the network can proactively stop failure from occurring. Demand products that can provide an application-level dashboard with a rich set of Netflow data. The CSP flow logs are costly and do not have richness.
Make operational visibility as part of every requirement and request for proposal (RFP) with tools such as packet capture, application-level connectivity verification, etc.
Ensure that your cloud networking vendor provides you with a cloud operation model that can extend into the edges and branches. This strategy will allow the reduction in the toolset by controlling and governing the landscape directly from the cloud, even for the on-prem hybrid resources.
Demand intelligent cloud networking with embedded telemetry as part of the data plane.

Conclusion

Cloud adoption is necessary for any business that wants to remain relevant; maintaining an on-premise model is no longer viable in today’s economic environment. But if not done properly, the costs of operating a cloud architecture can quickly spiral out of control.

In this five-part series of articles, I’ve detailed the five key mistakes organizations make in building their cloud architectures, and how they can avoid those problems. A flexible, scalable cloud networking platform that supports all major CSPs can:

Ensure that you maintain visibility into and control over your data processing, building a cloud architecture that is consistent across single or multiple clouds.
Provide a centralized controller and distributed data plane, avoiding the chaotic mess that can result from a DIY approach. A networking platform will allow for features such as machine learning (ML) based network behavior analytics and anomaly detection, self-healing capabilities, and unified support for business-critical apps.
Provide deeper cost visibility across all LOBs, providing a foundation for an intelligent costing and billing solution that covers shared and nonshared resources. When showback and chargeback options are missing, each team and LOB builds and orders services independently, substantially driving up costs.
Embed security into the data plane, allowing an organization to create and enforce intent-based security policies across multi-cloud environments. Taking an on-premise approach to security in the cloud is costly and inefficient, and also contributes to vulnerabilities.
Shorten an organization’s mean time to recovery (MTTR), by providing visibility across the environment along with the capability to use advanced troubleshooting tools, an application-level dashboard and other tools that can proactively prevent failures. Revenue losses are as expensive as a long MTTR, during which a business can shut down until the cause of a failure is detected and remediated.

I hope that these considerations help your organization achieve the visibility, flexibility, and scalability it needs to make the most of what cloud computing offers – without unduly driving up costs.

TAKE OUR DATA MANAGEMENT CERTIFICATION PREP COURSES