Consulting

Technical reviews and consulting for organizations including those designing for 99.999% uptime, multi-Region, disaster recovery or active-active systems.

Overview

Nathan Dye is available on hourly basis for consulting on a wide range of topics including distributed systems design, monitoring, resilience and continuous deployment processes. Your organization can also choose from a few examples of specific product offerings below.

Resilience Risk Analysis and Design Review

Nathan’s methodology starts with collecting service level objectives of the customer application for uptime including, data durability, recovery point objectives (RPO) and recovery time objectives (RTO). Then Nathan will assess an application’s design and recovery mechanisms against five categories of failure that cover just about every type of failure a distributed system may encounter. Nathan makes recommendations in monitoring & alerting, fault isolation, blast radius reduction, fault recovery mechanisms, and designs that prevent faults from turning into service impacts. Nathan uses insider knowledge of AWS system design to make recommendations on which AWS service functions to use and how to use them when taking critical service dependencies. Finally, Nathan will make recommendations about when a multi-AZ, multi-Region, multi-Cloud or a Hybrid solution may be required to meet the resilience and performance requirements of a customer application.

Application Monitoring and Alarming Review

Nathan will assess a collection of metrics and alarms against an existing application’s stated recovery time objectives (RTOs). Nathan will identify risk areas and improvement opportunities to ensure metrics and alarms are properly configured for automated or operator-driven failure recovery as appropriate. Alternatively, Nathan can recommend a set of metrics and alarms during the design phase of an application to guide upcoming development and operations ahead of a service launch.

Vendor or Technology Assessment

Most organizations must depend on other services like DNS providers or Cloud providers, 3rd party software for databases or container management, or other organizations. increasingly organizations are migrating critical workloads to the Cloud. Each cloud vendor offers 100s of services, each a complex service in its own right. How do you know if your dependencies are prepared to meet your unique resilience requirements?

Nathan’s 10 years of experience working in AWS has given him insider knowledge of cloud-scale services. Nathan can help you create the right set of questions to ask your Cloud vendors or other SaaS vendors to determine whether a bit of technology or service provider is prepared to meet your unique resilience requirements. Nathan can guide you in designing mitigations to compensate for 3rd party solutions that don’t meet your resilience requirements.

Review of Root Cause Analysis and Corrective Actions

Failures in distributed systems are inevitable. Failures are opportunities to learn more about your system in cases where the systems behaved as expected and especially when systems didn’t behave as expected. Nathan has 20 years experience in root cause analysis and developing corrective actions for distributed systems failures. Nathan can help your organization establish root cause analysis processes, or can provide a deep-dive technical review of a specific failure.