Disaster Recovery Management Company 
Workshop White Paper

A DRMco White Paper

 

Disaster Recovery Workshop 

By Gregory P. Smolla

Disaster Recovery Manager

 


Contents

 

Disaster Recovery Primer                   

Resiliency Laws Requirements            

Business Impact Analysis                 

Gap Analysis                                     

Disaster Recovery Management        

Disaster Recovery Workshop             

 


Disaster Recovery Primer

Any professional in most any profession will attest that 90% of a project’s successfulness is determined in the planning phase.

 

Because many organizations simply can no longer re-create computer driven transactions on paper, their information technology infrastructure has become the lifeblood to sustaining their resiliency in the face of disaster.

 

Post-September 11, many regulatory boards have published sound disaster recovery practices recommendations, some of the more resounding of which are discussed in this white paper. 

 

Federal regulations have removed the protective corporate umbrella that C-Suite executive board members once realized with respect to not only accurate book keeping, but forensically genuine time stamped digital correspondences – most notably  historical email - resulting in sometimes (intentionally much publicized) jail time for officers of publicly held companies in violation.

 

Legalities aside, pure and simple, regulations and recommendations have been put in place to guide you through diligently planned contingency operations to insure your business and employees realize the least amount of exposure to crisis in the form of natural disasters, terrorist acts, pandemic, and a host of resource failures (utility failures, inaccessibility, cyber attack, sabotage, fire).

 

With the maturing of the Disaster Recovery (DR) industry, new return on investment (ROI) leverage points have emerged, simultaneously enabling current era “green” initiatives, asset repurposing, and geographical dispersion via de-duplicated data and WAN acceleration.

Resiliency Laws Requirements

Outage tolerance – what are the legal consequences to business downtime?   Laws may affect downtime allowance due to systemic risk in certain verticals such as financial markets.

 

Largely as a result of September 11, an interagency paper – Federal Reserve System Docket No. R-1128 was published in 2003 specifying sound practices to strengthen the resilience of the U.S. Financial System, authorizing the OCC to take action against banks that fail to comply with requirements for DR by the U.S. financial system.  A key requirement states outage tolerances of 2 hours or less for key financial interrelated clearing systems possessing systemic risks.   Out-region vs. In-region - geographical dispersion of disaster recover capability is spelled out with considerations for power grid and network resources so as to lessen the likelihood of a regional outage significantly affecting the financial system.  

 

Liability - Federal Financial Institutions Examination Council (FFIEC) Handbook, 2003-2004 (Chapter 10) Specifies that directors and managers are accountable for organization-wide contingency planning and for "timely resumption of operations in the event of a disaster."

 

Sarbanes-Oxley (SOX) establishes audit retention policies, data integrity standards, and unstructured data liability.

 

The expedited funds act (EFA) 1989 requires federally chartered financial institutions to have a demonstrable business continuity program (BCP) to insure prompt availability of funds.

 

All government entities that operate utilities are required by the government accounting standards board (GASB) statement number 34 to insure that agency missions continue in time of crisis. 

 

The North American Electric Liability Counsel (NERC 1300) requires critical infrastructure protection (CIP  009-1) definition and documentation of recovery plans for critical cyber assets.

 

The Federal Energy Regulatory Counsel (FERC) RM01-12-00 (appendix G) mandates detailed and auditable recovery plans. 

 

The Federal Information Security Act (FISMA) of 2002 (IPL 107-347, 17 December 2002) is an executive order on critical infrastructure protection addressing business resumption and data security.

 

The Continuity of operations (COOP) and contingency of government (COG) Federal Preparedness Circular 69 establishes minimum planning considerations for federal government operations – stating that a business continuity program (BCP) must be maintained at a high level of readiness, must be capable of implementation with or without warning, must be operational within 12 hours of activation, and must be capable of sustained operations for up to 30 days.

 

The National Institute of Standards and Technology (NIST) special publication (SP) 800-34 contingency planning guide for information technology systems joins the NIST SP 800 series (parts 3,4 12, 14, 26, and 18) in detailing contingency, DR, and COOP plans.  NIST 800-53A provides assessment guidelines for 800-53 which gives specific requirements for contingency planning policy and procedures, and contingency plan testing, plan training, and plan updates.   NIST 60 is a guide for mapping types of information and systems to security categories.

 

The Health Insurance Portability and Accountability Act (HIPPA) 0f 1996 requires a data backup plan and emergency mode of operation plan.

 

The Food and Drug Administration (FDA) Code of Federal Regulations (CFR) Title XXI, 1999 establishes the requirements for electronic records and electronic signatures, often forcing the update of BC plans to insure availability of information.

 

The International Organization for Standards (ISO) 17799 Business Continuity Management (BCM) can be used as a guide to satisfying most federal and state mandated BCP requirements.

 

As a host of regulatory bodies exist with a lean towards Disaster Recovery, it is imperative to understand which agencies pertain to your particular business vertical.  Relative domestic boards and committees include COBIT, DRJ, SOX, and FEMA.   International organizations include AS/NZS 4360 Risk Management Guide, BCI Good Practices Guideline, and BS 25999. 

 

While satisfying laws and regulations will prevent legal liability action, these should be used as minimum guidelines, as in many cases they will not necessarily guarantee the resilience of your business in the wake of a disaster.  A business impact analysis (BIA) identifying the required information technology (IT) infrastructure and business operations processes required to sustain business during and after crises (while incorporating the laws applicable to your business sector) is paramount to insuring the survival of your organization.

 

Business Impact Analysis

Revenue loss during business outage is a key factor to justifying an organizations’ resiliency policy.  At a basic level, single points of failure are the natural starting point.  Beyond local data center infrastructure components such as servers and firewalls, the prudent DR plan looks holistically at business operations, expanding  single points of failure to multiple types of failures which include entire data centers, workplace inaccessibility (pandemic or terrorist attack), region (power grid failure or sabotage), with a study of systemic exposures from supply chain and e-commerce clearing components. 

 

Market share and competitive service level agreement (SLA) differentiators are not as easily measured but certainly factor into a BIA.  In many organizations, the awareness of vulnerability to disaster at a business executive level is far different from the reality the technical staff understands.  The Disaster Recovery Workshop (DRW) uncovers and closes this gap.

 

Gap Analysis

With business process outage tolerances defined via BIA results, business processes are mapped to technology resources.  The technology resources - data center elements (i.e. storage, servers, routers, appliances), entire data centers, network, workplace, supply chain etc. are examined for resiliency capability.  The gap between current capability and BIA mandated requirements decides resource adjustments and disaster recovery design.  Decision points factoring into resource elements capability adjustments include ledger (asset obsolescence – does it cost more to maintain older equipment?), latest performance technology (is N-1 technology good enough and at a significant price break?), production leverage (asset repurposing i.e. creating a Citrix remote access grid out of a test-dev server farm for employees’ remote access during inaccessibility disasters), technical support (staff re-training, trouble ticket) and recovery capability location.  A financial understanding of assets will identify whether to lease, purchase, host, re-purpose, time-share (common resource pool at time of disaster commercial hot site), drop-ship, or build internal fail-over capability. 

 

The Dot Com era, while initially failing from a business model perspective, gave us mature and affordable network clustering and load balancing technologies while appropriately driving down storage, server, and network circuit pricing.  Combining this with current data replication schemes and on-demand automated resource management helps to close the gap between required and actual disaster recovery capability.

 

Disaster Recovery Management

If left unattended, most everything in the Universe eventually is overtaken by the forces of nature.  Witness the anatomy of a business application – which is typically comprised of software code, an object model, repository, audit trail, membership model, security model and database, with multiple interdependent applications served up via a portal to integrate those applications.  Multiple applications means, multiple release cycles, multiple membership models, multiple security models, multiple databases, multiple object models, multiple audit trails, multiple, isolated repositories, isolated platform certifications and multiple synchronization points with fragile integration and fragile architecture.  Multiple synchronization points across all these systems increases the risk of system failure and challenges backup and recovery.  These systems are released with different language packs. 

 

This is why you have database administrators, platform specialists, storage, server, application, network and security staff, with constant management meetings between them to guide the IT infrastructure.  Each element represents a systemic effect on the others, none of which can be left unattended.  In the event of a disaster, interdependent elements must be recovered with appropriate synchronization.

 

Managing a disaster recovery plan encompasses IT infrastructure elements and recovery/fail-over processes and procedures on a continuously refreshed basis.  Strategy considerations include ledger (asset refresh, replace, and leverage), fixed recurring costs, compliance, business vertical, geographical dispersion and outage tolerance. 

 

During a disaster, disaster recovery management provides the continuity and coordination of internal elements, outside resources such as alternate hosted sites, equipment allocation, user community business application access, and key employee notification and participation roles and responsibilities.

 

Regular testing to measure actual vs. target resiliency capability provides audit and compliance satisfaction.

 

Documented disaster recovery plans identify recovery architecture, process, people, alert mechanism, and back-up personnel.  Continuous updates prevent obsolescence to insure your business is protected.   

 

Disaster recovery management participates in new application development infrastructure design to assure compliance with corporate disaster recovery strategy.  A continuous disaster recovery management program is as dynamic as the IT infrastructure, both local and systemic relative to supply chain, e-commerce, refresh to 3rd party contracted capabilities, and emerging technology efficiencies.

 

The right combination of recovery options will be unique to each organization – one size does not fit all.  Hosted applications may represent an ideal strategy for some business requirements such as email where DR is the responsibility of the hosting entity; while other business vertical specific processing applications may not be as portable.

 

A “hybrid” design might include redundant local high availability (HA) components such as RAID, clusters (network, storage, servers) and multiple egress points, combined with regionally dispersed hot-site, hosted and on-demand capacity.

 

Disaster Recovery Workshop (DRW)

The disaster recovery workshop is a series of sessions identifying and parsing out the requirements to sustaining business in the event of a disaster.  Multiple scenarios are defined, from localized element failure to regional catastrophes.  Business impact outage tolerances are mapped to resiliency designs.  Disaster recovery policies are defined and published.  Workplace personnel are identified with roles and responsibilities defined.  Disaster activation process with key personnel authorized to declare disasters are identified.  A disaster recovery theory of operations is developed in concert with IT staff, management, and executive participation.

 

Intelligent use of the array of IT disaster recovery tools and capabilities available today uncovers innate production efficiency leverage points, accelerating ROI while protecting your investment and allowing the right disaster recovery solution at the right price.  For example, any-to-any data replication appliances prevent you from being locked into a specific platform, while virtualized asset repurposing with database clones allows parallel processing, time of day processing, and priority repurposing of alternate site server resources during a disaster.  Understanding which vendor technology adheres to your industry specific regulations allows you to satisfy compliance while protecting your business as resiliency nirvana begins to surface.  Successfulness after all lies in the diligence of the planning.


 
Web Hosting Companies