ROYL BLOG

It’s truly remarkable how much a failure this was from an ops perspective. Bypassing client rules for staging and force pushing directly to production without adequate testing coverage or protections in place to ensure a smooth deployment? Wild.

Lessons to be learned from this?

Test
Test again
Try rolling releases? I dunno, anything but forcing a risky update to all of your clients

Hindsight being what it is it may be easy to say “Oh you just need to have protections in place so this sort of thing cannot happen” but it sounds like CrowdStrike did, and deliberately chose to ignore them. Or maybe they just had a gap in those protections and unfortunately found out the hard way. Either way, catastrophic failure on CrowdStrike part. Rebuilding trust is hard.

It’s a good goal to aim for resiliency in your stack and protection from your 3rd party vendors. Being at the mercy of another company for your mission critical operations feels real bad so try to mitigate the impact a failure on their part can have. Easier said than done, for sure, but still something to aim for.

Have backups and a rollback strategy (for when that’s possible).

#hugops for everyone still dealing with this.