Three Calls, Three Ways Stores Break in Silence

Three calls in one day, three different stores, and by the third one I was hearing the same story with different nouns. None of these businesses had been hacked. Nothing had crashed. No alarm had gone off anywhere. Each one had simply been broken for months, quietly, while everyone assumed the machine was running.

Company A had broken links across their site for months, and traffic had dropped by half. The cause was an automated translation tool that changed the URLs for a few categories. The tool did its job, the URLs changed, the old ones broke, and nobody connected the traffic decline to the rollout.

Company B’s sitemap on the live domain had not worked for about eight months. The cause was a cron job that was never transferred when they moved to a new server. The migration succeeded, every visible thing worked, and one invisible scheduled task stayed behind on a machine that no longer existed.

Company C’s caching had been off for a long time. The cause was an expired licence. The site did not go down. It just got slower and more expensive, one uncached request at a time.

MageCloud Silent Failure Note

Three Failures, One Pattern

COMPANY A
Broken links, traffic cut in half
An automated translation tool rewrote category URLs. The breakage was live for months before anyone traced the decline to it.

COMPANY B
Sitemap dead for 8 months
A cron job missed in a server migration. Everything visible survived the move. The scheduled task did not, and nothing complained.

COMPANY C
Caching off, licence expired
No outage, no error page. Just a slower store and a bigger server bill, accumulating quietly until someone finally asked why.

Paul Ryazanov · MageCloud · I can keep counting these

Why Nothing Alerted Anyone

The common thread in all three stories is that every failure happened in the gap between “down” and “fine.” Monitoring, where it existed at all, was watching for outages: is the site up, does the homepage load. All three sites were up the entire time. The failures lived one layer deeper, in the category of things that degrade rather than crash, and that layer is invisible unless something is deliberately looking at it.

Each failure also had a perfectly innocent trigger. A tool update, a server migration, a billing lapse. No one made an obvious mistake. Company B’s migration team moved everything anyone could see; the cron job was doing its work on a schedule precisely so that no human would have to think about it, which meant no human did. This is how real stores actually break. Not dramatically, but silently, through the boring plumbing, while the dashboards stay green.

The Checks That Would Have Caught All Three

What makes these stories painful is how cheap the detection would have been. A scheduled crawl of the site catches the broken category URLs within days of the translation tool creating them, instead of months later via the revenue chart. A weekly glance at Google Search Console catches the dead sitemap almost immediately, because GSC reports sitemap fetch errors on its own initiative to anyone who looks. A response-time check, or even a calendar reminder tied to the licence renewal, catches the silent cache failure before it costs a quarter of slow pages.

None of this requires an enterprise observability stack. It requires a short list of recurring checks with a name attached: crawl monthly, read Search Console weekly, verify scheduled jobs after every migration, inventory the licences and their renewal dates. The same discipline I apply to my own properties whenever I get a spare half hour in an airport, pointed at the layer where these three stores broke.

After Every Change Is When Things Break

A second lesson hides inside Company B specifically. Failures cluster around changes. The migration was the moment the cron job died, the tool rollout was the moment the URLs broke, and in both cases the project was declared successful because the checklist covered what the project touched, not what the project might have orphaned.

The fix is a habit I push on every team we work with: every meaningful change ends with a verification pass of the invisible layer. After a migration, confirm the scheduled jobs run on the new machine. After a tool rollout that touches content, crawl the affected sections. After a replatform, watch indexation daily, because go-live day is exactly when years of traffic can evaporate. The visible site lies to you. It looks finished the moment the styling loads. The invisible site is where the next eight-month failure is starting right now.

It Takes 24 Seconds to Find Out

The original version of this post ended with an offer, and the offer stands. It takes about 24 seconds to fill out our site audit form, and the audit is exactly the look that would have caught all three of these failures early: the crawl, the Search Console read, the infrastructure sanity check, a fresh pair of eyes on the layer you have stopped seeing.

I can keep counting these stories, and every month adds more. The only question is whether your store becomes one of them before or after somebody looks. If you would rather it was before, get in touch and we will run the checks this week.

Related reading: Why Hosting Support Should Open the Ticket First. The six-hour outage on our own site that makes the monitoring case from the vendor side.

Related reading: How to Fix Shopify’s Collections Sitemap in GSC. The two-minute fix for the newest member of the silent sitemap failure family.