Where I worked we had a very important time sensitive project. The server had to do a lot of calculations on a terrain dataset that covered the entire planet.
The server had a huge amount of RAM and each calculation block took about a week. It could not be saved until the end of the calculation and only that server had the RAM to do the work. So if it went down we could lose almost a weeks work.
Project was due in 6 months and calculation time was estimated to be about 5 1/2 months. So we couldn't afford any interruptions.
We had bought a huge UPS meant for a whole server rack. For this one server. It could keep the server up for three days. That way even if wet lost power over the weekend it would keep going and we would have time to buy a generator.
One Friday afternoon the building losses power and I go check on the server room. Sure enough the big UPS with a sign saying only for project xyz has a bunch of other servers plugged into it.
I quickly unplug all but ours. I tell my boss and we go home at 5. Latter that day the power comes back on.
On Monday there are a ton of departments bitching that they came in an their servers were unplugged. Lots of people wanted me fired. My boss backed me and nothing happened but it was stressful.
I'd be super gluing those plastic toddler plug covers all over that thing.
fuck those other departments.
You’d be surprised with inheriting tech debt. Quite often there’s no documentation, the last person to log in to the system is an admin that quit 3 years ago, but it doesn’t much matter because that’s only for a direct console login which normal users don’t do when accessing the application. With tribal knowledge gone and no documentation, only when you pull the network for a bit do you discover that there was this one random script running on it that was responsible for loading up all the needed data in the current system, when 9 of the other 10 times those scripts were no longer needed.
In a perfect world you’d have documentation, architecture and data flow diagrams for everything, but “ain’t nobody got time for that” and it doesn’t happen.
Had that the other way around recently. A docker container failed to come back up after I had updated the host OS.
Was about ready to restore the snapshot, when I looked further back in the logs on a hunch.
Turns out that container hadn't worked before the update either. The software's developer is long gone, and no one could tell me what it was supposedly doing.
company a gets bought by company b. company b fires 50% of company a.
even a scream test won't get you answers because nobody is around that could complain nor know where the docs are.
You'd be surprised. I had some security devices that I was actively using get shut down simply because some paperwork didn't get filled out properly and the data center team claimed they had no documentation on them.