What's the worst way you ever broke production?

RacerX@lemm.ee to Ask Lemmy@lemmy.world – 168 points –

Fess up. You know it was you.

104

One time I was deleting a user from our MySQL-backed RADIUS database.

DELETE * FROM PASSWORDS;

And yeah, if you donโ€™t have a WHERE clause? It just deletes everything. About 60,000 records for a decent-sized ISP.

That afternoon really, really sucked. We had only ad-hoc backups. It was not a well-run business.

Now when I interview sysadmins (or these days devops), I always ask about their worst cock-up. It tells you a lot about a candidate.

Always skeptical of people that don't own up to mistakes. Would much rather they own it and speak to what they learned.

This is what I was told when I started work. If you make a mistake, just admit to it. They most likely won't punish you for it if it wasn't out of pure negligence

It's difficult because you have a 50/50 of having a manager that doesn't respect mistakes and will immediately get you fired for it (to the best of their abilities), versus one that considers such a mistake to be very expensive training.

I simply can't blame people for self-defense. I interned at a 'non-profit' where there had apparently been a revolving door of employees being fired for making entirely reasonable mistakes and looking back at it a dozen years later, it's no surprise that nobody was getting anything done in that environment.

Incredibly short-sighted, especially for a nonprofit. You just spent some huge amount of time and money training a person to never make that mistake again, why would you throw that investment away?

I was a sysadmin in the US Air Force for 20 years. One of my assignments was working at the headquarters for AFCENT (Air Forces Central Command), which oversees every deployed base in the middle east. Specifically, I worked on a tier 3 help desk, solving problems that the help desks at deployed bases couldn't figure out.

Normally, we got our issues in tickets forwarded to us from the individual base's Communications Squadron (IT squadron at a base). But one day, we got a call from the commander of a base's Comm Sq. Apparently, every user account on the base has disappeared and he needed our help restoring accounts!

The first thing we did was dig through server logs to determine what caused it. No sense fixing it if an automated process was the cause and would just undo our work, right?

We found one Technical Sergeant logged in who had run a command to delete every single user account in the directory tree. We sought him out and he claimed he was trying to remove one individual, but accidentally selected the tree instead of the individual. It just so happened to be the base's tree, not an individual office or squadron.

As his rank implies, he's supposed to be the technical expert in his field. But this guy was an idiot who shouldn't have been touching user accounts in the first place. Managing user accounts in an Airman job; a simple job given to our lowest-ranking members as they're learning how to be sysadmins. And he couldn't even do that.

It was a very large base. It took 3 days to recover all accounts from backup. The Technical Sergeant had his admin privileges revoked and spent the rest of his deployment sitting in a corner, doing administrative paperwork.

I worked for a company where the testing database was also the only backup.

I always put the where clause first since a fuck up in my early 20s lost a loans company ยฃ40k of business.

My trick is writing it as a SELECT statement first, making sure it's returning the right number of records, and then switching out the SELECT for DELETE. Hasn't steered me wrong yet.

Accidentally deleted an entire column in a police department's evidence database early in my career ๐Ÿ˜ฌ

Thankfully, it only contained filepaths that could be reconstructed via a script. But I was sweating 12+1 bullets. Spent two days rebuilding that.

1 more...

Did you know that "Terminate" is not an appropriate way to stop an AWS EC2 instance? I sure as hell didn't.

Explain more?

Noob was told to change some parameters on an AWS EC2 instance, requiring a stop/start. Selected terminate instead, killing the instance.

Crappy company, running production infrastructure in AWS without giving proper training and securing a suitable backup process.

"Stop" is the AWS EC2 verb for shutting down a box, but leaving the configuration and storage alone. You do it for load balancing, or when you're done testing or developing something for the day but you'll need to go back to it tomorrow. To undo a Stop, you just do a Start, and it's just like power cycling a computer.

"Terminate" is the AWS EC2 verb for shutting down a box, deleting the configuration and (usually) deleting the storage as well. It's the "nuke it from orbit" option. You do it for temporary instances or instances with sensitive information that needs to go away. To undo a Terminate, you weep profusely and then manually rebuild everything; or, if you're very, very lucky, you restore from backups (or an AMI).

Apparently Terminate means stop and destroy. Definitely something to use with care.

Maybe there should be some warning message... Maybe a question requiring you to manually type "yes I want it" or something.

Maybe an entire feature that disables it so you can't do it accidentally, call it "termination protection" or something

It doesn't help that the webui used to hide stop. I think it still does.

I didn't call out a specific dimension on a machined part; instead I left it to the machinist to understand and figure out what needed to be done without explicitly making it clear.

That part was a 2 ton forging with two layers of explosion-bonded cladding on one side. The machinist faced all the way through a cladding layer before realizing something was off.

The replacement had a 6 month lead time.

That's hilarious, actually pretty recently I "caused" a line stop because a marker feature (for visuals at assembly, so pretty meaningless dimension overall) was very much over dimensioned (we talking depth, rad, width, location from step) and to top it off instead of a spot drill just doing a .01 plunge they interpolated it! (Why I have zero clue). So it was leaving dwell marks for at least the past 10 months and because it was over dimensioned it all of them had to be put on hold because DOD demands perfection (aircraft engine parts)

It was the bad old days of sysadmin, where literally every critical service ran on an iron box in the basement.

I was on my first oncall rotation. Got my first call from helpdesk, exchange was down, it's 3AM, and the oncall backup and Exchange SMEs weren't responding to pages.

Now I knew Exchange well enough, but I was new to this role and this architecture. I knew the system was clustered, so I quickly pulled the documentation and logged into the cluster manager.

I reviewed the docs several times, we had Exchange server 1 named something thoughtful like exh-001 and server 2 named exh-002 or something.

Well, I'd reviewed the docs and helpdesk and stakeholders were desperate to move forward, so I initiated a failover from clustered mode with 001 as the primary, instead to unclustered mode pointing directly to server 10.x.x.xx2

What's that you ask? Why did I suddenly switch to the IP address rather than the DNS name? Well that's how the servers were registered in the cluster manager. Nothing to worry about.

Well... Anyone want to guess which DNS name 10.x.x.xx2 was registered to?

Yeah. Not exh-002. For some crazy legacy reason the DNS names had been remapped in the distant past.

So anyway that's how I made a 15 minute outage into a 5 hour one.

On the plus side, I learned a lot and didn't get fired.

I once "biased for action" and removed some "unused" NS records to "fix" a flakey DNS resolution issue without telling anyone on a Friday afternoon before going out to dinner with family.

Turns out my fix did not work and those DNS records were actually important. Checked on the website halfway into the meal and freaked the fuck out once I realized the site went from resolving 90% of the time to not resolving at all. The worst part was when I finally got the guts to report I messed up on the group channel, DNS was somehow still resolving for both our internal monitoring and for everyone else who tried manually. My issue got shoo-shoo'd away, and I was left there not even sure of what to do next.

I spent the rest of my time on my phone, refreshing the website and resolving domain names in an online Dig tool over and over again, anxiety growing, knowing I couldn't do anything to fix my "fix" while I was outside.

Once I came home I ended up reversing everything I did which seemed to bring it back to the original flakey state. Learned the value of SOPs and taking things slow after that (and also to not screw with DNS).

If this story has a happy ending, it's that we did eventually fix the flakey DNS issue later, going through a more rigorous review this time. On the other hand, how and why I, a junior at the time, became the de facto owner of an entire product's DNS infra remains a big mystery to me.

Hopefully you learned a rule I try to live by despite not listing it: "no significant changes on Friday, no changes at all on Friday afternoon".

"Man who deployed Friday, works Saturday. "

I spent over 20 years in the military in IT. I took took down the network at every base I was ever at each time finding a new way to do it. Sometimes, but rarely, intentionally.

took out a node center by applying the patches gd recommended.... took an entire weekend to restore all the shots and my ass got fed 3/4ths into the woodchipper before it came out that the vendor was at fault for this debacle.

Worked for an MSP, we had a large storage array which was our cloud backup repository for all of our clients. It locked up and was doing this semi-regularly, so we decided to run an "OS reinstall". Basically these things install the OS across all of the disks, on a separate partition to where the data lives. "OS Reinstall" clones the OS from the flash drive plugged into the mainboard back to all the disks and retains all configuration and data. "Factory default", however, does not.

This array was particularly... special... In that you booted it up, held a paperclip into the reset pin, and the LEDs would flash a pattern to let you know you're in the boot menu. You click the pin to move through the boot menu options, each time you click it the lights flash a different pattern to tell you which option is selected. First option was normal boot, second or third was OS reinstall, the very next option was factory default.

I head into the data centre. I had the manual, I watched those lights like a hawk and verified the "OS reinstall" LED flash pattern matched up, then I held the pin in for a few seconds to select the option.

All the disks lit up, away we go. 10 minutes pass. Nothing. Not responding on its interface. 15 minutes. 20 minutes, I start sweating. I plug directly into the NIC and head to the default IP filled with dread. It loads. I enter the default password, it works.

There staring back at me: "0B of 45TB used".

Fuck.

This was in the days where 50M fibre was rare and most clients had 1-20M ADSL. Yes, asymmetric. We had to send guys out as far as 3 hour trips with portable hard disks to re-seed the backups over a painful 30ish days of re-ingesting them into the NAS.

The worst part? Years later I discovered that, completely undocumented, you can plug a VGA cable in and you get a text menu on the screen that shows you which option you have selected.

I (somehow) did not get fired.

You still remember so. That means you learned and probably won't do it again.

Plugged a serial cable into a UPS that was not expecting RS232. Took down the entire server room. Beyoop.

That's a common one I have seen on r/sysadminds.

I think APC is the company with the stupid issue.

You don't have two unrelated power inputs? (UPS and regular power)

This was 2001 at a shoestring dialup ISP that also did consulting and had a couple small software products. So no.

Updated WordPress...

Previous Web Dev had a whole mess of code inside the theme that was deprecated between WP versions.

Fuck WordPress for static sites...

UPDATE without a WHERE.

Yes in prod.

Yes it can still happen today (not my monkey).

Yes I wrap everything in a rollback now.

I did something similar. It was a list box with a hidden first row representing the id. Somehow the header row got selected and an update where id=id got ran.

I did this once. But only once. The panic I felt in that moment is something I will never forget. I was able to restore the data from a recent backup before it became a problem, though.

It wasn't "worst" in terms of how much time it wasted, but the worst in terms of how tricky it was to figure out. I submitted a change list that worked on my machine as well as 90% of the build farm and most other dev and QA machines, but threw a baffling linker error on the remaining 10%. It turned out that the change worked fine on any machine that used to have a particular old version of Visual Studio installed on it, even though we no longer used that version and had phased it out for a newer one. The code I had written depended on a library that was no longer in current VS installs but got left behind when uninstalling the old one. So only very new computers were hitting that, mostly belonging to newer hires who were least equipped to figure out what was going on.

That reminds me of when some of my former colleagues and I were on a training about programming industrial camera system that judges the quality of produced parts. I'm not really a programmer, just a guy who can troubleshoot and google stuff and occasionally hack together a simple code with heavy help from Google too.

The guy was a German (we are Czech and we communicated in English) programmer who coded the whole thing in Omron software but he also wrote his own plugin for it. All was well when he was showing us on the big screen, but when he sent us the program file so we could experiment on it (changing parameters, adding steps to the flow...) the app would crash. I finally delved into the app logs and with the help of Google I found it was because he compiled his plugin with debug flags and it worked for him because he had the VS debug DLLs installed but we didn't.

  1. Create a database,
  2. Have organisation manually populated it with lots of records using a web app,
  3. accidentally delete database.

All in between the backup window.

I fixed a bug and gave everyone administrator access once. I didnโ€™t know that bug wasโ€ฆ in use (is that the right way to put it?) by the authentication library. So every successful login request, instead of being returned the user who just logged in, was returned the first user in the DB, โ€œadminโ€.

Had to take down prod for that one. In my four years there, that was the only time we ever took down prod without an announcement.

"acknowledge all" used to behave a bit different in Cisco UCS manager. Well at least the notifications of pending actions all went away... because they were no longer pending.

It wasn't me personally but I was working as a temp at one of the world's biggest shoe distribution centers when a guy accidentally made all of the size 10 shoes start coming out onto the conveyor belts. Apparently it wasn't a simple thing to stop it and for three days we basically just stood around while engineers were flown in from China and the Netherlands to try and sort it out. The guy who made the fuckup happen looked totally destroyed. On the last day I remember a group of guys in suits coming down and walking over to him in the warehouse and then he didn't work there any more. It must have cost them an absolute fortune.

How can a guy accidentally order all size 10 shoes to come out, without there being any way to stop it

No idea. It was a new facility, so maybe it was a bug in their new system preventing them stopping it! I was 18 at the time and found it hilarious. They kept us there the whole time because they thought it would be quick to sort out. We shot each other down roller conveyors, rode the pallet trucks around like scooters and smoked cigarettes inside big cardboard boxes while we were waiting. Good times.

Broke teller machines at a bank by accidentally renaming the server all the machines were pointed to. Took an hour to bring back up.

I took down an ISPfor a couple hours because I forgot the 'add' keyword at the end of a Cisco configuration line

That's a rite of passage for anyone working on Cisco's shit TUI. At least its gotten better with some of the newer stuff. IOS-XR supported commits and diffing.

I acidentally destroyed the production system completely thru improper partition resize. We got the database snapshot, but it's in that server as well. After scrambling around for half a day, I managed to recover some of the older data dumps.

So I spun up the new server from scratch, restored the database with some slightly outdated dump, installed the code (which was thankfully managed thru git), and configured everything to run all in an hour or two.

The best part: everybody else knows this as some trivial misconfiguration. This happened in 2021.

Was wondering if anybody here had made the news.

My first time shutting down a factory at the end of second shift for the weekend. I shut down the compressors first, and that hard stopped a bunch of other equipment that relied on the air pressure. Lessons learned. I spent another hour restarting then properly shutting down everything. Never did that again.

Light switch is right next to the main power breaker.

And they looked the same, no cover or anything??!!

Early in my career as a cloud sysadmin, shut down the production database server of a public website for a couple of minutes accidentally. Not that bad and most users probably just got a little annoyed, but it didn't go unnoticed by management ๐Ÿ˜ฌ had to come up with a BS excuse that it was a false alarm.

Because of the legacy OS image of the server, simply changing the disk size in the cloud management portal wasn't enough and it was necessary to make changes to the partition table via command line. I did my research, planned the procedure and fallback process, then spun up a new VM to test it out before trying it on prod. Everything went smoothly except on the moment I had to shut down and delete the newly created VM, I instead shut down the original prod VM because they had similar names.

Put everything back in place, and eventually resized the original prod VM, but not without almost suffering a heart attack. At least I didn't go as far as deleting the actual database server :D

I tried to change ONE record in the production db but I forgot the WHILE clause, ended up changing over 2 MILLION records instead. Three hour production shutdown. Fun times.

I did my research, planned the procedure and fallback process, then spun up a new VM to test it out before trying it on prod

Went through a similar process when I was resizing some partitions on my media server. On the test run I forgot to specify G on the new size so it defaulted to MB when I resized it. Resulting in a 450gb partition going down to 400mb. I was real glad I tested that out first.

UPDATE ON articles SET status = 0 WHERE body LIKE '%...%'

On master production server, running myisam, against a text column, millions of rows.

This causes queries to stack because table locks

Rather than waiting for the query to finish. a slave was promoted to master.

Lesson: don't trust mysqladmin to not do something bad.

Table locks can be a real pain. You know you need to do the change, but the system is constantly running queries towards it. Now days it's a bit easier with algorithm=inplace and lock=none, but in the good old days you were on your own. Your only friend was luck. Large migrations like that still gives me shivers

Pretty run of the mill for me, so not that bad: Pushed a long-running migration during peak load hours that locked an important table for an extended period of time, effectively taking our site offline.

Also consider !ask_experienced_devs@programming.dev :)

There was a nasty bug with some storage system software that I had the bad fortune to find, which resulted in me deleting 6.4TB of live VMs. All just gone in a flash. It took months to restore everything.

This is nowhere near the worst on a technical level, but it was my first big fuck up. Some 12+ years ago, I was pretty junior at a very big company that you've all heard of. We had a feature coming out that I had entirely developed almost by myself, from conception to prototype to production, and it was getting coverage in some relatively well-known trade magazine or blog or something (I don't remember) that was coming out the next Monday. But that week, I introduced a bug in the data pipeline code such that, while I don't remember the details, instead of adding the day's data, it removed some small amount of data. No one noticed that the feature was losing all its data all week because it still worked (mostly) fine, but by Monday, when the article came out, it looked like it would work, but when you pressed the thing, nothing happened. It was thankfully pretty easy to fix but I went from being congratulated to yelled at so fast.

I was still a wee IT technician, I was supposed to remove some cables from a patch panel. I pulled at least two cables that were used as ISCSI from the hypervisors to the storage bays. During production hours. Not my proudest memory.

I removed the proxy settings from every user in the company. Over 80k people without Internet for the day.

Two things pop up

  • I once left an alert() asking "what the fuck?". That was mostly laughed upon, so no worry.
  • I accidentally dropped the production database and replaced it by the staging one. That was not laughed upon.

I once dropped a table in the production database. I did not replace it with the same table from staging.

On the bright side, we discovered our vendor wasn't doing daily backups.

Advertised an OS deployment to the 'All Wokstations' collection by mistake. I only realized after 30 minutes when peoples workstations started rebooting. Worked right through the night recovering and restoring about 200 machines.

Extracted a sizeable archive to a pretty small root/OS volume

Crashed a important server because it didnt have room for the update I was trying to install. Love old windows servers.

Was doing two deployments at the same time. On the first one, I got to the point where I had to clear the cache. I was typing out the command to remove the temp folder, and looked down at the other deployment instructions I had in front of me, and typed the folder for the prod deployments and hit enter, deleting all of the currently installed code. It was a clustered machine, and the other machine removed it's files within milliseconds. When I realized what I had done, I just jumped up from my desk and said out loud "I'm fired!!" over and over. Once I calmed down, I had to get back on the call and ask everyone to check their apps. Sure enough they were all failing. I told them what I had done, and we immediately went to the clustered machine and files were gone there too. It took about 8 hours for the backup team to restore everything. They kept having to go find tapes to put in the machine, and it took way longer than anyone expected. Once we got the files restored, well we determined that we were all back to the previous day, and everyone's work from that night was all gone, so we had to start the nights deployments over. I got grilled about it, and had to write a script to clear the cache from that point on. No more manually removing files. The other thing that came out of this for the good was no more doing two deployments at the same time. I told them exactly what happened and that when you push people like this, mistakes get made.

Well first of, in a properly managed environment/team there's never a single point of failure... *ahem*... that being said..

The worst I ever did was lose a whole bunch of irreplaceable data because of... things. I can't go into detail on that one. I did have a back plan for this kind of thing, but it was never implemented because my teammates thought it was a waste of time to cover for such a minuscule chance of a screw-up. I guess they didn't know me too well back then :)

"properly managed" is carrying a whole lotta weight in that first sentence.

I was purely talking in hypotheticals, I've never seen such a thing with my own eyes :)

A little different:

I was a live FOH sound tech during a concert and hit the wrong button on a playback device (it was a tracking song). Thought I was queuing up the next track for further in the concert but I was on the live side. The director did a great on of pivoting but boy was I red faced.

Plugged a server in after it had been repaired but the person whose responsibility it was insisted it would be fine - they didn't release the FSMO roles from it, the time was an hour out, it changed the time EVERYWHERE and broke ALL THE THINGS. Not technically my fault, but i should have pushed harder for them to have demoted it before I turned it back on.

Then colleague upgraded glibc by copying it in via scp. Then we couldn't ssh in anymore. :) Not sure how important that server was. I think it was reinstalled soon-ish.

Flushed the entire AD not realizing I somehow got back into prod

Forgot to turn the commercial power back on after testing the battery backups... oopsie.

Found out the hard way to triple check your work when adding a new line to the proxy policy. Or, more accurately 2 lines when you only planned one, and that second one defaulted to a 'deny all' and resulted in dropping all web traffic out for the company...

That made for a REAL tense meeting the next day after it got deployed and people started asking WTF happened...

Two exhibitors, both alike in dignity naming. One needed a critical sw update on their Doremi to fix an issue. The other was running The Force Awakens to a packed auditorium.

Was troubleshooting a failed drive in a raid array on a small business DC/File Serv/Print/Everything else box. Replaced drive still showed failed. Moved to another bay thinking it was the slot not the drive. Accidentally hit yes when asked to initialize the array. Blew the whole thing away. It was an OLD server the customer was working on replacing, so I told them it finally gave up the ghost and I was taking it back to the office to keep working on it. I had been on the job for about 4 months and thought for SURE I was fired. Turns out we were already working on moving them to the cloud, so it ended up not being a big deal.

Accidentally announced a /12 of IPv6 on a bad copy-paste of a /127.

Started appending a verification line after interface configs to make sure I never missed a trailing character again.

Took 3 months for anyone to notice (circa 2015).

Not software but I once powered off an entire network node by accident, the power distribution was 48v dc and the breaker panel in the rectifier had a retainer bar to hold in the breakers that was abรฎme the toggles. The toggles did not resist being turned off particularly well and after unscrewing one side of the bar, the whole thing pivoted down, cleanly shutting off every single breaker in the row.

Installed a flatpak app (can't remember which one but it wasn't obscure or shady) and smh it broke the file system on one of my main machines :) (at least I think that's what happened because the machine started lagging, any app refused to launch and after a reboot I got an fsck error or something like that)

Skipping test to patch ERP Prod because, you know, what could go wrong?

The vendor was......unsympathetic.

Set off cascading event bus loops that ran out of control. Friends donโ€™t let friends allow events to spawn more events.

@RacerX@lemm.ee In 1995 I worked at a company with several active web sites. Early days of the web, very important to the company. I was hired to take care of the hardware and software running the existing web sites and help in developing new ones.

One day I walked into my office, which had the production web server in it, carrying a Diet Coke (I was young and inexperienced). I opened the Diet Coke and it spewed an epic fountain right onto the production server. It was as if that server had a gravitational pull that drew all liquid towards it. I panicked and started unplugging every cable in sight, thinking this was better than risking a hardware-destroying short.

Needless to say the web sites were down for awhile. I believe I managed to save the hardware from myself though.

I seriously never had a major gaffe.

My buddy Donny, however, repartitioned and overwrote the wrong hard drive... Destroying video that took in the neighborhood of about 9,000 hours to render.

This was in 1996 1997 so you can only imagine how devastating that was when our rendering farm was 10 machines with Pentium III's.

Seems trivial now when we have so much computing power at our fingertips, but 10 computers as a dedicated rendering farm was considered insane at that time.

Forgive me, but that's a figure of speech I've never heard before. What does it mean?

By breaking production, I'm referring to a situation where someone, most likely in a technical job, broke a system that was intended to be responsible for the operation for some kind of service. Most of the responses here, which have been great to read, are about messing up things like software, databases, servers and other hardware.

Stuff happens and we all make mistakes. It's what you take away from the experience that matters.