Jul28

Confessions of a SysAdmin

facepalm

Last week, we celebrated the annual SysAdmin day by feeding our own SysAdmins cakes and running a confessions competition to give SysAdmins everywhere the chance to win a Samsung 840 EVO 1TB SSD in return for sharing the biggest FacePalm moment they had experienced on the job.

We received dozens of entries for the competition, with SysAdmins from Brazil to the UK sharing their most cringe-worthy on-the-job moments with us. Some confessions are too risky to even publish on our blog.

On Friday afternoon, the winner was drawn. The SysAdmin who penned it prefers to remain anonymous, but here’s the confession that earned the person their new Samsung 840 EVO 1TB SSD.

“This is a bad one, but it is absolutely true. I still, to this day, quiver when I think about what I did. My “Doh!” moment came a few years ago working for a fairly large US based outsourcing firm. We had home-grown script automation in place across all of our servers, for all of our clients, so we could easily push out scripts to be run automatically on each host. This came in handy quite frequently. However, one fateful afternoon, I edited a script, and pushed it out, to perform a function across all of our servers, but unknown to me I had a typo in it. This typo caused the script to respawn over and over until all server resources were utilized and the servers could do nothing else. With one script push I took down over a thousand servers across some very large, and well known, enterprises. To answer the question at least some of you are probably asking, no I was not let go for that mistake. I stayed there for a couple more years before choosing to move onto a new opportunity.”

There’s a moral to our winner’s story – Don’t type blind. There are a few other lessons we can learn from the other entries that came in. Here are a few.

Backup, backup and backup again.

“So I was moving up in my career and had just landed a job at a major US bank as a server admin. A few weeks into the new gig we got a report of a failed drive. My co-worker and I headed up to the raised floor and over to the server in question. I proceeded to remove the “failed” drive from the server when I realized I had the WRONG DRIVE. I had just pulled a live drive from a RAID array!! As you may know, 1 drive failure in a RAID 5 can be tolerated, 2 drives? YOUR TOAST!

My coworker noticed this too but it was too late. The array was now offline.

“What server is this?” I asked.

“It’s the email server.” he replied, but then he paused.

“…for the executives.”

I learned a lot that day. Some were things not to do, and some were the value of backups. Perhaps the best lesson though was the value of a good co worker. I still appreciate what you did for me that day Reinhart. Thanks man, wherever you are.”

And…

“This is a little story on how to NOT upgrade a running production server. I had been asking the management for a new server for several months, but with no luck, hence I thought; how difficult would it be to upgrade FreeBSD 7.2 to 9.1, since I needed some of the changes in the newer version.

I did some research and figured out the best way to do it was upgrading 7.2 to 7.3 and so forth until target goal was reached. I upgraded one step at the time and got until 8.3 and everything had been a breeze with only minor system changes. Easy peasy, I thought and started the next upgrade to 8.4.

Everything crashed and I couldn’t even boot failsafe anymore. I started sweating and my heart was racing for a few hours, while I frantically tried every trick I could think of, but with no luck.

After several hours of pure terror and all tries failed, I finally came to a solution, where I did a clean 8.3 install and then copied the entire system to the corrupted system.

Success! I could boot, see the files of all the users, but my services wasn’t running. Found some old configuration files for the services and installed those. After a couple of hours of re-configurations everything was running like it used to.

My cold sweat and worst case scenarios were gone. Finally!

After this episode of complete idiocy I swore never to do a production without complete backup. If you don’t learn from your mistakes, you’re bound to make them again.”

In jokes can easily become out-going jokes

I was preparing to migrate a web system to a new server so I copied across the source and a database dump and populated the test server. Just to make testing easier, I decided to wipe the tables in the database and start re-populating it as I tested sections.

For whatever reason, my go to “filler” name is butts. If I am filling in test address details it’s normally as follows

Mr Butt McButts

24 Buttington Road

Buttolk

BU7 7SS

0800 BUTTS

butts@buttsbuttsbutts.co.uk

This customer has asked to not be contacted. Reason: Butts

This customer first got in contact with us regarding: Butts

By the end of this first contact we were still discussing: Butts

So there I was merrily inputting my ‘butts’ when I get a phone call from a client. They said…

“Hi this is serious, the record I was accessing has disappeared and I think we’ve been hacked, I can only see one record and it’s called…”

*Oh god oh god oh god oh god*

“… Mr Butt McButts?” the client continued.

“Errrr ok… err, right, OK, let me have a look and I’ll get back to you” I said.

*quickly restore database*

“Hi, can you check now, it should be OK”

“Ahh good, yes it is, were we hacked?”

“… no, I’m just an idiot.”

Butts.

Sometimes you have to grin and bare it

Several Years ago I worked for a big financial institute, they installed several 2U servers in a closed cabin. The Problem was they forgot one detail, the cabin was closed and there was no air conditioning nor ventilation. So servers started to “cook”. Inside the cabin the air temperature rose to °90 on a normal day.

The facepalm was how they solved it. Instead of installing proper airconditioning they just decided to remove the cabin doors and hire security staff to watch nobody touching the servers. Well this should be temporary, but a friend of mine is still working there, and it’s still a working practise!

*face palm*

This article was brought to you by VPS.net, for dedicated server hosting, cloud servers and 24/7 support visit our site here vps.net

No Comments

over 200,000 servers launched

and counting worldwide...