Friday, November 18, 2022

On Twitter, DevOps, and My Work Experience

 Given how everyone is discussing when Twitter is going to die, I figured I'd talk a little bit about my work experiences - mostly from my previous job.

I'm going to try to make this as non-technical as I can. Since I came to tech late in life, I remember well what it was like before I understood pipelines, configs, scripting, etc... I'll probably use less technical terms (like sure, I know what a vm, host, machine, etc is, but in this post I'm just going to call them all 'computers'.)

I think the first thing to understand is scripting. Or perhaps the command line interface (CLI). Most people use Windows if they use a computer at all... and so we're used to having menus that let us know whether we want to copy files, or we click a mouse and choose to delete a file. It's a graphical user interface (GUI) that lets us work without memorizing the commands to do them all from the CLI.

You can do pretty much all of the same stuff from a CLI, so long as you know the words. On windows you can look for 'cmd' to get a simple screen where you can enter the commands, or you can use Powershell, their newer CLI. 

The reason I bring that up is that if you know the commands, you can also jot them down in a file (Use Notepad preferably, since Word tends to add some invisible characters that can screw things up. In Windows you have to give the file the proper extension in order to run it, and make sure it's executable.)

So if you regularly do the exact same thing, every day, you write a script with those commands. If you want to create a backup of your important files, you can manually go and copy it over every day, but that gets tedious. And some days you might forget to do it.

Instead you save the commands (in plain English, you might say 'check this folder for any file created today, and copy it over to this other location').

You can then run that script whenever you want to take a backup. 

Even better, you can then schedule it so that the computer runs it for you. That way you don't even have to think about it. You don't have to remember to make a copy, you don't have to run the script... it's all automated.

In my last job, the team had spent quite a bit of time automating these sorts of repetitive tasks. Some of them hadn't been changed for over a decade... and as long as the process is the same, it works great. 

Makes my job easier... 

The problem is that it doesn't always work. Or things change, and the scripts have to be modified. I joked that my job was generally about dealing with things when they go wrong.

Perhaps the most dramatic example (though thankfully not common) came about when a computer became obsolete.

We all generally by new computers on a regular basis, so we don't have to deal with it as much... but a large company has so many computers that it gets hard to keep track of them. Hard to know what they do, and who is responsible. 

And unfortunately you can't just let them sit and run their thing. One of the biggest reasons is security - as malware evolves and adapts, so too does software, and older computers sometimes stop being supported and updated and have vulnerabilities that you just can't fix. So the solution is to upgrade it to something newer.

So we'd had a couple of computers hosting some internal web pages for our work. We were supporting testing in non-production, and every time the developers made a change to the code, that code had to be pushed to the test environments. We called that a build, and so naturally pushing it was called a build push.

We also sometimes needed to reboot the application, sort of like how powering your computer on and off can fix weird issues. Except applications are a bit more complex than your computer... they may actually be running on two or three computers, or more. You may have one that dealt with your application logic, and another one that dealt with the website itself. They also generally will talk to a database, since that's where user information and other things are stored. So to reboot the application you have to 'bounce' a couple of different things, often in a specific order.

So we had an internal page where the testers could push a build and bounce our applications (of which we had many... customer service, payment processing, billing, etc).

Most of this was automated, again due to the hard work of people a good decade ago, and so most of my time was spent dealing with situations where the automation didn't work.

Maybe something interrupted the computer while it was copying some files, and the files were incomplete. Maybe someone deleted a key file. Maybe the operating system needed upgraded, and we had to change our configurations to reflect that change. Maybe a firewall was preventing a computer from talking to another computer. Maybe something was missing in a file and the computer didn't know how to reach another computer. We also used what's called 'third party software', like the databases and web hosting. Which also sometimes needed upgraded (especially if there's a vulnerability. That log4j issue a couple of years back caused a lot of work in that regard, and it's not enough to make sure our own applications were updated. We had to patch or upgrade some of that third party software, too.)

The issues were plentiful, though it wasn't necessarily consistent. One day could be very quiet, another day might have five or six different issues in as many different environments and/or applications.

Everything is highly interconnected, and every change may have an impact on other things. Applications...

Before I studied computers, I mostly thought of them as a single file. You know, you download an app from the app store. or a *.exe file for some software you want on your computer... then you run it and hey, presto! You have the app.

But really, at least in a business like ours, they're not so simple. And I'm not just talking about microservices. (Microservices make the application more flexible and modular, because you can update a small portion of it without worrying that it will impact other parts of the application. Basically if you go to your cell phone providers site, you can do a variety of things. Create a new account and purchase a plan, pay a bill, check your usage, etc. You can break all of those things down into separate parts. Like 'add a new subscriber', and have a developer focus on just that part of the application.)

If the developers and/or testers want to make changes, they don't necessarily want to have to update the code everywhere... so a lot of times what they do is they make a configuration (config) file. The code will query the config file to get the value it needs, and you can update that config file without having to change the code.

This was also something I spent a lot of time on... sometimes for simple reasons, like they needed to change the url for something. Or use a simulator. 

Sometimes it was because new variables were needed for the changes that were done, and we'd have a call with the developer where they'd have us try updating the config file in different ways until they figured out the right settings to make the application work.

One time there was an issue I helped resolve... the application needed to copy some files from one computer to another, and it kept asking for a password whenever it connected. We were asked to configure it so that it could connect without a password (generally you can connect from one computer to another with ssh, and it normally will ask for your username and password when you do. However, there are ssh keys you can store so that it verifies the connection without a password.)

We configured that, and computer A was able to talk to computer B, but they were still having the same problem.

I had to dig into the process in order to figure out that there was actually a third computer involved. So computer A could talk to computer B, and it could also talk to computer C... but computer C was asking for a password every time it talked to computer B. The issue wasn't with the two computers they told us about, but with an entirely different one that they didn't even mention.

I think I posted some things before about the struggle we had teaching new people this job, because while with experience you can learn how to deal with the most common issues, there are plenty of times where we'd get asked to fix something... and had no real idea what was wrong or where to go. So you had to learn how to figure it out.

To get back to scripting... most of those were written a decade ago, and the people who wrote it have moved on to other positions. Luckily, when I started we had a wealth of institutional knowledge... most of my co-workers had been there a decade or more. That meant that I had people I could go to when I needed to understand how something worked.

But there were plenty of times where I basically had to teach myself what I needed to know. By which I mean, read through the script that manages the build push... and it may be calling other scripts, or querying a database, so then I have to figure out what that other script is doing. Or (if it was our database, since we had one to store a lot of our configuration details) go to the database and figure out what was configured incorrectly or missing.

It is a lot harder to read code than it is to write it... (although scripts aren't really considered code, I think this statement applies to them as well)

When you write it, you know exactly what the commands do. When you read it, you may constantly have to look up the commands yourself... unless you're already quite familiar with them. And if it's a really long script, you may have to spend quite a bit of time working through each line to figure that out.

Tbh, given time constraints, I was more likely to skim the code and/or search for key terms in order to figure out just the portion related to the issue at hand.

I emphasize scripting because, to me at least, they seem to be the workhorses of what we do. All those fancy pipelines to run a complicated process? Like a build push? 

A lot of times they're just calling scripts in a set order. (With redundancies built in, like retrying a script if it fails the first time. Or picking up where it left off if something went wrong, so you don't have to do everything all over again.)

The tools used in DevOps are useful, and automation is nice... but someone, somewhere, has to understand what it's doing. 

Because sometimes you need to change them, or fix something that went wrong. When it breaks, you have to have someone who understands how to fix it.

And that is most definitely a 'when'. Maybe it'll be a good decade before it breaks, maybe it'll be tomorrow. 

If the person who designed it has moved on, then the people who got hired afterwards have to understand it... and if that information isn't passed down, then they'll have to teach themselves.

Anyways, to bring this back to Twitter...

They just lost a LOT of institutional knowledge.

A lot of people seem to expect it to break at any moment, and I'm not sure it will. Businesses put a lot of effort into making sure the code in production is reliable (hence all those test environments in non-production that I worked on).

Sure, we've all seen things go wrong when software gets buggy. Most of the time that's because certain issues don't show themselves until you're dealing with real world volumes. (The code may work fine when only a fraction of the users are using it). But businesses know that they lose money when things don't work. Customers get frustrated and switch to a competitor, people planning to buy your product suddenly decide they don't really need it... buggy code loses money.

The existing production code is probably as reliable as they can make it... the problems generally start when something new is introduced. (Or when the hardware wears out. That's also a thing, of course. Most people don't wear out their hard drive before they're ready to buy a new computer, but the wear and tear of a major enterprise is a whole other story.)

Anyways, that loss of institutional knowledge, to me, means that every time someone is trying to troubleshoot something they're going to have a heckuva time finding someone who can let them know where to look, or what is supposed to be happening. 

They're probably going to have to teach themselves everything, see if they can figure it out by poring through the code. And scripts.

Oh, I almost forgot - in addition to software updates, computer upgrades, and config changes, another common task is dealing with certifications. That is, many applications have certs that help ensure secure connections... and they expire and need to be renewed. 

Figuring out the process for renewing certs (who provides them? What commands help you get the start and end dates when the certs are often encrypted? where are they stored?) can be a bit of a pain, too. 

Anyways, the poor people left behind are now going to have to figure everything out themselves. If they're lucky, maybe they're still friends with their former colleagues and can call someone up. If they're unlucky, they may spend days troubleshooting something that the people who knew the system could have fixed in minutes.

I don't know when or how it will impact production, it depends on what goes wrong, and when. I'm fairly confident, though, that it'll take longer than it would have, that issues may pile on since they'll still be troubleshooting one thing when another thing pops, and that it will be extremely stressful.

No comments:

Post a Comment