Computer Whodunit – a Computer Troubleshooting Detective Story

This story is a great example of characterizing a problem, getting closer and closer to a solution with each step, and why the process is so important.  The story flows like a detective novel, with Greg the gumshoe uncovering new clues with each new step, all leading to a surprising conclusion that generates more unexpected questions for subsequent episodes.

Opening scene

Like most detective stories, the day started innocently enough.

My friend and customer, Lynn, called with a common problem.  Her email was broken.   Many of my problem calls start with broken email because pretty much everyone uses email.  But sometimes problems are not what they seem and the path to a solution can take many twists and turns.  This was one of those times.

I built the IT network in Lynn’s office and I know its characteristics the same way Scotty knew the original Starship Enterprise.   I knew Lynn used Microsoft Outlook on her desktop, the server was named ehcserver1, and the server ran Microsoft Exchange.  The server is in the basement of the building and everyone connects over a series of Ethernet switches.   Time for a good problem description.

Greg: “What happens when you launch your Outlook program”

Lynn: “It just sits there for a while and then gives me an error message, something about the server.”

Greg: “When did it break?”

Lynn:  “It worked fine when I shut down yesterday, but when I came in this morning and turned on my computer, now it doesn’t work.  I promise, I didn’t change anything.”

I could push Lynn harder for more details, but this told me enough.  Her Outlook program was not able to find the Exchange Server.   And I know Lynn well enough to believe her when she tells me she did not change anything.  This suggested something out of her control must have changed.

The next logical step in characterizing the problem was to find out if the problem was specific to Lynn or more widespread.  Quickly polling a few people near Lynn, we discovered Bruce had the problem, but not Ayrica, Joe, or Mike.  Since at least one other user had the problem, this suggested the problem was not specific to any workstation setting.  The problem was something common to Bruce and Lynn, but nobody else.

Start Unraveling the Mystery

Experience suggests most email problems are really symptoms of a more general network or server issue.  Everyone reports email problems because email is the application they use most often.  But email depends on the overall network.  If the overall network is broken, email will also be broken.

To find out if the problem is specific to email or something deeper, try a different application and see how it behaves.

One rule about working with end users.  Always start with an easy test and then dig deeper as necessary.  People seem to appreciate it more that way.

Greg:  Let’s see if you can see other stuff on the network.  Click Start…Computer, try to open one of your network drive mappings and let’s see what happens.  What happens when you open, say, the V drive?

A network drive mapping is really a directory on the server.  The idea is, the desktop computer “thinks” it’s another hard drive, thus the drive letter, but really it’s a directory on the server.   This is far and away the most common use for servers in an office.

All IT support companies have their own style and I set up many of my customers with a “V” drive, accessible to everyone.  It’s a convenient place to test.   Why V?  Because V stands for eVeryone.   Why not use “E”?  Because some computers use “E” for a locally connected CD or DVD or USB card reader.  It’s generally easier to use high letters in the alphabet for network drive letter mappings and leave low letters for locally attached devices.

Here is a picture similar to what Lynn saw.  (The picture will open in a different tab on your browser.)  The red X on the network drive mappings does not necessarily mean they are offline.  The only test that generates anything meaningful – just double-click on the drive letter and observe what happens.  Either the contents or an error message will show up in a window.

When Lynn double-clicked on the V drive, she saw an error message.  So did Bruce.  Since another application depending on the server and network was broken, the problem was not specific to email, but instead something common to both email and viewing drive letter mappings on the server.  But only common to Lynn and Bruce.  Mike, Joe, and Ayrica were fine.

Whodunit?

Computer troubleshooting is often compared to a good mystery movie.  Uncover clues and follow them where they lead.  This one was starting to feel like a Hollywood whodunit.  Time for some more in depth tests.

I asked Lynn to launch an old-fashioned DOS command window and try a few commands.  In Windows 7, Click Start…All Programs…Accessories…Command Prompt.  In Windows 8, click the upper right corner of the display to launch the Start screen, click the Start icon, right-click anywhere, click apps in the lower right corner of the system tray on the bottom of the screen, find the Command Prompt, and double-click on it.  (How much money did Microsoft spend on this new, “improved” interface?)

I knew the server was named ehcserver1.  So in that Command Prompt window, I asked Lynn to type “ping ehcserver1″, press the enter key, and tell me what it said.  Here is a picture similar to what Lynn found.  Here is a picture similar to what Lynn should have found.

How was it possible that Lynn could not translate the name of her server?  Clearly, something was fundamentally wrong with the network.  But it only effected a few users.  The next step is a deeper diagnostic.  In that DOS command window, type

ipconfig/all

Here is a PDF file with a sample report and some annotations taken from a Windows 7 computer in the Infrasupport network.

The computers in Lynn’s network should all have IPv4 addresses that look like 192.168.10.nnn, where nnn is a number between 1 and 254.  The gateway should be 192.168.10.1, DNS Server 192.168.10.20.  I built this network; I know what these values should be.

Surprise plot twist

But in a surprise plot twist worthy of the best Hollywood has to offer, both Lynn and Bruce’s computers showed IPv4 Address, Gateway, DHCP Server, and DNS Server Addresses of 192.168 2.nnn.  Note the 2.nnn instead of 10.nnn.

No wonder Lynn and Bruce’s computers were broken.   They both had bogus IP Addresses that did not belong to this network.  This was stunning!

The only possible explanation:  Somebody introduced a rogue DHCP server into this network and it was competing with my real DHCP Server.

DHCP servers lease IP Addresses and other network parameters to computers in an office.  Although there are carefully crafted special cases, typically an office should have exactly one and only one DHCP Server.  If an office has multiple DHCP servers, it is not possible to predict which DHCP server will lease a computer its network parameters.  This means computers may appear to suddenly fail at random times, and for random lengths of time, as their old leases expire and a rogue DHCP server assigns them bogus new network parameters.

This was exactly the case here.  The rogue DHCP Server serviced both Lynn and Bruce’s computers, while the correct DHCP Server took care of Ayrica, Joe, and Mike.

The suspicious character with the shifty eyes did it – or did he?

Wonderful.  Problem identified.  Now, what to do about it?  See  part 2 for the exciting conclusion to the story.   And, as always, contact us if you need help with a computer  troubleshooting situation.

Computer Troubleshooting 101 – Characterize the Problem

Just like most IT professionals, I get computer troubleshooting questions all the time from customers, friends, and family. A few are, uhmm, well, memorable. For example, the one about email a while ago.  The conversation started out something like this:

Friend:  My email doesn’t work.

Greg:  (Trying to be helpful)  OK, what email program do you use?

Friend:  Huh?

Greg:  Well, you run a program on your computer to get to your email, right?

Friend:  No, I just click on “email”.  But now it doesn’t work. What’s wrong with it?

I don’t think we ever solved that problem.  And most IT people reading this, after they finish laughing at an all too familiar story, know why.  I didn’t have enough information to begin solving the problem, and my friend was unable or unwilling to provide it.

All IT people read articles with advice about communicating with “normal” people.   The articles usually scold us for speaking a language most people don’t understand.  Fair enough and guilty as charged.  But we have our “IT words” for a good reason, as do all other professions.  I’m not sure why we get picked on so mercilessly.  For you finance people – why is it OK to say “EBIT-DA”, but not OK for IT people to say, “DHCP server”?

This blog entry is a little different.  I’m an IT guy and I’m asking so-called  ”normal” people who do not speak IT as a natural language to stretch just a little bit.  If you can say non IT words like “EBIT-DA”, you can say some IT words too.  It won’t hurt, I promise.

Meet us in the middle for your own benefit.  We IT people are pretty good at solving problems – that’s why we’re IT people – but we need more than “it doesn’t work”.  If you want your problem solved,  we need more from you.  I’ve learned at the feet of some of the best in the business, and what follows are some great troubleshooting tips.

First, before solving the problem, we have to identify it.  We call this characterizing the problem.  The process is part science, part art form.

Here are some things you can give me to help you get back up and running again:

What exactly happens when it breaks?  What do you do and how does the computer respond?  Give me a sequence of events leading up to the problem.  Give me exact error messages, codes, and pictures of screen shots if possible.  Details are important because at least one of those details may be a significant clue.

Has the system ever worked as expected or has it always been broken?  If it worked earlier and is broken now, when did it break?  What changed between when it worked earlier and now when it’s broken?

“Nothing changed” is always the wrong answer.  If nothing changed, then the system would still behave the same as it did earlier.  My friend, Bruce had a cell phone email problem a while ago.  He insisted nothing chanaged and his email just stopped working for no reason.  We talked about it and ended up removing and adding the email account to his smartphone.  Email behaved properly after that, and then Bruce said, “Oh yeah – a big update for my phone came out a few days ago and my email broke right after that!”  My other friend, Bob was also in the room, and Bob said, “wow – that’s probably why my cell phone email stopped working too!”

That’s the power of characterizing the problem – sometimes it helps solve multiple problems.

If the system worked before and is broken now, something broke it.  That something may be subtle and difficult to find, and that’s why details are important.  So think back to everything that happened with your broken system around the time the problem started.  Put together a detailed sequence of events.  Write it all down if this helps.  If I had known about that cell phone software update with Bruce and Bob, we could have saved time and jumped immediately to the solution.

Is the problem reproducible at will, or does it only happen sometimes?  If reproducible at will, what are the steps to reproduce it?  And if only sometimes, what is different about when it works versus when it breaks?  One time, I had a Dell laptop that sometimes refused to connect to the office wireless network.  After hours of trial and error, we finally found a pattern – the problem happened when the laptop was running on battery power, but not on AC power.  This turned out to be a (questionable) feature and not a bug – somebody at Dell thought it was a good idea to conserve power by turning off the wireless adapter by default when running on battery power.  The cure – press a function key to turn it on.

The solutions to many problems seem obvious, but generally only after going through the exercise to find them.

Perhaps most important – compare and contrast how the system should behave versus how it actually behaves.  It’s your job to explain this clearly and in detail to an expert who cannot be as familiar with the history of the problem as you.

Answer these and similar questions and now we have a well defined problem.

Next comes finding a solution.  The process is also part science, part art form.  For the science part, we form a possible solution based on the problem definition, come up with a way to test it, then evaluate the results.  The process is usually iterative, sometimes tedious, and always slower than anyone wants.  For the art part, sometimes inspiration strikes and sometimes it’s right.  Check out this article for a great example of a troubleshooting scenario.  And watch this space for more articles about interesting troubleshooting scenarios as they come up.

How to spot a “phishy” email

­

This Wikipedia article provides as good a definition as any for phishing:

Phishing is the act of attempting to acquire information such as usernames, passwords, and credit card details (and sometimes, indirectly, money) by masquerading as a trustworthy entity in an electronic communication.

The challenge is, how do you tell a phishing email that claims to come from your friend, your bank, or other trusted source, from a real email from your friend, bank, or other trusted source?  Using an example phishing email that hit my inbox yesterday, this blog post will provide some helpful and easy to use tips to spot phishing emails that get past your spam filter.

Yesterday’s email claimed to come from a friend, with subject, “Confidential document”.  I happen to know my friend is away from work, so the subject already raises an alarm.  Here is a screenshot with a picture of the offending mail message.   I blacked out the sender name and other identifying information in the text of the email.

Take a look at the little popup near the “click here” link.

And that leads to the first clue on whether that email is what it claims to be.  Most phishing emails come with embedded links you can click on – but where do those links really take you?  Here is how to find out.  Position your mouse cursor over the top of those links – don’t click anything, just position your mouse cursor there.  A little popup should appear with the URL of the website where this link really points.

In my example, the link points to a suspicious website named Altervista, even though the text of the email suggests the link should point somewhere inside Google.  But look closely – Altervista?  One of the original Internet search engines, before Google, was named Altavista (no “r” in the middle).

This is another favorite phishing trick.  Register domain names that look similar to legitimate or familiar domain names and use fake websites to fool people into giving up sensitive information.  See a few sentences below for a quick discussion about domain names.

I don’t need to dig any deeper.  With less than 5 seconds of analysis, I can confidently conclude this email is no more legitimate than a confederate $3 bill.

But we can do better.  I owe it to my friend and this blog entry to chase this one down a little more.

Digging Deeper

On the Internet, everyone who is anyone has a domain name.  Think of a domain name as kind of a trademark name on the Internet, managed by various registrars.  For now, there are a few top level domain names, such as .com, .org, .edu., .net, and others.   Thousands more are on the way and nobody knows how popular they will be.  But, at least for now, the real action is in the second level domain names.  Names such as google.com, whitehouse.gov, infrasupport.com, and millions of others comprise today’s Internet.  Most organizations today operate a website, typically named www.  They may also operate an email server, typically named “mail”.  Some offer additional services with different names.  Google, for example, offers another popular website named maps.google.com.

Here is where things become interesting.  In one of the more famous cases of name hijacking, a creative porn operator registered the name “whitehouse.com”.  The idea was, the United States Federal Government operates a website named www.whitehouse.gov.  This website has all the attributes we would expect from the Executive Branch of the United States Federal Government.  But www.whitehouse.com was a porn site – and not even the United States Federal Government had power to stop it, even though its name was similar to the website of the real White House.

Back to our suspicious email.  Domain registrars offer tools to find the current holder of any given domain name.   Some owners pay extra money for privacy, others identify themselves, although not always accurately.  So who is behind altervista.org?

The easiest way to find out – go here and do a whois lookup.  Type “altervista.org” in the search box, and here is the result.  Apparently, this domain name belongs to somebody in Italy.  The name was first registered in 2000 and expires in 2015.  The odds are pretty good the current domain name holders will renew it before it expires.

What can we do about this?  Realistically, not much.   Other than a few high profile cases in the headlines, law enforcement is generally not willing to work these cases because they are labor intensive.  But now, knowing the domain name is registered in Italy, we find yet another nail in this phishing email’s credibility coffin.  Stay far away from the website in that link.

Will the real sender please stand up?

Next, where did this email really come from?  In one of the most regrettable engineering design oversights of the Internet, the SMTP email protocol has no real security and anyone can impersonate anyone else in an email message.   This is a particularly nasty problem because, to date, nobody has come up with anything foolproof to address the problem.  This means, if I want to compose an email and claim I am, say, the vice-president of your bank, I can make the body of the email look like it really came from that sender.  I can even grab a copy of your bank’s letterhead and make the email look like it’s on bank stationary.  If I do a good job of editing, then when you receive the offending email, you will not have any inkling it’s a forgery.

Unless you look at the header.

Here is a picture of the header for the phishing email I received, with my friend’s name blacked out.  Email headers provide valuable diagnostic clues, including routing information and where  the message really originated.  We can compare this with where it claims to come from.  Most phishing emails claiming to come from your bank or credit card company in fact usually originate in China, Russia, or other country.

How do you look at the header?  Every version of every email program is different.  In Outlook 2010 and 2013, click File…Properties.  In Outlook 2007, click the little checkbox in the “Options” menu ribbon graphic.  In Outlook 2003 and earlier, click View…Options.

Notice my sender claims to come from gmail.com.  Gmail is Google’s free email service and my friend does, in fact, have a Gmail account.  Looking at the header, the evidence strongly suggests this message really came from my friend’s mailbox.

But my friend did not send it.  Somebody compromised my friend’s email account and is now trying to pursue my friend’s contacts, including me.  No doubt, that altervista website will try to extract personal information such as credit card numbers or passwords and use them illegally.  One day, I might use a throwaway computer to see what that website does, but not today.

I warned my friend and hopefully by now, that email account and any other accounts my friend operates have new passwords.

I want to thank the people who are reading this blog post and leaving comments.  If you don’t mind, I would appreciate it if you would fill out the Contact Us form and let me know how you found it.  And, of course, if you want some help eliminating “phishy” emails, or you suspect you have a malware problem, or just need IT help in general, please Contact Us too.

And now this blog is finally visible to the world

­

I seem to encounter more than my share of tech problems that nobody else has ever seen before.  I don’t know why, they just seem to find me.  The good news is, I like to think this makes me stronger.  If they don’t kill me first.

The saga getting this blog page up and running is a typical example.

Apparently, it all started about a year ago when somebody decided it would be better to setup the new systemd to start mysqld on Fedora using a private tmp directory instead of the system wide tmp directory.

Understanding the sentence above needs some background.  Briefly – I am hosting this website on a Fedora 18 virtual machine.  Fedora is a free, open source offering from a great company named Red Hat.  Because Fedora is free, lots of tech enthusiasts use it and help debug it and provide feedback back to Red Hat.  Red Hat incorporates the feedback and periodically releases another product with paid support subscriptions named Red Hat Enterprise Linux.   The model works for everyone.  I get a free platform, Red Hat gets a more solid paid offering.  And I’m a Red Hat partner, so it’s good to use the products I help resell and support.

Mysqld (pronounced, “My-S-Q-L-D”) is part of the well known open source mysql database package.  And systemd (pronounced “System-D”) is a new, sophisticated set of software to start Linux systems.  Systemd is an improvement over the old way to do it.  It’s rapidly maturing will soon become part of Red Hat Enterprise Linux.

The takeaway from all this is, it’s bleeding edge packaging and I am essentially a tester for this packaging.  Sometimes that testing produces unexpected results.

Why is this important to me?  Because I chose a package named WordPress to develop my new website and I’ve spent a significant portion of the past month of my life learning how to use it.  The website and blog you’re reading right now is the fruit of my labor.  WordPress depends on the mysql database hosted on my Fedora system, which, in turn, uses systemd to start itself.

Mysqld apparently writes temporary data to a temporary directory to perform its work.  This could be a potential security issue if others have access to that same temporary directory.  So about a year ago, somebody decided it would be a good idea to use systemd to “fool” mysqld into using a private temporary directory only available to mysqld.  Take a look here for details.  Unfortunately, apparently systemd removes these private temporary directories periodically and this breaks mysqld, but only after it has been successfully running for several days.   Problems like this are maddening to identify and troubleshoot because the system passes all tests and then suddenly fails for no apparent reason.

Around the time I set up my first blog post, systemd apparently decided to clean out its temporary directories.  This broke mysqld, which, in turn, broke WordPress, which, in turn, broke my blog entries and website menu construction.  This triggered a flood of electronic correspondence in various support forums to find an answer.  The problem was magnified because the WordPress theme I chose, named “Responsive” had an upgrade around the same time and the upgrade had some bugs.

Here and here are a couple of links with details.  Here is another one.  The one sentence summary – I haven’t slept much the past few days and I really need a shower.

I am deeply grateful and indebted to the people in this discussion thread who found the mysqld/systemd problem and to the support staff at Cyberchimps.com who seem to work the same weird late night hours as I do.

And now, this blog should finally be visible to the world.

Welcome to my blog

­

Welcome to the new and improved Infrasupport website.  This blog is where I’ll post articles or essays I think are of interest.  Over time, as the content accumulates, I’ll set up categories to make entries of interest easy to find.  Enjoy.