This is part of a series on diagnosing your website outage issues. This is part three; links to the other parts are here.
In Part 1 of this series we covered the overview of what could have broken to cause your website to go down. In Part 2, we started working through those possible issues by diagnosing DNS issues. Now that we know your domain’s DNS is good, we’re going to start looking at the routing. In other words, now that we know where your domain is (DNS), are there any roads to get there (routing)?
Many people misunderstand how the web works. I blame the media and well-meaning explainers for suggesting and continuing to propagate the notion that we “log on” or “visit” websites. The problem with these terms is that they naturally call to mind the concept of “going” to a website which is the exact opposite of how web surfing works. Instead of your browser being some sort of vehicle that goes to websites, it is a static piece of software that sits on your computer and issues demands for websites to come to it. Websites come to you, you do not go to them.
When you click a link or type in a domain name, your browser issues a request across the Internet to that domain’s web server and says “give me that page”. Well behaved web hosts comply and the content your browser requested is downloaded to your computer and your web browser then renders it nicely so you can see and interact with it. In order for this to happen, there has to be a pathway on the Internet that can transfer your web browser’s request to the web server and transfer the response (content) from the web server back to your web browser. Those pathways are called routes and a big part of website troubleshooting involves diagnosing routing issues.
As you might surmise, there are tools for testing routes and the grandaddy of them all is called traceroute because it…traces…routes. For unexplained reasons, Microsoft and Linux have different commands for invoking a traceroute, but the basic functionality is the same. On Windows the command is tracert and on Linux and Mac the command is traceroute.
Routes are comprised of hops. Your request for a website is broken down in to many packets and each packet is forwarded to the destination web server via a series of hops between the many routers that are present between you and the web server. A traceroute sends packets to your web server destination, but it tells each packet to only go one hop further than the last packet. When a packet returns (or times out and doesn’t return) traceroute sends another packet with instructions to go one hop further than the last one. This repeating process ends up mapping the route all the way to the destination web server and it looks something like this.
$ traceroute disney.com
traceroute to disney.com (126.96.36.199), 30 hops max, 60 byte packets
1 172.20.0.1 (172.20.0.1) 0.434 ms 0.413 ms 0.394 ms
2 * * *
3 * * *
4 188.8.131.52 (184.108.40.206) 2.128 ms te0-0-1-0.rcr12.yyz01.atlas.cogentco.com (220.127.116.11) 2.391 ms 18.104.22.168 (22.214.171.124) 2.501 ms
5 be2429.ccr22.yyz02.atlas.cogentco.com (126.96.36.199) 2.273 ms 4.126 ms be2428.ccr21.yyz02.atlas.cogentco.com (188.8.131.52) 2.233 ms
6 be2994.ccr22.cle04.atlas.cogentco.com (184.108.40.206) 9.055 ms 8.835 ms 8.534 ms
7 be2718.ccr42.ord01.atlas.cogentco.com (220.127.116.11) 16.671 ms 16.889 ms 16.627 ms
8 be2831.ccr21.mci01.atlas.cogentco.com (18.104.22.168) 28.287 ms 27.963 ms 27.973 ms
9 be2432.ccr21.dfw01.atlas.cogentco.com (22.214.171.124) 38.698 ms 38.713 ms be2433.ccr22.dfw01.atlas.cogentco.com (126.96.36.199) 37.953 ms
10 be2443.ccr22.iah01.atlas.cogentco.com (188.8.131.52) 42.778 ms be2441.ccr21.iah01.atlas.cogentco.com (184.108.40.206) 42.348 ms 42.187 ms
11 be2928.ccr21.elp01.atlas.cogentco.com (220.127.116.11) 58.732 ms 58.757 ms be2927.ccr21.elp01.atlas.cogentco.com (18.104.22.168) 58.840 ms
12 be2929.ccr21.phx02.atlas.cogentco.com (22.214.171.124) 66.853 ms 67.350 ms be2930.ccr22.phx02.atlas.cogentco.com (126.96.36.199) 66.011 ms
13 be2931.ccr21.lax01.atlas.cogentco.com (188.8.131.52) 78.713 ms be2932.ccr22.lax01.atlas.cogentco.com (184.108.40.206) 77.966 ms be2931.ccr21.lax01.atlas.cogentco.com (220.127.116.11) 78.338 ms
14 be2543.rcr21.las02.atlas.cogentco.com (18.104.22.168) 85.752 ms 85.210 ms be2536.rcr21.las02.atlas.cogentco.com (22.214.171.124) 85.770 ms
15 be2205.nr21.b023583-0.las02.atlas.cogentco.com (126.96.36.199) 85.914 ms 85.364 ms 85.359 ms
16 188.8.131.52 (184.108.40.206) 85.850 ms 85.483 ms 220.127.116.11 (18.104.22.168) 85.262 ms
17 be775.las-score7-3.sinap-tix.com (22.214.171.124) 85.882 ms 86.036 ms 86.036 ms
18 cust-126.96.36.199.sinap-tix.com (188.8.131.52) 85.310 ms 84.853 ms 85.127 ms
19 * * *
20 * * *
21 * * *
22 * * *
23 * * *
24 * * *
25 * * *
26 * * *
27 * * *
28 * * *
29 * * *
30 * * *
This particular traceroute does not complete. The *** lines mean that the device at that hop did not respond and eventually it gave up because it doesn’t know how many hops might be left after the last hop to cust-184.108.40.206.sinap-tix.com. There’s an important lesson about traceroutes in this output and that is that you cannot trust incomplete traceroute information. There are different reasons why traceroutes sometimes do not complete. The Windows implementation of traceroute uses the ICMP protocol which is more likely to complete hops due to how most routers handle ICMP packets. The Linux and Mac implementation of traceroute use UDP packets which are more likely to be dropped and result in incomplete traceroutes. There are also situations where pesky traffic like pings and traceroutes are throttled in some devices so they are dropped one day, but not the next. It’s not an exact science and a traceoute can only be really viewed as one tool, not THE tool, to diagnose routing issues. This doesn’t necessarily indicate a routing problem, however. Note how I am able to bring up the disney website despite the traceroute showing me there is no route to the site.
So what good is a traceroute? Well, it can tell you when the route is good and it can tell you if there is congestion along the way. The lines like “85.310 ms 84.853 ms 85.127 ms” tell you how long the packet took to make the round trip. Those numbers will steadily increase as the packets get farther away but if you see a sudden jump in time that could indicate a congestion problem at a particular hop. However, an incomplete traceroute cannot tell you with certainty that there are routing issues.
Another, less verbose, method of testing a route is to simply ping the site. A ping works much like traceroute except that it just tries to hit the destination web server without bothering to tell you about all the hops along the way and it looks like this:
$ ping disney.com
PING disney.com (220.127.116.11) 56(84) bytes of data.
64 bytes from 18.104.22.168: icmp_seq=1 ttl=237 time=85.0 ms
64 bytes from 22.214.171.124: icmp_seq=2 ttl=237 time=84.8 ms
64 bytes from 126.96.36.199: icmp_seq=3 ttl=237 time=85.2 ms
64 bytes from 188.8.131.52: icmp_seq=4 ttl=237 time=84.8 ms
As part of interpreting this ping, it’s interesting to note that the round trip time is about 84 ms which is very close to the round trip time of the last complete line in the traceroute above. This indicates that the traceroute probably stopped on the very last hop before the web server.
It’s important to note that these route tests only prove that the networking route is good. They do nothing to prove that the web server is functioning properly or what your website loads. The network route is one of the basic building blocks of troubleshooting outages.
Now that the route is proven, it’s time to look at the layers that we discussed in Part 1. That’s the subject of the next post and I will link to it here once it is complete.
Image credit to Stanford University.