Error messages are hard
Error messages are hard. When a program logs an error message sometimes both of these things can be true:
- The error message can be 100% accurate.
- The error message can be completely unhelpful, if not outright misleading.
Let’s look at some examples:
python JSONDecodeError
import json
import sys
with open(sys.argv[1]) as f:
contents = f.read()
print(json.loads(contents))
What happens if we run that against an invalid file:
$ python3 python_json.py test_one.txt
Traceback (most recent call last):
File "python_json.py", line 5, in <module>
print(json.loads(contents))
File "/usr/lib/python3.9/json/__init__.py", line 346, in loads
return _default_decoder.decode(s)
File "/usr/lib/python3.9/json/decoder.py", line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/lib/python3.9/json/decoder.py", line 355, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
What is the root cause here? The file doesn’t contain JSON, it is empty.
Go and JS do a little better in this case, they both say
unexpected end of JSON input
This is true, but the message could still be improved to check for the empty case.
You may be thinking at this point that while the message is bad, you can assume that specific error means that the file was empty. You would be wrong.
Let’s run this same program again against another invalid file, and see what we get:
$ python3 python_json.py test_two.txt
...
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
That file was not empty, it was an HTML file:
$ cat test_two.txt
<!doctype html>
...
Go and JS do a bit better here:
invalid character '<' looking for beginning of value
Uncaught SyntaxError: Unexpected token '<', "<!doctype html>" is not valid JSON
Why might you run into this error? These errors are common in python programs that
use the requests library to fetch a JSON file from a web server, but don’t
first call raise_for_status()
or check the status_code
directly:
Bad
r = requests.get("https://example.com/file.json")
data = r.json()
Good
r = requests.get("https://example.com/file.json")
r.raise_for_status()
data = r.json()
Without the call to raise_for_status
that program will raise the same json.decoder.JSONDecodeError
.
With the call to raise_for_status
it will raise:
requests.exceptions.HTTPError: 404 Client Error: Not Found for url: https://example.com/file.json
How could this be improved?
The error messages from Go and JS show that there are some easy wins here. Multiple languages could benefit from special casing the empty string case.
OpenSSL wrong version number
Let’s run curl and see if we can understand what the problem is from the error message:
$ url=https://...
$ curl $url
curl: (35) OpenSSL/3.0.8: error:0A00010B:SSL routines::wrong version number
This error message may lead you to believe that there is an issue with the TLS version being used. For example a client that wants to use TLS 1.3 connecting to an old server that only supports 1.2.
If we run curl with --trace -
, we can see what the real issue is:
...
<= Recv SSL data, 5 bytes (0x5)
0000: 48 54 54 50 2f HTTP/
== Info: OpenSSL/3.0.8: error:0A00010B:SSL routines::wrong version number
The problem is that the server is speaking plain-text HTTP, not TLS, on that port. Where it was expecting to receive a SSL handshake, it received
HTTP/1.1 400 Bad Request
Server: nginx
...
Normally it’s supposed to get something like this:
<= Recv SSL data, 5 bytes (0x5)
0000: 16 03 03 00 7a ....z
Since HTTP/
in ASCII is not a proper version, OpenSSL complains. The error
message is accurate, but not helpful.
The assumption the user might make is that there is some issue with the supported TLS version, and not that the server is not even speaking TLS in the first place.
How could this be improved?
OpenSSL could include the value in the error message, or make it available to the caller somehow.
This would help, but logging 485454502f
without converting it to ASCII first
isn’t as helpful as it could be. A web search for 485454502f
has results
full of people who have that value in an error message, but don’t know why.
If the value was available to the caller, libcurl could check for this case.
OpenSSL could check if the invalid version is 485454502f
and if so, handle
this specific case. Since OpenSSL is used for more than just HTTP this might
not be appropriate.
Connection reset by peer
Connection reset by peer, AKA ECONNRESET
, AKA errno 104
.
In theory, this error is straightforward. It is generated when an RST
packet
is received from the remote side of a TCP connection. If the client is logging
this error message, the log on the server side should indicate what the problem
is.
In the real world, you may find that both sides of a connection are
inexplicably logging ECONNRESET
at the same time. The root cause for this is
that there is a middlebox somewhere
resetting the connection. This device can be an “inline IPS”, or a “Next Gen
Firewall”.
To further complicate things, admins of such devices are often unaware that the
underlying mechanism used to block connections may involve injecting spoofed TCP
resets. I was once troubleshooting this exact issue on a customer network and
asked “Do you have an IPS or a Next Gen Firewall such as a Palo Alto”. They
responded “yes”, and we got their Palo admin on the call. Their Palo admin
proceeded to tell me that the Palo couldn’t be the cause of this problem,
because “The Palo doesn’t do that”. We then did a search in the Palo UI for our
IP address and the result was a screen full of entries that had an action not
of something like allow
, but of reset-both
. The action reset-both
corresponds to sending spoofed TCP resets to both the client and the server at
the same time.
How could this be improved?
This is a hard one. I suppose the error message could be more accurate as
“Connection reset by peer or (because RST packets are not authenticated in any way) an RST packet was spoofed by a next gen firewall”