-
-
Save sethhall/727ac36a630a642ca941661db68b87f4 to your computer and use it in GitHub Desktop.
redef record HTTP::Info += { | |
potential_fname: string &optional; | |
}; | |
event http_request(c: connection, method: string, original_URI: string, | |
unescaped_URI: string, version: string) &priority=5 | |
{ | |
# Get rid of uri arguments | |
local path = split_string(c$http$uri, /\?/)[0]; | |
local out = split_string(path, /\//); | |
# Take the last component in the uri path | |
c$http$potential_fname = out[|out|-1]; | |
} | |
event http_header(c: connection, is_orig: bool, name: string, value: string) &priority=3 | |
{ | |
if ( is_orig ) | |
return; | |
if ( name == "ETAG" && /\"/ in value ) | |
{ | |
if ( c$http?$potential_fname && c$http$potential_fname != "" ) | |
c$http$current_entity$filename = c$http$potential_fname; | |
} | |
} | |
c$http$uri is already the unescaped_URI value
(I'm looking for the jaw-dropped emoji, but this prose will suffice.) I had raised that comment to set into motion (I wasn't exactly sure what the right approach would be) the consideration of implementing defensive code for the value space. Even though "c$http$uri is already the unescaped_URI value" which surprises me, so I was misguided to suggest piggybacking it on there, nonetheless somewhere along the chain from packet content to this script's potential_fname there is appropriately some filtering or validation for what will be allowed into the content that ends up in this field that consumers will treat as a filename.
I actually regularly suggest that people never use a given file name from traffic as a filename they store on disk. It's specifically why the way file extraction works in Zeek by default is to not use any direct attribute of the file as any part of the filename. We avoid the entire class of potential problem that you suggest.
If someone wants to use anything else for a filename, they're certainly more than welcome to do it (we won't ever do that by default in Zeek however).
Did I understand correctly that your concern lies in someone writing out file names to disk that were under the direction of a potential attacker?
My concern initially was that what things are called (variable names and types) in this area were leading programmers to a dangerous potential misconception When as a programmer I read: "filename", that has some intellectual baggage immediately attached, like: cannot contain the characters * : / \ or "
In the above code that pulls straight from c$http$uri into local path = split_string... that sets off alarm bells for me. It isn't safe to start holding it and processing it as a thing that is "filename-like" until it is sanitized. You captured the right spirit in the comment where you say: "...concern lies in someone writing out file names to disk that were under the direction of a potential attacker", but my concern is more generally. For instance passing it through a regex and then in whatever comes out the other side starting to immediately treat that as sanitized. Unless there is a meticulous and rigorous sanitization step, it might very well be exploitable, and worthy of concerns about taking direction from a potential attacker.
That things like Regular Expression Denial of Service in websocket-extensions as CVE-2020-7662 are still lurking until recently in widely-used code, is a wakeup call.
I fully concur with your suggestion that people never use a given file name from traffic as a filename. But I go further and suggest that programmers never use a given input string as a file name from traffic, until they go over it treating it like a byte string and doing substitution of all of its potentially dangerous (and thus usually not-relevant to legit use-cases) characters by an innocuous character, before ever starting to treat it as--or even move it into a routine where they call it--a filename.
I've been pondered (and reading many of the relevant RFC's) ever since I read "c$http$uri is already the unescaped_URI value". I can see the tremendous upside of thus enabling code to read, like in this example taken from https://docs.zeek.org/en/current/examples/httpmonitor/
1 2 3 4 5 | event http_reply(c: connection, version: string, code: count, reason: string)
{
if ( /^[hH][tT][tT][pP]:/ in c$http$uri && c$http$status_code == 200 )
print fmt("A local server is acting as an open proxy: %s", c$id$resp_h);
}
instead of needing to be:
if ( /^([hH]|%48|%68)([tT]|%54|%74)([tT]|%54|%74)([pP]|%50|%70):/ in c$http$uri && . . .
But I can't get the sick-feeling out of my stomach that the unescaped_URI value is like having converted it to nitroglycerin. How many IDS tools will malfunction if \x1A (the DOS end-of-file character) is put into uri field in a log?
Is your concern really just revolving around the fact that the zeek http log uses the url decoded version? I thought your concern was something related this gist?
If your concern is really just about the decoding of the url, there might be a larger concern. Anything consuming logs should be developed fairly defensively. Are you certain that all web servers wouldn't log if I sent a raw \x1A byte in a url? If the web server is just looking for a newline character to end the request, it's certainly possible that some HTTP server implementation could just shovel that raw byte into its log. Now you've gotten me curious and I may have to poke around some. ;)
D'oh. I missed your previous reply.... reading that now.
(replying to your previous comment about this script now...)
I think our documentation for the filename field indicates more information about the field...
A filename for the file if one is available from the source for the file. These will frequently come from “Content-Disposition” headers in network protocols.
It's not a very adept description, but I think that it does pretty clearly point out that the filename is merely an indicator of the name of the file collected from its source. Even the assumption that you made about characteristics of the filename isn't exactly correct because different systems will have different possible filenames and already from network traffic we're getting files names from HTTP, SMTP, SMB, FTP, and possible others that I'm not thinking about off the top of my head. Those different carrier protocols have different notions of valid filenames already.
We could change the name of the field and it wouldn't really matter, if someone wants to shoot themselves in the foot, they'll see a filename collected from the network and write it to disk with that name which is why I've been regularly pointing out for years not to do that. 😄
My only hope is to educate enough people that eventually enough people get the message that there will always be someone there to inform a user that hasn't been fully informed yet (and to also never implement any code publicly that does this because it's such a blindingly bad habit). Of course, as you can probably guess, I've definitely implemented it before just because it was fun.
This thread has taken on a life of its own, and really--I only started out thinking "I'll suggest some validation" and then "hey, the code can also as a convenience do the benign instances of unescaping while it validates", which engendered the first post. I am glad to have had the dive-deeper into what distinguishes an id vs a uri vs a filename in the zeek realms. I came into this discussion with a profound respect for the uri escape specification in https://tools.ietf.org/html/rfc3986. I think they correctly captured: pay with human-reader-inconvenience and do substitution of all potentially dangerous characters (a surprisingly fair-price since those usually are not-relevant to legit use-cases). https://tools.ietf.org/html/rfc8089 for filenames specifically builds upon that, and I think we will avoid trouble if we stay close to those principles.
I've been enjoying this discussion quite a bit!
The 8089 RFC actually points out...
Treating a non-local file URI as local, or otherwise attempting to
perform local operations on a non-local URI, can result in security
problems.
I would take it to mean that the 8089 RFC is actually not terribly relevant to anything we're doing with Zeek because basically everything we have comes from the network and would be classified as non-local.
My general perspective on Zeek logs is that they aren't necessarily "words" anyway. Internally in Zeek they're all just arbitrary sequences of bytes which means they can technically haves nulls or anything else in them and the moment we start to try protect something like the filename field as safe to use for arbitrary purposes is the day that we start getting ourselves into trouble. Basically none of the data coming out of Zeek is safe to do anything with if a user decodes escaped fields (since the tab separated data and json both escape non-printable stuff) they have to adopt the notion that any data could end up in any field.
Have you viewed Zeek logs as sequences of bytes where anything could end up in them or had you taken a different perspective on the logs? I'm sort of wondering if we just have a disconnect on the way we think about the logs.
I also think this discussion is still helping us each articulate and see some subtle but important things. Thanks for hanging in there, if I ever express myself too obtusely.
You identified something there at the last: "we have a disconnect on the way we think about the logs". I am relatively new to zeek (just a few months) and one of the top-of-mind thoughts I had was "I have got to code myself a customized LESS, that columnarizes and word-wraps when log viewing, at least as well as HTML tables and/or RTF does it." I think the human reader of logs is a vital audience, activating the penchant for pattern detection that the human brain is for-better-or-for-worse so prone to.
LESS has saved my bacon innumerable times. It can still let oneself get in a jam (for instance don't jump to end of file if the files ends with thousands of \x00, if you ever want to get your keyboard to respond again) but I had expected and I guess I am trying to here encourage zeek to regard "verbatim representation" as not as much of a boon to the zeek programmers and users, as a boon to the attackers. "unambiguous representation" I absolutely concur is something that the zeek programmers and users must have. But it can be expressed with a dialect that is constrained to only be comprised of benign characters.
Oh and a small clarification, so that we don't digress over a canard. I realize Zeek logs aren't sequences of bytes where anything could end up in them, because the tab separated data and json both escape non-printable stuff. But internally in Zeek I worry if in every datatype they're all just arbitrary sequences of bytes which means they can technically haves nulls or anything else in them. I would blanche if hash results could haves nulls or anything such in them. The point I am raising in this discussion is that programmers carry some semantic baggage as they read variable and type names. I blanche if a "filename" can contain a * or / or \. It needs to be termed a filepath if it is the '/' delimited hierarchy. It needs to be a fullpath if it is the filepath and filename concatenated. It needs to be a pattern if it can contain * or ?.
If you are pulling characters straight out of c$http$uri, suggest to pass them through a translation to revert the %25 back to % etc.