-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Should an email address be able to begin with these characters? #61
Comments
Those are all completely valid characters to be used in email addresses, per RFC 5322. The only requirement for the local part of the domain with respect to first character is that it MUST not be a That said, I am not really aware of any providers which permit this as an address or alias, so.. perhaps it's OK to drop them? |
I might be inclined to include both stripped and non-stripped. They are valid characters, but most email address validation will fail them. |
I believe only the . is restricted from being the first or last character of the local-part, so I think you've gotta take the weirdos. "Without quotes, local-parts may consist of any combination of alphabetic characters, digits, or any of the special characters ! # $ % & ' * + - / = ? ^ _ ` . { | } ~ period (".") may also appear, but may not be used to start or end the local part, nor may two or more consecutive periods appear." |
Is there a pattern in the domain portion of the address? If it's all some set of custom domains, maybe they are valid. If it's gmail/yahoo/hotmail/outlook check the service's sign up rules, where those names are likely invalid. |
Perhaps, but I suspect that in this case those chars aren't a valid part of the address and are instead delimiters that have been inadvertently included. That leaves us with trading off between either strictly adhering to the RFC and excluding those addresses, or parsing them out and including them despite the strict RFC definition. Which is better? |
That gets very messy as it doubles up on the addresses which then affects all sorts of counts and stats. I'm really reticent to do that and would prefer to pick one approach over the other. |
No, lots of different domains, I'm highly confident we're looking at parsing issues, not deliberately funky addresses. |
I just sent myself an email starting with I'd also advocate for adhering to RFC 3696 wherever possible. I've seen issues when systems reject perfectly legitimate but rarely seen email addresses. One of my emails has the super-short format of |
For good measure, I also tested with Exchange (MS365) and Roundcube. No issues with either one. It seems like a definite edge case, but still worthwhile to allow all valid email addresses. |
If my email address is |
Even though these are valid characters, I feel more often then not they are probably going to be delimiters. If including them means I won't be alerted on a@b.c when you hit {a@b.c it'd be best to strip them. I do think the question again becomes; should you handle this in this tool or somewhere else in the HIBP infrastructure? |
👆This. I think we have about five nines certainty that these are other peoples' failure to parse lists properly. |
Do the email providers for these addresses support creating an address with the leading character? |
Timely, I've just been asking similar questions on email around "do I respect the RFCs or do I act sensibly / force others to act sensibly". Before doing research I was of the opinion of "force them to act sensibly". Do the email providers for these addresses support creating an address with the leading character?Using https://mailchimp.com/resources/most-used-email-service-providers/ as a reference. Google: NoYahoo: NoOutlook: NoAOL: NoProton: NoApple: NoMail.com: NoGMX: NoZoho: NoPersonal Conclusiontldr: strip them Given that none of the above major email providers let you signup using these characters, I think it is a fair assumption that these are fail delimiters. In the (probably rare) case that it is not a failed delimiter then you should have been sensible and unfortunately you will miss out on this HIBP notification. That said, I think there might be cases were it legitimately appears later in the address (i.e. where email providers allow the Aside: this ends up being a rabbit hole if you try to normalise it as some providers let you use The sending part is important in this context because (I assume) it will be used for notifications. Contrast that with searching on the web UI where you could have an index of normalised email addresses to help users who have taken advantage of things like Questions
If there is a concern about users not receiving notifications could we (and by we I mean @troyhunt):
|
That was a really useful set of screen caps @nhairs, thank you! I think the only reasonable conclusion here is that due to the lack of support for these chars by mainstream email providers, the unlikelihood of anyone legitimately using them and the real world impact of an email address being missed, the chars should be stripped and the RFC be damned 🙂 That now makes this issue a feature request. I'll add some tests for this later, I need to go back an analyse precisely how those strings were formatted originally. |
You might consider implementing a feature switch that looks like |
I've just added a failing test for this (EmailAddressesInSingleQuotesWithTrailingSpaceIsExtracted) based on the following in the most recent breach I loaded: This subsequently extracted an address that began with a single quote and ended with the correct .br TLD. Let's add single quotes to the list of chars in the issue above and get this done. |
Related: You may want to check for emails in the form of These are valid emails according to the RFCs (though I'm not super clear on where they can still be used). For more info I'd check out the RFCs in the comments for |
I've just updated the readme with a "practical considerations" section that addresses this: https://github.com/HaveIBeenPwned/EmailAddressExtractor/blob/main/README.md |
I've just commited cafb503 which is a kludgy fix for some common problems I was seeing. I've then added e73dd18 which caused a heap of dramas processing the (still alleged) AT&T breach to the point where I had to defer to the old script. This all looks related to problems with the delimiters and my view on it now is that we just take a strict approach. Would be awesome if someone took a good shot at fixing this. |
Why not using a regex? Besides that, a string.Replace replaces all the characters in the given text, not sure if you would want that. |
I've gotta draw a line under this work and implement changes to make this app usable. So, screw the RFC, let's boil it down to the absolute basics and define a list of characters that can appear anywhere in the alias:
And additional alias rules:
Going back to @nhairs's analysis of major mail providers, none of them support characters that aren't included in the list above. By the time you take out Gmail, Outlook / Hotmail, Yahoo, you're not left with much. By example, the MediaWorks breach loaded a few days ago had 163k breached addresses and 120k were on those mail providers alone. In that data set was a grand total of 4 aliases that were exceptions to the above criteria:
3 of those were on the mail providers mentioned above that explicitly disallow those strings in the address and the 4th was on xtra.co.nz (I don't know if they permit those characters or not). Even if every single one of those was actually legitimate and incorrectly got rejected, we're looking at a 0.002% false-positive rate. Can anyone identify any practical exceptions to these rules? By "practical" I mean characters that are commonly used and broadly acceptable by both email providers and websites. I want to reiterate that the sole purpose of this project is to extract email addresses from strings in a data breach; it's far more likely that a valid address is surrounded by junk than it is that an obscure RFC-compliant character is part of a legitimate address. |
@troyhunt for subaddressing, maybe |
Probably a small corner case, but you can make an alias that starts with |
Ugh, markdown somehow cannibalised my original points 5 and 6, the dash character is already included. Doesn't matter whether it's used for subaddressing or not, it's allowed. |
Yuck! Will keep it in mind, let's see how the rest of the feedback goes but yeah, I'm feeling that's far more likely to be junk than a legitimate address. |
So, you can also start an alias with Haven't tested sending to or receiving from outside Google. |
I did very similar to @nhairs above, I used to keep a service provider set of rules. When processing I'd look at the domain and apply the standardization rules for that domain. Unrecognized domains were cause for an alert (after a while you've seen all the main ones). On the list, but never completed, was to do an MX lookup for vanity domains. Since we were in marketing, the decision was made that if we overscrubbed an address from some really minor provider, oh well. I am strongly in favor of the RFC being reviewed, IIRC technically upper and lower case should be treated differently. |
https://www.jochentopf.com/email/chars.html has some considerations around this, coming to pretty much the same conclusion as above. I've seen |
That's a great reference @The-Compiler and if you just take the "OK" characters, it aligns perfectly with my conclusions. |
@troyhunt to confirm, it's just those characters at the beginning? Not checking for end or throughout the email address? |
That's a good question actually. Since originally creating this issue a year and a half ago, I think we've concluded that we can be much stricter in our definition with an acceptable false positive rate. @stebet, have we documented anywhere exactly what our rules are now? We should have that front and centre on this repo and also align it with the functions we have in SQL. Definitely just need one strict, canonical definition. |
This is more of an open question than something I think we should immediately do:
I've just run this across a breach I'm loading now that has about 2.2M addresses found. Over 1k of them at the end of the file begin with one of the following characters:
My get feel is that these chars should be stripped and are no valid use cases where they should legitimately exist at the beginning of the address (or probably anywhere in the address). Just by way of example:
I'm more inclined to strip these than include them, what do we all think?
The text was updated successfully, but these errors were encountered: