Saturday, January 13, 2024

[FIXED] Extract email sub-strings from large document

January 13, 2024 python, string No comments

Issue

I have a very large .txt file with hundreds of thousands of email addresses scattered throughout. They all take the format:

...<[email protected]>...

What is the best way to have Python to cycle through the entire .txt file looking for a all instances of a certain @domain string, and then grab the entirety of the address within the <...>'s, and add it to a list? The trouble I have is with the variable length of different addresses.

Solution

This code extracts the email addresses in a string. Use it while reading line by line

>>> import re
>>> line = "should we use regex more often? let me know at  [email protected]"
>>> match = re.search(r'[\w.+-]+@[\w-]+\.[\w.-]+', line)
>>> match.group(0)
'[email protected]'

If you have several email addresses use findall:

>>> line = "should we use regex more often? let me know at  [email protected] or [email protected]"
>>> match = re.findall(r'[\w.+-]+@[\w-]+\.[\w.-]+', line)
>>> match
['[email protected]', '[email protected]']

The regex above probably finds the most common non-fake email address. If you want to be completely aligned with the RFC 5322 you should check which email addresses follow the specification. Check this out to avoid any bugs in finding email addresses correctly.

Edit: as suggested in a comment by @kostek: In the string Contact us at [email protected]. my regex returns [email protected]. (with dot at the end). To avoid this, use [\w\.,]+@[\w\.,]+\.\w+)

Edit II: another wonderful improvement was mentioned in the comments: [\w\.-]+@[\w\.-]+\.\w+which will capture [email protected] as well.

Edit III: Added further improvements as discussed in the comments: "In addition to allowing + in the beginning of the address, this also ensures that there is at least one period in the domain. It allows multiple segments of domain like abc.co.uk as well, and does NOT match bad@ss :). Finally, you don't actually need to escape periods within a character class, so it doesn't do that."

Update 2023 Seems stackabuse has compiled a post based on the popular SO answer mentioned above.

import re

regex = re.compile(r"([-!#-'*+/-9=?A-Z^-~]+(\.[-!#-'*+/-9=?A-Z^-~]+)*|\"([]!#-[^-~ \t]|(\\[\t -~]))+\")@([-!#-'*+/-9=?A-Z^-~]+(\.[-!#-'*+/-9=?A-Z^-~]+)*|\[[\t -Z^-~]*])")

def isValid(email):
    if re.fullmatch(regex, email):
        print("Valid email")
    else:
        print("Invalid email")

isValid("[email protected]")
isValid("[email protected]")
isValid("[email protected]")
isValid("[email protected]")

Update 2024 (with GPT-4 hints and improvements):

import re

# Compiling the regex pattern for email validation
regex = re.compile(
    r"(?i)"  # Case-insensitive matching
    r"(?:[A-Z0-9!#$%&'*+/=?^_`{|}~-]+"  # Unquoted local part
    r"(?:\.[A-Z0-9!#$%&'*+/=?^_`{|}~-]+)*"  # Dot-separated atoms in local part
    r"|\"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]"  # Quoted strings
    r"|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*\")"  # Escaped characters in local part
    r"@"  # Separator
    r"[A-Z0-9](?:[A-Z0-9-]*[A-Z0-9])?"  # Domain name
    r"\.(?:[A-Z0-9](?:[A-Z0-9-]*[A-Z0-9])?)+"  # Top-level domain and subdomains
)

def isValid(email):
    """Check if the given email address is valid."""
    return "Valid email" if re.fullmatch(regex, email) else "Invalid email"

# Example Usage
print(isValid("[email protected]"))
print(isValid("[email protected]"))
print(isValid("[email protected]"))
print(isValid("[email protected]"))

Answered By - 0x90

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Saturday, January 13, 2024

[FIXED] Extract email sub-strings from large document

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels