Issue
I have a data which looks like this:
- www.r-computer.com
- www.rscompass.com
- www.italy.it and so.
I have written a script which looks like this:
data['website']=data['Website address'].str.split('www.').str[1]
data['website']=data['website'].str.split('.com').str[0]
This basically first removes the "www" and then the second code was intended to remove the ".com" from the string. The result I should be getting for the 1st and 2nd data point should be:
- r-computer
- rscompass
But instead I am getting is "r". So i think Python is not interpreting "." as dot, but any character before "com".
I would like to know how to remove phrases such as ".ru" , ".com", ".it" etc. Kindly help.
Solution
import re
def get_domain(s):
return re.sub("^www\.(.+)\.[^\.]+$", "\\1", s)
print(get_domain("www.r-computer.com")) # r-computer
(untested)
Return both sitename and .com .org etc. Return None if there is no match
import re
def get_domain(s):
ret = re.findall("^www\.(.+)\.([^\.]+)$", s)
return ret[0] if ret else (None, None)
# example
a, b = get_domain("www.italy.it")
if a and b:
print(a) # italy
print(b) # it
Answered By - andole
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.