Issue
I have problems with parsing when I want to get a response from the server - I suddenly found that the links are wrong. When i try remove all of links with ends .txt:
out1 = ['https://www.itu.int./htmldoc.asp?doc=t\\rec\\q\\T-REC-Q.1238.3-200006-I!!SUM-TXT-E.txt',
'https://www.itu.int/dms_pubrec/itu-t/rec/q/T-REC-Q.1248.1-200107-I!!SUM-HTM-E.htm',
'https://www.itu.int./htmldoc.asp?doc=t\\rec\\q\\T-REC-Q.1238.4-200006-I!!SUM-TXT-E.txt',
'https://www.itu.int./htmldoc.asp?doc=t\\rec\\x\\T-REC-X.42-200003-S!!SUM-TXT-E.txt',
'https://www.itu.int/rec/recommendation.asp?lang=en&parent=T-REC-X.51-198811-I',]
I receive next print:
a = ['https://www.itu.int/dms_pubrec/itu-t/rec/q/T-REC-Q.1248.1-200107-I!!SUM-HTM-E.htm',
'https://www.itu.int./htmldoc.asp?doc=t\\rec\\x\\T-REC-X.42-200003-S!!SUM-TXT-E.txt']
My code:
for ii in out1:
if ii.find('.txt'):
out1.remove(ii)
print(out1)
How i can delete wrong links with .txt? Thank you Update, i was writing:
r_list = []
for ii in out1:
d = re.sub(r'http\S+txt', '', ii)
r_list.append(d)
res = list(filter(lambda x: x, r_list))
print(res)
Solution
As mentioned regex
is not necessary but if you like to use it, try to search for the ending:
import re
[l for l in out1 if not re.search(r'\.txt$',l)]
Without regex
simply using endswith()
will do the same job:
[l for l in out1 if not l.endswith('.txt')]
Both will give you a cleaned list:
['https://www.itu.int/dms_pubrec/itu-t/rec/q/T-REC-Q.1248.1-200107-I!!SUM-HTM-E.htm','https://www.itu.int/rec/recommendation.asp?lang=en&parent=T-REC-X.51-198811-I']
Answered By - HedgeHog
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.