Issue
When I use the BeautifulSoup, I get the following code returned from href.
"/url?q=http://druid8.sit.aau.dk/acc_papers/kdln4ccpef78ielqg01fuabr81s1.pdf&sa=U&ei=HkNsUauqN_GQiAf5p4CwDg&ved=0CDkQFjAJ&usg=AFQjCNGk0DTzu2K2ieIKS-SXAeS5-VYTgA"
What is the easiest way to cut only the "http://...." pdf so I could download the file?
for link in soup.findAll('a'):
try:
href = link['href']
if re.search(re.compile('\.(pdf)'), href):
print href
except KeyError:
pass
Solution
How consistently do they come across?
href.split('q=')[1].split('&')[0]
Would work without regex. This might also do it:
href[7:href.index('&')] # may need +1 after .index call
They both seem to work in my interactive terminal:
>>> s = "/url?q=http://druid8.sit.aau.dk/acc_papers/kdln4ccpef78ielqg01fuabr81s1.pdf&sa=U&ei=HkNsUauqN_GQiA f5p4CwDg&ved=0CDkQFjAJ&usg=AFQjCNGk0DTzu2K2ieIKS-SXAeS5-VYTgA"
>>>
>>> s[7:s.index('&')]
'http://druid8.sit.aau.dk/acc_papers/kdln4ccpef78ielqg01fuabr81s1.pdf'
>>>
>>> s.split('q=')[1].split('&')[0]
'http://druid8.sit.aau.dk/acc_papers/kdln4ccpef78ielqg01fuabr81s1.pdf'
>>>
You can also get there with this regex:
>>> import re
>>>
>>> re.findall('http://.*?\.pdf', s)
['http://druid8.sit.aau.dk/acc_papers/kdln4ccpef78ielqg01fuabr81s1.pdf']
>>>
Answered By - g.d.d.c
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.