Issue
I have this file, let's call it bs4_scraper.py. Giving context:
- The scrap function is just an async function that makes asynchronous requests to the website.
- The get_pids_from_file_generator is just a function that reads a .txt file, adds each line (the pid) to a Generator, and returns it.
async def bs4_scraper():
limit = Semaphore(8)
tasks = []
pids = get_pids_from_file_generator()
for pid in pids:
task = create_task(scrap(pid, fake_header(), limit))
tasks.append(task)
result = await gather(*tasks)
return result
if __name__ == "__main__":
try:
run(bs4_scraper())
except Exception as e:
logger.error(e)
When I run this function in the terminal using python bs4_scraper.py
the function runs and and exits gracefully when all requests are done. No problem to this point (I think so).
Now I have this separate file, which is a Scrapy pipeline that runs at the end of the scraping process:
class WritePidErrorsPipeline:
def close_spider(self, spider):
pid_errors_file = generate_pid_errors_file()
pg = PostgresDB()
non_inserted_ids = pg.select_non_inserted_ids(pid_errors_file)
if non_inserted_ids:
self.insertion_errors_file(non_inserted_ids)
bs4_file = os.path.abspath("bs4/bs4_scraper.py")
exec(open(bs4_file).read()) # THE PROBLEM IS RIGHT HERE
else:
logger.info("[SUCCESS]: There are no items missing")
def insertion_errors_file(
self,
non_inserted_ids: List[Tuple[str]],
output_file: str = "insertion_errors.log",
) -> str:
with open(output_file, "w", encoding="utf-8") as f:
for non_inserted_id in non_inserted_ids:
f.write(f"{non_inserted_id[0]}\n")
return output_file
The problem occurs at the line exec(open(bs4_file).read())
. The file is called and the function runs properly, but when it is done, it does not exit, and keeps running after the last well-succeeded request. Looks like a zombie process, I don't have any idea why this happens.
How do I improve this code to run as expected?
PS: sorry for any English mistake
Solution
Are you sure the file actually runs,a nd hangs after it finishes?
because an obvious problem there is the guard if __name__ == "__main__":
at the end of your file: this is a code pattern meant to ensure the gaurded part will only run when that file, the file containing the line if __name__ == "__main__":
is the main file called by Python.
When running scrappy, IIRC, the main file are other scrappy scripts, which in turn will import your file containing the Pipeline: at that point, the variable __name__
won't contain __nain__
anymore - rather, it will be equal to the filename, sans .py
.
The outter __name__
variable will simply propagate to the exec
body, if you don't provide a custom globals dir as the second parameter - so, just by looking at your code, what can be said is that the bs4_scrapper
function will never be called.
The fact you truncated your files, throwing away the import
statements make it HARD to give you a definite answer - I suppose in the pipeline file (or in the script) you have something like from asyncio import run
. Please - these are not optional stuff - they are necessary stuff for one reviewing your code to know what is going on.
Either way, you have such an import or the code would not work in certain circunstances as you put - so, if the problem is what I had to guess here, you could fix it by setting the __name__
variable to __main__
inside the exec statement - but then we go to the other side: WHY this exec approach at all? You are running a Python program, reading a Python file, and issuing an statement to compile it from text so that the code could be run - when you could just import the file and call a function.
So, you can fix your code just by making it behave like a program, not forcing one file to be read as "text" and exec-ed:
import sys
from pathlib import Path
import asyncio
class ...
def close_spyder(...):
...
if non_inserted_ids:
self.insertion_errors_file(non_inserted_ids)
bs4_dir = Path("bs4").absolute()
if bs4_dir not in sys.path:
sys.path.insert(0, str(bs4_dir))
import bs4_scraper
result = asyncio.run(bs4_scraper.bs4_scraper())
Answered By - jsbueno
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.