Issue
Im using BigQuery
as my DB with Scrapy
spider.
Below are 2 pipelines to store data into DB. One uses Insert
, another Update methods. The Update method is 70 times
slower then insert (merely 20 updated records per minute). Update take 3.560
seconds while Insert only 0.05
seconds. Where am I wrong and how to speed up Update method?
P.S. Current table size is around 20k records, potential size can be as large as 500 000 records. Need to update records daily.
Update method
# Define the update query
query = f"""
UPDATE `{self.dataset_id}.{self.table_id}`
SET `Sold Status` = '{data['Sold Status']}',
`Amount of Views` = '{data['Amount of Views']}',
`Amount of Likes` = '{data['Amount of Likes']}',
`Sold Date & Time` = '{data['Sold Date & Time']}'
WHERE `Item number` = '{data['Item number']}'
"""
start_time = time.time()
# Run the update query
job = self.client.query(query)
# Wait for the job to complete
job.result()
# Check if the query was successful
if job.state == 'DONE':
print('Update query executed successfully.')
else:
print('Update query failed.')
end_time = time.time()
execution_time = end_time - start_time
logging.info(execution_time)
return item
Insert method
start_time = time.time()
data = item
slug = data['slug']
if slug in self.ids_seen:
raise DropItem("Duplicate item found: {}".format(slug))
else:
data.pop('slug', None)
self.ids_seen.add(slug)
table_ref = self.client.dataset(self.dataset_id).table(self.table_id)
# Define the rows to be inserted
rows = [
data
]
# Insert rows into the table
errors = self.client.insert_rows_json(table_ref, rows)
if errors == []:
print("Rows inserted successfully.")
else:
print("Encountered errors while inserting rows:", errors)
end_time = time.time()
execution_time = end_time - start_time
logging.info(execution_time)
return item
Solution
Ok. So thanks to ChatGPT I found workaround for this issue. What used to take my code 9 hours now takes under 15 minutes. So 36 fold improvement.
Basically what i do is:
- Create new temporary table.
- Batch append all scraped data as json to this table
- Run Update command from new table to my master table (which takes merely 5 seconds overall instead of 5 seconds per update query. How come?)
- Truncate temporary table getting it ready for the next use in couple hours (FYI you cant use truncated table immediately. smth between 2 and 15 minutes should pass for the table to be ready for insert. So far im pretty happy with results and will stick to this solution for a while.
Answered By - JBJ
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.