I have a collection that is going receive approx 30K documents. This collection starts empty and is updated (with insert). When I run my script for the second time, this collection don't need to be entirely update, only the products that doesn't exists need to be inserted. What is the best way to this?
I'm using the following code based on this answer
if(db.Catalog.count_documents({'Sku': prodCatalog["Sku"]}, limit=1) == 0):
db.Catalog.insert(prodCatalog)
#db.Catalog.create_index("Sku", pymongo.DESCENDING)
The commented line is used to try to optimize the insert process. I've got the following results testing the code above:
Without INDEXING = 15m7s
TEXTUAL INDEXING "Sku" = 14m40s
ASCENDING INDEXING "Sku" = 13m45s
DESCENDING INDEXING "Sku" = 14m
Sku is a string identifier (only numbers) to a product. Is a unique value for each product.
I've tried to use Upsert, but the results were not satisfactory.
Is there a way to reduce the insert/update execution time to 10 min or less?