1

So I've tried to loop on a formrequest that call my function that create, fill and yield the item, only pb : only one and only one item is done no matter how many times he looped and I can't figure out why ?

def access_data(self, res):
#receive all ID and request the infos
    res_json = (res.body).decode("utf-8")
    res_json = json.loads(res_json)
    for a in res_json['data']:
        logging.warning(a['id'])
        req = FormRequest(
            url='https://my_url',
            cookies={my_cookies},
            method='POST',
            callback=self.fill_details,
            formdata={'valeur': str(a['id'])},
            headers={'X-Requested-With': 'XMLHttpRequest'}
        )
        yield req

def fill_details(self, res):
    logging.warning("annonce")
    item = MyItem()
    item['html'] = res.xpath('//body//text()')
    item['existe'] = True
    item['ip_proxy'] = None
    item['launch_time'] = str(mySpider.init_time)
    yield item

To be sure everything is clear : When I run this, the log "annonce" is printed only one time while my logging a['id'] in my request loop is printed a lot and i can't find a way to fix this

Ayra
  • 308
  • 2
  • 12
  • Please share also some or all job logs if that doesn't bother you (e.g. won't expose any sensitive data of yours). – starrify Dec 06 '18 at 14:01
  • no pb, so the numbers fits the ID that I print during the loop while the word "annonce" is print when it start the function that write the item `2018-12-06 14:29:32 [root] WARNING: 22297026 2018-12-06 14:29:32 [root] WARNING: 21037236 2018-12-06 14:29:32 [root] WARNING: 19488143 2018-12-06 14:29:32 [root] WARNING: 18730440 2018-12-06 14:29:33 [root] WARNING: annonce 2018-12-06 14:30:25 [scrapy.extensions.logstats] INFO: Crawled 4 pages (at 4 pages/min), scraped 1 items (at 1 items/min) 2018-12-06 14:33:38 [scrapy.core.engine] INFO: Closing spider (finished)` – Ayra Dec 06 '18 at 14:17
  • I don't know how to force a break line in comment sorry – Ayra Dec 06 '18 at 14:18

1 Answers1

0

I found the way ! If any one has the same pb : as my url is always the same (only formdata change) the scrapy filter is taking the control and destroy the duplicates. Activate dont_filter to true in the formrequest to make it works

Ayra
  • 308
  • 2
  • 12
  • 1
    I agree that the duplicate filter is highly likely to be the cause, but Scrapy's default duplicate filter is designed to respect also the body for POST requests. Thus it's rather unexpected to know that the requests are indeed being filtered when you have different `formdata` provided. There shall be more to dig into, but unfortunately I don't have sufficient details. Anyway, congrats on having got the issue resolved. – starrify Dec 06 '18 at 16:41