0

I'm scrapping some pages and extracting some textual information and eventually images. As not all pages have images, I would like to create subfolders named with the title of the pages and store the respective images in them. For example: If an image was found on a page that has the title Junior Architect, a subfolder will be created in the IMAGES_STORE directory with the name Junior Architect.

What I've done so far is like this on the spider code:

class JobsSpider(scrapy.Spider):

[...]

    def parse_listing(self, response): #called by "parse" method

        [...]

        image_thumbs = response.xpath('//div[@id="thumbs"]/a/@href').extract()

        for image in image_thumbs:
            yield scrapy.Request(image,callback=self.parse_images,meta={'image':image})
        
        image_swipe_wrap = response.xpath('//div[@class="swipe-wrap"]/div/img/@src').extract()

        for image in image_swipe_wrap:
            yield scrapy.Request(image,callback=self.parse_images,meta={'image':image})


        [...]

        yield CraigslistItem(date=[date], link=[link], text=[text], compensation=[compensation], employment_type=[employment_type],
        lat=[lat], long=[long], description=[description])

        
    
    def parse_images(self, response):

        image_urls = response.meta['image']

        yield CraigslistItem(image_urls=[image_urls])

image_thumbs and image_swipe_wrap store the URLs of images to be downloaded.

This is my items.py:

class CraigslistItem(scrapy.Item):
    date = scrapy.Field()
    link = scrapy.Field()
    text = scrapy.Field()
    compensation = scrapy.Field()
    employment_type = scrapy.Field()
    lat = scrapy.Field()
    long = scrapy.Field()
    description = scrapy.Field()
    image_urls = scrapy.Field()
    images = scrapy.Field()

The settings.py:

ITEM_PIPELINES = {
   'scrapy.pipelines.images.ImagesPipeline': 1,
   'craigslist.pipelines.CraigslistPipeline': 2
}

IMAGES_STORE = '/home/higo/Imagens/craiglist'

DOWNLOAD_DELAY = 2

The textual information (date, link, text, compensation, employment_type, lat, long and description) are are being correctly extracted. The images are also extracting correctly, but so far I haven't been able to develop an approach that allows me to download them and store them in subfolders with the page title (this information is stored in the text item): they are being saved on IMAGES_STORE/full.

I've tried, without success, some approaches in pipelines.py like this and this one. Could someone help me with an implementation that works the way I want?

0 Answers0