I'm scrapping some pages and extracting some textual information and eventually images. As not all pages have images, I would like to create subfolders named with the title of the pages and store the respective images in them. For example: If an image was found on a page that has the title Junior Architect, a subfolder will be created in the IMAGES_STORE directory with the name Junior Architect.
What I've done so far is like this on the spider code:
class JobsSpider(scrapy.Spider):
[...]
def parse_listing(self, response): #called by "parse" method
[...]
image_thumbs = response.xpath('//div[@id="thumbs"]/a/@href').extract()
for image in image_thumbs:
yield scrapy.Request(image,callback=self.parse_images,meta={'image':image})
image_swipe_wrap = response.xpath('//div[@class="swipe-wrap"]/div/img/@src').extract()
for image in image_swipe_wrap:
yield scrapy.Request(image,callback=self.parse_images,meta={'image':image})
[...]
yield CraigslistItem(date=[date], link=[link], text=[text], compensation=[compensation], employment_type=[employment_type],
lat=[lat], long=[long], description=[description])
def parse_images(self, response):
image_urls = response.meta['image']
yield CraigslistItem(image_urls=[image_urls])
image_thumbs and image_swipe_wrap store the URLs of images to be downloaded.
This is my items.py:
class CraigslistItem(scrapy.Item):
date = scrapy.Field()
link = scrapy.Field()
text = scrapy.Field()
compensation = scrapy.Field()
employment_type = scrapy.Field()
lat = scrapy.Field()
long = scrapy.Field()
description = scrapy.Field()
image_urls = scrapy.Field()
images = scrapy.Field()
The settings.py:
ITEM_PIPELINES = {
'scrapy.pipelines.images.ImagesPipeline': 1,
'craigslist.pipelines.CraigslistPipeline': 2
}
IMAGES_STORE = '/home/higo/Imagens/craiglist'
DOWNLOAD_DELAY = 2
The textual information (date, link, text, compensation, employment_type, lat, long and description) are are being correctly extracted. The images are also extracting correctly, but so far I haven't been able to develop an approach that allows me to download them and store them in subfolders with the page title (this information is stored in the text item): they are being saved on IMAGES_STORE/full.
I've tried, without success, some approaches in pipelines.py like this and this one. Could someone help me with an implementation that works the way I want?