4

I'm trying to override default path full/hash.jpg to <dynamic>/hash.jpg, I've tried How to download scrapy images in a dyanmic folder using following code:

def item_completed(self, results, item, info):

    for result in [x for ok, x in results if ok]:
        path = result['path']
        # here we create the session-path where the files should be in the end
        # you'll have to change this path creation depending on your needs
        slug = slugify(item['category'])
        target_path = os.path.join(slug, os.path.basename(path))

        # try to move the file and raise exception if not possible
        if not os.rename(path, target_path):
            raise DropItem("Could not move image to target folder")

    if self.IMAGES_RESULT_FIELD in item.fields:
        item[self.IMAGES_RESULT_FIELD] = [x for ok, x in results if ok]
    return item

but I get:

Traceback (most recent call last):
    File "/home/user/.venv/sepid/lib/python2.7/site-packages/twisted/internet/defer.py", line 577, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
    File "/home/user/.venv/sepid/lib/python2.7/site-packages/twisted/internet/defer.py", line 839, in _cbDeferred
    self.callback(self.resultList)
    File "/home/user/.venv/sepid/lib/python2.7/site-packages/twisted/internet/defer.py", line 382, in callback
    self._startRunCallbacks(result)
    File "/home/user/.venv/sepid/lib/python2.7/site-packages/twisted/internet/defer.py", line 490, in _startRunCallbacks
    self._runCallbacks()
    --- <exception caught here> ---
    File "/home/user/.venv/sepid/lib/python2.7/site-packages/twisted/internet/defer.py", line 577, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
    File "/home/user/Projects/sepid/scraper/scraper/pipelines.py", line 44, in item_completed
    if not os.rename(path, target_path):
    exceptions.OSError: [Errno 2] No such file or directory

I don't know what's wrong, also is there any other way to change the path? Thanks

Community
  • 1
  • 1
eneepo
  • 1,287
  • 3
  • 19
  • 32
  • Can you print the path variable and verify it is a valid path? Can you also copyh full error traceback? I'd imagine os.rename() is where the prob is? – Dashing Adam Hughes Jan 18 '15 at 07:24
  • I added the full traceback. Also I printed the paths: `full/bc404f7f5e2ef9732d96d349f87cc66fa9f4479f.jpg` and `hola/bc404f7f5e2ef9732d96d349f87cc66fa9f4479f.jpg`. – eneepo Jan 18 '15 at 07:55
  • I agree with you I think problem is with os.rename(). Shouldn't paths be absolute? Thanks – eneepo Jan 18 '15 at 07:56
  • Are you on windows by chance? From os.rename, says On Windows, if dst already exists, OSError will be raised even if it is a file; there may be no way to implement an atomic rename when dst names an existing file. So it could be the file already exists and it's giving you an misleading error? – Dashing Adam Hughes Jan 18 '15 at 08:46
  • No, I'm on linux. You mentioned a good thing but what if dst folder doesn't exists? Will os.rename create it? I should check it out – eneepo Jan 18 '15 at 10:18
  • Thanks, it worked. By creating directories before os.rename problem solved! – eneepo Jan 18 '15 at 12:17

4 Answers4

7

I have created a pipeline inherited from ImagesPipeline and overridden file_path method and used it instead of standard ImagesPipeline

class StoreImgPipeline(ImagesPipeline):
    def file_path(self, request, response=None, info=None):
        image_guid = hashlib.sha1(to_bytes(request.url)).hexdigest()
        return 'realty-sc/%s/%s/%s/%s.jpg' % (YEAR, image_guid[:2], image_guid[2:4], image_guid)
slavugan
  • 1,334
  • 16
  • 17
  • 1
    Can you please show us all the files code item.py, spider.py and all important. I am new to scrapy and this code is not working for me. – The Gr8 Adakron Jun 20 '17 at 12:31
  • do you using scrapy with django?, maybe you forgot to add StoreImgPipeline to pipelines instead of origin ImagesPipeline in spider config. – slavugan Jun 20 '17 at 17:47
  • No! I was simply using scrapy start projects like a basic scrapper but I want to categorize all the images into diffrnt folders, everything is fine it's still going in /full/ folder. what is this "realty-sc" it's a directory I should create right? – The Gr8 Adakron Jun 21 '17 at 06:40
  • 1
    'realty-sc' it is just folder where I wanted to store images. scrapy should create it automatically. file_path function just returns path where scraped file will be saved, you can make it what you like. – slavugan Jun 21 '17 at 10:52
  • Thanx for helping. It worked but otherways around using mapping. – The Gr8 Adakron Jun 21 '17 at 10:56
3

Problem raises because dst folder doesn't exists, and quick solution is:

def item_completed(self, results, item, info):

    for result in [x for ok, x in results if ok]:
        path = result['path']
        slug = slugify(item['designer'])


        settings = get_project_settings()
        storage = settings.get('IMAGES_STORE')

        target_path = os.path.join(storage, slug, os.path.basename(path))
        path = os.path.join(storage, path)

        # If path doesn't exist, it will be created
        if not os.path.exists(os.path.join(storage, slug)):
            os.makedirs(os.path.join(storage, slug))

        if not os.rename(path, target_path):
            raise DropItem("Could not move image to target folder")

    if self.IMAGES_RESULT_FIELD in item.fields:
        item[self.IMAGES_RESULT_FIELD] = [x for ok, x in results if ok]
    return item
eneepo
  • 1,287
  • 3
  • 19
  • 32
  • 1
    I researched and understood the expression `for result in []` and `x in results` but have not been able to find any documentation to help me understand what the condition `for result in [x for ok, x in results if ok]:` means. I have searched "for statements in python" and var in list if var. Can someone point me in the right direction? – secuaz Jan 29 '15 at 10:48
  • 2
    You should search for `List Comprehensions`, please read [this tutorial](https://docs.python.org/3/tutorial/datastructures.html#list-comprehensions) on puthon docs. I hope it solves your issue, let me know if it was still unclear. – eneepo Jan 29 '15 at 16:18
  • I still get the error `[Errno 2] No such file or directory` I am writing a new question with my code in it. I don't see why It wont work – secuaz Feb 02 '15 at 16:04
  • 1
    Including your imports with python code is always a nice thing to do. – Ashley Feb 06 '16 at 19:07
  • 1
    Why not changing path/name of image in `def file_path` instead of doing it in `def item_completed`, because `item_completed` has already downloaded images, while `def file_path` is called before image download – Umair Ayub May 29 '17 at 15:19
0

To dynamically set the path for images downloaded by a scrapy spider prior to downloading images rather than moving them afterward, I created a custom pipeline overriding the get_media_requests and file_path methods.

class MyImagesPipeline(ImagesPipeline):

    def get_media_requests(self, item, info):
        return [Request(url, meta={'f1':item.get('field1'), 'f2':item.get('field2'), 'f3':item.get('field3'), 'f4':item.get('field4')}) for url in item.get(self.images_urls_field, [])]

    def file_path(self, request, response=None, info=None):
        ## start of deprecation warning block (can be removed in the future)
        def _warn():
            from scrapy.exceptions import ScrapyDeprecationWarning
            import warnings
            warnings.warn('ImagesPipeline.image_key(url) and file_key(url) methods are deprecated, '
                      'please use file_path(request, response=None, info=None) instead',
                      category=ScrapyDeprecationWarning, stacklevel=1)

        # check if called from image_key or file_key with url as first argument
        if not isinstance(request, Request):
            _warn()
            url = request
        else:
            url = request.url

        # detect if file_key() or image_key() methods have been overridden
        if not hasattr(self.file_key, '_base'):
            _warn()
            return self.file_key(url)
        elif not hasattr(self.image_key, '_base'):
            _warn()
            return self.image_key(url)
        ## end of deprecation warning block

        image_guid = hashlib.sha1(to_bytes(url)).hexdigest()
        return '%s/%s/%s/%s/%s.jpg' % (request.meta['f1'], request.meta['f2'], request.meta['f3'], request.meta['f4'], image_guid)

This approach assumes you define a scrapy.Item in your spider and replace, e.g., "field1" with your particular field name. Setting Request.meta in get_media_requests allows item field values to be used in setting download directories for each item, as shown in the return statement for file_path. Scrapy will create the directories automatically if they don't exist.

Custom pipeline class definitions are saved in my project's pipelines.py. Methods here are adapted directly from the default scrapy pipeline images.py, which on my Mac is stored in ~/anaconda3/pkgs/scrapy-1.5.0-py36_0/lib/python3.6/site-packages/scrapy/pipelines/. Includes and additional methods can be copied from that file as needed.

Crystal
  • 1
  • 1
-1

the solution that @neelix give is the best one , but i'm trying to use it and i found some strange results , some documents are moved but not all the documents. So i replaced :

if not os.rename(path, target_path):
            raise DropItem("Could not move image to target folder")

and i imported shutil library , then my code is :

def item_completed(self, results, item, info):

    for result in [x for ok, x in results if ok]:
        path = result['path']
        slug = slugify(item['designer'])


        settings = get_project_settings()
        storage = settings.get('IMAGES_STORE')

        target_path = os.path.join(storage, slug, os.path.basename(path))
        path = os.path.join(storage, path)

        # If path doesn't exist, it will be created
        if not os.path.exists(os.path.join(storage, slug)):
            os.makedirs(os.path.join(storage, slug))

        shutil.move(path, target_path)

    if self.IMAGES_RESULT_FIELD in item.fields:
        item[self.IMAGES_RESULT_FIELD] = [x for ok, x in results if ok]
    return item

i hope that it will work also for u guys :)

midodesign
  • 248
  • 3
  • 8