0

Input: Image URL as it appears in at least one common crawl (I also have the image itself available). Output: Pages in common crawl that link to or embed the image.

There is clip-retrieval which allows me to search for images based on an image or text query, the result consists of URLs of matching images.

This is great, but I really also need at least one page linking to each image to get some metadata. This would obviously be available by going through the entire Common Crawl and looking for pages that link to one of the images that I have and save that data with the image. But that's really expensive for a one-off search.

But it seems to me that it would be generally useful to be able to find all pages that link to a certain other page (or embed it, in the case of an image URL). Has someone made such an index available? Based on quick back-of-the-napkin calculations I assume such an index would fit on a fast SSD.

I can find tools to make web graphs from Common Crawl but no page-level graphs for download.

Nobody
  • 101
  • 1

0 Answers0