Questions tagged [web-crawling]

Web crawling is done typically for building and maintaining a Web Index. The crawled data is compared to previous versions of that dataset for changes. Web scraping is similar to web crawling, except that the site being crawled is scraped by the bot (crawler).

Web Crawling is a method of programatically assessing a web site's content, by systematically browsing the site via a bot, aka "crawling" for the purpose of Web Indexing. Bots, aka Spiders, aka FOAF Web Scutter, are scripts typically programmed to copy a sites content for the purpose of comparing it to the sites content at a future/paste date (Web Indexing). The differences are typically used for updating search engine result pages. Bots typically can validate hyperlinks, HTML, CSS, and JavaScript (Web Platform Technologies), empowering it to make decisions in regards to a sites links, content (seo), and functionality (ajax and accessibility support are two examples)

53 questions
7
votes
2 answers

Exclusion lists when crawling web directories

I have a case where I'm building a connector between a federated search engine, and an archive that's serving all of their data via HTTP. They've been kind enough to have their pipeline generate a series of index files for each day that I can use…
Joe
  • 4,445
  • 1
  • 18
  • 40
1
vote
0 answers

Are there any automated techniques one can use to gather data online for a dataset?

I am a developer myself and would like to use latest technologies to build some open data datasets. I would like to know if you are aware of any techniques or algorithm one can use to automate gather data on the internet instead of manually doing…
Mathematics
  • 445
  • 3
  • 10
0
votes
0 answers

Is there a common crawl index to search for pages embedding or linking to a specific exact image?

Input: Image URL as it appears in at least one common crawl (I also have the image itself available). Output: Pages in common crawl that link to or embed the image. There is clip-retrieval which allows me to search for images based on an image or…
Nobody
  • 101
  • 1