How is crawler seeing unlinked directories / files?

Question

I'm running a crawler on my website to test for broken links and such.

It starts by using a URL like www.domain.com

One curious thing is that it is showing directories with no internal links. For example, directory /example_dir/ is showing up in the crawl tree, but I can't find any internal link to that directory within the pages.

How could this be happening and is there a way to prevent it?

score 3 · Answer 1 · answered Nov 03 '11 at 15:08

3

What tool are you using to crawl your site?

Crawlers typically find new pages by following links so the odds are you have a link pointing to those directories. It may not be intentional, such as a a dynamic link that is pulling up bad data but not throwing out an error. If you aren't using Xenu's Link Sleuth I recommend using it as it will tell you what pages had links that lead it to crawl those directories.

answered Nov 03 '11 at 15:08

John Conde

86,255
27
146
241

1

It was Netsparker which is a web vulnerability scanner. Doesn't do broken links like I thought.
I grabbed Xenu and that works great, but didn't show the directory in question.

My guess is Netsparker is picking it up from the robots.txt file and that's the difference here.
– edeneye Nov 03 '11 at 19:03

score 2 · Answer 2 · edited Dec 11 '13 at 16:38

2

My guess is that Jon is right, you must have a link somewhere. It might not show on the page but the spider is finding it.

Don't forget that code like this can happpen <a href="/my_dir/"></a>. Although it's blank to the user, it will be followed by the spider.

edited Dec 11 '13 at 16:38

Stephen Ostermiller

98,758
18
137
361

answered Nov 03 '11 at 15:55

TheAlbear

322
4
10

How is crawler seeing unlinked directories / files?

2 Answers2