6

I have the following:

  • A library of XML files.
  • An aspx page (I guess an "Application Page") that takes the name of one of these files as a query-string parameter and renders it to HTML.
  • A way of listing the XML files, but with links to the aspx page using the correct query-string rather than the XML file itself. This means the users don't ever see the XML files (unless they go looking for them, which isn't really a problem); they just get the nice rendered output.

This is all fairly clunky, but works. (Basically I've been asked to take an existing XML-to-HTML rendering system and use SharePoint for version control, access etc.)

However, where it really falls down is in searching the files. I'm using Search Server Express, and I think I'm just indexing the content of the XML files. This means I can search inside the files, but the links (and titles) I get in the search results point at the XML files.

What I'd like is for the search results to link to the aspx page, with the correct query-string parameter, rather than the XML files.

A bonus would be to have SSE actually crawl the rendered output of the aspx page rather than the XML source files, but this is secondary to just having the links point to the right place.

Any ideas?

Alex Angas
  • 5,961
  • 9
  • 49
  • 89
Rawling
  • 880
  • 2
  • 10
  • 25

1 Answers1

4

This can be achieved by defining your content sources and crawl rules so that it crawls only the pages you want - i.e. the aspx pages but not the xml files. These are configured from the Search management page in Central Administration.

You can also configure search to omit all XML files by going to the Manage File Types page if you want to prevent crawling of XML files entirely.

You will need to create some kind of "index page" containing the links to your aspx page with the various different query string parameters. You could perhaps use a Content Query Web Part or XsltListViewer to do a roll-up of all the XML files and an XSLT transformation to create the links to the aspx page. When you crawl this page it will put all the URLs for the aspx page into the search index.

SPDoctor
  • 9,613
  • 2
  • 34
  • 60
  • I've managed to use a crawl rule to exclude the XML files; however, I've not managed to include the aspx page. I essentially need to tell the crawler "crawl this aspx page several times, using each of the following query strings", but don't know how to do that. – Rawling Apr 05 '11 at 10:14
  • 1
    Okay, crawl the XML files but just to follow links, don't index the XML file itself. This is set using a crawl rule by setting path wildcard, then check "Include all items in this path" and then choose "Follow links on the URL without crawling the URL itself". – SPDoctor Apr 05 '11 at 10:29
  • I'm assuming the XML file contains the link to the aspx page... – SPDoctor Apr 05 '11 at 10:30
  • I'm not sure how that helps; the XML files don't necessarily include any links. – Rawling Apr 05 '11 at 10:33
  • Okay, then you will have to create an index page or something. There is no way for the search crawler to find your pages if nothing links to them. I will update the answer. – SPDoctor Apr 05 '11 at 10:37
  • Cheers, adding an index page seems to have done the job. – Rawling Apr 05 '11 at 10:59
  • No, I'm mistaken... it seems to be fine if you search for the title of one of the rendered pages (which also happens to be what I made the anchor text on the index page), but doesn't actually index the content of the pages. – Rawling Apr 05 '11 at 11:35
  • Do you have anything that links to your index page. Add your index page as a start address for your content source in search admin., or create a new content source with the index page as the start address. – SPDoctor Apr 05 '11 at 12:55
  • Yes, I have the index page as the sole start address for a content source. I've a feeling it's crawling the output of the page before my XSLT is being applied to the lists; I'll see if I can change that. – Rawling Apr 05 '11 at 13:04
  • 1
    Things to check: web parts indexing enabled in site settings (search and offline availability); crawl account has access to your index page. Check crawl log for index page. – SPDoctor Apr 05 '11 at 13:24
  • 1
    Got it working now. The trick was not only to add the index page as a content source, but add a Crawl Rule, matching the aspx-plus-query links, with the "Follow Complex URLs" option set. – Rawling Apr 06 '11 at 07:50
  • Yes, of course. Glad you got it working. – SPDoctor Apr 06 '11 at 08:27