-1

I'm working on scraping a series of brick-and-moreter retailer webistes that use a cascading store directory pattern to gather the store location data. I'm having a problem regex matching the appropriate rule to page. The page URL structure follows this pattern:

  • /state.html
  • /state/city.html
  • /state/city/store.html
  • /state/city/store/service.html

I am working to gather links from State and City, collect data from Store, but ignore Service.

The LinkExtractor (le_store_state) Allow param is set for the State - to view only certain states. That is working fine. The LinkExtractor (le_store_city) Allow param seems to be ignored. The LinkExtractor (le_store) Allow param seems to be catching City, as it then tries to parse it. I'm not sure what pattern to use in the Deny param to get to the Store level, but then to ignore the Service level.

le_store_state = LinkExtractor(allow=([
'/ak.html',
'/az.html',
'/ca.html',
'/co.html',
'/hi.html',
'/id.html',
'/mt.html',
'/nm.html',
'/nv.html',
'/or.html',
'/ut.html',
'/wa.html',
'/wy.html',
]),
restrict_css='div.Directory-content a.Directory-listLink')

le_store_city = LinkExtractor(allow=(r"[a-z]{2}\/[^()]+.html$"),
restrict_css='div.Directory-content h2 a')

le_store = LinkExtractor(allow=(r"[a-z]{2}\/[^()]+\/[^()]+.html$"),
)

rules = (
Rule(le_store_state, follow = True),
Rule(le_store_city, follow = True),
Rule(le_store,
callback = 'parse_stores',
follow = False
),
)

Clearly I found the regex, but I don't understand what "[^()]+" is doing.

CwnAnnwn
  • 11
  • 1
  • 4

0 Answers0