7

I'm trying to write a simple script to simply check a webpage for a specific value:

$("a#infgHeader").text() == "Delivered";

I'd like to automate this from a Bash script to be run at an interval. I'm also fine with using Python. I need to essentially make an HTTP request, get the response, and have a way to intelligently query the result. Is there a library which will help me with the querying part?

Naftuli Kay
  • 82,570
  • 85
  • 256
  • 387

5 Answers5

11

Xpath is great for querying html.

Something like this:

//a[@id='infgHeader']/@text

In chrome developer tool you can use the search box in the Elements tab to test the expression.

Quick run in terminal:

$echo '<div id="test" text="foo">Hello</div>' | xpath '//div[@id="test"]/@text' 
Found 1 nodes:
-- NODE --
 text="foo"
ebaxt
  • 8,197
  • 1
  • 32
  • 36
  • Hooray for xPath! I was wondering if it would be of help here. I didn't know because HTML != XML, but hey, if it works, it works. – Naftuli Kay Feb 29 '12 at 19:33
  • 1
    `xpath` works poorly with not-strictly-XML HTML code. When running it on a 100-line HTML snippet, it freezes for a minute then dies with a "mismatched tag" error, apparently because the code had `` and not ``. – Tgr Mar 23 '16 at 14:48
  • Yes, xpath is not reliable. Guess I'll use [regular expressions to parse HTML](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) then. – phil294 Mar 17 '18 at 10:31
2

http://pypi.python.org/pypi/spynner/1.10

Spynner will let you select elements from the dom using jquery syntax.

Or there are other libraries that let you parse HTML. BeautifulSoup, lxml

dm03514
  • 52,703
  • 18
  • 104
  • 141
1

Alex MacCaw wrote up a nice post that does just what you're asking using node.js / JavaScript. There are a LOT of capabilities it brings too.

http://alexmaccaw.com/posts/node_jquery_xml_parsing

Joshua
  • 3,605
  • 1
  • 25
  • 32
0

I have recently done something like this using nodejs + jsdom both are well documented with a low entry barrier.

OlduwanSteve
  • 1,203
  • 14
  • 16
0

To parse html is not trivial for general websites, because html may not be prefect and DOM can be modified by java-script on the fly, so parsing html may not make sense in such case.

Best way is to use a browser and directly access the DOM, for that you can use a headless browser like phontomjs, so you can script it and check whatever you need to check

Anurag Uniyal
  • 81,711
  • 39
  • 167
  • 215