Programmatically determining the form of the English indefinite article

Question

Not sure if this is the correct place for this question. I am writing some code to place the correct indefinite article before a given noun. To do this I am looking at the first letter of the noun to decide whether to use 'a' or 'an'. I know this won't always give the correct output (e.g. An unicorn, A hour), as it is the sound rather than the actual letter that needs to be considered. To correct this I am looking for a list of all nouns where using the first letter will give the wrong result. Does anyone know where I can find such a list?

I doubt you'll ever find a complete list, because e.g. a mathematician will come along and invent the term "univariant" and now your list is out of date. — Draconis, Jan 06 '23 at 20:32
It doesn't need to be 100% complete, just able to catch the common exceptions, but obviously the more the merrier! — RedPython, Jan 06 '23 at 20:47
This varies between different variants of English, by the way. — Adam Bittlingmayer, Jan 08 '23 at 08:16
I don't think that language rules usually involve what is "all wrong". — Lambie, Jan 08 '23 at 16:47

Araucaria - him · Answer 1 · 2023-01-07T23:43:08.863

Even given a comprehensive list of such nouns, it won't be anything like sufficient. The reason is, of course, that the choice between a and an depends not on whether the head noun in the noun phrase begins with a vowel, but whether the following word begins with a vowel.

We can see this contrast in (1-4) below:

an elephant

a big elephant

a university

an unusual university

In (1) the head noun elephant begins with /e/ [or /ɛ/ if you prefer that symbol for the same phoneme]. Because the indefinite article occurs directly before this vowel, we see the form an. In (2) however, the article precedes the consonant /b/ in big and so we see the form a. We see the same kind of pattern in (3) and (4), where university begins with the consonant /j/, but unusual begins with the vowel /ʌ/.

You therefore need a list of all words spelled with consonant and pronounced with a vowel, or vice versa, that could occur after an indefinite article in English. In addition to nouns, this would have to include other determiners (e.g. numerals), prepositions, adjectives, adverbs, and verbs. It's probably safest to get a list of all such words in English regardless of their word category!

I think this answer is critical to solving the problem, I think you correctly pointed out that this is a purely phonetic phenomenon, has nothing to do with intrinsic noun attributes (as if they were some kind of morphemic attribute or something). In a way it probably makes the problem way simpler. You don't need a list of "irregular nouns". You need data on the pronunciation of the words, only. And then a super simple computer implementation of the a / an rule (which I think is extremely regular, I do not know if there any exceptions exceptions to the rules). — Julius Hamilton, Jan 09 '23 at 18:11
Yes. No grammar at all is needed. Sound follows sound and determines sound. — jlawler, Jan 09 '23 at 18:59

score 4 · Answer 2 · answered Jan 06 '23 at 21:53

4

You need a pronouncing dictionary: the CMU dictionary is pretty easy to use. Using the web interface, you can try to get "univariant",which directs you to Logios, which may generate a satisfactory pronunciation my rule.

answered Jan 06 '23 at 21:53

user6726

83,066
4
63
181

score 1 · Answer 3 · answered Jan 09 '23 at 17:58

I think it would be easier to generate your own data because then you don't need to hunt around for a resource that may or may not exist.

This isn't perfect but it's a start. Gonna update it asap.

Choose a starting URL for a crawler (Google, Wikipedia, YouTube, or any website).
Request the webpage HTML source; extract English text and undetermined nouns from it and store them.
Continue to crawl URLs present in the webpage as long or as little as you want.

an_a.py

import scrapy
import spacy
from scrapy.exceptions import CloseSpider
class AnACrawler(scrapy.Spider):
    name = "ana_crawler"
    start_urls = [
        "https://www.wikipedia.org",
    ]
    visited_urls = set()  # set to store visited URLs
def parse(self, response):
    # Load the NLP model
    nlp = spacy.load(&quot;en_core_web_sm&quot;)

    # Extract the text and detect the language of each sentence
    doc = nlp(response.text)

    # Keep only the English sentences and extract the nouns with indefinite articles
    nouns = []
    for sent in doc.sents:
        if sent._.language[&quot;language&quot;] == &quot;en&quot;:
            for token in sent:
                if token.pos_ == &quot;NOUN&quot; and token.text in [&quot;a&quot;, &quot;an&quot;]:
                    nouns.append((token.text, token.nbor().text))

    # Save the text and nouns to a file
    with open(&quot;text_and_nouns.txt&quot;, &quot;a&quot;) as f:
        f.write(response.text + &quot;\n&quot;)
        f.write(str(nouns) + &quot;\n&quot;)

    # Follow hyperlinks to crawl more pages
    for a in response.css(&quot;a&quot;):
        url = a.attrib[&quot;href&quot;]
        if url not in self.visited_urls:
            self.visited_urls.add(url)
            yield scrapy.Request(url, self.parse)

Programmatically determining the form of the English indefinite article

3 Answers3