0

I have an input string, some of which does not contain actual words (for example, it contains mathematical formulas such as x^2 = y_2 + 4). I would like a way to split my input string by whether we have a substring of actual English words. For example:

If my string was:

"Taking the derivative of: f(x) = \int_{0}^{1} z^3, we can see that we always get x^2 = y_2 + 4 which is the same as taking the double integral of g(x)"

then I would like it split into a list like:

["Taking the derivative of: ", "f(x) = \int_{0}^{1} z^3, ", "we can see that we always get ", "x^2 = y_2 + 4 ", "which is the same as taking the double integral of ", "g(x)"]

How can I accomplish this? I don't think regex will work for this, or at least I'm not aware of any method in regex that detects the longest substrings of English words (including commas, periods, semicolons, etc).

1 Answers1

2

U can simply use the pyenchant library as mentioned in this post:

import enchant
d = enchant.Dict("en_US")
print(d.check("Hello"))

Output:

True

U can install it by typing pip install pyenchant in ur command line. In ur case, u have to loop through all strings in the string and check whether the current string is an english word or not. Here is the full code to do it:

import enchant
d = enchant.Dict("en_US")

string = "Taking the derivative of: f(x) = \int_{0}^{1} z^3, we can see that we always get x^2 = y_2 + 4 which is the same as taking the double integral of g(x)"

stringlst = string.split(' ')
wordlst = []

for string in stringlst:
    if d.check(string):
        wordlst.append(string)

print(wordlst)

Output:

['Taking', 'the', 'derivative', 'we', 'can', 'see', 'that', 'we', 'always', 'get', '4', 'which', 'is', 'the', 'same', 'as', 'taking', 'the', 'double', 'integral', 'of']

Hope that this helps!

Sushil
  • 5,231
  • 1
  • 7
  • 23