-2

I condensed the nutrition facts of a very messy webpage, taking out the spaces, weird unicode characters, etc. It's the best way I could sort it all into a single string to simply search through.

"Nutrition FactsServing Size 6 ozlCalories 121*Percent Daily Values (DV)\r are based on a 2,000 calorie diet.Amount/Serving%DV*Amount/Serving%DV*Total Fat 2.6g3%Tot. Carb. 20.4g7% Sat. Fat 0.5g3% Dietary Fiber 2.9g12% Trans Fat 0g Sugars 0.5g Cholesterol 0mg0%Protein 4.3g Sodium 349.4mg15% Vitamin D - mcg 0%Calcium 2%Iron 9%Potassium 3%" (the pdvs follow each amount)

I want to convert the contents of the string above into an organized dictionary with the below format. (i inputted the values to match the string above)

master =
{
  "nutrition": {
    "amountPerServing": {
      "servingSize": "6 ozl",
      "calories": 121,
      "totalFat": "2.6g",
      "saturatedFat": "0.5g",
      "transFat": "0g",
      "cholesterol": "0mg",
      "sodium": "349.4mg",
      "totalCarbohydrate": "20.4g",
      "dietaryFiber": "2.9g",
      "totalSugars": "0.5g",
      "protein": "4.3g"
    },
    "percentDailyValue": {
      "totalFat": 3,
      "saturatedFat": 3,
      "cholesterol": 0,
      "sodium": 15,
      "totalCarbohydrate": 7,
      "dietaryFiber": 12,
      "vitaminD": 0,
      "calcium": 2,
      "iron": 9,
      "potassium": 3
    }
  }
}

this is my current code:

# (.*?) as opposed to (.*); non-greedy expression; https://blog.finxter.com/python-regex-greedy-vs-non-greedy-quantifiers/
master["nutrition"]["amountPerServing"]["servingSize"] = re.search("ize (.*?)Cal", string).group(1)
master["nutrition"]["amountPerServing"]["calories"] = re.search("ies (.*?)*Per", string).group(1)

servingSize renders fine as "6 ozl", but nutrition just returns "". Using .group(0) for nutrition just gives ies 121*Per. Also, I am using nongreedy matching because greedy matching causes a lot of backtracking and lags the program heavily. I tried changing the length of the string it has to match. I'm not sure how to parse this strong properly with the same synonymous method and type of matching efficiently, as this is to be implemented into a web application.

  • Without additional context I can't be sure, but I suspect the facts were probably more structured (either in a table, etc...) in the web page before you removed the additional characters. If that's the case, I suggest using some sort of [html parsing library](https://stackoverflow.com/a/6325277/11659881). Although you do mention it was messy, so you may have considered this. – Kraigolas Jun 02 '22 at 03:51
  • 1
    @Kraigolas Yeah, the data on the webpage weren't in proper rows, some lay as normal text, and others had parts of it chopped off into separate rows, etc. – Harsh Dadhich Jun 02 '22 at 03:57

0 Answers0