I condensed the nutrition facts of a very messy webpage, taking out the spaces, weird unicode characters, etc. It's the best way I could sort it all into a single string to simply search through.
"Nutrition FactsServing Size 6 ozlCalories 121*Percent Daily Values (DV)\r are based on a 2,000 calorie diet.Amount/Serving%DV*Amount/Serving%DV*Total Fat 2.6g3%Tot. Carb. 20.4g7% Sat. Fat 0.5g3% Dietary Fiber 2.9g12% Trans Fat 0g Sugars 0.5g Cholesterol 0mg0%Protein 4.3g Sodium 349.4mg15% Vitamin D - mcg 0%Calcium 2%Iron 9%Potassium 3%" (the pdvs follow each amount)
I want to convert the contents of the string above into an organized dictionary with the below format. (i inputted the values to match the string above)
master =
{
"nutrition": {
"amountPerServing": {
"servingSize": "6 ozl",
"calories": 121,
"totalFat": "2.6g",
"saturatedFat": "0.5g",
"transFat": "0g",
"cholesterol": "0mg",
"sodium": "349.4mg",
"totalCarbohydrate": "20.4g",
"dietaryFiber": "2.9g",
"totalSugars": "0.5g",
"protein": "4.3g"
},
"percentDailyValue": {
"totalFat": 3,
"saturatedFat": 3,
"cholesterol": 0,
"sodium": 15,
"totalCarbohydrate": 7,
"dietaryFiber": 12,
"vitaminD": 0,
"calcium": 2,
"iron": 9,
"potassium": 3
}
}
}
this is my current code:
# (.*?) as opposed to (.*); non-greedy expression; https://blog.finxter.com/python-regex-greedy-vs-non-greedy-quantifiers/
master["nutrition"]["amountPerServing"]["servingSize"] = re.search("ize (.*?)Cal", string).group(1)
master["nutrition"]["amountPerServing"]["calories"] = re.search("ies (.*?)*Per", string).group(1)
servingSize renders fine as "6 ozl", but nutrition just returns "". Using .group(0) for nutrition just gives ies 121*Per. Also, I am using nongreedy matching because greedy matching causes a lot of backtracking and lags the program heavily. I tried changing the length of the string it has to match. I'm not sure how to parse this strong properly with the same synonymous method and type of matching efficiently, as this is to be implemented into a web application.