0

I am working on a word cloud problem. I thought that my result covered the requirements as it produces a word cloud without the uninteresting words or punctuation, but apparently not. I cannot figure out what I am missing.

The script needs to process the text, remove punctuation, ignore cases and words that do not contain all alphabets, count the frequencies, and ignore uninteresting or irrelevant words. A dictionary is the output of the calculate_frequencies function. The wordcloud module will then generate the image from your dictionary.

My code:

def calculate_frequencies(file_contents):
    # Here is a list of punctuations and uninteresting words you can use to process your text
    punctuations = '''!()-[]{};:'"\,<>./?@#$%^&*_~'''
    uninteresting_words = ["the", "a", "to", "if", "is", "it", "of", "and", "or", "an", "as", "i", "me", "my", \
    "we", "our", "ours", "you", "your", "yours", "he", "she", "him", "his", "her", "hers", "its", "they", "them", \
    "their", "what", "which", "who", "whom", "this", "that", "am", "are", "was", "were", "be", "been", "being", \
    "have", "has", "had", "do", "does", "did", "but", "at", "by", "with", "from", "here", "when", "where", "how", \
    "all", "any", "both", "each", "few", "more", "some", "such", "no", "nor", "too", "very", "can", "will", "just", \
    "in", "for", "so" ,"on", "says", "not", "into", "because", "could", "out", "up", "back", "about"]
    
    # LEARNER CODE START HERE
      
    frequencies = {}
    words = file_contents.split()
    final_words = []
    
    
    
    for item in words:
        item = item.lower()
        
        if item in punctuations:
            words = words.replace(item, "")
                  
        if item not in uninteresting_words and item.isalpha()==True:
            final_words.append(item)
            
    for final in final_words:
        
        if final not in frequencies:
                frequencies[final]=0
        else:
                frequencies[final]+=1
                
    #wordcloud
    cloud = wordcloud.WordCloud()
    cloud.generate_from_frequencies(frequencies)
    return cloud.to_array()
Ethan
  • 802
  • 6
  • 15
  • 27
madmulchr
  • 9
  • 1
  • 3
    `words.replace` while you are iterating over `words` is probably a bad idea – OneCricketeer Sep 08 '21 at 00:15
  • You haven't said why it is deemed insufficient. Split probably only splits on whitespace, and punctuation follows words. Um, and you might want to do something pythonesque with a filter over a stream or something. – Maarten Bodewes Sep 08 '21 at 00:17
  • Please choose a better title of the question - it is not revealing any information about the problem. Read [this](https://stackoverflow.com/help/how-to-ask) short intro on how to ask questions on SO. – normanius Sep 08 '21 at 11:14
  • Please clarify your specific problem or provide additional details to highlight exactly what you need. As it's currently written, it's hard to tell exactly what you're asking. – Community Sep 13 '21 at 08:31

1 Answers1

-1

As written, I don't think your code runs. words is a list, and .replace is not a valid list method.


To simply get the counts, see this code

For punctuation refer - Best way to strip punctuation from a string

For counting, use a Counter

import string
from collections import Counter

uninteresting_words = {"the", "a", "to", "if", "is", "it", "of", "and", "or", "an", "as", "i", "me", "my", \
"we", "our", "ours", "you", "your", "yours", "he", "she", "him", "his", "her", "hers", "its", "they", "them", \
"their", "what", "which", "who", "whom", "this", "that", "am", "are", "was", "were", "be", "been", "being", \
"have", "has", "had", "do", "does", "did", "but", "at", "by", "with", "from", "here", "when", "where", "how", \
"all", "any", "both", "each", "few", "more", "some", "such", "no", "nor", "too", "very", "can", "will", "just", \
"in", "for", "so" ,"on", "says", "not", "into", "because", "could", "out", "up", "back", "about"}

def calculate_frequencies(s):
  global uninteresting_words
  words = (x.lower().strip().translate(str.maketrans('', '', string.punctuation)) for x in s.strip().split())

  c = Counter(words)
  for x in uninteresting_words:
    if x in c:
      del c[x]
  return c

print(calculate_frequencies('this is a string! A very fancy string?'))
# Counter({'string': 2, 'fancy': 1})

For a WordCloud, you shouldn't need to count anything, as it would do it for you. Notice there is a parameter for stopwords and a process_text function that uses a regex pattern that ignores punctuation by default - https://amueller.github.io/word_cloud/generated/wordcloud.WordCloud.html

OneCricketeer
  • 151,199
  • 17
  • 111
  • 216