-2

I'm trying to write code that will read through a list of files and count the frequency that a particular event occurs in each file. But I'm having a lot of trouble just reading the files.

I have gotten counting frequency code to work if I specify the names of files myself, but want to generalize my code, so that I don't have to edit the script every time I want to run it.

Below is work in progress code for opening and reading files in a folder:

import os

path = "/Users/Desktop/PracticeCode/TextFiles"

for filename in os.listdir(path):
    with open(filename, 'rU') as f:
        contents = f.read()
        print(filename)
        print(contents)

I don't know what 'rU' means but saw others using 'rU' to open files in a list. Using 'r' results in a similar error.

I expected to print title and content of each file in the folder but get the error below. I have no idea how to fix this and would appreciate any feedback.

I think the error message states that something is wrong with file encoding. If this is correct, can someone explain why I don't get this error when specifying files explicitly?

with open(filename, 'rU') as f:
Traceback (most recent call last):
  File "counting_code_2", line 8, in <module>
    contents = f.read()
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 3131: invalid start byte

Edit: I'm posting some lines of the file I've been using to develop code. It's a text file of pride and prejudice.

The Project Gutenberg EBook of Pride and Prejudice, by Jane Austen

This eBook is for the use of anyone anywhere at no cost and with almost no restrictions whatsoever. You may copy it, give it away or re-use it under the terms of the Project Gutenberg License included with this eBook or online at www.gutenberg.org

Title: Pride and Prejudice

Author: Jane Austen

Posting Date: August 26, 2008 [EBook #1342] Release Date: June, 1998 Last updated: February 15, 2015]

Language: English

Character set encoding: ASCII

* START OF THIS PROJECT GUTENBERG EBOOK PRIDE AND PREJUDICE *

Produced by Anonymous Volunteers

PRIDE AND PREJUDICE

By Jane Austen

Chapter 1

It is a truth universally acknowledged, that a single man in possession of a good fortune, must be in want of a wife.

Edit 2: a line of code has been added to the code before the with statement:

if filename != '.DS_Store':

This has removed the encoding error, however, there is still an indentation error after the read function. Is my coding grammar okay?

  • Post a couple of lines of the file you're reading; the traceback says there is a decode error. You probably need to add the `encoding` parameter to `with open(filename, 'rU', encoding='something') as f:` – Trenton McKinney Sep 11 '19 at 16:34
  • Possible duplicate of [UnicodeDecodeError: 'utf-8' codec can't decode byte](https://stackoverflow.com/questions/19699367/unicodedecodeerror-utf-8-codec-cant-decode-byte) – Trenton McKinney Sep 11 '19 at 16:39
  • Hmm, the file says ASCII but 0x80 isn't in ASCII... Are you sure it's a plain text file? Try checking with `file` (shell command). – wjandrea Sep 11 '19 at 16:52
  • I commented out the opening and read file stuff, and just printed the files in the folder. There is something called .DS_Store which doesn't appear when I normally view my folder. I assume I'd have to specify to look for .txt file using an endswith function? – ChemistryCoding Sep 11 '19 at 16:54

1 Answers1

0
  1. rUis a deprecated method which was used before newline. Try rt instead. This is unlikely to have caused the error though.

  2. Your error says that it is an encoding problem. You may need to add encoding = [whatever encoding you used] to your open().

Matthew Gaiser
  • 4,172
  • 1
  • 12
  • 29