-1

So i have a bunch of HTML files that i would like to fix the markup on with the help of bs4. But once i run the code, all files are just empty (lucky my i made a backup before running my script on the folder).

This is what i have so far:

from bs4 import BeautifulSoup
import os
for entry in os.scandir(path):
    if entry.is_file() and entry.path.endswith('html'):
        file = open(entry.path, 'w+')
        soup = BeautifulSoup(file, 'html.parser')
        file.write(soup.prettify())
        print(colored('Success', 'green'))
        file.close()

The expected result would be that the file is read, prettyfied and saved.

petezurich
  • 7,683
  • 8
  • 34
  • 51
Adam
  • 1,156
  • 1
  • 11
  • 32

3 Answers3

0

you have truncated the files with the access modifier of +w. Take a look at this answer here which explains in detail which mode you require.

More information from the python docs can be found here for 2.7 and for python3

Saif Asif
  • 5,250
  • 2
  • 29
  • 47
  • Alright but i want to truncate the file, and then save it with new content (prettyfied) – Adam Jan 15 '20 at 17:03
  • then you require to read the file contents first so in `r` only mode and then create another new file `w` mode to dump the new contents – Saif Asif Jan 15 '20 at 17:04
0

opening the file with "W +" you delete what's in it before you can read. Solution:

from bs4 import BeautifulSoup
import os
for entry in os.scandir(path):
    if entry.is_file() and entry.path.endswith('html'):
        readFile = open(entry.path, 'r')
        soup = BeautifulSoup(readFile, 'html.parser')
        readFile.close()
        writeFile = open(entry.path, 'w')
        writeFile.write(soup.prettify())
        writeFile.close()
        print(colored('Success', 'green'))


0

You've used the 'w+' mode to open the file. This clears/ truncates all file content.

Use 'r' to read file contents, then process them, and use 'w+' to overwrite the file with the processed contents.

from bs4 import BeautifulSoup
import os
for entry in os.scandir(path):
    if entry.is_file() and entry.path.endswith('html'):
        with open(entry.path, 'r') as f:
            readfile = f.read()
        readFile = open(entry.path, 'r')
        soup = BeautifulSoup(readFile, 'html.parser')
        with open(entry.path, 'w+') as f:
            readfile = f.write(soup.prettify())
        print(colored('Success', 'green'))

For more info about modes of opening files in python see these resources:

Excellent StackOverflow answers

Manpagez

Python documentation

Suyash
  • 325
  • 5
  • 16