1

I already have implemented HTML to DOCX in Python where I have parsed HTML using BeautifulSoup. I traversed each and every HTML tag recursively and then by using Python-Docx library, I created Docx document.

Now I want to do the reverse thing and convert Docx to HTML string. I read about reading existing document by using Python Docx library (https://python-docx.readthedocs.io/en/latest/user/documents.html). However, I could not find an approach to traverse each document object and convert them into HTML string.

Is there any way where I can do such reverse parsing? I have tried libraries https://pypi.org/project/docx2html/ and https://pypi.org/project/mammoth/. However, I found them ignoring some styles and I would like to write the code on my self instead of using the library.

Any help is greatly appreciated.

Gaurav Bagul
  • 399
  • 3
  • 18
  • Possible here solution: https://stackoverflow.com/questions/125222/extracting-text-from-ms-word-files-in-python and here https://github.com/mwilliamson/python-mammoth – Rufat Sep 06 '19 at 07:14
  • Also possible access to MS Office SaveAs Html function through Windows COM (OLE) interface. – Rufat Sep 06 '19 at 07:44

1 Answers1

1

Here solution for converting DOCX to HTML through Windows COM (OLE) MS Office interface:

import win32com.client
import win32com.client.dynamic


class WordSaveFormat:
    wdFormatNone = None
    wdFormatHTML = 8


class WordOle:
    def __init__( self, filename ):
        self.wordApp = win32com.client.dynamic.Dispatch( 'Word.Application' )
        self.filename = filename
        self.wordDoc = self.wordApp.Documents.Open( filename )

    def save( self, newFilename = None, wordSaveFormat = WordSaveFormat.wdFormatNone ):
        if newFilename:
            self.filename = newFilename
            self.wordDoc.SaveAs( newFilename, wordSaveFormat )
        else:
            self.wordDoc.Save()

    def close( self ):
        self.wordDoc.Close( SaveChanges = 0 )
        # self.wordApp.DoClose( SaveChanges = 0 )
        # self.wordApp.Close()
        del self.wordApp

    def show( self ):
        self.wordApp.Visible = 1

    def hide( self ):
        self.wordApp.Visible = 0


wordOle = WordOle( "D:\\TestDoc.docx" )
wordOle.show()
wordOle.save( "D:\\TestDoc.html", WordSaveFormat.wdFormatHTML )
# wordOle.save( "D:\\TestDoc2.docx", WordSaveFormat.wdFormatNone )
wordOle.close()
Rufat
  • 356
  • 1
  • 5
  • 17