extract text from pptx and export in excel

Question

I have used below code for extracting text from pptx

from pptx import Presentation

import glob

for eachfile in glob.glob(r"C:\Users\Desktop\powerpoint file\*.pptx"):
    prs = Presentation(eachfile)
    
    for slide in prs.slides:
        for shape in slide.shapes:
            if hasattr(shape, "text"):
                print(shape.text)

Input pptx looks like this.

Output is like below ( it is not in order ) also i am trying to export it in excel - please help

Expected output:

Please check the question https://stackoverflow.com/questions/13437727/writing-to-an-excel-spreadsheet to see a number of suggestions how to write data to Excel. — Marek Grzenkowicz, Mar 12 '21 at 12:36
@MarekGrzenkowicz , Thanks but i was not able to extract the text by page .. pls guide me. Thanks — mani 05, Mar 12 '21 at 14:36

score 0 · Answer 1 · answered Mar 12 '21 at 19:05

The shapes appearing on a slide form a sequence ordered by z-order, like shapes later in the list are "on top of" shapes earlier in the list.

They do not appear in some sort of "the order you would read them left-to-right, top-to-bottom" order.

If you want the text to appear in the order it might most naturally be scanned by a reader, you'll need to consider the position of each shape and perhaps sort them by (top, left). While that's probably a start, it's likely you'll need more sophisticated rules to account for things like "columns" of text that are scanned differently.

This problem arises from the fact that unlike something like a Microsoft Word document, the content of a PowerPoint slide is not flowed and has no natural "content" or "reading" sequence, only a "visual" sequence of what shape is stacked on top of which other shape.

extract text from pptx and export in excel

1 Answers1