-1

I have used below code for extracting text from pptx

from pptx import Presentation

import glob

for eachfile in glob.glob(r"C:\Users\Desktop\powerpoint file\*.pptx"):
    prs = Presentation(eachfile)
    
    for slide in prs.slides:
        for shape in slide.shapes:
            if hasattr(shape, "text"):
                print(shape.text)

Input pptx looks like this.

enter image description here

enter image description here

Output is like below ( it is not in order ) also i am trying to export it in excel - please help

enter image description here

Expected output:

enter image description here

Jignasha Royala
  • 1,026
  • 10
  • 27
mani 05
  • 13
  • 4
  • Please check the question https://stackoverflow.com/questions/13437727/writing-to-an-excel-spreadsheet to see a number of suggestions how to write data to Excel. – Marek Grzenkowicz Mar 12 '21 at 12:36
  • @MarekGrzenkowicz , Thanks but i was not able to extract the text by page .. pls guide me. Thanks – mani 05 Mar 12 '21 at 14:36

1 Answers1

0

The shapes appearing on a slide form a sequence ordered by z-order, like shapes later in the list are "on top of" shapes earlier in the list.

They do not appear in some sort of "the order you would read them left-to-right, top-to-bottom" order.

If you want the text to appear in the order it might most naturally be scanned by a reader, you'll need to consider the position of each shape and perhaps sort them by (top, left). While that's probably a start, it's likely you'll need more sophisticated rules to account for things like "columns" of text that are scanned differently.

This problem arises from the fact that unlike something like a Microsoft Word document, the content of a PowerPoint slide is not flowed and has no natural "content" or "reading" sequence, only a "visual" sequence of what shape is stacked on top of which other shape.

scanny
  • 23,741
  • 4
  • 47
  • 70