1

I am trying to read a pdf using python and the content has many newline (crlf) characters. I tried removing them using below code:

from tika import parser

filename = 'myfile.pdf'
raw = parser.from_file(filename)
content = raw['content']
content = content.replace("\r\n", "")
print(content)

But the output remains unchanged. I tried using double backslashes also which didn't fix the issue. can someone please advise?

Leni
  • 555
  • 1
  • 4
  • 21
  • 2
    What sort of data structure is "content"? Post a sample of it to help us help you? – Nick Feb 19 '19 at 07:34
  • This example is not reproducible without knowing what `content` contains. – Ignatius Feb 19 '19 at 07:35
  • You can't just read a literal PDF file and make text replacements like this. You need a Python library which can parse PDF content. – Tim Biegeleisen Feb 19 '19 at 07:37
  • content is a string. I checked it using type(content). @TimBiegeleisen I use the text after parsing the file from tika as you can see in code. – Leni Feb 19 '19 at 07:40
  • I've never heard of tika before, but after a quick google search I'm 99% sure it's not a pdf parser. – Aran-Fey Feb 19 '19 at 07:50

4 Answers4

7
content = content.replace("\\r\\n", "")

You need to double escape them.

Albert
  • 799
  • 6
  • 24
2

I don't have access to your pdf file, so I processed one on my system. I also don't know if you need to remove all new lines or just double new lines. The code below remove double new lines, which makes the output more readable.

Please let me know if this works for your current needs.

from tika import parser

filename = 'myfile.pdf'

# Parse the PDF
parsedPDF = parser.from_file(filename)

# Extract the text content from the parsed PDF
pdf = parsedPDF["content"]

# Convert double newlines into single newlines
pdf = pdf.replace('\n\n', '\n')

#####################################
# Do something with the PDF
#####################################
print (pdf)
Life is complex
  • 11,388
  • 5
  • 20
  • 45
0
print(open('myfile.txt').read().replace('\n', ''))
Ryan M
  • 15,686
  • 29
  • 53
  • 64
  • 5
    So what is this meant to do? How does this answer the question? Please [edit] your answer and explain the answer. Additionally please read [answer] – The Grand J Mar 19 '21 at 03:43
0

If you are having issues with different forms of line break, try the str.splitlines() function and then re-join the result using the string you're after. Like this:

content = "".join(l for l in content.splitlines() if l)

Then, you just have to change the value within the quotes to what you need to join on. This will allow you to detect all of the line boundaries found here. Be aware though that str.splitlines() returns a list not an iterator. So, for large strings, this will blow out your memory usage. In those cases, you are better off using the file stream or io.StringIO and read line by line.

Aaron Ford
  • 62
  • 7