Quantcast
Channel: How to extract text from a PDF file? - Stack Overflow
Viewing all articles
Browse latest Browse all 36

How to extract text from a PDF file?

$
0
0

I'm trying to extract the text included in this PDF file using Python.

I'm using the PyPDF2 package (version 1.27.2), and have the following script:

import PyPDF2with open("sample.pdf", "rb") as pdf_file:    read_pdf = PyPDF2.PdfFileReader(pdf_file)    number_of_pages = read_pdf.getNumPages()    page = read_pdf.pages[0]    page_content = page.extractText()print(page_content)

When I run the code, I get the following output which is different from that included in the PDF document:

 ! " # $ % # $ % &% $ &' ( ) * % + , - % . / 0 1 ' * 2 3% 45' % 1 $ # 2 6 % 3/ % 7 / ) ) / 8 % &) / 2 6 % 8 # 3" % 3" * % 31 3/ 9 # &)%

How can I extract the text as is in the PDF document?


Viewing all articles
Browse latest Browse all 36

Latest Images

Trending Articles



Latest Images