Initializes a PdfFileReader object. This operation can take some time, asthe PDF stream’s cross-reference tables are read into memory.
A Pure-Python library built as a PDF toolkit. It is capable of: extracting document information (title, author, ) and more! By being Pure-Python, it should run on any Python platform without any dependencies on external libraries. It can also work entirely on StringIO objects rather than file streams, allowing for PDF manipulation in memory. The PdfFileReader Class¶ class PyPDF2.PdfFileReader (stream, strict=True, warndest=None, overwriteWarnings=True) ¶. Initializes a PdfFileReader object. This operation can take some time, as the PDF stream’s cross-reference tables are read into memory.
Parameters: |
|
---|
decrypt
(password)¶When using an encrypted / secured PDF file with the PDF Standardencryption handler, this function will allow the file to be decrypted.It checks the given password against the document’s user password andowner password, and then stores the resulting decryption key if eitherpassword is correct.
Install Pypdf2 Python On Linux
Oct 08, 2012 The import PyPDF2 looks for PyPDF2 folder in the base location of python. In my case it is C: Python27 site-packages PyPDF2. When virtualenv is activated, pip install PyPDF2 does not create the above folder. Conda install linux-64 v1.26.0; win-32 v1.26.0; noarch v1.26.0; osx-64 v1.26.0; win-64 v1.26.0; To install this package with conda run one of the following: conda install -c conda-forge pypdf2. Installation pypdf2 is a pure python package, so you can install it using pip (assuming pip is in your system’s path): 1. Python -m pip install pypdf2.
It does not matter which password was matched. Both passwords providethe correct decryption key that will allow the document to be used withthis library.
Parameters: | password (str) – The password to match. |
---|---|
Returns: | 0 if the password failed, 1 if the password matched the userpassword, and 2 if the password matched the owner password. |
Return type: | int |
Raises NotImplementedError: | |
if document uses an unsupported encryptionmethod. |
documentInfo
¶Read-only property that accesses the getDocumentInfo()
function.
getDestinationPageNumber
(destination)¶Retrieve page number of a given Destination object
Parameters: | destination (Destination) – The destination to get page number.Should be an instance ofDestination |
---|---|
Returns: | the page number or -1 if page not found |
Return type: | int |
getDocumentInfo
()¶Retrieves the PDF file’s document information dictionary, if it exists.Note that some PDF files use metadata streams instead of docinfodictionaries, and these metadata streams will not be accessed by thisfunction.
Returns: | the document information of this PDF file |
---|---|
Return type: | DocumentInformation or None if none exists. |
getFields
(tree=None, retval=None, fileobj=None)¶Extracts field data if this PDF contains interactive form fields.The tree and retval parameters are for recursive use.
Parameters: | fileobj – A file object (usually a text file) to writea report to on all interactive form fields found. |
---|---|
Returns: | A dictionary where each key is a field name, and eachvalue is a Field object. Bydefault, the mapping name is used for keys. |
Return type: | dict, or None if form data could not be located. |
getFormTextFields
()¶Installesd.dmg missing osxx lion. Retrieves form fields from the document with textual data (inputs, dropdowns)
getNamedDestinations
(tree=None, retval=None)¶Retrieves the named destinations present in the document.
Returns: | a dictionary which maps names toDestinations . |
---|---|
Return type: | dict |
getNumPages
()¶Calculates the number of pages in this PDF file.
Returns: | number of pages |
---|---|
Return type: | int |
Raises PdfReadError: | |
if file is encrypted and restrictions preventthis action. |
getOutlines
(node=None, outlines=None)¶Retrieves the document outline present in the document.
Returns: | a nested list of Destinations . |
---|
getPage
(pageNumber)¶Retrieves a page by number from this PDF file.
Parameters: | pageNumber (int) – The page number to retrieve(pages begin at zero) |
---|---|
Returns: | a PageObject instance. |
Return type: | PageObject |
getPageLayout
()¶Install Pypdf2 Python Anaconda
Get the page layout.See setPageLayout()
for a description of valid layouts.
Returns: | Page layout currently being used. |
---|---|
Return type: | str , None if not specified |
getPageMode
()¶Get the page mode.See setPageMode()
for a description of valid modes.
Returns: | Page mode currently being used. |
---|---|
Return type: | str , None if not specified |
getPageNumber
(page)¶Retrieve page number of a given PageObject
Parameters: | page (PageObject) – The page to get page number. Should bean instance of PageObject |
---|---|
Returns: | the page number or -1 if page not found |
Return type: | int |
getXmpMetadata
()¶Retrieves XMP (Extensible Metadata Platform) data from the PDF documentroot.
Returns: | a XmpInformation instance that can be used to access XMP metadata from the document. |
---|---|
Return type: | XmpInformation orNone if no metadata was found on the document root. |
isEncrypted
¶Read-only boolean property showing whether this PDF file is encrypted.Note that this property, if true, will remain true even after thedecrypt()
method is called.
namedDestinations
¶Read-only property that accesses thegetNamedDestinations()
function.
numPages
¶Read-only property that accesses thegetNumPages()
function.
outlines
¶- Read-only property that accesses the
getOutlines()
function.
pageLayout
¶Read-only property accessing thegetPageLayout()
method.
pageMode
¶Read-only property accessing thegetPageMode()
method. P90x workout videos online.
pages
¶Read-only property that emulates a list based upon thegetNumPages()
andgetPage()
methods.
xmpMetadata
¶Read-only property that accesses thegetXmpMetadata()
function.
In this Python tutorial, we will discuss what is PyPDF2 in python and various methods of PdfFileReader and also PdfFileReader Python example.
We will learn about the PdfFileReader class and methods. It is the class from the PyPDF2 module that is widely used to access & manipulate PDF files in Python.
PyPDF2 Python Library
Install Pypdf2 Python 2.7
- Python is used for a wide variety of purposes & is adorned with libraries & classes for all kinds of activities. Out of these purposes, one is to read text from PDF in Python.
- PyPDF2 offers classes that help us to Read, Merge,Write a pdf file.
- PdfFileReader used to perform all the operations related to reading a file.
- PdfFileMerger is used to merge multiple pdf files together.
- PdfFileWriter is used to perform write operations on pdf.
- All of the classes have various functions that facilitate a programmer to control & perform any operation on pdf.
- PyPDF2 has stopped receiving any updates after Python3.5 but it is still used to control PDFs. In this tutorial, we will be covering everything about PdfFileReader class & we will tell you what all functions are depreciated or broken.
Read: PdfFileMerger Python examples
Install pypdf2 in python
To use the PyPDF2 library in Python, we need to first install PyPDF2. Follow the below code to install the PyPDF2 module in your system.
After reading this tutorial, you will have complete knowledge of each function in PdfFileReader class. Also, we will be demonstrating the examples for each function in PdfFileReader class.
PdfFileReader in Python
- PdfFileReader in Python offers functions that help in reading & viewing the pdf file. It offers various functions using which you can filter the pdf on the basis of the page number, content, page mode, etc.
- The first step is to import the PyPDF2 module, type
import PyPDF2
- The next step is to create an object that holds the path of the pdf file. We have provided one more argument i.e rb which means read binary. We have used the pdf file with the name ‘sample’ & it is stored in the same directory where the main program is.
- , PdfFileReader function is used to read the object that holds the path of a pdf file. Also, it offers few more arguments that can be passed.
- Here is the explanation of all four arguments:
- stream: Pass the name of the object that holds the pdf file. In our case it is pdfObj.
- strict: Do you want to inform the user about the fatal error that appeared while reading the pdf file. If yes then set it to True. if no, then set it to False. By default it is True.
- warndest: Destination for logging warning ( default is
sys.stderr
). - overwriteWarnings: Determines whether to override python’s warning.py module with a custom implementation (default is True).
- Here is the implementation of all the code mentioned above.
- This picture shows three things:
- You can notice the files on the left side. The ‘sample’ is the pdf file that we have used in this program.
- All the above code is in the centre.
- terminal shows an error when we tried to run this program so we have installed the PyPDF2 module. Now we run the program nothing appears that is because we have just read the file so far.
Read: PdfFileWriter Python Examples
PdfFileReader python example
In this section, we will cover all the functions of PdfFileReader class. Our approach would be to explain the function in the simplest way & to demonstrate an example for each. So let us see a few PdfFileReader python examples.
Get PDF information using PdfFileReader in Python
PdfFileReader provides a method as documentInfo() which gives us the information about a PDF file in Python.
- retrieves pdf document information in a dictionary format if exist.
TypeError: 'DocumentInformation' object is not callable
- If you are seeing the above error, simply remove the
()
from the documentInfo.
Example:
Here is the example of implementation of documentinfo function.
Code Snippet:
In this code, we have displayed the information of sample.pdf in Python.
Output:
In this output, you can notice that the information of sample.pdf is displayed in a dictionary format.
Get PDF information of a specific page using PdfFileReader in Python
PdfFileReader provides a method as getDestinationPageNumber() which gives us the information about a PDF file in Python on a specific page.
- Retrieves information available on the provided page number.
- If you want to see the content of a particular page then you can simply pass the page number as an argument to this function.
- It is helpful only if you know the page number or you have the index of content.
- PyPDF2 library is not updated after python3.5 so there are few bugs & broken functions. This works perfectly only when used with python3.5 or below.
Get field data from PDF using PdfFileReader in Python
PdfFileReader provides a method getFields(tree=None, retval=None, FileObj=None) which extracts field data from interactive PDF in Python.
- tree & retval parameters are for recursive use.
- This function extracts field data if the PDF contains interactive form fields.
- Interactive forms are those in which users can fill in the information. Click here to see the demonstration of interactive forms.
- These interactive pdf won’t work if downloaded directly so we have mentioned the python code below that can download interactive pdfs in a working state.
CodeSnippet:
You can find the interactive forms over the internet & these can be downloaded by using the given code. Simply provide the path of the interactive pdf file. In our case, we are downloading it from https://royalegroupnyc.com
Here is the code snippet to read the interactive PDF in python.
Output:
In this output, you can notice that all the information is fetched in a dictionary format. If the PDF won’t contain interactive fields in that case None is returned.
Get text data from fields in PDF using PdfFileReader in Python
PdfFileReader provides a method getFormTextFields() to extract text data from the interactive PDF in Python.
- This function is used to retrieve the text data that is provided by the user in the interactive PDF in Python.
- The data is displayed in a dictionary format
- In case you are seeing an error :
TypeError: 'NoneType' object is not iterable
This means the pdf does not contain interactive text fields. - The major difference between getFields() and getFormTextFields() is getFileds displays all the Filed information whereas getFormTextFields displays the information entered in the interactive pdf.
Code Snippet:
In this code, we have used this function in the last line where it is displaying the output.
Output:
In this output, you can notice in the terminal section that Name has value None. This means that no value is passed in the PDF.
Get to the named Destinations in PDF using PdfFileReader in Python
PdfFileReader provide a method getNamedDestinations(tree=None, retval=None) to easily get named destination of PDF in Python.
- This function is used to retrieve the named destination present in the doc.
- It returns empty dictionary if named destination is not found.
code Snippet:
In this code, this function is used in the last line. It is displaying the named destination present in the Smallpdf.pdf.
Output:
In this output, you can notice on the terminal that empty curly braces are returned. It means that named destination is not present in Smallpdf.pdf.
Get the total page count of PDF using PdfFileReader in Python
PdfFileReader provide a method getNumPages() which returns the total pages in the PDF file in Python.
- This function returns the total number of pages in the PDF file in Python.
- It retrieves page information by page number
Code Snippet:
In this code, this function is used in the last line where it is displaying page number of ‘sample.pdf’
Output:
In this output, you can notice result on the terminal. The sample.pdf has total 8 pages.
Get outlines in the PDF using PdfFileReader in Python
PdfFileReader provide a method getOutlines(node=None, outlines=None) which allows to retrieves outlines in the PDF file in Python
- This function retrieves the outline present in the PDF file.
- In other words, it retrieves the nested list of destinations.
- When a group of people start looking at the pdf then they usually add some markings also called annotations. Using this function you can fetch all the marking or outlines.
- PyPDF2 is not being updated after python3.5 so there are things that are broken. Outlines function is one of the broken this which stopped working fine after python3.5
- We tried but it didn’t work for us. The output shows an empty string even after adding outlines to the pdf.
- We will update the blog once we found the solution.
Jump to a specific page of PDF using PdfFileReader in Python
PdfFileReader provide a method getPage(pageNumber) which allows to see content of specific page.
- This function returns the content on the provided page number.
- to extract the content in a readable format we have to use a function with the name extractText().
- extractText() is a function from PageObject Class. Using this function we can read the content of the pdf.
Code Snippet:
In this code, we are using single page pdf file with the name Smallpdf.pdf. In the last line of the code we have passed pagenumber 0 as an argument & we have applied extractText() function to display the content.
Output:
In this output, you can see that data from page 0 that means first is displayed on the screen. The content is in human readable format.
Get Page Mode of PDF using PdfFileReader in Python
PdfFileReader provide a method getPageMode() which allows to get the page mode of PDF in Python.
- This function is used to get the page mode.
- There are various valid page modes
Get Page Layout of the PDF usingPdfFileReader in Python
PdfFileReader provide a method getPageLayout() which returns page layout of PDF in Python
- get the layout of the page of PDF in Python
- There are various valid layouts.
Get Encryption information of the PDF using PdfFileReader in Python
PdfFileReader provides method isEncrypted() which allows to check if the PDF file is encrypted in Python
- Shows whether the PDF is encrypted or not using Python.
- The return type is boolean (True/False) and the function is not callable.
- If the PDF file returns True then it will remain true even if it is decrypted.
- In the below picture you can see that Smallpdf.pdf is not encrypted that is why the output is false.
You may also like the following Python tutorials:
With this, we have completed the Python PdfFileReader class and its functions. There were two functions that were depreciated getOutlines() and getDestinationPageNumber()
.
Here are a few PdfFileReader python example.
- Get PDF information using PdfFileReader in Python
- How to get PDF information of a specific page using PdfFileReader in Python
- Get field data from PDF using PdfFileReader in Python
- How to get text data from fields in PDF using PdfFileReader in Python
- Get to the named Destinations in PDF using PdfFileReader in Python
- How to get the total page count of PDF using PdfFileReader in Python
- How to get outlines in the PDF using PdfFileReader in Python
- Jump to a specific page of PDF using PdfFileReader in Python
- How to get Page Mode of PDF using PdfFileReader in Python
- Get Page Layout of the PDF usingPdfFileReader in Python
- How to get Encryption information of the PDF using PdfFileReader in Python
Install Pypdf2 Python Download
Entrepreneur, Founder, Author, Blogger, Trainer, and more. Check out my profile.