A complete word processing with Python | by Himalaya Bir Shrestha | Jan, 2023

By Jessie Hobb On Jan 25, 2023

Reading pdf file, utilizing regular expressions, exporting to Excel and Word document, and converting it back to pdf format

Recently for a self-study project, I had to go through an 800-page pdf file. Each chapter of the file contained a common set of questions. And I needed the answers to specific questions in each chapter. Now it’d take me forever to go through each page of the document and assess the answers to those questions. I was wondering if there was a quick way to scan through each page and extract only the relevant information for me from the file. I figured out a Pythonic way to do the same. In this post, I am going to share how I was able to read the pdf file, extract only relevant information from each chapter of the file, export the data into Excel and editable word document, and convert it back to pdf format using different packages in Python. Let’s get started.

Data

Rather than an 800-page document, I am going to use a 4-page pdf file as an example. During the final days of high school, my classmates passed around a diary called “Auto book” as a memory to collect the interests, preferences, and contact information of each other. The pdf file I am using contains dummy information about four imaginary friends named Ram, Shyam, Hari, and Shiva. The file contains information such as their nationality, date of birth, preferences (food, fruit, sports, player, movie, actor), favorite quotes, aim, views on politics, and message to the world.

Pdf file called autobook.pdf containing information and messages from four friends. Image by Author.

It’d be easy to extract information for few friends directly by copy pasting from the pdf file. However, if the pdf file is large, it’d be much more efficient and precise to do it using Python. Following sections show how it is done step by step in Python.

1. Reading pdf document using PyPDF2 or PyMuPDF packages

a. Read the first page using PyPDF2

To read the text in the pdf file using Python, I use a package called PyPDF2, and its PdfReader module. In the code snippet below, I read just the first page of the pdf file and extract the text from it.

Script to read the first page of the pdf file. Image by Author.

b. Read the entire text of the file using PyPDF2

To read the entire text from the pdf file, I use a function called extract_text_from_pdf as shown below. First, the function opens the pdf file for reading in binary format and initializes the reading object. An empty list called pdf_text is initialized. Next, while looping through each page of the pdf file, the content of each page is extracted and appended to the list.

def extract_text_from_pdf(pdf_file: str) -> [str]:# Open the pdf file for reading in binary format
with open(pdf_file, ‘rb’) as pdf:
#initialize a PDfReader object
reader = PyPDF2.PdfReader(pdf)
#start an empty list
pdf_text = []
#loop through each page in document
for page in reader.pages:
content = page.extract_text()
pdf_text.append(content)
return pdf_text

When the file is passed as an argument in the function above, it returns a list containing elements- each element referring to the text on each page. The given file autobook.pdf is read as 4 elements using the extract_text_from_pdf() function as shown below:

The elements inside the extracted_text can also be joined as a single element using:

all_text = [''.join(extracted_text)]
len(all_text) #returns 1

all_text returns a list containing only one element for the entire text in all the pages of the pdf file.

c. Alternative way to read the entire text of the pdf file using PyMUPDF package.

Alternatively, I came across a package called PyMUPDF to read the entire text in the pdf as shown below:

# install using: pip install PyMuPDF
import fitzwith fitz.open(file) as doc:
text = ""
for page in doc:
#append characeter in each page
text += page.get_text()
print ("Pdf file is read as a collection of ", len(text), "text elements.")
#returns 1786 elements.

First, the pdf file is opened as a doc. text is initialized as an empty string. By looping through each page in the doc, the character on each page is appended to the text. Hence, the length of text here is 1786 elements, which includes each character including spaces, new lines, and punctuation marks.

2. RegEx

RegEx, or Regular Expression, is a sequence of characters that forms a search pattern. Python has an in-built package called re for this purpose.

From all the text in the given pdf file, I wanted to extract only the specific information. Below I describe the functions I used for this purpose, although there could be much wider use cases of RegEx.

a. findall

When the given pattern matches in the string/text, the findall function returns the list of all the matches.

In the code snippet below, x, y and z return all the matches for Name, Nationality, and Country in the text. There are four occurrences of Name, three occurrences of Nationality, and a single occurrence of Country in the text in the pdf.

Findall function is used to return the list of all matches. Image by Author.

b. sub

The sub function is used to substitute/replace one or more matches with a string.

In the given text, the Nationality is referred to as Country in the case of a friend named Hari. To replace the Country with Nationality, first I compiled a regular expression pattern for Country. Next, I used the sub method to replace the pattern with the new word and created a new string called new_text. In new_text, I find four occurrences of Nationality unlike three in the previous case.

Sub function used to substitute/find and replace in the string. Image by Author.

c. finditer

The finditer method can be used to find the matches of the pattern in a string.

In the given text, the text between the Name and Nationality fields contains the actual names of the friends, and the text between the Nationality and Date of Birth fields contains the actual nationalities. I created the following function called find_between() to find the text between any two words present in consecutive order in the given text.

def find_between(first_word, last_word, text):
"""Find characters between any two first_word and last_word."""pattern = rf"(?P<start>{first_word})(?P<match>.+?)(?P<end>{last_word})"
#Returns an iterator called matches based on pattern in the string.
#re.DOTALL flag allows the '.' character to inclde new lines in matching 
matches = re.finditer(pattern, text, re.DOTALL)
new_list = []
for match in matches:
new_list.append(match.group("match"))
return new_list

One of the main parameters in the above function is the pattern. The pattern is set up to extract the characters between the first_word and the last_word in the given text. The finditer function returns an iterator over all non-overlapping matches in the string. For each match, the iterator returns a Match object. An empty list called new_list is initialized. By looping through the matches, the exact match in each iteration is appended to the new_list, and is returned by the fuction.

In this way, I was able to create the lists for each field such as names, nationalities, date of birth, preferences, and so on from the pdf file as shown in the code snippet below:

Using the find_between function to extract relevant profile information of each friend from the pdf. Image by Author.

Note:

The ‘.’ special character in Python matches with any character in the text/string excluding the new line. However, the re.DOTALL flag the ‘.’ character can match any character including the new line.

3. Exporting data to Excel

a. Pandas dataframe from lists

In the step above, I got the lists for each profile field for each friend. In this step, I convert these lists into a pandas dataframe:

import pandas as pddf = pd.DataFrame()
df["Name"] = names
df["Nationality"] = nationalities
df["Date of Birth"] = dobs
df["Favorite Food"] = foods
df["Favorite Fruit"] = fruits
df["Favorite Sports"] = sports
df["Favorite Player"] = players
df["Favorite Movie"] = movies
df["Favorite Actor"] = actors
df["Favorite Quotes"] = quotes
df["Aim"] = aims
df["Views on Politics"] = politics
df["Messages"] = messages
df = df.T
df.columns = df.iloc[0]
df.drop(index = "Name", inplace = True)
df

The dataframe df looks as shown below:

Deriving pandas dataframe from the lists of each profile field for each friend. Imagy by Author.

b. Conditional formatting using pandas dataframe

Pandas datafame allows conditional formatting feature similar to Excel. Suppose I want to highlight the cells containing the name of my favorite player Lionel Messi in df. This can be done using df.style.applymap() function as shown below:

Applying background color in selected cells using df.style.applymap function.

When the file is exported as *.xlsx format in line [28], the exported file also contains yellow highlight for the cell containing Lionel Messi.

4. Exporting from Python to word format

a. Creating word document using Python-docx

To export data from Python to a Word format, I use a package called python-docx. The Document module inside the docx package allows the creation of different aspects of a word document such as headings and paragraphs.

In the code below, I add the heading Name for each friend at first followed by a paragraph containing the actual name of the friend. This is followed by the headings and the corresponding texts for each profile field. I add a page break at the end of the profile of each friend.

from docx import Document
document = Document()for column in df.columns:  
document.add_heading("Name")
p = document.add_paragraph()
p.add_run(column)
for index in df.index:
document.add_heading(index)
p = document.add_paragraph()
p.add_run(df.loc[index, column])
#add page break after profile of each friend
document.add_page_break()

The code above helps to yield a word document of the following format after saving it:

Word document that is to be generated by the code above. Image by Author.

b. Highlight paragraph using Python-docx

The Python-docx package helps to generate a word document with most of the features available in a Microsoft Word application. For example, the font can be added in different font styles, font colors, and sizes, along with features such as bold, italic, and underline.
Let’s say I want to create a section called Favorites at the end of the document and highlight the text in the document. It can be done with the following code:

from docx.enum.text import WD_COLOR_INDEX
document.add_heading("Favorites")
p = document.add_paragraph()
p.add_run("This section consists of favorite items of each friend.").font.highlight_color=WD_COLOR_INDEX.YELLOW

c. Create a table using Python-docx

The Python-docx also allows the creation of tables in the word document directly from Python. Suppose I want to add a table consisting of the favorite item of each friend in the Favorites section at the end of the document. A table can be created using document.add_tables(rows = nrows, cols = ncols). Furthermore, the text needs to be defined for each row/column or cell of the table.
In the code below, I define a table object with 8 rows and 5 columns. Next, I define the table header and first column. By looping through the dataframe df, I define the text for each cell inside the table based on the favorite item of each friend.

table = document.add_table(rows = 8, cols = 5)# Adding heading in the 1st row of the table
column1 = table.rows[0].cells
column1[0].text = ‘Items’
#table header
for i in range(1, len(df.columns)+1):
column1[i].text = df.columns[i-1]
#first column in the table for labels
for i in range(1,8):
table.cell(i,0).text = df.index[i+1]
for i in range(2, 9):
for j in range(1, 5):
table.cell(i-1, j).text = df.iloc[i, j-1]
#define table style
table.style = “Light Grid Accent 1”

d. Save the document.

The document is saved as a *.docx format file using:

document.save(“../output/python_to_word.docx”)

The final page of the document comprising favorites section and the table looks as follows:

Favorites section and table created by following steps b and c above. File is saved as *.docx format using code in step d. Image by Author.

5. Converting word document to pdf format.

To convert a document from word *.docx format to *.pdf format using Python, I came across a package called docx2pdf. The word document can be converted to pdf format using the convert module of the package as convert(input_path, output_path).

Converting word document from *.docx to *.pdf format. Image by Author.

The output folder looks as follows for me:

Output folder comprising of Excel file, Word document, and pdf file all exported using Python. Image by Author.

Conclusion

Scanning through a pdf file and extracting only the necessary information can be very time-consuming and stressful. There are different packages available in Python that help to automate this process, alleviate cumbersomeness, and make the process more efficient to extract precise information.

In this post, I use a dummy example of a pdf file containing common fields/sections/headings in the profile of four friends and extract the relevant information for each field for each friend. First, I used the PyPDF2 or PyMuPDF package to read the pdf file and print out the entire text. Second, I used Regular Expressions (RegEx) to detect patterns and find the matches for each pattern in the text to extract only relevant information. Third, I converted the lists of information for each profile field for each friend as pandas dataframe and exported it to an Excel file. Next, I created a word file using the Python-docx package. And finally, I converted the word file into a pdf format again using the docx2pdf file.

The notebook and the input pdf file for this post are available in this GitHub repository. Thank you for reading!