A Recommendation System For Academic Research (And Other Data Types)! | by Benjamin McCloskey | Mar, 2023


Photo by Shubham Dhage on Unsplash

Many of the projects people develop today generally begin with the first crucial step: Active Research. Investing in what other people have done and building on their work is important for your project’s ability to add value. Not only should you learn from the strong conclusions of what other people have done, but you also will want to figure out what you shouldn’t do in your project to ensure its success.

As I worked through my thesis, I started collecting lots of different types of research files. For example, I had collections of different academic publications I read through as well as excel sheets with information containing the results of different experiments. As I completed the research for my thesis, I wondered: Is there a way to create a recommendation system that can compare all the research I have in my archive and help guide me in my next project?

In fact, there is!

Note: Not only would this be for a repository of all of the research you may be collecting from various search engines, but it will also work for any directory you have containing various types of different documents.

I developed this recommendation with my team using Python 3.

There are lots of APIs that support this recommendation system and researching what each specific API can perform may be beneficial for your own learning.

import string 
import csv
from io import StringIO
from pptx import Presentation
import docx2txt
import PyPDF2
import spacy
import pandas as pd
import numpy as np
import nltk
import re
import openpyxl
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from gensim.parsing.preprocessing import STOPWORDS as SW
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('averaged_perceptron_tagger')
from nltk.corpus import wordnet
import networkx as nx
from networkx.algorithms.shortest_paths import weighted
import glob

The Hurdle

One big hurdle I had to overcome was the need for the recommendation machine’s ability to compare different types of files. For example, I wanted to see if an excel spreadsheet has information similar or is connected to the information within a PowerPoint and academic PDF journal. The trick to doing this was reading every file type into Python and transforming each object into a single string of words. This normalizes all the data and allows for the calculation of a similarity metric.

PDF Reading Class

The first class we will look at for this project is the pdfReader class which is able to format a PDF to be readable in Python. Of all the file formats, I would argue that PDFs are one of the most important since many of the journal articles downloaded from research repositories such as Google Scholar are in PDF format.

class pdfReader:

def __init__(self, file_path: str) -> str:
self.file_path = file_path

def PDF_one_pager(self) -> str:
"""A function which returns a one line string of the
pdf.

Returns:
one_page_pdf (str): A one line string of the pdf.

"""
content = ""
p = open(self.file_path, "rb")
pdf = PyPDF2.PdfReader(p)
num_pages = len(pdf.pages)
for i in range(0, num_pages):
content += pdf.pages[i].extract_text() + "\n"
content = " ".join(content.replace(u"\xa0", " ").strip().split())
page_number_removal = r"\d{1,3} of \d{1,3}"
page_number_removal_pattern = re.compile(page_number_removal, re.IGNORECASE)
content = re.sub(page_number_removal_pattern, '',content)

return content

def pdf_reader(self) -> str:
"""A function which can read .pdf formatted files
and returns a python readable pdf.

Returns:
read_pdf: A python readable .pdf file.
"""
opener = open(self.file_path,'rb')
read_pdf = PyPDF2.PdfFileReader(opener)

return read_pdf

def pdf_info(self) -> dict:
"""A function which returns an information dictionary of a
pdf.

Returns:
dict(pdf_info_dict): A dictionary containing the meta
data of the object.
"""
opener = open(self.file_path,'rb')
read_pdf = PyPDF2.PdfFileReader(opener)
pdf_info_dict = {}
for key,value in read_pdf.documentInfo.items():
pdf_info_dict[re.sub('/',"",key)] = value
return pdf_info_dict

def pdf_dictionary(self) -> dict:
"""A function which returns a dictionary of
the object where the keys are the pages
and the text within the pages are the
values.

Returns:
dict(pdf_dict): A dictionary pages and text.
"""
opener = open(self.file_path,'rb')

read_pdf = PyPDF2.PdfReader(opener)
length = read_pdf.pages
pdf_dict = {}
for i in range(length):
page = read_pdf.getPage(i)
text = page.extract_text()
pdf_dict[i] = text
return pdf_dict

Microsoft Powerpoint Reader

The pptReader class is capable of reading Microsoft Powerpoint files into Python.

class pptReader:

def __init__(self, file_path: str) -> None:
self.file_path = file_path

def ppt_text(self) -> str:
"""A function that returns a string of text from all
of the slides in a pptReader object.

Returns:
text (str): A single string containing the text
within each slide of the pptReader object.
"""
prs = Presentation(self.file_path)
text = str()

for slide in prs.slides:
for shape in slide.shapes:
if not shape.has_text_frame:
continue
for paragraph in shape.text_frame.paragraphs:
for run in paragraph.runs:
text += ' ' + run.text
return text

Microsoft Word Document Reader

The wordDocReader class can be used for reading Microsoft Word Documents in Python. It utilizes the doc2txt API and returns a string of the text/information located within a given word document.

class wordDocReader:
def __init__(self, file_path: str) -> str:
self.file_path = file_path

def word_reader(self):
"""A function that transforms a wordDocReader object into a Python readable
word document."""

text = docx2txt.process(self.file_path)
text = text.replace('\n', ' ')
text = text.replace('\xa0', ' ')
text = text.replace('\t', ' ')
return text

Microsft Excel Reader

Sometimes researchers will include excel sheets of their results with their publications. Being able to read the column names, and even the values, could help with recommending results that are like what you are searching for. For example, what if you were researching information on the past performance of a certain stock? Maybe you search for the name and symbol which is annotated in a historical performance excel sheet. This recommendation system would recommend the excel sheet to you to help with your research.


class xlsxReader:

def __init__(self, file_path: str) -> str:
self.file_path = file_path

def xlsx_text(self):
"""A function which returns the string of an
excel document.

Returns:
text(str): String of text of a document.
"""
inputExcelFile = self.file_path
text = str()
wb = openpyxl.load_workbook(inputExcelFile)
#This will save the excel sheet as a CSV file
for sn in wb.sheetnames:
excelFile = pd.read_excel(inputExcelFile, engine = 'openpyxl', sheet_name = sn)
excelFile.to_csv("ResultCsvFile.csv", index = None, header=True)

with open("ResultCsvFile.csv", "r") as csvFile:
lines = csvFile.read().split(",") # "\r\n" if needed
for val in lines:
if val != '':
text += val + ' '
text = text.replace('\ufeff', '')
text = text.replace('\n', ' ')
return textCSV File Reader

The csvReader class will allow for CSV files to be included in your database and to be used in the system’s recommendations.


class csvReader:

def __init__(self, file_path: str) -> str:
self.file_path = file_path

def csv_text(self):
"""A function which returns the string of a
csv document.

Returns:
text(str): String of text of a document.
"""
text = str()
with open(self.file_path, "r") as csvFile:
lines = csvFile.read().split(",") # "\r\n" if needed
for val in lines:
text += val + ' '
text = text.replace('\ufeff', '')
text = text.replace('\n', ' ')
return textMicrosoft PowerPoint Reader

Here’s a helpful class. Not many people think about how there is valuable information stored within the bodies of PowerPoint presentations. These presentations are by and large created to visualize key ideas and information to the audience. The following class will help relate any PowerPoints you have in your database to other bodies of information in hopes of steering you towards connected pieces of work.

class pptReader:

def __init__(self, file_path: str) -> str:
self.file_path = file_path

def ppt_text(self):
"""A function which returns the string of a
Mirocsoft PowerPoint document.

Returns:
text(str): String of text of a document.
"""
prs = Presentation(self.file_path)
text = str()
for slide in prs.slides:
for shape in slide.shapes:
if not shape.has_text_frame:
continue
for paragraph in shape.text_frame.paragraphs:
for run in paragraph.runs:
text += ' ' + run.text

return textMicrosoft Word Document Reader

The final class for this system is a Microsoft Word document reader. Word documents are another valuable source of information. Many people will write reports, indicating their findings and ideas in word document format.

class wordDocReader:
def __init__(self, file_path: str) -> str:
self.file_path = file_path

def word_reader(self):
"""A function which returns the string of a
Microsoft Word document.

Returns:
text(str): String of text of a document.
"""
text = docx2txt.process(self.file_path)
text = text.replace('\n', ' ')
text = text.replace('\xa0', ' ')
text = text.replace('\t', ' ')
return text

That’s a wrap for the classes used in today’s project. Please note: there are tons of other file types you can use to enhance your recommendation system. A current version of the code being developed will accept images and try to relate them to other documents within a database!

Preprocessing

Let’s look at how to preprocess this data. This recommendation system was built for a repository of academic research, therefore the need to break the text down using the preprocessing steps guided by Natural Language Processing (NLP) was important.

The data processing class is simply called datapreprocessor and the first function within the class is a word parts of speech tagger.

class dataprocessor:
def __init__(self):
return

@staticmethod
def get_wordnet_pos(text: str) -> str:
"""Map POS tag to first character lemmatize() accepts
Inputs:
text(str): A string of text

Returns:
tag_dict(dict): A dictionary of tags
"""
tag = nltk.pos_tag([text])[0][1][0].upper()
tag_dict = {"J": wordnet.ADJ,
"N": wordnet.NOUN,
"V": wordnet.VERB,
"R": wordnet.ADV}

return tag_dict.get(tag, wordnet.NOUN)

This function tags the parts of speech in a word and will come in handy later in the project.

Second, there is a function that conducts the normal NLP steps many of us have seen before. These steps are:

  1. Lowercase each word
  2. Remove the punctuation
  3. Remove digits (I only wanted to look at non-numeric information. This step could be taken out if desired)
  4. Stopword removal.
  5. Lemmanitizaion. This is where the get_wordnet_pos() function comes in handy for including parts of speech!
@staticmethod
def preprocess(text: str):
"""A function that prepoccesses text through the
steps of Natural Language Processing (NLP).
Inputs:
text(str): A string of text

Returns:
text(str): A processed string of text
"""
#lowercase
text = text.lower()

#punctuation removal
text = "".join([i for i in text if i not in string.punctuation])

#Digit removal (Only for ALL numeric numbers)
text = [x for x in text.split(' ') if x.isnumeric() == False]

#Stop removal
stopwords = nltk.corpus.stopwords.words('english')
custom_stopwords = ['\n','\n\n', '&', ' ', '.', '-', '$', '@']
stopwords.extend(custom_stopwords)

text = [i for i in text if i not in stopwords]
text = ' '.join(word for word in text)

#lemmanization
lm = WordNetLemmatizer()
text = [lm.lemmatize(word, dataprocessor.get_wordnet_pos(word)) for word in text.split(' ')]
text = ' '.join(word for word in text)

text = re.sub(' +', ' ',text)

return text

Next, there is a function to read all of the files into the system.

@staticmethod
def data_reader(list_file_names):
"""A function that reads in the data from a directory of files.

Inputs:
list_file_names(list): List of the filepaths in a directory.

Returns:
text_list (list): A list where each value is a string of text
for each file in the directory
file_dict(dict): Dictionary where the keys are the filename and the values
are the information found within each given file
"""

text_list = []
reader = dataprocessor()
for file in list_file_names:
temp = file.split('.')
filetype = temp[-1]
if filetype == "pdf":
file_pdf = pdfReader(file)
text = file_pdf.PDF_one_pager()

elif filetype == "docx":
word_doc_reader = wordDocReader(file)
text = word_doc_reader.word_reader()

elif filetype == "pptx" or filetype == 'ppt':
ppt_reader = pptReader(file)
text = ppt_reader.ppt_text()

elif filetype == "csv":
csv_reader = csvReader(file)
text = csv_reader.csv_text()

elif filetype == 'xlsx':
xl_reader = xlsxReader(file)
text = xl_reader.xlsx_text()
else:
print('File type {} not supported!'.format(filetype))
continue

text = reader.preprocess(text)
text_list.append(text)
file_dict = dict()
for i,file in enumerate(list_file_names):
file_dict[i] = (file, file.split('/')[-1])
return text_list, file_dict

As this is the first version of this system, I want to foot stomp that the code can be adapted to include many other file types!

The next function is called the database_preprocess() which is used to process all of the files within your given database. The input is a list of the files, each with its associated string of text (processed already). The strings of text are then vectorized using sklearn’s tfidVectorizer. What is that exactly? Basically, it will transform all the text into different feature vectors based on the frequency of each given word. We do this so we can look at how closely related documents are using similarity formulas relating to vector arithmetic.

@staticmethod
@staticmethod
def database_processor(file_dict,text_list: list):
"""A function that transforms the text of each file within the
database into a vector.

Inputs:
file_dixt(dict): Dictionary where the keys are the filename and the values
are the information found within each given file
text_list (list): A list where each value is a string of the text
for each file in the directory

Returns:
list_dense(list): A list of the files' text turned into vectors.
vectorizer: The vectorizor used to transform the strings of text
file_vector_dict(dict): A dictionary where the file names are the keys
and the vectors of each files' text are the values.
"""
file_vector_dict = dict()
vectorizer = TfidfVectorizer()
vectors = vectorizer.fit_transform(text_list)
feature_names = vectorizer.get_feature_names_out()
matrix = vectors.todense()
list_dense = matrix.tolist()
for i in range(len(list_dense)):
file_vector_dict[file_dict[i][1]] = list_dense[i]

return list_dense, vectorizer, file_vector_dict

The reason a vectorizer is created off of the database is that when a user gives a list of terms to search for in the database, those words will be vectorized based on their frequency in said database. This is the biggest weakness of the current system. As we increase the size of the database, the time and computational allocation needed for calculating similarities will increase and slow down the system. One recommendation given during a quality control meeting was to use Reinforcement Learning for recommending different articles of data.

Next, we can use an input processor that processes any word provided into a vector. This is synonymous to when you type a request into a search engine.

 @staticmethod
def input_processor(text, TDIF_vectorizor):
"""A function which accepts a string of text and vectorizes the text using a
TDIF vectorizoer.

Inputs:
text(str): A string of text
TDIF_vectorizor: A pretrained vectorizor

Returns:
words(list): A list of the input text in vectored form.
"""
words = ''
total_words = len(text.split(' '))
for word in text.split(' '):
words += (word + ' ') * total_words
total_words -= 1

words = [words[:-1]]
words = TDIF_vectorizor.transform(words)
words = words.todense()
words = words.tolist()
return words

Since all of the information within and given to the database will be vectors, we can use cosine similarity to compute the angle between the vectors. The closer the angle is to 0, the less similar the two said vectors will be.

@staticmethod
def similarity_checker(vector_1, vector_2):
"""A function which accepts two vectors and computes their cosine similarity.

Inputs:
vector_1(int): A numerical vector
vector_2(int): A numerical vector

Returns:
cosine_similarity([vector_1], vector_2) (int): Cosine similarity score
"""
vectors = [vector_1, vector_2]
for vec in vectors:
if np.ndim(vec) == 1:
vec = np.expand_dims(vec, axis=0)
return cosine_similarity([vector_1], vector_2)

Once the capability of finding the similarity score between two vectors is done, rankings can now be created between the words being searched and the documents located within the database.

@staticmethod
def recommender(vector_file_list,query_vector, file_dict):
"""A function which accepts a list of vectors, query vectors, and a dictionary
pertaining to the list of vectors with their original values and file names.

Inputs:
vector_file_list(list): A list of vectors
query_vector(int): A numerical vector
file_dict(dict): A dictionary of filenames and text relating to the list
of vectors

Returns:
final_recommendation (list): A list of the final recommended files
similarity_list[:len(final_recommendation)] (list): A list of the similarity
scores of the final recommendations.
"""
similarity_list = []
score_dict = dict()
for i,file_vector in enumerate(vector_file_list):
x = dataprocessor.similarity_checker(file_vector, query_vector)
score_dict[file_dict[i][1]] = (x[0][0])
similarity_list.append(x)
similarity_list = sorted(similarity_list, reverse = True)
#Recommends the top 20%
recommended = sorted(score_dict.items(),
key=lambda x:-x[1])[:int(np.round(.5*len(similarity_list)))]

final_recommendation = []
for i in range(len(recommended)):
final_recommendation.append(recommended[i][0])
#add in graph for greater than 3 recommendationa
return final_recommendation, similarity_list[:len(final_recommendation)]

The vector file list is the list of vectors we created from the files before. The query vector is a vector of the words being searched. The file dictionary was created earlier which uses file names for the keys and the files’ text as values. Similarities are computed, and then a ranking is created favoring the most similar pieces of information to the queried words being recommended first. Note, what if there are greater than 3 recommendations? Incorporating elements of Networks and Graph Theory will add an extra level of computational benefit to this system and create more confident recommendations.

Page Rank Theory

Let’s take a quick detour and go over the theory of page rank. Don’t get me wrong, cosine similarity is a powerful computation for measuring the similarity between vectors, put incorporating page rank into your recommendation algorithm allows for similarity comparisons across multiple vectors (data within your database).

Page rank was first designed by Larry Page to rank websites and measure their importance [1]. The basic idea is that a website can be deemed “more important” if more websites are linked to it. Drawing from this idea, a node on a graph can be ranked as more important if there is a decrease in the distance of its edge to other nodes. The shorter the collective distance a node has compared to other nodes in a graph, the more important said node is.

Today we will use one variation of PageRank called eigenvector centrality. Eigenvector centrality is like PageRank in that it measures the connections between nodes of a graph, assigning higher scores for stronger connections. Biggest difference? Eigenvector centrality will account for the importance of nodes connected to a given node to estimate how important that node is. This is synonymous with saying, a person who knows lots of important people may be very important themselves through these strong relationships. All-in-all, these two algorithms are very close in the way they are implemented.

For this database, after the vectors are computed, they can be placed into a graph where their edge distance is determined by their similarity to other vectors.

@staticmethod
def ranker(recommendation_val, file_vec_dict):
"""A function which accepts a list of recommendaton values and a dictionary
files wihin the databse and their vectors.

Inputs:
reccomendation_val(list): A list of recommendations found through cosine
similarity
file_vec_dic(dict): A dictionary of the filenames as keys and their
text in vectors as the values.

Returns:
ec_recommended(list): A list of the top 20% recommendations found using the
eigenvector centrality algorithm.
"""
my_graph = nx.Graph()
for i in range(len(recommendation_val)):
file_1 = recommendation_val[i]
for j in range(len(recommendation_val)):
file_2 = recommendation_val[j]

if i != j:
#Calculate sim_score between two values (weight)
edge_dist = cosine_similarity([file_vec_dict[recommendation_val[i]]],[file_vec_dict[recommendation_val[j]]])
#add an edge from file 1 to file 2 with the weight
my_graph.add_edge(file_1, file_2, weight=edge_dist)

#Pagerank the graph ]
rec = nx.eigenvector_centrality(my_graph)
#Takes 20% of the values
ec_recommended = sorted(rec.items(), key=lambda x:-x[1])[:int(np.round(len(rec)))]

return ec_recommended

Okay, now what? We have the recommendations created by using the cosine similarity between each data point in the database, and recommendations computed by the eigenvector centrality algorithm. Which recommendations should we output? Both!

@staticmethod
def weighted_final_rank(sim_list,ec_recommended,final_recommendation):
"""A function which accepts a list of similiarity values found through
cosine similairty, recommendations found through eigenvector centrality,
and the final recommendations produced by cosine similarity.

Inputs:
sim_list(list): A list of all of the similarity values for the files
within the database.
ec_recommended(list): A list of the top 20% recommendations found using the
eigenvector centrality algorithm.
final_recommendation (list): A list of the final recommendations found
by using cosine similarity.

Returns:
weighted_final_recommend(list): A list of the final recommendations for
the files in the database.
"""
final_dict = dict()

for i in range(len(sim_list)):
val = (.8*sim_list[final_recommendation.index(ec_recommendation[i][0])].squeeze()) + (.2 * ec_recommendation[i][1])
final_dict[ec_recommendation[i][0]] = val

weighted_final_recommend = sorted(final_dict.items(), key=lambda x:-x[1])[:int(np.round(len(final_dict)))]

return weighted_final_recommend

The final function of this script will weigh the different recommendations produced by cosine similarity and eigenvector centrality. Currently, 80% of the weight will be given to the recommendations produced by the cosine similarity recommendations, and 20% of the weight will be given to eigenvector centrality recommendations. The final recommendations can be computed based on these weights and aggregated together to produce recommendations that are representative of all the similarity computations in the system. The weights can easily be changed by the developer to reflect which batch of recommendations they feel are more important.

Let’s do a quick example with this code. The documents within my database are all in the formats previously discussed and pertain to different areas of machine learning. More documents in the database are related to Generative Adversarial Networks (GANS), so I would suspect those to be recommended first when “Generative Adversarial Network” is the query term.

path = '/content/drive/MyDrive/database/'
db = [f for f in glob.glob(path + '*')]

research_documents, file_dictionary = dataprocessor.data_reader(db)
list_files, vectorizer, file_vec_dict = dataprocessor.database_processor(file_dictionary,research_documents)
query = 'Generative Adversarial Networks'
query = dataprocessor.preprocess(query)
query = dataprocessor.input_processor(query, vectorizer)
recommendation, sim_list = dataprocessor.recommender(list_files,query, file_dictionary)
ec_recommendation = dataprocessor.ranker(recommendation, file_vec_dict)
final_weighted_recommended = dataprocessor.weighted_final_rank(sim_list,ec_recommendation, recommendation)
print(final_weighted_recommended)

Running this block of code produces the following recommendations, along with the weight value for each recommendation.

[(‘GAN_presentation.pptx’, 0.3411272882084124), (‘Using GANs to Augment UAV Data_V2.docx’, 0.16293615818015078), (‘GANS_DAY_1.docx’, 0.12546058188955278), (‘ml_pdf.pdf’, 0.10864164490536887)]

Let’s try one more. What if I query “Machine Learning” ?

[(‘ml_pdf.pdf’, 0.31244922151487337), (‘GAN_presentation.pptx’, 0.18170070184645432), (‘GANS_DAY_1.docx’, 0.14825501243059303), (‘Using GANs to Augment UAV Data_V2.docx’, 0.1309153863914564)]

Aha! As expected, the first document recommended is an introductory brief to machine learning! I only used 7 documents for this example, and the more documents added, the more recommendations one will receive!

Today we looked at how you can create a recommendation system for files you collect (especially if you are collecting research for a project). The main feature of this system is that it goes one step further in computing the cosine similarity of vectors by adopting the eigenvector centrality algorithm for more concise, and better recommendations. Try this out today, and I hope it helps you get a better understanding of how related the pieces of data you possess are.

If you enjoyed today’s reading, PLEASE give me a follow and let me know if there is another topic you would like me to explore! If you do not have a Medium account, sign up through my link here (I receive a small commission when you do this)! Additionally, add me on LinkedIn, or feel free to reach out! Thanks for reading!

Sources

  1. https://www.geeksforgeeks.org/page-rank-algorithm-implementation/
  2. Full Code: https://github.com/benmccloskey/Research_recommend_model/blob/main/app.py


Photo by Shubham Dhage on Unsplash

Many of the projects people develop today generally begin with the first crucial step: Active Research. Investing in what other people have done and building on their work is important for your project’s ability to add value. Not only should you learn from the strong conclusions of what other people have done, but you also will want to figure out what you shouldn’t do in your project to ensure its success.

As I worked through my thesis, I started collecting lots of different types of research files. For example, I had collections of different academic publications I read through as well as excel sheets with information containing the results of different experiments. As I completed the research for my thesis, I wondered: Is there a way to create a recommendation system that can compare all the research I have in my archive and help guide me in my next project?

In fact, there is!

Note: Not only would this be for a repository of all of the research you may be collecting from various search engines, but it will also work for any directory you have containing various types of different documents.

I developed this recommendation with my team using Python 3.

There are lots of APIs that support this recommendation system and researching what each specific API can perform may be beneficial for your own learning.

import string 
import csv
from io import StringIO
from pptx import Presentation
import docx2txt
import PyPDF2
import spacy
import pandas as pd
import numpy as np
import nltk
import re
import openpyxl
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from gensim.parsing.preprocessing import STOPWORDS as SW
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('averaged_perceptron_tagger')
from nltk.corpus import wordnet
import networkx as nx
from networkx.algorithms.shortest_paths import weighted
import glob

The Hurdle

One big hurdle I had to overcome was the need for the recommendation machine’s ability to compare different types of files. For example, I wanted to see if an excel spreadsheet has information similar or is connected to the information within a PowerPoint and academic PDF journal. The trick to doing this was reading every file type into Python and transforming each object into a single string of words. This normalizes all the data and allows for the calculation of a similarity metric.

PDF Reading Class

The first class we will look at for this project is the pdfReader class which is able to format a PDF to be readable in Python. Of all the file formats, I would argue that PDFs are one of the most important since many of the journal articles downloaded from research repositories such as Google Scholar are in PDF format.

class pdfReader:

def __init__(self, file_path: str) -> str:
self.file_path = file_path

def PDF_one_pager(self) -> str:
"""A function which returns a one line string of the
pdf.

Returns:
one_page_pdf (str): A one line string of the pdf.

"""
content = ""
p = open(self.file_path, "rb")
pdf = PyPDF2.PdfReader(p)
num_pages = len(pdf.pages)
for i in range(0, num_pages):
content += pdf.pages[i].extract_text() + "\n"
content = " ".join(content.replace(u"\xa0", " ").strip().split())
page_number_removal = r"\d{1,3} of \d{1,3}"
page_number_removal_pattern = re.compile(page_number_removal, re.IGNORECASE)
content = re.sub(page_number_removal_pattern, '',content)

return content

def pdf_reader(self) -> str:
"""A function which can read .pdf formatted files
and returns a python readable pdf.

Returns:
read_pdf: A python readable .pdf file.
"""
opener = open(self.file_path,'rb')
read_pdf = PyPDF2.PdfFileReader(opener)

return read_pdf

def pdf_info(self) -> dict:
"""A function which returns an information dictionary of a
pdf.

Returns:
dict(pdf_info_dict): A dictionary containing the meta
data of the object.
"""
opener = open(self.file_path,'rb')
read_pdf = PyPDF2.PdfFileReader(opener)
pdf_info_dict = {}
for key,value in read_pdf.documentInfo.items():
pdf_info_dict[re.sub('/',"",key)] = value
return pdf_info_dict

def pdf_dictionary(self) -> dict:
"""A function which returns a dictionary of
the object where the keys are the pages
and the text within the pages are the
values.

Returns:
dict(pdf_dict): A dictionary pages and text.
"""
opener = open(self.file_path,'rb')

read_pdf = PyPDF2.PdfReader(opener)
length = read_pdf.pages
pdf_dict = {}
for i in range(length):
page = read_pdf.getPage(i)
text = page.extract_text()
pdf_dict[i] = text
return pdf_dict

Microsoft Powerpoint Reader

The pptReader class is capable of reading Microsoft Powerpoint files into Python.

class pptReader:

def __init__(self, file_path: str) -> None:
self.file_path = file_path

def ppt_text(self) -> str:
"""A function that returns a string of text from all
of the slides in a pptReader object.

Returns:
text (str): A single string containing the text
within each slide of the pptReader object.
"""
prs = Presentation(self.file_path)
text = str()

for slide in prs.slides:
for shape in slide.shapes:
if not shape.has_text_frame:
continue
for paragraph in shape.text_frame.paragraphs:
for run in paragraph.runs:
text += ' ' + run.text
return text

Microsoft Word Document Reader

The wordDocReader class can be used for reading Microsoft Word Documents in Python. It utilizes the doc2txt API and returns a string of the text/information located within a given word document.

class wordDocReader:
def __init__(self, file_path: str) -> str:
self.file_path = file_path

def word_reader(self):
"""A function that transforms a wordDocReader object into a Python readable
word document."""

text = docx2txt.process(self.file_path)
text = text.replace('\n', ' ')
text = text.replace('\xa0', ' ')
text = text.replace('\t', ' ')
return text

Microsft Excel Reader

Sometimes researchers will include excel sheets of their results with their publications. Being able to read the column names, and even the values, could help with recommending results that are like what you are searching for. For example, what if you were researching information on the past performance of a certain stock? Maybe you search for the name and symbol which is annotated in a historical performance excel sheet. This recommendation system would recommend the excel sheet to you to help with your research.


class xlsxReader:

def __init__(self, file_path: str) -> str:
self.file_path = file_path

def xlsx_text(self):
"""A function which returns the string of an
excel document.

Returns:
text(str): String of text of a document.
"""
inputExcelFile = self.file_path
text = str()
wb = openpyxl.load_workbook(inputExcelFile)
#This will save the excel sheet as a CSV file
for sn in wb.sheetnames:
excelFile = pd.read_excel(inputExcelFile, engine = 'openpyxl', sheet_name = sn)
excelFile.to_csv("ResultCsvFile.csv", index = None, header=True)

with open("ResultCsvFile.csv", "r") as csvFile:
lines = csvFile.read().split(",") # "\r\n" if needed
for val in lines:
if val != '':
text += val + ' '
text = text.replace('\ufeff', '')
text = text.replace('\n', ' ')
return textCSV File Reader

The csvReader class will allow for CSV files to be included in your database and to be used in the system’s recommendations.


class csvReader:

def __init__(self, file_path: str) -> str:
self.file_path = file_path

def csv_text(self):
"""A function which returns the string of a
csv document.

Returns:
text(str): String of text of a document.
"""
text = str()
with open(self.file_path, "r") as csvFile:
lines = csvFile.read().split(",") # "\r\n" if needed
for val in lines:
text += val + ' '
text = text.replace('\ufeff', '')
text = text.replace('\n', ' ')
return textMicrosoft PowerPoint Reader

Here’s a helpful class. Not many people think about how there is valuable information stored within the bodies of PowerPoint presentations. These presentations are by and large created to visualize key ideas and information to the audience. The following class will help relate any PowerPoints you have in your database to other bodies of information in hopes of steering you towards connected pieces of work.

class pptReader:

def __init__(self, file_path: str) -> str:
self.file_path = file_path

def ppt_text(self):
"""A function which returns the string of a
Mirocsoft PowerPoint document.

Returns:
text(str): String of text of a document.
"""
prs = Presentation(self.file_path)
text = str()
for slide in prs.slides:
for shape in slide.shapes:
if not shape.has_text_frame:
continue
for paragraph in shape.text_frame.paragraphs:
for run in paragraph.runs:
text += ' ' + run.text

return textMicrosoft Word Document Reader

The final class for this system is a Microsoft Word document reader. Word documents are another valuable source of information. Many people will write reports, indicating their findings and ideas in word document format.

class wordDocReader:
def __init__(self, file_path: str) -> str:
self.file_path = file_path

def word_reader(self):
"""A function which returns the string of a
Microsoft Word document.

Returns:
text(str): String of text of a document.
"""
text = docx2txt.process(self.file_path)
text = text.replace('\n', ' ')
text = text.replace('\xa0', ' ')
text = text.replace('\t', ' ')
return text

That’s a wrap for the classes used in today’s project. Please note: there are tons of other file types you can use to enhance your recommendation system. A current version of the code being developed will accept images and try to relate them to other documents within a database!

Preprocessing

Let’s look at how to preprocess this data. This recommendation system was built for a repository of academic research, therefore the need to break the text down using the preprocessing steps guided by Natural Language Processing (NLP) was important.

The data processing class is simply called datapreprocessor and the first function within the class is a word parts of speech tagger.

class dataprocessor:
def __init__(self):
return

@staticmethod
def get_wordnet_pos(text: str) -> str:
"""Map POS tag to first character lemmatize() accepts
Inputs:
text(str): A string of text

Returns:
tag_dict(dict): A dictionary of tags
"""
tag = nltk.pos_tag([text])[0][1][0].upper()
tag_dict = {"J": wordnet.ADJ,
"N": wordnet.NOUN,
"V": wordnet.VERB,
"R": wordnet.ADV}

return tag_dict.get(tag, wordnet.NOUN)

This function tags the parts of speech in a word and will come in handy later in the project.

Second, there is a function that conducts the normal NLP steps many of us have seen before. These steps are:

  1. Lowercase each word
  2. Remove the punctuation
  3. Remove digits (I only wanted to look at non-numeric information. This step could be taken out if desired)
  4. Stopword removal.
  5. Lemmanitizaion. This is where the get_wordnet_pos() function comes in handy for including parts of speech!
@staticmethod
def preprocess(text: str):
"""A function that prepoccesses text through the
steps of Natural Language Processing (NLP).
Inputs:
text(str): A string of text

Returns:
text(str): A processed string of text
"""
#lowercase
text = text.lower()

#punctuation removal
text = "".join([i for i in text if i not in string.punctuation])

#Digit removal (Only for ALL numeric numbers)
text = [x for x in text.split(' ') if x.isnumeric() == False]

#Stop removal
stopwords = nltk.corpus.stopwords.words('english')
custom_stopwords = ['\n','\n\n', '&', ' ', '.', '-', '$', '@']
stopwords.extend(custom_stopwords)

text = [i for i in text if i not in stopwords]
text = ' '.join(word for word in text)

#lemmanization
lm = WordNetLemmatizer()
text = [lm.lemmatize(word, dataprocessor.get_wordnet_pos(word)) for word in text.split(' ')]
text = ' '.join(word for word in text)

text = re.sub(' +', ' ',text)

return text

Next, there is a function to read all of the files into the system.

@staticmethod
def data_reader(list_file_names):
"""A function that reads in the data from a directory of files.

Inputs:
list_file_names(list): List of the filepaths in a directory.

Returns:
text_list (list): A list where each value is a string of text
for each file in the directory
file_dict(dict): Dictionary where the keys are the filename and the values
are the information found within each given file
"""

text_list = []
reader = dataprocessor()
for file in list_file_names:
temp = file.split('.')
filetype = temp[-1]
if filetype == "pdf":
file_pdf = pdfReader(file)
text = file_pdf.PDF_one_pager()

elif filetype == "docx":
word_doc_reader = wordDocReader(file)
text = word_doc_reader.word_reader()

elif filetype == "pptx" or filetype == 'ppt':
ppt_reader = pptReader(file)
text = ppt_reader.ppt_text()

elif filetype == "csv":
csv_reader = csvReader(file)
text = csv_reader.csv_text()

elif filetype == 'xlsx':
xl_reader = xlsxReader(file)
text = xl_reader.xlsx_text()
else:
print('File type {} not supported!'.format(filetype))
continue

text = reader.preprocess(text)
text_list.append(text)
file_dict = dict()
for i,file in enumerate(list_file_names):
file_dict[i] = (file, file.split('/')[-1])
return text_list, file_dict

As this is the first version of this system, I want to foot stomp that the code can be adapted to include many other file types!

The next function is called the database_preprocess() which is used to process all of the files within your given database. The input is a list of the files, each with its associated string of text (processed already). The strings of text are then vectorized using sklearn’s tfidVectorizer. What is that exactly? Basically, it will transform all the text into different feature vectors based on the frequency of each given word. We do this so we can look at how closely related documents are using similarity formulas relating to vector arithmetic.

@staticmethod
@staticmethod
def database_processor(file_dict,text_list: list):
"""A function that transforms the text of each file within the
database into a vector.

Inputs:
file_dixt(dict): Dictionary where the keys are the filename and the values
are the information found within each given file
text_list (list): A list where each value is a string of the text
for each file in the directory

Returns:
list_dense(list): A list of the files' text turned into vectors.
vectorizer: The vectorizor used to transform the strings of text
file_vector_dict(dict): A dictionary where the file names are the keys
and the vectors of each files' text are the values.
"""
file_vector_dict = dict()
vectorizer = TfidfVectorizer()
vectors = vectorizer.fit_transform(text_list)
feature_names = vectorizer.get_feature_names_out()
matrix = vectors.todense()
list_dense = matrix.tolist()
for i in range(len(list_dense)):
file_vector_dict[file_dict[i][1]] = list_dense[i]

return list_dense, vectorizer, file_vector_dict

The reason a vectorizer is created off of the database is that when a user gives a list of terms to search for in the database, those words will be vectorized based on their frequency in said database. This is the biggest weakness of the current system. As we increase the size of the database, the time and computational allocation needed for calculating similarities will increase and slow down the system. One recommendation given during a quality control meeting was to use Reinforcement Learning for recommending different articles of data.

Next, we can use an input processor that processes any word provided into a vector. This is synonymous to when you type a request into a search engine.

 @staticmethod
def input_processor(text, TDIF_vectorizor):
"""A function which accepts a string of text and vectorizes the text using a
TDIF vectorizoer.

Inputs:
text(str): A string of text
TDIF_vectorizor: A pretrained vectorizor

Returns:
words(list): A list of the input text in vectored form.
"""
words = ''
total_words = len(text.split(' '))
for word in text.split(' '):
words += (word + ' ') * total_words
total_words -= 1

words = [words[:-1]]
words = TDIF_vectorizor.transform(words)
words = words.todense()
words = words.tolist()
return words

Since all of the information within and given to the database will be vectors, we can use cosine similarity to compute the angle between the vectors. The closer the angle is to 0, the less similar the two said vectors will be.

@staticmethod
def similarity_checker(vector_1, vector_2):
"""A function which accepts two vectors and computes their cosine similarity.

Inputs:
vector_1(int): A numerical vector
vector_2(int): A numerical vector

Returns:
cosine_similarity([vector_1], vector_2) (int): Cosine similarity score
"""
vectors = [vector_1, vector_2]
for vec in vectors:
if np.ndim(vec) == 1:
vec = np.expand_dims(vec, axis=0)
return cosine_similarity([vector_1], vector_2)

Once the capability of finding the similarity score between two vectors is done, rankings can now be created between the words being searched and the documents located within the database.

@staticmethod
def recommender(vector_file_list,query_vector, file_dict):
"""A function which accepts a list of vectors, query vectors, and a dictionary
pertaining to the list of vectors with their original values and file names.

Inputs:
vector_file_list(list): A list of vectors
query_vector(int): A numerical vector
file_dict(dict): A dictionary of filenames and text relating to the list
of vectors

Returns:
final_recommendation (list): A list of the final recommended files
similarity_list[:len(final_recommendation)] (list): A list of the similarity
scores of the final recommendations.
"""
similarity_list = []
score_dict = dict()
for i,file_vector in enumerate(vector_file_list):
x = dataprocessor.similarity_checker(file_vector, query_vector)
score_dict[file_dict[i][1]] = (x[0][0])
similarity_list.append(x)
similarity_list = sorted(similarity_list, reverse = True)
#Recommends the top 20%
recommended = sorted(score_dict.items(),
key=lambda x:-x[1])[:int(np.round(.5*len(similarity_list)))]

final_recommendation = []
for i in range(len(recommended)):
final_recommendation.append(recommended[i][0])
#add in graph for greater than 3 recommendationa
return final_recommendation, similarity_list[:len(final_recommendation)]

The vector file list is the list of vectors we created from the files before. The query vector is a vector of the words being searched. The file dictionary was created earlier which uses file names for the keys and the files’ text as values. Similarities are computed, and then a ranking is created favoring the most similar pieces of information to the queried words being recommended first. Note, what if there are greater than 3 recommendations? Incorporating elements of Networks and Graph Theory will add an extra level of computational benefit to this system and create more confident recommendations.

Page Rank Theory

Let’s take a quick detour and go over the theory of page rank. Don’t get me wrong, cosine similarity is a powerful computation for measuring the similarity between vectors, put incorporating page rank into your recommendation algorithm allows for similarity comparisons across multiple vectors (data within your database).

Page rank was first designed by Larry Page to rank websites and measure their importance [1]. The basic idea is that a website can be deemed “more important” if more websites are linked to it. Drawing from this idea, a node on a graph can be ranked as more important if there is a decrease in the distance of its edge to other nodes. The shorter the collective distance a node has compared to other nodes in a graph, the more important said node is.

Today we will use one variation of PageRank called eigenvector centrality. Eigenvector centrality is like PageRank in that it measures the connections between nodes of a graph, assigning higher scores for stronger connections. Biggest difference? Eigenvector centrality will account for the importance of nodes connected to a given node to estimate how important that node is. This is synonymous with saying, a person who knows lots of important people may be very important themselves through these strong relationships. All-in-all, these two algorithms are very close in the way they are implemented.

For this database, after the vectors are computed, they can be placed into a graph where their edge distance is determined by their similarity to other vectors.

@staticmethod
def ranker(recommendation_val, file_vec_dict):
"""A function which accepts a list of recommendaton values and a dictionary
files wihin the databse and their vectors.

Inputs:
reccomendation_val(list): A list of recommendations found through cosine
similarity
file_vec_dic(dict): A dictionary of the filenames as keys and their
text in vectors as the values.

Returns:
ec_recommended(list): A list of the top 20% recommendations found using the
eigenvector centrality algorithm.
"""
my_graph = nx.Graph()
for i in range(len(recommendation_val)):
file_1 = recommendation_val[i]
for j in range(len(recommendation_val)):
file_2 = recommendation_val[j]

if i != j:
#Calculate sim_score between two values (weight)
edge_dist = cosine_similarity([file_vec_dict[recommendation_val[i]]],[file_vec_dict[recommendation_val[j]]])
#add an edge from file 1 to file 2 with the weight
my_graph.add_edge(file_1, file_2, weight=edge_dist)

#Pagerank the graph ]
rec = nx.eigenvector_centrality(my_graph)
#Takes 20% of the values
ec_recommended = sorted(rec.items(), key=lambda x:-x[1])[:int(np.round(len(rec)))]

return ec_recommended

Okay, now what? We have the recommendations created by using the cosine similarity between each data point in the database, and recommendations computed by the eigenvector centrality algorithm. Which recommendations should we output? Both!

@staticmethod
def weighted_final_rank(sim_list,ec_recommended,final_recommendation):
"""A function which accepts a list of similiarity values found through
cosine similairty, recommendations found through eigenvector centrality,
and the final recommendations produced by cosine similarity.

Inputs:
sim_list(list): A list of all of the similarity values for the files
within the database.
ec_recommended(list): A list of the top 20% recommendations found using the
eigenvector centrality algorithm.
final_recommendation (list): A list of the final recommendations found
by using cosine similarity.

Returns:
weighted_final_recommend(list): A list of the final recommendations for
the files in the database.
"""
final_dict = dict()

for i in range(len(sim_list)):
val = (.8*sim_list[final_recommendation.index(ec_recommendation[i][0])].squeeze()) + (.2 * ec_recommendation[i][1])
final_dict[ec_recommendation[i][0]] = val

weighted_final_recommend = sorted(final_dict.items(), key=lambda x:-x[1])[:int(np.round(len(final_dict)))]

return weighted_final_recommend

The final function of this script will weigh the different recommendations produced by cosine similarity and eigenvector centrality. Currently, 80% of the weight will be given to the recommendations produced by the cosine similarity recommendations, and 20% of the weight will be given to eigenvector centrality recommendations. The final recommendations can be computed based on these weights and aggregated together to produce recommendations that are representative of all the similarity computations in the system. The weights can easily be changed by the developer to reflect which batch of recommendations they feel are more important.

Let’s do a quick example with this code. The documents within my database are all in the formats previously discussed and pertain to different areas of machine learning. More documents in the database are related to Generative Adversarial Networks (GANS), so I would suspect those to be recommended first when “Generative Adversarial Network” is the query term.

path = '/content/drive/MyDrive/database/'
db = [f for f in glob.glob(path + '*')]

research_documents, file_dictionary = dataprocessor.data_reader(db)
list_files, vectorizer, file_vec_dict = dataprocessor.database_processor(file_dictionary,research_documents)
query = 'Generative Adversarial Networks'
query = dataprocessor.preprocess(query)
query = dataprocessor.input_processor(query, vectorizer)
recommendation, sim_list = dataprocessor.recommender(list_files,query, file_dictionary)
ec_recommendation = dataprocessor.ranker(recommendation, file_vec_dict)
final_weighted_recommended = dataprocessor.weighted_final_rank(sim_list,ec_recommendation, recommendation)
print(final_weighted_recommended)

Running this block of code produces the following recommendations, along with the weight value for each recommendation.

[(‘GAN_presentation.pptx’, 0.3411272882084124), (‘Using GANs to Augment UAV Data_V2.docx’, 0.16293615818015078), (‘GANS_DAY_1.docx’, 0.12546058188955278), (‘ml_pdf.pdf’, 0.10864164490536887)]

Let’s try one more. What if I query “Machine Learning” ?

[(‘ml_pdf.pdf’, 0.31244922151487337), (‘GAN_presentation.pptx’, 0.18170070184645432), (‘GANS_DAY_1.docx’, 0.14825501243059303), (‘Using GANs to Augment UAV Data_V2.docx’, 0.1309153863914564)]

Aha! As expected, the first document recommended is an introductory brief to machine learning! I only used 7 documents for this example, and the more documents added, the more recommendations one will receive!

Today we looked at how you can create a recommendation system for files you collect (especially if you are collecting research for a project). The main feature of this system is that it goes one step further in computing the cosine similarity of vectors by adopting the eigenvector centrality algorithm for more concise, and better recommendations. Try this out today, and I hope it helps you get a better understanding of how related the pieces of data you possess are.

If you enjoyed today’s reading, PLEASE give me a follow and let me know if there is another topic you would like me to explore! If you do not have a Medium account, sign up through my link here (I receive a small commission when you do this)! Additionally, add me on LinkedIn, or feel free to reach out! Thanks for reading!

Sources

  1. https://www.geeksforgeeks.org/page-rank-algorithm-implementation/
  2. Full Code: https://github.com/benmccloskey/Research_recommend_model/blob/main/app.py

FOLLOW US ON GOOGLE NEWS

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – admin@technoblender.com. The content will be deleted within 24 hours.
academicBenjaminDatalatest newsmachine learningMARMcCloskeyRecommendationResearchSystemTechnologytypes
Comments (0)
Add Comment