Techno Blender
Digitally Yours.

PDF Parsing Dashboard with Plotly Dash | by Benjamin McCloskey | Dec, 2022

0 76


An introduction to how to read and display PDF files in your next dashboard.

PDF Parser (Image from Author)

I recently have taken an interest in using PDF files for my Natural Language Processing (NLP) projects and you may be wondering, why? PDF documents contain tons of information that can be extracted and used to create various types of machine learning models as well as for finding patterns in different data. Problem? PDF files are tricky to work with in Python. Furthermore, as I began creating a dashboard for a customer on Plotly dash, there was little to no information that I could find on how to ingest and parse PDF files in a Plotly dashboard. This lack of information will change today and I am going to share with you how you can upload and work with PDF files in a Plotly Dashboard!

import pandas as pd 
from dash import dcc, Dash, html, dash_table
import base64
import datetime
import io
import PyPDF2
from dash.dependencies import Input, Output, State
import re
import dash_bootstrap_components as dbc
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
from string import punctuation
from heapq import nlargest

Most of the packages above are what you will typically find when deploying a Dash app. For example, dash is the main Plotly API we will use and dcc, Dash, html, and dash_table are some of the main methods we need for adding functionality. When reading PDFs in Python, I tend to use PyPDF2 but there are other APIs you can explore and you should always use what works best for your project.

The supporting functions help with adding interactivity to the dash app. There were two different classes I created for this app that allowed for a PDF to be parsed apart. The first class is the pdfReader class. The functions in this class are all related to reading a PDF into Python and transforming its contents into a usable form. Additionally, some of the functions are capable of extracting the metadata intrinsically located within a PDF (ie. Creation Date, Author, etc.)

class pdfReader:    
def __init__(self, file_path: str) -> str:
self.file_path = file_path

def PDF_one_pager(self) -> str:
"""A function that accepts a file path to a pdf
as input and returns a one line string of the
pdf.

Parameters:
file_path(str): The file path to the pdf.

Returns:
one_page_pdf (str): A one line string of the pdf.

"""
content = ""
p = open(self.file_path, "rb")
pdf = PyPDF2.PdfFileReader(p)
num_pages = pdf.numPages
for i in range(0, num_pages):
content += pdf.getPage(i).extractText() + "\n"
content = " ".join(content.replace(u"\xa0", " ").strip().split())
page_number_removal = r"\d{1,3} of \d{1,3}"
page_number_removal_pattern = re.compile(page_number_removal, re.IGNORECASE)
content = re.sub(page_number_removal_pattern, '',content)

return content

def pdf_reader(self) -> str:
"""A function that can read .pdf formatted files
and returns a python readable pdf.

Parameters:
self (obj): An object of the class pdfReader

Returns:
read_pdf: A python readable .pdf file.
"""
opener = open(self.file_path,'rb')
read_pdf = PyPDF2.PdfFileReader(opener)

return read_pdf

def pdf_info(self) -> dict:
"""A function which returns an information dictionary
of an object.

Parameters:
self (obj): An object of the pdfReader class.

Returns:
dict(pdf_info_dict): A dictionary containing the meta
data of the object.
"""
opener = open(self.file_path,'rb')
read_pdf = PyPDF2.PdfFileReader(opener)
pdf_info_dict = {}
for key,value in read_pdf.documentInfo.items():
pdf_info_dict[re.sub('/',"",key)] = value
return pdf_info_dict

def pdf_dictionary(self) -> dict:
"""A function which returns a dictionary of
the object where the keys are the pages
and the text within the pages are the
values.

Parameters:
obj (self): An object of the pdfReader class.

Returns:
dict(pdf_dict): A dictionary of the object within the
pdfReader class.
"""
opener = open(self.file_path,'rb')
#try:
# file_path = os.path.exists(self.file_path)
# file_path = True
#break
#except ValueError:
# print('Unidentifiable file path')
read_pdf = PyPDF2.PdfFileReader(opener)
length = read_pdf.numPages
pdf_dict = {}
for i in range(length):
page = read_pdf.getPage(i)
text = page.extract_text()
pdf_dict[i] = text
return pdf_dict

def get_publish_date(self) -> str:
"""A function of which accepts an information dictionray of an object
in the pdfReader class and returns the creation date of the
object (if applicable).

Parameters:
self (obj): An object of the pdfReader class

Returns:
pub_date (str): The publication date which is assumed to be the
creation date (if applicable).
"""
info_dict_pdf = self.pdf_info()
pub_date= 'None'
try:
publication_date = info_dict_pdf['CreationDate']
publication_date = datetime.datetime.strptime(publication_date.replace("'", ""), "D:%Y%m%d%H%M%S%z")
pub_date = publication_date.isoformat()[0:10]
except:
pass
return str(pub_date)

The second class created was the pdfParser and this class performs the main parsing operations we wish to use on the PDF in Python. These functions include:

  1. get_emails()– a function that is able to find all of the emails within a string of text.
  2. get_dates()– a function that can locate dates within a string of text. For this dashboard, we are going to find the date on which the PDF was downloaded.
  3. get_summary()– a function that is able to create a summarization of a body of text-based on the importance of words and the % of text the user wishes to be placed within a summary from the original text.
  4. get_urls()– a function that will find all of the URL and domain names within a string of text.
class pdfParser:
def __init__(self):
return

@staticmethod
def get_emails(text: str) -> set:
"""A function that accepts a string of text and
returns any email addresses located within the
text

Parameters:
text (str): A string of text

Returns:
set(emails): A set of emails located within
the string of text.
"""
email_pattern = re.compile(r'[\w.+-]+@[\w-]+\.[\w.-]+')
email_set = set()
email_set.update(email_pattern.findall(text))

return str(email_set)

@staticmethod
def get_dates(text: str, info_dict_pdf : dict) -> set:
date_label = ['DATE']
nlp = spacy.load('en_core_web_lg')
doc = nlp(text)

dates_pattern = re.compile(r'(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})')
dates = set((ent.text) for ent in doc.ents if ent.label_ in date_label)
filtered_dates = set(date for date in dates if not dates_pattern.match(date))

return str(filtered_dates)

@staticmethod
def get_summary(text: str, per: float) -> str:
nlp = spacy.load('en_core_web_sm')
doc= nlp(text)
word_frequencies={}
for word in doc:
if word.text.lower() not in list(STOP_WORDS):
if word.text.lower() not in punctuation:
if word.text not in word_frequencies.keys():
word_frequencies[word.text] = 1
else:
word_frequencies[word.text] += 1
max_frequency=max(word_frequencies.values())
for word in word_frequencies.keys():
word_frequencies[word]=word_frequencies[word]/max_frequency
sentence_tokens= [sent for sent in doc.sents]
sentence_scores = {}
for sent in sentence_tokens:
for word in sent:
if word.text.lower() in word_frequencies.keys():
if sent not in sentence_scores.keys():
sentence_scores[sent]=word_frequencies[word.text.lower()]
else:
sentence_scores[sent]+=word_frequencies[word.text.lower()]
select_length=int(len(sentence_tokens)*per)
summary=nlargest(select_length, sentence_scores,key=sentence_scores.get)
final_summary=[word.text for word in summary]
summary=''.join(final_summary)
return summary

@staticmethod
def get_urls_domains(text: str) -> set:
"""A function that accepts a string of text and
returns any urls and domain names located within
the text.

Parmeters:
text (str): A string of text.

Returns:
set(urls): A set of urls located within the text
set(domain_names): A set of domain names located within the text.
"""
#f = open('/Users/benmccloskey/Desktop/pdf_dashboard/cfg.json')
#data = json.load(f)
#url_endings = [end for end in data['REGEX_URL_ENDINGS']['url_endings'].split(',')]
url_end = 'com,gov,edu,org,mil,net,au,in,ca,br,it,mx,ai,fr,tw,il,uk,int,arpa,co,us,info,xyz,ly,site,biz,bz'
url_endings = [end for end in url_end.split(',')]
url_regex = '(?:(?:https?|ftp):\\/\\/)?[\\w/\\-?=%.]+\\.(?:' + '|'.join(url_endings) + ')[^\s]+'
url_reg_pattern = re.compile(url_regex, re.IGNORECASE)
url_reg_list = url_reg_pattern.findall(text)

url_set = set()
url_set.update(url_reg_list)

domain_set = set()
domain_regex = r'^(?:https?:\/\/)?(?:[^@\/\n]+@)?(?:www\.)?([^:\/\n]+)'
domain_pattern = re.compile(domain_regex, re.IGNORECASE)
for url in url_set:
domain_set.update(domain_pattern.findall(url))

return str(url_set), str(domain_set)

Throughout today’s example we will parse apart the paper ImageNet Classification with Deep Convolutional Neural Networks written by Alex Krizhevsky Ilya Sutskever, and Geoffrey E. Hinton [1].

Setup

You will need to create a directory that holds the PDF files you wish to analyze in your dashboard. You will then want to initialize your Dash app.

directory = '/Users/benmccloskey/Desktop/pdf_dashboard/files'

app = Dash(__name__, external_stylesheets=[dbc.themes.CYBORG],suppress_callback_exceptions=True)

I made a folder called “files” and I simply placed the PDF I wanted to upload within the said folder.

PDF Parser Dashboard (Image from Author)

As shown above, the dashboard is very basic when it’s initialized but this is expected! No information has been provided to the app and additionally, I did not want the app to be too busy at the beginning.

The Skeleton

The next step is to create the skeleton of the dash app. This really is the scaffolding of the dash app where we can place different pieces upon it to ultimately create the final architecture. First, we will add a title and style it the way we want. You can get the numbers associated with different colors here!

app.layout = html.Div(children =[html.Div(children =[html.H1(children='PDF Parser', 
style = {'textAlign': 'center',
'color' : '#7FDBFF',})])

Second, we can use dcc.Upload which allows us to actually upload data into our dash app.

This is where I met my first challenge.

The example offered by Plotly Dash did not show how to read in PDF files. I configured the following code block and appendes it to the ingestion code provided by Plotly.

if 'csv' in filename:
# Assume that the user uploaded a CSV file
df = pd.read_csv(
io.StringIO(decoded.decode('utf-8')))
elif 'xls' in filename:
# Assume that the user uploaded an excel file
df = pd.read_excel(io.BytesIO(decoded))
elif 'pdf' in filename:
pdf = pdfReader(directory + '/' + filename)
text = pdf.PDF_one_pager()
emails = pdfParser.get_emails(text)
ddate = pdf.get_publish_date()
summary = pdfParser.get_summary(text, 0.1)
df = pd.DataFrame({'Text':[text], 'Summary':[summary],
'Download Date' : [ddate],'Emails' : [emails]})

The main change needed was to force the function to see if a file name had “pdf” at the end of its title. Then, I was able to pull the pdf from the directory I originally had initiated. Please note: With the current state of the code, any PDF file uploaded to the dashboard that is not in the directory folder will not be able to be read and parsed by the dashboard. This is a future update! If you have any functions that work with text, this is where you can add those in.

As previously stated, the package I used for reading the PDF files was PyPDF2 and there are many others out that one can use. Initiating the classes at the beginning of the dashboard greatly reduced the clutter within the skeleton. Another trick that I found helpful was to actually parse the pdf right when it is uploaded and save the data frame at a later line in the code. For this example, we are going to display the text, summary, download date, and email addresses associated with a PDF file.

Upload Function (Image from Author)

Once you click the upload button, a popup window of your file directory will open and you can select the PDF file you wish to parse (do not forget that for this setup, the pdf must be in your dashboard’s working directory!).

Once the file is uploaded, we can finish the function by returning the file name with the DateTime, and create a Plotly dash checklist of the different features within the PDF we want our dashboard to parse out and display. Shown below is the full body of the dashboard before the callbacks are instantiated.

app.layout = html.Div(children =[html.Div(children =[html.H1(children='PDF Parser', 
style = {'textAlign': 'center',
'color' : '#7FDBFF',})]),

html.Div([
dcc.Upload(
id='upload-data',
children=html.Div([
'Drag and Drop or ',
html.A('Select Files')
]),
style={
'width': '100%',
'height': '60px',
'lineHeight': '60px',
'borderWidth': '1px',
'borderStyle': 'dashed',
'borderRadius': '5px',
'textAlign': 'center',
'margin': '10px'
},
# Allow multiple files to be uploaded
multiple=True
),
#Returns info, above the datatable,
html.Div(id='output-datatable'),
html.Div(id='output-data-upload')#output for the datatable,
]),

])

def parse_contents(contents, filename, date):
content_type, content_string = contents.split(',')

decoded = base64.b64decode(content_string)
try:
if 'csv' in filename:
# Assume that the user uploaded a CSV file
df = pd.read_csv(
io.StringIO(decoded.decode('utf-8')))
elif 'xls' in filename:
# Assume that the user uploaded an excel file
df = pd.read_excel(io.BytesIO(decoded))
elif 'pdf' in filename:
pdf = pdfReader(directory + '/' + filename)
text = pdf.PDF_one_pager()
emails = pdfParser.get_emails(text)
ddate = pdf.get_publish_date()
summary = pdfParser.get_summary(text, 0.1)
df = pd.DataFrame({'Text':[text], 'Summary':[summary],
'Download Date' : [ddate],'Emails' : [emails],
' URLs' : [urls], 'Domain Names' : [domains]})
except Exception as e:
print(e)
return html.Div([
'There was an error processing this file.'
])

return html.Div([
html.H5(filename),#return the filename
html.H6(datetime.datetime.fromtimestamp(date)),#edit date
dcc.Checklist(id='checklist',options = [
{"label": "Text", "value": "Text"},
{"label": "summary", "value": "Summary"},
{"label": "Download Date", "value": "Download Date"},
{"label": "Email Addresses", "value": "Email Addresses"}
],
value = []),
html.Hr(),
dcc.Store(id='stored-data' ,data = df.to_dict('records')),

html.Hr(), # horizontal line

])

Dashboard Checklist

The checklist. This is where we will begin adding some functionality to the dashboard. Using the dcc.Checklist function, we can create a list that allows a user to select which features of the PDF they want to be displayed. For example, maybe the user just wants to look at one feature at a time so they are not overwhelmed and can inspect one specific pattern. Or maybe the user wants to compare the summary to the entire body of the text to see if there is any information missing in the summary that may be needed to be added. What’s convenient about using the checklist is you can look at different batches of features that are found within a pdf which helps reduce sensory overload.

Checklist (Image from author)

Once the pdf is uploaded, a checklist will display with none of the items selected. See in the next section what happens when we select the different parameters in the checklist!

Callbacks

Callbacks are extremely important in Plotly dashboards. They add functionality to the dashboard and make it interactive with the user. The first callback is used for the upload function and ensures that our data can be uploaded.

The second callback displays the different data in our data table. It does this by calling upon the saved data frame initially created when a PDF is uploaded. Depending on which boxes of the checklist are selected will determine what information is currently being displayed in the data table.

@app.callback(Output('output-datatable', 'children'),
Input('upload-data', 'contents'),
State('upload-data', 'filename'),
State('upload-data', 'last_modified'))
def update_output(list_of_contents, list_of_names, list_of_dates):
if list_of_contents is not None:
children = [
parse_contents(c, n, d) for c, n, d in
zip(list_of_contents, list_of_names, list_of_dates)]
return children

@app.callback(Output('output-data-upload', 'children'),
Input('checklist','value'),
Input('stored-data', 'data'))

def table_update(options_chosen, df_dict):
if options_chosen == []:
return options_chosen == []
df_copy = pd.DataFrame(df_dict)
text = df_copy['Text']
emails = df_copy['Emails']
ddate = df_copy['Download Date']
summary = df_copy['Summary']
value_dct = {}
for val in options_chosen:
if val == 'Text':
value_dct[val] = text
if val == 'Summary':
value_dct['Summary'] = summary
if val == 'Download Date':
value_dct['Download Date'] = ddate
if val == 'Email Addresses':
value_dct['Email'] = emails
dff = pd.DataFrame(value_dct)
return dash_table.DataTable(
dff.to_dict('records'),
[{'name': i, 'id': i} for i in dff.columns],
export_format="csv"
style_data={
'whiteSpace': 'normal',
'height': 'auto',
'textAlign': 'left',
'backgroundColor': 'rgb(50, 50, 50)',
'color': 'white'},
style_header={'textAlign' : 'left',
'backgroundColor': 'rgb(30, 30, 30)',
'color': 'white'
})

For example, what if we want to see the emails, urls, and domain names located within the PDF? By checking those three selections in the checklist, that data will populate the table and display itself on the dashboard.

Checklist Example (Image from Author)

Finally, while we could easily copy and paste the entries into another spreadsheet or database of our choice, the “export” button in the top left allows us to download and save the table to a CSV file!

And that’s it! These few blocks of code are all it takes to perform basic parsing functions on a pdf in a Plotly dashboard.

Today we looked at one direction a data scientist can take to upload a PDF to a Plotly dashboard and display certain content to a user. This can be super useful for a person who has to pull specific information from PDFs (ie. maybe customer email, name, and phone number) and needs to analyze that information. While this dashboard is basic, it can be tweaked for different styles of PDFs. For example, maybe you want to create a parsing dashboard that displays the citations in a research paper PDF you are reading so you can either save or use those citations to find other papers. Try to this code out today and if you add any cool functions, let me know!

If you enjoy today’s reading, PLEASE give me a follow and let me know if there is another topic you would like me to explore! If you do not have a Medium account, sign up through my link here! Additionally, add me on LinkedIn, or feel free to reach out! Thanks for reading!

import pandas as pd 
from dash import dcc, Dash, html, dash_table
import base64
import datetime
import io
import PyPDF2
from dash.dependencies import Input, Output, State
import re
import dash_bootstrap_components as dbc
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
from string import punctuation
from heapq import nlargest

external_stylesheets = ['https://codepen.io/chriddyp/pen/bWLwgP.css']

app = Dash(__name__, external_stylesheets=[dbc.themes.CYBORG],suppress_callback_exceptions=True)

class pdfReader:
def __init__(self, file_path: str) -> str:
self.file_path = file_path

def PDF_one_pager(self) -> str:
"""A function that accepts a file path to a pdf
as input and returns a one line string of the
pdf.

Parameters:
file_path(str): The file path to the pdf.

Returns:
one_page_pdf (str): A one line string of the pdf.

"""
content = ""
p = open(self.file_path, "rb")
pdf = PyPDF2.PdfFileReader(p)
num_pages = pdf.numPages
for i in range(0, num_pages):
content += pdf.getPage(i).extractText() + "\n"
content = " ".join(content.replace(u"\xa0", " ").strip().split())
page_number_removal = r"\d{1,3} of \d{1,3}"
page_number_removal_pattern = re.compile(page_number_removal, re.IGNORECASE)
content = re.sub(page_number_removal_pattern, '',content)

return content

def pdf_reader(self) -> str:
"""A function that can read .pdf formatted files
and returns a python readable pdf.

Parameters:
self (obj): An object of the class pdfReader

Returns:
read_pdf: A python readable .pdf file.
"""
opener = open(self.file_path,'rb')
read_pdf = PyPDF2.PdfFileReader(opener)

return read_pdf

def pdf_info(self) -> dict:
"""A function which returns an information dictionary
of an object associated with the pdfReader class.

Parameters:
self (obj): An object of the pdfReader class.

Returns:
dict(pdf_info_dict): A dictionary containing the meta
data of the object.
"""
opener = open(self.file_path,'rb')
read_pdf = PyPDF2.PdfFileReader(opener)
pdf_info_dict = {}
for key,value in read_pdf.documentInfo.items():
pdf_info_dict[re.sub('/',"",key)] = value
return pdf_info_dict

def pdf_dictionary(self) -> dict:
"""A function which returns a dictionary of
the object where the keys are the pages
and the text within the pages are the
values.

Parameters:
obj (self): An object of the pdfReader class.

Returns:
dict(pdf_dict): A dictionary of the object within the
pdfReader class.
"""
opener = open(self.file_path,'rb')

read_pdf = PyPDF2.PdfFileReader(opener)
length = read_pdf.numPages
pdf_dict = {}
for i in range(length):
page = read_pdf.getPage(i)
text = page.extract_text()
pdf_dict[i] = text
return pdf_dict

def get_publish_date(self) -> str:
"""A function of which accepts an information dictionray of an object
in the pdfReader class and returns the creation date of the
object (if applicable).

Parameters:
self (obj): An object of the pdfReader class

Returns:
pub_date (str): The publication date which is assumed to be the
creation date (if applicable).
"""
info_dict_pdf = self.pdf_info()
pub_date= 'None'
try:
publication_date = info_dict_pdf['CreationDate']
publication_date = datetime.datetime.strptime(publication_date.replace("'", ""), "D:%Y%m%d%H%M%S%z")
pub_date = publication_date.isoformat()[0:10]
except:
pass
return str(pub_date)

class pdfParser:
def __init__(self):
return

@staticmethod
def get_emails(text: str) -> set:
"""A function that accepts a string of text and
returns any email addresses located within the
text

Parameters:
text (str): A string of text

Returns:
set(emails): A set of emails located within
the string of text.
"""
email_pattern = re.compile(r'[\w.+-]+@[\w-]+\.[\w.-]+')
email_set = set()
email_set.update(email_pattern.findall(text))

return str(email_set)

@staticmethod
def get_dates(text: str, info_dict_pdf : dict) -> set:
date_label = ['DATE']
nlp = spacy.load('en_core_web_lg')
doc = nlp(text)

dates_pattern = re.compile(r'(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})')
dates = set((ent.text) for ent in doc.ents if ent.label_ in date_label)
filtered_dates = set(date for date in dates if not dates_pattern.match(date))

return str(filtered_dates)

@staticmethod
def get_summary(text, per):
nlp = spacy.load('en_core_web_sm')
doc= nlp(text)
word_frequencies={}
for word in doc:
if word.text.lower() not in list(STOP_WORDS):
if word.text.lower() not in punctuation:
if word.text not in word_frequencies.keys():
word_frequencies[word.text] = 1
else:
word_frequencies[word.text] += 1
max_frequency=max(word_frequencies.values())
for word in word_frequencies.keys():
word_frequencies[word]=word_frequencies[word]/max_frequency
sentence_tokens= [sent for sent in doc.sents]
sentence_scores = {}
for sent in sentence_tokens:
for word in sent:
if word.text.lower() in word_frequencies.keys():
if sent not in sentence_scores.keys():
sentence_scores[sent]=word_frequencies[word.text.lower()]
else:
sentence_scores[sent]+=word_frequencies[word.text.lower()]
select_length=int(len(sentence_tokens)*per)
summary=nlargest(select_length, sentence_scores,key=sentence_scores.get)
final_summary=[word.text for word in summary]
summary=''.join(final_summary)
return summary

directory = '/Users/benmccloskey/Desktop/pdf_dashboard/files'

app.layout = html.Div(children =[html.Div(children =[html.H1(children='PDF Parser',
style = {'textAlign': 'center',
'color' : '#7FDBFF',})]),

html.Div([
dcc.Upload(
id='upload-data',
children=html.Div([
'Drag and Drop or ',
html.A('Select Files')
]),
style={
'width': '100%',
'height': '60px',
'lineHeight': '60px',
'borderWidth': '1px',
'borderStyle': 'dashed',
'borderRadius': '5px',
'textAlign': 'center',
'margin': '10px'
},
# Allow multiple files to be uploaded
multiple=True
),
#Returns info, above the datatable,
html.Div(id='output-datatable'),
html.Div(id='output-data-upload')#output for the datatable,
]),

])

def parse_contents(contents, filename, date):
content_type, content_string = contents.split(',')

decoded = base64.b64decode(content_string)
try:
if 'csv' in filename:
# Assume that the user uploaded a CSV file
df = pd.read_csv(
io.StringIO(decoded.decode('utf-8')))
elif 'xls' in filename:
# Assume that the user uploaded an excel file
df = pd.read_excel(io.BytesIO(decoded))
elif 'pdf' in filename:
pdf = pdfReader(directory + '/' + filename)
text = pdf.PDF_one_pager()
emails = pdfParser.get_emails(text)
ddate = pdf.get_publish_date()
summary = pdfParser.get_summary(text, 0.1)
df = pd.DataFrame({'Text':[text], 'Summary':[summary],
'Download Date' : [ddate],'Emails' : [emails]})
except Exception as e:
print(e)
return html.Div([
'There was an error processing this file.'
])

return html.Div([
html.H5(filename),#return the filename
html.H6(datetime.datetime.fromtimestamp(date)),#edit date
dcc.Checklist(id='checklist',options = [
{"label": "Text", "value": "Text"},
{"label": "summary", "value": "Summary"},
{"label": "Download Date", "value": "Download Date"},
{"label": "Email Addresses", "value": "Email Addresses"}
],
value = []),

html.Hr(),
dcc.Store(id='stored-data' ,data = df.to_dict('records')),

html.Hr(), # horizontal line

])

@app.callback(Output('output-datatable', 'children'),
Input('upload-data', 'contents'),
State('upload-data', 'filename'),
State('upload-data', 'last_modified'))
def update_output(list_of_contents, list_of_names, list_of_dates):
if list_of_contents is not None:
children = [
parse_contents(c, n, d) for c, n, d in
zip(list_of_contents, list_of_names, list_of_dates)]
return children

@app.callback(Output('output-data-upload', 'children'),
Input('checklist','value'),
Input('stored-data', 'data'))

def table_update(options_chosen, df_dict):
if options_chosen == []:
return options_chosen == []
df_copy = pd.DataFrame(df_dict)
text = df_copy['Text']
emails = df_copy['Emails']
ddate = df_copy['Download Date']
summary = df_copy['Summary']
value_dct = {}
for val in options_chosen:
if val == 'Text':
value_dct[val] = text
if val == 'Summary':
value_dct['Summary'] = summary
if val == 'Download Date':
value_dct['Download Date'] = ddate
if val == 'Email Addresses':
value_dct['Email'] = emails
dff = pd.DataFrame(value_dct)
return dash_table.DataTable(
dff.to_dict('records'),
[{'name': i, 'id': i} for i in dff.columns],
style_data={
'whiteSpace': 'normal',
'height': 'auto',
'textAlign': 'left',
'backgroundColor': 'rgb(50, 50, 50)',
'color': 'white'},
style_header={'textAlign' : 'left',
'backgroundColor': 'rgb(30, 30, 30)',
'color': 'white'
})

if __name__ == '__main__':
app.run_server(debug=True)

  1. Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. “Imagenet classification with deep convolutional neural networks.” Communications of the ACM 60.6 (2017): 84–90.


An introduction to how to read and display PDF files in your next dashboard.

PDF Parser (Image from Author)

I recently have taken an interest in using PDF files for my Natural Language Processing (NLP) projects and you may be wondering, why? PDF documents contain tons of information that can be extracted and used to create various types of machine learning models as well as for finding patterns in different data. Problem? PDF files are tricky to work with in Python. Furthermore, as I began creating a dashboard for a customer on Plotly dash, there was little to no information that I could find on how to ingest and parse PDF files in a Plotly dashboard. This lack of information will change today and I am going to share with you how you can upload and work with PDF files in a Plotly Dashboard!

import pandas as pd 
from dash import dcc, Dash, html, dash_table
import base64
import datetime
import io
import PyPDF2
from dash.dependencies import Input, Output, State
import re
import dash_bootstrap_components as dbc
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
from string import punctuation
from heapq import nlargest

Most of the packages above are what you will typically find when deploying a Dash app. For example, dash is the main Plotly API we will use and dcc, Dash, html, and dash_table are some of the main methods we need for adding functionality. When reading PDFs in Python, I tend to use PyPDF2 but there are other APIs you can explore and you should always use what works best for your project.

The supporting functions help with adding interactivity to the dash app. There were two different classes I created for this app that allowed for a PDF to be parsed apart. The first class is the pdfReader class. The functions in this class are all related to reading a PDF into Python and transforming its contents into a usable form. Additionally, some of the functions are capable of extracting the metadata intrinsically located within a PDF (ie. Creation Date, Author, etc.)

class pdfReader:    
def __init__(self, file_path: str) -> str:
self.file_path = file_path

def PDF_one_pager(self) -> str:
"""A function that accepts a file path to a pdf
as input and returns a one line string of the
pdf.

Parameters:
file_path(str): The file path to the pdf.

Returns:
one_page_pdf (str): A one line string of the pdf.

"""
content = ""
p = open(self.file_path, "rb")
pdf = PyPDF2.PdfFileReader(p)
num_pages = pdf.numPages
for i in range(0, num_pages):
content += pdf.getPage(i).extractText() + "\n"
content = " ".join(content.replace(u"\xa0", " ").strip().split())
page_number_removal = r"\d{1,3} of \d{1,3}"
page_number_removal_pattern = re.compile(page_number_removal, re.IGNORECASE)
content = re.sub(page_number_removal_pattern, '',content)

return content

def pdf_reader(self) -> str:
"""A function that can read .pdf formatted files
and returns a python readable pdf.

Parameters:
self (obj): An object of the class pdfReader

Returns:
read_pdf: A python readable .pdf file.
"""
opener = open(self.file_path,'rb')
read_pdf = PyPDF2.PdfFileReader(opener)

return read_pdf

def pdf_info(self) -> dict:
"""A function which returns an information dictionary
of an object.

Parameters:
self (obj): An object of the pdfReader class.

Returns:
dict(pdf_info_dict): A dictionary containing the meta
data of the object.
"""
opener = open(self.file_path,'rb')
read_pdf = PyPDF2.PdfFileReader(opener)
pdf_info_dict = {}
for key,value in read_pdf.documentInfo.items():
pdf_info_dict[re.sub('/',"",key)] = value
return pdf_info_dict

def pdf_dictionary(self) -> dict:
"""A function which returns a dictionary of
the object where the keys are the pages
and the text within the pages are the
values.

Parameters:
obj (self): An object of the pdfReader class.

Returns:
dict(pdf_dict): A dictionary of the object within the
pdfReader class.
"""
opener = open(self.file_path,'rb')
#try:
# file_path = os.path.exists(self.file_path)
# file_path = True
#break
#except ValueError:
# print('Unidentifiable file path')
read_pdf = PyPDF2.PdfFileReader(opener)
length = read_pdf.numPages
pdf_dict = {}
for i in range(length):
page = read_pdf.getPage(i)
text = page.extract_text()
pdf_dict[i] = text
return pdf_dict

def get_publish_date(self) -> str:
"""A function of which accepts an information dictionray of an object
in the pdfReader class and returns the creation date of the
object (if applicable).

Parameters:
self (obj): An object of the pdfReader class

Returns:
pub_date (str): The publication date which is assumed to be the
creation date (if applicable).
"""
info_dict_pdf = self.pdf_info()
pub_date= 'None'
try:
publication_date = info_dict_pdf['CreationDate']
publication_date = datetime.datetime.strptime(publication_date.replace("'", ""), "D:%Y%m%d%H%M%S%z")
pub_date = publication_date.isoformat()[0:10]
except:
pass
return str(pub_date)

The second class created was the pdfParser and this class performs the main parsing operations we wish to use on the PDF in Python. These functions include:

  1. get_emails()– a function that is able to find all of the emails within a string of text.
  2. get_dates()– a function that can locate dates within a string of text. For this dashboard, we are going to find the date on which the PDF was downloaded.
  3. get_summary()– a function that is able to create a summarization of a body of text-based on the importance of words and the % of text the user wishes to be placed within a summary from the original text.
  4. get_urls()– a function that will find all of the URL and domain names within a string of text.
class pdfParser:
def __init__(self):
return

@staticmethod
def get_emails(text: str) -> set:
"""A function that accepts a string of text and
returns any email addresses located within the
text

Parameters:
text (str): A string of text

Returns:
set(emails): A set of emails located within
the string of text.
"""
email_pattern = re.compile(r'[\w.+-]+@[\w-]+\.[\w.-]+')
email_set = set()
email_set.update(email_pattern.findall(text))

return str(email_set)

@staticmethod
def get_dates(text: str, info_dict_pdf : dict) -> set:
date_label = ['DATE']
nlp = spacy.load('en_core_web_lg')
doc = nlp(text)

dates_pattern = re.compile(r'(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})')
dates = set((ent.text) for ent in doc.ents if ent.label_ in date_label)
filtered_dates = set(date for date in dates if not dates_pattern.match(date))

return str(filtered_dates)

@staticmethod
def get_summary(text: str, per: float) -> str:
nlp = spacy.load('en_core_web_sm')
doc= nlp(text)
word_frequencies={}
for word in doc:
if word.text.lower() not in list(STOP_WORDS):
if word.text.lower() not in punctuation:
if word.text not in word_frequencies.keys():
word_frequencies[word.text] = 1
else:
word_frequencies[word.text] += 1
max_frequency=max(word_frequencies.values())
for word in word_frequencies.keys():
word_frequencies[word]=word_frequencies[word]/max_frequency
sentence_tokens= [sent for sent in doc.sents]
sentence_scores = {}
for sent in sentence_tokens:
for word in sent:
if word.text.lower() in word_frequencies.keys():
if sent not in sentence_scores.keys():
sentence_scores[sent]=word_frequencies[word.text.lower()]
else:
sentence_scores[sent]+=word_frequencies[word.text.lower()]
select_length=int(len(sentence_tokens)*per)
summary=nlargest(select_length, sentence_scores,key=sentence_scores.get)
final_summary=[word.text for word in summary]
summary=''.join(final_summary)
return summary

@staticmethod
def get_urls_domains(text: str) -> set:
"""A function that accepts a string of text and
returns any urls and domain names located within
the text.

Parmeters:
text (str): A string of text.

Returns:
set(urls): A set of urls located within the text
set(domain_names): A set of domain names located within the text.
"""
#f = open('/Users/benmccloskey/Desktop/pdf_dashboard/cfg.json')
#data = json.load(f)
#url_endings = [end for end in data['REGEX_URL_ENDINGS']['url_endings'].split(',')]
url_end = 'com,gov,edu,org,mil,net,au,in,ca,br,it,mx,ai,fr,tw,il,uk,int,arpa,co,us,info,xyz,ly,site,biz,bz'
url_endings = [end for end in url_end.split(',')]
url_regex = '(?:(?:https?|ftp):\\/\\/)?[\\w/\\-?=%.]+\\.(?:' + '|'.join(url_endings) + ')[^\s]+'
url_reg_pattern = re.compile(url_regex, re.IGNORECASE)
url_reg_list = url_reg_pattern.findall(text)

url_set = set()
url_set.update(url_reg_list)

domain_set = set()
domain_regex = r'^(?:https?:\/\/)?(?:[^@\/\n]+@)?(?:www\.)?([^:\/\n]+)'
domain_pattern = re.compile(domain_regex, re.IGNORECASE)
for url in url_set:
domain_set.update(domain_pattern.findall(url))

return str(url_set), str(domain_set)

Throughout today’s example we will parse apart the paper ImageNet Classification with Deep Convolutional Neural Networks written by Alex Krizhevsky Ilya Sutskever, and Geoffrey E. Hinton [1].

Setup

You will need to create a directory that holds the PDF files you wish to analyze in your dashboard. You will then want to initialize your Dash app.

directory = '/Users/benmccloskey/Desktop/pdf_dashboard/files'

app = Dash(__name__, external_stylesheets=[dbc.themes.CYBORG],suppress_callback_exceptions=True)

I made a folder called “files” and I simply placed the PDF I wanted to upload within the said folder.

PDF Parser Dashboard (Image from Author)

As shown above, the dashboard is very basic when it’s initialized but this is expected! No information has been provided to the app and additionally, I did not want the app to be too busy at the beginning.

The Skeleton

The next step is to create the skeleton of the dash app. This really is the scaffolding of the dash app where we can place different pieces upon it to ultimately create the final architecture. First, we will add a title and style it the way we want. You can get the numbers associated with different colors here!

app.layout = html.Div(children =[html.Div(children =[html.H1(children='PDF Parser', 
style = {'textAlign': 'center',
'color' : '#7FDBFF',})])

Second, we can use dcc.Upload which allows us to actually upload data into our dash app.

This is where I met my first challenge.

The example offered by Plotly Dash did not show how to read in PDF files. I configured the following code block and appendes it to the ingestion code provided by Plotly.

if 'csv' in filename:
# Assume that the user uploaded a CSV file
df = pd.read_csv(
io.StringIO(decoded.decode('utf-8')))
elif 'xls' in filename:
# Assume that the user uploaded an excel file
df = pd.read_excel(io.BytesIO(decoded))
elif 'pdf' in filename:
pdf = pdfReader(directory + '/' + filename)
text = pdf.PDF_one_pager()
emails = pdfParser.get_emails(text)
ddate = pdf.get_publish_date()
summary = pdfParser.get_summary(text, 0.1)
df = pd.DataFrame({'Text':[text], 'Summary':[summary],
'Download Date' : [ddate],'Emails' : [emails]})

The main change needed was to force the function to see if a file name had “pdf” at the end of its title. Then, I was able to pull the pdf from the directory I originally had initiated. Please note: With the current state of the code, any PDF file uploaded to the dashboard that is not in the directory folder will not be able to be read and parsed by the dashboard. This is a future update! If you have any functions that work with text, this is where you can add those in.

As previously stated, the package I used for reading the PDF files was PyPDF2 and there are many others out that one can use. Initiating the classes at the beginning of the dashboard greatly reduced the clutter within the skeleton. Another trick that I found helpful was to actually parse the pdf right when it is uploaded and save the data frame at a later line in the code. For this example, we are going to display the text, summary, download date, and email addresses associated with a PDF file.

Upload Function (Image from Author)

Once you click the upload button, a popup window of your file directory will open and you can select the PDF file you wish to parse (do not forget that for this setup, the pdf must be in your dashboard’s working directory!).

Once the file is uploaded, we can finish the function by returning the file name with the DateTime, and create a Plotly dash checklist of the different features within the PDF we want our dashboard to parse out and display. Shown below is the full body of the dashboard before the callbacks are instantiated.

app.layout = html.Div(children =[html.Div(children =[html.H1(children='PDF Parser', 
style = {'textAlign': 'center',
'color' : '#7FDBFF',})]),

html.Div([
dcc.Upload(
id='upload-data',
children=html.Div([
'Drag and Drop or ',
html.A('Select Files')
]),
style={
'width': '100%',
'height': '60px',
'lineHeight': '60px',
'borderWidth': '1px',
'borderStyle': 'dashed',
'borderRadius': '5px',
'textAlign': 'center',
'margin': '10px'
},
# Allow multiple files to be uploaded
multiple=True
),
#Returns info, above the datatable,
html.Div(id='output-datatable'),
html.Div(id='output-data-upload')#output for the datatable,
]),

])

def parse_contents(contents, filename, date):
content_type, content_string = contents.split(',')

decoded = base64.b64decode(content_string)
try:
if 'csv' in filename:
# Assume that the user uploaded a CSV file
df = pd.read_csv(
io.StringIO(decoded.decode('utf-8')))
elif 'xls' in filename:
# Assume that the user uploaded an excel file
df = pd.read_excel(io.BytesIO(decoded))
elif 'pdf' in filename:
pdf = pdfReader(directory + '/' + filename)
text = pdf.PDF_one_pager()
emails = pdfParser.get_emails(text)
ddate = pdf.get_publish_date()
summary = pdfParser.get_summary(text, 0.1)
df = pd.DataFrame({'Text':[text], 'Summary':[summary],
'Download Date' : [ddate],'Emails' : [emails],
' URLs' : [urls], 'Domain Names' : [domains]})
except Exception as e:
print(e)
return html.Div([
'There was an error processing this file.'
])

return html.Div([
html.H5(filename),#return the filename
html.H6(datetime.datetime.fromtimestamp(date)),#edit date
dcc.Checklist(id='checklist',options = [
{"label": "Text", "value": "Text"},
{"label": "summary", "value": "Summary"},
{"label": "Download Date", "value": "Download Date"},
{"label": "Email Addresses", "value": "Email Addresses"}
],
value = []),
html.Hr(),
dcc.Store(id='stored-data' ,data = df.to_dict('records')),

html.Hr(), # horizontal line

])

Dashboard Checklist

The checklist. This is where we will begin adding some functionality to the dashboard. Using the dcc.Checklist function, we can create a list that allows a user to select which features of the PDF they want to be displayed. For example, maybe the user just wants to look at one feature at a time so they are not overwhelmed and can inspect one specific pattern. Or maybe the user wants to compare the summary to the entire body of the text to see if there is any information missing in the summary that may be needed to be added. What’s convenient about using the checklist is you can look at different batches of features that are found within a pdf which helps reduce sensory overload.

Checklist (Image from author)

Once the pdf is uploaded, a checklist will display with none of the items selected. See in the next section what happens when we select the different parameters in the checklist!

Callbacks

Callbacks are extremely important in Plotly dashboards. They add functionality to the dashboard and make it interactive with the user. The first callback is used for the upload function and ensures that our data can be uploaded.

The second callback displays the different data in our data table. It does this by calling upon the saved data frame initially created when a PDF is uploaded. Depending on which boxes of the checklist are selected will determine what information is currently being displayed in the data table.

@app.callback(Output('output-datatable', 'children'),
Input('upload-data', 'contents'),
State('upload-data', 'filename'),
State('upload-data', 'last_modified'))
def update_output(list_of_contents, list_of_names, list_of_dates):
if list_of_contents is not None:
children = [
parse_contents(c, n, d) for c, n, d in
zip(list_of_contents, list_of_names, list_of_dates)]
return children

@app.callback(Output('output-data-upload', 'children'),
Input('checklist','value'),
Input('stored-data', 'data'))

def table_update(options_chosen, df_dict):
if options_chosen == []:
return options_chosen == []
df_copy = pd.DataFrame(df_dict)
text = df_copy['Text']
emails = df_copy['Emails']
ddate = df_copy['Download Date']
summary = df_copy['Summary']
value_dct = {}
for val in options_chosen:
if val == 'Text':
value_dct[val] = text
if val == 'Summary':
value_dct['Summary'] = summary
if val == 'Download Date':
value_dct['Download Date'] = ddate
if val == 'Email Addresses':
value_dct['Email'] = emails
dff = pd.DataFrame(value_dct)
return dash_table.DataTable(
dff.to_dict('records'),
[{'name': i, 'id': i} for i in dff.columns],
export_format="csv"
style_data={
'whiteSpace': 'normal',
'height': 'auto',
'textAlign': 'left',
'backgroundColor': 'rgb(50, 50, 50)',
'color': 'white'},
style_header={'textAlign' : 'left',
'backgroundColor': 'rgb(30, 30, 30)',
'color': 'white'
})

For example, what if we want to see the emails, urls, and domain names located within the PDF? By checking those three selections in the checklist, that data will populate the table and display itself on the dashboard.

Checklist Example (Image from Author)

Finally, while we could easily copy and paste the entries into another spreadsheet or database of our choice, the “export” button in the top left allows us to download and save the table to a CSV file!

And that’s it! These few blocks of code are all it takes to perform basic parsing functions on a pdf in a Plotly dashboard.

Today we looked at one direction a data scientist can take to upload a PDF to a Plotly dashboard and display certain content to a user. This can be super useful for a person who has to pull specific information from PDFs (ie. maybe customer email, name, and phone number) and needs to analyze that information. While this dashboard is basic, it can be tweaked for different styles of PDFs. For example, maybe you want to create a parsing dashboard that displays the citations in a research paper PDF you are reading so you can either save or use those citations to find other papers. Try to this code out today and if you add any cool functions, let me know!

If you enjoy today’s reading, PLEASE give me a follow and let me know if there is another topic you would like me to explore! If you do not have a Medium account, sign up through my link here! Additionally, add me on LinkedIn, or feel free to reach out! Thanks for reading!

import pandas as pd 
from dash import dcc, Dash, html, dash_table
import base64
import datetime
import io
import PyPDF2
from dash.dependencies import Input, Output, State
import re
import dash_bootstrap_components as dbc
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
from string import punctuation
from heapq import nlargest

external_stylesheets = ['https://codepen.io/chriddyp/pen/bWLwgP.css']

app = Dash(__name__, external_stylesheets=[dbc.themes.CYBORG],suppress_callback_exceptions=True)

class pdfReader:
def __init__(self, file_path: str) -> str:
self.file_path = file_path

def PDF_one_pager(self) -> str:
"""A function that accepts a file path to a pdf
as input and returns a one line string of the
pdf.

Parameters:
file_path(str): The file path to the pdf.

Returns:
one_page_pdf (str): A one line string of the pdf.

"""
content = ""
p = open(self.file_path, "rb")
pdf = PyPDF2.PdfFileReader(p)
num_pages = pdf.numPages
for i in range(0, num_pages):
content += pdf.getPage(i).extractText() + "\n"
content = " ".join(content.replace(u"\xa0", " ").strip().split())
page_number_removal = r"\d{1,3} of \d{1,3}"
page_number_removal_pattern = re.compile(page_number_removal, re.IGNORECASE)
content = re.sub(page_number_removal_pattern, '',content)

return content

def pdf_reader(self) -> str:
"""A function that can read .pdf formatted files
and returns a python readable pdf.

Parameters:
self (obj): An object of the class pdfReader

Returns:
read_pdf: A python readable .pdf file.
"""
opener = open(self.file_path,'rb')
read_pdf = PyPDF2.PdfFileReader(opener)

return read_pdf

def pdf_info(self) -> dict:
"""A function which returns an information dictionary
of an object associated with the pdfReader class.

Parameters:
self (obj): An object of the pdfReader class.

Returns:
dict(pdf_info_dict): A dictionary containing the meta
data of the object.
"""
opener = open(self.file_path,'rb')
read_pdf = PyPDF2.PdfFileReader(opener)
pdf_info_dict = {}
for key,value in read_pdf.documentInfo.items():
pdf_info_dict[re.sub('/',"",key)] = value
return pdf_info_dict

def pdf_dictionary(self) -> dict:
"""A function which returns a dictionary of
the object where the keys are the pages
and the text within the pages are the
values.

Parameters:
obj (self): An object of the pdfReader class.

Returns:
dict(pdf_dict): A dictionary of the object within the
pdfReader class.
"""
opener = open(self.file_path,'rb')

read_pdf = PyPDF2.PdfFileReader(opener)
length = read_pdf.numPages
pdf_dict = {}
for i in range(length):
page = read_pdf.getPage(i)
text = page.extract_text()
pdf_dict[i] = text
return pdf_dict

def get_publish_date(self) -> str:
"""A function of which accepts an information dictionray of an object
in the pdfReader class and returns the creation date of the
object (if applicable).

Parameters:
self (obj): An object of the pdfReader class

Returns:
pub_date (str): The publication date which is assumed to be the
creation date (if applicable).
"""
info_dict_pdf = self.pdf_info()
pub_date= 'None'
try:
publication_date = info_dict_pdf['CreationDate']
publication_date = datetime.datetime.strptime(publication_date.replace("'", ""), "D:%Y%m%d%H%M%S%z")
pub_date = publication_date.isoformat()[0:10]
except:
pass
return str(pub_date)

class pdfParser:
def __init__(self):
return

@staticmethod
def get_emails(text: str) -> set:
"""A function that accepts a string of text and
returns any email addresses located within the
text

Parameters:
text (str): A string of text

Returns:
set(emails): A set of emails located within
the string of text.
"""
email_pattern = re.compile(r'[\w.+-]+@[\w-]+\.[\w.-]+')
email_set = set()
email_set.update(email_pattern.findall(text))

return str(email_set)

@staticmethod
def get_dates(text: str, info_dict_pdf : dict) -> set:
date_label = ['DATE']
nlp = spacy.load('en_core_web_lg')
doc = nlp(text)

dates_pattern = re.compile(r'(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})')
dates = set((ent.text) for ent in doc.ents if ent.label_ in date_label)
filtered_dates = set(date for date in dates if not dates_pattern.match(date))

return str(filtered_dates)

@staticmethod
def get_summary(text, per):
nlp = spacy.load('en_core_web_sm')
doc= nlp(text)
word_frequencies={}
for word in doc:
if word.text.lower() not in list(STOP_WORDS):
if word.text.lower() not in punctuation:
if word.text not in word_frequencies.keys():
word_frequencies[word.text] = 1
else:
word_frequencies[word.text] += 1
max_frequency=max(word_frequencies.values())
for word in word_frequencies.keys():
word_frequencies[word]=word_frequencies[word]/max_frequency
sentence_tokens= [sent for sent in doc.sents]
sentence_scores = {}
for sent in sentence_tokens:
for word in sent:
if word.text.lower() in word_frequencies.keys():
if sent not in sentence_scores.keys():
sentence_scores[sent]=word_frequencies[word.text.lower()]
else:
sentence_scores[sent]+=word_frequencies[word.text.lower()]
select_length=int(len(sentence_tokens)*per)
summary=nlargest(select_length, sentence_scores,key=sentence_scores.get)
final_summary=[word.text for word in summary]
summary=''.join(final_summary)
return summary

directory = '/Users/benmccloskey/Desktop/pdf_dashboard/files'

app.layout = html.Div(children =[html.Div(children =[html.H1(children='PDF Parser',
style = {'textAlign': 'center',
'color' : '#7FDBFF',})]),

html.Div([
dcc.Upload(
id='upload-data',
children=html.Div([
'Drag and Drop or ',
html.A('Select Files')
]),
style={
'width': '100%',
'height': '60px',
'lineHeight': '60px',
'borderWidth': '1px',
'borderStyle': 'dashed',
'borderRadius': '5px',
'textAlign': 'center',
'margin': '10px'
},
# Allow multiple files to be uploaded
multiple=True
),
#Returns info, above the datatable,
html.Div(id='output-datatable'),
html.Div(id='output-data-upload')#output for the datatable,
]),

])

def parse_contents(contents, filename, date):
content_type, content_string = contents.split(',')

decoded = base64.b64decode(content_string)
try:
if 'csv' in filename:
# Assume that the user uploaded a CSV file
df = pd.read_csv(
io.StringIO(decoded.decode('utf-8')))
elif 'xls' in filename:
# Assume that the user uploaded an excel file
df = pd.read_excel(io.BytesIO(decoded))
elif 'pdf' in filename:
pdf = pdfReader(directory + '/' + filename)
text = pdf.PDF_one_pager()
emails = pdfParser.get_emails(text)
ddate = pdf.get_publish_date()
summary = pdfParser.get_summary(text, 0.1)
df = pd.DataFrame({'Text':[text], 'Summary':[summary],
'Download Date' : [ddate],'Emails' : [emails]})
except Exception as e:
print(e)
return html.Div([
'There was an error processing this file.'
])

return html.Div([
html.H5(filename),#return the filename
html.H6(datetime.datetime.fromtimestamp(date)),#edit date
dcc.Checklist(id='checklist',options = [
{"label": "Text", "value": "Text"},
{"label": "summary", "value": "Summary"},
{"label": "Download Date", "value": "Download Date"},
{"label": "Email Addresses", "value": "Email Addresses"}
],
value = []),

html.Hr(),
dcc.Store(id='stored-data' ,data = df.to_dict('records')),

html.Hr(), # horizontal line

])

@app.callback(Output('output-datatable', 'children'),
Input('upload-data', 'contents'),
State('upload-data', 'filename'),
State('upload-data', 'last_modified'))
def update_output(list_of_contents, list_of_names, list_of_dates):
if list_of_contents is not None:
children = [
parse_contents(c, n, d) for c, n, d in
zip(list_of_contents, list_of_names, list_of_dates)]
return children

@app.callback(Output('output-data-upload', 'children'),
Input('checklist','value'),
Input('stored-data', 'data'))

def table_update(options_chosen, df_dict):
if options_chosen == []:
return options_chosen == []
df_copy = pd.DataFrame(df_dict)
text = df_copy['Text']
emails = df_copy['Emails']
ddate = df_copy['Download Date']
summary = df_copy['Summary']
value_dct = {}
for val in options_chosen:
if val == 'Text':
value_dct[val] = text
if val == 'Summary':
value_dct['Summary'] = summary
if val == 'Download Date':
value_dct['Download Date'] = ddate
if val == 'Email Addresses':
value_dct['Email'] = emails
dff = pd.DataFrame(value_dct)
return dash_table.DataTable(
dff.to_dict('records'),
[{'name': i, 'id': i} for i in dff.columns],
style_data={
'whiteSpace': 'normal',
'height': 'auto',
'textAlign': 'left',
'backgroundColor': 'rgb(50, 50, 50)',
'color': 'white'},
style_header={'textAlign' : 'left',
'backgroundColor': 'rgb(30, 30, 30)',
'color': 'white'
})

if __name__ == '__main__':
app.run_server(debug=True)

  1. Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. “Imagenet classification with deep convolutional neural networks.” Communications of the ACM 60.6 (2017): 84–90.

FOLLOW US ON GOOGLE NEWS

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.

Leave a comment