Write an algorithm to find all PDFs that contain specific words in seconds, text extraction and search automation.

Luis Garcia Fuentes
3 min readApr 21, 2021

--

This aims to be a short tutorial to solve a very specific problem. If you have a folder full of pdfs, and you want to know all the pdfs that contain a certain word/sentence within seconds, then this blog should help you.

In my case, the problem is that the accounting department often processes some invoices incorrectly. Luckily for me, the wrong invoices can be found by simply finding the invoices that contain certain words. However, doing this by hand would be a very painful task, as there are hundreds of invoices.

Photo by Beatriz Pérez Moya on Unsplash

Why my solution may be relevant to you when others exist? I was not able to find a solution that worked for me, particularly because the recommended libraries either were not supported by Python3 or because the libraries (PyPDF2) were not able to read all invoices (for some reason, PyPDF2 was unable to read c. 50% of my pdfs, and I could not get past this obstacle with this library).

Below is the code for the base scenario where you have a folder location and want to look across all pdfs in the folder for a specfic word/sentence.

import fitz
import os
path= r’C:\Users\luisgarcuser\xxxxx’
files = os.listdir(path)
for file in files:
doc=fitz.open(path+'\\'+file)
for page in doc:
text = page.getText()
# print(text)
result = text.find('Word of interest')
print(file)
if result != -1:
print(file)
pass

Where: The variable path’ contains the folder location, ‘files’ variable contains a list of all files in this location, and the rest of the code is a for-loop. The for-loop will make it so that the written code reads and searches each pdf across all pdfs in the folder.

View of variable ‘files’ (using Spyder). We can see all pdfs have been listed. We loop over this list next.

The loop text extraction: For each file in the list of files, we map each file to the object ‘doc’ as it is defined by the ‘fitz’ library through the method ‘open’. We can then iterate on each page of each pdf to read their contents and save their content in the variable text. This seems complicated, but it is not. All we are doing is translating each pdf into a variable that contains all the pdf text contents.

Translation from invoice pdf to text in Python variable. Blurred for privacy

So far we have extracted the text from each pdf, and saved all the extracted text in the variable ‘text’. We now need to search for the keywords we want to use to flag invoices to our attention. We achieve this using the variable ‘result’, which will equal -1 if the script was not able to find the ‘word of interest’, while >-1 if it finds a match. Note this method is case sensitive.

The way this works is that the method ‘text.find’ will return the index where the searched word is first found. When the index is -1, it means that there was no match and we can leverage this to list all pdfs where we do find a match. We do this by using an if statement, where if the result is not =-1, then we want to print the name of the invoice in our console.

This script takes seconds to run and gets the results I am looking for. You can play around with the contents in the for loop if you need to use multiple words as flag criteria. I hope you find this useful!

Cheers,

--

--

Luis Garcia Fuentes
Luis Garcia Fuentes

Responses (2)