Introduction
The PDF format or portable document file is one of the most common file formats today. It is widely used in all industries, as in government offices, medical care and even personal work. Due, There is a large amount of unstructured data that exists in PDF format and extracting this data to generate meaningful information is a common job among data scientists.
There are several Python libraries dedicated to working with PDF documents like PYPDF2, etc. In this tutorial, will wear Camelot.
Why Camelot?
- You are in control: unlike other libraries and tools that give good result or fail miserably (without intermediaries), Camelot gives you the power to modify table extraction. (This is essential since everything in the real world, including extraction of PDF tables, it's confusing).
- A little tables can be dropped based on metrics like precision and whitespace, without having to manually look at each table.
- Each table is a pandas DataFrame, that integrates seamlessly into Data analysis and ETL workflows.
- Export to multiple formats, including JSON, Excel, HTML y Sqlite.
Let's start
Before installing the Camelot libraries we have to install ghost script , once we install the ghost script, let's install camelot-py.
Run below commands :
pip install "camelot-py[cv]"
Once you have installed the camelot-py library, we will be ready to start. We are trying to extract a statewide GST revenue table from this pdf document.
import camelot
If you have camelot, Python won't print an error message, if not, you will see a ImportError.
# Syntax of the camelot.read_pdf function
camelot.read_pdf(
filepath,
pages='1',
password=None,
flavor='lattice',
suppress_stdout=False,
layout_kwargs={},
**kwargs,
)
If you have to extract a table from different pages, must give the page number.
tables2=camelot.read_pdf('gst-revenue-collection-march2020.pdf', flavor="stream", pages="0-3")
tables2
This will give you a total list of the Table that is there in a pdf document. we can choose a table passing the index.
tables2[2] # 2 is the index
tables2[2].parsing_report
The above code will provide you details such as precision and page number. Please note that there are 2 pages.
The following code will extract the table from the pdf document.
df2=tables2[2].df
df2
In this circumstance, because the table is divided into two different pages. Then we can make a solution.
tables2[3]
tables2[3].parsing_report
Here you can notice, we extract the table from page no 3.
df3=tables2[3].df
df3
The following is the code to add df2 and df3.
df4=df2.append(df3)
df4
df5 = df4[1:] df5.head() new_header = df5.iloc[0]df5 = df5[1:]df5.columns = new_header
Here you have, we have extracted a table from pdf, now we can export this data in any format to the local system.
Conclution
Extracting tabular data from pdf with the help of camelot library is really easy. At the same time, we know that there is a lot of unstructured data in pdf format and, after extracting the tables, we can do a lot of analysis and visualization based on your business needs.
I hope this post helps you and saves you a good amount of time. Let me know if you have any suggestions.
HAPPY CODING.
About the Author
Prabhat Kumar – Associate analyst
I am an engineer who today works in the main multinational companies as an associate analyst and an innovation enthusiast, i love learning new things, I believe that each information has a story and I love reading the stories.
Prabhat Pathak (Linkedin profile) is Associate Analyst.