An amazing Python library for extracting tabular data from PDF files

Contents

Introduction

The PDF format or portable document file is one of the most common file formats today. It is widely used in all industries, as in government offices, medical care and even personal work. Due, There is a large amount of unstructured data that exists in PDF format and extracting this data to generate meaningful information is a common job among data scientists.

There are several Python libraries dedicated to working with PDF documents like PYPDF2, etc. In this tutorial, will wear .

untitled-design9-6875785

Why Camelot?

  • You are in control: unlike other libraries and tools that give good result or fail miserably (without intermediaries), Camelot gives you the power to modify table extraction. (This is essential since everything in the real world, including extraction of PDF tables, it's confusing).
  • A little tables can be dropped based on metrics like precision and whitespace, without having to manually look at each table.
  • Each table is a pandas DataFrame, that integrates seamlessly into .
  • Export to multiple formats, including JSON, Excel, HTML y Sqlite.

Let's start

Before installing the Camelot libraries we have to install , once we install the ghost script, let's install camelot-py.

Run below commands :

pip install "camelot-py[cv]"

Once you have installed the camelot-py library, we will be ready to start. We are trying to extract a statewide GST revenue table from this .

1ade_tu7csgbjclaqfe3caw-7747757

Pdf table

import camelot

If you have camelot, Python won't print an error message, if not, you will see a ImportError.

# Syntax of the camelot.read_pdf function 
camelot.read_pdf(
    filepath,
    pages='1',
    password=None,
    flavor='lattice',
    suppress_stdout=False,
    layout_kwargs={},
    **kwargs,
)

If you have to extract a table from different pages, must give the page number.

tables2=camelot.read_pdf('gst-revenue-collection-march2020.pdf', flavor="stream", pages="0-3")
tables2

1svhisitx6pd_rz1vv_od6w-6774624

This will give you a total list of the Table that is there in a pdf document. we can choose a table passing the index.

tables2[2]  # 2 is the index 

1ijpye1zvesgawtkxnzrrng-7589702

tables2[2].parsing_report

1umbobuvpdwehhzugkh41xq-4198103

The above code will provide you details such as precision and page number. Please note that there are 2 pages.

The following code will extract the table from the pdf document.

df2=tables2[2].df
df2  

1ow-af3lrrki2xotjsxj8na-9491633

In this circumstance, because the table is divided into two different pages. Then we can make a solution.

tables2[3]
tables2[3].parsing_report

1aih7b20zs7pgpnbwq4wgjw-4244202

Here you can notice, we extract the table from page no 3.

df3=tables2[3].df
df3

17mwcha5yugncgfnvpyewsq-1232542

The following is the code to add df2 and df3.

df4=df2.append(df3)
df4

1uvk-z-v_wvx89gzx4js9-g-2462588

df5 = df4[1:]
df5.head()
new_header = df5.iloc[0]df5 = df5[1:]df5.columns = new_header

1w9qmyubuz7yd5qiasdzyra-6186338

Here you have, we have extracted a table from pdf, now we can export this data in any format to the local system.

Conclution

Extracting tabular data from pdf with the help of camelot library is really easy. At the same time, we know that there is a lot of unstructured data in pdf format and, after extracting the tables, we can do a lot of analysis and visualization based on your business needs.

I hope this post helps you and saves you a good amount of time. Let me know if you have any suggestions.

HAPPY CODING.

About the Author

id_card1-7705849

Prabhat Kumar – Associate analyst

I am an engineer who today works in the main multinational companies as an associate analyst and an innovation enthusiast, i love learning new things, I believe that each information has a story and I love reading the stories.

Prabhat Pathak () is Associate Analyst.

Subscribe to our Newsletter

We will not send you SPAM mail. We hate it as much as you.