Big Data

An amazing Python library for extracting tabular data from PDF files

Introduction

The PDF format or portable document file is one of the most common file formats today. It is widely used in all industries, as in government offices, medical care and even personal work. Due, There is a large amount of unstructured data that exists in PDF format and extracting this data to generate meaningful information is a common job among data scientists.

There are several Python libraries dedicated to working with PDF documents like PYPDF2, etc. In this tutorial, will wear Camelot.

Why Camelot?

You are in control: unlike other libraries and tools that give good result or fail miserably (without intermediaries), Camelot gives you the power to modify table extraction. (This is essential since everything in the real world, including extraction of PDF tables, it's confusing).
A little tables can be dropped based on metrics like precision and whitespace, without having to manually look at each table.
Each table is a pandas DataFrame, that integrates seamlessly into Data analysis and ETL workflows.
Export to multiple formats, including JSON, Excel, HTML y Sqlite.

Let's start

Before installing the Camelot libraries we have to install ghost script , once we install the ghost script, let's install camelot-py.

Run below commands :

pip install "camelot-py[cv]"

Once you have installed the camelot-py library, we will be ready to start. We are trying to extract a statewide GST revenue table from this pdf document.

import camelot

If you have camelot, Python won't print an error message, if not, you will see a ImportError.

# Syntax of the camelot.read_pdf function 
camelot.read_pdf(
    filepath,
    pages='1',
    password=None,
    flavor='lattice',
    suppress_stdout=False,
    layout_kwargs={},
    **kwargs,
)

If you have to extract a table from different pages, must give the page number.

tables2=camelot.read_pdf('gst-revenue-collection-march2020.pdf', flavor="stream", pages="0-3")
tables2

This will give you a total list of the Table that is there in a pdf document. we can choose a table passing the index.

tables2[2]  # 2 is the index

tables2[2].parsing_report

The above code will provide you details such as precision and page number. Please note that there are 2 pages.

The following code will extract the table from the pdf document.

df2=tables2[2].df
df2

In this circumstance, because the table is divided into two different pages. Then we can make a solution.

tables2[3]
tables2[3].parsing_report

Here you can notice, we extract the table from page no 3.

df3=tables2[3].df
df3

The following is the code to add df2 and df3.

df4=df2.append(df3)
df4

df5 = df4[1:]
df5.head()
new_header = df5.iloc[0]df5 = df5[1:]df5.columns = new_header

Here you have, we have extracted a table from pdf, now we can export this data in any format to the local system.

Conclution

Extracting tabular data from pdf with the help of camelot library is really easy. At the same time, we know that there is a lot of unstructured data in pdf format and, after extracting the tables, we can do a lot of analysis and visualization based on your business needs.

I hope this post helps you and saves you a good amount of time. Let me know if you have any suggestions.

HAPPY CODING.

About the Author

Prabhat Kumar – Associate analyst

I am an engineer who today works in the main multinational companies as an associate analyst and an innovation enthusiast, i love learning new things, I believe that each information has a story and I love reading the stories.

Prabhat Pathak (Linkedin profile) is Associate Analyst.

An amazing Python library for extracting tabular data from PDF files

Contents

Introduction

Why Camelot?

Let's start

Conclution

About the Author

Related

Recent posts

Artificial Intelligence in Video: How New Technologies Are Changing Video Production?

IT profiles you should consider

How to record a screen on Windows computer?

¿Do you know the seniority levels?

Find Your Best Slip Rings and Rotary Joints Here

Posittion Agency: Advantages of link building for an online store

Subscribe to our Newsletter

Gaming

Brands

Business

Languages

An amazing Python library for extracting tabular data from PDF files

Contents

Introduction

Why Camelot?

Let's start

Conclution

About the Author

Related

Related Posts:

Recent posts

Artificial Intelligence in Video: How New Technologies Are Changing Video Production?

IT profiles you should consider

How to record a screen on Windows computer?

¿Do you know the seniority levels?

Find Your Best Slip Rings and Rotary Joints Here

Posittion Agency: Advantages of link building for an online store

Subscribe to our Newsletter

Gaming

Brands

Business

Languages