Data Warehouse with PostgreSQL in Python for Data Scientists

This post was released as part of the Data Science Blogathon

Introduction

Data warehouse generalizes and mixes data in a multidimensional space. The construction or structure of a data warehouse involves data cleansing, data integration and data transformation, and can be seen as a “pre-processing step important for data mining”.

At the same time, data warehouses provide online analytical processing (called OLAP) tools for interactive analysis of multidimensional data of varying granularity, facilitating effective data mining and generalization. There are many other data mining functions, as an association,
classification, prediction and clustering, that can be integrated with OLAP operations to drive interactive knowledge extraction.

This is why the data warehouse has become an important scaled platform for data analysis and OLAP.. The data warehouse will provide a constructive platform for data mining. Therefore, data warehousing and OLAP form an essential step in the knowledge discovery procedure (KDD). This is the overview that is essential to understand the general procedure of data mining and knowledge discovery.

Now let's understand the basic concept of data warehouse.

Basic concept of data warehouse:

Data warehousing provides architectures and tools for business professionals to organize, understand and use this data in a systematic way to make strategic decisions. Data warehousing systems are beneficial tools in today's competitive and rapidly evolving world.. Since the last years, many companies and industries have spent many millions of dollars building company-wide data warehouses.

“Then, What exactly is a data warehouse?” In general terms, A data warehouse refers to a repository of data that is maintained separately from the operational databases of an organization. Data warehouses enable the integration of a range of application systems. The four keywords (focused on the topic, integrated, time-varying and non-volatile) distinguish data warehouses from other data storage systems, What
relational database systemsRDBMS), transaction processing systems and other file systems.

There are three keys to implementing a data warehouse:

– server

– Board

– Indexing

Let's analyze all these points in detail: –

1) Server:

Postgre SQL

“Postgre SQL” is an open source relational database system (RDMS). Even though it is a structured database administration system (DBMS), also stores unstructured data. Most importantly, the Postgre SQL GUI makes it very easy to deliver and manage databases on mount..

before continuing, you must download and install Postgres using the link PostgreSQL.

After the installation is complete, you can log into the server by running the application that will open a portal in your pgadmin browser.

There is a default database labeled as Postgre, despite this, you can create your own database by right-clicking on the “Databases " menu and then select “To create” to create a new database.

2) Python implementation

Now that we have created our server and database, you must first install the package called “sqlalchemy ” to be used to connect to the database via Python. Additionally you can download and install this package using the following command at the Anaconda prompt like-

pip install sqlalchemy

Let's install and then import other must-have libraries in Python script as follows:

from sqlalchemy import create_engine
import psycopg2
import pandas as pd
import streamlit as st

Now, we need to determine a connection between our “records_db " database and create a new table where we can store our records. At the same time, we need to create another connection with the “datasets_db ” database where we can store our datasets.

p_engine = create_engine("postgresql://<username>:<password>@localhost:5432/records_db")
p_engine_dataset = create_engine("postgresql://<username>:<password>@localhost:5432/datasets_db")
p_engine.execute("CREATE TABLE IF NOT EXISTS records (name text PRIMARY KEY, details text[])")

As we know the postegre naming convention, table names must begin with underscores (_) or lyrics ("to, b, c" and not numbers), must not contain hyphens (-) and have less than 64 characters. Consider our “records” table, we will create a “Name” field with a “text” type of data declared as PRIMARY KEY and a Details field as text[](training) which is the Postgres notation for a one-dimensional matrix. At the same time, if you want to store your database credentials securely, save them in a configuration file and then invoke them as parameters in your code according to your requirements.

Therefore, let's create the following five functions that are for reading, to write, to update, list our data towards / from our database. We'll see:-

def write_record(name,details,p_engine):
    p_engine.execute("INSERT INTO records (name,details) VALUES ('%s','%s')" % (name,details))

def read_record(The DCOUNT function,name,p_engine):
    result = p_engine.execute("SELECT %s FROM records WHERE name="%s"" % (The DCOUNT function,name))
    return result.first()[0]
    
def update_record(The DCOUNT function,name,new_value,p_engine):
    p_engine.execute("UPDATE records SET %s="%s" WHERE name="%s"" % (The DCOUNT function,new_value,name))

def write_dataset(name,dataset,p_engine):
    dataset.to_sql('%s' % (name),p_engine,index=False,if_exists="replace",chunksize=1000)

def read_dataset(name,p_engine):
    try:
        dataset = pd.read_sql_table(name,p_engine)
    except:
        dataset = pd. DataFrame([])
    return dataset

def list_datasets(p_engine):
    datasets = p_engine.execute("SELECT table_name FROM information_schema.tables WHERE table_schema="public" ORDER BY table_name;")
    return datasets.fetchall()

3) Dashboard:

Streamlit

“Streamlit” is a pure Python web framework that enables us to develop and implement user interfaces (UI) and real-time applications. Here we are using streamlit to render the dashboard to interact with the database.

In the code shown below, we are using different text inputs to insert the values in our records, arrays and names for our data sets. Then, we use Streamlit functions to interactively visualize our dataset as a graph and also as a data frame.

st.title('Dashboard')
column_1, column_2 = st.beta_columns(2)

with column_1:
    st.header('Save records')
    name = st.text_input('Please enter name')
    details = st.text_input('Please enter your details (separated by comma ",")')
    details = ('{%s}' % (details))
    if st.button('Save record to database'):
        write_record(name,details,p_engine)
        st.info('Name: **%s** and details: **%s** saved to database' % (name,details[1:-1]))

    st.header('Update records')
    field = st.selectbox('Please select field to update',('name','details'))
    name_key = st.text_input('Please enter name of record that to be updated')    
    if field == 'name':
        updated_name = st.text_input('Please enter your updated name')
        if st.button('Update records'):
            update_record(The DCOUNT function,name_key,updated_name,p_engine)
            st.info('Updated name to **%s** in record **%s**' % (updated_name,name_key))                
    elif field == 'details':
        updated_details = st.text_input('Please enter updated details (separated by comma)')
        updated_details = ('{%s}' % (updated_details))  
        if st.button('Update records'):
            update_record(The DCOUNT function,name_key,updated_details,p_engine)
            st.info('Updated details to  **%s** in record **%s**' % (updated_details[1:-1],name_key))
            
    st.header('Read records')
    record_to_read = st.text_input('Please enter name of record to read')
    if st.button('Search'):
        read_name = read_record('name',record_to_read,p_engine)
        read_details = read_record('details',record_to_read,p_engine)
        st.info('Record name is **%s**, record details is **%s**' % (read_name,str(read_details)[1:-1]))

with column_2:
    st.header('Save datasets')
    dataset = st.file_uploader('Please upload dataset')
    if dataset is not None:
        dataset = pd.read_csv(dataset)
        dataset_name = st.text_input('Please enter name for dataset')
        if st.button('Save dataset to database'):
            write_dataset('%s' % (dataset_name),dataset,p_engine_dataset)
            st.info('**%s** saved to database' % (dataset_name))

    try:
        read_title = st.empty()
        dataset_to_read = st.selectbox('Please select dataset to read',([x[0] for x in list_datasets(p_engine_dataset)]))
        read_title.header('Read datasets')
        if st.button('Read dataset'):
            df = read_dataset(dataset_to_read,p_engine_dataset)
            st.subheader('Chart')
            st.line_chart(df['value'])
            st.subheader('Dataframe')
            st.write(df)    
    except:
        pass

You can run your panel in a local browser from your machine, by typing the following commands in anaconda notice. Then, first, you need to change your root directory to the place where your source code was saved.

cd C:Usersyour directory path...

Now we will run the following code to run our application …

streamlit run file_name.py

Final score

To end, we have a board that can be used to write, read, tokenizar, to update, upload and view our data in real time. We can see the beauty of our data warehouse that can be expanded for the user / host have as much data as you need within the same structure.