NoSQL databases that every data scientist should know about! 2020!

Share on facebook
Share on twitter
Share on linkedin
Share on telegram
Share on whatsapp

Contents

Overview

  • NoSQL databases are ubiquitous in the industry: a data scientist is expected to be familiar with these databases
  • Here we will see what a NoSQL database is and why you should learn about it.
  • We will also see the characteristics of 5 different NoSQL databases.

Introduction

Here's some advice I wish someone had given me when I started in data science: learn all you can about working with databases.

Here's a quick look at where your database knowledge will come into play:

  • You will face database questions in your data science interview.
  • You will work extensively with databases in your role as a data scientist, data analyst, business analyst, etc.
  • You will leverage your knowledge of databases to collect and collect data for your data science project.

And much more!

The incontrovertible truth is that we are generating data at an unprecedented rate and scale right now.. The simple fact that more than 8.500 tweets y 900 photos on Instagram in just a second blows my mind. Hallucinate the mind: How do today's databases cope with such volumes of data?

different-nosql-databases-6197427

To handle this large amount of data, we need a distributed database system that can run multiple nodes and is also partition tolerant. It means that even if one of the nodes fails for whatever reason, the system should work without problems. So partition tolerance is a must. Now, according to the CAP theorem, we can't have partition tolerance, availability and consistency all three at the same time.

We have to negotiate between availability and consistency. For instance, in a banking app, a customer should see the correct balance regardless of where they are accessing from. Results may be delayed for a few seconds, but they must be very consistent.

In this article, we will see different types of NoSQL databases, its characteristics and when to use each type of database.

Table of Contents

  1. What is a NoSQL database?
  2. NoSQL database types
    1. Document-based database
    2. Key-values ​​database
    3. Wide column based database
    4. Graph-based database
  3. Different NoSQL databases
    1. MongoDB
    2. Cassandra
    3. ElasticSearch
    4. Amazon DynamoDB
    5. HBase

What is a NoSQL database?

Then, What is a NoSQL database?

You may have heard people say that a NoSQL database is any non-relational database that has no relationship between the data. Good, That's not true at all. They can also store the relationship between the data but in a different way.

We can say that “NoSQL” it means “No solo SQL”. Here, data is not split across multiple tables, as they allow all data that is related in any way possible, in a single data structure. When you work with a large amount of data, you don't need to worry about performance lags when querying a NoSQL database. No need to run expensive joints! They are highly scalable and reliable and are designed to operate in a distributed environment..

NoSQL database types

Now that we know what a NoSQL database is, let's explore the different types of NoSQL databases in this section.

1. Document-based NoSQL databases

Document-based databases store data in JSON objects. Each document has key-value pairs as structures:

screenshot-from-2020-09-09-15-19-30-9737840

Document-based databases are easy for developers, since the document maps directly to the objects, since JSON is a very common data format used by web developers. They are very flexible and allow us to modify the structure at any time.

screenshot-from-2020-09-13-12-55-46-2909824

Some examples of document based databases are MongoDB, Orient DB y BaseX.

2. Key-value databases

As the name suggests, stores data as key-value pairs. Here, keys and values ​​can be anything, like chains, whole numbers or even complex objects. They are highly divisible and are the best in horizontal scale. They can be really useful in session-oriented applications where we are trying to capture the behavior of the client in a particular session..

Some of the examples are DynamoDB, Redis y Aerospike.

3. Extensive column-based databases

This database stores data in records similar to any relational database, but it has the ability to store a large number of dynamic columns. Groups columns logically into column families.

For instance, in a relational database, has several tables, but in a wide column based database, instead of having multiple tables, we have several column families.

Here's a good resource for more information on column-based databases:

Popular examples of this type of database are Cassandra and HBase.

4. Graph-based databases

They store the data in the form of nodes and edges. The node part of the database stores information about the main entities as people, places, products, etc., and the edges part stores the relationships between them. These work best when you need to figure out the relationship or pattern between your data points, like a social network, recommendation engines, etc.

Some of the examples are Neo4j, Amazon Neptune, etc.

Now, Let's take a look at some of the NoSQL databases and their features.

List of the different NoSQL databases

1. MongoDB

MongoDB is the most widely used document-based database. Store documents in JSON objects.

mongo-db-logo-7843472

According to the website stackshare.io, more of 3400 companies are using MongoDB in their tech stack. Uber, Google, eBay, Nokia, Coinbase are some of them.

When to use MongoDB?

  1. In case you plan to integrate hundreds of different data sources, MongoDB's document-based model will be an excellent choice, as it will provide a single unified view of the data.
  2. When you expect a lot of read and write operations from your application, but it doesn't much matter to you that some of the data is lost in the server crash
  3. You can use it to store click stream data and use it for customer behavior analysis

If you want to get started with MongoDB, I recommend that you read the following articles:

2. Cassandra

Cassandra is an open source distributed database system that was initially built by Facebook (and motivated by Google's Big Table). It is widely available and quite scalable. Can handle petabytes of information and thousands of simultaneous requests per second.

279px-cassandra_logo-svg_-2992395

One more time, according to stackshare.io, more of 400 companies are using Cassandra in their tech stack. Facebook, Instagram, Netflix, Spotify, Coursera are some of them.

When to use Cassandra?

  1. When your use case requires more write than read operations
  2. In situations where you need more availability than consistency. For instance, you can use it for social media websites, but you can't use it for banking purposes.
  3. You need fewer joins and aggregations in your database queries
  4. Health trackers, meteorological data, order tracking and time series data are some good use cases where you can use Cassandra databases.

3. ElasticSearch

This is also an open source distributed NoSQL database system. It is highly scalable and consistent. You can also call it as Analytical Engine. Can analyze, easily store and search large volumes of data.

If full-text search is part of your use case, ElasticSearch will be the best option for your technology stack. It even allows fuzzy match searching.

1280px-elasticsearch_logo-svg_-7590200

More of 3000 companies are using Elasticsearch in their technology stack, incluidas Slack, Udemy, Medium y Stackoverflow.

When to use ElasticSearch?

  1. If your use case requires a full-text search, Elasticsearch will be the best option
  2. If your use case involves chatbots where these bots solve most of the queries, like when a person writes something, there is a high chance of misspellings. You can make use of ElasticSearch's built-in fuzzy matching practices
  3. What's more, ElasticSearch is useful for storing log data and analyzing it.

4. Amazon DynamoDB

It is a distributed database system based on key-value pairs created by Amazon and highly scalable. But, Regrettably, not open source. Can easily handle 10 trillion requests per day so you can see why.

amazon-dynamodb-logo-300x150-1-4826258

More of 700 companies are using DynamoDB in their technology stack, including Snapchat, Lift y Samsung.

When to use DynamoDB?

    1. In case you are looking for a database that can handle simple key-value queries, but those queries are very numerous.
    2. In case you are working with an OLTP workload, such as online ticket booking or banking, where the data must be very consistent

5. HBase

It is also a highly scalable open source distributive database system. HBase was written in JAVA and runs on top of the Hadoop distributed file system (HDFS).

apache_hbase-logo-wine_-300x200-9488359

More of 70 companies are using Hbase in their technology stack, como Hike, Pinterest y HubSpot.

When to use HBase?

  1. You must have at least petabytes of data to process. If your data volume is small, you will not get the desired results
  2. If your use case requires real-time and random access to data, HBase will be the right choice.
  3. If you want to easily store messages in real time for billions of people

Final notes

This is by no means an exhaustive list. There are more NoSQL databases, but these are the most used in the industry.

If you have worked with any of these databases or any other NoSQL databases, let me know in the comment section below. I would love to hear about your experience!

There is a lot of difference between the data science we learn in courses and personal practice and the one we work with in industry. I recommend that you follow the crystal clear free courses below to understand all about analytics, machine learning and artificial intelligence:

  1. Introduction to the free AI course / ML | Mobile app
  2. Introduction to the AI ​​mobile app / ML for business leaders
  3. Free introductory business analysis course | Mobile app

Subscribe to our Newsletter

We will not send you SPAM mail. We hate it as much as you.