Dealing with Time Series Data : AWS DynamoDB vs GCP BigTable

Overview

Data is a valuable resource and in today’s fast paced world , data is stored in the cloud for ease of accessibility, scalability and most majorly security concerns. As businesses and individuals generate more and more data, they can easily expand their storage capacity in the cloud without having to invest in new hardware or infrastructure. In the modern day scenario , the answer to storing your data is generalised because anybody with a technical background will unhesitantly say ‘cloud’ , but it is also important to choose which cloud service you would choose in-order to store your data. Generally , you would choose any cloud providers’ relational database offering to store structural data and subsequently choose a non-relational database when you are unsure of your client’s growing needs or changing table schemas or structures.

Problem Statement

Let’s take an example of data returned from an API where the data being captured is weather parameter data with factors like ‘temperature’, ‘atmospheric pressure’ etc.,. Also , this API can return weather data at any given time of a particular day , i.e there exists multiple weather parameters for a particular day at multiple points of time (so basically time-series data).

The API might return a JSON like below on 18th of March 2023 at 12:30pm

"us-east-1-18-03-23" :  {
    "temperature" :  28.4,
    "atm_pressure" :  110045,
    "timestamp" :  1679122800
    }

The Initial Approach

Since the data being returned by the API is unstructured and also since we are unsure about whether we might be getting other parameters as well in the future from the API , we cannot have a fixed , defined structure. Let’s explore Non-relational databases for the above scenario. DynamoDB , one of Amazon Web Services’ offerings for storing Non-relation data comes to mind. Let’s picture how data from the API can be stored in the DynamoDB.

Image showing how a sample of 10 time series records that can be stored in DynamoDB

A DynamoDB table can be provisioned on-demand with partition key being a column named ‘region’ and the sort key with column named ‘timestamp’ (since dynamodb treats ts as default for TTL). We can have a column named parameters to store ‘temperature’ or ‘pressure’ or any other future parameters coming in from the API and also record their subsequent values in the ‘values’ column.

The Drawback

What if the data grows , of course DynamoDB provides scalability but now accessing data becomes a challenge. Data has to be made available very quickly and this might be hampered if data is recorded per minute for the same day and also not to forget the different parameters that may arise in the future. Data accessibility is hampered since DynamoDB provides only two indexes , the partition key and the sort key but let’s say there are hundreds to thousands of records and all of these records’ partition keys are majorly repeated since data from the API is returned at any given interval of the day.

The Solution

Google Cloud Platform offers a non-relational database known as Google BigTable. How does Google BigTable do it differently ? Google BigTable allows you to define your own column family for every row key , i.e every row key has multiple column families and also other different column keys, all this while still being a non-relational database. So , from this , it is apparent that Google BigTable provides many indexes as opposed to DynamoDB. If we were to store the same API time series data in BigTable , it would look something like below -

Screenshot 2023-03-19 at 11.04.22 AM.png

_Image showing same weather API data ingested into BigTable in the GCP console along with ‘temperature’ and ‘atm_pressure’ as column family parameters_

Google Bigtable is a distributed, highly scalable, and NoSQL database system designed to handle large amounts of structured data. It is used by Google to power its own services such as Gmail, Google Search, and Google Analytics. Bigtable is built on top of Google’s proprietary distributed file system, known as Google File System (GFS), and provides a simple data model with sparse, distributed, and persistent multidimensional sorted maps. It allows for efficient reads and writes, automatic sharding, and replication across multiple data centers, making it suitable for storing and processing massive amounts of data in real-time applications.

As seen in the image of GCP’s CLI above , column family data can be queried against row key as an example in order to point to a single particular record within a series of records in the table. Particular data can be returned by indexing multiple columns. Also, BigTable records the time (in UTC) when the record was inserted or updated as seen below the record of a row key. Bigtable as a whole makes accessing data faster no matter how much the data grows or the number of parameters grow while still maintaining non-relational database characteristics like scalability.

What am I compromising on by migrating from DynamoDB to BigTable ?

Bigtable architecture is centralised , the table is composed of rows , each of which describes a single entity , and columns which contain individual values for each row. Each row is indexed by a single row key and the columns are related to another based on the defined column family. Data model of Bigtable is basically column oriented which is as opposed to DynamoDBs decentralised data model with key-value data model.
Replication of data in Bigtable only happens within a single data centre whereas DynamoDB can replicate data across multiple data centres.
DynamoDB provides partitioning with consistent hashing where every node in the system is assigned to one or more points on a fixed circular space called “ring”. Each data item identified by a key , is assigned to a node by hashing its key with a hash function whose output is a point on the ring and then walking the ring clockwise to find the first node that appears on it. The main advantage of this technique is that addition or removal of a node only affects its immediate neighbours while other nodes remain unaffected. Whereas partitioning in BigTable is Key range based and data is ordered by a row key. Row ranges are called tablets. Each table consists of a set of tablets, and each tablet contains all data associated with a row range. Initially, each table consists of just one tablet. As a table grows, it is automatically split into multiple tablets. BigTable implementation includes single master and multiple tablet servers. The master is responsible for assigning tablets to tablet servers, whereas tablet servers are responsible for handling read and write requests to the tablets that they serve, and splitting tablets that have grown too large. Since each table’s cell belongs to a particular row, each row belongs to a particular tablet, and each tablet is assigned to exactly one tablet server at a time, it is very simple to find a node which stores the data of a specific table’s cell. The only thing to do is to query a special METADATA table, which stores the location of a tablet under a row key.
DynamoDB clients do not have to wait until their updates reach all the replicas, but in return, they deal with the multiple object versions on reads. Whereas, BigTable clients enjoy a consistent view of their data, but in return, they must wait in the presence of system failures. DynamoDB sacrifices consistency while Bigtable sacrifices availability.
DynamoDB completely ignores security related requirements while Bigtable has an adequate authorization mechanism. In BIgTable , access control rights are granted at column family level. For example- it can be configured that three different applications can access data from table 1 where the first application is only permitted to view personal data, the second can view and update personal data and the third can view and update all users’ data.