Microsoft Azure is huge! There are so many services that it is sometimes hard to determine which ones you should use when. This is also true when it comes to storing your data in Azure.
There are many different services in Azure that you can use to store your data in, but which one is right for you? Obviously, the answer is: it depends. And you might also use several different data storage options in one application. This article will show you which options you have to store data in Azure and for what scenarios (I think) they should be used.
Let’s go over the services that you can use to store your data:
If you need to store files and small rows of data at large scale, without advanced query capabilities, Azure Storage is your best bet.
Azure Storage consists out of multiple services that are each optimized for a certain usage scenario. They are described in this post, and here is a summary of them:
* Useful for storing files, small and large, like audio, video or VHD files
* Meant for storing small messages that are picked up by other applications. Queue Storage can help to decouple your applications
* Based on the SMB protocol, File Storage is meant to be mounted as a disk in a VM. It is very useful to use for lifting and shifting applications into the cloud
* Disk Storage is optimized for high I/O operations and can be used as a hard disk for a VM, like a server
All of these services share common features, like data encryption at rest and authentication and authorization. Additionally, by default, everything you store in Azure Storage is replicated 3 times within the chosen datacenter. You can choose to also have the storage replicate to another datacenters using the Geo-redundant storage tier, which than also creates 3 copies of your data in a secondary datacenter, making the total number of copies 6. This ensures that your data is safe in case of failure or disaster.
In general, Azure Storage is highly performant and inexpensive, but offers limited querying capabilities.
Azure Cosmos DB
If you want to store non-relational data and use that all over the world with great performance, than Azure Cosmos DB is a the database for you.
Another notable thing about Cosmos DB is the focus on performance. The database itself is highly performant, regardless of the API that you use and takes care of indexes automatically. There is no need for you to tune your database, this happens automatically. On top of this, performance is part of the SLA, where Microsoft promises a maximum of 10ms latency for read operations and 15ms latency for write operations.
Additionally, it is very easy to create geographic performance by having Cosmos DB replicate data to another region. You just indicate the region and data is replicated in real-time. And better yet, you can keep communicating with your app as you did before and Cosmos DB will figure out which database in which region is the most performant for you. Keep in mind though, that replicas are always read-only replicas, and you only have one writable node.
In summary, Cosmos DB is great at storing and retrieving complex, non-relational data that potentially needs geographic performance.
Azure SQL Database
When you need to store relational data in a transactional manner with advanced querying capabilities, Azure SQL Database is the service for you.
Azure SQL Database is the fully managed cloud equivalent of the on-premises SQL Server product that has been around for decades, and Azure SQL database has been around since the beginning of Azure.
You store data in tables with rows and columns and you can use stored procedures, user-defined functions and all of the T-SQL language spec. Azure SQL Database is a very mature and advanced database.
Additional to being a great place to store relational data, it offers advanced capabilities like automatic indexing, just like Cosmos DB does. It also offers dynamic data masking which you can use to mask sensitive data, like credit card information for certain users, without affecting any of the querying capabilities.
Azure SQL Databases can be extremely performant, especially when you use the In-Memory OTLP feature that allows you to store and retrieve data from memory, just like a cache does. Additionally, Azure SQL Database offers geo-replication, which enables you to create read-only replicas of your databases in other geographic regions, which can help to improve geographic performance.
Azure SQL Database is a great service for storing and retrieving complex, relational data, potentially optimized for geographic performance.
Azure Database for PostgreSQL and MySQL
Relatively new database offerings in Azure are those for PostgreSQL and MySQL. Both of these offerings offer a managed database experience, where you just spin up a database and don’t have to worry about keeping it up and running.
Just like Azure SQL Database, they make an incremental backup every 5 minutes and a full back up every hour that is stored on geo-redundant storage. This means that the backups are replicated 3 times within the primary datacenter and another 3 times in a datacenter in another geographical location. Because of this, you can restore your database from any restore point within 35 days (depending on the pricing tier that you choose) in case of failure or disaster.
Azure Database for MySQL is based on the open-source MySQL database. You should use this if you need to store relational data and query it with a rich SQL language. MySQL is typically used in the LAMP software stack, which stands for Linux, Apache, MySQL and PHP.
Just like with Azure SQL Database, you store data in tables with rows and columns and have things like constraints and relationships.
One thing that is different, is the tooling that you use with MySQL vs. Azure SQL Database. Azure SQL Database works with all the Microsoft tools (Visual Studio, SQL Server Management Studio), where you’d use other tools to work with MySQL, like MySQL Workbench.
I think the choice between MySQL and Azure SQL Database is one of preference and ecosystem. If you like MySQL and are familiar with the tools, go for it. If you work in the Microsoft ecosystem and are familiar with the tools, you probably prefer Azure SQL Database.
Azure Database for PostgreSQL is the fully managed version of the open-source PostgreSQL database. This is an object-relational database, which is different from a relational database like MySQL. You’d use it when you need to store data in table-like structures that support objects, classes and inheritance in the database schema and query language. You’d also use it if you need a lot of flexibility. With that I mean that PostgreSQL supports a lot of interoperability, including running stored procedures in more than a dozen programming languages.
Besides that PostgreSQL is an object-relational database, the tooling that you use is also different from what you use with MySQL and Azure SQL Database. For PostgreSQL you’d use other tools, like pgAdmin.
I think that the choice for Azure Database for PostgreSQL is one that you make when you are comfortable with PostgreSQL and its tooling and when you need specific features like the flexibility with stored procedures which you don’t get with MySQL and Azure SQL Database.
Azure Synapse Analytics (previously Azure SQL Data Warehouse)
If you need to store large amounts of data in structured, relational formats for reporting purposes, than Azure Synapse Analytics is for you.
Based on Azure SQL Database, Azure Synapse Analytics stores data in tables with rows and columns and has features like indexes, constraints and keys. The difference is that Azure Synapse Analytics is optimized for reporting purposes, where Azure SQL Database is optimized for transactional processing of data, which happens in most regular data entry applications. You could actually build a data warehouse with Azure SQL Database is you want, but it’s not meant for that as there are some differences.
Azure Synapse Analytics can host a lot of data in one place. In fact, there is no actual limit to the amount of data that you can store in Azure Synapse Analytics (besides your budget). The amount of data that you can store in one database is more limited in Azure SQL Database.
Azure Synapse Analytics is also optimized for executing complex queries, which makes it much more performant for reporting than Azure SQL Database. Additionally, Azure Synapse Analytics supports Polybase, which Azure SQL Database doesn’t. Also, Azure SQL Database has more features for supporting live applications, like geo-replication and In-Memory OTLP, which Azure Synapse Analytics doesn’t have.
Before you load data in SQL Data Warehouse, you define the data schema that you want to use – you define the way that the data is structured. By doing this, you can make the data ready for analysis and reporting tools, like PowerBI.
In summary, Azure Synapse Analytics is your preferred datastore if you need to store and retrieve large amounts of relational data for reporting purposes.
Azure Data Lake Store
Another store that is optimized for storing large amounts of data for reporting and analytical purposes is the Azure Data Lake Store. The Data Lake store is geared towards storing data in its native format, making it a great store for non-relational data.
The Data Lake Store works differently from SQL Data Warehouse in that you don’t define a data schema upfront where you do need to do that with a SQL Data Warehouse. You store data in whatever form it comes in and define a schema when you retrieve it.
The Data Lake Store is a place where you put large amounts of data that you want to analyze for information. For when you don’t yet know which questions you want the data to answer. Once you know the questions to ask and the format of the answer, you can transform the data in the Data Lake into more structured data with tools like Azure Data Lake Analytics, Azure HDInsight or Azure Analysis Services and move the structured data into Azure SQL Data warehouse to be visualized by tools like PowerBI.
Azure Data Lake Store is the datastore that you use for large amounts of semi- and unstructured, non-relational data that you need to process at scale.
What to use when
There are a lot of options for storing data in Azure. What to use when? The summary below might help.
*** OLAP stands for Online Analytical Processing, which is the processing and analyzing of data
** OLTP stands for Online Transactional Processing, which is for transaction-oriented applications, like data entry applications