Jan 4, 2024
Databases in 2023: A Year in Review
I am starting this new year the same way I ended the last: taking antibiotics because my biological daughter brought home a nasty sinus bug from Carnegie Mellon’s preschool. This was after my first wife betrayed me and gave me COVID. Nevertheless, it is time for my annual screed on last year’s major database happenings and trends since a lot has happened.
My goal is to keep my trenchant opinions fair-minded and avoid catchpenny headlines. If you are new to this game or cannot get enough of my uninhibited wit, you can also check out my other end-of-year database reviews for 2022 and 2021.
Want to view a video version of this blog with additional commentary? Check out the video companion below. You can keep scrolling to read the blog.
Rise of Vector Databases
This was clearly the year of the vector database. Although some of these systems have been around for a couple of years, the widespread interest in LLMs and services built on top of them (e.g., ChatGPT) brought them to the forefront this last year. Vector databases promise to provide deeper searches on data (especially unstructured data) based on the data's semantics rather than just its contents. That is, instead of searching for documents that contain exact keywords (e.g., "Wu-Tang Clan"), an application can search for documents that are related to topics (e.g., "hip-hop groups with songs about slinging").
The "magic" that facilitates these types of searches is transformers that convert data into a fixed-length, one-dimensional vector of floating point numbers called an embedding. The values in these embeddings are indecipherable to humans. Instead, their contents encode some relationship between the parameters and the transformer's training corpus. The size of these embedding vectors ranges from 100s of dimensions for simple transformers to 1000s for high-end models.
Suppose one generates embeddings for all the records in a database using a transformer. In that case, one can search for similar records for a given input by finding the record embeddings closest to the search embedding in high-dimensional space. But brute-force comparing all vectors to find the closest matches is horribly expensive.
The complexity of this search is O(N * d * k) where N is the number of embeddings, d is the size of each vector, and k is the number of matches you want -- if you don't know what this means, trust me, it is terrible. Thanks to Ankur Goyal (CMU'17) for the correction.
This is where vector DBMSs come into the picture. At its core, a vector DBMS is just a document database with a specialized index data structure to accelerate similarity search on embeddings. Instead of performing an exact match for the most similar vectors per query, these systems can use approximate searches to generate results that make the trade-off of being "good enough" in exchange for returning the result more quickly. That's it.
After being knocked down by the 2022 crash of blockchain database mumbo-jumbo, the VCs took a bracing huff of smelling salts and got excited once again. They turned out their money in big funding rounds for almost all the major players in the vector database game. For seed rounds in 2023, Marqo popped a $5.3m seed, Qdrant got a $7.5m seed, and Chroma pulled out a large $18m seed. Weaviate came up big with their $50m Series B in April 2023. But Pinecone led the pack in 2023 with a mouthwatering $100m Series B round. The vector databases companies were clearly in the right place at the right time.
💩 Andy's Take:
After LLMs became "mainstream" with ChatGPT in late 2022, it took less than one year for several DBMS vendors to add their own vector search extensions. These include SingleStore, Oracle, Rockset, and Clickhouse. Several PostgreSQL-derivative systems also announced support for vector search; some use the pgvector extension (Supabase, AlloyDB), while others use other open-source ANN libraries (Timescale, Neon). Leading NoSQL DBMSs like MongoDB and Cassandra also added vector indexes.
It is interesting to contrast the quick proliferation of vector support across multiple DBMSs with the rise of JSON data types. NoSQL systems that natively store JSON became popular in the late 2000s (e.g., MongoDB, CouchDB), but it took several years after that until the relational DBMS incumbents added support for JSON (e.g., PostgreSQL, Oracle, and MySQL added JSON data types in 2012, 2014, and 2015, respectively). The SQL standard added functions to operate on JSON data in SQL:2016, but it did not add an official JSON data type until SQL:2023. This delay is somewhat surprising given that many relational DBMSs already supported the conceptually similar XML data type.
There are two likely explanations for the quick proliferation of vector search indexes. The first is that similarity search via embeddings is such a compelling use case that every DBMS vendor rushed out their version and announced it immediately. The second is that the engineering effort to introduce what amounts to just a new access method and index data structure is small enough that it did not take that much work for the DBMS vendors to add vector search. Most vendors did not write their vector index from scratch and instead just integrated one of the several high-quality open-source libraries available (e.g., Microsoft DiskANN, Meta Faiss).
But if the engineering effort to add reasonable vector search capability to a DBMS is low, the danger is that the vector DBMS vendors do not have a large enough moat to keep existing DBMSs from encroaching on their turf.
I recently told the co-founders of both the Pinecone and Weaviate that there are two paths their systems could take (see my podcast interview with Weaviate’s CTO). The first way is that their customers will start using vector DBMSs as their "database of record," and then vendors will add better support for operational workloads. They will end up looking more like popular document DBMSs (e.g., MongoDB), and then in five years, they will add support for SQL like the NoSQL systems that preceded them. The alternative path is that vector DBMSs will remain secondary databases updated via changes from upstream operational DBMSs. This is how people use search engine DBMSs like Elastic and Vespa. In that case, the vector DBMSs can survive without expanding their query languages or having a more structured data model.
Side Comment: I recently recorded a Q&A session on vector vs. relational databases. You can hear my prediction that every relational DBMS will have a high-performance implementation of a vector index within the next five years.
SQL Keeps Getting Better
This upcoming year will be the 50th anniversary of the creation of SQL at IBM Research by Don Chamberlain and Ray Boyce (RIP). Originally known as SEQUEL (Structured English QUEry Language), SQL has been the de facto standard programming language for interacting with databases since the 1980s. Despite its age, SQL's usage and capabilities have increased, especially in the last decade.
This past year saw the latest incarnation of the ISO/IEC 9075 specification, better known as SQL:2023. The update includes many "nice to haves" that deal with frustrations and inconsistencies in various SQL dialects (e.g., ANY_VALUE). Two enhancements to SQL that further erode the need for alternative data models and query languages are worth mentioning here. Remember that just because the SQL specification includes something does not mean that your favorite relational DBMS will immediately support these new features.
Property Graph Queries (SQL/PGQ):
SQL now supports defining read-only queries on graphs. This allows an application to declare a property graph structure over existing tables. The example is for a graph in Oracle v23c that keeps track of which people are in which bands:
It is left up to the DBMS to decide whether to create an auxiliary data structure (e.g., adjacency matrix) for the property graph or just keep track of the meta-data. You can then write graph traversal queries in SQL using the MATCH keyword. The syntax builds on existing languages (e.g., Neo4j's Cypher, Oracle’s PGQL, and TigerGraph’s GSQL), and shares aspects of the emerging GQL standard. The following query returns the number of members in each band:
As of January 2024, the only DBMS that I am aware of that supports SQL/PGQ features is Oracle. There is an experimental branch of DuckDB that also supports SQL/PGQ, but I could not get my above example to work because their supported syntax is slightly different. You can learn more about SQL/PGQ from this great list of resources curated by CWI/DuckDB researcher Gabor Szarnyas.
Multi-Dimensional Arrays (SQL/MDA):
SQL has support for arrays since SQL:1999 introduced limited single-dimension, fixed-length array data types. SQL:2003 expanded the functionality to support nested arrays without a predefined maximum cardinality. The SQL/MDA update in SQL:2023 supports true multi-dimensional arrays with arbitrary dimensions using integer-based coordinates. Rasdaman's RQL heavily inspired the SQL/MDA syntax to provide structural and operational array constructs compatible with SQL and orthogonal to its set semantics. These enhancements allow applications to interact and manipulate multi-dimensional arrays entirely in SQL without having to export them, like to a Python notebook. The table below (source) shows different examples of using the
MDARRAY data type in a
CREATE TABLE statement:
Although the SQL/MDA specification has been available since 2019 as a tech report, it was not put into the official SQL standard until SQL:2023. As far as I know, no production-ready DBMS supports the SQL/MDA extensions other than Rasdaman. The only other prototype I could find is a fork of HSQLDB named ASQLDB.
💩 Andy's Take:
The SQL:2023 revision is the next stage in the continuous evolution and improvement of the ubiquitous language. SQL is not perfect, of course, and it is not truly portable since every DBMS has its quirks, proprietary features, and non-standard extensions. I am personally fond of PostgreSQL's :: cast operator shortcut.
SQL/PGQ is a big deal. However, I do not foresee it being an immediate deathblow for graph DBMSs, as there are already several ways to translate graph-oriented queries to SQL. Some DBMSs, including SQL Server and Oracle, provide built-in SQL extensions that make storing and querying graph data easier. Amazon Neptune is a graph-oriented veneer on top of their Aurora MySQL offering. Apache AGE provides an OpenCypher interface on top of PostgreSQL. I expect other major OLAP systems (e.g., Snowflake, Redshift, BigQuery) will support SQL/PGQ in the near future.
Adding SQL/PGQ in a DBMS is not as simple as adding support for the new syntax. There are several engineering considerations to ensure graph queries perform well. For example, graph queries perform multi-way joins to traverse the graph. But a problem arises when the intermediate results for these joins are larger than the base tables. A DBMS must use a worst-case optimal join (WCOJ) algorithm to execute such joins more efficiently than the usual hash join used when joining two tables. Another important technique is to use factorization to avoid materializing redundant intermediate results during joins. This type of compression helps the DBMS avoid blowing out its memory with the same join record over and over again.
I bring these optimizations up not to say that existing graph DBMSs implement them because, as far as I know, leading systems like Neo4j, TigerGraph, and others do not. The only graph-oriented system I am aware of is the embeddable Kuzu DBMS from the University of Waterloo. Most relational DBMSs also do not implement them (at least the open-source ones). The DuckDB experimental branch mentioned above implements both WCOJ and factorization optimizations and showed in a 2023 paper that it outperforms Neo4j on an industry-standard graph benchmark by up to 10x.
As I've said before, SQL was here before you were born and will be here when you die. I will continue to scoff at claims that natural language queries will completely replace SQL.
Side Comment: It’s been two years since I made my public bet that graph DBMSs will not overtake relational DBMSs in the database marketplace by 2030. So far I’m still good.
Troubles in the Sea Lion Kingdom
MariaDB was in the news a lot this past year, and not in a good way. We found out that the MariaDB Corporation (which is separate from the MariaDB Foundation) is apparently a dumpster fire. In 2022, the Corporation backdoor IPO-ed through a sketchy merger instrument known as a SPAC. But the stock ($MRDB) immediately lost 40% of its value three days after its IPO. Because the Corporation decided to speedrun its way to the NYSE and become a publicly traded company, its dirty laundry became public. By the end of 2023, the stock price had dropped by over 90% since its opening.
In the wake of these financial problems, the Corporation announced two stints of layoffs. The first was in April 2023, but they had another larger round in October 2023. The Corporation also announced they were killing off two products: Xpand and SkySQL. The Corporation acquired Xpand in 2018, when it was previously known as Clustrix; I visited Clustrix's SFO office in 2014 and thought it was a creepy ghost town back then (large office with half the lights turned off) . The history of SkySQL is more complicated. It was originally a separate company providing MariaDB as a service but then it merged with Monty Program AB in 2013. Then, in 2014, the combined Monty Program AB + SkySQL company became the MariaDB Corporation we have today. But in December 2023, the Corporation announced that SkySQL is not dead and is back on the streets as an independent company!
Things are so rotten at MariaDB Corporation that the Foundation's CEO wrote an article complaining about how their relationship with the Corporation has soured since the IPO and they are hoping to "reboot" it. Other problems include Microsoft announcing in September 2023 that they will no longer offer MariaDB as a managed Azure service. Microsoft will instead focus on supporting MySQL. And just in case you are not aware, MariaDB is a fork of MySQL that MySQL's original creator, Monty Widenus, started after Oracle announced its acquisition of Sun Microsystems in 2009. Recall that Sun bought MySQL AB in 2008 after Oracle bought InnoBase (makers of InnoDB) in 2005. But now MySQL is doing fine and MariaDB is one with problems. You don't need to watch movies or television shows for entertainment! You can get all the drama you need in your life through databases!
💩 Andy's Take:
The saviness of the database customer has changed for the better in the last ten years. No longer can companies "fake it until you make it" with flashy benchmark numbers, new query languages that promise to replace SQL, or celebrity endorsements. A DBMS's reputation matters more than ever. And the company building the system matters just as much. This means that it is vital that the DBMS software itself is solid and that the company that is building it has its act together.
And before you think being open-source insulates you from a company going under, few DBMS projects continue on and thrive when their founding for-profit company fails. PostgreSQL sort of counts even though the open-source version we have today is based on the UC Berkeley source code and not the commercial Illustra version (which was acquired by Informix in 1996). The only other example is after the company building the InfiniDB OLAP engine for MySQL went bankrupt in 2014, its GPLv2 source code was picked up and continued as MariaDB's ColumnStore.
Instead, there are plenty of examples where the DBMS withers away once the company paying for most (if not all) of the development goes away. The only two examples where the DBMS sort of lived on are Riak and RethinkDB. Basho went bankrupt in 2017, and now Riak is maintained by one person who works for the UK's NHS. The RethinkDB company died in 2017 (not surprising, given the founder's thoughts about women in tech), and the DBMS source code was moved over to The Linux Foundation. Despite the Foundation taking over, the RethinkDB project is on life support: the project put out a new release in 2023, but they are only hotfixes to deal with bit rot. You can also see other abandoned DBMS projects in The Apache Foundation's Attic.
Cloud-only DBaaS makes this issue even more pernicious because if the company fails (or starts to face financial pressure), it turns off the servers hosting your databases. Xeround gave their customers two weeks to migrate their databases when they shut them down in 2013. To cut costs, InfluxDB gave their customers six months before they deleted entire regions in July 2023, but people still were caught by surprise.
MariaDB is in a better position than the average database start-up because Monty and others established the non-profit foundation that has control of the open-source project. But it's a bad sign when you're a for-profit company for an open-source DBMS, and the non-profit organization helping you build that DBMS publicly says you're a hot mess! It's as bad as Chuck D feuding with Flavva over doing a Public Enemy reunion tour. Meanwhile, MySQL continues to improve, and Oracle has been a reasonably good corporate steward of the system (at least from an engineering perspective). But the MariaDB Corporation mess will further bolster society's movement towards PostgreSQL.
MariaDB also can't fail because, as far as I know, Monty does not have any more children to name databases after (e.g., MaxDB, MySQL, MariaDB).
Government Database Crash Grounds US Air Travel
On January 11, 2023, the Federal Aviation Administration (FAA) grounded all flights in the US because of a NOTAM system outage. The NOTAM system provides plaintext-encoded messages to pilots with warnings about unexpected changes or potential hazards they may encounter on their flight paths. When the NOTAM system crashed on the morning of January 11th, it halted the take-off of roughly 11,000 flights in the US. Other countries also operate independent NOTAM systems unaffected by the US NOTAM failure.
According to the FAA, the outage was due to a corrupt database file. An engineer from a third-party contractor tried to replace the file with a backup, but it turned out that the backup file was messed up, too. A similar problem also caused a 2008 outage of the FAA’s legacy infrastructure.
There is no public information about the DBMS the FAA uses for the NOTAM system. Some reporting suggests it is still running on two Philips DS714/81 mainframes from 1988. But these Philips DS714 machines did not have an operating system as we know them today; they are a relic of the 1960s mainframe era. That means the FAA could not use an off-the-shelf DBMS for this application in the 1980s, even though several existed (e.g., Oracle, Ingres, and Informix all supported various Unix platforms at the time). The best analysis I can find speculates that the NOTAM system managed the database itself, likely using flat files (e.g., CSV). The application code written by non-database experts in the 1980s was responsible for reading/writing records from the file, replicating to standby server, and maintaining data integrity in case of a crash.
💩 Andy's Take:
Running a mission-critical system on irreplaceable legacy hardware with a custom database access library written by in-house developers that have long retired is every database researcher’s worst nightmare. I am surprised it did not crash sooner (unless the 2008 failure was for the same system), so I guess we should give them credit for keeping it going for 35 years.
Sources claim that the NOTAM system only processed 20 messages per second. That is undoubtedly small by modern standards, but remember that the FAA provisioned this system in the 1980s. Database legend and 1998 Turing Award winner Jim Gray (RIP) wrote in 1985 about how “ordinary” DBMSs could execute about 50 transactions per second (txn/sec) and the very high-end ones could reach up to 200 txn/sec. For reference, five years ago, somebody achieved roughly 200 txn/sec running PostgreSQL on a Raspberry Pi 3 using a benchmark from the 1980s (i.e., TPC-B, which was based on TPC-A). Ignoring systems that use strongly consistent replication across data centers (which are bottlenecked by the speed of light), a modern single-node OLTP DBMS can achieve a throughput of millions of txn/sec for some workloads. NOTAM’s peak throughput of 20 messages per second was not pushing the state-of-the-art in the 1980s, and it certainly is not doing that today.
Because NOTAM did not separate the database from the application logic, independently upgrading those components was impossible. It is fair to fault them for this design choice since the virtues of the relational model were well-known by the mid-1980s. Not that SQL would have prevented this exact failure (it was a human error), but such independence makes the individual components less unwieldy and more manageable.
Nevertheless, the US government was already using commercially available relational DBMSs at the time. For example, the 1988 IPO filing for Stonebraker’s RTI (maker of Ingres) mentions that their existing customers include the Departments of Defense and Interior, branches of the military, and research labs. I am sure other departments in the US government were using IBM DB2 and Oracle at the time as well. So, unless there was something about the NOTAM operating environment that I am unaware of, they could have used a real DBMS.
I was flying back from CIDR 2023 in Amsterdam when this happened. Luckily, it did not affect inbound international flights, so our plane could land unaffected. But then I was stuck in Newark Airport because all domestic flights were grounded. If you’re unfamiliar with that airport, its off-brand terminal is cramped for regional flights like Newark to Pittsburgh. I didn’t want to stay long because I’m not trying to get my toe tagged in Newark.
Side Comment: See our write-up from last year why a NOTAM database crash is unlikely to happen if it was running Amazon RDS.
Database Money Stuff
Beyond the vector DBMS VC feeding frenzy mentioned above, there was still some activity in the VC space for other types of database systems. But overall, database funding activity this year was much more muted than in previous years.
Automated tuning start-up DBTune raised a $2.6m seed round in Europe. PostgresML got $4.5m for their seed round to build a DBaaS with custom extensions to invoke ML frameworks from SQL. TileDB announced their $34m series B in the fall to continue building their array DBMS. Despite being over 13 years old, SQReam got a $45m Series C for their GPU-accelerated DBMS. Neon landed a $46m Series B in August 2023 to expand their serverless PostgreSQL platform. And, of course, the funding winner again in 2023 is Databricks with their blowout $500m Series I in September 2023. Yes, this is a lot of money. But it is not as much as their $1.6b Series H in 2021.
Update 2024-01-05: I also forgot to mention that MotherDuck (the commercial version of DuckDB) scored their $52.5m series B in September 2023 (thanks Peter Boncz). Another aquatic animal-themed database product, DBeaver, got a $5m seed round for their beloved multi-DBMS administration tool (thanks Tianzhou Chen).
There were some acquisitions in the database space in 2023 as well. The biggest one happened at the beginning of the year when MarkLogic was bought by Progress Software for $355m in straight cash money. MarkLogic is one of the oldest XML DBMSs (circa 2001), while Progress owns OpenEdge, an even older DBMS (circa 1984). IBM acquired the Meta-spinoff Ahana that was attempting to commercialize PrestoDB (which is different from the hard fork PrestoSQL, since renamed to Trino). Multi-cloud database service provider Aiven acquired the AI-powered query rewriter EverSQL start-up. EnterpriseDB spent Bain Capital's money to acquire the Seafowl team building a PostgreSQL-compatible OLAP engine based on DataFusion. Snowflake acquired two start-ups from my database colleagues: (1) Sisu Data from ex-Stanford professor Peter Bailis and (2) Ponder (based on Modin) from Berkeley professor Aditya Parameswaran.
💩 Andy's Take:
My VC friends are telling me that they saw more new company pitches but wrote fewer checks in 2023 compared to previous years. This trend is across all start-up fields, and the database market is not immune to it. Anything loosely connected to AI + LLMs received the bulk of the attention (rightly so, as it is a new chapter in computing).
Although some of the US's macroeconomic indicators for 2023 are positive, the tech industry is still skittish, and everyone is still looking to cut costs. With OtterTune, customers wanted our database optimizations to be more aggressive in helping them reduce their database infrastructure costs in 2023. This differs from the company's earlier years when people came to us primarily to improve their DBMS's performance and stability. We plan to announce new features to help reduce database costs in 2024. Back at the university, I had a larger-than-normal number of students asking me to help them find database dev jobs this semester. These requests for help were surprising as CMU CS students have always been able to get good internships and full-time positions on their own (except that time one of my best undergrads rewrote our query optimizer but then couldn't find a summer internship because he forgot to ask me and he ended up doing web development for Dick's Sporting Goods near the Pittsburgh airport – he now works happily at Vertica).
If the US tech market continues this way, many database start-ups will struggle to reach the next stage in the next few years. The smaller DBMS start-ups will likely get gobbled up by the large tech companies or private equity, or just die. But the companies that have raised a lot of money with high valuations are heading into rough waters. As I said before, some may not be able to go IPO, and no big tech company will need their DBMS since everyone has their own these days. Hence, these larger DBMS companies will have three choices. They can take a down round to keep the lights on and the company going. They can get propped up on life support by private equity (e.g., Cloudera). Or they can get bought by an IT services company (e.g., Rocket, Actian) that puts the system into maintenance mode but continues to milk the licensing fees from trapped customers with legacy applications they cannot easily migrate. These three paths are less than ideal for a database company and should rightly scare off potential new customers.
I finish with a reminder that, once again, the question to ask is not if Databricks will IPO but rather when it will.
The Most Expensive Password Change Ever
The OG database don, Larry Ellison, was riding high in 2023. It was a banner year for him in an already outstanding career. In June 2023, he returned to his rightful position as the 4th richest person in the world. Oracle's stock price ($ORCL) increased by 22% in 2023, slightly under the S&P 500’s 24% return. Larry then went up to Redmond in September 2023 for the first time in his life. He appeared on stage with Microsoft's CEO Satya Nadella to announce that Oracle's DBMS is now available as a managed service in Azure's cloud platform. Later, in November 2023, shareholders voted overwhelmingly to keep Larry as Oracle's board chairman at age 79.
But the real big news in 2023 was how Elon Musk personally helped reset Larry's Twitter password after he invested $1b in Musk's takeover of the social media company. And with this $1b password reset, we were graced in October 2023 with Larry's second-ever tweet and his first new one in over a decade. Larry portended his upcoming trip to the University of Oxford, where he later announced the establishment of the Ellison Institute of Technology (EIT) on Oxford's campus.
💩 Andy's Take:
If you're like me, then you remember precisely where you were and what you were doing when Larry's mythic second tweet dropped. I couldn't care less what the tweet was about. Sure, the planned research topics are super interesting, but that's not the important part. It is the fact that Larry tweeted again that makes this a momentous occasion. Both Larry and I are pleased with the positive reception that he received from it. I know from personal experience (don't ask how) that Larry occasionally does read his Twitter feed and is especially fond of detailed start-up pitches, thirsty birthday greetings, and random shower thoughts.
Larry's tweet was so unexpected because you assume he is always too busy doing more grandiose activities in his life. After all, the man owns a MiG-29 fighter jet and a Hawaiian island. He has so many choices of better things to do. So when he takes time out of his day to write a message to let us know what he is doing on a social media site in decline, this is a significant life event for all of us. Yes, Larry had to ask a friend to help him reset his password first, and that friend happens to be the wealthiest person in the world. But who among us has not had to reset a parent's password? Maybe not after spending $1b first, but that's nothing when you're worth $103b.
More Awesomeness Next Year
I am looking forward to 2024 and hope to spend more time on databases. It will also mark the fourth anniversary of Dana, Bohan, and me establishing the OtterTune company. We have learned a lot and expanded our database optimization service well beyond the original academic prototype.
We plan to share some highlights and findings on improving real-world MySQL and PostgreSQL DBMSs using AI in the coming year. We also have some new enhancements in the works that will make it easier for anyone to maintain a healthy and happy database.