Dec 31, 2022

Databases in 2022: A Year in Review

Andy Pavlo

dj ot with microphone
dj ot with microphone

Another year has gone by, and I’m still alive. As such, it is an excellent time to reflect on what happened in the world of databases last year. It was quiet in the streets as the benchmark wars between DBMS vendors have quieted down. I had fun writing last year’s retrospective, so I am excited to share with you the things that stand out from 2022 and my thoughts on them.

Big database funding has slowed big time

As I discussed last year, 2021 was a banner year for database funding. There was a lot of money being thrown at start-ups building new DBMSs as investors continue to search for the next Snowflake. The beginning of 2022 looked like it was going to be another repeat of the previous year with lots of big funding round announcements.

The party started in February with Timescale’s $110m series C, Voltron Data’s $110m seed + series A, and Dbt Labs’s $222m series D. Starburst announced their $250m Series D in March to expand their Trino offering. Imply pulled out a $100m series D in May for their commercial version of Druid. DataStax got $115m in funding on its way to IPO in June. Lastly, SingleStore dropped their $116m series F in July and then extended it with another $30m in October.

There were several other smaller companies with impressive series A rounds in the first half of 2022, including Neon’s $30m series A for their serverless PostgreSQL offering, ReadySet’s $29m series A for their query caching layer, Convex’s $26m series A for their application framework built on PostgreSQL, and QuestDB’s $15m series A for their time-series DBMS. Although we’re not building a new DBMS or related infrastructure, OtterTune flexed with our own $12m series A in April.

But then, the massive funding rounds stopped in the second half of 2022. Although there were smaller rounds for early-stage start-ups, there were no more nine-figure dollar amounts for more seasoned companies.

RisingWave copped a $36m series A for their streaming processing engine in October. Keebo raised a $10.5m series A for their Snowflake query accelerator. In November, we saw announcements for MotherDuck’s $45m seed + series A to commercialize a cloud version of DuckDB and EdgeDB’s $15m series A in November. Lastly, the SurrealDB brothers picked up a $6m seed round. I probably missed some others, but this is not meant to be an exhaustive list.

The only other notable financial event in databases was MariaDB’s disastrous public offering in December (via a SPAC), where the share price dropped by 40% on its first day of trading.

Andy's take

There are two reasons for the reduction in big funding rounds in 2022 compared to 2021. The most obvious reason is that the entire tech sector has cooled, partly fueled by concerns about inflation, interest rates, and the collapse of the crypto economy. The other reason is that everyone in a position to take a large round due already did so before things dried up.

For example, Starburst raised its series D in 2022 after its $100m series C in 2021. The database companies that raised huge rounds in the last two years will need to raise more money soon to keep the growth train going. Others have commented on the staggering amounts that these companies are getting.

The bad news is that these companies are in trouble unless the tech sector improves and big institutional investors start turning their money out on the street again. The market cannot sustain so many independent software vendors (ISVs) for databases. The only path to continue forward for these companies with billion-dollar valuations is going IPO or bankrupt. They are too expensive for acquisition (unless the VCs are willing to take massive cuts) for most companies.

Furthermore, the major tech companies (e.g., Amazon, Google, Microsoft) that do large M&A’s already have their own cloud database offerings. Hence, it is not clear who will acquire these database start-ups. It does not make sense for Amazon to buy Clickhouse at their 2021 $2b valuation when they are already making billions per year from Redshift. This problem is not exclusive to OLAP database companies; OLTP database companies will face the same issue soon.

I am not the only one making such dire predictions about the fate of database start-ups. Gartner analysts predict that 50% of independent DBMS vendors will go out of business by 2025. I am obviously biased, but I think the companies that will survive will be the ones that work in front of DBMSs to improve/enhance them rather than replace them (e.g., dbt, ReadySet, Keebo, and OtterTune).

I cannot comment on whether the SPAC “speedrun to IPO” method, like what MariaDB did, is a good idea. Such financial instruments are outside of my area of expertise (i.e., databases). But since it is the same thing the previous US president is doing with his social media company, I assume that it’s probably shady AF.

Blockchain databases are still a stupid idea

There have been wild claims about how web3 represents a radical change in how people will build new applications. I had a student storm out of my class because I was teaching relational databases instead of web3. The core tenet of the web3 movement is storing state in a blockchain database.

Blockchains are essentially decentralized log-structured databases (i.e., ledger) that maintain incremental checksums using some variation of Merkle trees and a BFT consensus protocol to determine the next update to install into the database. These incremental checksums are how a blockchain ensures that the database’s log records are immutable: clients use these checksums to verify that previous database updates are not altered.

Blockchains are a clever amalgamation of previous ideas. But the belief that a decentralized ledger is how everyone should build their OLTP applications is misguided. From a database perspective, they have nothing to offer over existing DBMS technologies for any practical use-case other than cryptocurrencies. Furthermore, any claim that blockchains provide better security and auditability in databases over existing DBMSs is simply wrong.

So if cryptocurrency was the best-case scenario for blockchain databases, it did not help that the crypto market crashed in 2022, which just further hindered their future. I’ll ignore the collapse of FTX for this discussion because it appears to be straight-up fraud and has nothing to do with databases. I will point out, however, that FTX, like all other crypto exchanges, didn’t run their business on a blockchain database and instead used PostgreSQL.

But other blockchain database use cases not related to cryptocurrencies, like trading and gaming platforms, have fizzled out due to their impracticality or scams.

Andy's take

One rule to follow when assessing a technology is that it is no longer “new” once IBM makes a television commercial about it. This means if there is no compelling use case for something by the time IBM starts advertising about it, then there will never be one.

For example, IBM was touting Linux as a hot new thing in a 2002 commercial, but thousands of companies were already using it as their primary server OS by then (including Google). So when IBM put out their blockchain commercial in 2018, I knew that the technology was going nowhere beyond cryptocurrencies because there was not a problem that a decentralized blockchain could solve that a centralized DBMS could not.

And it’s not surprising that IBM announced this year they were shutting down their supply chain IT infrastructure overhaul project with shipping magnate Maersk (the same thing they hyped in their commercial).

Blockchains are horribly inefficient compared to a well-written transactional DBMS controlled by a trusted authority that only allows trusted clients to connect directly. Almost every real-world interaction works this way except cryptocurrencies (see above) or illicit activities like trapping.

We are required to trust others to have a functioning society. For example, I authorize the company that hosts the OtterTune website to charge our credit card, and they trust a cloud provider to host their software. Nobody needs a blockchain database for these transactions.

Switching from a proof-of-work (PoW) to a less energy-intensive proof-of-stake (PoS) consensus mechanism does improve the performance of blockchain databases. But that only affects the database’s throughput; blockchain transaction latencies are still measured in tens of seconds. If the solution to these long latencies is to use a PoS blockchain with fewer participants, then the application is better off just using PostgreSQL and authenticating those participants.

See this great article from Tim Bray about internal discussions he had with AWS top brass about whether there was a viable use case for blockchains. Note that he says AWS concluded in 2016 that blockchain databases were a solution looking for a problem, two years before IBM came out with their commercial!

And although AWS did eventually release its QLDB service in 2018, it is not the same thing as a blockchain; it is a centralized verifiable ledger that does not use BFT consensus. Customer adoption of QLDB has been less than stellar, especially compared to Amazon’s wildly successful Aurora offerings.

Side Comment #1: I recently participated in a panel discussion at an SFO conference where SBF also flew in from the Bahamas to appear. I stuck around to see his talk. The audience went wild when SBF arrived on the stage. My contemporary Slack messages indicate I was not impressed with SBF’s “yep” responses to the moderator’s questions.

Side Comment #2: Three weeks before FTX imploded, somebody pointed out to Dana Van Aken and me that OtterTune had the same number of full-time engineers as FTX’s team in the Bahamas. This person then told us that since we had the same number of engineers, OtterTune should be more agile/aggressive like FTX and have $1b in ARR by now. Oops.

New database systems

This year there were several major announcements for new DBMS software.

  • Google AlloyDB: The biggest bombshell this year was when Google Cloud announced its new database service in May. Instead of building on top of Spanner, AlloyDB is a modified version of PostgreSQL that separates its compute and storage layers and supports WAL record processing directly in storage.

  • Snowflake Unistore: In June, Snowflake announced their new Unistore engine with “hybrid tables” to support lower latency transactions for DML operations. When queries update a table, the changes get propagated to Snowflake’s columnar storage. Somebody at SingleStore got a little feisty and mentioned that they have some patents around this space, but nothing has come of it.

  • MySQL Heatwave: After Oracle realized that Amazon makes more money off of MySQL than they do, they finally decided to build their own cloud offering for MySQL in 2020. But instead of just making an RDS clone, they extended MySQL with an in-memory vectorized OLAP engine called Heatwave. Last year Oracle announced that their MySQL service also supports automated database optimizations (but different from what OtterTune provides). This year, Oracle finally realized that they are not the leading cloud vendor and relented to support MySQL Heatwave on AWS.

  • Velox: Meta started building Velox as a new execution engine for PrestoDB in 2020. Two years later, they announced the project and published a VLDB paper about it. Velox is not a full DBMS: it does not come with a SQL parser, catalog, optimizer, or networking support. Instead, it’s a C++ extensible execution engine with a memory pool and storage connectors. One could use Velox to build a full-fledged DBMS around it.

  • InfluxDB IOx: Like Meta with Velox, the Influx squad has been working on their new IOx engine for the past two years. They finally announced they went to GA with their new engine in October. InfluxDB built IOx from scratch based on DataFusion and Apache Arrow. Thankfully they also ditched MMAP in their new system after I warned Influx’s CTO in 2017 that using MMAP was a bad idea.

Andy's take

Databases are the second most important thing in my life, so I enjoy seeing all the developments in the last year.

My hot take on AlloyDB is that it is a neat system, and an impressive amount of engineering went into it, but I still don’t know what is novel about it yet. AlloyDB’s architecture is similar to Amazon Aurora and Neon, where the DBMS storage has an additional compute layer to process WAL records independently of the compute nodes. Despite already having a solid database portfolio (e.g., Spanner, BigQuery), Google Cloud felt the need to build AlloyDB to try to catch up with Amazon and Microsoft.

The long-term trend to watch is the proliferation of frameworks like Velox, DataFusion, and Polars. Along with projects like Substrait, the commoditization of these query execution components means that all OLAP DBMSs will be roughly equivalent in the next five years.

Instead of building a new DBMS entirely from scratch or hard forking an existing system (e.g., how Firebolt forked Clickhouse), people are better off using an extensible framework like Velox. This means that every DBMS will have the same vectorized execution capabilities that were unique to Snowflake ten years ago. And since in the cloud, the storage layer is the same for everyone (e.g., Amazon controls EBS/S3), the critical differentiator between DBMS offerings will be things that are difficult to quantify, like UI/UX stuff and query optimization.

The loss of a database pioneer

On a more somber note, we lost Martin Kersten in July 2022. Martin was a researcher at CWI that was a leader on several influential database projects, including one of the first distributed in-memory DBMSs in the 1990s (PRISMA/DB) and one of the first columnar OLAP DBMS in the 2000s (MonetDB). Martin was awarded a royal knighthood by the Dutch government in 2020, specifically for his work on databases.

The MonetDB codebase was a springboard for several other OLAP system projects. In the late 2000s, Peter Boncz and Marcin Żukowski forked it to make MonetDB/X100, which then was commercialized as Vectorwise (now known as Actian Vector). Marcin then went off to co-found Snowflake using a lot of the techniques that he developed on the original MonetDB code. More recently, Hannes Mühleisen created an embedded version of MonetDB called MonetDBLite, which he then rewrote again into what is now DuckDB.

Martin’s contribution to modern database systems cannot be overstated. If you use any modern analytical DBMS (e.g., Snowflake, Redshift, BigQuery, Clickhouse), then you are benefiting from many of the advancements developed by Martin and his students over the last 30 years.

Andy's take

I acknowledge that Martin was probably not well known to people outside the database research community compared to people like Mike Stonebraker. I always thought of Martin as the European version of Stonebraker: they are both prolific database researchers that are tall, thin, wear glasses, and are about the same age. But Martin is not some off-brand knockoff like a Nintendo Smitch.

Beyond the research, Martin was always generous with his time and eager to discuss database architectures with anyone. I last saw him at VLDB 2019 before the pandemic. He and I argued for almost an hour about why he felt that using MMAP in MonetDB was the right choice; he claimed that because MonetDB focuses on read-only OLAP workloads, then MMAP was good enough. I felt terrible because he also had to deal with students watching my database courses on Youtube and then emailing him about why MonetDB made design choices that I claim were inferior.

I encourage you to watch one of the last talks Martin gave in 2021 for our CMU-DB Seminar Series. I promised Martin I wouldn’t derail his talk by complaining about MonetDB’s use of MMAP. But if you watch the first 60 seconds, you will see I hired a Dutch person to record a fake royal intro for Martin.

Using A database fortune to save democracy

I always want to end my year-end retrospective articles on a happy note. Databases are supposed to make people feel good about their lives. They represent the culmination of scientific and engineering breakthroughs that enable us to organize data about any facet of modern life. Given this, my last story should make everyone feel good. It is an example of somebody doing the right thing for the right reason.

In May 2022, the Washington Post reported that Larry Ellison, Oracle founder and sailing enthusiast, participated in a November 2020 conference call with the US president and other conservative leaders about the recently completed election.

The call focused on different strategies that the president’s allies and roustabouts could employ to overturn the results of the presidential election. As the Post’s article points out, it is unclear why the administration included Larry on the call. One speculation is that given Larry’s obvious strong technical background, he might be in a good position to assess whether the allegations of using Italian satellites to manipulate air-gapped voting machines are legitimate.

Andy's take

Both Larry and I are sick of people making outlandish claims about his support for right-wing causes in the US. Some have even said this one phone call is the worst thing Larry has ever done. This is not true, and I personally know that it hurts Larry to read such statements about himself in the news and on social media. These journalists made it sound like Larry was doing something nefarious or indecent, like the time he made his pregnant third wife sign a prenup two hours before their wedding.

I can assure you that Larry was only trying to use his vast wealth as the 7th richest person in the world to help his country. His participation in this call is admirable and should be lauded. Free and fair elections are not a trivial affair, like a boat race where sometimes shenanigans are okay as long as you win. Larry has done other great things with his money that are overlooked, like spending $370m on anti-aging research so that he can live forever and investing $1b to help Elon Musk run(?) Twitter. So I stand by Larry’s actions in this example.

Next year is going to be lit

For me personally, there were a lot of changes in 2022:

  1. OtterTune raised our series A in April.

  2. My #1 Ph.D. student joined the University of Michigan as a new database professor.

  3. I was indefinitely “not fired” from Carnegie Mellon University in July (although I am still bound by the university’s “moral turpitude” clause for tenured faculty).

  4. I returned to teaching full-time in September.

I also contracted four non-COVID sinus infections from my biological daughter after she started preschool.

I am excited about where OtterTune is going next year. We plan to announce a complete rewrite of our database automation service in the first half of 2023. I have posted previews of the new features that we are working on now. We also had some legal issues with our record label due to transporting otters across state lines, but we’ve sorted out that mess and hope to release new albums in 2023.

P.S.: As always, please don’t forget to run ANALYZE on your databases over the holidays. Or let OtterTune automatically figure it out for you.

Try OtterTune for free. Start using AI to optimize your PostgreSQL or MySQL databases running on Amazon RDS or Aurora.

Try OtterTune for free. Start using AI to optimize your PostgreSQL or MySQL databases running on Amazon RDS or Aurora.

Get Started

Subscribe to blog updates.