In a very crowded Moscone Center in San Francisco last week, data and AI contender Databricks held its annual confab, the Data and AI Summit (DAIS). The company says 16,000 people attended the event in person, with another 44,000 registered for online attendance. With DAIS immediately following archrival Snowflake’s Data Cloud Summit event, held in the same venue the week prior, Databricks had its work cut out. And it’s more than fair to say it delivered.
The DAIS Day One and Day Two keynotes — anchored by CEO Ali Ghodsi and CTO Matei Zaharia, respectively — totaled almost 6 hours of content, with numerous announcements pertaining to open table formats, AI, BI, the now open-sourced Unity Catalog, Apache Spark 4.0 and more. It was a smörgåsbord of stuff. In this post, I’ll do my best to cover what was announced, organizing it by category. In a follow-up post, I’ll compare Databricks’ announcements with those of Snowflake, to analyze the competition between the two but also to understand how the developments at both companies are together shaping the broader data and analytics arena.
Open Table Formats = Open Standard Competition
Ghodsi kicked off the first-day keynote with a strategic overview, before passing the baton to a range of Databricks leaders, each addressing announcements in their areas. In his high-level introduction, Ghodsi said security and governance of the data estate are among the company’s biggest concerns, as is the fragmentation of the data stack, which he indicated was acute and unacceptable. Despite that fragmentation, however, Ali and Databricks feel strongly that you should own your own data, store it in an open format, avoid vendor lock-in, and be able to use a number of engines on that data without having to make multiple copies of it.
That manifesto ended up being a great segue into a discussion of Databricks’ acquisition — announced just the week prior — of Tabular, the company founded by the creators of open table format Apache Iceberg. For those who haven’t been watching, there has been a format war going on between two open-source table formats. On the one hand, there is Delta Lake, created by Databricks itself, adopted by Microsoft for its Fabric platform, and supported now by DuckDB, something also announced during the Day One keynote. On the other hand is Iceberg, a format championed by companies like Google, Cloudera, Starburst, Confluent, Dremio and, most recently, Snowflake, which announced at its event the general availability (GA) of Iceberg as a natively supported format for Snowflake database tables.
Boardwalk and Park Place
As mighty as Snowflake’s announcement may have seemed, Databricks announced its acquisition of Tabular that same week and, with that announcement, news that it will formally employ Tabular CEO Ryan Blue, Project Management Committee (PMC) Chair for the Apache Iceberg project. Ostensibly also joining Databricks will be Tabular’s Head of Engineering, Daniel Weeks, co-creator of Iceberg and another of its PMC’s 16 members. With Iceberg’s leadership thus on board, and given that it’s the company that created Delta Lake, and is still the open source’s project’s biggest committer, Databricks will have quintessential expertise in both formats.
Tabular’s Blue spoke at the Day One keynote event and said that the two formats would likely converge. Meanwhile, the Delta Lake Universal Format (UniForm) layer, which abstracts away differences between the two formats, will certainly benefit from the Tabular acquisition and will make it so that, in Ghodsi’s words, “you don’t have to pick which of the two silos, which of the two USB formats, do I have to store this in.”
Does all this mean Iceberg will survive and Delta Lake will be phased out? Does it mean the reverse? Or does it mean that the two will continue to coexist with trivial differences between them? Frankly, it’s unclear. With the acquisition announcement so recent, it seems Databricks may not even have decided yet. But there are communities and vendors with investments around both of these formats, and everyone’s going to need to come together if Databricks is serious about reducing the fragmentation that Ghodsi decried in his opening remarks.
AI: The Compound Word
Moving on to Artificial Intelligence, a big theme at DAIS was the idea of “Compound AI,” which focuses on building custom large language models (LLMs) which are fine-tuned on a customer’s own data, so that they can have a contextual understanding of it as a baseline, rather than that context needing to be passed in as part of a retrieval augmented generation (RAG) prompting of the model. Although RAG is wildly popular now, Databricks maintains that compound AI is key to making models easier to use, more precise, and less prone to the “hallucination” anomalies that plague LLMs in general. Zaharia, Ghodsi and others coauthored a paper on the subject that you can read for details.
At the DAIS Day one keynote, Databricks announced that more than 200,000 custom AI models have been built on Databricks’ platform. Furthermore, the company said it will offer facilities for no-code fine-tuning of LLMs on your data. This will fulfill the vision of compound AI and help achieve what Databricks calls Data Intelligence, which it defines as AI based on your data, versus than General Intelligence, which is not. (Note: other companies, like Collibra, define the term “data intelligence” very differently, as a discipline largely focused on intelligent management of a customer’s data estate.)
Piecing Together the Mosaic
Databricks has branded Mosaic AI the collective power of its long-established and continuously evolved machine learning capabilities, as well as its newer capabilities for Generative AI (GenAI) and LLMs. At DAIS, the company announced that Mosaic AI will feature a new Agent Framework for building RAG applications, a Tools Catalog (which lets customers inventory and curate AI-relevant SQL functions, Python functions, model endpoints, remote functions, and retrievers), as well as evaluation and training capabilities. Furthermore, the vector search capabilities that had been available in preview are now generally available (GA). During the keynote, Databricks also discussed the Mosaic AI Gateway (detailed within this blog post), which abstracts away API differences between the various LLMs and GenAI providers.
As an example of how these features can be applied productively, Databricks discussed a text-to-image AI model that the Mosaic AI research team built in partnership with Shutterstock, the massive stock photo bureau. The model, called ImageAI, is available online from Shutterstock, for a $7 monthly subscription.
Unity Catalog: Enhanced and Open Sourced
There are now so many components within the Databricks platform, it’s perhaps fitting that the one that ties most of the others together is called Unity Catalog (UC). UC started out as Databricks’ own implementation of a database table catalog, to deliver and exceed the capabilities of the catalog offered by the Apache Hive project which, though bare-bones, had become an industry standard. UC has grown to take on other workloads, though, including new ones announced at DAIS.
More UC news: the Databricks Lakehouse Monitoring feature — which monitors the statistical properties, data quality, and accuracy of tables and AI models in UC — is now GA. A new attribute-based access control (ABAC) feature is going into preview to complement UC’s foundational role-based access control (RBAC) feature. And if that weren’t enough, Databricks announced that UC can now function as a metrics store, allowing it to catalog centralized, certified definitions of key metrics used in analysis, thus avoiding the situation where different business units may define like-named metrics differently. Databricks said the feature will be compatible with metrics store capabilities offered by AtScale, Cube, and dbt.
UC OSS
Having UC no longer makes Databricks unique amongst its peers, though. At its Data Cloud Summit event the week before DAIS, Snowflake announced a new catalog, specifically for Iceberg tables, called Polaris, that it said it would open source 90 days hence. Perhaps in response, at DAIS, Databricks announced that it would be open sourcing UC, and in the Day 2 keynote, Matei Zaharia did just that, by making the UC’s GitHub repo public, live on stage. Zaharia also announced that the project would be domiciled at the Linux Foundation’s “LF AI + Data” unit.
Zaharia explained that CIOs canvassed by Databricks want governance with open connectivity, that handles both data and AI, and open access from any engine or client. UC aims to provide that as it now can be used to govern files, AI models, and AI tools, in addition to tables. And folks who want to keep using the Hive Metastore or Amazon Web Services’ comparable Glue catalog can now federate them into UC so that the contents of those catalogs can be managed from UC as well.
Federation and Sharing
Federation doesn’t just extend to catalogs, though. Databricks announced that its Lakehouse Federation feature is now GA. Lakehouse Federation is essentially a data virtualization technology that allows tables from external Databricks lakehouses, as well as non-Databricks platforms like MySQL, PostgreSQL, Snowflake, Google BigQuery, Amazon Redshift, and Microsoft’s Azure Synapse Analytics, into UC. The federation story continues, as Databricks announced that Delta Sharing technology now works with Lakehouse Federation sources as well as Databricks native tables. And Delta Sharing now lets Databricks customers share data with other organizations using not just Databricks, but also vanilla Apache Spark, Tableau, Power BI, Excel, and the Python Pandas library.
Meanwhile, Databricks Clean Rooms, now in preview, provides a more granular data-sharing mechanism, facilitating the sharing of specific data — based on queries approved by the sharing and consuming parties — without sharing entire datasets.
BI, Serverless, and LakeFlow
All of this AI and catalog technology is great, but what about BI (business intelligence)? Databricks had a big announcement there too. It launched a new component called AI/BI, the “Genie” feature of which allows data to be queried in much the same way fashion as sending prompts to an LLM in Q&A fashion, and responds with both data result sets and visualizations. The model is built automatically from its general knowledge as well as the customer’s data, and its metadata.
At first blush, AI/BI looks like Databricks’ own implementation of functionality long offered by ThoughtSpot, allowing natural language exploration of a customer’s data. But Databricks’ offering is really more of an AI/BI hybrid, true to its name. As an example, if a user asks a question using terms or concepts that aren’t in AI/BI’s model, the platform will ask the user to explain them. The explanation is added to the model as a so-called “instruction,” and AI/BI will then attempt to provide an answer, based on its improved contextual understanding of the question. Instructions can also be entered directly, rather than at query time. Together, both types of instructions essentially help AI/BI build a natural language-based BI semantic model.
AI/BI also sports a dashboarding feature. This is not Databricks’ first foray into this realm, as Databricks SQL introduced a dashboard feature of its own. It’s not real clear that Databricks customers will want to build dashboards within the Databricks platform, rather than use full-fledged BI tools like Microsoft’s Power BI or Salesforce’s Tableau. But Databricks promises that those tools will also be able to tap into AI/BI’s power through APIs. Time will tell which combination customers like most.
Speaking of the Databricks platform, the company announced that all of its components will now be available in serverless form. At DAIS, Databricks said the serverless approach will eliminate cluster tuning, autoscale configuration, data layout setup, capacity planning, usage tracking, and Spark version management. The company also announced LakeFlow, which handles data ingestion, transformation and orchestration. Essentially, LakeFlow ties together the change data capture (CDC) technology Databricks took ownership of through its acquisition of Arcion Labs, with its own Jobs and Delta Live Tables features, providing a visual authoring interface with which to build out full data pipelines.
Spark 4, and More
That’s a ton of announcements for the commercial Databricks platform, but it’s still not the whole DAIS story. Let us not forget that Databricks was founded by the creators of Apache Spark, and its platform is built atop Spark technology. DAIS itself used to be called “Spark and AI Summit,” and simply “Spark Summit” prior to that. This means news around open source Apache Spark is still very important to the Databricks community.
Databricks co-founder Reynold Xin spent lots of time during the Day 2 keynote on Spark. He first mentioned that the platform now has highly improved Python usability. This has been a concern of the Python community, as Spark itself is written in a language called Scala, and that language often offered capabilities or advantages in working with Spark that Python couldn’t match. But Xin explained that Python is now a first-class language for working with Spark and, in fact, some Spark features are exclusive to Python. Xin also announced that 5 billion queries are run every day with PySpark (the Python Spark library) on Spark 3.3 and higher, just on the Databricks platform alone.
As popular as Spark 3 may be, though, Xin said Spark 4.0 will be coming later this year. Xin said Spark 4 will support ANSI SQL for querying data in tables and DataFrames, and that the new version has been co-designed with the Delta Lake and Iceberg projects and their communities.
Moving to new versions of Spark has traditionally been difficult, as version dependencies can be complex, making version migrations non-trivial. Never fear though, as Spark 4 will include the GA release of the Spark Connect API. Spark Connect allows client applications to be bound to a version of the Spark API, rather than to a version of the engine and its dependencies themselves, all of which will make Spark version upgrades more seamless.
Many Announcements, Competitive Impact
This was a big post, matching up perfectly with the huge payload of announcements that Databricks brought to its annual summit at Moscone Center. The company and its platform have come a long way from offering the first cloud-hosted Spark environment that functioned independently of Apache Hadoop and Jupyter notebooks. The AI, MLOps, governance, data virtualization, SQL/BI and data engineering/pipelining capabilities are helping to build out Databricks as a true end-to-end platform, akin to those offered in the Cloudera Data Platform and Microsoft Fabric.
Snowflake is building out its own platform in a likewise manner and the two companies have fostered a rivalry reminiscent of that between Cloudera and Hortonworks ten years ago, before the two companies merged at the very beginning of 2019. In a future post, I’ll cover the competition between and aggregate output of Databricks and Snowflake, covering Delta Lake vs. Iceberg, Unity Catalog vs. Polaris, Mosaic AI vs. Cortex AI, and the two companies partnerships and ecosystems.
The post Databricks Launches Lakehouse, AI, BI and Governance Advances at Annual Summit appeared first on The New Stack.
At the Databricks Data and AI Summit in San Francisco, "Compound AI," was a big theme, focusing on building custom LLMs fine-tuned on a customer's data.