Whether applications, information systems and computational resources are customer-facing or internal, revenue-generating or otherwise, one fact will almost always remain true about them.
They’re all founded upon data and only as useful as their data assets are.
When that data’s accurate, timely, complete and of dependable quality, these systems can meet any number of business objectives. However, as Anomalo CEO Elliot Shmukler told The New Stack, “The problem is all of those things break if you’re doing things wrong. Or, if your data is late. Or, if it’s missing.”
Contemporary data quality solutions account for these and an almost unlimited number of additional variables that diminish an organization’s data quality — and the output of its data-driven endeavors.
Moreover, they do so with a staggering amount of automation (based on machine learning (ML), low code and no-code constructs, and libraries of resources). They provide this functionality for data flowing through data pipelines, data at rest, and structured, semi-structured and unstructured data (including unstructured text).
With capabilities to speedily implement root cause analysis via natural language and pictorial descriptions, such platforms “help data teams find these kinds of issues with their data before things break,” Shmukler added. “Before the dashboards are wrong. Before the ML models based on that data go off their rails. Before the wrong decisions are made.”
Self-Supervised Learning
Anomalo’s machine learning is instrumental for enabling organizations to implement data quality expediently and, in most cases, with nominal effort. The offering primarily works by connecting to data warehouses or data lakes to begin “automatically monitoring any dataset … that you care about,” Shmukler indicated.
Anomalo’s machine learning models, which largely involve self-supervised learning, monitor datasets without users instituting rules, writing code, or describing what quality data is. Although data-profiling techniques are also involved, the self-supervised learning models “train themselves on the history of the dataset, rather than any human-labeled data,” Shmukler said. Soon after datasets are selected, the platform begins monitoring for details like:
- Data freshness: This check determines if new data is coming in when it’s expected.
- Completeness: This metric assesses if the data has the right volume or is missing segments, columnar information or other characteristics.
- Distribution: Distribution shifts indicate if datasets contain new, anomalous values.
- Columnar correlations: Anomalo can determine if there are atypical changes in the correlations between columns in tables.
By analyzing these and other factors, Shmukler said, the system’s “out of the box checks find 85 to 90 percent of all possible issues without you having to tell us what to look for.”
Deterministic Monitoring
Users can administer rules and business logic for specialized, deterministic monitoring for the remaining 10 to 15 percent of data quality issues. A library of basic rules is available in the platform for organizations to “take three to four clicks to implement an easy, simple rule,” Shmukler said.
Examples include rules for checking columnar values to detect things like nulls appearing where they shouldn’t be, whether values are in the correct format, and differences between particular tables.
For customized use cases in which SQL is required, AI Assist, a recently released capability, utilizes GPT-4 to write SQL from natural language prompts. It can also correct code errors if users prefer writing their own SQL.
“With AI Assist you can just tell us what you’re trying to do, and we’ll write that SQL for you,” Shmukler said.
Organizations can even issue checks for an entire set of metrics, up to approximately 100 at a time, for different datasets, or for the same one. Thus, when monitoring activities from multiple customers in multiple regions, for example, users can employ this feature instead of creating individual checks for each metric for each customer in each region—an arduous, time-consuming process.
With this capability, Shmukler said, the platform “will automatically look at that collection of metrics as a whole and identify the most anomalous, track them over time, and build individual models to understand how it’s moving.”
Root Cause Analysis
The simplicity Anomalo provides for automatically monitoring data quality characterizes that for rapidly ascertaining root causes of data incidents. The system issues “automated root cause analysis when rules fail,” according to Shmukler, and when anomalies are detected. Several factors influence the determination of the cause of data issues, including data lineage, current developments and previous ones.
A synthesis of these factors is disseminated via timely alerts of issues that link back to the underlying system and “the root cause that we’ve computed,” Shmukler said. “There’s historical information so you can put this issue in context. All of those things are available to you, just a click away.”
Alerts incorporate natural language explanations of—and visualizations depicting—any problems with either data or data pipelines.
Data Pipeline Support
Anomalo extends its monitoring to data pipelines and orchestration tools to implement measures like circuit breakers, which stop pipelines from transmitting data of substandard quality, and binning, which separates quality data from poor quality data, allowing the former to continue through the pipeline.
Shmukler recounted a use case in which a real estate customer, receiving third-party data from numerous nationwide sources and governmental agencies, was struggling to match its listings price data with that of the correct tax data.
Anomalo’s root cause analysis not only successfully pinpointed where the poor quality data came from but, Shmukler noted, “also identified which geography within the dataset was showing that bad data.”
Unstructured Text and Documents
The capacity to implement data quality on sources by merely connecting to them, stop data pipelines based on the quality of the data they contain, and automate natural language and pictorial explanations of causes of data incidents is just the beginning for Anomalo.
The vendor, which was recognized as Databricks’ Emerging Partner of the Year, recently released a feature in which it deploys GPT-4 for unstructured text monitoring, allowing users to upload a corpus and “create summaries of the documents to assess their quality while looking at how rich is this content,” Shmukler said. “What grade level is it at? Are there duplicates?”
Such efforts are modernizing data quality’s overall utility by bringing it into the realm of generative AI. Anomalo is not only employing this technology, but also doing so for organizations to fine-tune, train and institute retrieval augmented generation (RAG) on their own language models.
The post Improving Data Quality: Anomalo and Automated Monitoring appeared first on The New Stack.
Anomalo can revamp your organization's data quality with ML-based monitoring, automated root cause analysis, and data pipeline support.