
Recent research conducted by NVIDIA focused on the state of AI in 2024, and one finding is that almost 50% of companies across industries plan to run AI projects both in the cloud and on-premises. In other words, AI infrastructure’s future will be hybrid-cloud and multicloud. This finding isn’t surprising because, in most cases, GPU resources for GenAI projects, and even LLM training, will never be co-located with all the data needed to feed them.
While there has been a lot of discussion about localized infrastructure and the need for specialized storage systems to serve data to GPUs at very high performance, the need for GPU data orchestration — the ability to quickly and efficiently move data from where it is today to feed high-performance compute resources — is arguably a more important topic.
The concept of orchestration is well understood: Structured data orchestration is critical in Databricks’ platform, Run.ai is used for GPU resource orchestration, and Kubernetes is used for container orchestration. But what about orchestrating the unstructured data that makes up most of the data used for GenAI?
Moving large data sets between sites and clouds is complex, especially in multivendor storage environments in nearly every enterprise worldwide. Few, if any, organizations store their data on a single vendor’s storage. Identifying and accessing the correct migration data is tedious and fraught with potential error and security risks. Many organizations accomplish this with manual, brute-force techniques, like copying entire datasets from one storage system or cloud to another. This approach takes time, adds both CAPEX and OPEX and slows innovation.
This problem has become more acute for modern high-performance workloads, with broad-based GPU shortages often requiring organizations to burst workflows into cloud-based GPU clusters or remote GPU-as-a-service providers. And even as more GPUs are available and deployed, it is unlikely that all of an organization’s GPU resources will be located in the same data center because of availability or power requirements.
As we head into the following data cycle, organizations need direct global access to all their data to extract unrealized value.
Ensuring Data Is Where It Needs to Be When It Needs to Be There
There are multiple phases in AI workflows and, of course, many AI use cases that can vary greatly. But as diverse as the AI use cases are, the common denominator is the need to collect data from many diverse sources, often different locations, and even outside a single organization.
The fundamental problem is that access to data by both humans and AI models is always funneled through a file system at some point. The issue is that file systems have traditionally been embedded within the storage infrastructure. The result of this infrastructure-centric approach is that when data needs to be used outside the storage platform it’s on today, or if different performance requirements or cost profiles dictate the use of other storage types, multiple copies of data end up being generated (leading to dirty data sets), the complexity of users and applications navigating across multiple access paths increases, and security risks grow as data is moved out of the governing file system.
This problem is particularly acute for AI workloads, where a critical first step is consolidating data from multiple sources to enable a global view across them all. AI workloads must have access to the complete dataset to classify and label the files as the first step to figuring out which should be refined down to the next step.
Each phase in the AI journey refines the data further. This process might include cleansing and large language model (LLM) training or, in some cases, tuning existing LLMs for iterative inferencing runs to get closer to the desired output. Each step also requires different compute and storage performance requirements, ranging from slower, less expensive mass storage systems and archives to high-performance and more costly NVMe storage and memory-loaded servers.
Overcome Data Gravity by Decoupling the File System From Infrastructure
Unlike traditional storage platforms that tie the file system to the infrastructure, modern data orchestration solutions work with any storage platform — whether at the edge, on-premises, or in the cloud — regardless of the vendor. These solutions create a high-performance, cross-platform, Parallel Global File System that unifies otherwise incompatible storage silos across multiple locations, including the cloud.
Of critical importance to AI workflows, data classification can be significantly enhanced and automated with enriched metadata that is used to automate data placement based on business objectives. Powerful metadata management capabilities enable files and directories to be manually or automatically tagged with user-defined custom metadata, creating a rich set of file classification information that can be used to streamline the classification phase of AI workflows and simplify later iterations.
GPU data orchestration allows IT administrators to automate data services across all storage silos and compute resources worldwide without interrupting users or applications. Tools like Hammerspace handle data orchestration in the background by decoupling the file system from the underlying infrastructure, ensuring high performance for GPU clusters, AI models, and data engineers. This unified, global metadata control plane gives all users and applications in every location seamless read/write access to duplicate files, not just copies.
This may all sound good, but you might think it won’t work for large datasets due to the constraints of data gravity. Data orchestration systems for large data sets must be file granular and powered by a global metadata control plane, overcoming many challenges of data gravity by allowing data to be accessed and managed seamlessly across different storage locations without moving extensive data sets physically. When data needs to be moved, it can be accessed while in flight due to the global metadata control plane that is the master of any changes to data.
In essence, this approach allows organizations to overcome the limitations imposed by data gravity, enabling faster, more efficient data processing and analysis across distributed environments.
Unprecedented Flexibility To Adapt Legacy Environments to New Use Cases
Bridging the asynchronous distance gap between locations or clouds with a high-performance Parallel Global File System enables organizations to rapidly ramp up or down application, compute, and storage resources wherever and whenever needed, easily accommodating new use cases like those emerging in AI/DL workflows. This approach enables routine operations, such as replacing old storage with new platforms, to become a non-disruptive background activity. Data owners can do these things without the penalties associated with re-tooling existing on-premises infrastructures or interrupting user/application access to data. Organizations can get more life out of their existing computing and storage resources by automatically freeing up space on high-performance systems in the background without disruption.
Storage capacity, performance, cost centers, location, and more can now all become variables to trigger objective-based policies. Even the cost profiles of different storage types or between regions from the same cloud vendor can be used to create business rules for managing various data classes throughout their life cycle.
Data orchestration’s ability to provide global access and control across silos eliminates the problem of redundant copies, manual replication, fragmented data protection strategies, and other symptoms of data and storage sprawl. All data services are built into the software to enable IT administrators a simple way to automate such tasks, leveraging the skillsets they already have. This approach reduces the number of data copies and the number of software applications and point solutions required to manage a multi-silo data environment.
With the emerging AI/DL requirements that are changing the traditional life cycles of unstructured data, GPU data orchestration allows organizations the flexibility to create an accurate vendor-neutral data mesh that can modernize and streamline their existing data environments utilizing the infrastructure they already have.
In summary, GPU data orchestration opens new possibilities while reducing administration costs and time burdens. Architectures benefit from
- Decoupling Data from Infrastructure: By decoupling the file system from the underlying storage infrastructure, the global metadata control plane ensures that data can be orchestrated at a granular level, meaning specific files or data sets can be managed independently of their physical location.
- Global Access and Efficiency: The global metadata layer provides a unified view of all data, regardless of where it is stored, enabling users and applications to access the same data in real time. This eliminates the need to create multiple copies or move large data sets, which is often a significant challenge due to data gravity, where large data sets are complex to move due to their size and complexity.
- Improved Performance and Agility: By orchestrating data at the file level, administrators can optimize data placement based on performance requirements, reducing latency and proximity to GPU clusters, AI models, and other computational resources. This approach ensures that the data is where it needs to be, when it needs to be there, without being bogged down by the constraints of data gravity.
The post The Critical Role of GPU Data Orchestration in AI Success appeared first on The New Stack.
This is how GPU data orchestration can open new possibilities while reducing administration costs and time burdens.