Vector search looks for similar vectors or data points in a data set based on their vector representations. Unlike proprietary vector databases such as Pinecone, Milvus, Qdrant and Weaviate, MyScale is built on the open source, SQL-compatible ClickHouse database.
SQL is an effective tool for managing relational databases. Combining SQL and vectors provides a powerful approach to tackling complex AI-related questions. Users can execute traditional SQL and vector queries on structured data and vector embeddings (data) to address complex queries and analyze high-dimensional data in a unified, efficient manner.
Advanced SQL Techniques for Complex Queries
Simple SQL queries are commands that perform straightforward data retrieval, usually from only one table at a time. Complex SQL queries go beyond standard requests by retrieving data from several tables and limiting the result set with multiple conditions.
A complex query could include features such as:
- Common table expressions
- Subqueries
- Joining to many tables and using different join types
Common Table Expressions
A common table expression (CTE) is a name you give a subquery within your main query. The main reason to do this is to simplify your query, making it easier to read and debug. It can sometimes improve performance, which is another benefit, but it’s mostly about readability and simplification.
Consider a scenario in which you want to determine the average age of customers who bought a particular product. You have a table of customer data, including their name, age and the products they purchased.
Here’s an example query to perform this calculation using a CTE:
This CTE — a temporary named result set (subquery) that can be referenced within a single query — is named product_customers
. It’s created using a SELECT
statement that retrieves the name and age columns from the customer_data
table for customers who purchased the product 'widget'
.
Moving the subquery to the top of the query and giving it a name makes it easier to understand what the query does. If your subquery selects a sample embedding vector, you could name your subquery something like target_vector_embed
. When you refer to this in the main query, you’ll see this name and know what it refers to.
This is also helpful if you have a long query and need the same logic in several places. You can define it at the top of the query and refer to it multiple times throughout your main query.
So consider using CTEs when having a subquery improves the readability of your query.
Subqueries
A subquery is a simple SQL command embedded within another query. By nesting queries, you can set up larger restrictions on the data included in the result set.
Subqueries can be used in several places within a query, but it’s easiest to start with the FROM
statement. Here’s an example of a basic subquery:
I’ll break down what happens when you run the above query:
First, the database runs the “inner query” — the part between the parentheses. If you run this independently, it produces a result set just like any other query. However, after the inner query runs, the outer query runs using the results from the inner query as its underlying table:
Subqueries must have names, which are added after parentheses (the same way you would add an alias to a regular table). This query uses the name sub
.
Using Subqueries in Conditional Logic
You can use subqueries in conditional logic (in conjunction with WHERE
, JOIN/ON
or CASE
). The following query returns all the entries from the same date as the specified entry in the data set:
This query works because the result of the subquery is only one cell. Most conditional logic will work with subqueries containing one-cell results. However, IN
is the only type of conditional logic that will work when the inner query contains multiple results:
Note that you should not include an alias when you write a subquery in a conditional statement. This is because the subquery is treated as an individual value (or set of values in the IN
clause) rather than as a table.
Joining Tables
join
produces a new table by combining columns from one or multiple tables by using values common to each. Different types of joins are:
INNER JOIN
: Only matching rows are returned.LEFT JOIN
: Nonmatching rows from the left table and matching rows are returned.RIGHT JOIN
: Nonmatching rows from the right table and matching rows are returned.FULL JOIN
: Nonmatching rows from both tables and matching rows are returned.CROSS JOIN
: Produces the Cartesian product of whole tables, as “join keys” are not specified.
Using Complex SQL and Vector Queries in MyScale
SQL vector database MyScale includes several features that help with complex SQL and vector queries.
Common Table Expressions
MyScale supports CTE and substitutes the code defined in the WITH
clause for the rest of the SELECT
query. Named subqueries can be included in the current and child query context anywhere table objects are allowed.
Vector search is a search method that represents data as vectors. It is commonly used in applications such as image, video and text search. MyScale uses the distance()
function to perform vector searches. It calculates the distance between a specified vector and all vector data in a specified column and returns the top candidates.
In some cases, if the specified vector is obtained from another table or the dimension of the specified vector is large and inconvenient to represent, you can use CTE or subquery.
Assume you have a vector table named photos
that stores metadata information linked to your photo library’s images, with id
, photo_id
and photo_embed
for the embedding vector.
The following example treats the result of a selection in CTE as a target vector to execute vector search:
Joins and Subqueries
There is limited support in MyScale for join
, and using subquery is recommended as a workaround. In MyScale, the vector search is based on the vector index on a table with a vector column. Although the distance()
function appears in the SELECT
clause, its value is calculated during vector search on the table, not after join
. The join
result may not be the expected result.
The following are possible workarounds:
- You can use the
distance()...WHERE...ORDER BY...LIMIT
query pattern in subqueries that utilize vector indexes and get expected results on vector tables. - You can also use subqueries in the
WHERE
clause to rewrite the join.
Assume you have another table, photo_meta
, that stores information about the photo library’s images with photo_id
, photo_author
, year
and title
. The following example retrieves relevant photos taken in 2023 from a collection of images:
Here’s what happens when you run the above query:
First, MyScale executes vector search on the table photos
to get the required column photo_id
and the value of the distance()
function for the top five relevant records:
Then, the join
runs using the results from the vector table as its underlying table:
Because the vector search doesn’t consider the year photos were taken, the result may be incorrect. To get the correct result, rewrite the join
query by using a subquery:
Improve Data Analysis
Advanced SQL techniques like CTEs, subqueries and joins can help you perform complex data analyses and manipulations with greater precision and efficiency. MyScale combines the power of SQL and vectors to provide a powerful approach to tackling complex AI-related questions. With MyScale, you can efficiently execute traditional SQL and vector queries on structured and vector data to address complex queries and analyze high-dimensional data in a unified and efficient manner.
If you are interested in learning more, please follow us on X (Twitter) or join our Discord community. Let’s build the future of data and AI together!
The post How To Run Complex Queries With SQL in Vector Databases appeared first on The New Stack.
Learn how to use common table expressions, subqueries and joins to perform complex SQL and vector search queries in MyScaleDB.