Although no large language model (LLM) has ever ridden a bike, they clearly understand the role of cycling within the domain of human transport. What they seem to offer software developers is a semantic-like knowledge of the real world, combined with an understanding of the technical world. We saw this clearly in a recent post when we were able to produce a simple book publishing SQL schema just by describing it in English.
While I was happy with the performance of Llama 3 creating schemas, one colleague from my Oracle days pointed out that the book publishing schema is a fairly well-known example. For the sake of a readable explanation, this is a good thing, but in order to stretch LLMs’ abilities a little further, in this post I’ll see how well an LLM can alter its own schema based on issues described in English. This time I’ll use OpenAI’s GPT-4o, as that did a good of code reviewing for me recently.
As a jumping-off point, we’ll start with the same question as we asked in that first article, and summarize the answer, which is similar to last time. This time, we got an ERD diagram from GPT-4o, as well as a nice explanation of the relationships:
And similar to the previous attempt, it suggested this schema:
CREATE TABLE Author ( author_id INT AUTO_INCREMENT PRIMARY KEY, first_name VARCHAR(50), last_name VARCHAR(50), birth_date DATE, nationality VARCHAR(50) ); CREATE TABLE Publisher ( publisher_id INT AUTO_INCREMENT PRIMARY KEY, name VARCHAR(100), address VARCHAR(255), contact_number VARCHAR(20), email VARCHAR(100) ); CREATE TABLE Book ( book_id INT AUTO_INCREMENT PRIMARY KEY, title VARCHAR(100), genre VARCHAR(50), publication_date DATE, isbn VARCHAR(20) UNIQUE, author_id INT, publisher_id INT, FOREIGN KEY (author_id) REFERENCES Author(author_id), FOREIGN KEY (publisher_id) REFERENCES Publisher(publisher_id) );
Minor quibble: I prefer tables to use the plural for the objects they contain. And I think that is the accepted standard.
The LLM described these relationship limitations:
So using the same data from last time, let’s check if we get the same result in our SQL playground, dbfiddle.
If we seed the data and add the view from last time…
INSERT INTO Author (first_name, last_name, birth_date) VALUES ('Iain', 'Banks', '1954-02-16'); INSERT INTO Author (first_name, last_name, birth_date) VALUES ('Iain', 'M Banks', '1954-02-16'); INSERT INTO Publisher (name, address) VALUES ('Abacus', 'London'); INSERT INTO Publisher (name, address) VALUES ('Orbit', 'New York'); INSERT INTO Book (title, author_id, publisher_id, publication_date) VALUES ('Consider Phlebas', 2, 2, '1988-04-14'); INSERT INTO Book (title, author_id, publisher_id, publication_date) VALUES ('The Wasp Factory', 1, 1, '1984-02-15'); CREATE VIEW ViewableBooks AS SELECT Book.title 'Book', Author.first_name 'Author firstname', Author.last_name 'Author surname', Publisher.name 'Publisher', Book.publication_date FROM Book, Publisher, Author WHERE Book.author_id = Author.author_id AND Book.publisher_id = Publisher.publisher_id;
…we get the result view we wanted from dbfiddle in the table at the bottom here:
You can run this yourself from the link until it expires. The second surname, which includes the middle letter “M”, looks awkward. And we’ll look at an issue relating to that next.
The First Alteration
As I mentioned in the previous article on SQL generation, “Ian Banks” and “Ian M Banks” are in fact the same author. Last time, we left this pseudonym issue unaddressed. So let’s ask the LLM to fix this problem:
So this is good. This time it had to map the literary concept of ‘pseudonym’ onto an existing schema design that it had already produced. So it had to do more than just discover an existing solution. First of all, what of the new relationships:
This seems kosher. These are the newly altered tables:
CREATE TABLE Pseudonym ( pseudonym_id INT AUTO_INCREMENT PRIMARY KEY, pseudonym VARCHAR(100), author_id INT, FOREIGN KEY (author_id) REFERENCES Author(author_id) ); CREATE TABLE Book ( book_id INT AUTO_INCREMENT PRIMARY KEY, title VARCHAR(100), genre VARCHAR(50), publication_date DATE, isbn VARCHAR(20) UNIQUE, pseudonym_id INT, publisher_id INT, FOREIGN KEY (pseudonym_id) REFERENCES Pseudonym(pseudonym_id), FOREIGN KEY (publisher_id) REFERENCES Publisher(publisher_id) );
This feels correct, too. The schema ties the book to a pseudonym instead of the author. Let’s do a fresh dbfiddle with the new schema, feed in the altered data to work with it, and see if we can get our nice result back:
It is actually a nicer table, now that the pseudonym column is just one field.
Another Alteration Request
Now, I’ll ask for one more schema alteration. We know that books can have multiple authors (you may remember that last time, Llama 3 suggested this without prompting) so we want GPT-4o to alter it’s schema again.
The one additional table is just this:
CREATE TABLE BookAuthor ( book_id INT, pseudonym_id INT, PRIMARY KEY (book_id, pseudonym_id), FOREIGN KEY (book_id) REFERENCES Book(book_id), FOREIGN KEY (pseudonym_id) REFERENCES Pseudonym(pseudonym_id) );
So the relationship changes:
(Note the weird bracket error after describing the first couple of relationships. This has been repeated for all of the descriptions of relationships. It seems to be preventing the text “1:M” or “M:M” from printing — perhaps an emoji confusion?)
Also of course, GPT-4o is following the conversation as one thread — it is taking previous work into its context. This much-heralded ability does indeed make working with it much more natural. Overall, it performed well (and very quickly) to parse our English descriptions to alter its own suggested schema.
Before We Get Too Excited
Schemas are all about the relationship between things — they don’t require an intimate understanding of things themselves. However, this does not quite imply that the road is clear for LLMs to take over database design just yet.
Optimizing for SQL queries and schemas has always been a bit of an art form. There is a need to understand which common queries will be best served by a design, how many tables will be touched, query interdependencies, index definition, partitioning, etc. And that is before dealing with CAP theorem dilemmas — consistency vs. availability. Underneath these technical abstractions are human expectations of data retrieval that are far from simple.
I have no doubt that some mixture of LLM and specialization will deal with these engineering issues over time, but for now we should take the win with how well GPT-4o was able to produce and amend a healthy schema.
The post GPT-4o and SQL: How Well Can an LLM Alter Its Own Schema? appeared first on The New Stack.
How well can an LLM can alter its own SQL schema based on issues described in English? We run some tests using OpenAI’s GPT-4o.